Pages

Efficient Mixed-Model Association eXpedited (EMMAX) to Simutaneously Account for Relatedness and Stratification in Genome-Wide Association Studies

A few months ago I covered an algorithm called EMMA (Efficient Mixed-Model Association) implemented in R for simultaneously correct for both population stratification and relatedness in an association study. This method/software is very useful because most methods that account for relatedness in an association study assume a genetically (ethnically) homogeneous population, while methods that detect and correct for population stratification typically assume individuals are unrelated. The EMMA algorithm simultaneously accounts for both types of population structure by using a linear mixed model with an empirically estimated relatedness matrix to model the correlation between phenotypes of sample subjects.

The original EMMA algorithm, however, is computationally infeasible for datasets with thousands of individuals because the variance components parameters are estimated for each marker, which can take about 10 minutes per marker on the authors' large GWAS dataset, which would take over 6 years to complete on a single processor. A new implementation of the algorithm called EMMAX (Efficient Mixed-Model Association eXpedited) makes the simplifying assumption that because the effect of any given SNP on the trait is typically small, then the variance parameters only need to be estimated once for the entire dataset, rather than once for each marker.

In the paper the authors take the Northern Finland Birth Cohort and estimate genomic control inflation factors (gamma) for uncorrected test statistics, test statistics adjusted for the top 100 principle components using Eigenstrat, and corrected for structure using the EMMAX algorithm and found that the inflation factors were closest to 1 for the EMMAX-corrected tests. Further, whereas genomic control simply adjusts all test statistics downward without changing the rank of the test statistics, the EMMAX method does result in changes of the ranks of test statistics for each SNP.

A beta version of EMMAX is available online, with a complete version to be released soon. Conveniently, the software is able to take a PLINK transposed ped file and covariate files as input (tped and tfam documentation here).

Nature Genetics Technical Report - Variance component model to account for sample structure in genome-wide association studies