Getting Genetics Done

Showing posts with label GWAS. Show all posts

PLATO, an Alternative to PLINK

Since the near beginning of genome-wide association studies, the PLINK software package (developed by Shaun Purcell’s group at the Broad Institute and MGH) has been the standard for manipulating the large-scale data produced by these studies. Over the course of its development, numerous features and options were added to enhance its capabilities, but it is best known for the core functionality of performing quality control and standard association tests.

Nearly 10 years ago (around the time PLINK was just getting started), the CHGR Computational Genomics Core (CGC) at Vanderbilt University started work on a similar framework for implementing genotype QC and association tests. This project, called PLATO, has stayed active primarily to provide functionality and control that (for one reason or another) is unavailable in PLINK. We have found it especially useful for processing ADME and DMET panel data – it supports QC and association tests of multi-allelic variants.

PLATO runs via command line interface, but accepts a batch file that allows users to specify an order of operations for QC filtering steps. When running multiple QC steps in a single run of PLINK, the order of application is hard-coded and not well documented. As a result, users wanting this level of control must run a sequence of PLINK commands, generating new data files at each step leading to longer compute times and disk usage. PLATO also has a variety of data reformatting options for other genetic analysis programs, making it easy to run EIGENSTRAT, for example.

The detail of QC output from each of the filtering steps is much greater in PLATO, allowing output per group (founders only, parents only, etc), and giving more details on why samples fail sex checks, Hardy-Weinberg checks, and Mendelian inconsistencies to facilitate deeper investigation of these errors. And with family data, disabling samples due to poor genotype quality retains pedigree information useful for phasing and transmission tests. Full documentation and download links can be found here (https://chgr.mc.vanderbilt.edu/plato). Special thanks to Yuki Bradford in the CGC for her thoughts on this post.

Getting Genetics Done 2012 In Review

Here are links to all of this year's posts (excluding seminar/webinar announcements), with the most visited posts in bold italic. As always, you can follow me on Twitter for more frequent updates. Happy new year!

New Year's Resolution: Learn How to Code

Annotating limma Results with Gene Names for Affy Microarrays

Your Publications (with PMCID) as a PubMed Query

Pathway Analysis for High-Throughput Genomics Studies

find | xargs ... Like a Boss

Redesign by Subtraction

Video Tip: Convert Gene IDs with Biomart

RNA-Seq Methods & March Twitter Roundup

Awk Command to Count Total, Unique, and the Most Abundant Read in a FASTQ file

Video Tip: Use Ensembl BioMart to Quickly Get Ortholog Information

Stepping Outside My Open-Source Comfort Zone: A First Look at Golden Helix SVS

How to Stay Current in Bioinformatics/Genomics

The HaploREG Database for Functional Annotation of SNPs

Identifying Pathogens in Sequencing Data

Browsing dbGAP Results

Fix Overplotting with Colored Contour Lines

Plotting the Frequency of Twitter Hashtag Usage Over Time with R and ggplot2

Cscan: Finding Gene Expression Regulators with ENCODE ChIP-Seq Data

More on Exploring Correlations in R

DESeq vs edgeR Comparison

Learn R and Python, and Have Fun Doing It

STAR: ultrafast universal RNA-seq aligner

RegulomeDB: Identify DNA Features and Regulatory Elements in Non-Coding Regions

Copy Text to the Local Clipboard from a Remote SSH Session

Differential Isoform Expression With RNA-Seq: Are We Really There Yet?

Browsing dbGAP Results

Thanks to the excellent work of Lucia Hindorff and colleagues at NHGRI, the GWAS catalog provides a great reference for the cumulative results of GWAS for various phenotypes. Anyone familiar with GWAS also likely knows about dbGaP – the NCBI repository for genotype-phenotype relationships – and the wealth of data it contains. While dbGaP is often thought of as a way to get access to existing genotype data, analysis results are often deposited into dbGaP as well. Individual-level data (like genotypes) are generally considered “controlled access”, requiring special permission to retrieve or use. Summary-level data, such as association p-values, are a bit more accessible. There are two tools available from the dbGaP website: the Association Results Browser and the Phenotype-GenotypeIntegrator (PheGenI). These tools provide a search interface for examining previous GWAS associations.

The Association Results Browser provides a simple table listing of associations, searchable by SNP, gene, or phenotype. It contains the information from the NHGRI GWAS catalog, as well as additional associations from dbGaP deposited studies. I’ve shown an example below for multiple sclerosis. You can restrict the search to the dbGaP-specific results by changing the “Source” selection. If you are looking for the impact of a SNP, this is a nice supplement to the catalog. Clicking on a p-value brings up the GaP browser, which provides a more graphical (but perhaps less useful) view of the data.

The PheGenI tool provides similar search functionality, but attempts to provide phenotype categories rather than more specific phenotype associations. Essentially, phenotype descriptions are linked to MeSH terms to provide categories such as “Chemicals and Drugs”, or “Hemic and Lymphatic Diseases”. PheGenI seems most useful if searching from the phenotype perspective, while the association browser seems better for SNP or Gene searches. All these tools are under active development, and I look forward to seeing their future versions.

A New Dimension to Principal Components Analysis

In general, the standard practice for correcting for population stratification in genetic studies is to use principal components analysis (PCA) to categorize samples along different ethnic axes. Price et al. published on this in 2006, and since then PCA plots are a common component of many published GWAS studies. One key advantage to using PCA for ethnicity is that each sample is given coordinates in a multidimensional space corresponding to the varying components of their ethnic ancestry. Using either full GWAS data or a set of ancestral informative markers (AIMs), PCA can be easily conducted using available software packages like EIGENSOFT or GCTA. HapMap samples are sometimes included in the PCA analysis to provide a frame of reference for the ethnic groups.

Once computed, each sample will have values that correspond to a position in the new coordinate system that effectively clusters samples together by ethnic similarity. The results of this analysis are usually plotted/visualized to identify ethnic outliers or to simply examine the structure of the data. A common problem however is that it may take more than the first two principal components to identify groups.

To illustrate, I will plot some PCs generated based on 125 AIMs markers for a recent study of ours. I generated these using GCTA software and loaded the top 5 PCs into R using the read.table() function. I loaded the top 5, but for continental ancestry, I've found that the top 3 are usually enough to separate groups. The values look something like this:

new_ruid pc1 pc2 pc3 pc4 pc5
1 11596 4.10996e-03 -0.002883830 0.003100840 -0.00638232 0.00709780
2 5415 3.22958e-03 -0.000299851 -0.005358910 0.00660643 0.00430520
3 11597 -4.35116e-03 0.013282400 0.006398130 0.01721600 -0.02275470
4 5416 4.01592e-03 0.001408180 0.005077310 0.00159497 0.00394816
5 3111 3.04779e-03 -0.002079510 -0.000127967 -0.00420436 0.01257460
6 11598 6.15318e-06 -0.000279919 0.001060880 0.00606267 0.00954331

I loaded this into a dataframe called pca, so I can plot the first two PCs using this command:

plot(pca$pc1, pca$pc2)

We might also want to look at the next two PCs:

plot(pca$pc2, pca$pc3)

Its probably best to look at all of them together:

pairs(pca[2:4])

So this is where my mind plays tricks on me. I can't make much sense out of these plots -- there should be four ethnic groups represented, but its hard to see who goes where. To look at all of these dimensions simultaneously, we need a 3D plot. Now 3D plots (especially 3D scatterplots) aren't highly regarded -- in fact I hear that some poor soul at the University of Washington gets laughed at for showing his 3D plots -- but in this case I found them quite useful.

Using a library called rgl, I generated a 3D scatterplot like so:

plot3d(pca[2:4])

Now, using the mouse I could rotate and play with the cloud of data points, and it became more clear how the ethnic groups sorted out. Just to double check my intuition, I ran a model-based clustering algorithm (mclust) on the data. Different parameters obviously produce different cluster patterns, but I found that using an "ellipsoidal model with equal variances" and a cluster size of 4 identified the groups I thought should be there based on the overlay with the HapMap samples.

fit <- Mclust(pca[2:4], G=4, modelNames = "EEV")

plot3d(pca[2:4], col = fit$classification)

Basically, the red sphere corresponds to the European descent group, the green indicates the admixed African American group, the black group corresponds to the Hispanic group, and the blue identifying the Asian descent group. We are still a bit confused as to why the Asian descent samples don't form a more concise cluster -- it may be due to relatively poor performance of these AIMs in Asian descent groups. Whatever the case, you might notice several individuals falling either outside a clear cluster or at the interface between two groups. The ethnic assignment for these individuals is questionable, but the clustering algorithm gives us a very nice measure of cluster assignment uncertainty. We can plot this like so:

plot(pca[2:3], cex = fit$uncertainty*10)

I had to scale the uncertainty factor by 10 to make the questionable points more visible in this plot, shown as the hollow circles. We will likely drop these samples from any stratified analyses. We can export the cluster assignment by accessing the fit$classification column, and we have our samples assigned to an ethnic group.

My thoughts on ICHG 2011

I’m a bit exhausted from a week of excellent science at ICHG. First, let me say that Montreal is a truly remarkable city with fantastic food and a fascinating blend of architectural styles, all making the meeting a fun place to be…. Now on to the genomics – I’ll recap a few of the most exciting sessions I attended. You can find a live-stream of tweets from the meeting by searching the #ICHG2011 and #ICHG hashtags.

On Wednesday, Marylyn Ritchie(@MarylynRitchie) and Nancy Cox organized “Beyond Genome-wide association studies”. Nancy Cox presented some ideas on how to integrate multiple “intermediate” associations for SNPs, such as expression QTLs and newly discovered protein QTLs (More on pQTLs later). This approach which she called a Functional Unit Analysis would group signals together based on the genes they influence. Nicholas Shork presented some nice examples of pros and cons of sequence level annotation algorithms. Trey Idekker gave a very nice talk illustrating some of the properties of epistasis in yeast protein interaction networks. One of the more striking points he made was that epistasis tends to occur between full protein complexes rather than within elements of the complexes themselves. Marylyn Ritchie presented the ideas behind her ATHENA software for machine learning analysis of genetic data, and Manuel Mattesian from Tim Becker’s group presented the methods in their INTERSNP software for doing large-scale interaction analysis. What was most impressive with this session is that there were clear attempts to incorporate underlying biological complexity into data analysis.

On Thursday, I attended the second Statistical Genetics section called “Expanding Genome-wide Association Studies”, organized by Saurabh Ghosh and Daniel Shriner. Having recently attended IGES, I feel pretty “up” on newer analysis techniques, but this session had a few talks that sparked my interest. The first three talks were related to haplotype phasing and the issues surrounding computational accuracy and speed. The basic goal of all these methods is to efficiently estimate genotypes for a common set of loci for all samples of a study using a set of reference haplotypes, usually from the HapMap or 1000 genomes data. Despite these advances, it seems like phasing haplotypes for thousands of samples is still a massive undertaking that requires a high-performance computing cluster. There were several talks about ongoing epidemiological studies, including the Kaiser Permanente UCSF cohort. Neil Risch presented an elegant study design implementing four custom GWAS chips for the four targeted populations. Looks like the data hasn't started to flow from this yet, but when it does we’re sure to learn about lots of interesting ethnic-specific disease effects. My good friend and colleague Dana Crawford presented an in silico GWAS study of hypothyroidism. In her best NPR voice, Dana showed how electronic medical records with GWAS data in the EMERGE network can be re-used to construct entirely new studies nested within the data collected for other specific disease purposes. Her excellent Post-Doc, Logan Dumitrescu presented several gene-environment interactions between Lipid levels and vitamin A and E from Dana’s EAGLE study. Finally Paul O’Reilly presented a cool new way to look at multiple phenotypes by essentially flipping a typical regression equation around, estimating coefficients that relate each phenotype in a study to a single SNP genotype as an outcome. This rather clever approach called MultiPhen is similar to log-linear models I’ve seen used for transmission-based analysis, and allows you to model the “interaction” among phenotypes in much the same way you would look at SNP interactions.

By far the most interesting talks of the meeting (for me) were in the Genomics section on Gene Expression, organized by Tomi Pastinen and Mark Corbett. Chris Mason started the session off with a fantastic demonstration of the power of RNA-seq. Examining transcriptomes of 14 non-human primate species, they validated many of the computational predictions in the AceView gene build, and illustrated that most “exome” sequencing is probably examining less than half of all transcribed sequences. Rupali Patwardhan talked about a system for examining the impact of promoter and enhancer mutations in whole mice, essentially using mutagenesis screens to localize these regions. Ron Hause presented work on the protein QTLs that Nancy Cox alluded to earlier in the conference. Using a high-throughput form of western blots, they systematically examined levels for over 400 proteins in the Yoruba HapMap cell lines. They also illustrate that only about 50% of eQTLs identified in these lines actually alter protein levels. Stephen Montgomery spoke about the impact of rare genetic variants within a transcript on transcript levels. Essentially he showed an epistatic effect on expression, where transcripts with deleterious alleles are less likely to be expressed – an intuitive and fascinating finding, especially for those considering rare-variant analysis. Athma Pai presented a new QTL that influences mRNA decay rates. By measuring multiple time points using RNA-seq, she found individual-level variants that alter decay, which she calls dQTLs. Veronique Adoue looked at cis-eQTLs relative to transcription factor binding sites using ChIP, and Alfonso Buil showed how genetic variants influence gene expression networks (or correlation among gene expression) across tissue types.

I must say despite all the awesome work presented in this session, Michael Snyder stole the show with his talk on the “Snyderome” – his own personal –omics profile collected over 21 months. His whole-genome was sequenced by Complete Genomics, and processed using Rong Chen and Atul Butte’s risk-o-gram to quantify his disease risk. His profile predicted increased risk of T2D, so he began collecting glucose measures and low and behold, he saw a sustained spike in blood glucose levels following a few days following a common cold. His interpretation was that an environmental stress knocked him into a pseudo-diabetic state, and his transcriptome and proteome results corroborated this idea. Granted, this is an N of 1, and there is still lots of work to be done before this type of analysis revolutionizes medicine, but the take home message is salient – multiple -omics are better than one, and everyone’s manifestation of a complex disease is different. This was truly thought-provoking work, and it nicely closed an entire session devoted to understanding the intermediate impact of genetic variants to better understand disease complexity.

This is just my take of a really great meeting -- I'm sure I missed lots of excellent talks. If you saw something good please leave a comment and share!

Estimating Trait Heritability from GWAS Data

Peter Visscher and colleagues have recently published a flurry of papers employing a new software package called GCTA to estimate the heritability of traits using GWAS data (GCTA stands for Genome-wide Complex Trait Analysis -- clever acronymity!). The tool, supported (and presumably coded) by Jian Yang is remarkably easy to use, based in part on the familiar PLINK commandline interface. The GCTA Homepage provides an excellent walk-through of the available options.

The basic idea is to use GWAS data to estimate the degree of "genetic sharing" or relatedness among the samples, computing what the authors call a genetic relationship matrix (GRM). The degree of genetic sharing among samples is then related to the amount of phenotypic sharing using restricted maximum likelihood analysis (REML). The result is an estimate of the variance explained by the SNPs used to generate the GRM. Full details of the stats along with all the gory matrix notation can be found in their software publication.

The approach has been applied to several disorders studied by the WTCCC and to a recent study of human height. Interestingly, the developers have also used the approach to partition the trait variance across chromosomes, resulting in something similar to population-based variance-components linkage analysis. The approach works for both quantitative and dichotomous traits, however the authors warn that variance estimates of dichotomous trait liability are influenced by genotyping artifacts.

The package also includes several other handy features, including a relatively easy way to estimate principal components for population structure correction, a GWAS simulation tool, and a regression-based LD mapping tool. Download and play -- a binary is available for Linux, MacOS, and DOS/Windows.

Replication of >180 Genetic Associations with Self-Report Medical Data

DNA genotyping and sequencing are getting cheaper every day. As Oxford Nanopore CTO Clive Brown recently discussed at Genomes Unzipped, when the cost of a full DNA sequence begins to fall below $1000, the value of having that information far outweighs the cost of data generation.

Participant collection and ascertainment, however, isn't getting cheaper any time soon, spurring a burgeoning interest in using DNA biobanks and electronic medical records (EMRs) for genomics research (reviewed here). In fact, this is exactly the focus of the eMERGE network - a consortium of five sites having biobanks linked to electronic medical records for genetic research. The first order of business for the eMERGE network was assessment - can DNA biobanks + EMRs be used for genetic research? This question was answered with a demonstration project using Vanderbilt University's BioVU biobank+EMR. Here, 21 previously reported associations to five complex diseases were tested for association to electronically abstracted phenotypes from BioVU's EMR. This forest plot shows that for the 21 previously reported associations (red), five replicated at a nominal significance threshold, and for the rest, the calculated odds ratios (blue) trended in the same direction as the reported association.

While electronic phenotyping is much cheaper than ascertainment in the traditional sense, it can still be costly and labor intensive, involving interative cycles of manual medical record chart review followed by refinement of natural language processing algorithms. In many instances, self-report can be comparably accurate, and much easier to obtain (for example, compare the eMERGE network's hypothyroidism phenotyping algorithm with simply asking the question: "Have you ever been diagnosed with Hashmoto's Thyroiditis?").

This is the approach to genetic research 23andMe is taking. Joyce Tung presented some preliminary results at last year's ASHG conference, and now you can read the preprint of the research paper online at Nature Preceedings - "Efficient Replication of Over 180 Genetic Associations with Self-Reported Medical Data". In this paper the authors amassed self-reported data from >20,000 genotyped 23andMe customers on 50 medical phenotypes and attempted to replicate known findings from the GWAS catalog. Using only self-reported data, the authors were able to replicate at a nominal significance level 75% of the previously reported associations that they were powered to detect. Figure 1 from the preprint is similar to the figure above, where blue X's are prior GWAS hits and black dots and lines represent their ORs and 95% CI's:

One might ask whether there is any confounding due to the fact that 23andMe customers can view trait/disease risks before answering questions (see this paper and this discussion by Luke Jostins). The authors investigated this and did not observe a consistent or significant effect of seeing genetic risk results before answering questions. There's also a good discussion regarding SNPs that failed to replicate, and a general discussion of using self-report from a recontactable genotyped cohort for genetic studies.

Other than 23andMe's customer base, I can't think of any other genotyped, recontactable cohort that have incentive to fill out research surveys for this kind of study. But this team has shown here and in the past that this model for performing genetic research works and is substantially cheaper than traditional ascertainment or even mining electronic medical records. I look forward to seeing more research results from this group!

Efficient Replication of Over 180 Genetic Associations with Self-Reported Medical Data

Resources for Pathway Analysis, Functional Annotation, and Interactome Reconstruction with Omics Data

I just read a helpful paper on pathway analysis and interactome reconstruction:

Tieri, P., Fuente, A. D., Termanini, A., & Franceschi, C. (2011). Integrating Omics Data for Signaling Pathways, Interactome Reconstruction, and Functional Analysis. In Bioinformatics for Omics Data, Methods in Molecular Biology, vol. 719. doi: 10.1007/978-1-61779-027-0. (PubMed).

The authors give a description about how each of these tools can be used in pathway analysis and functional annotation, along with an example of using several of these resources for mapping the interactome for transcription factor NF-kappa-B.

While a more extensive list of hundreds of tools and databases for biological pathway analysis can be found at Pathguide, this looks like a good starting point.

•	APID Agile Protein Interaction DataAnalyzer – http://bioinfow.dep.usal.es/apid/index.htm
•	Ariadne Genomics Pathway Studio – http://www.ariadnegenomics.com/products/pathway-studio
•	BIND Biomolecular Interaction Network Database – http://www.bind.ca
•	BioCarta Pathways – http://www.biocarta.com/genes/index.asp
•	BioGRID The Biological General Repository for Interaction Datasets – http://www.thebiogrid.org
•	BiologicalNetworks – http://biologicalnetworks.net
•	CellDesigner – http://www.celldesigner.org
•	ClusterMaker – http://www.cgl.ucsf.edu/cytoscape/cluster/clusterMaker.html
•	Cytoscape – http://www.cytoscape.org
•	DIP Database of Interacting Proteins – http://dip.doe-mbi.ucla.edu/dip/Main.cgi
•	Entrez Gene – http://www.ncbi.nlm.nih.gov/gene
•	GenMAPP Gene Map Annotator and Pathway Profiler – http://www.genmapp.org
•	GraphWeb – http://biit.cs.ut.ee/graphweb
•	HPRD Human Protein Reference Database – http://www.hprd.org
•	HUBBA Hub objects analyzer – http://hub.iis.sinica.edu.tw/Hubba
•	Ingenuity Systems – http://www.ingenuity.com
•	IntAct – http://www.ebi.ac.uk/intact
•	KEGG Kyoto Encyclopedia of Genes and Genomes – http://www.genome.jp/kegg
•	MINT the Molecular INTeraction database – http://mint.bio.uniroma2.it/mint
•	NCI-Nature Pathway Interaction Database – http://pid.nci.nih.gov
•	NetPath – http://www.netpath.org
•	NetworkAnalyzer – http://med.bioinf.mpi-inf.mpg.de/netanalyzer
•	Pajek – http://vlado.fmf.uni-lj.si/pub/networks/pajek
•	Pathguide: the pathway resource list – http://www.pathguide.org
•	PathVisio – http://www.pathvisio.org
•	Pathway Commons – http://www.pathwaycommons.org
•	R Project for Statistical Computing – http://www.r-project.org
•	Reactome – http://www.reactome.org
•	SBW Systems Biology Workbench – http://sbw.sourceforge.net
•	TRANSFAC & TRANSPATH – http://www.gene-regulation.com
•	TRED Transcriptional Regulatory Element Database – http://rulai.cshl.edu/cgi-bin/TRED
•	UniProt – http://www.uniprot.org
•	WikiPathways – http://www.wikipathways.org/index.php/WikiPathways

Integrating Omics data for signaling pathways, interactome reconstruction, and functional analysis.

More Command-Line Text Munging Utilities

In a previous post I linked to gcol as a quick and intuitive alternative to awk. I just stumbled across yet another set of handy text file manipulation utilities from the creators of the BEAGLE software for GWAS data imputation and analysis. In addition to several command line utilities for converting and formatting BEAGLE files, there are several tools for doing basic text processing tasks on the command line:

changecolumn.jar - replace values in a column of an input file.
changeline.jar - replace values in a line of an input file.
cut.jar - extract columns from a file.
filtercolumns.jar - filters columns of input data according to the values in a line.
filterlines.jar - filters lines of input data according to the values in a column.
paste.jar - pastes together files that have shared initial columns followed by data columns.
transpose.jar - transposes rows and columns of an input file.

Much of what these tools do can probably be emulated with some creativity with Unix commands and pipes. But since these are all Java archives they should work on any platform, not just Unix/Linux. Hit the link below to see the full list and documentation.

BEAGLE Utilities for text manipulation

PLINK/SEQ for Analyzing Large-Scale Genome Sequencing Data

PLINK/SEQ is an open source C/C++ library for analyzing large-scale genome sequencing data. The library can be accessed via the pseq command line tool, or through an R interface. The project is developed independently of PLINK but it's syntax will be familiar to PLINK users.

PLINK/SEQ boasts an impressive feature set for a project still in the beta testing phase. It supports several data types (multiallelic, phased, imputation probabilities, indels, and structural variants), and can handle datasets much larger than what can fit into memory. PLINK/SEQ also comes bundled with several reference databases of gene transcripts and sequence & variation projects, including dbSNP and 1000 Genomes Project data.

As with PLINK, the documentation is good, and there's a tutorial using 1000 Genomes Project data.

PLINK/SEQ - A library for the analysis of genetic variation data

Annotated Manhattan plots and QQ plots for GWAS using R, Revisited

Last year I showed you how to create manhattan plots, and later how to highlight regions of interest, using ggplot2 in R. The code was slow, required a lot of memory, and was difficult to maintain and modify.

I finally found time to rewrite the code using base graphics rather than ggplot2. The code is now much faster, and if you're familiar with base R's plot options and graphical parameters, most of these can now be passed to the functions to tweak the plots' appearance. The code also behaves differently depending on whether you have results for one or more than one chromosome.

Here's a quick demo.

First, either copy and paste the code from GitHub, or run the following commands in R to download and source the code from GitHub (you can't directly read from https in R, so you have to download the file first, the source it). Note the command is different on Mac vs Windows.

Download the function code on Mac:

download.file("https://raw.github.com/stephenturner/qqman/master/qqman.r", destfile="./qqman.r", method="curl")

Download the function code on Windows, leave out the method="curl" argument:

download.file("https://raw.github.com/stephenturner/qqman/master/qqman.r", destfile="./qqman.r")

Now, source the script containing the functions.

source("./qqman.r")

Next, load some GWAS results, and take a look at the relevant columns (same as above, download first then read locally from disk). This is standard output from PLINK's --assoc option.

Download example data on Mac:

download.file("https://raw.github.com/stephenturner/qqman/master/plink.assoc.txt.gz", destfile="./plink.assoc.txt.gz", method="curl")

Download example data on Windows:

download.file("https://raw.github.com/stephenturner/qqman/master/plink.assoc.txt.gz", destfile="./plink.assoc.txt.gz")

Read in the sample results, and take a look at the first few lines:

results = read.table(gzfile("./plink.assoc.txt.gz"), header=T)

head(subset(results, select=c(SNP, CHR, BP, P)))

The manhattan function assumes you have columns named SNP, CHR, BP, and P, corresponding to the SNP name (rs number), chromosome number, genomic coordinate, and p-value. Missing values (where regression failed to converge, for instance) should be recoded NA before reading into R. Do this with a quick sed command. Here's what the data looks like:

         SNP CHR        BP       P
1 rs10495434   1 235800006 0.62220
2  rs6689417   1  46100028 0.06195
3  rs3897197   1 143700035 0.10700
4  rs2282450   1 202300047 0.47280
5   rs567279   1  66400050      NA
6 rs11208515   1  64900051 0.53430

Now, create a basic manhattan plot (click the image to enlarge):

manhattan(results)

If you type args(manhattan) you can see the options you can set. Here are a few:

colors: this is a character vector specifying the colors to cycle through for coloring each point. Here's a PDF chart of R's color names.

ymax: this is the y-axis limit. If ymax="max" (default), the y-axis will always be a little bit higher than the most significant -log10(p-value). Otherwise you can set this value yourself.

cex.x.axis: this can be used to shrink the x-axis labels by setting this value less than 1. This is handy if some of the tick labels aren't showing up because the plot region is too small.

*** Update March 9 2012*** cex.x.axis is deprecated. To change the x-axis size, use the default base graphics argument cex.axis.

limitchromosomes: you can limit which chromosomes you want to display. By default this restricts the plot to chromosomes 1-23(x).

suggestiveline and genomewideline: set these to FALSE if you don't want threshold lines, or change the thresholds yourself.

annotate: by default this is undefined. If you supply a character vector of SNP names (e.g. rs numbers), any SNPs in the results data frame that also show up here will be highlighted in green by default. example below.

... : The dot-dot-dot means you can pass most other plot or graphical parameters to these functions (e.g. main, cex, pch, etc).

Make a better looking manhattan plot. Change the plot colors, point shape, and remove the threshold lines:

manhattan(results, colors=c("black","#666666","#CC6600"), pch=20, genomewideline=F, suggestiveline=F)

Now, read in a text file with SNP names that you want to highlight, then make a manhattan plot highlighting those SNPs, and give the plot a title:

Download a SNP list on Mac:

download.file("https://raw.github.com/stephenturner/qqman/master/snps.txt", destfile="./snps.txt", method="curl")

Download a SNP list on Windows:

download.file("https://raw.github.com/stephenturner/qqman/master/snps.txt", destfile="./snps.txt")

Read in the SNP list, and create an annotated plot:

snps_to_highlight = scan("./snps.txt", character())

manhattan(results, annotate=snps_to_highlight, pch=20, main="Manhattan Plot")

Finally, zoom in and plot only the results for chromosome 11, still highlighting those results. Notice that the x-axis changes from chromosome to genomic coordinate.

manhattan(subset(results, CHR==11), pch=20, annotate=snps_to_highlight, main="Chr11")

Finally, make a quantile-quantile plot of the p-values. To make a basic qq-plot of the p-values, pass the qq() function a vector of p-values:

qq(results$P)

Perhaps we should have made the qq-plot first, as it looks like we might have some unaccounted-for population stratification or other bias.

The code should run much faster and use less memory than before. All the old functions that use ggplot2 are still available, now prefixed with "gg." Please feel free to use, modify, and redistribute, but kindly link back to this post. The function source code, data, SNPs of interest, and example code used in this post are all available in the qqman GitHub repository.

Using R + Bioconductor to Get Flanking Sequence Given Genomic Coordinates

I'm working on a project using next-gen sequencing to fine-map a genetic association in a gene region. Now that I've sequenced the region in a small sample, I'm picking SNPs to genotype in a larger sample. When designing the genotyping assay the lab will need flanking sequence.

This is easy to get for SNPs in dbSNP, but what about novel SNPs? Specifically, given a list of genomic positions where about half of them are potentially novel SNPs in the population I'm studying, how can I quickly and automatically fetch flanking sequence from the reference sequence? I asked how to do this on previously mentioned Biostar, and got several answers within minutes.

Fortunately this can easily be done using R and Bioconductor. For several reasons BioMart wouldn't do what I needed it to do, and I don't know the UCSC/Ensembl schemas well enough to use a query or the Perl API.

This R code below (modified from Adrian's answer on BioStar) does the trick:

Prune GWAS data in R

Hansong Wang, our biostats professor here at the Hawaii Cancer Center, generously gave me some R code that goes through a SNP annotation file (i.e. a mapfile) and selects SNPs that are at least a certain specified distance apart. You might want to do this if you're picking a subset of SNPs for PCA, for instance. Plink has an LD pruning feature, but if you can't load your data into PLINK, this poor-man's-pruning based on physical distance (not LD) is a quick solution.

Provide the function with a data frame containing containing column names "chrom" and "position," where the SNPs are ordered by chromosome and position. By default the function selects SNPs that are at least 100kb apart, but you can change this with the optional second argument. The function returns the row indices corresponding to the SNPs you want to keep. Then simply subset your dataset selecting only those row indices and all columns.

New GenABEL Website, and more *ABEL software

The *ABEL suite of R packages and software for genetic analysis has grown substantially since the appearance of GenABEL and the previously mentioned ProbABEL R packages. There are now a handful of useful R packages and other software utilities facilitating genome-wide association studies, analysis of imputed data, meta-analysis, efficient data storage, prediction, parallelization, and mixed model methods. The new GenABEL website also has prominent links to extensive documentation in manuals and tutorials. Finally, there's a new forum you can visit if you can't find an answer in the manuals or tutorials.

The GenABEL Project for Statistical Genetics Analysis

Parallelize IBD estimation with PLINK

Obtaining the probability that zero, one, or two alleles are shared identical by descent (IBD) is useful for many reasons in a GWAS analysis. A while back I showed you how to visualize sample relatedness using R and ggplot2, which requires IBD estimates. Using plink --genome uses IBS and allele frequencies to infer IBD. While a recent article in Nature Reviews Genetics on IBD and IBS analysis demonstrates potentially superior approaches, PLINK's approach is definitely the easiest because of PLINK's superior data management capabilities. The problem with IBD inference is that while computation time is linear with respect to the number of SNPs, it's geometric (read: SLOW) with respect to the number of samples. With GWAS data on 10,000 samples, (10,000 choose 2) = 49,995,000 pairwise IBD estimates. This can take quite some time to calculate on a single processor.

A developer in Will's lab, Justin Giles, wrote a Perl script which utilizes one of PLINK's advanced features, --genome-lists, which takes two files as arguments. You can read about this feature under the advanced hint section of the PLINK documentation of IBS clustering. Each of these files contain lists of family IDs and individual IDs of samples for whom you'd like to calculate IBD. In other words, you can break up the IBD calculations by groups of samples, instead of requiring a single process to do it all. The Perl script also takes the base filename of your binary pedfile and parses the .fam file to split up the list of individuals into small chunks. The size of these chunks are specified by the user. Assuming you have access to a high-performance computing cluster using Torque/PBS for scheduling and job submission, the script also writes out PBS files that can be used to submit each of the segments to a node on a supercomputing cluster (although this can easily be modified to fit other parallelization frameworks, so modify the script as necessary). The script also needs all the minor allele frequencies (which can easily be attained with the --freq option in PLINK).

One of the first things the script does is parses and splits up your fam file into chunks of N individuals (where N is set by the user - I used 100 and estimation only took ~15 minutes). This can be accomplished by a simple gawk command:

gawk '{print $1,$2}' data.fam | split -d -a 3 -l 100 - tmp.list

Then the script sets up some PBS scripts (like shell scripts) to run PLINK commands:

At which point you'll have the output files...

data.sub.1.1.genome
data.sub.1.2.genome
data.sub.1.3.genome
data.sub.1.4.genome
data.sub.2.2.genome
data.sub.2.3.genome
data.sub.2.4.genome
data.sub.3.3.genome
data.sub.3.4.genome
data.sub.4.4.genome

...that you can easily concatenate.

Here's the perl script below. To run it, give it the full path to your binary pedfile, the number of individuals in each "chunk" to infer IBD between, and the fully qualified path to your .frq file that you get from running plink --freq. If you're not using PBS to submit jobs, you'll have to modify the code a little bit in the main print statement in the middle. If you're not running this in your /scratch/username/ibd directory, you'll want to change that on line 57. You'll also want to change your email address on line 38 if you want to receive emails from your scheduler if you use PBS.

After you submit all these jobs, you can very easily run these commands to concatenate the results and clean up the temporary files:

cat data.sub.*genome > results.genome
rm tmp.list*
rm data.sub.*

Prioritizing GWAS Results: A Review of Statistical Methods and Recommendations for Their Application

While writing my thesis I came across this nice review by Rita Cantor, Kenneth Lange, and Janet Sinsheimer at UCLA, "Prioritizing GWAS Results: A Review of Statistical Methods and Recommendations for Their Application." Skip the introduction unless you're new to GWAS, in which case you'll probably want to start with this more recent review by Teri Manolio. After skipping the intro you'll find succinct introduction to meta-analysis for GWAS with lots of very good references, including these among others:

DerSimonian R., Laird N. Meta-analysis in clinical trials. Control. Clin. Trials. 1986;7:177–188. [PubMed]

Fleiss J.L. The statistical basis of meta-analysis. Stat. Methods Med. Res. 1993;2:121–145. [PubMed]

Yesupriya A., Yu W., Clyne M., Gwinn M., Khoury M.J. The continued need to synthesize the results of genetic associations across multiple studies. Genet. Med. 2008;10:633–635. [PubMed]

Lau J., Ioannidis J.P., Schmid C.H. Quantitative synthesis in systematic reviews. Ann. Intern. Med. 1997;127:820–826. [PubMed]

Allison D.B., Schork N.J. Selected methodological issues in meiotic mapping of obesity genes in humans: Issues of power and efficiency. Behav. Genet. 1997;27:401–421. [PubMed]

Ioannidis J.P., Gwinn M., Little J., Higgins J.P., Bernstein J.L., Boffetta P., Bondy M., Bray M.S., Brenchley P.E., Buffler P.A., Human Genome Epidemiology Network and the Network of Investigator Networks A road map for efficient and reliable human genome epidemiology. Nat. Genet. 2006;38:3–5. [PubMed]

de Bakker P.I., Ferreira M.A., Jia X., Neale B.M., Raychaudhuri S., Voight B.F. Practical aspects of imputation-driven meta-analysis of genome-wide association studies. Hum. Mol. Genet. 2008;17(R2):R122–R128. [PMC free article] [PubMed]

Sagoo G.S., Little J., Higgins J.P., Human Genome Epidemiology Network Systematic reviews of genetic association studies. PLoS Med. 2009;6:e28. [PMC free article] [PubMed]

Zeggini E., Ioannidis J.P. Meta-analysis in genome-wide association studies. Pharmacogenomics. 2009;10:191–201. [PMC free article] [PubMed]

Egger M., Smith G.D., Phillips A.N. Meta-analysis: Principles and procedures. BMJ. 1997;315:1533–1537. [PMC free article] [PubMed]

Ioannidis J.P., Patsopoulos N.A., Evangelou E. Heterogeneity in meta-analyses of genome-wide association investigations. PLoS ONE. 2007;2:e841. [PMC free article] [PubMed]

This section covers using imputation in meta-analysis, fixed effects versus random effects meta-analysis, canned software for meta-analysis (such as METAL), Bayesian hierarchical approaches, and references to many applications of meta-analysis in GWAS.

After the meta-analysis section there's a nice section on modeling epistasis, or gene-gene interactions, to prioritize associations with links to other reviews of statistical methods, and brief coverage of data mining procedures like CART, MDR, random forests, conditional entropy methods, neural networks, genetic programming, logic regression, pattern mining, Bayesian partitioning, and penalized regression approaches, again with lots of references. This section also covers parameterization of epistatic models, and covers some of the computation and statistical issues you'll face with the dimensionality problem.

Finally, the review concludes with a section on pathway analysis. As the review admits, pathway analysis in GWAS has no set of strict guidelines or best practices, and new approaches arise every day.

While this review is nearly a year old at this point, I think it's a real gem because of all the references it offers, especially in the meta-analysis and epistasis sections.

AJHG: Prioritizing GWAS Results: A Review of Statistical Methods and Recommendations for Their Application

Empowering Personal Genomics by Considering Regulatory Cis-Epistasis and Heterogeneity

Will Bush and I just heard that our paper "Multivariate Analysis of Regulatory SNPs: Empowering Personal Genomics by Considering Cis-Epistasis and Heterogeneity" was accepted for publication and a talk at the Personal Genomics session of the 2011 Pacific Symposium in Biocomputing.

Your humble GGD contributors embarked on our first collaborative paper using genome-wide transcriptome data and genome-wide SNP data from HapMap lymphoblastoid cell lines to examine an alternative mechanism for how epistasis might affect human traits. Many human traits are driven by alterations in gene expression, and it's known that common genetic variation affects the expression of nearby genes. We also know that epistasis is ubiquitous and affects human traits. Combining these three ideas, is it possible that genetic variation can interact epistatically to exert a cis-regulatory effect on the expression of nearby genes? If so, what is the genomic and statistical structure of these epistatically interacting multilocus models? Are genes which are affected by cis-epistasis associated with complex human disease or morphological phenotypes? If so, how might we use this knowledge to guide the reanalysis of existing datasets? We addressed these questions here using experimental data from HapMap cell lines. If you're interested in seeing the paper please email me, or try to catch our talk at PSB (a meeting worth going to!).

Abstract: Understanding how genetic variants impact the regulation and expression of genes is important for forging mechanistic links between variants and phenotypes in personal genomics studies. In this work, we investigate statistical interactions among variants that alter gene expression and identify 79 genes showing highly significant interaction effects consistent with genetic heterogeneity. Of the 79 genes, 28 have been linked to phenotypes through previous genomic studies. We characterize the structural and statistical nature of these 79 cis-epistasis models, and show that interacting regulatory SNPs often lie far apart from each other and can be quite distant from the gene they regulate. By using cis-epistasis models that account for more variance in gene expression, investigators may improve the power and replicability of their genomics studies, and more accurately estimate an individual's gene expression level, improving phenotype prediction.

Pacific Symposium in Biocomputing 2011

Success of GWAS Approach Demonstrated by Latest Lipids Meta-Analysis

Last year I linked to a series of perspectives in NEJM with contrasting views on the success or failure of GWAS - David Goldstein's paper and Nick Wade's synopsis that soon followed in the New York Times being particularly pessimistic. Earlier this year I was swayed by an essay in Cell by Jon McClellan and Mary-Claire King condemning the common disease common variant hypothesis and chalking up most GWAS hits to population stratification. The main argument, that the frequency of the risk allele being hypervariable even in European populations - was persuasive until Kai Wang pointed out on this blog and recently expanded in a correspondence in Cell that McClellan and King used a cohort with extremely small sample sizes to estimate allele frequencies, and that this locus was no more variable than the average SNP in European populations. (There are a series of three letters in Cell, including a response by McClellan and King that are definitely worth reading here, here, and here.)

One of the main arguments in David Goldstein's 2009 NEJM paper is that doing GWAS with increasingly larger sample sizes will not yield meaningful discoveries, especially if the newly detected loci explain such a small proportion of the heritability of the trait being studied. A paper published last week in Nature provides empirical evidence refuting such claims. "Biological, clinical and population relevance of 95 loci for blood lipids" by Teslovich et al (with Goncalo Abecasis, Cristen Willer, Sekar Kathiresan, Leena Peltonen, Kari Steffanson, Yurii Aulchenko, Chiara Sabatti, Robert Hegele, Francis Collins, and many, many other co-authors) presents a meta-analysis of blood lipid levels in over 100,000 samples in multiple ethnic groups. This study identified 95 loci (59 novel) associated with either total cholesterol, LDL, HDL, or Triglycerides, explaining 10-12% of the variation in these traits. A handful of these loci demonstrated clear clinical and/or biological significance: several are common variants in or near genes harboring rare variants known to cause extreme dyslipidemias, and with others the authors demonstrated altered lipid levels in mice after disturbing the regulation of several of the newly discovered genes. Furthermore, most of the newly discovered loci were significant in other non-European populations with the same direction of association .

This study demonstrates that combining studies using meta-analysis, achieving massive sample sizes to detect extremely small effects can result in both clinically and biologically meaningful discoveries using GWAS. This study also demonstrates that most of the significant results are in fact associated with lipid traits across global populations, which has implications for enabling personal genomics / personalized medicine in non-European populations. Furthermore, as Teri Manolio noted in her recent review in NEJM, one cannot equate variance explained with potential clinical importance: Type II diabetes associated genes PPARG and KCNJ11 and psoriasis-associated IL12B encode proteins that are drug targets for thiazolidinediones, sulfonylureas, and anti-p40 antibodies respectively, yet all of these associations have odds ratios less than 1.45. So a GWAS with >100,000 samples uncovers new loci with extremely small effects... while these loci alone may not be useful today for treatment or clinical risk stratification, it's difficult to judge the importance of these loci until you perturb the system with pharmaceutical or some other environmental intervention.

A true testament to the success of GWAS, the paper is a pleasure to read, (even though unfortunately the real substance of the paper is buried in the 19 tables and 3 figures in the 83 page supplement).

Biological, clinical and population relevance of 95 loci for blood lipids

How to Read a Genome-Wide Association Study (@GenomesUnzipped)

Jeff Barret (@jcbarret on Twitter) over at Genomes Unzipped (@GenomesUnzipped) has posted a nice guide for the uninitiated on how to read a GWAS paper. Barret outlines five critical areas that readers should pay attention to: sample size, quality control, confounding (including population substructure), the replication requirement, and biological significance. It would be nice to see a follow-up post like this on things to look out for in studies that investigate other forms of human genetic variation such as copy number polymorphism, rare variation, or gene-environment interaction.

And this is also a convenient point for me to mention Genomes Unzipped - a collaborative blog covering topics relevant to the personal genomics industry, featuring posts by several of my favorite bloggers including Daniel MacArthur (of Genetic Future), Luke Jostins (of Genetic Inference), Dan Vorhaus (of Genomics Law Report), Jan Aerts (Saaien Tist), Jeff Barret, Caroline Wright, Katherine Morley, and Vincent Plagnol. GNZ, as it's called, has only been live for about two weeks, but looks like a good one to follow as the personal genomics industry begins to mature over the next few years.

Genomes Unzipped: How to Read a Genome-Wide Association Study

Using Expression Data to Mine the "Gray Zone" of GWAS

Researchers in the ENGAGE consortium used a clever technique to leverage genome-wide expression data to select or prioritize genes for GWAS analysis. The investigators published the novel candidate genes for obesity in this month's PLoS Genetics, but I think the method they used here is more interesting.

If you're studying obesity and you find that expression of some gene correlates with BMI, you have a problem in that you don't know whether the correlation indicates a causal relationship or if the changes in gene expression were simply reactive to changes in body composition. This is the case when looking at unrelated individuals - some correlations will be reactive, others potentially causal. However, if you're looking only in identical twins, you know that all the correlations you see are reactive, because MZ twins are genetically identical. The authors here took an interesting approach to prioritize genes for GWAS analysis that were correlated in the unrelated individuals only, and not in the MZ twins.

Following up these "causal" genes in a GWAS analysis the authors found that the p-value distribution was highly biased away from the null - in other words, more of these genes were associated than you'd expect by chance. The genes dubbed reactive were biased toward the null, i.e. fewer variants in these genes were associated with the phenotype.

While not everyone has easy access to whole-genome expression data on MZ twins before doing a GWAS, I wonder if the idea can be extended out to siblings or even more distant relatives, perhaps leveraging the kinship coefficient as a measure of relatedness between two individuals to "nudge" the transcript in question more towards causal versus reactive. Anyhow, check out the paper linked, it's a very clever idea.

(On a slightly related note, check out this interesting discussion about open access publishing a la PLoS versus traditional scientific publishing)

PLoS Genetics: Use of Genome-Wide Expression Data to Mine the “Gray Zone” of GWA Studies Leads to Novel Candidate Obesity Genes