Pages

Annotated Manhattan plots and QQ plots for GWAS using R, Revisited

Last year I showed you how to create manhattan plots, and later how to highlight regions of interest, using ggplot2 in R. The code was slow, required a lot of memory, and was difficult to maintain and modify.

I finally found time to rewrite the code using base graphics rather than ggplot2. The code is now much faster, and if you're familiar with base R's plot options and graphical parameters, most of these can now be passed to the functions to tweak the plots' appearance. The code also behaves differently depending on whether you have results for one or more than one chromosome.

Here's a quick demo.

First, either copy and paste the code from GitHub, or run the following commands in R to download and source the code from GitHub (you can't directly read from https in R, so you have to download the file first, the source it). Note the command is different on Mac vs Windows.

Download the function code on Mac:

download.file("https://raw.github.com/stephenturner/qqman/master/qqman.r", destfile="./qqman.r", method="curl")

Download the function code on Windows, leave out the method="curl" argument:

download.file("https://raw.github.com/stephenturner/qqman/master/qqman.r", destfile="./qqman.r")

Now, source the script containing the functions.

source("./qqman.r")


Next, load some GWAS results, and take a look at the relevant columns (same as above, download first then read locally from disk). This is standard output from PLINK's --assoc option.

Download example data on Mac:

download.file("https://raw.github.com/stephenturner/qqman/master/plink.assoc.txt.gz", destfile="./plink.assoc.txt.gz", method="curl")

Download example data on Windows:

download.file("https://raw.github.com/stephenturner/qqman/master/plink.assoc.txt.gz", destfile="./plink.assoc.txt.gz")

Read in the sample results, and take a look at the first few lines:

results = read.table(gzfile("./plink.assoc.txt.gz"), header=T)

head(subset(results, select=c(SNP, CHR, BP, P)))

The manhattan function assumes you have columns named SNP, CHR, BP, and P, corresponding to the SNP name (rs number), chromosome number, genomic coordinate, and p-value. Missing values (where regression failed to converge, for instance) should be recoded NA before reading into R. Do this with a quick sed command. Here's what the data looks like:

         SNP CHR        BP       P
1 rs10495434   1 235800006 0.62220
2  rs6689417   1  46100028 0.06195
3  rs3897197   1 143700035 0.10700
4  rs2282450   1 202300047 0.47280
5   rs567279   1  66400050      NA
6 rs11208515   1  64900051 0.53430


Now, create a basic manhattan plot (click the image to enlarge):

manhattan(results)



If you type args(manhattan) you can see the options you can set. Here are a few:

colors: this is a character vector specifying the colors to cycle through for coloring each point. Here's a PDF chart of R's color names.

ymax: this is the y-axis limit. If ymax="max" (default), the y-axis will always be a little bit higher than the most significant -log10(p-value). Otherwise you can set this value yourself.

cex.x.axis: this can be used to shrink the x-axis labels by setting this value less than 1. This is handy if some of the tick labels aren't showing up because the plot region is too small.

*** Update March 9 2012*** cex.x.axis is deprecated. To change the x-axis size, use the default base graphics argument cex.axis.

limitchromosomes: you can limit which chromosomes you want to display. By default this restricts the plot to chromosomes 1-23(x).

suggestiveline and genomewideline: set these to FALSE if you don't want threshold lines, or change the thresholds yourself.

annotate: by default this is undefined. If you supply a character vector of SNP names (e.g. rs numbers), any SNPs in the results data frame that also show up here will be highlighted in green by default. example below.

... : The dot-dot-dot means you can pass most other plot or graphical parameters to these functions (e.g. main, cex, pch, etc).

Make a better looking manhattan plot. Change the plot colors, point shape, and remove the threshold lines:

manhattan(results, colors=c("black","#666666","#CC6600"), pch=20, genomewideline=F, suggestiveline=F)


Now, read in a text file with SNP names that you want to highlight, then make a manhattan plot highlighting those SNPs, and give the plot a title:

Download a SNP list on Mac:

download.file("https://raw.github.com/stephenturner/qqman/master/snps.txt", destfile="./snps.txt", method="curl")

Download a SNP list on Windows:

download.file("https://raw.github.com/stephenturner/qqman/master/snps.txt", destfile="./snps.txt")

Read in the SNP list, and create an annotated plot:

snps_to_highlight = scan("./snps.txt", character())

manhattan(results, annotate=snps_to_highlight, pch=20, main="Manhattan Plot")



Finally, zoom in and plot only the results for chromosome 11, still highlighting those results. Notice that the x-axis changes from chromosome to genomic coordinate.

manhattan(subset(results, CHR==11), pch=20, annotate=snps_to_highlight, main="Chr11")



Finally, make a quantile-quantile plot of the p-values. To make a basic qq-plot of the p-values, pass the qq() function a vector of p-values:

qq(results$P)


Perhaps we should have made the qq-plot first, as it looks like we might have some unaccounted-for population stratification or other bias.

The code should run much faster and use less memory than before. All the old functions that use ggplot2 are still available, now prefixed with "gg." Please feel free to use, modify, and redistribute, but kindly link back to this post. The function source code, data, SNPs of interest, and example code used in this post are all available in the qqman GitHub repository.