Pages

Annotated Manhattan plots and QQ plots for GWAS using R, Revisited

Last year I showed you how to create manhattan plots, and later how to highlight regions of interest, using ggplot2 in R. The code was slow, required a lot of memory, and was difficult to maintain and modify.

I finally found time to rewrite the code using base graphics rather than ggplot2. The code is now much faster, and if you're familiar with base R's plot options and graphical parameters, most of these can now be passed to the functions to tweak the plots' appearance. The code also behaves differently depending on whether you have results for one or more than one chromosome.

Here's a quick demo.

First, either copy and paste the code from GitHub, or run the following commands in R to download and source the code from GitHub (you can't directly read from https in R, so you have to download the file first, the source it). Note the command is different on Mac vs Windows.

Download the function code on Mac:

download.file("https://raw.github.com/stephenturner/qqman/master/qqman.r", destfile="./qqman.r", method="curl")

Download the function code on Windows, leave out the method="curl" argument:

download.file("https://raw.github.com/stephenturner/qqman/master/qqman.r", destfile="./qqman.r")

Now, source the script containing the functions.

source("./qqman.r")


Next, load some GWAS results, and take a look at the relevant columns (same as above, download first then read locally from disk). This is standard output from PLINK's --assoc option.

Download example data on Mac:

download.file("https://raw.github.com/stephenturner/qqman/master/plink.assoc.txt.gz", destfile="./plink.assoc.txt.gz", method="curl")

Download example data on Windows:

download.file("https://raw.github.com/stephenturner/qqman/master/plink.assoc.txt.gz", destfile="./plink.assoc.txt.gz")

Read in the sample results, and take a look at the first few lines:

results = read.table(gzfile("./plink.assoc.txt.gz"), header=T)

head(subset(results, select=c(SNP, CHR, BP, P)))

The manhattan function assumes you have columns named SNP, CHR, BP, and P, corresponding to the SNP name (rs number), chromosome number, genomic coordinate, and p-value. Missing values (where regression failed to converge, for instance) should be recoded NA before reading into R. Do this with a quick sed command. Here's what the data looks like:

         SNP CHR        BP       P
1 rs10495434   1 235800006 0.62220
2  rs6689417   1  46100028 0.06195
3  rs3897197   1 143700035 0.10700
4  rs2282450   1 202300047 0.47280
5   rs567279   1  66400050      NA
6 rs11208515   1  64900051 0.53430


Now, create a basic manhattan plot (click the image to enlarge):

manhattan(results)



If you type args(manhattan) you can see the options you can set. Here are a few:

colors: this is a character vector specifying the colors to cycle through for coloring each point. Here's a PDF chart of R's color names.

ymax: this is the y-axis limit. If ymax="max" (default), the y-axis will always be a little bit higher than the most significant -log10(p-value). Otherwise you can set this value yourself.

cex.x.axis: this can be used to shrink the x-axis labels by setting this value less than 1. This is handy if some of the tick labels aren't showing up because the plot region is too small.

*** Update March 9 2012*** cex.x.axis is deprecated. To change the x-axis size, use the default base graphics argument cex.axis.

limitchromosomes: you can limit which chromosomes you want to display. By default this restricts the plot to chromosomes 1-23(x).

suggestiveline and genomewideline: set these to FALSE if you don't want threshold lines, or change the thresholds yourself.

annotate: by default this is undefined. If you supply a character vector of SNP names (e.g. rs numbers), any SNPs in the results data frame that also show up here will be highlighted in green by default. example below.

... : The dot-dot-dot means you can pass most other plot or graphical parameters to these functions (e.g. main, cex, pch, etc).

Make a better looking manhattan plot. Change the plot colors, point shape, and remove the threshold lines:

manhattan(results, colors=c("black","#666666","#CC6600"), pch=20, genomewideline=F, suggestiveline=F)


Now, read in a text file with SNP names that you want to highlight, then make a manhattan plot highlighting those SNPs, and give the plot a title:

Download a SNP list on Mac:

download.file("https://raw.github.com/stephenturner/qqman/master/snps.txt", destfile="./snps.txt", method="curl")

Download a SNP list on Windows:

download.file("https://raw.github.com/stephenturner/qqman/master/snps.txt", destfile="./snps.txt")

Read in the SNP list, and create an annotated plot:

snps_to_highlight = scan("./snps.txt", character())

manhattan(results, annotate=snps_to_highlight, pch=20, main="Manhattan Plot")



Finally, zoom in and plot only the results for chromosome 11, still highlighting those results. Notice that the x-axis changes from chromosome to genomic coordinate.

manhattan(subset(results, CHR==11), pch=20, annotate=snps_to_highlight, main="Chr11")



Finally, make a quantile-quantile plot of the p-values. To make a basic qq-plot of the p-values, pass the qq() function a vector of p-values:

qq(results$P)


Perhaps we should have made the qq-plot first, as it looks like we might have some unaccounted-for population stratification or other bias.

The code should run much faster and use less memory than before. All the old functions that use ggplot2 are still available, now prefixed with "gg." Please feel free to use, modify, and redistribute, but kindly link back to this post. The function source code, data, SNPs of interest, and example code used in this post are all available in the qqman GitHub repository.

How To Get Allele Frequencies and Create a PED file from 1000 Genomes Data

I recently analyzed some next-generation sequencing data, and I first wanted to compare the frequencies in my samples to those in the 1000 Genomes Project. It turns out this is much easier that I thought, as long as you're a little comfortable with the Linux command line.

First, you'll need a Linux system, and two utilities: tabix and vcftools.

I'm virtualizing an Ubuntu Linux system in Virtualbox on my Windows 7 machine. I had a little trouble compiling vcftools on my Ubuntu system out of the box. Before trying to compile tabix and vcftools I'd recommend installing the GNU C++ compiler and another development version of a compression library, zlib1g-dev. This is easy in Ubuntu. Just enter these commands at the terminal:

sudo apt-get install g++
sudo apt-get install zlib1g-dev

First, download tabix. I'm giving you the direct link to the most current version as of this writing, but you might go to the respective sourceforge pages to get the most recent version yourself. Use tar to unpack the download, go into the unzipped directory, and type "make" to compile the executable.

wget http://sourceforge.net/projects/samtools/files/tabix/tabix-0.2.3.tar.bz2
tar -jxvf tabix-0.2.3.tar.bz2
cd tabix-0.2.3/
make

Now do the same thing for vcf tools:

wget http://sourceforge.net/projects/vcftools/files/vcftools_v0.1.4a.tar.gz
tar -zxvf vcftools_v0.1.4a.tar.gz 
cd vcftools_0.1.4a/
make

The vcftools binary will be in the cpp directory. Copy both the tabix and vcftools executables to wherever you want to run your analysis.

Let's say that you wanted to pull all the 1000 genomes data from the CETP gene on chromosome 16, compute allele frequencies, and drop a linkage format PED file so you can look at linkage disequilibrium using Haploview.

First, use tabix to hit the 1000 genomes FTP site, pulling data from the 20080804 release for the CETP region (chr16:56,995,835-57,017,756), and save that output to a file called genotypes.vcf. Because tabix doesn't download the entire 1000 Genomes data and pulls only the sections you need, this is extremely fast. This should take around a minute, depending on your web connection and CPU speeds.

./tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 16:56995835-57017756 > genotypes.vcf

Not too difficult, right? Now use vcftools (which works a lot like plink) to compute allele frequencies. This should take less than one second.

./vcftools --vcf genotypes.vcf --freq --out allelefrequencies

Finally, use vcftools to create a linkage format PED and MAP file that you can use in PLINK or Haploview. This took about 30 seconds for me.

./vcftools --vcf genotypes.vcf --plink --out plinkformat

That's it. It looks like you can also dig around in the supporting directory on the FTP site and pull out genotypes for specific ethnic groups as well (EUR, AFR, and ASN HapMap populations).

Using LaTeX for Math Formulas on the Web

I love the idea of using R+LaTeX+Sweave for reproducible research. This is even easier now that R has a jazzy new IDE that supports Sweave syntax highlighting and automatic PDF generation.

I know I'm going to take some flak for saying this, but let's be honest here... If you're working in the biomedical sciences, chances are, your collaborators have never heard of Sweave. Physicians only use LaTeX during surgery. Lots of folks you work with probably think real applied statistics can only be done in SAS (if you're one of them, please see http://www.r-project.org/). Most biomedical journals will only accept MS Word .doc files during manuscript submission. NIH grant applications use a standardized MS Word template.

These are a few of the reasons I don't routinely incorporate LaTeX+Sweave in my analysis workflow.

That said, one of the things LaTeX is really good for is mathematical typesetting. Writing out math formulae using LaTeX is fast, intuitive, and your plain-text code is portable. If you're ever posting a question on stats.stackexchange, editing Wikipedia, or if you're like me and keep your lab notebook online in a private blog online, using LaTeX conventions for typesetting formulae can be extremely handy.

Codecogs.com's Online LaTeX Equation Editor makes it very simple to use HTML to add formulae to your blog or anywhere on the web. The idea's simple - you type in the LaTeX code for the formula you want, e.g.

SS_{err}=\sum_i({y_i-\hat{y}_i})^2

And you'll get this HTML code:

<img src="http://latex.codecogs.com/gif.latex?SS_{err}=\sum_i({y_i-\hat{y}_i})^2" title="SS_{err}=\sum_i({y_i-\hat{y}_i})^2" />

This HTML code generates a hosted image that you can copy and paste anywhere on the web you like. Paste this into the compose window in Blogger, and it looks like this:



Online LaTeX Equation Editor

Sharing Genomics Data: NIH Data Sharing Policies Past, Present, and Future

This looks like an interesting and relevant webinar for anyone receiving federal funding for genetics projects. You can view it online by registering http://www.readytalk.com (Access Code 5345825) or listen by phone: 1-866-740-1260 (Access Code 5345825).

iDASH Webinar Series
University of California, San Diego
Sharing Genomics Data: NIH Data Sharing Policies Past, Present, and Future
Laura Rodriguez, Ph.D.
Director for the Office of Policy, Communication, and Education

ABSTRACT: One of the key tenets of genomics research is rapid and broad access to basic genomic data.  The National Institutes of Health (NIH) is similarly committed to open data sharing practices in order to ensure access to federally-funded data and resources and promote maximum public benefit through the public's biomedical research investment.  In this webinar, the principles underlying NIH's genomic data sharing policies will be reviewed and an overview will be provided of the policy model and expectations for protecting the interests of research participants whose data is contained within primary genomic datasets.

SPEAKER BIO: Laura Lyman Rodriguez, Ph.D., is the Director for the Office of Policy, Communication, and Education and the Senior Advisor to the Director for Research Policy at the National Human Genome Research Institute (NHGRI), National Institutes of Health (NIH). Dr. Rodriguez works to develop and implement policy for research initiatives at the NHGRI, as well as trans-NIH programs.  She is particularly interested in the policy and ethics questions related to the inclusion of human research participants in genomics and genetics research.  Among other activities, Dr. Rodriguez has provided leadership for many of the policy development activities pertaining to genomic data sharing and the creation of the database for Genotypes and Phenotypes (dbGaP) at the NIH.
Dr. Rodriguez received her bachelor of science with honors in biology from Washington and Lee University in Virginia and earned a doctorate in cell biology from Baylor College of Medicine in Texas.

Date and Time
Friday, April 15, 2011
11:00AM-11:50AM (PST)

DETAILS ON HOW TO JOIN THE MEETING:
Web: http://www.readytalk.com/ (Join meeting with Access Code 5345825)
Audio: 1-866-740-1260 (Access Code 5345825) 

Using R + Bioconductor to Get Flanking Sequence Given Genomic Coordinates

I'm working on a project using next-gen sequencing to fine-map a genetic association in a gene region. Now that I've sequenced the region in a small sample, I'm picking SNPs to genotype in a larger sample. When designing the genotyping assay the lab will need flanking sequence.

This is easy to get for SNPs in dbSNP, but what about novel SNPs? Specifically, given a list of genomic positions where about half of them are potentially novel SNPs in the population I'm studying, how can I quickly and automatically fetch flanking sequence from the reference sequence? I asked how to do this on previously mentioned Biostar, and got several answers within minutes.

Fortunately this can easily be done using R and Bioconductor. For several reasons BioMart wouldn't do what I needed it to do, and I don't know the UCSC/Ensembl schemas well enough to use a query or the Perl API.

This R code below (modified from Adrian's answer on BioStar) does the trick:

Monday Links: 23andMe, RStudio, PacBio+Galaxy, Data Science One-Liners, Post-Linkage RFA, SSH

Lately I haven't written as many full length posts as usual, but here's a quick roundup of a few links I've shared on Twitter (@genetics_blog) over the last week:

First, 23andMe is having a big DNA Day Sale ($108) for the kit + 1 year of their personal genome subscription service https://www.23andme.com/.

Previously mentioned R IDE RStudio released a new beta (version 0.93) that includes several new features and bugfixes. Among the new features, my favorites include the ability to customize the pane layout, and the included {manipulate} package that facilitates interactive plotting: http://bit.ly/gR9ir8.

The Pacific Biosciences open-source Analysis suite is out: http://j.mp/f9QP0F (genomeweb subscribers only), and you can watch a video tutorial (free) on Galaxy-PacBio integration here: http://j.mp/i1FzEm.

If you already have existing genotype/phenotype data on families you've collected, you might check out this new RFA from NHLBI and NHGRI: “Life After Linkage: the Future of Family Studies” http://1.usa.gov/dR9gHb: This FOA issued by NHLBI and NHGRI solicits applications which integrate novel molecular data with existing genotype and phenotype data in families to identify and characterize  genes influencing complex disorders.  Applications for this program should have previously examined families with some genotyping and phenotypes available.  Proposals may utilize focused ascertainment from families, for example, using risk profile, extreme phenotypes or families contributing heavily to genomic signals.  The program may support the addition of targeted or whole genome sequencing, exon sequencing, copy number variants (CNVs), expression profiling, metabolomics, epigenomics,  and/or novel biomarkers.  This may require additional sample collection and/or reconsent.  Applications should include a strong analytic component focused on bioinformatics and integration of multiple data types.

Data Science Hand Tools http://oreil.ly/hyhb6Z. This piece includes nice demonstrations of using grep, awk, cut, colrm, sort, uniq, wc, find, xargs, and other simple and flexible UNIX one-liners to do some serious data munging.

A new paper from Lucia Hindorff, Elizabeth Gillanders, and Teri Manolio at NHGRI: Genetic architecture of cancer & complex diseases: Lessons learned & future directions http://1.usa.gov/h28J2A

Bioinformatics special issue includes acompilation of all Bioinformatics papers on Next-Generation Sequencing http://bit.ly/DcfAm

And finally, simple instructions for setting up password-free SSH login http://linuxproblem.org/art_9.html

QR Code on Conference Posters

I found a really nice tip on Postersession.com's blog about putting QR codes in your research posters (via @gw_dailyscan). QR codes, short for quick response, is a two-dimensional bar code that can be read by most smartphones with a camera. The idea is that you can generate a QR code that links to a PDF of the poster, or to your own personal website or bio, eliminating the need for handouts or business cards, and put this in a prominent spot in your poster:


Generating a QR code is easy using Google's URL shortener and a simple URL hack.

  1. First, copy the URL of the website you'd like to generate a QR code for. Head over to goo.gl, and paste in the URL. 
  2. You'll get a shortened URL that will redirect to the URL that you pasted, something that looks like this: http://goo.gl/4Yvle
  3. Now, copy that short URL, and add ".qr" to the end, and enter that in your address bar: http://goo.gl/4Yvle.qr. That will give you an image that you can copy and paste into your poster. 

You can also get statistics on how many people have scanned that QR code and visited your site by adding a "+" to the end of the shortened URL:  http://goo.gl/4Yvle+.