Pages

Video Tip: Convert Gene IDs with Biomart

I get asked frequently how to convert from one gene identifier to another. This can be tricky, especially when relying on gene symbols, as Will pointed out in a previous post a few years ago. There are several tools that can do this, including DAVID and the previously mentioned new Biomart ID Converter, but I still prefer using the Ensembl Biomart for this because of its added flexibility and annotation.

I've started putting together video screencasts for things like this, especially when several of the core's clients ask the same question. In this example, I'll show you how to quickly convert from the Affymetrix Mouse Gene 1.0 ST microarray probeset IDs to an Ensembl gene ID and gene symbol.



You can also do this programmatically in R using the biomaRt package in Bioconductor.

Redesign by Subtraction

GGD has a new look. I was inspired by Gina Trapani (Smarterware, Lifehacker) to remove any extra lines, links, and other "ink" that doesn't serve any purpose, and I hope the site appears cleaner and easier to read. I also wanted the extra horizontal space for larger images and avoid the dreaded side-scrolling in posts with lots of code like this one. The space will also be handy for short instructional screencasts I've been recording for my clients here in the core, which I'll start posting here soon. Leave a comment if anything appears broken. I hope you like the new look!

find | xargs ... Like a Boss

*Edit March 12* Be sure to look at the comments, especially the commentary on Hacker News - you can supercharge the find|xargs idea by using find|parallel instead.
---

Do you ever discover a trick to do something better, faster, or easier, and wish you could reclaim all the wasted time and effort before your discovery? I've had this happen many times - learning how to use vim, learning about querying relational databases, switching from EndNote to Mendeley, etc.

Now I wish I could reclaim all the time I spent writing perl code to do something that find|xargs can do much more easily. I'll lead you through an example using a problem I ran into and solved easily with find|xargs.

The problem:

I'm doing an RNA-seq experiment where I have three replicates {1,2,3} for two conditions {C,M}. I've run Tophat to align the raw reads to a reference genome, and saved the output of each run to a separate directory. Within those directories I have an alignment (accepted_hits.bam) and some other stuff.



Now, I want to assemble transcripts separately for each alignment. All the alignments are 1 directory deep in the tophat output directory. Furthermore, I want to submit a separate job for each assembly onto our cluster using qsub. In a former life I would have wrapped a perl script around this to write shell scripts that the scheduler would run, and then write a second perl script that would submit all the jobs to the scheduler.

Here's a command that will do it all in one go:

PROCS=8; find `pwd` -name "accepted_hits.bam" | xargs -i echo qsub -l ncpus=$PROCS -- `which cufflinks` -p $PROCS -o {}-cufflinks -g genes.gtf {} | sh

So let's break this down one piece at a time to see what's going on here.

First the find command. If I want to find all the files in the current directory recursively through any subdirectory, I can use the find command with the -name argument. You can see that find is searching "." (the current directory), and is returning the path (relative to ".") for each file it finds.



But if I'm submitting jobs to a scheduler, I want to use the full absolute path. You can modify find's output by telling it to search in "." but give it the full path. Even better, tell find to search in `pwd` (those are backticks, usually above the tab key. The shell will run the pwd command, and insert that output into the find command. See below.



Now, here's where xargs comes in. xargs lets you build new commands based on the standard input. In other words, you can build a command to run on each line that gets piped to xargs. Use the -i option to create a command for each individual line on the pipe, and use {} as a placeholder. The example below should help.



So here's whats going on. In the first step, I'm piping the output of find into xargs, and xargs is creating a new command that will echo "somecommand {}", where {} is a placeholder for what gets piped into xargs. So you can replace somecommand with anycommand -any -options -you -want, and echo all that back out to the STDOUT.

The second part simply pipes whatever is echo'd to the STDIN to the bash shell ( | sh). So bash will run each command it receives on the pipe. Since somecommand doesn't exist, I'm getting an error.

Below, I'm building the command to run cufflinks on each alignment, and dump that output back out to a new directory based on the pattern of the alignment (bam) file name. But since in the next step I want to parallelize this on the cluster, and the cluster won't know that cufflinks is in my path, I need to tell it where to find cufflinks. I could give it the path, but I would rather use the backtick trick I showed you above to let the shell tell the shell where cufflinks resides by using `which cufflinks`.



In the final piece, I'm adding a few more components.



First, I'm setting a shell variable called PROCS. I can access this variable later by using $PROCS. This is how many CPUs I want to allocate to each assembly job. The ";" separates two commands. Instead of using xargs to build a cufflinks command to run at the shell directly, I'm using xargs to build a qsub command that will qsub these jobs to the cluster. To qsub something from the command line, the syntax is

qsub -- myprog -o -p -t arg1 arg2

Where are the PBS directives to qsub (like the number of processors I want, the amount of RAM I require, etc). myprog is the program you're running. -o, -p, -t are options to myprog, and arg1 and arg2 are the arguments to myprog. The two hyphens are required between the qsub options and the commands you actually want to run.

So in the command above, I'm using xargs to build up a qsub command, substituting $PROCS (8 here) for the number of CPUs I require, calling cufflinks from the full path using the backtick trick, telling cufflinks that I want $PROCS (8) processors, use genes.gtf for reference annotation based transcript (RABT) assembly, and run all that cufflinks stuff on the supplied alignment.

In summary, this one-liner will submit six 8-processor jobs. It sure beats writing perl scripts. There are a zillion other uses for piping together find with xargs. If you have a useful one-liner, please share it in the comments. Happy piping!

Pathway Analysis for High-Throughput Genomics Studies

I get a lot of requests in the core about running a "pathway analysis." Someone ran a handful of gene expression arrays, or better yet, ran an RNA-seq experiment (with replicates!). These, and many other kinds of high-throughput assays (GWAS, ChIP-seq, etc.) result in a list of genes and some associated p-value, fold change, or other statistic.

Here's some R code to download public data from a study on susceptibility to colorectal cancer. I realize that this is an oversimplified design - that's not the focus here. You can read more about proper design and contrast matrices in the limma vignette. You can read more about the study and the samples in the paper or on the GEO page.



A thoughtful and thorough analysis doesn't end with a list of genes and P-values. You spent way too much money on an experiment just to end up with a list of genes and p-values at some arbitrary cutoff. I often see investigators going down the list and cherry-picking genes that they think are important or might play a role, and trying to create a story involving those cherry-picked genes. I believe it's better to be more agnostic, taking a data-driven approach to framing results in a functional context.

How do you do this? This review recently published in PLoS Comp Bio is an excellent place to start:

Khatri, P., Sirota, M., & Butte, A. J. (2012). Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges. PLoS Computational Biology, 8(2), e1002375. doi:10.1371/journal.pcbi.1002375

I wrote an evaluation of this paper at Faculty of 1000 - here's an expanded version. The paper gives an overview of the available methods for pathway analysis, and categorizes these methods into three functional classes: over-representation analysis, functional class scoring, and topology-based methods. The paper discusses the advantages and limitations of each approach, as well as future development directions.

Over-representation analysis is what you would typically do with something like gene ontology annotations. The Gene Ontology (GO) project is an effort to describe gene products from all organisms using a consistent language (ontologies are formal representations of knowledge domains; GO does this with cell biology). A biological process is typically made up of a group of genes, as opposed to an individual gene alone. The idea of over-representation analysis is that if a biological process is abnormal in a given study, the co-functioning genes are more likely to be selected as a relevant group by the high-throughput screening technologies. For example, if 10% of the your genes selected by a microarray experiment are kinases, as opposed to 1% of the genes in the whole human genome (this is the gene population background) that are kinases, then you have significant over-representation in your genes for kinases. There are many, many tools for doing over-representation analysis using gene ontology terms, but they all rely on the same idea, typically using a hypergeometric test. You can create directed acyclic graphs like the one below, color coding nodes based on the statistical significance of the term overrepresentation in your gene list. See the paper above for a discussion of limitations of over-representation analysis, and also check out this paper on the use and misuse of gene ontology annotations.


Functional Class Scoring. One of the biggest limitations of over-representation analysis is that you have to arbitrarily decide what you call significant and what you don't. If you use an FDR threshold of 0.05 and fold change cutoff of 2, you'll lose all those genes with a fold change of 1.95 or FDR 0.051, which are arguably just as important as the genes just under the arbitrary cutoff. Further, over-representation analyses only use the number of genes and ignore how strongly those genes are associated with whatever you're studying. Pathway analysis methods that the above review classifies as "Functional Class Scoring" methods use all the genes you have as well as their association statistics (e.g. fold change), and compute a running enrichment score for gene groupings (based on some functional knowledge like gene ontology or KEGG pathways). Gene Set Enrichment Analysis is one of the more popular approaches in this category. The plot below shows the kind of results you get from GSEA.  From the GSEA documentation: "The primary result of the gene set enrichment analysis is the enrichment score (ES), which reflects the degree to which a gene set is overrepresented at the top or bottom of a ranked list of genes. GSEA calculates the ES by walking down the ranked list of genes, increasing a running-sum statistic when a gene is in the gene set and decreasing it when it is not. The magnitude of the increment depends on the correlation of the gene with the phenotype. The top portion of the plot shows the running ES for the gene set as the analysis walks down the ranked list. The score at the peak of the plot (the score furthest from 0.0) is the ES for the gene set. Gene sets with a distinct peak at the beginning (such as the one shown here) or end of the ranked list are generally the most interesting." So this shows that you have significant enrichment in the cancer phenotype (labeled "LL") for genes involved in the KEGG Spliceosome pathway.

Topology based methods. The above methods don't take into account how genes interact with each other (e.g. activation, inhibition, phosphorylation, direct/indirect interaction, ubiquitination, etc). Pathway topology methods this extra information in computing pathway-level statistics. I've recently started using the bioconductor SPIA package (Signaling Pathway Impact Analysis), which integrates lists of differentially expressed genes, their fold changes, and pathway topology, to identify pathways associated with condition you're studying. The code below runs SPIA on the analysis we did above. Make sure you successfully loaded all the R packages in the code above.

You can read more about the results SPIA is giving you in their paper and vignette. Briefly, each pathway is represented by a point, and the x and y axes are showing increasing evidence for the involvement of this pathway in your disease (the total "perturbation" in the pathway based on your gene expression data).

It's worth noting that all of the above mentioned methods have limitations. We don't fully understand biology, and our understanding of molecular networks and signaling pathways is still very low-resolution. We also don't have information about how different isoforms have different effects - which is something we'll get from RNA-seq experiments. Annotations are often incorrect and inaccurate, and we don't have very much cell-type specific or dynamic information about these pathways. Finally, the analysis of large gene lists with the current enrichment tools is still more of an exploratory data-mining procedure rather than a pure statistical solution and analytical endpoint.

Despite the limitations, the kinds of methods discussed here and reviewed elsewhere (see below) are still very useful for extracting biological meaning and framing results from high-throughput experiments in a functional context. The bioinformatics core here at UVA can do any of the kinds of analyses discussed above. Please fill out a consultation request form and I'll be happy to talk to you about what kinds of insight you may be able to glean from your existing data or in an experiment you're planning.

Other useful literature on the subject:


Holmans, P., Green, E. K., Pahwa, J. S., Ferreira, M. a R., Purcell, S. M., Sklar, P., Owen, M. J., et al. (2009). Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder. American journal of human genetics, 85(1), 13-24. doi:10.1016/j.ajhg.2009.05.011

Huang, D. W., Sherman, B. T., & Lempicki, R. a. (2009). Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic acids research, 37(1), 1-13. doi:10.1093/nar/gkn923

Khatri, P., Sirota, M., & Butte, A. J. (2012). Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges. (C. A. Ouzounis, Ed.)PLoS Computational Biology, 8(2), e1002375. doi:10.1371/journal.pcbi.1002375

Rhee, S. Y., Wood, V., Dolinski, K., & Draghici, S. (2008). Use and misuse of the gene ontology annotations. Nature reviews. Genetics, 9(7), 509-15. doi:10.1038/nrg2363

Wang, K., Li, M., & Bucan, M. (2007). Pathway-Based Approaches for Analysis of Genomewide Association Studies. American journal of human genetics, 81(6), 1278-1283. doi:10.1086/522374

Yaspan, B. L., & Veatch, O. J. (2011). Strategies for Pathway Analysis from GWAS Data. Current protocols in human genetics / editorial board, Jonathan L. Haines ... [et al.], Chapter 1(October), Unit1.20. doi:10.1002/0471142905.hg0120s71