Getting Genetics Done

Showing posts with label Linux. Show all posts

find | xargs ... Like a Boss

*Edit March 12* Be sure to look at the comments, especially the commentary on Hacker News - you can supercharge the find|xargs idea by using find|parallel instead.
---

Do you ever discover a trick to do something better, faster, or easier, and wish you could reclaim all the wasted time and effort before your discovery? I've had this happen many times - learning how to use vim, learning about querying relational databases, switching from EndNote to Mendeley, etc.

Now I wish I could reclaim all the time I spent writing perl code to do something that find|xargs can do much more easily. I'll lead you through an example using a problem I ran into and solved easily with find|xargs.

The problem:

I'm doing an RNA-seq experiment where I have three replicates {1,2,3} for two conditions {C,M}. I've run Tophat to align the raw reads to a reference genome, and saved the output of each run to a separate directory. Within those directories I have an alignment (accepted_hits.bam) and some other stuff.

Now, I want to assemble transcripts separately for each alignment. All the alignments are 1 directory deep in the tophat output directory. Furthermore, I want to submit a separate job for each assembly onto our cluster using qsub. In a former life I would have wrapped a perl script around this to write shell scripts that the scheduler would run, and then write a second perl script that would submit all the jobs to the scheduler.

Here's a command that will do it all in one go:

PROCS=8; find `pwd` -name "accepted_hits.bam" | xargs -i echo qsub -l ncpus=$PROCS -- `which cufflinks` -p $PROCS -o {}-cufflinks -g genes.gtf {} | sh

So let's break this down one piece at a time to see what's going on here.

First the find command. If I want to find all the files in the current directory recursively through any subdirectory, I can use the find command with the -name argument. You can see that find is searching "." (the current directory), and is returning the path (relative to ".") for each file it finds.

But if I'm submitting jobs to a scheduler, I want to use the full absolute path. You can modify find's output by telling it to search in "." but give it the full path. Even better, tell find to search in `pwd` (those are backticks, usually above the tab key. The shell will run the pwd command, and insert that output into the find command. See below.

Now, here's where xargs comes in. xargs lets you build new commands based on the standard input. In other words, you can build a command to run on each line that gets piped to xargs. Use the -i option to create a command for each individual line on the pipe, and use {} as a placeholder. The example below should help.

So here's whats going on. In the first step, I'm piping the output of find into xargs, and xargs is creating a new command that will echo "somecommand {}", where {} is a placeholder for what gets piped into xargs. So you can replace somecommand with anycommand -any -options -you -want, and echo all that back out to the STDOUT.

The second part simply pipes whatever is echo'd to the STDIN to the bash shell ( | sh). So bash will run each command it receives on the pipe. Since somecommand doesn't exist, I'm getting an error.

Below, I'm building the command to run cufflinks on each alignment, and dump that output back out to a new directory based on the pattern of the alignment (bam) file name. But since in the next step I want to parallelize this on the cluster, and the cluster won't know that cufflinks is in my path, I need to tell it where to find cufflinks. I could give it the path, but I would rather use the backtick trick I showed you above to let the shell tell the shell where cufflinks resides by using `which cufflinks`.

In the final piece, I'm adding a few more components.

First, I'm setting a shell variable called PROCS. I can access this variable later by using $PROCS. This is how many CPUs I want to allocate to each assembly job. The ";" separates two commands. Instead of using xargs to build a cufflinks command to run at the shell directly, I'm using xargs to build a qsub command that will qsub these jobs to the cluster. To qsub something from the command line, the syntax is

qsub -- myprog -o -p -t arg1 arg2

Where are the PBS directives to qsub (like the number of processors I want, the amount of RAM I require, etc). myprog is the program you're running. -o, -p, -t are options to myprog, and arg1 and arg2 are the arguments to myprog. The two hyphens are required between the qsub options and the commands you actually want to run.

So in the command above, I'm using xargs to build up a qsub command, substituting $PROCS (8 here) for the number of CPUs I require, calling cufflinks from the full path using the backtick trick, telling cufflinks that I want $PROCS (8) processors, use genes.gtf for reference annotation based transcript (RABT) assembly, and run all that cufflinks stuff on the supplied alignment.

In summary, this one-liner will submit six 8-processor jobs. It sure beats writing perl scripts. There are a zillion other uses for piping together find with xargs. If you have a useful one-liner, please share it in the comments. Happy piping!

How To Get Allele Frequencies and Create a PED file from 1000 Genomes Data

I recently analyzed some next-generation sequencing data, and I first wanted to compare the frequencies in my samples to those in the 1000 Genomes Project. It turns out this is much easier that I thought, as long as you're a little comfortable with the Linux command line.

First, you'll need a Linux system, and two utilities: tabix and vcftools.

I'm virtualizing an Ubuntu Linux system in Virtualbox on my Windows 7 machine. I had a little trouble compiling vcftools on my Ubuntu system out of the box. Before trying to compile tabix and vcftools I'd recommend installing the GNU C++ compiler and another development version of a compression library, zlib1g-dev. This is easy in Ubuntu. Just enter these commands at the terminal:

sudo apt-get install g++

sudo apt-get install zlib1g-dev

First, download tabix. I'm giving you the direct link to the most current version as of this writing, but you might go to the respective sourceforge pages to get the most recent version yourself. Use tar to unpack the download, go into the unzipped directory, and type "make" to compile the executable.

wget http://sourceforge.net/projects/samtools/files/tabix/tabix-0.2.3.tar.bz2

tar -jxvf tabix-0.2.3.tar.bz2

cd tabix-0.2.3/ 

make

Now do the same thing for vcf tools:

wget http://sourceforge.net/projects/vcftools/files/vcftools_v0.1.4a.tar.gz
tar -zxvf vcftools_v0.1.4a.tar.gz 

cd vcftools_0.1.4a/
make

The vcftools binary will be in the cpp directory. Copy both the tabix and vcftools executables to wherever you want to run your analysis.

Let's say that you wanted to pull all the 1000 genomes data from the CETP gene on chromosome 16, compute allele frequencies, and drop a linkage format PED file so you can look at linkage disequilibrium using Haploview.

First, use tabix to hit the 1000 genomes FTP site, pulling data from the 20080804 release for the CETP region (chr16:56,995,835-57,017,756), and save that output to a file called genotypes.vcf. Because tabix doesn't download the entire 1000 Genomes data and pulls only the sections you need, this is extremely fast. This should take around a minute, depending on your web connection and CPU speeds.

./tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 16:56995835-57017756 > genotypes.vcf

Not too difficult, right? Now use vcftools (which works a lot like plink) to compute allele frequencies. This should take less than one second.

./vcftools --vcf genotypes.vcf --freq --out allelefrequencies

Finally, use vcftools to create a linkage format PED and MAP file that you can use in PLINK or Haploview. This took about 30 seconds for me.

./vcftools --vcf genotypes.vcf --plink --out plinkformat

That's it. It looks like you can also dig around in the supporting directory on the FTP site and pull out genotypes for specific ethnic groups as well (EUR, AFR, and ASN HapMap populations).

Monday Links: 23andMe, RStudio, PacBio+Galaxy, Data Science One-Liners, Post-Linkage RFA, SSH

Lately I haven't written as many full length posts as usual, but here's a quick roundup of a few links I've shared on Twitter (@genetics_blog) over the last week:

First, 23andMe is having a big DNA Day Sale ($108) for the kit + 1 year of their personal genome subscription service https://www.23andme.com/.

Previously mentioned R IDE RStudio released a new beta (version 0.93) that includes several new features and bugfixes. Among the new features, my favorites include the ability to customize the pane layout, and the included {manipulate} package that facilitates interactive plotting: http://bit.ly/gR9ir8.

The Pacific Biosciences open-source Analysis suite is out: http://j.mp/f9QP0F (genomeweb subscribers only), and you can watch a video tutorial (free) on Galaxy-PacBio integration here: http://j.mp/i1FzEm.

If you already have existing genotype/phenotype data on families you've collected, you might check out this new RFA from NHLBI and NHGRI: “Life After Linkage: the Future of Family Studies” http://1.usa.gov/dR9gHb: This FOA issued by NHLBI and NHGRI solicits applications which integrate novel molecular data with existing genotype and phenotype data in families to identify and characterize genes influencing complex disorders. Applications for this program should have previously examined families with some genotyping and phenotypes available. Proposals may utilize focused ascertainment from families, for example, using risk profile, extreme phenotypes or families contributing heavily to genomic signals. The program may support the addition of targeted or whole genome sequencing, exon sequencing, copy number variants (CNVs), expression profiling, metabolomics, epigenomics, and/or novel biomarkers. This may require additional sample collection and/or reconsent. Applications should include a strong analytic component focused on bioinformatics and integration of multiple data types.

Data Science Hand Tools http://oreil.ly/hyhb6Z. This piece includes nice demonstrations of using grep, awk, cut, colrm, sort, uniq, wc, find, xargs, and other simple and flexible UNIX one-liners to do some serious data munging.

A new paper from Lucia Hindorff, Elizabeth Gillanders, and Teri Manolio at NHGRI: Genetic architecture of cancer & complex diseases: Lessons learned & future directions http://1.usa.gov/h28J2A

Bioinformatics special issue includes acompilation of all Bioinformatics papers on Next-Generation Sequencing http://bit.ly/DcfAm

And finally, simple instructions for setting up password-free SSH login http://linuxproblem.org/art_9.html

Unix and Perl for Biologists

This looks like a must-read for anyone starting out in computational biology without extensive experience at the command line. The 135-page document linked at the bottom of the Google Group page looks like an excellent primer with lots of examples that could probably be completed in a day or two, and provides a great start for working in a Linux/Unix environment and programming with Perl. This started out as a graduate student course at UC Davis, and is now freely available for anyone who wants to learn Unix and Perl. Also, don't forget about the printable linux command line cheat sheet I posted here long ago.

Google Groups: Unix and Perl for Biologists

Happy Birthday GGD! The 10 Most Popular Posts Since GGD's Launch

The first post on Getting Genetics Done was one year ago today. To celebrate, here are the top 10 most viewed posts since GGD launched last year. Incidentally, nine of the ten are tutorials on how to do something in R. Thanks to all the readers and all the commenters for sharing your thoughts!

ggplot2 Tutorial: Scatterplots in a Series of Small Multiples

GWAS Manhattan plots and QQ plots using ggplot2 in R

Hierarchical Clustering in R (by Will Bush)

Comparison of plots using Stata, R base, R lattice, and R ggplot2: Histograms

Merge data from different files using R

QQ plots of p-values in R using ggplot2

PDF tutorial from R course (Introduction to R)

Visualizing sample relatedness in a GWAS using PLINK and R

Split, apply, and combine in R using PLYR

Linux Command Line Cheat Sheet

Use PuTTY and XMing to see Linux graphics via SSH on your Windows computer

Do you use SSH to connect to a remote Linux machine from your local Windows computer? Ever needed to run a program on that Linux machine that displays graphical output, or uses a GUI? I was in this position last week trying to make figures using ggplot2 in R of results from an analysis of GWAS data which required using a 64-bit Linux machine with more RAM than my 32-bit windows machine can see.

You try plotting something in R on a Linux machine in an SSH session you'll get this nasty error message:

Error in function (display = "", width, height, pointsize, gamma, bg,: 

X11 I/O error while opening X11 connection to 'localhost:10.0'

Turns out there's a very easy way to see graphical output over your SSH terminal. First, if you're not already using PuTTY for SSH, download putty.exe from here. Next, download, install, and run Xming. While Xming is running in your system tray, log into the Linux server as you normally would using PuTTY. Then type this command at the terminal to log into the linux server of your choice (here, pepperjack), with the -X (uppercase) to enable X11 forwarding.

ssh -X pepperjack.mc.vanderbilt.edu

If all goes well you should now be able to use programs that utilize graphical output or interfaces, which are running on the remote Linux machine rather than your local windows computer.

Xming - PC X Server

Xming download link on SourceForge

Sync your home directories on ACCRE and the local Linux servers (a.k.a. "the cheeses")

Vanderbilt ACCRE users with PCs only...

If you use ACCRE to run multi-processor jobs you'll be glad to know that they now allow you to map your home directory to your local desktop using Samba (so you can access your files through My Computer as you normally would with local files). Just submit a help request on their website and they'll get you set up.

Now if you have both your ACCRE home and your home on the cheeses mapped, you can easily sync the files between the two. Download Microsoft's free SyncToy to do the job. It's pretty dead simple to set up, and one click will synchronize files between the two servers.

I didn't want to synchronize everything, so I set it up to only sync directories that contain perl scripts and other programs that I commonly use on both machines. SyncToy also seems pretty useful for backing up your files too.

Microsoft SyncToy

Ask ACCRE to let you map your home

Get the full path to a file in Linux / Unix

In the last post I showed you how to point to a file in windows and get the full path copied to your clipboard. I wanted to come up with something similar for a Linux environment. This is helpful on Vampire/ACCRE because you have to fully qualify the path to every file you use when you submit jobs with PBS scripts. So I wrote a little perl script:

#!/usr/bin/perl
chomp($pwd=`pwd`);
print "$pwd\n" if @ARGV==0;
foreach (@ARGV) {print "$pwd/$_\n";}

You can copy this from me, just put it in your bin directory, like this:

cp /home/turnersd/bin/path ~/bin

Make it executable, like this:

chmod +x ~/bin/path

Here it is in action. Let's say I wanted to print out the full path to all the .txt files in the current directory. Call the program with arguments as the files you want to print the path to:

[turnersd@vmps21 replicated]$ ls
parseitnow.pbs
parsing_program.pl
replic.txt
tophits.txt

[turnersd@vmps21 replicated]$ path *.txt
/projects/HDL/epistasis/replicated/replic.txt
/projects/HDL/epistasis/replicated/tophits.txt

Sure, it's only a little bit quicker than typing pwd, copying that, then spelling out the filenames. But if you have long filenames or lots of filenames you want to copy, this should get things done faster. Enjoy.

Convert PLINK output to CSV

I tip my hat to Will for showing me this little command line trick. PLINK's output looks nice when you print it to the screen, but it can be a pain to load the output into excel or a MySQL database because all the fields are separated by a variable number of spaces. This little command line trick will convert a variable-space delimited PLINK output file to a comma delimited file.

You need to be on a Linux/Unix machine to do this. Here's the command. I'm looking at results from Mendelian errors here. Replace "mendel" with the results file you want to reformat, and put this all on one line.

cat mendel.txt | sed -r 's/^\s+//g' | sed -r 's/\s+/,/g' > mendel.csv

You'll have created a new file called results.hwe.csv that you can now open directly in Excel or load into a database more easily than you could with the default output.

Before:


turnersd@provolone~: cat mendel.txt
FID        PAT        MAT   CHLD    N
1089   16223073   16223062      1  149
1116   16233564   16233589      1  114
123   16230767   16230725      2  189
12   16221778   16221803      1  116
12   16221805   16221822      1   98
12   16230486   16230496      1   76
12   16231205   16232111      2  180
134   16222939   16222945      2  140
1758   16230755   16231121      2  206

After:


turnersd@provolone~: cat mendel.csv
FID,PAT,MAT,CHLD,N
1089,16223073,16223062,1,149
1116,16233564,16233589,1,114
123,16230767,16230725,2,189
12,16221778,16221803,1,116
12,16221805,16221822,1,98
12,16230486,16230496,1,76
12,16231205,16232111,2,180
134,16222939,16222945,2,140
1758,16230755,16231121,2,206

If you're interested in the details of what this is doing here you go:

First, you cat the contents of the file and pipe it to a command called sed. The thing between the single quotes in the sed command is called a regular expression, which is similar to doing a find-and-replace in MS Word. What this does is searches for the thing between the first pair of slashes and replaces it with the thing between the next two slashes. You need the -r option, and the "s" before the first and the "g" after the last slash to make it work right.

/^\s+// is the first regular expression. \s is special and it means means search for whitespace. \s+ means search for any amount of whitespace. The ^ means only look for it at the beginning of the line. Notice there is nothing between the second and third slashes, so it will replace any whitespace with nothing. This part will trim any whitespace from the beginning of the line, which is important because in the next part we're turning any remaining whitespace into a comma, so we don't want the line to start with a comma.

/\s+/,/ is the second regular expression. Again we're searching for a variable amount of whitespace but this time replacing it with a comma.

Linux tip: history and !

Ever find yourself trying to remember a series of steps you recently did in Linux? Try typing the command "history" at the command line (without the quotes). You'll see a long list of your most recently used commands. It looks like this:

1018 ls
1019 cd
1020 cd /scratch/turnersd/
1021 ls
1022 cd
1023 grep -P "\s(10|9|8)\s" /scratch/turnersd/alz/parsed.txt | awk '{print $1"\n"$2}' | sort | uniq | perl -pi -e 's/RS/rs/g'
1024 history

Which brings me to my second tip, ! commands. Notice that when you type history the commands are numbered. 1023 in particular was a long string of commands I wouldn't want to retype over and over. Fortunately Linux lets me repeat that command any time I want just by typing an exclamation point followed by the number, like this — !1023 — at the command line, which does the same thing as typing it in the long way.

Linux Command Du Jour: time

I briefly mentioned "time" in the previously posted Linux command line cheat sheet, but I can't overstate its utility especially for ACCRE/Vampire users. The time command does exactly what it sounds like: it times exactly how long it takes to run anything at the command line. And it couldn't be easier to use. Just type the word "time" preceding what you would normally type in at the command line.

For example, instead of:

> ./sMDR run.cfg

Type this in instead:

> time ./sMDR run.cfg

The MDR analysis (or whatever other command) runs just as before, but now there will be a couple lines displayed telling you how long the command took to execute. (Depending on whether you're doing this on the local cheeses or on Vampire the output will look slightly different, but you want to look at the "real" time). If you don't have anything to run right now, fire up a terminal window and just try "time ls". Obviously it won't take long to run, but you can see how the time command works.

Where this becomes very useful is if you are preparing to run analyses on Vampire, where you must estimate the time it takes to execute a command or analysis. You can time smaller versions of a large analysis for a better estimate of how long something will take. For more information, type "man time" at the command line.

Ditch Excel for AWK!

How often have you needed to extract a certain column from a file, or strip out only a few individuals from a ped file? This is often done with Excel, but that requires loading the data, manipulating it, then saving it back into a format that is acceptable, which may require converting tabs to spaces or other aggravating things. In many circumstances, a powerful linux command, AWK, can be used to accomplish the same thing with far fewer steps (at least once you get the hang of it).

AWK is almost like its own language, and in fact portions of PERL are based on AWK. Let's say you have a file called text.txt and you want to find all lines of that file that contain the word "the":

> awk '/The/' text.txt

Or you'd like to see all lines that start with "rs":

> awk '/^rs/' text.txt

Or perhaps most usefully, you want to strip the top 5 lines out of a file:

> awk 'NR > 5' text.txt

This just scratches the surface of course... for a good tutorial with examples, check out this site:
http://www.vectorsite.net/tsawk_1.html#m2

I'll also look into setting up a post with AWK snippets for common useful procedures...

Will

Write Your First Perl Script

And it will do way more than display "Hello, world!" to the screen. An anonymous commenter on one of our Linux posts recently asked how to write scripts that will automate the same analysis on multiple files. While there are potentially several ways to do this, perl will almost always get the job done. Here I'll pose a situation you may run into where perl may help, and we'll break down what this simple perl script does, line by line.

Let's say you want to run something like plink or MDR, or any other program that you have to call at the command line with a configuration or batch file that contains the details of the analysis. For this simple case, let's pretend we're running MDR on a dataset we've stratified 3 ways by race. We have three MDR config files: afr-am.cfg, caucsian.cfg, and hispanic.cfg that will each run MDR on the respective dataset. Now, we could manually run MDR three separate times, and in fact, in this scenario that may be easier. But when you need to run dozens, or maybe hundreds of analyses, you'll need a way to automate things. Check out this perl script, and I'll explain what it's doing below. Fire up something like nano or pico, copy/paste this, and save the file as "runMDR.pl"

foreach $current_cfg (@ARGV)
{
# This will run sMDR
`./sMDR $current_cfg`;
}
# Hooray, we're done!

Now, if you call this script from the command line like this, giving the config files you want to run as arguments to the script, it will run sMDR on all three datasets, one after the other:

> perl runMDR.pl afr-am.cfg caucasian.cfg hispanic.cfg

You could also use the askerisk to pass everything that ends with ".cfg" as an argument to the script:

> perl runMDR.pl *.cfg

Okay, let's break this down, step by step.

First, some syntax. Perl ignores everything on a line after the # sign, so this way you can comment your code, so you can remember what it does later. The little ` things on the 4th line are backticks. Those are usually above your tab key on your keyboard. And that semicolon is important.
@ARGV is an array that contains the arguments that you pass to the program (the MDR config files here), and $current_config is a variable that assumes each element in @ARGV, one at a time.
Each time $current_config assumes a new identity, perl will execute the code between the curly braces. So the first time, perl will execute `./sMDR afr-am.cfg`; The stuff between the backticks is executed exactly as if you were typing it into the shell yourself. Here, I'm assuming you have sMDR and afr-am.cfg in the current directory.
Once perl executes the block of code between the braces for each element of @ARGV, it quits, and now you'll have results for all three analyses.

A few final thoughts... If the stuff you're automating is going to take a while to complete, you may consider checking out Greg's previous tutorial on screen. Next, if whatever program you're running over and over again displays output to the screen, you'll have to add an extra line to see that output yourself, or write that output to a file. Also, don't forget your comments! Perl can be quick to write but difficult to understand later on, so comment your scripts well. Finally, if you need more help and you can't find it here or here, many of the folks on this hall have used perl for some time, so ask around!

Screen How-to

`screen` is one of the most useful system tools I've ever used. It allows you to begin a process and then detach from the process to let it continue to run in the background. You can later re-attach to the screen to check progress, even from a remote location.

To install screen on a Debian-based system, you would do:

apt-get install screen

or for a RedHat-based system:

yum install screen

Some usage examples:

To start a job inside a screen, a Ruby script for example, you would issue a command something like this:

> screen my_script.rb

The job begins to run in exactly the same manner if you had not used a screen.

To detach from the screen you would use these keystrokes:

> Ctrl + a, Ctrl + d.

After this the Ruby script is still running.

To reconnect to the running screen you type:

> screen -dr

The 'd' means detach and the 'r' means re-attach to the current terminal.

If you have lots of screens running, you'll be prompted to specify which screen you want to re-attach to your terminal. For example:

> screen -dr
There are several suitable screens on:
5190.pts-0.pluto (03/18/2009 01:36:22 PM) (Detached)
5134.pts-0.pluto (03/18/2009 01:14:08 PM) (Detached)

Type "screen [-d] -r [pid.]tty.host" to resume one of them.

At this point you have to specify like this:

screen -dr 5190.pts-0.pluto

or

screen -dr 5134.pts-0.pluto

I find screen most useful for starting long-running jobs on remote servers. I can start the job, then log out and let it run without any worries of what my local system is doing. I can reboot or log off without any issues. Later I can re-attach the screen to my terminal to check progress as required. When the job is done, so is the screen, they are self-cleaning. :)

More info can be found here:

http://www.linuxmanpages.com/man1/screen.1.php

Linux Tutorial

Last week I posted a one-page reference guide that gives a short description of the Linux commands I most commonly use. To accompany this, here is a detailed walkthrough filled with examples that will introduce any beginner to the basics of using Linux in just a few hours. Many thanks to Eric Torstenson in the Ritchie lab for putting this together.

Linux Tutorial (PDF)

Linux Command Line Cheat Sheet

Whenever we have new students rotating through our lab who've never used Linux I always end up scrounging around the world wide series-of-tubes only to find some command line reference that's not really useful for students. So I made my own one-page reference guide that gives a basic description of 99% of the commands most of you will use. Feel free to distribute.

Linux / Unix Command Line Cheat Sheet (PDF)