Use plyr instead of _apply() in R
Enter plyr.
As I mentioned previously, the plyr functions (ddply, in particular), are intuitive, usually returning the result that you wanted. The ddply function splits up your dataset based on one or more grouping variables, applies some function or statistic, and summarizes the results returning a dataframe.
Here's JD Long's screencast showing how plyr makes a task like this easy where the apply function fails.
Cerebral Mastication - Struggling with apply() in R
Capture system commands as R objects with system(..., intern=T)
This command issues "ls *_hdl*.qassoc" to the system, just as if you were typing it in at the terminal. The intern=TRUE tells the system command to treat the output as an R object. I can store this in myfiles, then do some processing on myfiles, which is a vector containing all the filenames of files I want to process.
[1] "0-all.01_hdl_modeled.qassoc"
[2] "0-all.02_hdl_modeled_polyresid.qassoc"
[3] "0-all.03_hdl_med.qassoc"
[4] "0-all.04_hdl_med_polyresid.qassoc"
[5] "0-all.05_hdl_med_smokage_polyresid.qassoc"
[6] "1-male.01_hdl_modeled.qassoc"
[7] "1-male.02_hdl_modeled_polyresid.qassoc"
[8] "1-male.03_hdl_med.qassoc"
[9] "1-male.04_hdl_med_polyresid.qassoc"
[10] "1-male.05_hdl_med_smokage_polyresid.qassoc"
Sync files across multiple computers with Dropbox (PC, Mac & Linux too!)
Once you sign up and install on all your computers, you'll have a special folder, where if you save something there on one computer, it is automatically created and stays synchronized in the same folder on all your other computers. What's more, if you use someone else's computer, you can access all your files through a web interface because they're all securely backed up online. I've been using this for a while now to sync all the papers I'm working on, RefMan/EndNote databases, config files, and R functions I reuse all the time. You can also create "public" folders. Put something here, and you can get a direct link to the file online to share with other folks. For example, here's a link to some R code I wrote to use ggplot2 to make manhattan plots and QQ-plots for every PLINK output file in the current directory (I'm hoping to clean this code up and include this with some other functions I've written into a package on CRAN soon).
I can't recommend this little app enough. If you're still not convinced, check out this short video that explains what Dropbox is all about and shows of just how simple it is to use.
You get a whopping 2GB for free, but if you use the registration link provided below, you'll get an extra free 1/4GB. Happy holidays from GGD, and I'll catch up with you all next week!
Dropbox - Secure online backup and synchronization
Review: The challenges of sequencing by synthesis
Review: The challenges of sequencing by synthesis (Nature Biotechnology)
Recent improvements to Pubget
1. Citation matching. Pubget's citation matcher seems to work better than Pubmed most of the time. Try going to Pubget and pasting any of these random citations into the search bar:
J Biol Chem 277: 30738-30745
Nucleic Acids Res 2004;32:4812-20.
Evol. Biol. 7, 214 (2007).
2. The PaperPlane bookmarklet. Go here and drag the link to your bookmark toolbar. Now, if you're searching from pubmed, click the bookmarklet for one-click access to the PDF.
3. If you have a long list of PMIDs, separate them with commas and you can paste them directly into the search bar.
Pubget (Vanderbilt institutional link)
Pubget (If you're anywhere else)
Seminar announcement: A Multivariate Methodology for Analyzing Genome-wide Association Studies
Department of Biostatistics Seminar/Workshop Series: A Multivariate Methodology for Analyzing Genome-wide Association Studies, by Janice Brodsky, PhD, UCLA.
Wednesday, December 16, 1:30-2:30pm, MRB III Conference Room 1220
Intended Audience: Persons interested in applied statistics, statistical theory, epidemiology, health services research, clinical trials methodology, statistical computing, statistical graphics, R users or potential users
In the last few years, high-dimensional genome-wide association (GWA) studies have become a common tool in genetics for investigating which genes are associated with physical traits. However, the results of many GWA studies have fewer genes than expected or even no genes at all. This does not necessarily indicate that there are no genetic associations in the data: genes with weaker associations or which only work in groups will be missed with the standard GWA statistical analysis. We present a multivariate methodology for analyzing GWA data which is designed to handle weaker signals, dependent data, and multicollinearity. We applied this method to a large GWA study, and the results were consistent with previously performed studies. We also discuss extensions of the methodology.
Follow GGD on Twitter @genetics_blog
Browse R Graphics with the R Graph Gallery and the R Graphical Manual
Now let's say you've seen a certain graphic before, and you want to find the package you need to download and which function you should use to make the plot. That's where the R Graph Gallery and the R Graphical Manual can become very useful. Both sites give you thumbnail previews of graphics produced by functions bundled with certain R packages, code for producing the graphic, and which R packages you need to download for the functions used to create the graphic. The R Graphical Manual is much more comprehensive, and is categorized based on CRAN Task Views (CTV) categories (check out all 29 pages of graphics in the Genetics task view).
R Graphical Manual
R Graph Gallery
Sequencing technologies — the next generation
Following up on last week's coverage of the Genotyping Portal, check out this new review article on next-generation sequencing in Nature Reviews Genetics. One major focus of this paper is that the next generation of sequencing platforms each use fundamentally different technologies. Because of this, it's likely that multiple platforms will coexist in the marketplace, and different platforms will have clear advantages over others for particular biological applications. The paper has some nice figures illustrating how the technology works in sequencing by reversible terminators used by Illumina/Solexa and Helicos BioSciences, emulsion PCR used by Life/APG's SOLiD ligation platform and the Roche/454 Pyrosequencing system, and the highly-anticipated real-time single-molecule sequencing from Pacific Biosciences. Finally, there's a table giving the pros, cons, biological applications, cost, read length, run time, and references for each of the next-gen sequencing applications. Finally, a revealing piece of information I found in the last table showing sequencing statistics on personal genomes shows that the sequencing of Stephen Quake's genome with Helicos a few months ago cost only $48,000, a decrease of several orders of magnitude compared to the sequencing of J. Craig Venter's genome (Sanger), which cost an estimated $70,000,000 just a few years ago.
The author of the paper, Michael L. Metzker, is an associate professor of genetics at Baylor College of Medicine, a senior manager at the Human Genome Sequencing Center at Baylor, and President & CEO of LaserGen, Inc., Houston, TX.
Sequencing Technologies - The Next Generation (NRG AOP)
Genotyping Portal: A comprehensive (and freely available) online resource about methods for DNA genotyping, screening and sequencing
Diego Forero has compiled a comprehensive list of primary publications on commonly used SNP genotyping and DNA sequencing technologies (including SNP arrays, Sequenom, TaqMan, Pyrosequencing, Molecular Beacons, FP-TDI, Invader, xMAP, SNaPshot, SNPlex, Sanger, 454, Illumina, Helicos, SOLiD, Complete Genomics, Bisulfite sequencing, and others). Also included here are links to review articles, protocols, and links to manufacturers of reagents and equipment. Where available, links are included to open access versions of the papers on PubMed Central.
This is an excellent resource for anyone who is generally interested in how these technologies work. For 2nd year grad students at Vanderbilt, you will be asked about some of these technologies on your qualifying exam!
Genotyping Portal: A comprehensive (and freely available) online resource about methods for DNA genotyping, screening and sequencing
Use PuTTY and XMing to see Linux graphics via SSH on your Windows computer
You try plotting something in R on a Linux machine in an SSH session you'll get this nasty error message:
Turns out there's a very easy way to see graphical output over your SSH terminal. First, if you're not already using PuTTY for SSH, download putty.exe from here. Next, download, install, and run Xming. While Xming is running in your system tray, log into the Linux server as you normally would using PuTTY. Then type this command at the terminal to log into the linux server of your choice (here, pepperjack), with the -X (uppercase) to enable X11 forwarding.
If all goes well you should now be able to use programs that utilize graphical output or interfaces, which are running on the remote Linux machine rather than your local windows computer.
Xming - PC X Server
Xming download link on SourceForge
Get Started with Machine Learning in R
For more, check out the link below.
A beautiful WWW: Guide to Getting Started in Machine Learning
Use Doodle to Schedule Meetings
Doodle: easy scheduling
NYT: SAS threatened by R
NYT: At a Software Powerhouse, the Good Life Is Under Siege
NYT Slideshow: At SAS, Taking Care of Employees Is Good Business
Cancer Epidemiology, Biostatistics, and Bioinformatics Retreat
Looks like the retreat is free to anyone who registers using the form on the website below. I'm going, hope to see a few of you there.
Cancer Epidemiology, Biostatistics, and Bioinformatics Retreat
Seminar: Reproducible Research with R, LaTeX, & Sweave
Looks like her slides as well as much more introductory material on R, Latex, and Sweave are on her website.
Reproducible Research with R, LaTeX, & Sweave
QQ plots of p-values in R using ggplot2
Of course you can use R's built-in qqplot() function, but I could never figure out a way to add the diagonal using base graphics. To get the function I wrote to make qqplots, copy and paste this into your R session:
# Thanks to Daniel Shriner at NHGRI for providing this code for creating expected and observed values
o = -log10(sort(pvector,decreasing=F))
e = -log10( 1:length(o)/length(o) )
# you could use base graphics
#plot(e,o,pch=19,cex=0.25, xlab=expression(Expected~~-log[10](italic(p))), ylab=expression(Observed~~-log[10](italic(p))),xlim=c(0,max(e)),ylim=c(0,max(e)))
#lines(e,e,col="red")
#You'll need ggplot2 installed to do the rest
plot=qplot(e,o, xlim=c(0,max(e)), ylim=c(0,max(o))) + stat_abline(intercept=0,slope=1, col="red")
plot=plot+opts(title=title)
plot=plot+scale_x_continuous(name=expression(Expected~~-log[10](italic(p))))
plot=plot+scale_y_continuous(name=expression(Observed~~-log[10](italic(p))))
if (spartan) plot=plot+opts(panel.background=theme_rect(col="grey50"), panel.grid.minor=theme_blank())
plot
}
Also, make sure you have ggplot2 installed:
If you already have it installed, load it:
library(ggplot2)
The function takes a vector of p-values, optionally a title, and a third option to make the plot black and white. You can check the default arguments to the function by typing args(qq).
qq(data$Pvals, title="My Quantile-Quantile Plot")
You can also make the plot more "Spartan" by removing the title (setting it to NULL) and by making the color scheme black on white:
qq(data$Pvals, title=NULL, spartan=TRUE)
Stay tuned - I'm also putting together a way to import your PLINK output files and make better Manhattan plots using R and ggplot2 than you ever could with Haploview.
Update Tuesday, November 10, 2009: A big tip of the hat to the anonymous commenter who pointed out that this can easily be done in the base graphics package, as well as adding confidence intervals. As always, we welcome your comments if you know of a better or easier way to do anything mentioned here!
Split, apply, and combine in R using PLYR
While flirting around with previously mentioned ggplot2 I came across an incredibly useful set of functions in the plyr package, made by Hadley Wickham, the same guy behind ggplot2. If you've ever used MySQL before, think of "GROUP BY", but here you can arbitrarily apply any R function to splits of the data, or write one yourself.
Imagine you have two SNPs, and 2 different variables you want info about across all three genotypes at each locus. You can generate some fake data using this command:
The data should look like this:
X1 X2 SNP1 SNP2
1 -0.73671602 2.76733658 AA BB
2 -1.83947190 5.70194282 AA BB
3 -0.46963370 4.89393264 AA BB
4 0.85105460 6.37610477 AA BB
5 -1.69043368 8.11399122 AA BB
6 -1.47995127 4.32866212 AA BB
7 0.21026063 4.32073489 AA BB
8 0.19038088 4.26631298 AA BB
9 -0.29759084 4.07349852 AA BB
10 -0.46843672 7.01048905 AA BB
11 0.93804369 3.78855823 Aa Bb
12 -1.46223800 5.08756968 Aa Bb
13 -0.01338604 -0.01494702 Aa Bb
14 0.13704332 4.53625220 Aa Bb
15 0.59214151 3.75293124 Aa Bb
16 -0.27658821 5.78701933 Aa Bb
17 0.05527139 5.99460464 Aa Bb
18 1.00077756 5.26838121 Aa Bb
19 0.62871877 6.98314827 Aa Bb
20 1.18925876 8.61457227 Aa Bb
21 -0.40389888 3.36446128 aa bb
22 -1.43420595 6.50801614 aa bb
23 1.79086285 5.39616828 aa bb
24 0.15886243 3.98010396 aa bb
25 0.57746024 4.93605364 aa bb
26 -1.11158987 6.40760662 aa bb
27 1.93484117 3.33698750 aa bb
28 0.10803302 4.56442931 aa bb
29 0.56648591 4.48502267 aa bb
30 0.03616814 5.77160936 aa bb
Now let's say you want the mean and standard deviation of X1 and X2 across each of the multilocus genotypes at SNP1 and SNP2. First, install and load the plyr package:
Now run the following command:
And you get exactly what you asked for:
SNP1 SNP2 meanX1 sdX1 meanX2 sdX2
1 aa bb 0.2223019 1.0855063 4.875046 1.144623
2 Aa Bb 0.2789043 0.7817727 4.979809 2.286914
3 AA BB -0.5730538 0.8834038 5.185301 1.601626
That command above may appear complicated at first, but it isn't really. Cerebral mastication gives a very easy to follow, concise explanation of what that command is doing. Also, check out the slides he presented at a recent R meetup discussion:
There's more to plyr than ddply, so check it out for yourself!
Common disorders are quantitative traits
---
That's the argument made by Plomin, Haworth, and Davis in a great review paper just published online in Nature Reviews Genetics. One of the main themes presented here is that as a disorder becomes more common, a case-control design becomes less and less powerful to identify associated genetic variants because cases become less extreme and the control group becomes increasingly contaminated by "near cases".
Some qualitative disorders have obvious relevant quantitative traits - BMI for obesity, blood pressure for hypertension, mood for depression. I recently authored a review with Dana and Marylyn making a similar argument in the context of pharmacogenomics research. The authors admit, however, that quantitative measurements are not at all apparent for some disorders, such as alcoholism, arthritis, autism, dementia, diabetes, or heart disease.
The review also has a glossary for the uninitiated, encouraging the use of quantitative vocabulary, like linear regression or ANOVA instead of logistic regression, variance and mean differences rather than risk, and covariance rather than comorbidity.
The conclude with statement that the extremes of a distribution are important medically, but there is no scientific or statistical advantage in analyzing artificially constructed disease labels that evolved historically on the basis of symptoms rather than etiology.
Common disorders are quantitative traits (NRG AOP).
Annotated LD-Plots
This plot design displays p-values from association or other types of statistics in the context of quality control measures such as genotyping efficiency, HWE p-values, and minor allele frequency. It also displays association signals relative to LD structure, so the user can clearly see if multiple SNPs in the same LD block show evidence for association. The haplotypes and their relative frequencies are shown along the top of the figure, and gray shading carries the haplotype structure up through the other statistics in the figure. Categorical stats, such as inclusion on genotyping platforms, whether a SNP was imputed or typed, or the consequence of the SNP (coding/non-coding) can be displayed also.
LD-Plus is available via a web interface, and is an evolving project. To give it a try, visit https://chgr.mc.vanderbilt.edu/ldplus. Details on how to properly format input files is available under the "Documentation" link on the site. We'd love your feedback on the design, and what other features we might include in the design!
Check out the GGD poster at ASHG!
Free Book: Elements of Statistical Learning
Download it here or view it online here.
Visualizing sample relatedness in a GWAS using PLINK and R
PLINK allows you to estimate genomewide IBD-sharing coefficients between seemingly unrelated individuals from whole-genome data. You can read lots more about the particular method they use to infer IBD given IBS and allele frequencies in the PLINK paper. It's relatively straightforward to compute these estimates. If you already have your data in pedfile format, use the following command (assuming mydata.ped and mydata.map are in the current directory where you run PLINK).
This will usually take a few days if you use genome-wide data. You could speed this up by first thinning your dataset to around 100,000 markers using the --thin option to create a smaller dataset.
After you've run this you'll have a file in that directory called plink.genome. You can read about the output here, but we're really interested in Z1 and Z0, the proportion of markers identical by descent 1 and 0 respectively, for every pair of individuals in the dataset. You can plot Z1 by Z0 and color code them by the relationship type as coded in the pedfile using this R code.
Here's what the columns of interest will look like:
RT Z1 Z0
1 UN 0.1697 0.8294
2 OT 0.5034 0.4879
3 OT 0.4403 0.5439
4 OT 0.5200 0.4674
5 UN 0.1646 0.8311
6 OT 0.5519 0.4481
First, fire up R, go to the directory where your plink results are located and load in the data:
Next, set up the plot area. The par option makes the points solid instead of an open circle, and the plot command sets up your plot and axes without adding any points.
Your plot should be empty, like this:
Now use this command to plot points where the relationship type is FS, or full sibling. Make the points green (col=3).
Do the same thing with half sibs, "other related" (have the same family ID), parent offspring pairs, and unrelated, making each of them a different color.
You might want to rerun the commands for "other related" and especially half sibs, because they were covered up when you plotted unrelateds.
Finally, add a legend
For a recap, your code should look like this:
d = read.table("plink.genome", header=T)
par(pch=16)
with(d,plot(Z0,Z1, xlim=c(0,1), ylim=c(0,1), type="n"))
with(subset(d,RT=="FS") , points(Z0,Z1,col=3))
with(subset(d,RT=="HS") , points(Z0,Z1,col="darkorange"))
with(subset(d,RT=="OT") , points(Z0,Z1,col=4))
with(subset(d,RT=="PO") , points(Z0,Z1,col=2))
with(subset(d,RT=="UN") , points(Z0,Z1,col=1))
with(subset(d,RT=="OT") , points(Z0,Z1,col=4))
with(subset(d,RT=="HS") , points(Z0,Z1,col="darkorange"))
What you can see from the above plot is that self-reported parent-offspring pairs share 100% of their alleles IBD=1. There is a single sib pair at the bottom left where Pr(IBD=1)=0 and Pr(IBD=0)=0. This means that this sib pair shares 2 alleles identical by descent at every locus across the genome. This is either a duplicated sample or a set of identical twins. The full sibs cluster right around the middle of the plot, and you have lots of unrelated or "other related" pairs stretching from completely unrelated (bottom right quadrant) to relatively closely related (2nd or 3rd degree relatives, right around the middle of the plot). If you see self-reported sibs showing up where the unrelated individuals cluster on this plot, or vice versa, this should clue you in on potential sample identity problems.
...
Just for fun, let's make the same plot using ggplot2. Make sure you have the data loaded as above, then install ggplot2 if you haven't already
You only have to install the package once. Next load the package:
Use this incredibly simple command to make the plot:
qplot(Z0,Z1,data=d,colour=RT)
I'm not a huge fan of these colors, so lets make them the same as above:
That's a little better. Obviously the code to do this is MUCH simpler in ggplot2. The only problem I have with this is that it automatically plots the points in the order of the levels of RT, ie. FS, then HS, then OT, PO, then UN, so you end up covering up all your half-sibs with "other related" and unrelated pairs. Post a comment if you know how to go back and replot the HS points over the UN points. For more on ggplot2, see my previous post comparing histograms made using Stata, R base, R lattice and R ggplot2.
R Commander: A Basic Statistics GUI for R
R commander depends on lots of other packages to function properly, so to install it along with all it's dependencies, type this in at the R command line:
Note that this may take 5-10 minutes depending on your connection since it's downloading lots of extra packages, but you only need to do that once. Now that it's installed, any time you want to load it, open R and type:
If you're familiar with Stata or SPSS you should find the menus and dialog boxes intuitive and self-explanatory. I've put in a couple screenshots below showing R commander's data editor, GLM tool, and graphics menu.
There are a few other GUIs for R that I've never used, including Rattle (the R Analytical Tool To Learn Easily), and JGR (Java Gui for R). If you prefer one of these to Rcmdr, let us here why in the comments!
The R Commander: A Basic-Statistics GUI for R
Genomics to Confirm Nationality?
http://www.sciencemag.org/cgi/content/full/326/5949/30
Essentially, the United Kingdom is experimenting with using mitochondrial and Y-chromosome markers to determine if asylum-seekers that flee persecution are actually from the nation in question.
Advocates of this policy don't seem to understand that genetic variation is not nation-specific. Just as you cannot "look" at a person and determine their nationality, you cannot look at their DNA to make that determination either.
What happens when a consumer genetics company goes bankrupt?
Bankruptcy law authorizes the sale of the assets of a business in bankruptcy, and genomic data is likely the most valuable asset of any DTC genomics company. First the authors dissect the privacy policy and terms of service for three major DTC companies: 23andMe, deCODE Genetics, and TruGenetics. Next there's a discussion of how the legal system would treat a DTC genomics company's bankruptcy. The series wraps up with a brief discussion of how this ultimately affects the average DTC genomics cutomer.
Genomics Law Report: What happens if a DTC Genomics Company Goes Belly-Up?
JBrowse: a JavaScript Based Genome Browser
Genome Browsers are nothing new, but JBrowse is a new JavaScript based genome browser that uses information from the UCSC genome browser and has the look and feel of Google Maps. It's extremely easy to zoom in and out and scroll around because all the "work" is being done by your computer rather than some server farm thousands of miles away. OpenHelix is calling it a gamechanger, and they have a nice video demonstration showing off some of JBrowse's features. Click the Drosophila or Homo sapiens genome and give JBrowse a spin for yourself!
The JBrowse genome browser
Comparison of plots using Stata, R base, R lattice, and R ggplot2, Part I: Histograms
First I'll start with the three graphing systems in R: base, lattice, and ggplot2. If you don't have the last two packages installed, go ahead and download them:
Now load these two packages, and download this fake dataset I made up containing 100 samples each from three different genotypes ("geno") and a continuous outcome ("trait")
Now let's get started...
R: base graphics
R: lattice
R: ggplot2
qplot(trait,data=mydat,facets=geno~.)
# Update Tuesday, September 22, 2009
# A commenter mentioned that this code did not work.
# If the above code does not work, try explicitly
# stating that you want a histogram:
qplot(trait,geom="histogram",data=mydat,facets=geno~.)
Stata
insheet using "http://people.vanderbilt.edu/~stephen.turner/ggd/2009-09-21-histodemo.csv", comma clear
histogram trait, by(geno, col(1))
Commentary
In my opinion ggplot2 is the clear winner. Again I'll concede that all of the above graphing systems give you an incredible amount of control of every aspect of the graph, but I'm only looking for what gives me the best out-of-the-box default plot using the shortest command possible. R's base graphics give you a rather spartan plot, with very wide bins. It also requires 4 lines of code. (If you can shorten this, please comment). By default, the base graphics system gives you counts (frequency) on the vertical axis. The lattice package in R does a little better perhaps, but the default color scheme is visually less than stellar. Also, I'm not sure why the axis labels switch sides every other plot, and the ticks on top of the plot are probably unnecessary. I still think the bins are too wide. You lose some information especially on the bottom plot towards the right tail. The vertical axis is proportion of total. Stata's default plot looks very similar to lattice, but again uses a very unattractive color scheme. It uses density for the vertical axis, which may not mean much to non-statisticians. The default plot made by ggplot2 is just hands-down good-looking. There are no unnecessary lines delimiting the bins, and the binwidth is appropriate. The vertical axis represents counts. The black bars on the light-gray background have a good data-ink ratio. And it required the 2nd shortest command, only 3 characters longer than the Stata equivalent.
I'm ordering the ggplot2 book (Amazon, ~$50), so as I figure out how to do more with ggplot2 I'll post more comparisons like this. If you use SPSS, SAS, MATLAB, or something else, post the code in a comment here and send me a picture or link to the plot and I'll post it here.
PCG Journal Club Articles, 9/11
~Julia
Kim S, Xing EP. Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genet. 2009 Aug; 5(8):e1000587.
Zamar D, Tripp B, Ellis G, Daley D. Path: a tool to failitate pathway-based genetic association analysis. Bioinformatics. 2009 Sep 15; 25(18):2444-6
R clinic this week: Regression Modeling Strategies in R
To install the rms package, start R and type:
Then to load it any time thereafter,
The R clinic is held by the Vanderbilt biostatistics department every Thursday 2-3pm and free to anyone who wants to attend. More information here.
Find the function you're looking for in R
First, fire up R, then install the sos package (don't omit the quotes):
It'll ask you to choose a mirror. Choose the closest one. After it installs, load the package (omit the quotes this time):
This loaded all the functions that come with the sos package, including a particularly useful one called findFn. It scans the "function" entries in Jonathan Baron's "R site search" database. Give it a try, using "epistasis" with the quotes as the keyword.
This should open up a web browser that displays relevant functions, the package you need to download (using the above procedure) to use the function, and a link to the help page for that function.
You can also use ??? as an alias for findFn. Try it like this (use the quotes):
Once you have the sos package installed, type vignette("sos") for more information on how to use various functions in this package.
If you still can't find what you're looking for, check out my previous post on finding help on R, and if all else fails, don't forget about Theresa Scott's free weekly R clinic / Q&A sessions.
Machine Learning in R
Be sure to check out one of Will's previous post on hierarchical clustering in R.
Revolutions: Machine learning in R, in a nutshell
Sync your home directories on ACCRE and the local Linux servers (a.k.a. "the cheeses")
If you use ACCRE to run multi-processor jobs you'll be glad to know that they now allow you to map your home directory to your local desktop using Samba (so you can access your files through My Computer as you normally would with local files). Just submit a help request on their website and they'll get you set up.
Now if you have both your ACCRE home and your home on the cheeses mapped, you can easily sync the files between the two. Download Microsoft's free SyncToy to do the job. It's pretty dead simple to set up, and one click will synchronize files between the two servers.
I didn't want to synchronize everything, so I set it up to only sync directories that contain perl scripts and other programs that I commonly use on both machines. SyncToy also seems pretty useful for backing up your files too.
Microsoft SyncToy
Ask ACCRE to let you map your home
Get the full path to a file in Linux / Unix
#!/usr/bin/perl
chomp($pwd=`pwd`);
print "$pwd\n" if @ARGV==0;
foreach (@ARGV) {print "$pwd/$_\n";}
You can copy this from me, just put it in your bin directory, like this:
Make it executable, like this:
chmod +x ~/bin/path
Here it is in action. Let's say I wanted to print out the full path to all the .txt files in the current directory. Call the program with arguments as the files you want to print the path to:
[turnersd@vmps21 replicated]$ ls
parseitnow.pbs
parsing_program.pl
replic.txt
tophits.txt
[turnersd@vmps21 replicated]$ path *.txt
/projects/HDL/epistasis/replicated/replic.txt
/projects/HDL/epistasis/replicated/tophits.txt
Sure, it's only a little bit quicker than typing pwd, copying that, then spelling out the filenames. But if you have long filenames or lots of filenames you want to copy, this should get things done faster. Enjoy.
ClipPath copies filename and path from windows for loading into R
Download the zipfile from the website below (here's a direct link to the zip file). Extract it's contents, right click on ClipPath.inf, and choose install. You can always uninstall later through the control panel.
ClipPath Shell Extension
GGD posts now printer-friendly
PCG Journal Club Articles, 8/28
~Julia
Gurwitz D, Fortier I, Lunshof JE, Knoppers BM. Research ethics: Children and population biobanks. Science. 2009 Aug 14; 325(5942):818-9
Hastings PJ, Lupski JR, Rosenberg SM, Ira G. Mechanisms of change in gene copy number. Nat Rev Genet. 2009 Aug; 10(8):551-64.
Holsinger KE, Weir BS. Genetics in geographically structured populations: defining, estimating and interpreting F(ST). Nat Rev Genet. 2009 Sep; 10(9):639-50.
Koga Y, Pelizzola M, Cheng E, Krauthammer M, Sznol M, Ariyan S, Narayan D, Molinaro AM, Halaban R, Weissman SM. Genome-wide screen of promoter methylation identifies novel markers in melanoma. Genome Res. 2009 Aug; 19(8):1462-70.
Monier A, Pagarete A, de Vargas C, Allen MJ, Read B, Claverie JM, Ogata H. Horizontal gene transfer of an entire metabolic pathway between a eukaryotic alga and its DNA virus. Genome Res. 2009 Aug; 19(8):1441-9.
Raveh-Sadka T, Levo M, Segal E. Incorporating nucleosomes into thermodynamic models of transcription regulation. Genome Res. 2009 Aug; 19(8):1480-96.
Rosenberg NA, Vanliere JM. Replication of genetic associations as pseudoreplication due to shared genealogy. Genet Epidemiol. 2009 Sep; 33(6):479-87.
Schmitz D, Netzer C, Henn W. An offer you can't refuse? Ethical implications of non-invasive prenatal diagnosis. Nat Rev Genet. 2009 Aug; 10(8):515.
Convert PLINK output to CSV
You need to be on a Linux/Unix machine to do this. Here's the command. I'm looking at results from Mendelian errors here. Replace "mendel" with the results file you want to reformat, and put this all on one line.
cat mendel.txt | sed -r 's/^\s+//g' | sed -r 's/\s+/,/g' > mendel.csv
You'll have created a new file called results.hwe.csv that you can now open directly in Excel or load into a database more easily than you could with the default output.
Before:
turnersd@provolone~: cat mendel.txt
FID PAT MAT CHLD N
1089 16223073 16223062 1 149
1116 16233564 16233589 1 114
123 16230767 16230725 2 189
12 16221778 16221803 1 116
12 16221805 16221822 1 98
12 16230486 16230496 1 76
12 16231205 16232111 2 180
134 16222939 16222945 2 140
1758 16230755 16231121 2 206
After:
turnersd@provolone~: cat mendel.csv
FID,PAT,MAT,CHLD,N
1089,16223073,16223062,1,149
1116,16233564,16233589,1,114
123,16230767,16230725,2,189
12,16221778,16221803,1,116
12,16221805,16221822,1,98
12,16230486,16230496,1,76
12,16231205,16232111,2,180
134,16222939,16222945,2,140
1758,16230755,16231121,2,206
If you're interested in the details of what this is doing here you go:
First, you cat the contents of the file and pipe it to a command called sed. The thing between the single quotes in the sed command is called a regular expression, which is similar to doing a find-and-replace in MS Word. What this does is searches for the thing between the first pair of slashes and replaces it with the thing between the next two slashes. You need the -r option, and the "s" before the first and the "g" after the last slash to make it work right.
/^\s+// is the first regular expression. \s is special and it means means search for whitespace. \s+ means search for any amount of whitespace. The ^ means only look for it at the beginning of the line. Notice there is nothing between the second and third slashes, so it will replace any whitespace with nothing. This part will trim any whitespace from the beginning of the line, which is important because in the next part we're turning any remaining whitespace into a comma, so we don't want the line to start with a comma.
/\s+/,/ is the second regular expression. Again we're searching for a variable amount of whitespace but this time replacing it with a comma.
Great Quote from Our Own Doug Mortlock
'It’s stunning to see a genetic modification like this,' developmental geneticist Douglas Mortlock of Vanderbilt University in Nashville, Tenn., says of the new study, published online July 16 in Science. 'This is the gene that makes wiener dogs short-legged.'
Beautiful Info-graphic
The graphic is essentially a stacked density plot with time (24 hours) on the X-axis. Clicking on a different group of individuals provides a very smooth transition to the new density distribution, allowing an animated visual comparison. In some ways, this animated version provides an easier comparison than showing multiple versions of the same figure. Furthermore, there is just something compelling about this figure that begs you to examine it more closely...
http://www.nytimes.com/interactive/2009/07/31/business/20080801-metrics-graphic.html?scp=3&sq=infographic&st=cse
Next-Gen Sequencing
Wellcome Trust feature on next-gen sequencing
Systems Biology Graphical Notation
SBGN defines a comprehensive set of symbols with precise semantics, together with detailed syntactic rules defining their use and how diagrams are to be interpreted. By standardizing the visual notation, SBGN can serve as a bridge between different communities in research, education, publishing, and more. The real payoff will come when researchers are as familiar with the notation as electronics engineers are familiar with the notation of circuit schematics. If researchers are saved the time and effort required to familiarize themselves with different notations, they can spend more time thinking about the biology being depicted.
You can view the project info here.
ggplot2 workshop at Vanderbilt
http://lookingatdata.com/
PCG Journal Club Articles, 7/31
~Julia
Bogdanowicz W, Allen M, Branicki W, Lembring M, Gajewska M, Kupiec T. Genetic identification of putative remains of the famous astronomer Nicolaus Copernicus. Proc Natl Acad Sci U S A, 2009 Jul 7 [Epub ahead of print].
Green RC, Roberts JS, Cupples LA, Relkin NR, Whitehouse PJ, Brown T, Eckert SL, Butson M, Sadovnick AD, Quaid KA, Chen C, Cook-Deegan R, Farrer LA; REVEAL Study Group. Disclosure of APOE genotype for risk of Alzheimer's disease. N Engl J Med, 2009 Jul 16; 361(3):245-54.
International Schizophrenia Consortium, Purcell SM, Wray NR, Stone JL, Visscher PM, O'Donovan MC, Sullivan PF, Sklar P. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature, 2009 Aug 6; 460(7256):748-52.
Pevzner P, Shamir R. Computing has changed biology--biology education must catch up. Science, 2009 Jul 31; 325(5940):541-2. No abstract available.
Pomerantz MM, Ahmadiyeh N, Jia L, Herman P, Verzi MP, Doddapaneni H, Beckwith CA, Chan JA, Hills A, Davis M, Yao K, Kehoe SM, Lenz HJ, Haiman CA, Yan C, Henderson BE, Frenkel B, Barretina J, Bass A, Tabernero J, Baselga J, Regan MM, Manak JR, Shivdasani R, Coetzee GA, Freedman ML. The 8q24 cancer risk variant rs6983267 shows long-range interaction with MYC in colorectal cancer. Nat Genet, 2009 Aug; 41(8):882-4.
Rosser ZH, Balaresque P, Jobling MA. Gene conversion between the X chromosome and the male-specific region of the Y chromosome at a translocation hotspot. Am J Hum Genet, 2009 Jul; 85(1):130-4.
Tuupanen S, Turunen M, Lehtonen R, Hallikas O, Vanharanta S, Kivioja T, Björklund M, Wei G, Yan J, Niittymäki I, Mecklin JP, Järvinen H, Ristimäki A, Di-Bernardo M, East P, Carvajal-Carmona L, Houlston RS, Tomlinson I, Palin K, Ukkonen E, Karhu A, Taipale J, Aaltonen LA. The common colorectal cancer predisposition SNP rs6983267 at chromosome 8q24 confers potential to enhanced Wnt signaling. Nat Genet, 2009 Aug; 41(8):885-90. Epub 2009 Jun 28.
Wain LV, Armour JA, Tobin MD. Genomic copy number variation, human health, and disease. Lancet, 2009 Jul 25; 374(9686):340-50. Review.