Pages

What happens when a consumer genetics company goes bankrupt?

Dan Vorhaus and Lawrence Moore recently put together this excellent three part series on Genomics Law Report.  Headlines about deCODE Genetics on the brink of insolvency and major shifts in the upper management of 23andMe inspired this series of posts on what would happen when a direct-to-consumer (DTC) genomics company goes declares bankruptcy.

Bankruptcy law authorizes the sale of the assets of a business in bankruptcy, and genomic data is likely the most valuable asset of any DTC genomics company.  First the authors dissect the privacy policy and terms of service for three major DTC companies: 23andMe, deCODE Genetics, and TruGenetics.  Next there's a discussion of how the legal system would treat a DTC genomics company's bankruptcy.  The series wraps up with a brief discussion of how this ultimately affects the average DTC genomics cutomer.

Genomics Law Report: What happens if a DTC Genomics Company Goes Belly-Up?

JBrowse: a JavaScript Based Genome Browser



Genome Browsers are nothing new, but JBrowse is a new JavaScript based genome browser that uses information from the UCSC genome browser and has the look and feel of Google Maps.  It's extremely easy to zoom in and out and scroll around because all the "work" is being done by your computer rather than some server farm thousands of miles away.  OpenHelix is calling it a gamechanger, and they have a nice video demonstration showing off some of JBrowse's features.  Click the Drosophila or Homo sapiens genome and give JBrowse a spin for yourself!

The JBrowse genome browser

Comparison of plots using Stata, R base, R lattice, and R ggplot2, Part I: Histograms

One of the nicer things about many statistics packages is the extremely granular control you get over your graphical output.  But I lack the patience to set dozens of command line flags in R, and I'd rather not power the computer by pumping the mouse trying to set all the clicky-box options in Stata's graphics editor.  I want something that just looks nice, using the out-of-the-box defaults.  Here's a little comparison of 4 different graphing systems (three using R, and one using Stata) and their default output for plotting a histogram of a continuous variable split over three levels of a categorical variable.

First I'll start with the three graphing systems in R: base, lattice, and ggplot2.  If you don't have the last two packages installed, go ahead and download them:

install.packages("ggplot2")
install.packages("lattice")

Now load these two packages, and download this fake dataset I made up containing 100 samples each from three different genotypes ("geno") and a continuous outcome ("trait")

mydat=read.csv("http://people.vanderbilt.edu/~stephen.turner/ggd/2009-09-21-histodemo.csv",header=T)
library(ggplot2)
library(lattice)

Now let's get started...

R: base graphics

par(mfrow=c(3,1))
with(subset(mydat,geno=="aa"),hist(trait))
with(subset(mydat,geno=="Aa"),hist(trait))
with(subset(mydat,geno=="AA"),hist(trait))




R: lattice

histogram(~trait | factor(geno), data=mydat, layout=c(1,3))


R: ggplot2

qplot(trait,data=mydat,facets=geno~.)

# Update Tuesday, September 22, 2009
# A commenter mentioned that this code did not work.
# If the above code does not work, try explicitly
# stating that you want a histogram:
qplot(trait,geom="histogram",data=mydat,facets=geno~.)



Stata

insheet using "http://people.vanderbilt.edu/~stephen.turner/ggd/2009-09-21-histodemo.csv", comma clear
histogram trait, by(geno, col(1))





Commentary

In my opinion ggplot2 is the clear winner.  Again I'll concede that all of the above graphing systems give you an incredible amount of control of every aspect of the graph, but I'm only looking for what gives me the best out-of-the-box default plot using the shortest command possible. R's base graphics give you a rather spartan plot, with very wide bins.  It also requires 4 lines of code.  (If you can shorten this, please comment).  By default, the base graphics system gives you counts (frequency) on the vertical axis.  The lattice package in R does a little better perhaps, but the default color scheme is visually less than stellar.  Also, I'm not sure why the axis labels switch sides every other plot, and the ticks on top of the plot are probably unnecessary.  I still think the bins are too wide.  You lose some information especially on the bottom plot towards the right tail.  The vertical axis is proportion of total.  Stata's default plot looks very similar to lattice, but again uses a very unattractive color scheme.  It uses density for the vertical axis, which may not mean much to non-statisticians.  The default plot made by ggplot2 is just hands-down good-looking.  There are no unnecessary lines delimiting the bins, and the binwidth is appropriate.  The vertical axis represents counts.  The black bars on the light-gray background have a good data-ink ratio.  And it required the 2nd shortest command, only 3 characters longer than the Stata equivalent.

I'm ordering the ggplot2 book (Amazon, ~$50), so as I figure out how to do more with ggplot2 I'll post more comparisons like this.  If you use SPSS, SAS, MATLAB, or something else, post the code in a comment here and send me a picture or link to the plot and I'll post it here.

PCG Journal Club Articles, 9/11

There were only a couple of citations for articles discussed at this week's PCG meeting (September 11). Our next meeting is scheduled for September 26.

~Julia


Kim S, Xing EP. Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genet. 2009 Aug; 5(8):e1000587.

Zamar D, Tripp B, Ellis G, Daley D. Path: a tool to failitate pathway-based genetic association analysis. Bioinformatics. 2009 Sep 15; 25(18):2444-6

R clinic this week: Regression Modeling Strategies in R

At this week's R clinic Frank Harrell will unveil the new rms (Regression Modeling Strategies) package that is a replacement for the R Design package.  He will demonstrate the differences with Design, especially related to enhanced graphics for displaying effects in regression models.  Frank will also discuss the implementation of quantile regression in rms.  The rms package website has links to the manual, examples of graphical output, and printable reference cards for many of the package's commands.  It also makes a point that many of rms's graphics capabilities are modular and will play nicely with previously mentioned ggplot2.

To install the rms package, start R and type:

install.packages("rms", dependencies=TRUE)

Then to load it any time thereafter,

library(rms)

The R clinic is held by the Vanderbilt biostatistics department every Thursday 2-3pm and free to anyone who wants to attend.  More information here.

Find the function you're looking for in R

Any R user no matter what level of experience has had trouble finding the package or the function to do what you want to do and then figuring out how to use it.  The sos package in R just made that a lot easier.

First, fire up R, then install the sos package (don't omit the quotes):

install.packages("sos")

It'll ask you to choose a mirror.  Choose the closest one.  After it installs, load the package (omit the quotes this time):

library(sos)

This loaded all the functions that come with the sos package, including a particularly useful one called findFn.  It scans the "function" entries in Jonathan Baron's "R site search" database.  Give it a try, using "epistasis" with the quotes as the keyword.

findFn("epistasis")

This should open up a web browser that displays relevant functions, the package you need to download (using the above procedure) to use the function, and a link to the help page for that function.


You can also use ??? as an alias for findFn.  Try it like this (use the quotes):

???"genome wide"

Once you have the sos package installed, type vignette("sos") for more information on how to use various functions in this package.

If you still can't find what you're looking for, check out my previous post on finding help on R, and if all else fails, don't forget about Theresa Scott's free weekly R clinic / Q&A sessions.

Machine Learning in R

Revolutions blog recently posted a link to R code by Joshua Reich with self-contained examples of using machine learning techniques in R, including various clustering methods (k-means, nearest neighbor, and kernel), recursive partitioning (CART), principle components analysis, linear discriminant analysis, and support vector machines.  This post also links to some slides that go over the basics of machine learning.  Looks like a good place to start learning about ML before handrolling your own code.

Be sure to check out one of Will's previous post on hierarchical clustering in R.

Revolutions: Machine learning in R, in a nutshell

Sync your home directories on ACCRE and the local Linux servers (a.k.a. "the cheeses")

Vanderbilt ACCRE users with PCs only...

If you use ACCRE to run multi-processor jobs you'll be glad to know that they now allow you to map your home directory to your local desktop using Samba (so you can access your files through My Computer as you normally would with local files).  Just submit a help request on their website and they'll get you set up.

Now if you have both your ACCRE home and your home on the cheeses mapped, you can easily sync the files between the two.  Download Microsoft's free SyncToy to do the job.  It's pretty dead simple to set up, and one click will synchronize files between the two servers.


I didn't want to synchronize everything, so I set it up to only sync directories that contain perl scripts and other programs that I commonly use on both machines.  SyncToy also seems pretty useful for backing up your files too.

Microsoft SyncToy

Ask ACCRE to let you map your home

Get the full path to a file in Linux / Unix

In the last post I showed you how to point to a file in windows and get the full path copied to your clipboard.  I wanted to come up with something similar for a Linux environment.  This is helpful on Vampire/ACCRE because you have to fully qualify the path to every file you use when you submit jobs with PBS scripts. So I wrote a little perl script:

#!/usr/bin/perl
chomp($pwd=`pwd`);
print "$pwd\n" if @ARGV==0;
foreach (@ARGV) {print "$pwd/$_\n";}

You can copy this from me, just put it in your bin directory, like this:

cp /home/turnersd/bin/path ~/bin

Make it executable, like this:

chmod +x ~/bin/path

Here it is in action. Let's say I wanted to print out the full path to all the .txt files in the current directory.  Call the program with arguments as the files you want to print the path to:


[turnersd@vmps21 replicated]$ ls
parseitnow.pbs
parsing_program.pl
replic.txt
tophits.txt
 
[turnersd@vmps21 replicated]$ path *.txt
/projects/HDL/epistasis/replicated/replic.txt
/projects/HDL/epistasis/replicated/tophits.txt


Sure, it's only a little bit quicker than typing pwd, copying that, then spelling out the filenames.  But if you have long filenames or lots of filenames you want to copy, this should get things done faster.  Enjoy.

ClipPath copies filename and path from windows for loading into R

I wish I would have discovered this long ago.  Loading data into R or MySQL requires you to specify the full path to the file.  If you do this on a Windows machine there are two annoyances.  First, if you save something to your desktop the path to your desktop is really long.  Second, windows by default uses backslashes "\" in the file path, while R or other software requires forward slashes "/".  ClipPath is a tiny program that adds an entry to your right-click menu to copy the full file path with a forward slash, then you can paste the filename into whatever program you're using.



Download the zipfile from the website below (here's a direct link to the zip file).  Extract it's contents, right click on ClipPath.inf, and choose install.  You can always uninstall later through the control panel.

ClipPath Shell Extension

GGD posts now printer-friendly

A quick announcement about a formatting fix here - you can now print posts from GGD a little more cleanly.  When you print it should no longer include the title and sidebar, so most posts should now only use a page or two.

PCG Journal Club Articles, 8/28

Here are citations for the articles discussed at our most recent meeting (August 28). I have also appended a link to Nature Reviews: Genetics new series, Fundamental concepts in genetics, at the end. Our next meeting is scheduled for September 11.

~Julia


Gurwitz D, Fortier I, Lunshof JE, Knoppers BM. Research ethics: Children and population biobanks. Science. 2009 Aug 14; 325(5942):818-9

Hastings PJ, Lupski JR, Rosenberg SM, Ira G. Mechanisms of change in gene copy number. Nat Rev Genet. 2009 Aug; 10(8):551-64.

Holsinger KE, Weir BS. Genetics in geographically structured populations: defining, estimating and interpreting F(ST). Nat Rev Genet. 2009 Sep; 10(9):639-50.

Koga Y, Pelizzola M, Cheng E, Krauthammer M, Sznol M, Ariyan S, Narayan D, Molinaro AM, Halaban R, Weissman SM. Genome-wide screen of promoter methylation identifies novel markers in melanoma. Genome Res. 2009 Aug; 19(8):1462-70.

Monier A, Pagarete A, de Vargas C, Allen MJ, Read B, Claverie JM, Ogata H. Horizontal gene transfer of an entire metabolic pathway between a eukaryotic alga and its DNA virus. Genome Res. 2009 Aug; 19(8):1441-9.

Raveh-Sadka T, Levo M, Segal E. Incorporating nucleosomes into thermodynamic models of transcription regulation. Genome Res. 2009 Aug; 19(8):1480-96.

Rosenberg NA, Vanliere JM. Replication of genetic associations as pseudoreplication due to shared genealogy. Genet Epidemiol. 2009 Sep; 33(6):479-87.

Schmitz D, Netzer C, Henn W. An offer you can't refuse? Ethical implications of non-invasive prenatal diagnosis. Nat Rev Genet. 2009 Aug; 10(8):515.


Fundamental concepts in genetics: http://www.nature.com/nrg/series/fundamental/index.html