23andMe GCPM recaps and FDA meeting on Laboratory Developed Tests
You can find two nice recaps of last week's personalized medicine policy forum on Genomics Law Report and 23andMe's blog, The Spittoon. Also of interest today and tomorrow - the FDA is holding a public meeting to discuss issues surrounding the potential oversight of laboratory developed tests (a catagory which DTC genetic testing may fall into). You can find the agenda and links to the live (free) webcast here, or you could follow the #FDALDT hashtag on Twitter. The Washington Post published this nice piece over the weekend summarizing the issues.
Genomics and the Consumer: The Present and Future of Personalized Medicine
For those of you not following GGD on Twitter you may not have seen this - California State Senator Alex Padilla and 23andMe are hosting a policy forum entitled "Genomics and the Consumer: The Present and Future of Personalized Medicine" today in San Francisco. The agenda looks very exciting, featuring talks by Anne Wojcicki, co-founder of 23andMe, Leroy Hood from the Institute for Systems biology, Dan Vorhaus (@genomicslawyer, editor of Genomics Law Report), Senator Padilla, and several others. Hopefully 23andMe will record and make this exciting discussion available online soon. In the meantime, you can follow the #gcpm hashtag for live updates from those in attendance.
QQ plot of p-values in R using base graphics
Update Tuesday, September 14, 2010: Fixed the ylim issue, now it sets the y axis limit based on the smallest observed p-value.
A while back Will showed you how to create QQ plots of p-values in Stata and in R using the now-deprecated sma package. A bit later on I showed you how to do the same thing in R using ggplot2. As much as we (and our readers) love ggplot2 around here, it can be quite a bit slower than using the built in base graphics. This was only recently a problem for me when I tried creating a quantile-quantile plot of over 12-million p-values. I wrote the code to do this in base graphics, which is substantially faster than using the ggplot2 code I posted a while back. The code an an example are below.
Here's what the resulting QQ-plot will look like:
A while back Will showed you how to create QQ plots of p-values in Stata and in R using the now-deprecated sma package. A bit later on I showed you how to do the same thing in R using ggplot2. As much as we (and our readers) love ggplot2 around here, it can be quite a bit slower than using the built in base graphics. This was only recently a problem for me when I tried creating a quantile-quantile plot of over 12-million p-values. I wrote the code to do this in base graphics, which is substantially faster than using the ggplot2 code I posted a while back. The code an an example are below.
Here's what the resulting QQ-plot will look like:
Illumina Sequencing Seminar Series
Next week Brent Anderson with Illumina will be hosting a seminar series showcasing presentations from Vanderbilt scientists using Illumina technology to power their next-generation sequencing studies. Here's the schedule:
Tuesday, July 13, 2010
Vanderbilt University
Light Hall Room 512
Tuesday, July 13, 2010
Vanderbilt University
Light Hall Room 512
- 1:00 Registration
- 1:30 Intrucution (Brent Anderson, Illumina)
- 1:45 Whole Transcriptome Analysis of Pancreatic Progenitor Cells (Mark Magnuson, Vanderbilt)
- 2:15 Targeted Next-Gen Sequencing in Drug Induced Torsades de Pointes (Andrea Ramirez, Vanderbilt)
- 2:45 Studying Gene Structure, Expression, & Regulation Using the HiSeq 2000 (Haley Fiske, Illumina)
All code on GGD is Free (Open Source BSD)
At the request of a commenter I just wanted to clarify that any code released here for R or anything else is free and open source unless specifically stated otherwise. The open source BSD license for any code on GGD can be found on this copyright page.
Convert PLINK output to CSV Revisited
A while back, Stephen wrote a very nice post about converting PLINK output to a CSV file. If you are like me, you have used this a thousand times -- enough to get tired of typing lots of SED commands.
I just crafted a little BASH script that accomplishes the same effect with a single easy to type command. Insert the following text into your .bashrc file. This file is generally hidden in your UNIX home directory (you can see it if you type 'ls -al').
This version converts the infile to a tab-delimited output.
And this version converts to a CSV file.
I also converted the "NA" to a Null value for easy loading into MySQL, however you can remove that bit if you'd like:
You use this function as follows:
bush@queso:~$ cleanplink plinkresults.assoc
and it produces a file with the same name, but with a ".csv" or a ".txt" on the end.
I just crafted a little BASH script that accomplishes the same effect with a single easy to type command. Insert the following text into your .bashrc file. This file is generally hidden in your UNIX home directory (you can see it if you type 'ls -al').
This version converts the infile to a tab-delimited output.
function cleanplink
{
sed -r 's/\s+/\t/g' $1 | sed -r 's/^\t//g' | sed -r 's/NA/\\N/g' > $1.txt
}
And this version converts to a CSV file.
function cleanplink
{
sed -r 's/\s+/,/g' $1 | sed -r 's/^,//g' | sed -r 's/NA/\\N/g' > $1.csv
}
I also converted the "NA" to a Null value for easy loading into MySQL, however you can remove that bit if you'd like:
function cleanplink
{
sed -r 's/\s+/,/g' $1 | sed -r 's/^,//g' > $1.csv
}
You use this function as follows:
bush@queso:~$ cleanplink plinkresults.assoc
and it produces a file with the same name, but with a ".csv" or a ".txt" on the end.
Using Expression Data to Mine the "Gray Zone" of GWAS
Researchers in the ENGAGE consortium used a clever technique to leverage genome-wide expression data to select or prioritize genes for GWAS analysis. The investigators published the novel candidate genes for obesity in this month's PLoS Genetics, but I think the method they used here is more interesting.
If you're studying obesity and you find that expression of some gene correlates with BMI, you have a problem in that you don't know whether the correlation indicates a causal relationship or if the changes in gene expression were simply reactive to changes in body composition. This is the case when looking at unrelated individuals - some correlations will be reactive, others potentially causal. However, if you're looking only in identical twins, you know that all the correlations you see are reactive, because MZ twins are genetically identical. The authors here took an interesting approach to prioritize genes for GWAS analysis that were correlated in the unrelated individuals only, and not in the MZ twins.
Following up these "causal" genes in a GWAS analysis the authors found that the p-value distribution was highly biased away from the null - in other words, more of these genes were associated than you'd expect by chance. The genes dubbed reactive were biased toward the null, i.e. fewer variants in these genes were associated with the phenotype.
While not everyone has easy access to whole-genome expression data on MZ twins before doing a GWAS, I wonder if the idea can be extended out to siblings or even more distant relatives, perhaps leveraging the kinship coefficient as a measure of relatedness between two individuals to "nudge" the transcript in question more towards causal versus reactive. Anyhow, check out the paper linked, it's a very clever idea.
(On a slightly related note, check out this interesting discussion about open access publishing a la PLoS versus traditional scientific publishing)
PLoS Genetics: Use of Genome-Wide Expression Data to Mine the “Gray Zone” of GWA Studies Leads to Novel Candidate Obesity Genes
If you're studying obesity and you find that expression of some gene correlates with BMI, you have a problem in that you don't know whether the correlation indicates a causal relationship or if the changes in gene expression were simply reactive to changes in body composition. This is the case when looking at unrelated individuals - some correlations will be reactive, others potentially causal. However, if you're looking only in identical twins, you know that all the correlations you see are reactive, because MZ twins are genetically identical. The authors here took an interesting approach to prioritize genes for GWAS analysis that were correlated in the unrelated individuals only, and not in the MZ twins.
Following up these "causal" genes in a GWAS analysis the authors found that the p-value distribution was highly biased away from the null - in other words, more of these genes were associated than you'd expect by chance. The genes dubbed reactive were biased toward the null, i.e. fewer variants in these genes were associated with the phenotype.
While not everyone has easy access to whole-genome expression data on MZ twins before doing a GWAS, I wonder if the idea can be extended out to siblings or even more distant relatives, perhaps leveraging the kinship coefficient as a measure of relatedness between two individuals to "nudge" the transcript in question more towards causal versus reactive. Anyhow, check out the paper linked, it's a very clever idea.
(On a slightly related note, check out this interesting discussion about open access publishing a la PLoS versus traditional scientific publishing)
PLoS Genetics: Use of Genome-Wide Expression Data to Mine the “Gray Zone” of GWA Studies Leads to Novel Candidate Obesity Genes
Subscribe to:
Posts (Atom)