The R journal just published its inaugural peer-reviewed journal. Aligned with the open-source mantra, the journal is free and openly accessible. The journal features short articles on topics focused on R, including notes about new add-on packages, hints for R newcomers, application reports detailing examples of data analysis with R, and other news items. The current issue in PDF and information about submission can be found at link below.
The R Journal
PCG Journal Club Articles, 5/13 & 5/27
Hi! I'm Julia Wall, the Technical Writer in Vanderbilt's CHGR. Every two weeks or so, I'll be posting citation links to the articles that the students in our Program for Computational Genomics (PCG) Journal Club discuss in their bimonthly meetings. Enjoy!
PCG Journal Club Articles, 5/13
Baranzini SE, Galwey NW, Wang J, Khankhanian P, Lindberg R, Pelletier D, Wu W, Uitdehaag BM, Kappos L; GeneMSA Consortium, Polman CH, Matthews PM, Hauser SL, Gibson RA, Oksenberg JR, Barnes MR. Pathway and network-based analysis of genome-wide association studies in multiple sclerosis. Hum Mol Genet, 18(11):2078-90 (2009).
Stoddart D, Heron AJ, Mikhailova E, Maglia G, Bayley H. Single-nucleotide discrimination in immobilized DNA oligonucleotides with a biological nanopore. Proc Natl Acad Sci USA, 106(19):7702-7 (2009).
PCG Journal Club Articles, 5/27
Barrett JC, Clayton DG, Concannon P, Akolkar B, Cooper JD, Erlich HA, Julier C, Morahan G, Nerup J, Nierras C, Plagnol V, Pociot F, Schuilenburg H, Smyth DJ, Stevens H, Todd JA, Walker NM, Rich SS; The Type 1 Diabetes Genetics Consortium. Genome-wide association study and meta-analysis find that over 40 loci affect risk of type 1 diabetes. Nature Genetics, 2009 May 10 [Epub ahead of print].
Biswas S, Scheinfeldt LB, Akey JM. Genome-wide insights into the patterns and determinants of fine-scale population structure in humans. AJHG, 84(5): 641-50 (2009).
Glessner JT, Wang K, Cai G, Korvatska O, Kim CE, Wood S, Zhang H, Estes A, Brune CW, Bradfield JP, Imielinski M, Frackelton EC, Reichert J, Crawford EL, Munson J, Sleiman PM, Chiavacci R, Annaiah K, Thomas K, Hou C, Glaberson W, Flory J, Otieno F, Garris M, Soorya L, Klei L, Piven J, Meyer KJ, Anagnostou E, Sakurai T, Game RM, Rudd DS, Zurawiecki D, McDougle CJ, Davis LK, Miller J, Posey DJ, Michaels S, Kolevzon A, Silverman JM, Bernier R, Levy SE, Schultz RT, Dawson G, Owley T, McMahon WM, Wassink TH, Sweeney JA, Nurnberger JI, Coon H, Sutcliffe JS, Minshew NJ, Grant SF, Bucan M, Cook EH, Buxbaum JD, Devlin B, Schellenberg GD, Hakonarson H. Autism genome-wide copy number variation reveals ubiquitin and neuronal genes. Nature, 2009 April 28 [Epub ahead of print].
Ioannidis JP, Thomas G, Daly MJ. Validating, augmenting and refining genome-wide association signals. Nat Rev Genet, 10(5): 318-29 (2009).
Newcombe PJ, Verzilli C, Casas JP, Hingorani AD, Smeeth L, Whittaker JC. Multilocus Bayesian meta-analysis of gene-disease associations. AJHG, 84(5): 567-80 (2009).
Spencer SL, Gaudet S, Albeck JG, Burke JM, Sorger PK. Non-genetic origins of cell-to-cell variability in TRAIL-induced apoptosis. Nature, 459(7245): 428-32 (2009).
Thomas A, Camp NJ, Farnham J, Allen-Brady K, Cannon-Albright LA. Shared genomic segment analysis. Mapping disease predisposition genes in extended pedigrees using SNP genotype assays. Ann Hum Genet, 72(Pt2): 279-87 (2008).
Yoshida Y, Makita Y, Heida N, Asano S, Matsushima A, Ishii M, Mochizuki Y, Masuya H, Wakana S, Kobayashi N, Toyoda T. PosMed (Positional Medline): prioritizing genes with an artificial neural network comprising medical documents to accelerate positional cloning. Nucleic Acids Res, 2009 May 25 [Epub ahead of print].
PCG Journal Club Articles, 5/13
Baranzini SE, Galwey NW, Wang J, Khankhanian P, Lindberg R, Pelletier D, Wu W, Uitdehaag BM, Kappos L; GeneMSA Consortium, Polman CH, Matthews PM, Hauser SL, Gibson RA, Oksenberg JR, Barnes MR. Pathway and network-based analysis of genome-wide association studies in multiple sclerosis. Hum Mol Genet, 18(11):2078-90 (2009).
Stoddart D, Heron AJ, Mikhailova E, Maglia G, Bayley H. Single-nucleotide discrimination in immobilized DNA oligonucleotides with a biological nanopore. Proc Natl Acad Sci USA, 106(19):7702-7 (2009).
PCG Journal Club Articles, 5/27
Barrett JC, Clayton DG, Concannon P, Akolkar B, Cooper JD, Erlich HA, Julier C, Morahan G, Nerup J, Nierras C, Plagnol V, Pociot F, Schuilenburg H, Smyth DJ, Stevens H, Todd JA, Walker NM, Rich SS; The Type 1 Diabetes Genetics Consortium. Genome-wide association study and meta-analysis find that over 40 loci affect risk of type 1 diabetes. Nature Genetics, 2009 May 10 [Epub ahead of print].
Biswas S, Scheinfeldt LB, Akey JM. Genome-wide insights into the patterns and determinants of fine-scale population structure in humans. AJHG, 84(5): 641-50 (2009).
Glessner JT, Wang K, Cai G, Korvatska O, Kim CE, Wood S, Zhang H, Estes A, Brune CW, Bradfield JP, Imielinski M, Frackelton EC, Reichert J, Crawford EL, Munson J, Sleiman PM, Chiavacci R, Annaiah K, Thomas K, Hou C, Glaberson W, Flory J, Otieno F, Garris M, Soorya L, Klei L, Piven J, Meyer KJ, Anagnostou E, Sakurai T, Game RM, Rudd DS, Zurawiecki D, McDougle CJ, Davis LK, Miller J, Posey DJ, Michaels S, Kolevzon A, Silverman JM, Bernier R, Levy SE, Schultz RT, Dawson G, Owley T, McMahon WM, Wassink TH, Sweeney JA, Nurnberger JI, Coon H, Sutcliffe JS, Minshew NJ, Grant SF, Bucan M, Cook EH, Buxbaum JD, Devlin B, Schellenberg GD, Hakonarson H. Autism genome-wide copy number variation reveals ubiquitin and neuronal genes. Nature, 2009 April 28 [Epub ahead of print].
Ioannidis JP, Thomas G, Daly MJ. Validating, augmenting and refining genome-wide association signals. Nat Rev Genet, 10(5): 318-29 (2009).
Newcombe PJ, Verzilli C, Casas JP, Hingorani AD, Smeeth L, Whittaker JC. Multilocus Bayesian meta-analysis of gene-disease associations. AJHG, 84(5): 567-80 (2009).
Spencer SL, Gaudet S, Albeck JG, Burke JM, Sorger PK. Non-genetic origins of cell-to-cell variability in TRAIL-induced apoptosis. Nature, 459(7245): 428-32 (2009).
Thomas A, Camp NJ, Farnham J, Allen-Brady K, Cannon-Albright LA. Shared genomic segment analysis. Mapping disease predisposition genes in extended pedigrees using SNP genotype assays. Ann Hum Genet, 72(Pt2): 279-87 (2008).
Yoshida Y, Makita Y, Heida N, Asano S, Matsushima A, Ishii M, Mochizuki Y, Masuya H, Wakana S, Kobayashi N, Toyoda T. PosMed (Positional Medline): prioritizing genes with an artificial neural network comprising medical documents to accelerate positional cloning. Nucleic Acids Res, 2009 May 25 [Epub ahead of print].
Statistics and sex appeal
Google's chief economist was recently quoted as saying "The sexy job in the next ten years will be statisticians… The ability to take data-to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it-that’s going to be a hugely important skill." I'll leave you for the weekend with this ego-boosting article relating how our skill set as statisticians is a hot commodity in the real world.
Dataspora Blog: The three sexy skills of data geeks
Dataspora Blog: The three sexy skills of data geeks
Journal club notes coming soon
The Program in Computational Genomics holds a journal club twice a month. Our technical writer Julia Wall will soon start posting here references to the articles we talked about. Keep an eye out here to see some of the latest research we're discussing!
Analysis workshop on Haploview, PLINK, STRUCTURE, and WASP
In case you missed the email, Lana Olson here in the CHGR is holding a workshop next Thursday, June 4 at 1:30pm on using Haploview, WASP, PLINK, and STRUCTURE. The workshop will be held in the CHGR conference room, 519 Light Hall.
Free one-day R course at Vanderbilt
The Vanderbilt Kennedy Center is offering a free (repeat, free) one-day introductory course to the R statistical computing language on June 23, taught by Theresa Scott from the department of Biostatistics. You can find contact/registration info at the link below.
Vanderbilt Kennedy Center - An Introduction to the Fundamentals & Functionality of the R Language
In case you missed it, if you've used R before and need a quick reference, print out the previously mentioned R reference card.
Vanderbilt Kennedy Center - An Introduction to the Fundamentals & Functionality of the R Language
In case you missed it, if you've used R before and need a quick reference, print out the previously mentioned R reference card.
Illumina Onmi
Illumina recently announced their latest chip for GWA studies - the HumanOmni1-Quad. Generating more than 4 million data points, it assays over 30,000 coding SNPs, contains content derived from the 1000 genomes project, and has probes covering 11,000 common and rare CNVs.
I doubt the information gain will immediately be worth the premium price for this chip, but it will be available next month nonetheless.
Press release: HumanOmni1-Quad
I doubt the information gain will immediately be worth the premium price for this chip, but it will be available next month nonetheless.
Press release: HumanOmni1-Quad
Has anyone ever used Galaxy?
Has anyone ever used Galaxy? I saw their presentation at last year's ASHG. Seems like a great way to do collaborate on and keep a record of analyses in an easy web-GUI interface without having to download any software. If you've used it for genetic analysis and you'd like to write a bit about it here (whether you're a Vanderbilt person or not), email me or post in the comments.
Would a gene by any other name be just as significant?
So you have found significant SNPs from a study, and you are investigating the region. Browsing through Ensembl or Entrez-Gene, you find a coding region nearby. Atop this coding region, you see a collection of letters that are commonly used to refer to this gene, lets say "MYLK". So you begin a PubMed search to find publications that describe the function of this gene, searching with "MYLK". Seems reasonable, right?
Beware! Unfortunately, gene names or acronyms are NOT a standardized way of identifying coding regions. According to Gene Cards, the coding region with the symbol "MYLK" has 14 different symbol aliases, and four unique descriptions! To be complete, conduct a PubMed search using all of these terms. For example, searching PubMed for MYLK retrieves only 30 articles, mostly involving muscle contraction. Searching for MLCK on the other hand retrieves 847 articles! These references have much more emphasis on the neural activities of the gene, so perhaps a difference groups of investigators use different symbols.
To make matters worse, according to Entrez-Gene, MYLK is the "official" gene symbol. yet less than 5% of the PubMed articles use that designation! If possible, use the Entrez-gene or Ensembl gene ID when referencing a gene in the literature to help avoid this confusion.
Beware! Unfortunately, gene names or acronyms are NOT a standardized way of identifying coding regions. According to Gene Cards, the coding region with the symbol "MYLK" has 14 different symbol aliases, and four unique descriptions! To be complete, conduct a PubMed search using all of these terms. For example, searching PubMed for MYLK retrieves only 30 articles, mostly involving muscle contraction. Searching for MLCK on the other hand retrieves 847 articles! These references have much more emphasis on the neural activities of the gene, so perhaps a difference groups of investigators use different symbols.
To make matters worse, according to Entrez-Gene, MYLK is the "official" gene symbol. yet less than 5% of the PubMed articles use that designation! If possible, use the Entrez-gene or Ensembl gene ID when referencing a gene in the literature to help avoid this confusion.
Wolfram Alpha as a bioinformatics tool
Just released last week by the makers of Mathematica, Wolfram Alpha is kind of like a search engine, calling itself a "computational knowledge engine," with the lofty goal as a "long-term project to make all systematic knowledge immediately computable by anyone."
From their homepage you can link to a page showing examples of how to use it, but I was interested in seeing how much biology Wolfram Alpha knows, and I've got to say I'm impressed with the results.
(Note: their servers are pretty busy I guess, so if the links don't work the first time, or the search times out, try reloading.)
Check out the results I got when I searched for APOE. It correctly interpreted the fact that I wanted information about the human gene, and accordingly gave me information about the gene and its location, along with a chromosome ideogram, a reference sequence, splice structures, and more.
I was also impressed to see what happened when I entered a random string of ACGT's. It correctly interpreted my query as a nucleotide sequence, told me the amino acid sequence it would make, correctly guessed how often this sequence would be found in the genome if bases occur randomly, and gave me gene names, positions, and ideograms of the places where this sequence is actually found in the human genome.
Finally, I tried searching for a SNP that I have an interest in.
For being only days old, and for not being specifically developed as a bioinformatics tool, it's pretty impressive what it can do already. It should be interesting to see what else they come up with.
From their homepage you can link to a page showing examples of how to use it, but I was interested in seeing how much biology Wolfram Alpha knows, and I've got to say I'm impressed with the results.
(Note: their servers are pretty busy I guess, so if the links don't work the first time, or the search times out, try reloading.)
Check out the results I got when I searched for APOE. It correctly interpreted the fact that I wanted information about the human gene, and accordingly gave me information about the gene and its location, along with a chromosome ideogram, a reference sequence, splice structures, and more.
I was also impressed to see what happened when I entered a random string of ACGT's. It correctly interpreted my query as a nucleotide sequence, told me the amino acid sequence it would make, correctly guessed how often this sequence would be found in the genome if bases occur randomly, and gave me gene names, positions, and ideograms of the places where this sequence is actually found in the human genome.
Finally, I tried searching for a SNP that I have an interest in.
For being only days old, and for not being specifically developed as a bioinformatics tool, it's pretty impressive what it can do already. It should be interesting to see what else they come up with.
Linux tip: history and !
Ever find yourself trying to remember a series of steps you recently did in Linux? Try typing the command "history" at the command line (without the quotes). You'll see a long list of your most recently used commands. It looks like this:
1018 ls
1019 cd
1020 cd /scratch/turnersd/
1021 ls
1022 cd
1023 grep -P "\s(10|9|8)\s" /scratch/turnersd/alz/parsed.txt | awk '{print $1"\n"$2}' | sort | uniq | perl -pi -e 's/RS/rs/g'
1024 history
Which brings me to my second tip, ! commands. Notice that when you type history the commands are numbered. 1023 in particular was a long string of commands I wouldn't want to retype over and over. Fortunately Linux lets me repeat that command any time I want just by typing an exclamation point followed by the number, like this — !1023 — at the command line, which does the same thing as typing it in the long way.
1018 ls
1019 cd
1020 cd /scratch/turnersd/
1021 ls
1022 cd
1023 grep -P "\s(10|9|8)\s" /scratch/turnersd/alz/parsed.txt | awk '{print $1"\n"$2}' | sort | uniq | perl -pi -e 's/RS/rs/g'
1024 history
Which brings me to my second tip, ! commands. Notice that when you type history the commands are numbered. 1023 in particular was a long string of commands I wouldn't want to retype over and over. Fortunately Linux lets me repeat that command any time I want just by typing an exclamation point followed by the number, like this — !1023 — at the command line, which does the same thing as typing it in the long way.
Friday fun: Abstract mad libs
For those of you writing abstracts that will get you to Hawaii this fall for IGES (due June 1) and ASHG (due June 2), just follow this template from Piled Higher and Deeper.
Not just another MDR paper
Nature Reviews Genetics just published an excellent paper on interaction analysis by Heather Cordell. This masterfully written review starts by defining interaction, then delving into strategies to statistically model it in human genetics studies. She covers several regression-based procedures, Bayesian approaches, data-mining methods, and other related techniques, referring the reader to the pertinent publications and software (usually free) available to perform these analyses. The review wraps up with a discussion of the biological interpretation of statistical models of interaction.
Genome-wide association studies: Detecting gene–gene interactions that underlie human diseases (NRG Advance Online)
Genome-wide association studies: Detecting gene–gene interactions that underlie human diseases (NRG Advance Online)
Don't categorize continuous variables!
If you're doing an analysis with variables that naturally vary on a continuous scale, like age or smoking pack-years, NEVER be tempted to categorize individuals into groups - there's nearly always a better approach that utilizes the full distribution of values. It may seem convenient for a particular analysis you're doing but you'll take an enormous hit in power and precision. Frank Harrell in Vanderbilt's Biostatistics department wrote an excellent list of reasons why this is a terrible idea.
For an interactive example, take 5 seconds to look at this applet that illustrates the problem. Look at the t-statistic for the regression coefficient, and especially the R-squared. Moving the slider to the right simulates splitting observations at the median of X into two categories. The regression coefficient fluctuates, but both the t-statistic and the R-squared shrink substantially - enough to make the result no longer significant. Here's what it looks like:
Using the full distribution of your data:
b=.43 ; p=.036 ; R²=.221
After a median split:
b=.29 ; p=.275 ; R²=.066
The bottom line here is that if you have continuous variables, pick an analysis method that doesn't discard useful variation in your data! See the two previous posts (part I and part II) for help choosing the best method for the data types you have.
For an interactive example, take 5 seconds to look at this applet that illustrates the problem. Look at the t-statistic for the regression coefficient, and especially the R-squared. Moving the slider to the right simulates splitting observations at the median of X into two categories. The regression coefficient fluctuates, but both the t-statistic and the R-squared shrink substantially - enough to make the result no longer significant. Here's what it looks like:
Using the full distribution of your data:
b=.43 ; p=.036 ; R²=.221
After a median split:
b=.29 ; p=.275 ; R²=.066
The bottom line here is that if you have continuous variables, pick an analysis method that doesn't discard useful variation in your data! See the two previous posts (part I and part II) for help choosing the best method for the data types you have.
AAAS personalized medicine meeting webcast
AAAS and the Food and Drug Law Institute (FDLI) are holding three 2-day meetings in Washington DC on personalized medicine, addressing the scientific discoveries, business models, and policy changes that are necessary to develop personalized treatments and diagnostics. All three meetings will be available via a free live webcast where viewers can use response technology to interact with speakers and session chairs.
The first meeting is June 1-2, and the agenda is available online. One session, "State of the Science: Connecting Biomarkers and Diagnostics" is chaired by Teri Manolio, who will be speaking at the CHGR's June 10 Retreat. Other sessions include topics about consumer genetics in the open market, ELSI issues, and other topics.
Personalized medicine: Planning for the Future
June 1-2 2009: Colloquium I Agenda
The first meeting is June 1-2, and the agenda is available online. One session, "State of the Science: Connecting Biomarkers and Diagnostics" is chaired by Teri Manolio, who will be speaking at the CHGR's June 10 Retreat. Other sessions include topics about consumer genetics in the open market, ELSI issues, and other topics.
Personalized medicine: Planning for the Future
June 1-2 2009: Colloquium I Agenda
Highlight all the acronyms in a Word document
A tip of the hat to Lifehacker for pointing this out.
Ever been nailed by a reviewer or a thesis committee for using too many acronyms without defining them well? There's an easy way built into MS Word to find and highlight all the acronyms in a document. It's a nice check after you're done with a first draft, and you might be surprised by how many you find!
First, hit Ctrl-F to bring up a find box, and type this in exactly as written:
<[A-Z]{2,}>
Hit the "More >>" button, then check the "Use Wildcards" box. Finally, click the "Reading Highlight" box, then click "Highlight All". It should look like this:
After you do that, you'll see all your uppercase acronyms highlighted throughout the entire document. It will look something like this:
That find query basically tries to find any two or more adjacent capital letters. See the original article referenced above and the comments below it for more similar wildcard search tricks, like how to extend this idea to include numbers, or using regular expressions.
Ever been nailed by a reviewer or a thesis committee for using too many acronyms without defining them well? There's an easy way built into MS Word to find and highlight all the acronyms in a document. It's a nice check after you're done with a first draft, and you might be surprised by how many you find!
First, hit Ctrl-F to bring up a find box, and type this in exactly as written:
<[A-Z]{2,}>
Hit the "More >>" button, then check the "Use Wildcards" box. Finally, click the "Reading Highlight" box, then click "Highlight All". It should look like this:
After you do that, you'll see all your uppercase acronyms highlighted throughout the entire document. It will look something like this:
That find query basically tries to find any two or more adjacent capital letters. See the original article referenced above and the comments below it for more similar wildcard search tricks, like how to extend this idea to include numbers, or using regular expressions.
100 publications every grad student should read
Jason Moore at the previously mentioned Epistasis Blog has begun compiling a list of 100 papers every grad student should read, broken down by discipline. Right now the list is in its infancy, but it's a good start. I'll post here when the list is updated again.
100 Publications Every Graduate Student Should Read
UPDATE 2009-05-08: The list has grown substantially since yesterday. Check the link again!
100 Publications Every Graduate Student Should Read
UPDATE 2009-05-08: The list has grown substantially since yesterday. Check the link again!
Gene regulation: A new toolbox for mapping regulatory sites
Here's a paper in Nature Reviews Genetics highlighting two recently published methods for mapping regulatory sites. Most currently used procedures rely on ChIP, where you can only examine sites one protein at a time. The two methods discussed here can potentially overcome these limitations - one mostly identifies cis-regulatory sites near promoters, and the other is capable of identifying sites farther away, including long range enhancer elements.
R Reference Card (PDF)
Last week I posted a short tutorial on how to merge datasets using R. R is a free and open-source statistical computing software and programming language (get R here). The only downside is a steeper learning curve because the documentation is sparse and often difficult to understand at first. Once you start using it, you'll realize it can do anything SPSS, SAS, and Stata can do, and its graphing capabilities are light years ahead of everything else. Lately it's use is becoming more mainstream, gaining popularity in the life sciences among other fields, and knowing how to use R is a marketable skill to have on a job hunt.
If you've used R at least a few times before then this printable reference card is really handy for remembering which functions do what and how to use them. If you've never used R before, there are several online resources to teach you the basics of R, and a free book (PDF) written specifically for people who have used SPSS or SAS before and now want to learn R. Also, check back here in the future for more R tutorials or examples.
R Reference Card (PDF) via Rpad.org
If you've used R at least a few times before then this printable reference card is really handy for remembering which functions do what and how to use them. If you've never used R before, there are several online resources to teach you the basics of R, and a free book (PDF) written specifically for people who have used SPSS or SAS before and now want to learn R. Also, check back here in the future for more R tutorials or examples.
R Reference Card (PDF) via Rpad.org
Free tutorials on bioinformatics and model organisms resources
OpenHelix offers free access to a handful of their tutorials on genomics and bioinformatics resources, including the UCSC Genome Browser, Seattle SNPs, Genome Variation Server, and the VISTA Comparative Genomics tools. Below those, you can also find similar tutorials on how to use the free tools and resources for most of the major model organisms (FlyBase, ZFIN, Wormbase, etc).
Each of the tutorials has video lessons, powerpoint slides, handouts, and practice exercises available for download.
OpenHelix: Free tutorials on bioinformatics and model organisms resources
Each of the tutorials has video lessons, powerpoint slides, handouts, and practice exercises available for download.
OpenHelix: Free tutorials on bioinformatics and model organisms resources
Genetic diversity in African populations
An international team led by Sarah Tishkoff, in collaboration with our own Scott Williams, and former CHGR member Jason Moore, published yesterday in Science the largest, most comprehensive characterization of genetic variation in over 100 different African populations. The graphic below summarizes some of this regional variation by displaying the proportion of each of 14 color-coded ancestral populations that are found in modern African subpopulations.
UPDATE 2009-05-01: I didn't read through the 100+ page supplement with 40+ figures, but this has made quite a splash in the popular science journalism press. You can read about the highlights here, here, here, here, here, here, here, and here.
Science: The Genetic Structure and History of Africans and African Americans
UPDATE 2009-05-01: I didn't read through the 100+ page supplement with 40+ figures, but this has made quite a splash in the popular science journalism press. You can read about the highlights here, here, here, here, here, here, here, and here.
Science: The Genetic Structure and History of Africans and African Americans
Subscribe to:
Posts (Atom)