Pages

Showing posts with label Databases. Show all posts
Showing posts with label Databases. Show all posts

Understanding the ENSEMBL Schema

ENSEMBL is a frequently used resource for various genomics and transcriptomics tasks.  The ENSEMBL website and MART tools provide easy access to their rich database, but ENSEMBL also provides flat-file downloads of their entire database and a public MySQL portal.  You can access this using the MySQL Workbench using the following:
Host:     useastdb.ensembl.org
User:     anonymous
Once inside, you can get a sense for what the ENSEMBL schema (or data model) is like.  First, it’s important to understand the ENSEMBL ID system.  Most of the primary entities in the ENSEMBL database (genes, exons, transcripts, proteins) have a formal, stable identifier (beginning with ENSG, ENSE, ENST, and ENSP respectively) that does not change from build to build.  These entries can be found in the gene_stable_id tables.  All of these entities also have an internal identifier (an integer).  Once you have an internal ID for the entity of interest, details of the entity can be found in the genes, exons, transcripts, and translations (proteins) table. For example, the following query will retrieve a list of all transcripts and their exons for a given gene.
SELECT * FROM gene_stable_id a
inner join gene b on a.gene_id = b.gene_id
inner join transcript c on b.gene_id = c.gene_id
inner join exon_transcript d on c.transcript_id = d.transcript_id
inner join exon e on d.exon_id = e.exon_id
inner join transcript_stable_id f on c.transcript_id = f.transcript_id
inner join exon_stable_id g on e.exon_id = g.exon_id
The exon_transcript table contains a mapping of each exon to any transcripts containing it, and also contains a rank to indicate which exon it is relative to a given transcript.  To retrieve exons for a list of genes by their ENSEMBL IDs, these could be loaded into a table and joined to the gene_stable_id table in the query above.  To pull the build 37 chromosome and coordinates for an exon, use the following:
Select a.exon_id, b.name, a.seq_region_start, a.seq_region_end from exon a
inner join seq_region b on a.seq_region_id = b.seq_region_id
inner join coord_system c on b.coord_system_id = c.coord_system_id
where c.version = "GRCh37";
In this query, the seq_region table contains a field called name that identifies the contig to which the coordinates refer, in this case the chromosome number. 

There are also extensive cross-references in the ENSEMBL database.  To retrieve alternate identifiers for a set of transcripts, execute the following: 
select * from transcript_stable_id a
inner join transcript b on a.transcript_id = b.transcript_id
inner join object_xref c on b.transcript_id = c.ensembl_id
inner join xref d on c.xref_id = d.xref_id
inner join external_db e on d.external_db_id = e.external_db_id
where ensembl_object_type = "Transcript"
limit 20;
ENSEMBL organizes cross-references (xrefs) for all entity types into a single table object_xref.  This table contains an ensemble_object_type field that is a “Transcript”, “Gene”, or “Translation”, and an ensemble_id that matches either a gene_id, transcript_id, or a translation_id.  Replace “transcript” in the above query with “gene” or “translation” to retrieve gene or protein cross-references.  A list of all external cross-reference sources can be found by querying: 
Select db_name from external_db;
There is a great deal of information within the ENSEMBL database that can be accessed using SQL, which for some types of operations is easier than using the MART or web interface.  Full details of the ENSEMBL schema can be found here (http://useast.ensembl.org/info/docs/api/core/core_schema.html)


Software Carpentry Bootcamp at University of Virginia

A couple of weeks ago I, with the help of others here at UVA, organized a Software Carpentry bootcamp, instructed by Steve Crouch, Carlos Anderson, and Ben Morris. The day before the course started, Charlottesville was racked by nearly a foot of snow, widespread power outages, and many cancelled incoming flights. Luckily our instructors arrived just in time, and power was (mostly) restored shortly before the boot camp started. Despite the conditions, the course was very well-attended.

Software Carpentry's aim is to teach researchers (usually graduate students) basic computing concepts and skills so that they can get more done in less time, and with less pain. They're a volunteer organization funded by Mozilla and the Sloan foundation, and led this two-day bootcamp completely free of charge to us.

The course started out with a head-first dive into Unix and Bash scripting, followed by a tutorial on automation with Make, concluding the first day with an introduction to Python. The second day covered version control with git, Python code testing, and wrapped up with an introduction to databases and SQL. At the conclusion of the course, participants offered near-universal positive feedback, with the git and Make tutorials being exceptionally popular.

Software Carpentry's approach to teaching these topics is unlike many others that I've seen. Rather than lecturing on for hours, the instructors inject very short (~5 minute) partnered exercises between every ~15 minutes of instruction in 1.5 hour sessions. With two full days of intensive instruction and your computer in front of you, it's all too easy to get distracted by an email, get lost in your everyday responsibilities, and zone out for the rest of the session.  The exercises keep participants paying attention and accountable to their partner.

All of the bootcamp's materials are freely available:

Unix and Bash: https://github.com/redcurry/bash_tutorial
Python Introduction: https://github.com/redcurry/python_tutorial
Git tutorial: https://github.com/redcurry/git_tutorial
Databases & SQL: https://github.com/bendmorris/swc_databases
Everything else: http://users.ecs.soton.ac.uk/stc/SWC/tutorial-materials-virginia.zip

Perhaps more relevant to a broader audience are the online lectures and materials available on the Software Carpentry Website, which include all the above topics, as well as many others.

We capped the course at 50, and had 95 register within a day of opening registration, so we'll likely do this again in the future. I sit in countless meetings where faculty lament how nearly all basic science researchers enter grad school or their postdoc woefully unprepared for this brave new world of data-rich high-throughput science. Self-paced online learning works well for some, but if you're in a department or other organization that could benefit from a free, on-site, intensive introduction to the topics listed above, I highly recommend contacting Software Carpentry and organizing your own bootcamp.

Finally, when organizing an optional section of the course, we let participants vote whether they preferred learning number crunching with NumPy, or SQL/databases; SQL won by a small margin. However, Katherine Holcomb in UVACSE has graciously volunteered to teach a two-hour introduction to NumPy this week, regardless of whether you participated in the boot camp (although some basic Python knowledge is recommended). This (free) short course is this Thursday, March 21, 2-4pm, in the same place as the bootcamp (Brown Library Classroom in Clark Hall). Sign up here.

Stop Hosting Data and Code on your Lab Website

It's happened to all of us. You read about a new tool, database, webservice, software, or some interesting and useful data, but when you browse to http://instititution.edu/~home/professorX/lab/data, there's no trace of what you were looking for.

THE PROBLEM

This isn't an uncommon problem. See the following two articles:
Schultheiss, Sebastian J., et al. "Persistence and availability of web services in computational biology." PLoS one 6.9 (2011): e24914. 
Wren, Jonathan D. "404 not found: the stability and persistence of URLs published in MEDLINE." Bioinformatics 20.5 (2004): 668-672.
The first gives us some alarming statistics. In a survey of nearly 1000 web services published in the Nucleic Acids Web Server Issue between 2003 and 2009:
  • Only 72% were still available at the published address.
  • The authors could not test the functionality for 33% because there was no example data, and 13% no longer worked as expected.
  • The authors could only confirm positive functionality for 45%.
  • Only 274 of the 872 corresponding authors answered an email.
  • Of these 78% said a service was developed by a student or temporary researcher, and many had no plan for maintenance after the researcher had moved on to a permanent position.
The Wren et al. paper found that of 1630 URLs identified in Pubmed abstracts, only 63% were consistently available. That rate was far worse for anonymous login FTP sites (33%).

OpenHelix recently started this thread on Biostar as an obituary section for bioinformatics tools and resources that have vanished.

It's a fact that most of us academics move around a fair amount. Often we may not deem a tool we developed or data we collected and released to be worth transporting and maintaining. After some grace period, the resource disappears without a trace. 

SOFTWARE

I won't spend much time here because most readers here are probably aware of source code repositories for hosting software projects. Unless you're not releasing the source code to your software (aside: starting an open-source project is a way to stake a claim in a field, not a real risk for getting yourself scooped), I can think of no benefit for hosting your code on your lab website when there are plenty of better alternatives available, such as Sourceforge, GitHub, Google Code, and others. In addition to free project hosting, tools like these provide version control, wikis, bug trackers, mailing lists and other services to enable transparent and open development with the end result of a better product and higher visibility. For more tips on open scientific software development, see this short editorial in PLoS Comp Bio:

Prlić A, Procter JB (2012) Ten Simple Rules for the Open Development of Scientific Software. PLoS Comput Biol 8(12): e1002802. 

Casey Bergman recently analyzed where bioinformaticians are hosting their code, where he finds that the growth rate of Github is outpacing both Google Code and Sourceforge. Indeed, Github hosts more repositories than there are articles in Wikipedia, and has an excellent tutorial and interactive learning modules to help you learn how to use it. However, Bergman also points out how easy it is to delete a repository from Github and Google Code, where repositories are published by individuals who hold the keys to preservation (as opposed to Sourceforge, where it is extremely difficult to remove a project once it's been released).

DATA, FIGURES, SLIDES, WEB SERVICES, OR ANYTHING ELSE

For everything else there's Figshare. Figshare lets you host and publicly share unlimited data (or store data privately up to 1GB). The name suggests a site for sharing figures, but Figshare allows you to permanently store and share any research object. That can be figures, slides, negative results, videos, datasets, or anything else. If you're running a database server or web service, you can package up the source code on one of the repositories mentioned above, and upload to Figshare a virtual machine image of the server running it, so that the service will be available to users long after you've lost the time, interest, or money to maintain it.

Research outputs stored at Figshare are archived in the CLOCKSS geographically and geopolitically distributed network of redundant archive nodes, located at 12 major research libraries around the world. This means that content will remain available indefinitely for everyone after a "trigger event," and ensures this work will be maximally accessible and useful over time. Figshare is hosted using Amazon Web Services to ensure the highest level of security and stability for research data. 

Upon uploading your data to Figshare, your data becomes discoverable, searchable, shareable, and instantly citable with its own DOI, allowing you to instantly take credit for the products of your research. 

To show you how easy this is, I recently uploaded a list of "consensus" genes generated by Will Bush where Ensembl refers to an Entrez-gene with the same coordinates, and that Entrez-gene entry refers back to the same Ensembl gene (discussed in more detail in this previous post).

Create an account, and hit the big upload link. You'll be given a screen to drag and drop anything you'd like here (there's also a desktop uploader for larger files).



Once I dropped in the data I downloaded from Vanderbilt's website linked from the original blog post, I enter some optional metadata, a description, a link back to the original post:



I then instantly receive a citeable DOI where the data is stored permanently, regardless of Will's future at Vanderbilt:

Ensembl/Entrez hg19/GRCh37 Consensus Genes. Stephen Turner. figshare. Retrieved 21:31, Dec 19, 2012 (GMT). http://dx.doi.org/10.6084/m9.figshare.103113

There are also links to the side that allow you to export that citation directly to your reference manager of choice.

Finally, as an experiment, I also uploaded this entire blog post to Figshare, which is now citeable and permanently archived at Figshare:

Stop Hosting Data and Code on your Lab Website. Stephen Turner. figshare. Retrieved 22:51, Dec 19, 2012 (GMT). http://dx.doi.org/10.6084/m9.figshare.105125.

RegulomeDB: Identify DNA Features and Regulatory Elements in Non-Coding Regions

Many papers have noted the challenges associated with assigning function to non-coding genetic variation, and since the majority of GWAS hits for common traits are non-coding, resources for providing some mechanism for these associations are desperately needed.

Boyle and colleagues have constructed a database called RegulomeDB to provide functional assignments to variants using data from manual curation, CHiP-seq data, chromatin state information, eQTLs across multiple cell lines, and some computational predictions generated from DNase footprinting and transcription factor binding motifs.

RegulomeDB implements a tiered category system (1-6) where category 1 has an eQTL association in addition to other ENCODE sources of data, 2 -5 have some ENCODE data only with no eQTL associations, and category 6 has evidence of a binding motif change only. As you might imagine, the annotation density increases as you increase category numbers.



Their simple, but impressive interface will accept RS numbers, or whole BED, GFF, or VCF files for annotation. The resulting output (example above) is downloadable, providing both specifics of the annotation (such as the transcription factor binding to the area) and the functional score for the variant.

http://regulome.stanford.edu/

Browsing dbGAP Results


Thanks to the excellent work of Lucia Hindorff and colleagues at NHGRI, the GWAS catalog provides a great reference for the cumulative results of GWAS for various phenotypes.  Anyone familiar with GWAS also likely knows about dbGaP – the NCBI repository for genotype-phenotype relationships – and the wealth of data it contains.  While dbGaP is often thought of as a way to get access to existing genotype data, analysis results are often deposited into dbGaP as well.  Individual-level data (like genotypes) are generally considered “controlled access”, requiring special permission to retrieve or use.  Summary-level data, such as association p-values, are a bit more accessible.  There are two tools available from the dbGaP website: the Association Results Browser and the Phenotype-GenotypeIntegrator (PheGenI).  These tools provide a search interface for examining previous GWAS associations. 

The Association Results Browser provides a simple table listing of associations, searchable by SNP, gene, or phenotype.  It contains the information from the NHGRI GWAS catalog, as well as additional associations from dbGaP deposited studies.  I’ve shown an example below for multiple sclerosis.  You can restrict the search to the dbGaP-specific results by changing the “Source” selection.  If you are looking for the impact of a SNP, this is a nice supplement to the catalog.  Clicking on a p-value brings up the GaP browser, which provides a more graphical (but perhaps less useful) view of the data.



The PheGenI tool provides similar search functionality, but attempts to provide phenotype categories rather than more specific phenotype associations.  Essentially, phenotype descriptions are linked to MeSH terms to provide categories such as “Chemicals and Drugs”, or “Hemic and Lymphatic Diseases”.  PheGenI seems most useful if searching from the phenotype perspective, while the association browser seems better for SNP or Gene searches.  All these tools are under active development, and I look forward to seeing their future versions.

The HaploREG Database for Functional Annotation of SNPs


The ENCODE project continues to generate massive numbers of data points on how genes are regulated.  This data will be of incredible use for understanding the role of genetic variation, both for altering low-level cellular phenotypes (like gene expression or splicing), but also for complex disease phenotypes.  While it is all deposited into the UCSC browser, ENCODE data is not always the easiest to access or manipulate. 

To make epigenomic tracks from the ENCODE project more accessible for interpretation in the context of new or existing GWAS hits, Luke Ward and Manolis Kellis at the BROAD Institute have developed a database called HaploREG.  HaploREG uses LD and SNP information from the 1000 Genomes project to map known genetic variants onto ENCODE data, providing a potential mechanism for SNP influence.  HaploREG will annotate SNPs with evolutionary constraint measures, predicted chromatin states, and how SNPs alter the Positional Weight Matrices of known transcription factors.

Here's a screenshot from SNP associated with HDL cholesterol levels showing summary information for several SNPs in LD at R2>0.9 in CEU. Clicking each SNP link provides further info.



In addition to providing annotations of user-submitted SNPs, HaploREG also provides cross-references from the NHGRI GWAS Catalog, allowing users to explore the mechanisms behind disease associated SNPs.  Check out the site here:  http://www.broadinstitute.org/mammals/haploreg/haploreg.phpand explore the functionality of any SNPs you might find associated in your work.  The more functional information we can include in our manuscripts, the more likely they are to be tested in a model system.

HaploReg: Functional Annotation of SNPs