It's happened to all of us. You read about a new tool, database, webservice, software, or some interesting and useful data, but when you browse to
http://instititution.edu/~home/professorX/lab/data, there's no trace of what you were looking for.
THE PROBLEMThis isn't an uncommon problem. See the following two articles:
Schultheiss, Sebastian J., et al. "Persistence and availability of web services in computational biology." PLoS one 6.9 (2011): e24914.
Wren, Jonathan D. "404 not found: the stability and persistence of URLs published in MEDLINE." Bioinformatics 20.5 (2004): 668-672.
The first gives us some alarming statistics. In a survey of nearly 1000 web services published in the
Nucleic Acids Web Server Issue between 2003 and 2009:
- Only 72% were still available at the published address.
- The authors could not test the functionality for 33% because there was no example data, and 13% no longer worked as expected.
- The authors could only confirm positive functionality for 45%.
- Only 274 of the 872 corresponding authors answered an email.
- Of these 78% said a service was developed by a student or temporary researcher, and many had no plan for maintenance after the researcher had moved on to a permanent position.
The Wren et al. paper found that of 1630 URLs identified in Pubmed abstracts, only 63% were consistently available. That rate was far worse for anonymous login FTP sites (33%).
It's a fact that most of us academics move around a fair amount. Often we may not deem a tool we developed or data we collected and released to be worth transporting and maintaining. After some grace period, the resource disappears without a trace.
SOFTWARE
I won't spend much time here because most readers here are probably aware of source code repositories for hosting software projects. Unless you're not releasing the source code to your software (aside: starting an open-source project is a
way to stake a claim in a field, not a real risk for getting yourself scooped), I can think of no benefit for hosting your code on your lab website when there are plenty of better alternatives available, such as
Sourceforge,
GitHub,
Google Code, and others. In addition to free project hosting, tools like these provide version control, wikis, bug trackers, mailing lists and other services to enable transparent and open development with the end result of a better product and higher visibility. For more tips on open scientific software development, see this short editorial in PLoS Comp Bio:
DATA, FIGURES, SLIDES, WEB SERVICES, OR ANYTHING ELSE
For everything else there's
Figshare. Figshare lets you host and publicly share
unlimited data (or store data privately up to 1GB). The name suggests a site for sharing figures, but Figshare allows you to permanently store and share
any research object. That can be figures, slides, negative results, videos, datasets, or anything else. If you're running a database server or web service, you can package up the source code on one of the repositories mentioned above, and upload to Figshare a
virtual machine image of the server running it, so that the service will be available to users long after you've lost the time, interest, or money to maintain it.
Research outputs stored at Figshare are archived in the
CLOCKSS geographically and geopolitically distributed network of redundant archive nodes, located at 12 major research libraries around the world. This means that content will remain available indefinitely for everyone after a "trigger event," and ensures this work will be maximally accessible and useful over time. Figshare is hosted using Amazon Web Services to ensure the highest level of security and stability for research data.
Upon uploading your data to Figshare, your data becomes discoverable, searchable, shareable, and instantly citable with its own DOI, allowing you to instantly take credit for the products of your research.
To show you how easy this is, I recently uploaded a list of "consensus" genes generated by Will Bush where Ensembl refers to an Entrez-gene with the same coordinates, and that Entrez-gene entry refers back to the same Ensembl gene (
discussed in more detail in this previous post).
Create an account, and hit the big
upload link. You'll be given a screen to drag and drop anything you'd like here (there's also a desktop uploader for larger files).
Once I dropped in the data I downloaded from Vanderbilt's website linked from the original blog post, I enter some optional metadata, a description, a link back to the original post:
I then instantly receive a citeable DOI where the data is stored permanently, regardless of Will's future at Vanderbilt:
There are also links to the side that allow you to export that citation directly to your reference manager of choice.
Finally, as an experiment, I also uploaded this entire blog post to Figshare, which is now citeable and permanently archived at Figshare: