Pages

Showing posts with label Writing. Show all posts
Showing posts with label Writing. Show all posts

Stop Hosting Data and Code on your Lab Website

It's happened to all of us. You read about a new tool, database, webservice, software, or some interesting and useful data, but when you browse to http://instititution.edu/~home/professorX/lab/data, there's no trace of what you were looking for.

THE PROBLEM

This isn't an uncommon problem. See the following two articles:
Schultheiss, Sebastian J., et al. "Persistence and availability of web services in computational biology." PLoS one 6.9 (2011): e24914. 
Wren, Jonathan D. "404 not found: the stability and persistence of URLs published in MEDLINE." Bioinformatics 20.5 (2004): 668-672.
The first gives us some alarming statistics. In a survey of nearly 1000 web services published in the Nucleic Acids Web Server Issue between 2003 and 2009:
  • Only 72% were still available at the published address.
  • The authors could not test the functionality for 33% because there was no example data, and 13% no longer worked as expected.
  • The authors could only confirm positive functionality for 45%.
  • Only 274 of the 872 corresponding authors answered an email.
  • Of these 78% said a service was developed by a student or temporary researcher, and many had no plan for maintenance after the researcher had moved on to a permanent position.
The Wren et al. paper found that of 1630 URLs identified in Pubmed abstracts, only 63% were consistently available. That rate was far worse for anonymous login FTP sites (33%).

OpenHelix recently started this thread on Biostar as an obituary section for bioinformatics tools and resources that have vanished.

It's a fact that most of us academics move around a fair amount. Often we may not deem a tool we developed or data we collected and released to be worth transporting and maintaining. After some grace period, the resource disappears without a trace. 

SOFTWARE

I won't spend much time here because most readers here are probably aware of source code repositories for hosting software projects. Unless you're not releasing the source code to your software (aside: starting an open-source project is a way to stake a claim in a field, not a real risk for getting yourself scooped), I can think of no benefit for hosting your code on your lab website when there are plenty of better alternatives available, such as Sourceforge, GitHub, Google Code, and others. In addition to free project hosting, tools like these provide version control, wikis, bug trackers, mailing lists and other services to enable transparent and open development with the end result of a better product and higher visibility. For more tips on open scientific software development, see this short editorial in PLoS Comp Bio:

Prlić A, Procter JB (2012) Ten Simple Rules for the Open Development of Scientific Software. PLoS Comput Biol 8(12): e1002802. 

Casey Bergman recently analyzed where bioinformaticians are hosting their code, where he finds that the growth rate of Github is outpacing both Google Code and Sourceforge. Indeed, Github hosts more repositories than there are articles in Wikipedia, and has an excellent tutorial and interactive learning modules to help you learn how to use it. However, Bergman also points out how easy it is to delete a repository from Github and Google Code, where repositories are published by individuals who hold the keys to preservation (as opposed to Sourceforge, where it is extremely difficult to remove a project once it's been released).

DATA, FIGURES, SLIDES, WEB SERVICES, OR ANYTHING ELSE

For everything else there's Figshare. Figshare lets you host and publicly share unlimited data (or store data privately up to 1GB). The name suggests a site for sharing figures, but Figshare allows you to permanently store and share any research object. That can be figures, slides, negative results, videos, datasets, or anything else. If you're running a database server or web service, you can package up the source code on one of the repositories mentioned above, and upload to Figshare a virtual machine image of the server running it, so that the service will be available to users long after you've lost the time, interest, or money to maintain it.

Research outputs stored at Figshare are archived in the CLOCKSS geographically and geopolitically distributed network of redundant archive nodes, located at 12 major research libraries around the world. This means that content will remain available indefinitely for everyone after a "trigger event," and ensures this work will be maximally accessible and useful over time. Figshare is hosted using Amazon Web Services to ensure the highest level of security and stability for research data. 

Upon uploading your data to Figshare, your data becomes discoverable, searchable, shareable, and instantly citable with its own DOI, allowing you to instantly take credit for the products of your research. 

To show you how easy this is, I recently uploaded a list of "consensus" genes generated by Will Bush where Ensembl refers to an Entrez-gene with the same coordinates, and that Entrez-gene entry refers back to the same Ensembl gene (discussed in more detail in this previous post).

Create an account, and hit the big upload link. You'll be given a screen to drag and drop anything you'd like here (there's also a desktop uploader for larger files).



Once I dropped in the data I downloaded from Vanderbilt's website linked from the original blog post, I enter some optional metadata, a description, a link back to the original post:



I then instantly receive a citeable DOI where the data is stored permanently, regardless of Will's future at Vanderbilt:

Ensembl/Entrez hg19/GRCh37 Consensus Genes. Stephen Turner. figshare. Retrieved 21:31, Dec 19, 2012 (GMT). http://dx.doi.org/10.6084/m9.figshare.103113

There are also links to the side that allow you to export that citation directly to your reference manager of choice.

Finally, as an experiment, I also uploaded this entire blog post to Figshare, which is now citeable and permanently archived at Figshare:

Stop Hosting Data and Code on your Lab Website. Stephen Turner. figshare. Retrieved 22:51, Dec 19, 2012 (GMT). http://dx.doi.org/10.6084/m9.figshare.105125.

Computing for Data Analysis, and Other Free Courses

Coursera's free Computing for Data Analysis course starts today. It's a four week long course, requiring about 3-5 hours/week. A bit about the course:
In this course you will learn how to program in R and how to use R for effective data analysis. You will learn how to install and configure software necessary for a statistical programming environment, discuss generic programming language concepts as they are implemented in a high-level statistical language. The course covers practical issues in statistical computing which includes programming in R, reading data into R, creating informative data graphics, accessing R packages, creating R packages with documentation, writing R functions, debugging, and organizing and commenting R code. Topics in statistical data analysis and optimization will provide working examples.
There are also hundreds of other free courses scheduled for this year. While the Computing for Data Analysis course is more about using R, the Data Analysis course is more about the methods and experimental designs you'll use, with a smaller emphasis on the R language. There are also courses on Scientific ComputingAlgorithmsHealth Informatics in the CloudNatural Language ProcessingIntroduction to Data ScienceScientific WritingNeural NetworksParallel ProgrammingStatistics 101Systems BiologyData Management for Clinical Research, and many, many others. See the link below for the full listing.

Free Courses on Coursera

Galaxy Project Group on CiteULike and Mendeley


The Galaxy Project started using CiteULike to organize papers that are about, use, or reference Galaxy. The Galaxy CiteULike group is open to any CUL user, and once you join, you can add papers to the group, assign tags, and rate papers.

While not a CUL user, I'm a big fan of Mendeley for managing references, PDFs, and creating bibliographies (and so are many of you). I'm happy to hear that the Galaxy folks also set up a Galaxy Mendeley Group, also open to the public for anyone to join.  If you join the Galaxy public Mendeley group, all of the groups references will show up in your Mendeley library (and these won't count against your personal quota).

Just one important thing to note: The Mendeley group is a mirror of the CiteULike group, so if you want to add more publications to the Galaxy Group, add them on CiteULike, not Mendeley (it doesn't work the other way around - papers added to Mendeley won't make it to the CUL group).

Galaxy Project Group on CiteULike and Mendeley

Using LaTeX for Math Formulas on the Web

I love the idea of using R+LaTeX+Sweave for reproducible research. This is even easier now that R has a jazzy new IDE that supports Sweave syntax highlighting and automatic PDF generation.

I know I'm going to take some flak for saying this, but let's be honest here... If you're working in the biomedical sciences, chances are, your collaborators have never heard of Sweave. Physicians only use LaTeX during surgery. Lots of folks you work with probably think real applied statistics can only be done in SAS (if you're one of them, please see http://www.r-project.org/). Most biomedical journals will only accept MS Word .doc files during manuscript submission. NIH grant applications use a standardized MS Word template.

These are a few of the reasons I don't routinely incorporate LaTeX+Sweave in my analysis workflow.

That said, one of the things LaTeX is really good for is mathematical typesetting. Writing out math formulae using LaTeX is fast, intuitive, and your plain-text code is portable. If you're ever posting a question on stats.stackexchange, editing Wikipedia, or if you're like me and keep your lab notebook online in a private blog online, using LaTeX conventions for typesetting formulae can be extremely handy.

Codecogs.com's Online LaTeX Equation Editor makes it very simple to use HTML to add formulae to your blog or anywhere on the web. The idea's simple - you type in the LaTeX code for the formula you want, e.g.

SS_{err}=\sum_i({y_i-\hat{y}_i})^2

And you'll get this HTML code:

<img src="http://latex.codecogs.com/gif.latex?SS_{err}=\sum_i({y_i-\hat{y}_i})^2" title="SS_{err}=\sum_i({y_i-\hat{y}_i})^2" />

This HTML code generates a hosted image that you can copy and paste anywhere on the web you like. Paste this into the compose window in Blogger, and it looks like this:



Online LaTeX Equation Editor

Embed R Code with Syntax Highlighting on your Blog

Note 2010-11-17: there's more than one way to do this. See the updated post from 2010-11-17.

If you use blogger or even wordpress you've probably found that it's complicated to post code snippets with spacing preserved and syntax highlighting (especially for R code). I've discovered a few workarounds that involve hacking the blogger HTML template and linking to someone else's javascript templates, but it isn't pretty and I'm relying on someone else to perpetually host and maintain the necessary javascript. Github Gists make this really easy. Github is a source code hosting and collaborative/social coding website, and gist.github.com makes it very easy to post, share, and embed code snippets with syntax highlighting for almost any language you can think of.

Here's an example of some R code I posted a few weeks ago on making QQ plots of p-values using R base graphics.



The Perl highlighter also works well. Here's some code I posted recently to help clean up PLINK output:



Simply head over to gist.github.com and paste in your code, select a language for syntax highlighting, and hit "Create Public Gist." The embed button will give you a line of HTML that you can paste into your blog to embed the code directly.

Finally, if you're using Wordpress you can get the Github Gist plugin for Wordpress to get things done even faster. A big tip of the had to economist J.D. Long (blogger at Cerebral Mastication) for pointing this out to me.

Using R, LaTeX, and Sweave for Reproducible Research: Handouts, Templates, & Other Resources

Several readers emailed me or left a comment on my previous announcement of Frank Harrell's workshop on using Sweave for reproducible research asking if we could record the seminar. Unfortunately we couldn't record audio or video, but take a look at the Sweave/Latex page on the Biostatistics Dept Wiki. Here you can find Frank's slideshow from today and the handout from today (a PDF statistical report with all the LaTeX/R code necessary to produce it). While this was more of an advanced Sweave/LaTeX workshop, you can also find an introduction to LaTeX and an introduction to reproducible research using R, LaTeX, and Sweave, both by Theresa Scott.

In addition to lots of other helpful tips, you'll also find the following resources to help you learn to use both Sweave and LaTeX:

Keep your lab notebook in a private blog

In my previous post about Q10 a commenter suggested a software called "The Journal" by davidRM for productively keeping track of experiments, datasets, projects, etc. I've never tried this software before, but about a year ago I ditched my pen and paper lab notebook for an electronic lab notebook in the form of a blog using Blogger, the same platform I use to write Getting Genetics Done.

The idea of using a blogging platform for your lab notebook is pretty simple and the advantages are numerous. All your entries are automatically dated appear chronologically. You can view your notebook or make new entries from anywhere in the world. You can copy and paste useful code snippets, upload images and documents, and take advantages of tags and search features present with most blogging platforms. I keep my lab notebook private - search engines can't index it, and I have to give someone permission to view before they can see it. Once you've allowed someone to view your blog/notebook, you can also allow them to comment on posts. This is a great way for my mentor to keep track of result and make suggestions. And I can't count how many times I've gone back to an older notebook entry to view the code that I used to do something quickly in R or PLINK but didn't bother to save anywhere else.

Of course Blogger isn't the only platform that can do this, although it's free and one of the easiest to get set up, especially if you already have a Google account. Wordpress is very similar, and has tons of themes. You can find lots of comparisons between the two online.  If you have your own web host, you can install the open-source version of Wordpress on your own host, for added security and access control (see this explanation of the differences between Wordpress.com and Wordpress.org).

Another note-taking platform I've been using recently is Evernote. Lifehacker blog has a great overview of Evernote's features.  It runs as a desktop application, and it syncs across all your computers, and online in a web interface also. The free account lets you sync 40MB a month, which is roughly the equivalent of 20,000 typed notes. This quota resets every month, and you start fresh at 40MB. You can also attach PDFs to a note, and link notes to URLs. Every note is full-text searchable.

And then of course there's the non-free option: Microsoft OneNote.  Although it will set you back a few bucks, it integrates very nicely with many features on your Windows machine. I've never used OneNote.

FreeMyPDF.com unlocks PDFs for submitting to PubMed Central

Do you submit manuscripts to journals that are not indexed in PubMed? This can make it difficult for others to find your publications, especially if they don't have a subscription to the journal. This often happens with us when we publish in computer science journals. Using the NIH manuscript submission system you can upload your manuscript to PubMed Central, which provides free open access, and is indexed in PubMed. This takes less than 5 minutes to do per manuscript, and it makes it much easier for you and any other interested parties to access your publications. Furthermore, if you use NIH funding, you are required by law to make any publications resulting from this funding free and publicly available. Make sure you're not breaching any copyright agreements first by contacting the editor of your publisher.

I've uploaded a few of my own papers, and a snag I often run into is that the publisher will often "lock" the PDF by enabling security which prevents software from extracting data from the PDF file. FreeMyPDF.com will liberate your PDF from data extraction, printing, and other security restrictions, making it compatible with the NIH manuscript submission system.

NIH Manuscript Submission System

NIH Open Access Policy

FreeMyPDF.com - Removes security from viewable PDFs

Sync files across multiple computers with Dropbox (PC, Mac & Linux too!)

Do you ever find yourself switching back and forth between your work computer, your laptop, and your home computer?  This happens to me all the time when I'm writing.  Rather than carry all your files on a USB stick and risk losing it or corrupting your data, give Dropbox a try.  It's dead simple, and works for PC, Mac, and Linux too.

Once you sign up and install on all your computers, you'll have a special folder, where if you save something there on one computer, it is automatically created and stays synchronized in the same folder on all your other computers.  What's more, if you use someone else's computer, you can access all your files through a web interface because they're all securely backed up online.  I've been using this for a while now to sync all the papers I'm working on, RefMan/EndNote databases, config files, and R functions I reuse all the time.  You can also create "public" folders.  Put something here, and you can get a direct link to the file online to share with other folks. For example, here's a link to some R code I wrote to use ggplot2 to make manhattan plots and QQ-plots for every PLINK output file in the current directory (I'm hoping to clean this code up and include this with some other functions I've written into a package on CRAN soon).

I can't recommend this little app enough. If you're still not convinced, check out this short video that explains what Dropbox is all about and shows of just how simple it is to use.

You get a whopping 2GB for free, but if you use the registration link provided below, you'll get an extra free 1/4GB.  Happy holidays from GGD, and I'll catch up with you all next week!

Dropbox - Secure online backup and synchronization

Highlight all the acronyms in a Word document

A tip of the hat to Lifehacker for pointing this out.

Ever been nailed by a reviewer or a thesis committee for using too many acronyms without defining them well? There's an easy way built into MS Word to find and highlight all the acronyms in a document. It's a nice check after you're done with a first draft, and you might be surprised by how many you find!

First, hit Ctrl-F to bring up a find box, and type this in exactly as written:

<[A-Z]{2,}>

Hit the "More >>" button, then check the "Use Wildcards" box. Finally, click the "Reading Highlight" box, then click "Highlight All". It should look like this:



After you do that, you'll see all your uppercase acronyms highlighted throughout the entire document. It will look something like this:



That find query basically tries to find any two or more adjacent capital letters. See the original article referenced above and the comments below it for more similar wildcard search tricks, like how to extend this idea to include numbers, or using regular expressions.

Use plain text when you need to just write

When you need to focus and get some serious writing done, it may be a good idea to ditch your word processor and go with plain text instead. Save all your formatting and spell-checking to do later in one step. I just tried this out on a review article I needed to write and found it much easier to concentrate without all of MS Word's squiggly underlining, autocorrecting, autoformatting, and fancy toolbar buttons begging to be clicked. If you're using windows, Notepad works fine, but try out a tiny free program called Q10. It's a full screen text editor that by default uses easy-on-the-eyes light text on a black background. You can also run it directly from a USB stick. When you're done, just open the .txt file using Word, and from there out save it as .doc. By default it comes enabled with some silly typewriter sound when you type, but it's easy enough to turn that off.

Q10 - Full screen distraction-free text editor