Pages

Comparison of plots using Stata, R base, R lattice, and R ggplot2, Part I: Histograms

One of the nicer things about many statistics packages is the extremely granular control you get over your graphical output.  But I lack the patience to set dozens of command line flags in R, and I'd rather not power the computer by pumping the mouse trying to set all the clicky-box options in Stata's graphics editor.  I want something that just looks nice, using the out-of-the-box defaults.  Here's a little comparison of 4 different graphing systems (three using R, and one using Stata) and their default output for plotting a histogram of a continuous variable split over three levels of a categorical variable.

First I'll start with the three graphing systems in R: base, lattice, and ggplot2.  If you don't have the last two packages installed, go ahead and download them:

install.packages("ggplot2")
install.packages("lattice")

Now load these two packages, and download this fake dataset I made up containing 100 samples each from three different genotypes ("geno") and a continuous outcome ("trait")

mydat=read.csv("http://people.vanderbilt.edu/~stephen.turner/ggd/2009-09-21-histodemo.csv",header=T)
library(ggplot2)
library(lattice)

Now let's get started...

R: base graphics

par(mfrow=c(3,1))
with(subset(mydat,geno=="aa"),hist(trait))
with(subset(mydat,geno=="Aa"),hist(trait))
with(subset(mydat,geno=="AA"),hist(trait))




R: lattice

histogram(~trait | factor(geno), data=mydat, layout=c(1,3))


R: ggplot2

qplot(trait,data=mydat,facets=geno~.)

# Update Tuesday, September 22, 2009
# A commenter mentioned that this code did not work.
# If the above code does not work, try explicitly
# stating that you want a histogram:
qplot(trait,geom="histogram",data=mydat,facets=geno~.)



Stata

insheet using "http://people.vanderbilt.edu/~stephen.turner/ggd/2009-09-21-histodemo.csv", comma clear
histogram trait, by(geno, col(1))





Commentary

In my opinion ggplot2 is the clear winner.  Again I'll concede that all of the above graphing systems give you an incredible amount of control of every aspect of the graph, but I'm only looking for what gives me the best out-of-the-box default plot using the shortest command possible. R's base graphics give you a rather spartan plot, with very wide bins.  It also requires 4 lines of code.  (If you can shorten this, please comment).  By default, the base graphics system gives you counts (frequency) on the vertical axis.  The lattice package in R does a little better perhaps, but the default color scheme is visually less than stellar.  Also, I'm not sure why the axis labels switch sides every other plot, and the ticks on top of the plot are probably unnecessary.  I still think the bins are too wide.  You lose some information especially on the bottom plot towards the right tail.  The vertical axis is proportion of total.  Stata's default plot looks very similar to lattice, but again uses a very unattractive color scheme.  It uses density for the vertical axis, which may not mean much to non-statisticians.  The default plot made by ggplot2 is just hands-down good-looking.  There are no unnecessary lines delimiting the bins, and the binwidth is appropriate.  The vertical axis represents counts.  The black bars on the light-gray background have a good data-ink ratio.  And it required the 2nd shortest command, only 3 characters longer than the Stata equivalent.

I'm ordering the ggplot2 book (Amazon, ~$50), so as I figure out how to do more with ggplot2 I'll post more comparisons like this.  If you use SPSS, SAS, MATLAB, or something else, post the code in a comment here and send me a picture or link to the plot and I'll post it here.