Pages

Split, apply, and combine in R using PLYR



While flirting around with previously mentioned ggplot2 I came across an incredibly useful set of functions in the plyr package, made by Hadley Wickham, the same guy behind ggplot2.  If you've ever used MySQL before, think of "GROUP BY", but here you can arbitrarily apply any R function to splits of the data, or write one yourself.

Imagine you have two SNPs, and 2 different variables you want info about across all three genotypes at each locus.  You can generate some fake data using this command:

mydata=data.frame(X1=rnorm(30), X2=rnorm(30,5,2), SNP1=c(rep("AA",10), rep("Aa",10), rep("aa",10)), SNP2=c(rep("BB",10), rep("Bb",10), rep("bb",10)))

The data should look like this:

            X1          X2 SNP1 SNP2
1  -0.73671602  2.76733658   AA   BB
2  -1.83947190  5.70194282   AA   BB
3  -0.46963370  4.89393264   AA   BB
4   0.85105460  6.37610477   AA   BB
5  -1.69043368  8.11399122   AA   BB
6  -1.47995127  4.32866212   AA   BB
7   0.21026063  4.32073489   AA   BB
8   0.19038088  4.26631298   AA   BB
9  -0.29759084  4.07349852   AA   BB
10 -0.46843672  7.01048905   AA   BB
11  0.93804369  3.78855823   Aa   Bb
12 -1.46223800  5.08756968   Aa   Bb
13 -0.01338604 -0.01494702   Aa   Bb
14  0.13704332  4.53625220   Aa   Bb
15  0.59214151  3.75293124   Aa   Bb
16 -0.27658821  5.78701933   Aa   Bb
17  0.05527139  5.99460464   Aa   Bb
18  1.00077756  5.26838121   Aa   Bb
19  0.62871877  6.98314827   Aa   Bb
20  1.18925876  8.61457227   Aa   Bb
21 -0.40389888  3.36446128   aa   bb
22 -1.43420595  6.50801614   aa   bb
23  1.79086285  5.39616828   aa   bb
24  0.15886243  3.98010396   aa   bb
25  0.57746024  4.93605364   aa   bb
26 -1.11158987  6.40760662   aa   bb
27  1.93484117  3.33698750   aa   bb
28  0.10803302  4.56442931   aa   bb
29  0.56648591  4.48502267   aa   bb
30  0.03616814  5.77160936   aa   bb


Now let's say you want the mean and standard deviation of X1 and X2 across each of the multilocus genotypes at SNP1 and SNP2.  First, install and load the plyr package:

install.packages("plyr")
library(plyr)

Now run the following command:

ddply(mydata, c("SNP1","SNP2"), function(df) data.frame(meanX1=mean(df$X1), sdX1=sd(df$X1), meanX2=mean(df$X2), sdX2=sd(df$X2)))

And you get exactly what you asked for:

  SNP1 SNP2     meanX1      sdX1   meanX2     sdX2
1   aa   bb  0.2223019 1.0855063 4.875046 1.144623
2   Aa   Bb  0.2789043 0.7817727 4.979809 2.286914
3   AA   BB -0.5730538 0.8834038 5.185301 1.601626

That command above may appear complicated at first, but it isn't really.  Cerebral mastication gives a very easy to follow, concise explanation of what that command is doing.  Also, check out the slides he presented at a recent R meetup discussion:



There's more to plyr than ddply, so check it out for yourself!