Pages

Merging data from different files using R

A reader asked yesterday how you would merge data from two different files. For example, let's say you have a ped file with genotypes for individuals, and another file that had some other data for some of the individuals in the pedfile, like clinical or environmental covariates. How would you go about automatically putting the clinical data from file 2 into the appropriate rows in file 1? Without using a database, the easiest way is probably using the "merge" function in R that will do the trick with a one-line command. Here's a short tutorial to get you started.

First, start up R on the cheeses simply by typing in the uppercase letter R and hit enter. Alternatively you could use R if it's installed on your computer. Now type in this command to view a small dataset I made up. It's not a real pedfile, but it will do the trick. Make sure to put it all on one line:

read.table("http://people.vanderbilt.edu/~stephen.turner/ggd/ped.txt")

You should see the dataset displayed:

indiv_id SNP1 SNP2
1 1 1 1
2 2 1 1
3 3 0 0
4 4 1 0
5 5 1 1
6 6 1 0
7 7 1 1
8 8 0 1
9 9 1 1
10 10 1 1

We have 10 individuals numbered 1-10 in a column called "indiv_id" that have genotype information at 2 loci. Now let's view another dataset that has fake clinical covariates for a few of those individuals:

read.table("http://people.vanderbilt.edu/~stephen.turner/ggd/meta.txt")

Again, you will see the dataset displayed.

indiv_id cov1 cov2
1 1 1.14 74.6
2 3 4.50 79.4
3 4 0.80 48.2
4 6 1.39 68.1
5 7 3.89 84.8
6 8 0.42 86.1

In this dataset, we have clinical information on the individuals in the ped file above, but only on some of the individuals. Now, let's merge what information we have here with the ped file above.

But first, since all we did above was view the datasets, let's run those commands again, but this time storing the tables in a "data frame" (R parlance for dataset). Type in these two commands. The " <- " is an assignment. It means create a data frame called ped from the following table from the internet, and the same with a data frame called meta.

ped <- read.table("http://people.vanderbilt.edu/~stephen.turner/ped.txt")


meta <- read.table("http://people.vanderbilt.edu/~stephen.turner/meta.txt")

Now if you just type in the name of the data frame, e.g. ped, and hit enter, you will see the data frame displayed. Now it's time for the real magic. Type the following command, and I'll explain what it does below:

merge(ped,meta,by="indiv_id",all=TRUE)

The output should look like this:

indiv_id SNP1 SNP2 cov1 cov2
1 1 1 1 1.14 74.6
2 2 1 1 NA NA
3 3 0 0 4.50 79.4
4 4 1 0 0.80 48.2
5 5 1 1 NA NA
6 6 1 0 1.39 68.1
7 7 1 1 3.89 84.8
8 8 0 1 0.42 86.1
9 9 1 1 NA NA
10 10 1 1 NA NA

This command uses R's "merge" function to combine the data frames you created above, ped and meta, into a single data frame. It does this by joining the two data frames on a common variable called "indiv_id". The all=TRUE option tells R to join the two data frames, even if one has fewer entries than another. You'll see the NA's listed for a few individuals. NA is R's way of coding missing data. If you didn't include the all=TRUE, observations with any missing data for any variable would be deleted!

Finally, if you want to save this data to a file, re-run the command above, assigning it to a variable, then use the write.table command to write it to your home directory.

superpedfile <- merge(ped,meta,by="indiv_id",all=TRUE)

write.table(superpedfile,file="~/superped.txt")

To exit R, type quit(). To do this on your own data, see the help on read.table by typing ?read.table at the R command line. If anybody knows of a better way to do this post it in the comments!