> # install the package
> install.packages("sqldf")
>
> #load it
> library(sqldf)
>
> # set the random seed
> set.seed(42)
>
> #generate some data
> df1 = data.frame(id=1:10,class=rep(c("case","ctrl"),5))
> df2 = data.frame(id=1:10,cov=round(runif(10)*10,1))
>
> #look at the data
> df1
id class
1 1 case
2 2 ctrl
3 3 case
4 4 ctrl
5 5 case
6 6 ctrl
7 7 case
8 8 ctrl
9 9 case
10 10 ctrl
> df2
id cov
1 1 9.1
2 2 9.4
3 3 2.9
4 4 8.3
5 5 6.4
6 6 5.2
7 7 7.4
8 8 1.3
9 9 6.6
10 10 7.1
>
> # do an inner join
> sqldf("select * from df1 join df2 on df1.id=df2.id")
id class id cov
1 1 case 1 9.1
2 2 ctrl 2 9.4
3 3 case 3 2.9
4 4 ctrl 4 8.3
5 5 case 5 6.4
6 6 ctrl 6 5.2
7 7 case 7 7.4
8 8 ctrl 8 1.3
9 9 case 9 6.6
10 10 ctrl 10 7.1
>
> # where clauses
> sqldf("select * from df1 join df2 on df1.id=df2.id where class='case'")
id class id cov
1 1 case 1 9.1
2 3 case 3 2.9
3 5 case 5 6.4
4 7 case 7 7.4
5 9 case 9 6.6
>
> # lots of sql fun
> sqldf("select df1.id, df2.cov as covariate from df1 join df2 on df1.id=df2.id where class='case' and cov>3 order by cov")
id covariate
1 5 6.4
2 9 6.6
3 7 7.4
4 1 9.1
Use SQL queries to manipulate data frames in R with sqldf package
I've covered a few topics in the past including the plyr package, which is kind of like "GROUP BY" for R, and the merge function for merging datasets. I only recently found the sqldf package for R, and it's already one of the most useful packages I've ever installed. The main function in the package is sqldf(), which takes a quoted string as an argument. You can treat data frames as tables as if they were in a relational database. You can use some of the finer aspects of SQL like the INNER JOIN or the subquery, which are extremely difficult operations to mimic using standard R programming. While this isn't an SQL tutorial, try out some of these commands to see what sqldf can do for you. Read more about the sqldf package here.