Pages

R Snippet for Sampling from a Dataframe

It took me a while to figure this out, so I thought I'd share.

I have a dataframe with millions of observations in it, and I want to estimate a density distribution, which is a memory intensive process. Running my kde2d function on the full dataframe throws and error -- R tries to allocate a vector that is gigabytes in size. A reasonable alternative is to run the function on a smaller subset of the data. R has a nifty sample function, but it is designed to randomly sample from a vector, not a 2D dataframe. The sample function CAN work for this though, like so:
sub <- Dataset[sample(1:dim(Dataset)[1], size=100000, replace=FALSE),]


Now, sub contains a subset from my 2D dataframe containing 100000 observations. This example was a sample without replacement, but if you set replace=TRUE, you can get a sample with replacement also.