Pages

Summarize Missing Data for all Variables in a Data Frame in R

Something like this probably already exists in an R package somewhere out there, but I needed a function to summarize how much missing data I have in each variable of a data frame in R. Pass a data frame to this function and for each variable it'll give you the number of missing values, the total N, and the proportion missing.

propmiss <- function(dataframe) lapply(dataframe,function(x) data.frame(nmiss=sum(is.na(x)), n=length(x), propmiss=sum(is.na(x))/length(x)))

Let's try it out.

#simulate some fake data
fakedata=data.frame(var1=c(1,2,NA,4,NA,6,7,8,9,10),var2=c(11,NA,NA,14,NA,16,17,NA,19,NA))

print(fakedata)
var1 var2
1 1 11
2 2 NA
3 NA NA
4 4 14
5 NA NA
6 6 16
7 7 17
8 8 NA
9 9 19
10 10 NA

# summarize the missing data
propmiss(fakedata)
$var1
nmiss n propmiss
1 2 10 0.2

$var2
nmiss n propmiss
1 5 10 0.5

Running that function returns a list of data.frame objects. You can access the proportion missing for var1 by running propmiss(fakedata)$var1$propmis.

*Edit 2011-02-23*

Commenter A. Friedman asked for a version of this function that gives you the output as a data frame. The function's a bit uglier because something was being coerced as a list, but this does the trick: