Here's the R code to generate the data and all the figures here.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Generate some data | |
library(MASS) | |
set.seed(101) | |
n <- 50000 | |
X <- mvrnorm(n, mu=c(.5,2.5), Sigma=matrix(c(1,.6,.6,1), ncol=2)) | |
# A color palette from blue to yellow to red | |
library(RColorBrewer) | |
k <- 11 | |
my.cols <- rev(brewer.pal(k, "RdYlBu")) | |
## compute 2D kernel density, see MASS book, pp. 130-131 | |
z <- kde2d(X[,1], X[,2], n=50) | |
# Make the base plot | |
plot(X, xlab="X label", ylab="Y label", pch=19, cex=.4) | |
# Draw the colored contour lines | |
contour(z, drawlabels=FALSE, nlevels=k, col=my.cols, add=TRUE, lwd=2) | |
# Add lines for the mean of X and Y | |
abline(h=mean(X[,2]), v=mean(X[,1]), col="gray", lwd=1.5) | |
# Add the correlation coefficient to the top left corner | |
legend("topleft", paste("R=", round(cor(X)[1,2],3)), bty="n") | |
## Other methods to fix overplotting | |
# Make points smaller - use a single pixel as the plotting charachter | |
plot(X, pch=".") | |
# Hexbinning | |
library(hexbin) | |
plot(hexbin(X[,1], X[,2])) | |
# Make points semi-transparent | |
library(ggplot2) | |
qplot(X[,1], X[,2], alpha=I(.1)) | |
# The smoothScatter function (graphics package) | |
smoothScatter(X) |
Here's the problem: there are 50,000 points in this plot causing extreme overplotting. (This is a simple multivariate normal distribution, but if the distribution were more complex, overplotting might obscure a relationship in the data that you didn't know about).
I liked the solution they used in the paper referenced above. Contour lines were placed throughout the data indicating the density of the data in that region. Further, the contour lines were "heat" colored from blue to red, indicating increasing data density. Optionally, you can add vertical and horizontal lines that intersect the means, and a legend that includes the absolute correlation coefficient between the two variables.
There are many other ways to solve an overplotting problem - reducing the size of the points, making points transparent, using hex-binning.
Using a single pixel for each data point:
Using hexagonal binning to display density (hexbin package):
Finally, using semi-transparency (10% opacity; easiest using the ggplot2 package):
Edit July 7, 2012 - From Pete's comment below, the smoothScatter() function in the build in graphics package produces a smoothed color density representation of the scatterplot, obtained through a kernel density estimate. You can change the colors using the colramp option, and change how many outliers are plotted with the nrpoints option. Here, 100 outliers are plotted as single black pixels - outliers here being points in the areas of lowest regional density.
How do you deal with overplotting when you have many points?