## Scatterplot Matrices – Data Analysis with R

As Dean described, we should let the data speak to determine variables of interest. There’s a tool that we can use to create a number of scatter plots automatically. It’s called a scatter plot matrix. In a scatter plot matrix. There’s a grid of scatter plots between every pair of variables. As we’ve seen, scatter plots are great, but not necessarily suited for all types of variables. For example, categorical ones. So there are other types of visualizations that can be created instead of scatter plots. Like box plots or histograms when the variables are categorical. Let’s produce the scatter plot matrix for our pseudo Facebook data set. We’re going to use the GGally package to do so. So make sure you’ve installed it and then go ahead and load it using the library command. Now, I’m also going to set the theme here too. Now, there’s two other things that we want to do. First we want to set the seed so we get reproducible results. Now, you might be wondering why we set the seed in the first place. And it’s because we’re going to sample from our data set. Our data set contains all these variables and I actually don’t want all the variables. I don’t want user ID, year joined, or year joined.bucket. So what I can do is subset my data frame and then sample from that sub set. If I check out the variables in my subset data frame these are the ones of interest. Now I didn’t use year joined or year joined.bucket, because this one’s a categorical variable and really these were derived from tenure. Now I’m ready to use the GG pairs function inside of GGally to create this scatter plot matrix. Now, I’ve already run this piece of code and I do want to warn you that it takes a long time for this to generate. It might even take over an hour. Feel free to run the command and if its taking a long time for your plot to generate just come back to your computer at another time. We’ve also included a pdf of the scatter plot in the instructor notes so you can check that out as well. Here’s our scatter plot matrix, and notice in the upper part of the matrix we can see the correlation coefficients for the pairs of variables. For age and date of birth year, the correlation is actually negative one. And we can see that on the scatter plot. Sometimes we may want to produce these types of matrices so that way we can produce one number summaries of the different relationships of our variables. This is just like the correlation work that we did in lesson four. So, I’ve described the plots above the diagonal for the scatter plot matrix, but what do you notice in the lower left of the scatter plot matrix? Write a few sentences about what you see in this next quiz. And pay careful attention to the variable of gender. What types of plots do you think these are?