Dependent Sample Assessment Plots Using granovaGG and R

9/4/2011 Update: granovaGG is now available directly from CRAN.

Just over one year ago, I wrote about creating Dependent Sample Assessment Plots (DSAP) Using granova and R. Since then, Brian Danielak has been developing a new, ggplot2-based version of granova named granovaGG, which is almost ready for release on CRAN. This article updates my earlier granova-based version, but leaves much of the article text unchanged.

DSAPs constitute a way of visualizing data in the context of two dependent sample analyses. One (of at least four ways[1. See Pruzek and Helmreich’s paper in the Journal of Statistics Education Volume 17, Number 1 (2009), Enhancing Dependent Sample Analyses Using Graphics], updated in 2011: Download PDF) to think about this would be to think of pre-intervention and post-intervention response data scores, when studying the effects of intervention.

Suppose you’re an educator and you administer an assessment to students at the beginning of a unit asking about their level of confidence or understanding of a topic. You then teach a lesson that spans some period of time. At the end you collect responses to the same questions again. You now have a dependent sample: two responses that related to the same individual for some number of individuals.

Pre Post
Adam 22 45
Beth 33 30
Cindy 35 53
David 32 55
Elisabeth 27 40

For such a small sample, you can quickly eye-ball the raw data and see that there seems to have been an upward shift in scores, but is it significant (in the statistical sense)? Could you so easily eye-ball the results for a class of 20, 30, or 100 students? Probably not.

Data visualization is an attempt to reveal patterns in data by converting it from raw numbers to graphic images where we can more easily discern clusters (small groups of students who exhibit similar score patterns), outliers (unusually high or low scores), and effects of treatments (did the instruction result in learning?).

An assessment plot is simply a specialized scatter plot showing the pairs of values as (x, y) coordinates. When we enhance the scatter plot a little, we can gain quick insight into patterns in our data. Consider the following Dependent Sample Assessment Plots:

Dependent Sample Assessment Plot - ggplot2-based

The plot has several features worth mentioning:

  • The x-axis and y-axis use the same range (they’re on the same scale), so the plot is square.
  • I’ve plotted post-assessment scores along the x-axis so that the mean difference will be positive for increases in post-scores and negative for decreases in post-scores.
  • The solid black line running from the lower-left to the upper-right represents x and y values that are the same (10, 10), (20, 20), and so on; this is called the identity line. Therefore, if there was no change between the pre- and post-assessment, we would expect the points to appear along this 45 degree line.
  • Any points below this line represents a positive change (scores increased from the pre- to post-assessment).
  • Any points above this line represents a negative change (scores decreased from the pre- to post-assessment).
  • The horizontal, thinly-dashed line represents the pre-assessment mean; here, about 29.
  • The vertical, thinly-dashed line represents the post-assessment mean; here, about 44.
  • The thick, dashed line running diagonally is the mean of the difference between pre- and post-assessment scores (the difference mean); here, 14.8, i.e., post-assessment scores were 14.8 points higher than pre-assessment scores, on average.
  • The green bar indicates the 95% confidence interval: the range of values for the population mean difference that are reasonable, in light of these data.
  • If the green bar overlaps the identity line, then any observed difference is not statistically significant.
  • Conversely, if the green bar does not overlap the identity line, then any observed difference is statistically significant. (It’s up to the analyst to decide whether it’s of practical significance!)

This simple visualization offers us much information quickly and scales well to samples of moderate class sizes. (See the 40-student example, below.)

Free software is available to help us generate these graphs with just a little effort on our part. R, a statistical programming environment, is available for download for Windows, Macintosh, or Linux operating systems and offers a wealth of data management, analysis, and visualization tools. Here I’ll focus on only one such tool: granovaGG.

GranovaGG is an abbreviation for Graphical Analysis of Variance – ggplot2 and is a package available (also for free) for use in R written by Brian Danielak and myself with inspiration and guidance from Bob Pruzek and Jim Helmreich. In fact, the above plot was generated using granovaGG.

In a week or so, granovaGG will be available in CRAN, but in the meantime, in order to install and use R and the latest development release of granovaGG to produce this plot, you would

  1. Download R
  2. Install R per your operating system’s usual process
  3. Launch R
  4. Type the following commands within R
install.packages(pkgs="devtools", dependencies=TRUE)
install_github(repo="granovaGG", username="briandk", branch="dev")
x <- cbind(post=c(45, 30, 53, 55, 40), pre=c(22, 33, 35, 32, 27))

In the future, to run your own analysis using your own pre- and post-assessment data, you would simply

  1. Launch R
  2. Type the following commands within R
x <- cbind(post=c(45, 30, 53, 55, 40), pre=c(22, 33, 35, 32, 27))

replacing the numbers on the line beginning with

x <-

with your own pre- and post-assessment scores. It’s important that the two lists of numbers are in matched order. That is, 22 and 45 are scores for the first student, 33 and 30 are scores for the second student, and so on. Also, notice that I’ve entered post-assessment scores, first, so that the post-scores will appear on the x-axis.

In closing, consider the data below: 40 pairs of scores in both raw numeric and granovagg format. Can you eye-ball any trends from the raw data? What about based on the plot?

Pre Post
21.72 50.68 Dependent Sample Assessment Plot - ggplot-based #2
33.26 36.39
20.41 51.55
26.06 36.26
33.62 32.18
27.16 20.13
30.38 27.90
28.84 59.17
33.00 34.18
36.36 41.76
30.36 48.92
33.16 35.80
31.42 43.73
39.90 40.53
29.90 52.50
35.42 34.67
29.14 38.33
22.31 29.02
27.96 41.19
24.42 57.77
23.65 28.68
26.94 26.31
35.09 23.09
38.29 53.75
39.09 50.23
30.21 23.43
24.78 35.14
34.26 54.71
31.64 20.91
31.41 27.45
23.84 48.05
36.11 25.58
37.45 59.60
29.38 56.05
39.72 51.28
29.82 36.91
31.82 21.71
24.82 23.96
37.80 49.52
38.45 56.68