Same Teachers, Similar Students, Similar Tests, Different Answers

Please follow and like us:

One of my favorite studies to date about VAMs was conducted by John Papay, an economist once at Harvard and now at Brown University. In the study titled “Different Tests, Different Answers: The Stability of Teacher Value-Added Estimates Across Outcome Measures” published in 2009 by the 3rd best and most reputable peer-reviewed journal, American Educational Research Journal, Papay presents evidence that different yet similar tests (i.e., similar on content, and similar on when the tests were administered to similar sets of students) do not provide similar answers about teachers’ value-added performance. This is an issue with validity, in that, if a test is measuring the same things for the same folks at the same times, similar-to-the-same results should be realized. But they are not. Papay, rather, found moderate-sized rank correlations, ranging from r=0.15 to r=0.58, among the value-added estimates derived from the different tests.

Recently released, yet another study (albeit not yet peer-reviewed) has found similar results…potentially solidifying this finding further into our understandings about VAMs and their issues, particularly in terms of validity (or truth in VAM-based results). This study on “Comparing Estimates of Teacher Value-Added Based on Criterion- and Norm-Referenced Tests” released by the U.S. Department of Education and conducted by four researchers representing Notre Dame University, Basis Policy Research, and American Institutes of Research, provides evidence, again, that estimates of teacher value-added as based on different yet similar tests (i.e., in this case a criterion-referenced state assessment and a widely used norm-referenced test given in the same subject around the same time) yielded moderately correlated estimates of teacher-level value added, yet again.

If we had confidence in the validity of the inferences based on value-added measures, these correlations (or more simply put “relationships”) should be much higher than what they found, similar to what Papay found, in the range of 0.44 to 0.65. While the ideal correlation coefficient is a, in this case, r=+1.0, that is very rarely achieved. But for the purposes for which teacher-level value-added is currently being used, correlations above r=+.70/r=+.80 would (and should) be most desired, and possibly required before high-stakes decisions about teachers are to be made as based on these data.

In addition, researchers in this study found that on average, only 33.3% of teachers’ estimates from both sets of value-added estimates positioned them in the same range of scores (using quintiles or ranges including 20% bands of width) on both tests in the same school year. This too has implications for validity in that, again, teachers or teachers’ value-added estimates should fall in the same ranges, if and when using similar tests, if any valid inferences are to be made using value-added estimates.

Again, please note that this study has not yet been peer-reviewed. While some results naturally seem to make more sense, like the one reviewed above, peer-review matters equally for those with results with which we might tend to agree and those with results we might tend to reject. Take anything for that matter that is not peer-reviewed, as just that…a study with methods and findings not critically vetted by the research community.

2 thoughts on “Same Teachers, Similar Students, Similar Tests, Different Answers

  1. Hi Everyone,
    Nice blog! I was asked what the correlation between value-added estimates (like in this study) or the correlation between actual and estimated value-added (what we are really interested in, and this can be higher or lower than the correlation between estimates depending on how the estimates are made) should be for the different purposes of value-added estimates (e.g., research versus employment decisions). Can you say how you came up with the .7/.8? In an ideal world, do you think policy makers should set some minimum for this (probably between actual and estimated) before using these for high-stakes decisions?

    ps. Given this web site is about problems in measuring and ranking a complex construct, the value a teacher adds, it is worth adding that saying a journal is 3rd “best” in some category is also tricky. In the UK where university finances from government are based (in a small part) on where the faculty publish, this is a big issue.

  2. Agreed on journal ranking systems. We can leave it at “one of the top journals” published in education in the U.S. As for the .7/.8 correlation, this while still arbitrary, would be better albeit still imperfect IF high-stakes are to be attached to output. In a policy world, this matters even more IF high-stake are to be attached to such output. The higher the correlation, the better, hands down. This is also necessary if the correlations with VAM estimates and other indicators (like other “similar” tests in this post) continue to illustrate moderate-to-medium sized correlations. The “signals” are there, but they are certainly not strong enough to warrant current consequential use.

Leave a Reply

Your email address will not be published. Required fields are marked *