A few days ago, on Diane Ravitch’s blog, a person posted the following comment, that Diane sent to me for a response:
“Diane, In Wis. one proposed Assembly bill directs the Value Added Research Center [VARC] at UW to devise a method to equate three different standardized tests (like Iowas, Stanford) to one another and to the new SBCommon Core to be given this spring. Is this statistically valid? Help!”
Here’s what I wrote in response, that I’m sharing here with you all as this too is becoming increasingly common across the country; hence, it is increasingly becoming a question of high priority and interest:
“We have three issues here when equating different tests SIMPLY because these tests test the same students on the same things around the same time.
First is the ASSUMPTION that all varieties of standardized tests can be used to accurately measure educational “value,” when none have been validated for such purposes. To measure student achievement? YES/OK. To measure teachers impacts on student learning? NO. The ASA statement captures decades of research on this point.
Second, doing this ASSUMES that all standardized tests are vertically scaled whereas scales increase linearly as students progress through different grades on similar linear scales. This is also (grossly) false, ESPECIALLY when one combines different tests with different scales to (force a) fit that simply doesn’t exist. While one can “norm” all test data to make the data output look “similar,” (e.g., with a similar mean and similar standard deviations around the mean), this is really nothing more that statistical wizardry without really any theoretical or otherwise foundation in support.
Third, in one of the best and most well-respected studies we have on this to date, Papay (2010) [in his Different tests, different answers: The stability of teacher value-added estimates across outcome measures study] found that value-added estimates WIDELY range across different standardized tests given to the same students at the same time. So “simply” combining these tests under the assumption that they are indeed similar “enough” is also problematic. Using different tests (in line with the proposal here) with the same students at the same time yields different results, so one cannot simply combine them thinking they will yield similar results regardless. They will not…because the test matters.”
See also: Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn, R. L., Ravitch, D., Rothstein, R., Shavelson, R. J., & Shepard, L. A. (2010). Problems with the use of student test scores to evaluate teachers. Washington, D.C.: Economic Policy Institute.