A few days ago, on Diane Ravitch’s blog, a person posted the following comment, that Diane sent to me for a response:
“Diane, In Wis. one proposed Assembly bill directs the Value Added Research Center [VARC] at UW to devise a method to equate three different standardized tests (like Iowas, Stanford) to one another and to the new SBCommon Core to be given this spring. Is this statistically valid? Help!”
Here’s what I wrote in response, that I’m sharing here with you all as this too is becoming increasingly common across the country; hence, it is increasingly becoming a question of high priority and interest:
“We have three issues here when equating different tests SIMPLY because these tests test the same students on the same things around the same time.
First is the ASSUMPTION that all varieties of standardized tests can be used to accurately measure educational “value,” when none have been validated for such purposes. To measure student achievement? YES/OK. To measure teachers impacts on student learning? NO. The ASA statement captures decades of research on this point.
Second, doing this ASSUMES that all standardized tests are vertically scaled whereas scales increase linearly as students progress through different grades on similar linear scales. This is also (grossly) false, ESPECIALLY when one combines different tests with different scales to (force a) fit that simply doesn’t exist. While one can “norm” all test data to make the data output look “similar,” (e.g., with a similar mean and similar standard deviations around the mean), this is really nothing more that statistical wizardry without really any theoretical or otherwise foundation in support.
Third, in one of the best and most well-respected studies we have on this to date, Papay (2010) [in his Different tests, different answers: The stability of teacher value-added estimates across outcome measures study] found that value-added estimates WIDELY range across different standardized tests given to the same students at the same time. So “simply” combining these tests under the assumption that they are indeed similar “enough” is also problematic. Using different tests (in line with the proposal here) with the same students at the same time yields different results, so one cannot simply combine them thinking they will yield similar results regardless. They will not…because the test matters.”
See also: Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn, R. L., Ravitch, D., Rothstein, R., Shavelson, R. J., & Shepard, L. A. (2010). Problems with the use of student test scores to evaluate teachers. Washington, D.C.: Economic Policy Institute.
From an anonymous follower: “The equating of scores across standardized tests is fraught with problems and probably invalid. My guess is the VAMsters will say, they only use tests to establish a relative position in the distribution (stack ranking) and with a large enough population taking each test, they can assume that positions would not change much, so they can be used for their (very noisy…my editorializing) predictions without doing much to change the (in)validity of their measures.”
The use of three different achievement tests for value-added measurement is fraught with peril. First and foremost, it is not equating; it is linking. Place test scores from three independent tests on the same test score scale is a simple act, but the data may not be validly interpretable. The major issue is content. How well does each test match up with what students are being taught? We refer to this as “instructional sensitivity”. State-sponsored test that match up to a state’s content standards have the most instructional sensitivity as teachers are supposed to be teaching the content highly related to the test.
That said, value-added measurement is a statistical abomination. It is not defensible. Our national test standards call for multiple sources of information about student achievement if we want to properly assessment student learning.
Causal attribution is very difficulty without experiments and random assignment.
Current methods of value-added measure fall far short of our national standards.