Following up on a recent post regarding whether “Combining Different Tests to Measure Value-Added [Is] Valid?” an anonymous VAMboozled! follower emailed me an article I had never read on the topic. I found the article and its contents as related to this question important to share with you all here.
The 2005 article titled “Adjusting for Differences in Tests,” authored by Robert Linn – a highly respected scholar in the academy and also Distinguished Professor Emeritus in Research & Evaluation Methodology from the University of Colorado at Boulder – and published by the esteemed National Academy of Sciences, is all about “[t]he desire to treat scores obtained from different tests as if they were interchangeable, or at least comparable.” This is very much in line with current trends to combine different tests to measure value-added given statisticians and (particularly in the universe of VAMs) econometricians’ “natural predilections” to find (and force) solutions to what they see as very rational and practical problems. Hence, this paper is still very relevant here and today.
To equate tests, as per Dorans and Holland (2000) as cited by Linn, there are five requirements that must be met. These requirements “enjoy a broad professional consensus” and also pertain to test equating when it comes to doing this within VAMs:
- The Equal Construct Requirement: Tests that measure different constructs (e.g., mathematics versus reading/language arts) should not be equated. States that are using different subject area tests as pretests for other subject area tests to measure growth over time, for example, are in serious violation of this requirement.
- The Equal Reliability Requirement: Tests that measure the same construct but which differ in reliability (i.e., consistency at the same time or over time as measured in a variety of ways) should not be equated. Most large-scale tests, often depending on whether they yield norm- or criterion-referenced data given the normed or state standards to which they are aligned, yield different levels of reliability. Likewise, the types of consequences attached to test results throw this off as well.
- The Symmetry Requirement: The equating function for equating the scores of Y to those of X should be the inverse of the equating function for equating the scores of X to those of Y. I am unfamiliar of studies that have addressed this as this is typically overlooked and assumed.
- The Equity Requirement: It ought to be a matter of indifference for an examinee to be tested by either one of the two tests that have been equated. It is not, however, a matter of indifference in the universe of VAMs, as best demonstrated by the research conducted by Papay (2010) in his Different tests, different answers: The stability of teacher value-added estimates across outcome measures study.
- Population Invariance Requirement: The choice of (sub)population used to compute the equating function between the scores on tests X and Y should not matter. In other words, the equating function used to link the scores of X and Y should be population invariant. As also noted in Papay (2010) even using the same populations, the scores on tests X and Y DO matter.
Accordingly, here are Linn’s two key points of caution: (1) Factors that complicate doing this, and make such simple conversions less reliable and less or invalid include “the content, format, and margins of error of the tests; the intended and actual uses of the tests; and the consequences attached to the results of the tests.” Likewise, (2) “it cannot be assumed that converted scores will behave just like those of the test whose scale is adopted.” That is, by converting or norming scores to permit longitudinal analyses (i.e., in the case of VAMs) can very well violate what the test scores mean even if commonsense might tell us they essentially mean the same thing.
Hence, and also as per the National Research Council (NRC), “it is not feasible to compare ‘the full array of currently administered commercial and state achievement tests to one another, through the development of a single equivalency or linking scale.” Test content varies so much “in content, item format, conditions of administration, and consequences attached to [test] results that the linked scores could not be considered sufficiently comparable to justify the development of [a] single equivalency scale.”
“Although NAEP [the National Assessment of Educational Progress – the nation’s “best” test] would seem to hold the greatest promise for linking state assessments… NAEP has no consequences either for individual students or for the schools they attend. State assessments, on the other hand, clearly have important consequences for schools and in a number of states they also have significant consequences for individual students” and now teachers. This is the key factor that throws off the potential to equate tests that are not designed to be interchangeable. This is certainly not to say, however, just increase the consequences across tests to make them more interchangeable. It’s just not that simple – see rules 1-5 above.