As per Standard 1.2. of the newly released 2014 Standards for Educational and Psychological Testing authored by the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME):
A rationale should be presented for each intended interpretation of test scores for a given use, together with a summary of the evidence and theory bearing on the intended interpretation. (p. 23)
After a series of conversations I have had, also recently with a colleague named Harris Zwerling who is a researcher for the Pennsylvania State Education Association, I asked him to submit a guest blog about what he has observed working with Pennsylvania Education Value-Added Assessment (PVAAS, aka the Education Value-Added Assessment System [EVAAS) administered by SAS-EVAAS) data in his state as per this standard. Here is what he wrote:
Recent Vamboozled posts have focused on one of the qualities of the achievement tests used as the key measures in value-added models (VAMs) and other “growth” models. Specifically, the instructional sensitivity of current student achievement tests has been questioned (see, for example, here). Evidence of whether the achievement tests in use can actually detect the influence of instruction on student test performance should, accordingly, be a central part of validity evidence required for VAMs [as per the specific standard also mentioned above]. After all, it is achievement test scores that provide the bases for VAM-based estimates. In addition, while many have noted that the statistical methods employed to generate value-added estimates will force a distribution of “teacher effectiveness” scores, it is not clear that the distinctions generated and measured have educational (or clinical) meaning.
This concern can be illustrated with an example taken from comments I previously made while serving on a Gates Foundation-funded stakeholders committee tasked with making recommendations for the redesign of Pennsylvania’s teacher and principal evaluation systems (see the document to which I refer here). Let me note that what follows is only suggestive of the possible limitations of some of Pennsylvania’s state achievement tests (the PSSAs) and is not offered as an example of an actual analysis of their instructional sensitivity. The value-added calculations, which I converted to raw scores, were performed by a research team from Mathematica Policy Research (for more information click here).
From my comments in this report:
“…8th grade Reading provided the smallest teacher effects from three subjects reported in Table III.1. The difference the [Final Report] estimates comparing the teacher at the 15th percentile of effectiveness to the average teacher (50th percentile) is -22 scaled score points on the 5th grade PSSA Reading test…[referring] to the 2010 PSSA Technical Manual raw score table… for the 8th grade Reading test, that would be a difference of approximately 2 raw score points, or the equivalent of 2 multiple choice (MC) questions (1 point apiece) or half credit on one OE [open-ended] question. (There are 40 MC questions and 4 (3 point) OE questions on the Reading test.) The entire range from the 15th percentile of effectiveness to the 85th percentile of effectiveness …covers approximately 3.5 raw score points [emphasis added].” (p. 9)… [T]he range of teacher effectiveness covering the 5th to the 95th percentiles (73 scaled score points) represents approximately a 5.5 point change in the raw score (i.e., 5.5 of 52 total possible points [emphasis added].” (p. 9)
The 5th to 95th percentiles corresponds to the effectiveness range used in the PVAAS.
“It would be difficult to see how the range of teacher effects on the 8th grade Reading test would be considered substantively meaningful if this result holds up in a larger and random sample. In the case of this subject and grade level test, the VAM does not appear to be able to detect a range of teacher effectiveness that is substantially more discriminating than those reported for more traditional forms of evaluation. (p. 9-10)
Among other issues, researchers have considered which scaling properties are necessary for measuring growth (see, for example, here), whether the tests’ scale properties met the assumptions of the statistical models being used (see, for example, here), if growth in student achievement is scale dependent (see, for example, here), and even if tests that were vertically scaled could meet the assumptions required by regression-based models (see, for example, here).
Another issue raised by researchers is the fundamental question of whether state tests could be put to more than one use, that is, as a valid measure of student progress and teacher performance, or would the multiple uses inevitably lead to behaviors that would invalidate the tests for either purpose (see, for example, here). All this suggests is that before racing ahead to implement value-added components of teacher evaluation systems, states have an obligation to assure all involved that these psychometric issues have been explicitly addressed and the tests used are properly validated for use in value-added measurements of “teacher effectiveness.”
So…How many states have validated their student achievement tests for use as measures of teachers’ performance? How many have produced evidence of their tests instructional sensitivity or of the educational significance of the distinctions made by the value-added models they use?
One major vendor of value-added measures (i.e., SAS as in SAS-EVAAS) long has held that the tests need only to have 1) sufficient “stretch” in the scales “to ensure that progress could be measured for low-and high achieving students”, 2) that “the test is highly related to the academic standards,” and 3) “the scales are sufficiently reliable from one year to the next” (see, for example, here). These assertions should be subject to independent psychometric verification, but they have not. Rather, and notably, in all the literature they have produced regarding their VAMs, SAS does not indicate that it requires or conducts the tests that researchers have designed for determining instructional sensitivity. However, they assert that they can provide accurate measures of individual teacher effectiveness nonetheless.
To date Pennsylvania’s Department of Education (PDE) apparently has ignored the fact that the Data Recognition Corporation (DRC), the vendor of its state tests, has avoided making any claims that their tests are valid for use as measures of teacher performance (see, for example, here). In DRC’s most recent Technical Report chapter on validity, for example, they list the main purposes of the PSSA, the second of which is to “Provid[e] information on school and district accountability” (p.283). The omission of teacher accountability here certainly stands out.
To date, again, none of the publicly released DRC technical reports for Pennsylvania’s state tests provides any indication that instructional sensitivity has been examined, either. Nor do the reports provide a basis for others to make their own judgment. In fact, they make it clear that historically the PSSA exams were designed for school level accountability and only later have moved toward measuring individual student mastery of Pennsylvania’s academic standards. Various PSSA tests will sample an entire academic standard with as little as one or two questions (see, for example, Appendix C here), which aligns with the evidence presented in the first section of this post.
In sum, and given that VAMs, by design, force a distribution of “teacher effectiveness,” even if the measured differences are substantively quite small, and much smaller than the lay observer or in particular educational policymaker might have believed otherwise (in the case of the 8th grade PSSA Reading test 5.5 of 52 total possible points covers the entire range from the bottom to the top categories of “ teacher effectiveness”) , it is essential that both instructional sensitivity and educational significance be explicitly established. Otherwise, untrained users of the data may labor under the misapprehension that the difference observed provides a solid basis for making such (low- and high-stakes) personnel decisions.
This is a completely separate concern from the fact that, at best, measured “teacher effects” may only explain from 1% to 15 or 20% of the variance in student test scores, a figure dwarfed by the variance explained by factors beyond the control of teachers. To date, SAS has not publicly reported the amount of variance explained by its PVAAS models for any of PA’s state tests. I believe every vendor of value-added models should report this information for every achievement test being used as a measure of “teacher effectiveness.” This too, would provide essential context for understanding the substantive meaning of reported scores.