Recall from a prior post that the peer-reviewed journal titled *Educational Researcher (ER)* – recently published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#2 of 9) here, titled “Using Student Test Scores to Measure Teacher Performance: Some Problems in the Design and Implementation of Evaluation Systems” and authored by Dale Ballou – Associate Professor of Leadership, Policy, and Organizations at Vanderbilt University – and Matthew Springer – Assistant Professor of Public Policy also at Vanderbilt.

As written into the articles’ abstract, their “aim in this article [was] to draw attention to some underappreciated problems in the design and implementation of evaluation systems that incorporate value-added measures. [They focused] on four [problems]: (1) taking into account measurement error in teacher assessments, (2) revising teachers’ scores as more information becomes available about their students, and (3) and (4) minimizing opportunistic behavior by teachers during roster verification and the supervision of exams.”

Here is background on their perspective, so that you all can read and understand their forthcoming findings in context: “On the whole we regard the use of educator evaluation systems as a positive development, provided judicious use is made of this information. No evaluation instrument is perfect; every evaluation system is an assembly of various imperfect measures. There is information in student test scores about teacher performance; the challenge is to extract it and combine it with the information gleaned from other instruments.”

Their claims of most interest, in my opinion and given their perspective as illustrated above, are as follows:

- “Teacher value-added estimates are notoriously imprecise. If value-added scores are to be used for high-stakes personnel decisions, appropriate account must be taken of the magnitude of the likely error in these estimates” (p. 78).
- “[C]omparing a teacher of 25 students to [an equally effective] teacher of 100 students… the former is 4 to 12 times more likely to be deemed ineffective, solely as a function of the number of the teacher’s students who are tested—a reflection of the fact that the measures used in such accountability systems are noisy and that the amount of noise is greater the fewer students a teacher has. Clearly it is unfair to treat two teachers with the same true effectiveness differently” (p. 78).
- “[R]esources will be wasted if teachers are targeted for interventions without taking

into account the probability that the ratings they receive are based on error” (p. 78). - “Because many state administrative data systems are not up to [the data challenges required to calculate VAM output], many states have implemented procedures wherein teachers are called on to verify and correct their class rosters [i.e., roster verification]…[Hence]…the notion that teachers might manipulate their rosters in order to improve their value-added scores [is worrisome as the possibility of this occurring] obtains indirect support from other studies of strategic behavior in response to high-stakes accountability…These studies suggest that at least some teachers and schools will take advantage of virtually any opportunity to game

a test-based evaluation system…” (p. 80), especially if they view the system as unfair (this is my addition, not theirs) and despite the extent to which school or district administrators monitor the process or verify the final roster data. This is another gaming technique not often discussed, or researched. - Related, in one analysis these authors found that “students [who teachers] do not claim [during this roster verification process] have on average test scores far below those of the students who are claimed…a student who is not claimed is very likely to be one who would lower teachers’ value added” (p. 80). Interestingly, and inversely, they also found that “a majority of the students [they] deem[ed] exempt [were actually] claimed by their teachers [on teachers’ rosters]” (p. 80). They note that when either occurs, it’s rare; hence, it should not significantly impact teachers value added scores on the whole. However, this finding also “raises the prospect of more serious manipulation of roster verification should value added come to be used for high-stakes personnel decisions, when incentives to game the system will grow stronger” (p. 80).
- In terms of teachers versus proctors or other teachers monitoring students when they take large-scale standardized tests (that are used across all states to calculate value-added estimates), researchers also found that “[a]t every grade level, the number of questions answered correctly is higher when students are monitored by their own teacher” (p. 82). They believe this finding is more relevant that I do in that the difference was one question (although when multiplied by the number of students included in a teacher’s value-added calculations this might be more noteworthy). In addition, I know of very few teachers, anymore, who are permitted to proctor their own students’ tests, but for those who still allow this, this finding might also be relevant. “An alternative interpretation of these findings is that students

naturally do better when their own teacher supervises the exam as

opposed to a teacher they do not know” (p. 83).

The authors also critique, quite extensively in fact, the Education Value-Added Assessment System (EVAAS) used statewide in North Carolina, Ohio, Pennsylvania, and Tennessee and many districts elsewhere. In particular, they take issue with the model’s use of the conventional t-test statistic to identify a teacher for whom they are 95% confident (s)he differs from average. They also take issue with EVAAS practice whereby teachers’ EVAAS scores change retroactively, as more data become available, to get at more “precision” even though teachers’ scores can change one or two years well after the initial score is registered (and used for whatever purposes).

“This has confused teachers, who wonder why their value-added score keeps changing for students they had in the past. Whether or not there are sound statistical reasons for undertaking these revisions…revising value-added estimates poses problems when the evaluation system is used for high-stakes decisions. What will be done about the teacher whose performance during the 2013–2014 school year, as calculated in the summer of 2014, was so low that the teacher loses his or her job or license but whose revised estimate for the same year, released in the summer of 2015, places the teacher’s performance above the threshold at which these sanctions would apply?…[Hence,] it clearly makes no sense to revise these estimates, as each revision is based on less information about student performance” (p. 79).

Hence, “a state that [makes] a practice of issuing revised ‘improved’ estimates would appear to be in a poor position to argue that high-stakes decisions ought to be based on initial, unrevised estimates, though in fact the grounds for regarding the revised estimates as an improvement are sometimes highly dubious. There is no obvious fix for this problem, which we expect will be fought out in the courts” (p. 83).

*****

If interested, see the Review of Article #1 – the introduction to the special issue here.

Article #2 Reference: Ballou, D., & Springer, M. G. (2015). Using student test scores to measure teacher performance: Some problems in the design and implementation of evaluation systems. *Educational Researcher, 44*(2), 77-86. doi:10.3102/0013189X15574904