In a recent paper published in the peer-reviewed journal Education Finance and Policy, coauthors Cassandra Guarino (Indiana University – Bloomington), Mark Reckase (Michigan State University), and Jeffrey Wooldridge (Michigan State University) ask and then answer the following question: “Can Value-Added Measures of Teacher Performance Be Trusted?” While what I write below is taken from what I read via the official publication, I link here to the working paper that was published online via the Education Policy Center at Michigan State University (i.e., not for a fee).
From the abstract, authors “investigate whether commonly used value-added estimation strategies produce accurate estimates of teacher effects under a variety of scenarios. [They] estimate teacher effects [using] simulated student achievement data sets that mimic plausible types of student grouping and teacher assignment scenarios. [They] find that no one method accurately captures true teacher effects in all scenarios, and the potential for misclassifying teachers as high- or low-performing can be substantial.”
From elsewhere in more specific terms, the authors use simulated data to “represent controlled conditions” to most closely match “the relatively simple conceptual model upon which value-added estimation strategies are based.” This is the strength of this research study in that authors’ findings represent best-case scenarios, while when working with real-world and real-life data “conditions are [much] more complex.” Hence, working with various statistical estimators, controls, approaches, and the like using simulated data becomes “the best way to discover fundamental flaws and differences among them when they should be expected to perform at their best.”
- “No one [value-added] estimator performs well under all plausible circumstances, but some are more robust than others…[some] fare better than expected…[and] some of the most popular methods are neither the most robust nor ideal.” In other words, calculating value-added regardless of the sophistication of the statistical specifications and controls used is messy, and this messiness can seriously throw off the validity of the inferences to be drawn about teachers, even given the fanciest models and methodological approaches we currently have going (i.e., those models and model specifications being advanced via policy).
- “[S]ubstantial proportions of teachers can be misclassified as ‘below average’ or ‘above average’ as well as in the bottom and top quintiles of the teacher quality distribution, even in [these] best-case scenarios.” This means that the misclassification errors we are seeing with real-world data, we are also seeing with simulated data. This leads us to even more concern about whether VAMs will ever be able to get it right, or in this case, counter the effects of the nonrandom assignment of students to classrooms and teachers to the same.
- Researchers found that “even in the best scenarios and under the simplistic and idealized conditions imposed by [their] data-generating process, the potential for misclassifying above-average teachers as below average or for misidentifying the “worst” or “best” teachers remains nontrivial, particularly if teacher effects are relatively small. Applying the [most] commonly used [value-added approaches] results in misclassification rates that range from at least 7 percent to more than 60 percent, depending upon the estimator and scenario.” So even with a pretty perfect dataset, or a dataset much cleaner than those that come from actual children and their test scores in real schools, misclassification errors can impact teachers upwards of 60% of the time.
In sum, researchers conclude that while certain VAMs hold more promise than others, they may not be capable of overcoming the many obstacles presented by the non-random assignment of students to teachers (and teachers to classrooms).
In their own words, “it is clear that every estimator has an Achilles heel (or more than one area of potential weakness)” that can distort teacher-level output in highly consequential ways. Hence, “[t]he degree of error in [VAM] estimates…may make them less trustworthy for the specific purpose of evaluating individual teachers” than we might think.
More mischief happens when the results are reduced to HEDI ratings–Highly effective, Effective, Developing, Ineffective–especially given the federal definition of “highly effective” –production of more than a year’s worth of “growth, ” meaning gains in test scores.
Of course, there is no real calendar year there as the rhetoric implies. The language is deceptive. There is a big difference between a calendar year, an instructional year, and an “accountability year.”