In a study just released on the website of Education Next, researchers discuss results from their recent examinations of “new teacher-evaluation systems in four school districts that are at the forefront of the effort [emphasis added] to evaluate teachers meaningfully.” The four districts’ evaluation systems were based on classroom observations, achievement test gains for the whole school (i.e., school-level value-added), performance on non-standardized tests, and some form of measure of teacher professionalism and/or teacher commitment to the school community.
Researchers found the following: The ratings assigned teachers across the four districts’ leading evaluation systems as based primarily (i.e., 50-75%) on observations — not including value-added scores except for the amazingly low 20% of teachers who were VAM eligible — were “sufficiently predictive” of a teacher’s future performance. Later they define what “sufficiently predictive” is in terms of predictive validity coefficients that ranged between 0.33 to 0.38, which are actually quite “low” coefficients in reality. Later they say these coefficients are also “quite predictive,” regardless.
While such low coefficients are to be expected as per others’ research on this topic, one must question how authors came up with their determinations that these were “sufficiently” and “quite” predictive (see also Bill Honig’s comments at the bottom of this article). The authors of this article qualify these classifications later, though, writing that “[t]he degree of correlation confirms that these systems perform substantially better in predicting future teacher performance than traditional systems based on paper credentials and years of experience.” They explain further that these correlations are “in the range that is typical of systems for evaluating and predicting future performance in other fields of human endeavor, including, for example, those used to make management decisions on player contracts in professional sports.” So it seems their qualifications were based on a “better than” or relative but not empirical judgment (see also Bill Honig’s comments at the bottom of this article). That being said, this is something to certainly consume critically, particularly in the ways they’ve inappropriately categorized these coefficients.
Researchers also found the following: “The stability generated by the districts’ evaluation systems range[d] from a bit more than 0.50 for teachers with value-added scores to about 0.65 when value-added is not a component of the score.” In other words, districts’ “[e]valuation scores that [did] not include value-added [were] more stable [when districts] assign[ed] more weight to observation scores, which [were demonstrably] more stable over time than value-added scores.” In other words, observational scores outperformed value-added scores. Likewise, the stability they observed in the value-added scores (i.e., 0.50) fell within the upper range of those coefficients also reported elsewhere in the research. So, researchers also confirmed that teacher-level value-added scores are still quite inconsistent from year to year as they still (and too often) vary widely, and wildly over time.
Researchers’ key recommendations, as based on improving the quality of data derived from classroom observations: “Teacher evaluations should include two to three annual classroom observations, with at least one observation being conducted by a trained external observer.” They provide some evidence in support of this assertion in the full article. In addition, they assert that “[c]lassroom observations should carry at least as much weight as test-score gains in determining a teacher’s overall evaluation score.” Although I would argue, as based on their (and others’ results), they certainly made a greater case for observations in lieu of teacher-level value-added, throughout their paper..
Put differently, and in their own words – words with which I agree: “[M]ost of the action and nearly all the opportunities for improving teacher evaluations lie in the area of classroom observations rather than in test-score gains.” So there it is.
Note: The authors of this article do also talk about the “bias” inherent in classroom observations. As based on their findings, for example, they also recommend that “districts adjust classroom observation scores for the degree to which the students assigned to a teacher create challenging conditions for the teacher. Put simply, the current observation systems are patently unfair to teachers who are assigned less-able and -prepared students. The result is an unintended but strong incentive for good teachers to avoid teaching low-performing students and to avoid teaching in low-performing schools.” While I did not highlight these sections above, do click here if wanting to read more.