New Evidence that Developmental (and Formative) Approaches to Teacher Evaluation Systems Work

Susan Moore Johnson – Professor of Education at Harvard University and author of another important article regarding how value-added models (VAMs) oft-reinforce the walls of “egg-crate” schools (here) – recently published (along with two co-authors) an article in the esteemed, peer-reviewed Educational Evaluation and Policy Analysis. The article titled: Investing in Development: Six High-Performing, High-Poverty Schools Implement the Massachusetts Teacher Evaluation Policy can be downloaded here (in its free, pre-publication form).

In this piece, as taken from the abstract, they “studied how six high-performing, high-poverty [and traditional, charter, under state supervision] schools in one large Massachusetts city implemented the state’s new teacher evaluation policy” (p. 383). They aimed to learn how these “successful” schools, with “success” defined by the state’s accountability ranking per school along with its “public reputation,” approached the state’s teacher evaluation system and its system components (e.g., classroom observations, follow-up feedback, and the construction and treatment of teachers’ summative evaluation ratings). They also investigated how educators within these schools “interacted to shape the character and impact of [the state’s] evaluation” (p. 384).

Akin to Moore Johnson’s aforementioned work, she and her colleagues argue that “to understand whether and how new teacher evaluation policies affect teachers and their work, we must investigate [the] day-to-day responses [of] those within the schools” (p. 384). Hence, they explored “how the educators in these schools interpreted and acted on the new state policy’s opportunities and requirements and, overall, whether they used evaluation to promote greater accountability, more opportunities for development, or both” (p. 384).

They found that “despite important differences among the six successful schools [they] studied (e.g., size, curriculum and pedagogy, student discipline codes), administrators responded to the state evaluation policy in remarkably similar ways, giving priority to the goal of development over accountability [emphasis added]” (p. 385). In addition, “[m]ost schools not only complied with the new regulations of the law but also went beyond them to provide teachers with more frequent observations, feedback, and support than the policy required. Teachers widely corroborated their principal’s reports that evaluation in their school was meant to improve their performance and they strongly endorsed that priority” (p. 385).

Overall, and accordingly, they concluded that “an evaluation policy focusing on teachers’ development can be effectively implemented in ways that serve the interests of schools, students, and teachers” (p. 402). This is especially true when (1) evaluation efforts are “well grounded in the observations, feedback, and support of a formative evaluation process;” (2) states rely on “capacity building in addition to mandates to promote effective implementation;” and (3) schools also benefit from spillover effects from other, positive, state-level policies (i.e., states do not take Draconian approaches to other educational policies) that, in these cases included policies permitting district discretion and control over staffing and administrative support (p. 402).

Related, such developmental and formatively-focused teacher evaluation systems can work, they also conclude, when schools are lead by highly effective principals who are free to select high quality teachers. Their findings suggest that this “is probably the most important thing district officials can do to ensure that teacher evaluation will be a constructive, productive process” (p. 403). In sum, “as this study makes clear, policies that are intended to improve schooling depend on both administrators and teachers for their effective implementation” (p. 403).

Please note, however, that this study was conducted before districts in this state were required to incorporate standardized test scores to measure teachers’ effects (e.g., using VAMs); hence, the assertions and conclusions that authors set forth throughout this piece should be read and taken into consideration given that important caveat. Perhaps findings should matter even more in that here is at least some proof that teacher evaluation works IF used for developmental and formative (versus or perhaps in lieu of summative) purposes.

Citation: Reinhorn, S. K., Moore Johnson, S., & Simon, N. S. (2017). Educational Evaluation and Policy Analysis, 39(3), 383–406. doi:10.3102/0162373717690605 Retrieved from

The More Weight VAMs Carry, the More Teacher Effects (Will Appear to) Vary

Matthew A. Kraft — an Assistant Professor of Education & Economics at Brown University and co-author of an article published in Educational Researcher on “Revisiting The Widget Effect” (here), and another of his co-authors Matthew P. Steinberg — an Assistant Professor of Education Policy at the University of Pennsylvania — just published another article in this same journal on “The Sensitivity of Teacher Performance Ratings to the Design of Teacher Evaluation Systems” (see the full and freely accessible, at least for now, article here; see also its original and what should be enduring version here).

In this article, Steinberg and Kraft (2017) examine teacher performance measure weights while conducting multiple simulations of data taken from the Bill & Melinda Gates Measures of Effective Teaching (MET) studies. They conclude that “performance measure weights and ratings” surrounding teachers’ value-added, observational measures, and student survey indicators play “critical roles” when “determining teachers’ summative evaluation ratings and the distribution of teacher proficiency rates.” In other words, the weighting of teacher evaluation systems’ multiple measures matter, matter differently for different types of teachers within and across school districts and states, and matter also in that so often these weights are arbitrarily and politically defined and set.

Indeed, because “state and local policymakers have almost no empirically based evidence [emphasis added, although I would write “no empirically based evidence”] to inform their decision process about how to combine scores across multiple performance measures…decisions about [such] weights…are often made through a somewhat arbitrary and iterative process, one that is shaped by political considerations in place of empirical evidence” (Steinberg & Kraft, 2017, p. 379).

This is very important to note in that the consequences attached to these measures, also given the arbitrary and political constructions they represent, can be both professionally and personally, career and life changing, respectively. How and to what extent “the proportion of teachers deemed professionally proficient changes under different weighting and ratings thresholds schemes” (p. 379), then, clearly matters.

While Steinberg and Kraft (2017) have other key findings they also present throughout this piece, their most important finding, in my opinion, is that, again, “teacher proficiency rates change substantially as the weights assigned to teacher performance measures change” (p. 387). Moreover, the more weight assigned to measures with higher relative means (e.g., observational or student survey measures), the greater the rate by which teachers are rated effective or proficient, and vice versa (i.e., the more weight assigned to teachers’ value-added, the higher the rate by which teachers will be rated ineffective or inadequate; as also discussed on p. 388).

Put differently, “teacher proficiency rates are lowest across all [district and state] systems when norm-referenced teacher performance measures, such as VAMs [i.e., with scores that are normalized in line with bell curves, with a mean or average centered around the middle of the normal distributions], are given greater relative weight” (p. 389).

This becomes problematic when states or districts then use these weighted systems (again, weighted in arbitrary and political ways) to illustrate, often to the public, that their new-and-improved teacher evaluation systems, as inspired by the MET studies mentioned prior, are now “better” at differentiating between “good and bad” teachers. Thereafter, some states over others are then celebrated (e.g., by the National Center of Teacher Quality; see, for example, here) for taking the evaluation of teacher effects more seriously than others when, as evidenced herein, this is (unfortunately) more due to manipulation than true changes in these systems. Accordingly, the fact remains that the more weight VAMs carry, the more teacher effects (will appear to) vary. It’s not necessarily that they vary in reality, but the manipulation of the weights on the back end, rather, cause such variation and then lead to, quite literally, such delusions of grandeur in these regards (see also here).

At a more pragmatic level, this also suggests that the teacher evaluation ratings for the roughly 70% of teachers who are not VAM eligible “are likely to differ in systematic ways from the ratings of teachers for whom VAM scores can be calculated” (p. 392). This is precisely why evidence in New Mexico suggests VAM-eligible teachers are up to five times more likely to be ranked as “ineffective” or “minimally effective” than their non-VAM-eligible colleagues; that is, “[also b]ecause greater weight is consistently assigned to observation scores for teachers in nontested grades and subjects” (p. 392). This also causes a related but also important issue with fairness, whereas equally effective teachers, just by being VAM eligible, may be five-or-so times likely (e.g., in states like New Mexico) of being rated as ineffective by the mere fact that they are VAM eligible and their states, quite literally, “value” value-added “too much” (as also arbitrarily defined).

Finally, it should also be noted as an important caveat here, that the findings advanced by Steinberg and Kraft (2017) “are not intended to provide specific recommendations about what weights and ratings to select—such decisions are fundamentally subject to local district priorities and preferences. (p. 379). These findings do, however, “offer important insights about how these decisions will affect the distribution of teacher performance ratings as policymakers and administrators continue to refine and possibly remake teacher evaluation systems” (p. 379).

Related, please recall that via the MET studies one of the researchers’ goals was to determine which weights per multiple measure were empirically defensible. MET researchers failed to do so and then defaulted to recommending an equal distribution of weights without empirical justification (see also Rothstein & Mathis, 2013). This also means that anyone at any state or district level who might say that this weight here or that weight there is empirically defensible should be asked for the evidence in support.


Rothstein, J., & Mathis, W. J. (2013, January). Review of two culminating reports from the MET Project. Boulder, CO: National Educational Policy Center. Retrieved from

Steinberg, M. P., & Kraft, M. A. (2017). The sensitivity of teacher performance ratings to the design of teacher evaluation systems. Educational Researcher, 46(7), 378–
396. doi:10.3102/0013189X17726752 Retrieved from

Breaking News: The End of Value-Added Measures for Teacher Termination in Houston

Recall from multiple prior posts (see, for example, here, here, here, here, and here) that a set of teachers in the Houston Independent School District (HISD), with the support of the Houston Federation of Teachers (HFT) and the American Federation of Teachers (AFT), took their district to federal court to fight against the (mis)use of their value-added scores derived via the Education Value-Added Assessment System (EVAAS) — the “original” value-added model (VAM) developed in Tennessee by William L. Sanders who just recently passed away (see here). Teachers’ EVAAS scores, in short, were being used to evaluate teachers in Houston in more consequential ways than any other district or state in the nation (e.g., the termination of 221 teachers in one year as based, primarily, on their EVAAS scores).

The case — Houston Federation of Teachers et al. v. Houston ISD — was filed in 2014 and just one day ago (October 10, 2017) came the case’s final federal suit settlement. Click here to read the “Settlement and Full and Final Release Agreement.” But in short, this means the “End of Value-Added Measures for Teacher Termination in Houston” (see also here).

More specifically, recall that the judge notably ruled prior (in May of 2017) that the plaintiffs did have sufficient evidence to proceed to trial on their claims that the use of EVAAS in Houston to terminate their contracts was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case district shall deprive any person of life, liberty, or property, without due process). That is, the judge ruled that “any effort by teachers to replicate their own scores, with the limited information available to them, [would] necessarily fail” (see here p. 13). This was confirmed by the one of the plaintiffs’ expert witness who was also “unable to replicate the scores despite being given far greater access to the underlying computer codes than [was] available to an individual teacher” (see here p. 13).

Hence, and “[a]ccording to the unrebutted testimony of [the] plaintiffs’ expert [witness], without access to SAS’s proprietary information – the value-added equations, computer source codes, decision rules, and assumptions – EVAAS scores will remain a mysterious ‘black box,’ impervious to challenge” (see here p. 17). Consequently, the judge concluded that HISD teachers “have no meaningful way to ensure correct calculation of their EVAAS scores, and as a result are unfairly subject to mistaken deprivation of constitutionally protected property interests in their jobs” (see here p. 18).

Thereafter, and as per this settlement, HISD agreed to refrain from using VAMs, including the EVAAS, to terminate teachers’ contracts as long as the VAM score is “unverifiable.” More specifically, “HISD agree[d] it will not in the future use value-added scores, including but not limited to EVAAS scores, as a basis to terminate the employment of a term or probationary contract teacher during the term of that teacher’s contract, or to terminate a continuing contract teacher at any time, so long as the value-added score assigned to the teacher remains unverifiable. (see here p. 2; see also here). HISD also agreed to create an “instructional consultation subcommittee” to more inclusively and democratically inform HISD’s teacher appraisal systems and processes, and HISD agreed to pay the Texas AFT $237,000 in its attorney and other legal fees and expenses (State of Texas, 2017, p. 2; see also AFT, 2017).

This is yet another big win for teachers in Houston, and potentially elsewhere, as this ruling is an unprecedented development in VAM litigation. Teachers and others using the EVAAS or another VAM for that matter (e.g., that is also “unverifiable”) do take note, at minimum.