States’ Teacher Evaluation Systems Moving in the “Right” Direction

Last week, a technical report that one of my current and one of my former doctoral students helped me to research and write, was published by the University of Colorado Boulder’s National Education Policy Center (NEPC). While you can navigate to and read the press release here, as well as download and read the full report here, I thought I would summarize the report’s most interesting facts in this post, for the readers/followers of this blog who are likely more interested in the findings pertaining to states’ revised teacher evaluation systems, post the federal passage of the Every Student Succeeds Act (ESSA).

In short, we collected and analyzed for purposes of this study the 51 (i.e., 50 states plus Washington DC) revised teacher evaluation plans submitted to the federal government post ESSA (i.e. spring/summer of 2017) We found, again as specific only to states’ teacher evaluation systems, three key findings:

— First, the role of growth or value-added models (VAMs) for teacher evaluation purposes is declining. That is, the number of states using statewide growth models or VAMs has decreased from 42% to 30% since 2014. This is certainly a step in the “right,” defined as research-informed, direction. See also Figure 1 below (Close, Amrein-Beardsley, & Collins, 2018, p. 13).

— Second, because ESSA loosened federal control of teacher evaluation, many states no longer have a one-size-fits-all teacher evaluation system. This is allowing local districts to make more choices about models, implementation, execution, and the like, in the contexts of the schools and communities in which schools exist.

— Third, the rhetoric surrounding teacher evaluation has changed: language about holding teachers accountable for their value-added effects, or lack thereof, is much less evident in post-ESSA plans. Rather, new plans make note of providing data to teachers as a means of supporting professional development and improvement, essentially shifting the purpose of the evaluation system away from summative and toward formative use.

We also set forth recommendations for states in this report, as based on the evidence noted above (and presented in much more detail in the full report). The recommendations that also directly pertain to states’ (and districts’) teacher evaluation systems are that states/districts:

  1. Take advantage of decreased federal control by formulating revised assessment policies informed by the viewpoints of as many stakeholders as feasible. Such informed revision can help remedy earlier weaknesses, promote effective implementation, stress correct interpretation, and yield formative information.
  2. Ensure that teacher evaluation systems rely on a balanced system of multiple measures, without disproportionate weight assigned to any one measure as allegedly “superior” than any other. If measures contradict one another, however, output from all measures should be interpreted judiciously.
  3. Emphasize data useful as formative feedback in state systems, so that specific weaknesses in student learning can be identified, targeted and used to inform teachers’ professional development.
  4. Mandate ongoing research and evaluation of state assessment systems and ensure that adequate resources are provided to support [ongoing] evaluation [efforts].
  5. Set goals for reducing proficiency gaps and outline procedures for developing strategies to effectively reduce gaps once they have been identified.

We hope this information helps, especially the states and districts still looking to other states to see what is trending. While we note in the title of this blog post as well as the title of the full report that all of this represents “some steps in the right direction,” there is still much work to be done. This is especially true in states, for example like New Mexico (see my most recent post about the ongoing lawsuit in this state here) and other states which have yet to give up on the false promises and limited research of such educational policies established almost one decade ago (e.g., Race to the Top; Duncan, 2009).


Close, K., Amrein-Beardsley, A., & Collins, C. (2018). State-level assessments and teacher evaluation systems after the passage of the Every Student Succeeds Act: Some steps in the right direction. Boulder, CO: Nation Education Policy Center (NEPC). Retrieved from

Duncan, A. (2009, July 4). The race to the top begins: Remarks by Secretary Arne Duncan. Retrieved from

New Mexico Teacher Evaluation Lawsuit Updates

In December of 2015 in New Mexico, via a preliminary injunction set forth by state District Judge David K. Thomson, all consequences attached to teacher-level value-added model (VAM) scores (e.g., flagging the files of teachers with low VAM scores) were suspended throughout the state until the state (and/or others external to the state) could prove to the state court that the system was reliable, valid, fair, uniform, and the like. The trial during which this evidence is to be presented by the state is currently set for this October. See more information about this ruling here.

As the expert witness for the plaintiffs in this case, I was deposed a few weeks ago here in Phoenix, given my analyses of the state’s data (supported by one of my PhD students – Tray Geiger). In short, we found and I testified during the deposition that:

  • In terms of uniformity and fairness, there seem to be 70% or so of New Mexico teachers who are ineligible to be assessed using VAMs, and this proportion held constant across the years of data analyzed. This is even more important to note knowing that when VAM-based data are to be used to make consequential decisions about teachers, issues with fairness and uniformity become even more important given accountability-eligible teachers are also those who are relatively more likely to realize the negative or reap the positive consequences attached to VAM-based estimates.
  • In terms of reliability (or the consistency of teachers’ VAM-based scores over time), approximately 40% of teachers differed by one quintile (quintiles are derived when a sample or population is divided into fifths) and approximately 28% of teachers differed, from year-to-year, by two or more quintiles in terms of their VAM-derived effectiveness ratings. These results make sense when New Mexico’s results are situated within the current literature, whereas teachers classified as “effective” one year can have a 25%-59% chance of being classified as “ineffective” the next, or vice versa, with other permutations also possible.
  • In terms of validity (i.e., concurrent related evidence of validity), and importantly as also situated within the current literature, the correlations between New Mexico teachers’ VAM-based and observational scores ranged from r = 0.153 to r = 0.210. Not only are these correlations very weak[1], they are also very weak as appropriately situated within the literature, via which it is evidenced that correlations between multiple VAMs and observational scores typically range from 0.30 ≤ r ≤ 0.50.
  • In terms of bias, New Mexico’s Caucasian teachers had significantly higher observation scores than non-Caucasian teachers implying, also as per the current research, that Caucasian teachers may be (falsely) perceived as being better teachers than non-Caucasians teachers given bias within these instruments and/or bias of the scorers observing and scoring teachers using these instruments in practice. See prior posts about observational-based bias here, here and here.
  • Also of note in terms of bias was that: (1) teachers with fewer years of experience yielded VAM scores that were significantly lower than teachers with more years of experience, with similar patterns noted across teachers’ observation scores, which could all mean, as also in line with common sense as well as the research, that teachers with more experience are typically better teachers; (2) teachers who taught English language learners (ELLs) or special education students had lower VAM scores across the board than those who did not teach such students; (3) teachers who taught gifted students had significantly higher VAM scores than non-gifted teachers which runs counter to the current research evidencing that teachers’ gifted students oft-thwart or prevent them from demonstrating growth given ceiling effects; (4) teachers in schools with lower relative proportions of ELLs, special education students, students eligible for free-or-reduced lunches, and students from racial minority backgrounds, as well as higher relative proportions of gifted students, consistently had significantly higher VAM scores. These results suggest that teachers in these schools are as a group better, and/or that VAM-based estimates might be biased against teachers not teaching in these schools, preventing them from demonstrating comparable growth.

To read more about the data and methods used, as well as other findings, please see my affidavit submitted to the court attached here: Affidavit Feb2018.

Although, also in terms of a recent update, I should also note that a few weeks ago, as per an article in the AlbuquerqueJournal, New Mexico’s teacher evaluation systems is now likely to be overhauled, or simply “expired” as early as 2019. In short, “all three Democrats running for governor and the lone Republican candidate…have expressed misgivings about using students’ standardized test scores to evaluate the effectiveness of [New Mexico’s] teachers, a key component of the current system [at issue in this lawsuit and] imposed by the administration of outgoing Gov. Susana Martinez.” All four candidates described the current system “as fundamentally flawed and said they would move quickly to overhaul it.”

While I/we will proceed our efforts pertaining to this lawsuit until further notice, this is also important to note at this time in that it seems that New Mexico’s policymakers of new are going to be much wiser than those of late, at least in these regards.

[1] Interpreting r: 0.8 ≤ r ≤ 1.0 = a very strong correlation; 0.6 ≤ r ≤ 0.8 = a strong correlation; 0.4 ≤ r ≤ 0.6 = a moderate correlation; 0.2 ≤ r ≤ 0.4 = a weak correlation; and 0.0 ≤ r ≤ 0.2 = a very weak correlation, if any at all.