Learning from What Doesn’t Work in Teacher Evaluation

One of my doctoral students — Kevin Close — and I just had a study published in the practitioner journal Phi Delta Kappan that I wanted to share out with all of you, especially before the study is no longer open-access or free (see full study as currently available here). As the title indicates, the study is about how states, school districts, and schools can “Learn from What Doesn’t Work in Teacher Evaluation,” given an analysis that the two of us conducted of all documents pertaining to the four teacher evaluation and value-added model (VAM)-centered lawsuits in which I have been directly involved, and that I have also covered in this blog. These lawsuits include Lederman v. King in New York (see here), American Federation of Teachers et al. v. Public Education Department in New Mexico (see here), Houston Federation of Teachers v. Houston Independent School District in Texas (see here), and Trout v. Knox County Board of Education in Tennessee (see here).

Via this analysis we set out to comb through the legal documents to identify the strongest objections, as also recognized by the courts in these lawsuits, to VAMs as teacher measurement and accountability strategies. “The lessons to be learned from these cases are both important and timely” given that “[u]nder the Every Student Succeeds Act (ESSA), local education leaders once again have authority to decide for themselves how to assess teachers’ work.”

The most pertinent and also common issues as per these cases were as follows:

(1) Inconsistencies in teachers’ VAM-based estimates from one year to the next that are sometimes “wildly different.” Across these lawsuits, issues with reliability were very evident, whereas teachers classified as “effective” one year were either theorized or demonstrated to have around a 25%-59% chance of being classified as “ineffective” the next year, or vice versa, with other permutations also possible. As per our profession’s Standards for Educational and Psychological Testing, reliability should, rather, be observed whereby VAM estimates of teacher effectiveness are more or less consistent over time, from one year to the next, regardless of the type of students and perhaps subject areas that teachers teach.

(2) Bias in teachers’ VAM-based estimates were also of note, whereby documents suggested or evidenced that bias, or rather biased estimates of teachers’ actual effects does indeed exist (although this area was also of most contention and dispute). Specific to VAMs, since teachers are not randomly assigned the students they teach, whether their students are invariably more or less motivated, smart, knowledgeable, or capable can bias students’ test-based data, and teachers’ test-based data when aggregated. Court documents, although again not without counterarguments, suggested that VAM-based estimates are sometimes biased, especially when relatively homogeneous sets of students (i.e., English Language Learners (ELLs), gifted and special education students, free-or-reduced lunch eligible students) are non-randomly concentrated into schools, purposefully placed into classrooms, or both. Research suggests that this also sometimes happens regardless of the the sophistication of the statistical controls used to block said bias.

(3) The gaming mechanisms in play within teacher evaluation systems in which VAMs play a key role, or carry significant evaluative weight, were also of legal concern and dispute. That administrators sometimes inflate the observational ratings of their teachers whom they want to protect, while simultaneously offsetting the weight the VAMs sometimes carry was of note, as was the inverse. That administrators also sometimes lower teachers’ ratings to better align them with their “more objective” VAM counterparts were also at issue. “So argued the plaintiffs in the Houston and Tennessee lawsuits, for example. In those systems, school leaders appear to have given precedence to VAM scores, adjusting their classroom observations to match them. In both cases, administrators admitted to doing so, explaining that they sensed pressure to ensure that their ‘subjective’ classroom ratings were in sync with the VAM’s ‘objective’ scores.” Both sets of behavior distort the validity (or “truthfulness”) of any teacher evaluation system and are in violation of the same, aforementioned Standards for Educational and Psychological Testing that call for VAM scores and observation ratings to be kept separate. One indicator should never be adjusted to offset or to fit the other.

(4) Transparency, or the lack thereof, was also a common issue across cases. Transparency, which can be defined as the extent to which something is accessible and readily capable of being understood, pertains to whether VAM-based estimates are accessible and make sense to those at the receiving ends. “Not only should [teachers] have access to [their VAM-based] information for instructional purposes, but if they believe their evaluations to be unfair, they should be able to see all of the relevant data and calculations so that they can defend themselves.” In no case was this more legally pertinent than in Houston Federation of Teachers v. Houston Independent School District in Texas. Here, the presiding judge ruled that teachers did have “legitimate claims to see how their scores were calculated. Concealing this information, the judge ruled, violated teachers’ due process protections under the 14th Amendment (which holds that no state — or in this case organization — shall deprive any person of life, liberty, or property, without due process). Given this precedent, it seems likely that teachers in other states and districts will demand transparency as well.”

In the main article (here) we also discuss what states are now doing to (hopefully) improve upon their teacher evaluation systems in terms of using multiple measures to help to evaluate teachers more holistically. We emphasize the (in)formative versus the summative and high-stakes functions of such systems, and allowing teachers to take ownership over such systems in their development and implementation. I will leave you all to read the full article (here) for these details.

In sum, though, when rethinking states’ teacher evaluation systems, especially given the new liberties afforded to states via the Every Student Succeeds Act (ESSA), educators, education leaders, policymakers, and the like would do well to look to the past for guidance on what not to do — and what to do better. These legal cases can certainly inform such efforts.

Reference: Close, K., & Amrein-Beardsley, A. (2018). Learning from what doesn’t work in teacher evaluation. Phi Delta Kappan, 100(1), 15-19. Retrieved from http://www.kappanonline.org/learning-from-what-doesnt-work-in-teacher-evaluation/

Can More Teachers Be Covered Using VAMs?

Some researchers continue to explore the potential worth of value-added models (VAMs) for measuring teacher effectiveness. Not that I endorse the perpetual tweaking of this or twisting of that to explore how VAMs might be made “better” for such purposes, also given the abundance of decades research we now have evidencing the plethora of problems with using VAMs for such purposes, I do try to write about current events including current research published on this topic for this blog. Hence, I write here about a study researchers from Mathematica Policy Research released last month, about whether more teachers might be VAM-eligible (download the full study here).

One of the main issues with VAMs is that they can typically be used to measure the effects of only approximately 30% of all public school teachers. The other 70%, which sometimes includes entire campuses of teachers (e.g., early elementary and high school teachers) or teachers who do not teach the core subject areas assessed using large-scale standardized tests (e.g., mathematics and reading/language arts) cannot be evaluated or held accountable using VAM data. This is more generally termed an issue with fairness, defined by our profession’s Standards for Educational and Psychological Testing as the impartiality of “test score interpretations for intended use(s) for individuals from all [emphasis added] relevant subgroups” (p. 219). Issues of fairness arise when a test, or test-based inference or use impacts some more than others in unfair or prejudiced, yet often consequential ways.

Accordingly, in this study researchers explored whether VAMs can be used to evaluate teachers of subject areas that are only tested occasionally and in non-consecutive grade levels (e.g., science and social studies, for example, in grades 4 and 7 or 5 and 8) using teachers’ students’ other, consecutively administered subject area tests (i.e., mathematics and reading/language arts) can be used to help isolate teachers’ contributions to students’ achievement in said excluded subject areas. Indeed, it is true that “states and districts have little information about how value-added models [VAMs] perform in grades when tests in the same subject are not available from the previous year.” Yet, states (e.g., New Mexico) continue to do this without evidence that it works. This is also one point of contention in the ongoing lawsuit there. Hence, the purpose of this study was to explore (using state-level data from Oklahoma) how well doing this works, again, given the use of such proxy pretests “could allow states and districts to increase the number of teachers for whom value-added models [could] be used” (i.e., increase fairness).

However, researchers found that when doing just this (1) VAM estimates that do not account for a same-subject pretests may be less credible than estimates that use same-subject pretests from prior and adjacent grade levels (note that authors do not explicitly define what they mean by credible but infer the term to be synonymous with valid). In addition, (2) doing this may subsequently lead to relatively more biased VAM estimates, even more so than changing some other features of VAMs, and (3) doing this may make VAM estimates less precise, or reliable. Put more succinctly, using mathematics and reading/language arts as pretest scores to help measure (e.g., science and social studies) teachers’ value-added effects yields VAM estimates that are less credible (aka less valid), more biased, and less precise (aka less reliable).

The authors conclude that “some policy makers might interpret [these] findings as firm evidence against using value-added estimates that rely on proxy pretests [may be] too strong. The choice between different evaluation measures always involves trade-offs, and alternatives to value-added estimates [e.g., classroom observations and student learning objectives {SLOs)] also have important limitations.”

Their suggestion, rather, is for “[p]olicymakers [to] reduce the weight given to value-added estimates from models that rely on proxy pretests relative to the weight given to those of other teachers in subjects with pretests.” With all of this, I disagree. Using this or that statistical adjustment, or shrinkage approach, or adjusted weights, or…etc., is as I said before, at this point frivolous.

Reference: Walsh, E., Dotter, D., & Liu, A. Y. (2018). Can more teachers be covered? The accuracy, credibility, and precision of value-added estimates with proxy pre-tests. Washington DC: Mathematica Policy Research. Retrieved from https://www.mathematica-mpr.com/our-publications-and-findings/publications/can-more-teachers-be-covered-the-accuracy-credibility-and-precision-of-value-added-estimates

Effects of the Los Angeles Times Prior Publications of Teachers’ Value-Added Scores

In one of my older posts (here), I wrote about the Los Angeles Times and its controversial move to solicit Los Angeles Unified School District (LAUSD) students’ test scores via an open-records request, calculate LAUSD teachers’ value-added scores themselves, and then publish thousands of LAUSD teachers’ value-added scores along with their “effectiveness” classifications (e.g., least effective, less effective, average, more effective, and most effective) on their Los Angeles Teacher Ratings website. They did this, repeatedly, since 2010, and they have done this all the while despite the major research-based issues surrounding teachers’ value-added estimates (that hopefully followers of this blog know at least somewhat well). This is also of professional frustration for me since the authors of the initial articles and the creators of the searchable website (Jason Felch and Jason Strong) contacted me back in 2011 regarding whether what they were doing was appropriate, valid, and fair. Despite my strong warnings against it, Felch and Song thanked me for my time and moved forward.

Just yesterday, the National Education Policy Center (NEPC) at the University of Colorado – Boulder, published a Newsletter in which authors answer the following question, as taken from the Newsletter’s title: “Whatever Happened with the Los Angeles Times’ Decision to Publish Teachers’ Value-Added Scores?” Here is what they found, by summarizing one article and two studies on the topic, although you can also certainly read the full report here.

  • Publishing the scores meant already high-achieving students were assigned to the classrooms of higher-rated teachers the next year, [found a study in the peer-reviewed Economics of Education Review]. That could be because affluent or well-connected parents were able to pull strings to get their kids assigned to those top teachers, or because those teachers pushed to teach the highest-scoring students. In other words, the academically rich got even richer — an unintended consequence of what could be considered a journalistic experiment in school reform.
  • The decision to publish the scores led to: (1) A temporary increase in teacher turnover; (2) Improvements
    in value-added scores; and (3) No impact on local housing prices.
  • The Los Angeles Times’ analysis erroneously concluded that there was no relationship between value-added scores and levels of teacher education and experience.
  • It failed to account for the fact that teachers are non-randomly assigned to classes in ways that benefit some and disadvantage others.
  • It generated results that changed when Briggs and Domingue tweaked the underlying statistical model [i.e., yielding different value-estimates and classifications for the same teachers].
  • It produced “a significant number of false positives (teachers rated as effective who are really average), and false negatives (teachers rated as ineffective who are really average).”

After the Los Angeles Times’ used a different approach in 2011, Catherine Durso found:

  • Class composition varied so much that comparisons of value-added scores of two teachers were only valid if both teachers are assigned students with similar characteristics.
  • Annual fluctuations in results were so large that they lead to widely varying conclusions from one year to the next for the same teacher.
  • There was strong evidence that results were often due to the teaching environment, not just the teacher.
  • Some teachers’ scores were based on very little data.

In sum, while “[t]he debate over publicizing value-added scores, so fierce in 2010, has since died down to
a dull roar,” more states (e.g., like in New York and Virginia), organizations (e.g., like Matt Barnum’s Chalbeat), and news outlets (e.g., the Los Angeles Times has apparently discontinued this practice, although their website is still live) need to take a stand against or prohibit the publications of individual teachers’ value-added results from hereon out. As I noted to Jason Felch and Jason Strong a long time ago, this IS simply bad practice.

A Win in New Jersey: Tests to Now Account for 5% of Teachers’ Evaluations

Phil Murphy, the Governor of New Jersey, is keeping his campaign promise to parents, students, and educators, according to a news article just posted by the New Jersey Education Association (NJEA; see here). As per the New Jersey Commissioner of Education – Dr. Lamont Repollet, who was a classroom teacher himself — throughout New Jersey, Partnership for Assessment of Readiness for College and Careers (PARCC) test scores will now account for just 5% of a teacher’s evaluation, which is down from 30% as mandated for approxunatelt five years prior by both Murphy’s and Repollet’s predecessors.

Alas, the New Jersey Department of Education and the Murphy administration have “shown their respect for the research.” Because state law continues to require that standardized test scores play some role in teacher evaluation, a decrease to 5% is a victory, perhaps with a revocation of this law forthcoming.

“Today’s announcement is another step by Gov. Murphy toward keeping a campaign promise to rid New Jersey’s public schools of the scourge of high-stakes testing. While tens of thousands of families across the state have already refused to subject their children to PARCC, schools are still required to administer it and educators are still subject to its arbitrary effects on their evaluation. By dramatically lowering the stakes for the test, Murphy is making it possible for educators and students alike to focus more time and attention on real teaching and learning.” Indeed, “this is a victory of policy over politics, powered by parents and educators.”

Way to go New Jersey!

ACT Also Finds but Discourages States’ Decreased Use of VAMs Post-ESSA

Last June (2018), I released a blog post covering three key findings from a study that I along with two others conducted on states’ revised teacher evaluation systems post the passage of the federal government’s Every Student Succeeds Act (ESSA; see the full study here). In short, we evidenced that (1) the role of growth or value-added models (VAMs) for teacher evaluation purposes is declining across states, (2) many states are embracing the increased local control afforded them via ESSA and no longer have one-size-fits-all teacher evaluation systems, and (3) the rhetoric surrounding teacher evaluation has changed (e.g., language about holding teachers accountable is increasingly less evident than language about teachers’ professional development and support).

Last week, a similar study was released by the ACT standardized testing company. As per the title of this report (see the full report here), they too found that there is a “Shrinking Use of Growth” across states’ “Teacher Evaluation Legislation since ESSA.” They also found that for some states there was a “complete removal” of the state’s teacher evaluation system (e.g., Maine, Washington), a postponement of the state’s teacher evaluation systems until further notice (e.g., Connecticut, Indiana, Tennessee) or a complete prohibition of the use of students’ standardized tests in any teachers’ evaluations moving forward (e.g., Connecticut, Idaho). Otherwise, as we also found, states are increasingly “allowing districts, rather than the state, to determine their evaluation frameworks” themselves.

Unlike in our study, however, ACT (perhaps not surprisingly as a standardized testing company that also advertises its tests’ value-added capacities; see, for example, here) cautions states against “the complete elimination of student growth as part of teacher evaluation systems” in that they have “clearly” (without citations in support) proven “their value and potential value.” Hence, ACT defines this as “a step backward.” In our aforementioned study and blog post we reviewed some of the actual evidence (with citations in support) and would, accordingly, contest all-day-long that these systems have a “clear” “proven” “value.” Hence, we (also perhaps not surprisingly) called our similar findings as “steps in the right direction.”

Regardless, in in all fairness, they recommend that states not “respond to the challenges of using growth
measures in evaluation systems by eliminating their use” but “first consider less drastic measures such as:

  • postponing the use of student growth for employment decisions while refinements to the system can be made;
  • carrying out special studies to better understand the growth model; and/or
  • reviewing evaluation requirements for teachers who teach in untested grades and subjects so that the measures used more accurately reflect their performance.

“Pursuing such refinements, rather than reversing efforts to make teacher evaluation more meaningful and reflective of performance, is the best first step toward improving states’ evaluation systems.”

Citation: Croft, M., Guffy, G., & Vitale, D. (2018). The shrinking use of growth: Teacher evaluation legislation since ESSA. Iowa City, IA: ACT. Retrieved from https://www.act.org/content/dam/act/unsecured/documents/teacher-evaluation-legislation-since-essa.pdf

Fired “Ineffective” Teacher Wins Battle with DC Public Schools

In November of 2013, I published a blog post about a “working paper” released by the National Bureau of Economic Research (NBER) and written by authors Thomas Dee – Economics and Educational Policy Professor at Stanford, and James Wyckoff – Economics and Educational Policy Professor at the University of Virginia. In the study titled “Incentives, Selection, and Teacher Performance: Evidence from IMPACT,” Dee and Wyckoff (2013) analyzed the controversial IMPACT educator evaluation system that was put into place in Washington DC Public Schools (DCPS) under the then Chancellor, Michelle Rhee. In this paper, Dee and Wyckoff (2013) presented what they termed to be “novel evidence” to suggest that the “uniquely high-powered incentives” linked to “teacher performance” via DC’s IMPACT initiative worked to improve the performance of high-performing teachers, and that dismissal threats worked to increase the voluntary attrition of low-performing teachers, as well as improve the performance of the students of the teachers who replaced them.

I critiqued this study in full (see both short and long versions of this critique here), ultimately asserting that the study had “fatal flaws” which compromised the exaggerated claims Dee and Wyckoff (2013) advanced. This past January (2017) they published another report, titled “Teacher Turnover, Teacher Quality, and Student Achievement in DCPS,” which was also (prematurely) released as a “working paper” by the same NBER. I also critiqued this study here).

Anyhow, a public interest story that should be of interest to followers of this blog was published two days ago in The Washington Post. The article, “I’ve Been a Hostage for Nine Years’: Fired Teacher Wins Battle with D.C. Schools,” details one fired, now 53-year old, veteran’s teachers last nine years after being one of nearly 1,000 educators fired during the tenure of Michelle Rhee. He was fired after district “leaders,” using the IMPACT system and a teacher evaluation system prior, deemed him “ineffective.” He “contested his dismissal, arguing that he was wrongly fired and that the city was punishing him for being a union activist and for publicly criticizing the school system.” That he made a significant salary at the time (2009) also likely had something to do with it in terms of cost-savings, although this is more peripherally discussed in this piece.

In short, “an arbitrator [just] ruled in favor of the fired teacher, a decision that could entitle him to hundreds of thousands of dollars in back pay and the opportunity to be a District teacher again” although, perhaps not surprisingly, he might not take them up on that  offer. As well, apparently this teacher “isn’t the only one fighting to get his job back. Other educators who were fired years ago and allege unjust dismissals [as per the IMPACT system] are waiting for their cases to be settled.” The school system can appeal this ruling.

The Gates Foundation’s Expensive ($335 Million) Teacher Evaluation Missteps

The header of an Education Week article released last week (click here) was that “[t]he Bill & Melinda Gates Foundation’s multi-million-dollar, multi-year effort aimed at making teachers more effective largely fell short of its goal to increase student achievement-including among low-income and minority students.”

An evaluation of Gates Foundation’s Intensive Partnerships for Effective Teaching initiative funded at $290 million, an extension of its Measures of Effective Teaching (MET) project funded at $45 million, was the focus of this article. The MET project was lead by Thomas Kane (Professor of Education and Economics at Harvard, former leader of the MET project, and expert witness on the defendant’s side of the ongoing lawsuit supporting New Mexico’s MET project-esque statewide teacher evaluation system; see here and here), and both projects were primarily meant to hold teachers accountable using their students test scores via growth or value-added models (VAMs) and financial incentives. Both projects were tangentially meant to improve staffing, professional development opportunities, improve the retention of the teachers of “added value,” and ultimately lead to more-effective teaching and student achievement, especially in low-income schools and schools with higher relative proportions of racial minority students. The six-year evaluation of focus in this Education Week article was conducted by the RAND Corporation and the American Institutes for Research, and the evaluation was also funded by the Gates Foundation (click here for the evaluation report, see below for the full citation of this study).

Their key finding was that Intensive Partnerships for Effective Teaching district/school sites (see them listed here) implemented new measures of teaching effectiveness and modified personnel policies, but they did not achieve their goals for students.

Evaluators also found (see also here):

  • The sites succeeded in implementing measures of effectiveness to evaluate teachers and made use of the measures in a range of human-resource decisions.
  • Every site adopted an observation rubric that established a common understanding of effective teaching. Sites devoted considerable time and effort to train and certify classroom observers and to observe teachers on a regular basis.
  • Every site implemented a composite measure of teacher effectiveness that included scores from direct classroom observations of teaching and a measure of growth in student achievement.
  • Every site used the composite measure to varying degrees to make decisions about human resource matters, including recruitment, hiring, placement, tenure, dismissal, professional development, and compensation.

Overall, the initiative did not achieve its goals for student achievement or graduation, especially for low-income and racial minority students. With minor exceptions, student achievement, access to effective teaching, and dropout rates were also not dramatically better than they were for similar sites that did not participate in the intensive initiative.

Their recommendations were as follows (see also here):

  • Reformers should not underestimate the resistance that could arise if changes to teacher-evaluation systems have major negative consequences.
  • A near-exclusive focus on teacher evaluation systems such as these might be insufficient to improve student outcomes. Many other factors might also need to be addressed, ranging from early childhood education, to students’ social and emotional competencies, to the school learning environment, to family support. Dramatic improvement in outcomes, particularly for low-income and racial minority students, will likely require attention to many of these factors as well.
  • In change efforts such as these, it is important to measure the extent to which each of the new policies and procedures is implemented in order to understand how the specific elements of the reform relate to outcomes.

Reference:

Stecher, B. M., Holtzman, D. J., Garet, M. S., Hamilton, L. S., Engberg, J., Steiner, E. D., Robyn, A., Baird, M. D., Gutierrez, I. A., Peet, E. D., de los Reyes, I. B., Fronberg, K., Weinberger, G., Hunter, G. P., & Chambers, J. (2018). Improving teaching effectiveness: Final report. The Intensive Partnerships for Effective Teaching through 2015–2016. Santa Monica, CA: The RAND Corporation. Retrieved from https://www.rand.org/pubs/research_reports/RR2242.html

States’ Teacher Evaluation Systems Moving in the “Right” Direction

Last week, a technical report that one of my current and one of my former doctoral students helped me to research and write, was published by the University of Colorado Boulder’s National Education Policy Center (NEPC). While you can navigate to and read the press release here, as well as download and read the full report here, I thought I would summarize the report’s most interesting facts in this post, for the readers/followers of this blog who are likely more interested in the findings pertaining to states’ revised teacher evaluation systems, post the federal passage of the Every Student Succeeds Act (ESSA).

In short, we collected and analyzed for purposes of this study the 51 (i.e., 50 states plus Washington DC) revised teacher evaluation plans submitted to the federal government post ESSA (i.e. spring/summer of 2017) We found, again as specific only to states’ teacher evaluation systems, three key findings:

— First, the role of growth or value-added models (VAMs) for teacher evaluation purposes is declining. That is, the number of states using statewide growth models or VAMs has decreased from 42% to 30% since 2014. This is certainly a step in the “right,” defined as research-informed, direction. See also Figure 1 below (Close, Amrein-Beardsley, & Collins, 2018, p. 13).

— Second, because ESSA loosened federal control of teacher evaluation, many states no longer have a one-size-fits-all teacher evaluation system. This is allowing local districts to make more choices about models, implementation, execution, and the like, in the contexts of the schools and communities in which schools exist.

— Third, the rhetoric surrounding teacher evaluation has changed: language about holding teachers accountable for their value-added effects, or lack thereof, is much less evident in post-ESSA plans. Rather, new plans make note of providing data to teachers as a means of supporting professional development and improvement, essentially shifting the purpose of the evaluation system away from summative and toward formative use.

We also set forth recommendations for states in this report, as based on the evidence noted above (and presented in much more detail in the full report). The recommendations that also directly pertain to states’ (and districts’) teacher evaluation systems are that states/districts:

  1. Take advantage of decreased federal control by formulating revised assessment policies informed by the viewpoints of as many stakeholders as feasible. Such informed revision can help remedy earlier weaknesses, promote effective implementation, stress correct interpretation, and yield formative information.
  2. Ensure that teacher evaluation systems rely on a balanced system of multiple measures, without disproportionate weight assigned to any one measure as allegedly “superior” than any other. If measures contradict one another, however, output from all measures should be interpreted judiciously.
  3. Emphasize data useful as formative feedback in state systems, so that specific weaknesses in student learning can be identified, targeted and used to inform teachers’ professional development.
  4. Mandate ongoing research and evaluation of state assessment systems and ensure that adequate resources are provided to support [ongoing] evaluation [efforts].
  5. Set goals for reducing proficiency gaps and outline procedures for developing strategies to effectively reduce gaps once they have been identified.

We hope this information helps, especially the states and districts still looking to other states to see what is trending. While we note in the title of this blog post as well as the title of the full report that all of this represents “some steps in the right direction,” there is still much work to be done. This is especially true in states, for example like New Mexico (see my most recent post about the ongoing lawsuit in this state here) and other states which have yet to give up on the false promises and limited research of such educational policies established almost one decade ago (e.g., Race to the Top; Duncan, 2009).

Citations:

Close, K., Amrein-Beardsley, A., & Collins, C. (2018). State-level assessments and teacher evaluation systems after the passage of the Every Student Succeeds Act: Some steps in the right direction. Boulder, CO: Nation Education Policy Center (NEPC). Retrieved from http://nepc.colorado.edu/publication/state-assessment

Duncan, A. (2009, July 4). The race to the top begins: Remarks by Secretary Arne Duncan. Retrieved from http://www.ed.gov/news/speeches/2009/07/07242009.html

New Mexico Teacher Evaluation Lawsuit Updates

In December of 2015 in New Mexico, via a preliminary injunction set forth by state District Judge David K. Thomson, all consequences attached to teacher-level value-added model (VAM) scores (e.g., flagging the files of teachers with low VAM scores) were suspended throughout the state until the state (and/or others external to the state) could prove to the state court that the system was reliable, valid, fair, uniform, and the like. The trial during which this evidence is to be presented by the state is currently set for this October. See more information about this ruling here.

As the expert witness for the plaintiffs in this case, I was deposed a few weeks ago here in Phoenix, given my analyses of the state’s data (supported by one of my PhD students – Tray Geiger). In short, we found and I testified during the deposition that:

  • In terms of uniformity and fairness, there seem to be 70% or so of New Mexico teachers who are ineligible to be assessed using VAMs, and this proportion held constant across the years of data analyzed. This is even more important to note knowing that when VAM-based data are to be used to make consequential decisions about teachers, issues with fairness and uniformity become even more important given accountability-eligible teachers are also those who are relatively more likely to realize the negative or reap the positive consequences attached to VAM-based estimates.
  • In terms of reliability (or the consistency of teachers’ VAM-based scores over time), approximately 40% of teachers differed by one quintile (quintiles are derived when a sample or population is divided into fifths) and approximately 28% of teachers differed, from year-to-year, by two or more quintiles in terms of their VAM-derived effectiveness ratings. These results make sense when New Mexico’s results are situated within the current literature, whereas teachers classified as “effective” one year can have a 25%-59% chance of being classified as “ineffective” the next, or vice versa, with other permutations also possible.
  • In terms of validity (i.e., concurrent related evidence of validity), and importantly as also situated within the current literature, the correlations between New Mexico teachers’ VAM-based and observational scores ranged from r = 0.153 to r = 0.210. Not only are these correlations very weak[1], they are also very weak as appropriately situated within the literature, via which it is evidenced that correlations between multiple VAMs and observational scores typically range from 0.30 ≤ r ≤ 0.50.
  • In terms of bias, New Mexico’s Caucasian teachers had significantly higher observation scores than non-Caucasian teachers implying, also as per the current research, that Caucasian teachers may be (falsely) perceived as being better teachers than non-Caucasians teachers given bias within these instruments and/or bias of the scorers observing and scoring teachers using these instruments in practice. See prior posts about observational-based bias here, here and here.
  • Also of note in terms of bias was that: (1) teachers with fewer years of experience yielded VAM scores that were significantly lower than teachers with more years of experience, with similar patterns noted across teachers’ observation scores, which could all mean, as also in line with common sense as well as the research, that teachers with more experience are typically better teachers; (2) teachers who taught English language learners (ELLs) or special education students had lower VAM scores across the board than those who did not teach such students; (3) teachers who taught gifted students had significantly higher VAM scores than non-gifted teachers which runs counter to the current research evidencing that teachers’ gifted students oft-thwart or prevent them from demonstrating growth given ceiling effects; (4) teachers in schools with lower relative proportions of ELLs, special education students, students eligible for free-or-reduced lunches, and students from racial minority backgrounds, as well as higher relative proportions of gifted students, consistently had significantly higher VAM scores. These results suggest that teachers in these schools are as a group better, and/or that VAM-based estimates might be biased against teachers not teaching in these schools, preventing them from demonstrating comparable growth.

To read more about the data and methods used, as well as other findings, please see my affidavit submitted to the court attached here: Affidavit Feb2018.

Although, also in terms of a recent update, I should also note that a few weeks ago, as per an article in the AlbuquerqueJournal, New Mexico’s teacher evaluation systems is now likely to be overhauled, or simply “expired” as early as 2019. In short, “all three Democrats running for governor and the lone Republican candidate…have expressed misgivings about using students’ standardized test scores to evaluate the effectiveness of [New Mexico’s] teachers, a key component of the current system [at issue in this lawsuit and] imposed by the administration of outgoing Gov. Susana Martinez.” All four candidates described the current system “as fundamentally flawed and said they would move quickly to overhaul it.”

While I/we will proceed our efforts pertaining to this lawsuit until further notice, this is also important to note at this time in that it seems that New Mexico’s policymakers of new are going to be much wiser than those of late, at least in these regards.

[1] Interpreting r: 0.8 ≤ r ≤ 1.0 = a very strong correlation; 0.6 ≤ r ≤ 0.8 = a strong correlation; 0.4 ≤ r ≤ 0.6 = a moderate correlation; 0.2 ≤ r ≤ 0.4 = a weak correlation; and 0.0 ≤ r ≤ 0.2 = a very weak correlation, if any at all.

 

New Mexico’s Motion for Summary Judgment, Following Houston’s Precedent-Setting Ruling

Recall that in New Mexico, just over two years ago, all consequences attached to teacher-level value-added model (VAM) scores (e.g., flagging the files of teachers with low VAM scores) were suspended throughout the state until the state (and/or others external to the state) could prove to the state court that the system was reliable, valid, fair, uniform, and the like. The trial during which this evidence was to be presented by the state was repeatedly postponed since, yet with teacher-level consequences prohibited all the while. See more information about this ruling here.

Recall as well that in Houston, just this past May, that a district judge ruled that Houston Independent School District (HISD) teachers’ who had VAM scores (as based on the Education Value-Added Assessment System (EVAAS)) had legitimate claims regarding how EVAAS use in HISD was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case organization shall deprive any person of life, liberty, or property, without due process). More specifically, in what turned out to be a huge and unprecedented victory, the judge ruled that because HISD teachers “ha[d] no meaningful way to ensure correct calculation of their EVAAS scores,” they were, as a result, “unfairly subject to mistaken deprivation of constitutionally protected property interests in their jobs.” This ruling ultimately led the district to end the use of the EVAAS for teacher termination throughout Houston. See more information about this ruling here.

Just this past week, New Mexico charged that the Houston ruling regarding Houston teachers’ Fourteenth Amendment due process protections also applies to teachers throughout the state of New Mexico.

As per an article titled “Motion For Summary Judgment Filed In New Mexico Teacher Evaluation Lawsuit,” the American Federation of Teachers and Albuquerque Teachers Federation filed a “motion for summary judgment in the litigation in our continuing effort to make teacher evaluations beneficial and accurate in New Mexico.” They, too, are “seeking a determination that the [state’s] failure to provide teachers with adequate information about the calculation of their VAM scores violated their procedural due process rights.”

“The evidence demonstrates that neither school administrators nor educators have been provided with sufficient information to replicate the [New Mexico] VAM score calculations used as a basis for teacher evaluations. The VAM algorithm is complex, and the general overview provided in the NMTeach Technical Guide is not enough to pass constitutional muster. During previous hearings, educators testified they do not receive an explanation at the time they receive their annual evaluation, and teachers have been subjected to performance growth plans based on low VAM scores, without being given any guidance or explanation as to how to raise that score on future evaluations. Thus, not only do educators not understand the algorithm used to derive the VAM score that is now part of the basis for their overall evaluation rating, but school administrators within the districts do not have sufficient information on how the score is derived in order to replicate it or to provide professional development, whether as part of a disciplinary scenario or otherwise, to assist teachers in raising their VAM score.”

For more information about this update, please click here.