Learning from What Doesn’t Work in Teacher Evaluation

One of my doctoral students — Kevin Close — and I just had a study published in the practitioner journal Phi Delta Kappan that I wanted to share out with all of you, especially before the study is no longer open-access or free (see full study as currently available here). As the title indicates, the study is about how states, school districts, and schools can “Learn from What Doesn’t Work in Teacher Evaluation,” given an analysis that the two of us conducted of all documents pertaining to the four teacher evaluation and value-added model (VAM)-centered lawsuits in which I have been directly involved, and that I have also covered in this blog. These lawsuits include Lederman v. King in New York (see here), American Federation of Teachers et al. v. Public Education Department in New Mexico (see here), Houston Federation of Teachers v. Houston Independent School District in Texas (see here), and Trout v. Knox County Board of Education in Tennessee (see here).

Via this analysis we set out to comb through the legal documents to identify the strongest objections, as also recognized by the courts in these lawsuits, to VAMs as teacher measurement and accountability strategies. “The lessons to be learned from these cases are both important and timely” given that “[u]nder the Every Student Succeeds Act (ESSA), local education leaders once again have authority to decide for themselves how to assess teachers’ work.”

The most pertinent and also common issues as per these cases were as follows:

(1) Inconsistencies in teachers’ VAM-based estimates from one year to the next that are sometimes “wildly different.” Across these lawsuits, issues with reliability were very evident, whereas teachers classified as “effective” one year were either theorized or demonstrated to have around a 25%-59% chance of being classified as “ineffective” the next year, or vice versa, with other permutations also possible. As per our profession’s Standards for Educational and Psychological Testing, reliability should, rather, be observed whereby VAM estimates of teacher effectiveness are more or less consistent over time, from one year to the next, regardless of the type of students and perhaps subject areas that teachers teach.

(2) Bias in teachers’ VAM-based estimates were also of note, whereby documents suggested or evidenced that bias, or rather biased estimates of teachers’ actual effects does indeed exist (although this area was also of most contention and dispute). Specific to VAMs, since teachers are not randomly assigned the students they teach, whether their students are invariably more or less motivated, smart, knowledgeable, or capable can bias students’ test-based data, and teachers’ test-based data when aggregated. Court documents, although again not without counterarguments, suggested that VAM-based estimates are sometimes biased, especially when relatively homogeneous sets of students (i.e., English Language Learners (ELLs), gifted and special education students, free-or-reduced lunch eligible students) are non-randomly concentrated into schools, purposefully placed into classrooms, or both. Research suggests that this also sometimes happens regardless of the the sophistication of the statistical controls used to block said bias.

(3) The gaming mechanisms in play within teacher evaluation systems in which VAMs play a key role, or carry significant evaluative weight, were also of legal concern and dispute. That administrators sometimes inflate the observational ratings of their teachers whom they want to protect, while simultaneously offsetting the weight the VAMs sometimes carry was of note, as was the inverse. That administrators also sometimes lower teachers’ ratings to better align them with their “more objective” VAM counterparts were also at issue. “So argued the plaintiffs in the Houston and Tennessee lawsuits, for example. In those systems, school leaders appear to have given precedence to VAM scores, adjusting their classroom observations to match them. In both cases, administrators admitted to doing so, explaining that they sensed pressure to ensure that their ‘subjective’ classroom ratings were in sync with the VAM’s ‘objective’ scores.” Both sets of behavior distort the validity (or “truthfulness”) of any teacher evaluation system and are in violation of the same, aforementioned Standards for Educational and Psychological Testing that call for VAM scores and observation ratings to be kept separate. One indicator should never be adjusted to offset or to fit the other.

(4) Transparency, or the lack thereof, was also a common issue across cases. Transparency, which can be defined as the extent to which something is accessible and readily capable of being understood, pertains to whether VAM-based estimates are accessible and make sense to those at the receiving ends. “Not only should [teachers] have access to [their VAM-based] information for instructional purposes, but if they believe their evaluations to be unfair, they should be able to see all of the relevant data and calculations so that they can defend themselves.” In no case was this more legally pertinent than in Houston Federation of Teachers v. Houston Independent School District in Texas. Here, the presiding judge ruled that teachers did have “legitimate claims to see how their scores were calculated. Concealing this information, the judge ruled, violated teachers’ due process protections under the 14th Amendment (which holds that no state — or in this case organization — shall deprive any person of life, liberty, or property, without due process). Given this precedent, it seems likely that teachers in other states and districts will demand transparency as well.”

In the main article (here) we also discuss what states are now doing to (hopefully) improve upon their teacher evaluation systems in terms of using multiple measures to help to evaluate teachers more holistically. We emphasize the (in)formative versus the summative and high-stakes functions of such systems, and allowing teachers to take ownership over such systems in their development and implementation. I will leave you all to read the full article (here) for these details.

In sum, though, when rethinking states’ teacher evaluation systems, especially given the new liberties afforded to states via the Every Student Succeeds Act (ESSA), educators, education leaders, policymakers, and the like would do well to look to the past for guidance on what not to do — and what to do better. These legal cases can certainly inform such efforts.

Reference: Close, K., & Amrein-Beardsley, A. (2018). Learning from what doesn’t work in teacher evaluation. Phi Delta Kappan, 100(1), 15-19. Retrieved from http://www.kappanonline.org/learning-from-what-doesnt-work-in-teacher-evaluation/

Can More Teachers Be Covered Using VAMs?

Some researchers continue to explore the potential worth of value-added models (VAMs) for measuring teacher effectiveness. Not that I endorse the perpetual tweaking of this or twisting of that to explore how VAMs might be made “better” for such purposes, also given the abundance of decades research we now have evidencing the plethora of problems with using VAMs for such purposes, I do try to write about current events including current research published on this topic for this blog. Hence, I write here about a study researchers from Mathematica Policy Research released last month, about whether more teachers might be VAM-eligible (download the full study here).

One of the main issues with VAMs is that they can typically be used to measure the effects of only approximately 30% of all public school teachers. The other 70%, which sometimes includes entire campuses of teachers (e.g., early elementary and high school teachers) or teachers who do not teach the core subject areas assessed using large-scale standardized tests (e.g., mathematics and reading/language arts) cannot be evaluated or held accountable using VAM data. This is more generally termed an issue with fairness, defined by our profession’s Standards for Educational and Psychological Testing as the impartiality of “test score interpretations for intended use(s) for individuals from all [emphasis added] relevant subgroups” (p. 219). Issues of fairness arise when a test, or test-based inference or use impacts some more than others in unfair or prejudiced, yet often consequential ways.

Accordingly, in this study researchers explored whether VAMs can be used to evaluate teachers of subject areas that are only tested occasionally and in non-consecutive grade levels (e.g., science and social studies, for example, in grades 4 and 7 or 5 and 8) using teachers’ students’ other, consecutively administered subject area tests (i.e., mathematics and reading/language arts) can be used to help isolate teachers’ contributions to students’ achievement in said excluded subject areas. Indeed, it is true that “states and districts have little information about how value-added models [VAMs] perform in grades when tests in the same subject are not available from the previous year.” Yet, states (e.g., New Mexico) continue to do this without evidence that it works. This is also one point of contention in the ongoing lawsuit there. Hence, the purpose of this study was to explore (using state-level data from Oklahoma) how well doing this works, again, given the use of such proxy pretests “could allow states and districts to increase the number of teachers for whom value-added models [could] be used” (i.e., increase fairness).

However, researchers found that when doing just this (1) VAM estimates that do not account for a same-subject pretests may be less credible than estimates that use same-subject pretests from prior and adjacent grade levels (note that authors do not explicitly define what they mean by credible but infer the term to be synonymous with valid). In addition, (2) doing this may subsequently lead to relatively more biased VAM estimates, even more so than changing some other features of VAMs, and (3) doing this may make VAM estimates less precise, or reliable. Put more succinctly, using mathematics and reading/language arts as pretest scores to help measure (e.g., science and social studies) teachers’ value-added effects yields VAM estimates that are less credible (aka less valid), more biased, and less precise (aka less reliable).

The authors conclude that “some policy makers might interpret [these] findings as firm evidence against using value-added estimates that rely on proxy pretests [may be] too strong. The choice between different evaluation measures always involves trade-offs, and alternatives to value-added estimates [e.g., classroom observations and student learning objectives {SLOs)] also have important limitations.”

Their suggestion, rather, is for “[p]olicymakers [to] reduce the weight given to value-added estimates from models that rely on proxy pretests relative to the weight given to those of other teachers in subjects with pretests.” With all of this, I disagree. Using this or that statistical adjustment, or shrinkage approach, or adjusted weights, or…etc., is as I said before, at this point frivolous.

Reference: Walsh, E., Dotter, D., & Liu, A. Y. (2018). Can more teachers be covered? The accuracy, credibility, and precision of value-added estimates with proxy pre-tests. Washington DC: Mathematica Policy Research. Retrieved from https://www.mathematica-mpr.com/our-publications-and-findings/publications/can-more-teachers-be-covered-the-accuracy-credibility-and-precision-of-value-added-estimates

Effects of the Los Angeles Times Prior Publications of Teachers’ Value-Added Scores

In one of my older posts (here), I wrote about the Los Angeles Times and its controversial move to solicit Los Angeles Unified School District (LAUSD) students’ test scores via an open-records request, calculate LAUSD teachers’ value-added scores themselves, and then publish thousands of LAUSD teachers’ value-added scores along with their “effectiveness” classifications (e.g., least effective, less effective, average, more effective, and most effective) on their Los Angeles Teacher Ratings website. They did this, repeatedly, since 2010, and they have done this all the while despite the major research-based issues surrounding teachers’ value-added estimates (that hopefully followers of this blog know at least somewhat well). This is also of professional frustration for me since the authors of the initial articles and the creators of the searchable website (Jason Felch and Jason Strong) contacted me back in 2011 regarding whether what they were doing was appropriate, valid, and fair. Despite my strong warnings against it, Felch and Song thanked me for my time and moved forward.

Just yesterday, the National Education Policy Center (NEPC) at the University of Colorado – Boulder, published a Newsletter in which authors answer the following question, as taken from the Newsletter’s title: “Whatever Happened with the Los Angeles Times’ Decision to Publish Teachers’ Value-Added Scores?” Here is what they found, by summarizing one article and two studies on the topic, although you can also certainly read the full report here.

  • Publishing the scores meant already high-achieving students were assigned to the classrooms of higher-rated teachers the next year, [found a study in the peer-reviewed Economics of Education Review]. That could be because affluent or well-connected parents were able to pull strings to get their kids assigned to those top teachers, or because those teachers pushed to teach the highest-scoring students. In other words, the academically rich got even richer — an unintended consequence of what could be considered a journalistic experiment in school reform.
  • The decision to publish the scores led to: (1) A temporary increase in teacher turnover; (2) Improvements
    in value-added scores; and (3) No impact on local housing prices.
  • The Los Angeles Times’ analysis erroneously concluded that there was no relationship between value-added scores and levels of teacher education and experience.
  • It failed to account for the fact that teachers are non-randomly assigned to classes in ways that benefit some and disadvantage others.
  • It generated results that changed when Briggs and Domingue tweaked the underlying statistical model [i.e., yielding different value-estimates and classifications for the same teachers].
  • It produced “a significant number of false positives (teachers rated as effective who are really average), and false negatives (teachers rated as ineffective who are really average).”

After the Los Angeles Times’ used a different approach in 2011, Catherine Durso found:

  • Class composition varied so much that comparisons of value-added scores of two teachers were only valid if both teachers are assigned students with similar characteristics.
  • Annual fluctuations in results were so large that they lead to widely varying conclusions from one year to the next for the same teacher.
  • There was strong evidence that results were often due to the teaching environment, not just the teacher.
  • Some teachers’ scores were based on very little data.

In sum, while “[t]he debate over publicizing value-added scores, so fierce in 2010, has since died down to
a dull roar,” more states (e.g., like in New York and Virginia), organizations (e.g., like Matt Barnum’s Chalbeat), and news outlets (e.g., the Los Angeles Times has apparently discontinued this practice, although their website is still live) need to take a stand against or prohibit the publications of individual teachers’ value-added results from hereon out. As I noted to Jason Felch and Jason Strong a long time ago, this IS simply bad practice.

A Win in New Jersey: Tests to Now Account for 5% of Teachers’ Evaluations

Phil Murphy, the Governor of New Jersey, is keeping his campaign promise to parents, students, and educators, according to a news article just posted by the New Jersey Education Association (NJEA; see here). As per the New Jersey Commissioner of Education – Dr. Lamont Repollet, who was a classroom teacher himself — throughout New Jersey, Partnership for Assessment of Readiness for College and Careers (PARCC) test scores will now account for just 5% of a teacher’s evaluation, which is down from 30% as mandated for approxunatelt five years prior by both Murphy’s and Repollet’s predecessors.

Alas, the New Jersey Department of Education and the Murphy administration have “shown their respect for the research.” Because state law continues to require that standardized test scores play some role in teacher evaluation, a decrease to 5% is a victory, perhaps with a revocation of this law forthcoming.

“Today’s announcement is another step by Gov. Murphy toward keeping a campaign promise to rid New Jersey’s public schools of the scourge of high-stakes testing. While tens of thousands of families across the state have already refused to subject their children to PARCC, schools are still required to administer it and educators are still subject to its arbitrary effects on their evaluation. By dramatically lowering the stakes for the test, Murphy is making it possible for educators and students alike to focus more time and attention on real teaching and learning.” Indeed, “this is a victory of policy over politics, powered by parents and educators.”

Way to go New Jersey!

New Mexico Loses Major Education Finance Lawsuit (with Rulings Related to Teacher Evaluation System)

Followers of this blog should be familiar with the ongoing teacher evaluation lawsuit in New Mexico. The lawsuit — American Federation of Teachers – New Mexico and the Albuquerque Federation of Teachers (Plaintiffs) v. New Mexico Public Education Department (Defendants) — is being heard by a state judge who ruled in 2015 that all consequences attached to teacher-level value-added model (VAM) scores (e.g., flagging the files of teachers with low VAM scores) were to be suspended throughout the state until the state (and/or others external to the state) could prove to the state court that the system was reliable, valid, fair, uniform, and the like. This case is set to be heard in court again this November (see more about this case from my most recent update here).

While this lawsuit has been occurring, however, it is important to note that two other very important New Mexico cases (that have since been consolidated into one) have been ongoing since around the same time (2014) — Martinez v. State of New Mexico and Yazzie v. State of New Mexico. Plaintiffs in this lawsuit, filed by the New Mexico Center on Law and Poverty and the Mexican American Legal Defense and Education Fund (MALDEF), argued that the state’s schools are inadequately funded; hence, the state is also denying New Mexico students their constitutional rights to an adequate education.

Last Friday, a different state judge presiding over this case ruled, “in a blistering, landmark decision,” that New Mexico is in fact :violating the constitutional rights of at-risk students by failing to provide them with a sufficient education.” As such, the state, its governor, and its public education department (PED) are “to establish a funding system that meets constitutional requirements by April 15 [of] next year” (see full article here).

As this case does indeed pertain to the above mentioned teacher evaluation lawsuit of interest within this blog, it is also important to note that the judge:

  • “[R]ejected arguments by [Governor] Susana Martinez’s administration that the education system is improving…[and]…that the state was doing the best with what it had” (see here).
  • Emphasized that “New Mexico children [continue to] rank at the very bottom in the country for educational achievement” (see here).
  • Added that “New Mexico doesn’t have enough teachers…[and]…New Mexico teachers are among the lowest paid in the country” (see here).
  • “[S]uggested the state teacher evaluation system ‘may be contributing to the lower quality of teachers in high-need schools…[also given]…punitive teacher evaluation systems that penalize teachers for working in high-need schools contribute to problem in this category of schools” (see here).
  • And concluded that all of “the programs being lauded by PED are not changing this [bleak] picture” (see here) and, more specifically, “offered a scathing assessment of the ways in which New Mexico has failed its children,” again, taking “particular aim at the state’s punitive teacher evaluation system” (see here).

Apparently, the state plans to appeal the decision (see a related article here).

Fired “Ineffective” Teacher Wins Battle with DC Public Schools

In November of 2013, I published a blog post about a “working paper” released by the National Bureau of Economic Research (NBER) and written by authors Thomas Dee – Economics and Educational Policy Professor at Stanford, and James Wyckoff – Economics and Educational Policy Professor at the University of Virginia. In the study titled “Incentives, Selection, and Teacher Performance: Evidence from IMPACT,” Dee and Wyckoff (2013) analyzed the controversial IMPACT educator evaluation system that was put into place in Washington DC Public Schools (DCPS) under the then Chancellor, Michelle Rhee. In this paper, Dee and Wyckoff (2013) presented what they termed to be “novel evidence” to suggest that the “uniquely high-powered incentives” linked to “teacher performance” via DC’s IMPACT initiative worked to improve the performance of high-performing teachers, and that dismissal threats worked to increase the voluntary attrition of low-performing teachers, as well as improve the performance of the students of the teachers who replaced them.

I critiqued this study in full (see both short and long versions of this critique here), ultimately asserting that the study had “fatal flaws” which compromised the exaggerated claims Dee and Wyckoff (2013) advanced. This past January (2017) they published another report, titled “Teacher Turnover, Teacher Quality, and Student Achievement in DCPS,” which was also (prematurely) released as a “working paper” by the same NBER. I also critiqued this study here).

Anyhow, a public interest story that should be of interest to followers of this blog was published two days ago in The Washington Post. The article, “I’ve Been a Hostage for Nine Years’: Fired Teacher Wins Battle with D.C. Schools,” details one fired, now 53-year old, veteran’s teachers last nine years after being one of nearly 1,000 educators fired during the tenure of Michelle Rhee. He was fired after district “leaders,” using the IMPACT system and a teacher evaluation system prior, deemed him “ineffective.” He “contested his dismissal, arguing that he was wrongly fired and that the city was punishing him for being a union activist and for publicly criticizing the school system.” That he made a significant salary at the time (2009) also likely had something to do with it in terms of cost-savings, although this is more peripherally discussed in this piece.

In short, “an arbitrator [just] ruled in favor of the fired teacher, a decision that could entitle him to hundreds of thousands of dollars in back pay and the opportunity to be a District teacher again” although, perhaps not surprisingly, he might not take them up on that  offer. As well, apparently this teacher “isn’t the only one fighting to get his job back. Other educators who were fired years ago and allege unjust dismissals [as per the IMPACT system] are waiting for their cases to be settled.” The school system can appeal this ruling.

The Gates Foundation’s Expensive ($335 Million) Teacher Evaluation Missteps

The header of an Education Week article released last week (click here) was that “[t]he Bill & Melinda Gates Foundation’s multi-million-dollar, multi-year effort aimed at making teachers more effective largely fell short of its goal to increase student achievement-including among low-income and minority students.”

An evaluation of Gates Foundation’s Intensive Partnerships for Effective Teaching initiative funded at $290 million, an extension of its Measures of Effective Teaching (MET) project funded at $45 million, was the focus of this article. The MET project was lead by Thomas Kane (Professor of Education and Economics at Harvard, former leader of the MET project, and expert witness on the defendant’s side of the ongoing lawsuit supporting New Mexico’s MET project-esque statewide teacher evaluation system; see here and here), and both projects were primarily meant to hold teachers accountable using their students test scores via growth or value-added models (VAMs) and financial incentives. Both projects were tangentially meant to improve staffing, professional development opportunities, improve the retention of the teachers of “added value,” and ultimately lead to more-effective teaching and student achievement, especially in low-income schools and schools with higher relative proportions of racial minority students. The six-year evaluation of focus in this Education Week article was conducted by the RAND Corporation and the American Institutes for Research, and the evaluation was also funded by the Gates Foundation (click here for the evaluation report, see below for the full citation of this study).

Their key finding was that Intensive Partnerships for Effective Teaching district/school sites (see them listed here) implemented new measures of teaching effectiveness and modified personnel policies, but they did not achieve their goals for students.

Evaluators also found (see also here):

  • The sites succeeded in implementing measures of effectiveness to evaluate teachers and made use of the measures in a range of human-resource decisions.
  • Every site adopted an observation rubric that established a common understanding of effective teaching. Sites devoted considerable time and effort to train and certify classroom observers and to observe teachers on a regular basis.
  • Every site implemented a composite measure of teacher effectiveness that included scores from direct classroom observations of teaching and a measure of growth in student achievement.
  • Every site used the composite measure to varying degrees to make decisions about human resource matters, including recruitment, hiring, placement, tenure, dismissal, professional development, and compensation.

Overall, the initiative did not achieve its goals for student achievement or graduation, especially for low-income and racial minority students. With minor exceptions, student achievement, access to effective teaching, and dropout rates were also not dramatically better than they were for similar sites that did not participate in the intensive initiative.

Their recommendations were as follows (see also here):

  • Reformers should not underestimate the resistance that could arise if changes to teacher-evaluation systems have major negative consequences.
  • A near-exclusive focus on teacher evaluation systems such as these might be insufficient to improve student outcomes. Many other factors might also need to be addressed, ranging from early childhood education, to students’ social and emotional competencies, to the school learning environment, to family support. Dramatic improvement in outcomes, particularly for low-income and racial minority students, will likely require attention to many of these factors as well.
  • In change efforts such as these, it is important to measure the extent to which each of the new policies and procedures is implemented in order to understand how the specific elements of the reform relate to outcomes.

Reference:

Stecher, B. M., Holtzman, D. J., Garet, M. S., Hamilton, L. S., Engberg, J., Steiner, E. D., Robyn, A., Baird, M. D., Gutierrez, I. A., Peet, E. D., de los Reyes, I. B., Fronberg, K., Weinberger, G., Hunter, G. P., & Chambers, J. (2018). Improving teaching effectiveness: Final report. The Intensive Partnerships for Effective Teaching through 2015–2016. Santa Monica, CA: The RAND Corporation. Retrieved from https://www.rand.org/pubs/research_reports/RR2242.html

New Mexico Teacher Evaluation Lawsuit Updates

In December of 2015 in New Mexico, via a preliminary injunction set forth by state District Judge David K. Thomson, all consequences attached to teacher-level value-added model (VAM) scores (e.g., flagging the files of teachers with low VAM scores) were suspended throughout the state until the state (and/or others external to the state) could prove to the state court that the system was reliable, valid, fair, uniform, and the like. The trial during which this evidence is to be presented by the state is currently set for this October. See more information about this ruling here.

As the expert witness for the plaintiffs in this case, I was deposed a few weeks ago here in Phoenix, given my analyses of the state’s data (supported by one of my PhD students – Tray Geiger). In short, we found and I testified during the deposition that:

  • In terms of uniformity and fairness, there seem to be 70% or so of New Mexico teachers who are ineligible to be assessed using VAMs, and this proportion held constant across the years of data analyzed. This is even more important to note knowing that when VAM-based data are to be used to make consequential decisions about teachers, issues with fairness and uniformity become even more important given accountability-eligible teachers are also those who are relatively more likely to realize the negative or reap the positive consequences attached to VAM-based estimates.
  • In terms of reliability (or the consistency of teachers’ VAM-based scores over time), approximately 40% of teachers differed by one quintile (quintiles are derived when a sample or population is divided into fifths) and approximately 28% of teachers differed, from year-to-year, by two or more quintiles in terms of their VAM-derived effectiveness ratings. These results make sense when New Mexico’s results are situated within the current literature, whereas teachers classified as “effective” one year can have a 25%-59% chance of being classified as “ineffective” the next, or vice versa, with other permutations also possible.
  • In terms of validity (i.e., concurrent related evidence of validity), and importantly as also situated within the current literature, the correlations between New Mexico teachers’ VAM-based and observational scores ranged from r = 0.153 to r = 0.210. Not only are these correlations very weak[1], they are also very weak as appropriately situated within the literature, via which it is evidenced that correlations between multiple VAMs and observational scores typically range from 0.30 ≤ r ≤ 0.50.
  • In terms of bias, New Mexico’s Caucasian teachers had significantly higher observation scores than non-Caucasian teachers implying, also as per the current research, that Caucasian teachers may be (falsely) perceived as being better teachers than non-Caucasians teachers given bias within these instruments and/or bias of the scorers observing and scoring teachers using these instruments in practice. See prior posts about observational-based bias here, here and here.
  • Also of note in terms of bias was that: (1) teachers with fewer years of experience yielded VAM scores that were significantly lower than teachers with more years of experience, with similar patterns noted across teachers’ observation scores, which could all mean, as also in line with common sense as well as the research, that teachers with more experience are typically better teachers; (2) teachers who taught English language learners (ELLs) or special education students had lower VAM scores across the board than those who did not teach such students; (3) teachers who taught gifted students had significantly higher VAM scores than non-gifted teachers which runs counter to the current research evidencing that teachers’ gifted students oft-thwart or prevent them from demonstrating growth given ceiling effects; (4) teachers in schools with lower relative proportions of ELLs, special education students, students eligible for free-or-reduced lunches, and students from racial minority backgrounds, as well as higher relative proportions of gifted students, consistently had significantly higher VAM scores. These results suggest that teachers in these schools are as a group better, and/or that VAM-based estimates might be biased against teachers not teaching in these schools, preventing them from demonstrating comparable growth.

To read more about the data and methods used, as well as other findings, please see my affidavit submitted to the court attached here: Affidavit Feb2018.

Although, also in terms of a recent update, I should also note that a few weeks ago, as per an article in the AlbuquerqueJournal, New Mexico’s teacher evaluation systems is now likely to be overhauled, or simply “expired” as early as 2019. In short, “all three Democrats running for governor and the lone Republican candidate…have expressed misgivings about using students’ standardized test scores to evaluate the effectiveness of [New Mexico’s] teachers, a key component of the current system [at issue in this lawsuit and] imposed by the administration of outgoing Gov. Susana Martinez.” All four candidates described the current system “as fundamentally flawed and said they would move quickly to overhaul it.”

While I/we will proceed our efforts pertaining to this lawsuit until further notice, this is also important to note at this time in that it seems that New Mexico’s policymakers of new are going to be much wiser than those of late, at least in these regards.

[1] Interpreting r: 0.8 ≤ r ≤ 1.0 = a very strong correlation; 0.6 ≤ r ≤ 0.8 = a strong correlation; 0.4 ≤ r ≤ 0.6 = a moderate correlation; 0.2 ≤ r ≤ 0.4 = a weak correlation; and 0.0 ≤ r ≤ 0.2 = a very weak correlation, if any at all.

 

New Mexico’s Motion for Summary Judgment, Following Houston’s Precedent-Setting Ruling

Recall that in New Mexico, just over two years ago, all consequences attached to teacher-level value-added model (VAM) scores (e.g., flagging the files of teachers with low VAM scores) were suspended throughout the state until the state (and/or others external to the state) could prove to the state court that the system was reliable, valid, fair, uniform, and the like. The trial during which this evidence was to be presented by the state was repeatedly postponed since, yet with teacher-level consequences prohibited all the while. See more information about this ruling here.

Recall as well that in Houston, just this past May, that a district judge ruled that Houston Independent School District (HISD) teachers’ who had VAM scores (as based on the Education Value-Added Assessment System (EVAAS)) had legitimate claims regarding how EVAAS use in HISD was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case organization shall deprive any person of life, liberty, or property, without due process). More specifically, in what turned out to be a huge and unprecedented victory, the judge ruled that because HISD teachers “ha[d] no meaningful way to ensure correct calculation of their EVAAS scores,” they were, as a result, “unfairly subject to mistaken deprivation of constitutionally protected property interests in their jobs.” This ruling ultimately led the district to end the use of the EVAAS for teacher termination throughout Houston. See more information about this ruling here.

Just this past week, New Mexico charged that the Houston ruling regarding Houston teachers’ Fourteenth Amendment due process protections also applies to teachers throughout the state of New Mexico.

As per an article titled “Motion For Summary Judgment Filed In New Mexico Teacher Evaluation Lawsuit,” the American Federation of Teachers and Albuquerque Teachers Federation filed a “motion for summary judgment in the litigation in our continuing effort to make teacher evaluations beneficial and accurate in New Mexico.” They, too, are “seeking a determination that the [state’s] failure to provide teachers with adequate information about the calculation of their VAM scores violated their procedural due process rights.”

“The evidence demonstrates that neither school administrators nor educators have been provided with sufficient information to replicate the [New Mexico] VAM score calculations used as a basis for teacher evaluations. The VAM algorithm is complex, and the general overview provided in the NMTeach Technical Guide is not enough to pass constitutional muster. During previous hearings, educators testified they do not receive an explanation at the time they receive their annual evaluation, and teachers have been subjected to performance growth plans based on low VAM scores, without being given any guidance or explanation as to how to raise that score on future evaluations. Thus, not only do educators not understand the algorithm used to derive the VAM score that is now part of the basis for their overall evaluation rating, but school administrators within the districts do not have sufficient information on how the score is derived in order to replicate it or to provide professional development, whether as part of a disciplinary scenario or otherwise, to assist teachers in raising their VAM score.”

For more information about this update, please click here.

Bias in VAMs, According to Validity Expert Michael T. Kane

During the still ongoing, value-added lawsuit in New Mexico (see my most recent update about this case here), I was honored to testify as the expert witness on behalf of the plaintiffs (see, for example, here). I was also fortunate to witness the testimony of the expert witness who testified on behalf of the defendants – Thomas Kane, Economics Professor at Harvard and former Director of the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) studies. During Kane’s testimony, one of the highlights (i.e., for the plaintiffs), or rather the low-lights (i.e., for him and the defendants), in my opinion, was when one of the plaintiff’s attorney’s questioned Kane, on the stand, about his expertise in the area of validity. In sum, Kane responded that he defined himself as an “expert” in the area, having also been trained by some of the best. Consequently, the plaintiff’s attorney’s questioned Kane about different types of validity evidences (e.g., construct, content, criterion), and Kane could not answer those questions. The only form of validity evidence with which he was familiar, and which he could clearly define, was evidence related to predictive validity. This hardly made him the expert he proclaimed himself to be minutes prior.

Let’s not mince words, though, or in this case names.

A real expert in validity (and validity theory) is another Kane, who goes by the full name of Michael T. Kane. This Kane is The Samuel J. Messick Chair in Test Validity at the Educational Testing Service (ETS); this Kane wrote one of the best, most contemporary, and currently most foundational papers on validity (see here); and this Kane just released an ETS-sponsored paper on Measurement Error and Bias in Value-Added Models certainly of interest here. I summarize this piece below (see the PDF of this report here).

In this paper Kane examines “the origins of [value-added model (VAM)-based] bias and its potential impact” and indicates that bias that is observed “is an increasing linear function of the student’s prior achievement and can be quite large (e.g., half a true-score standard deviation) for very low-scoring and high-scoring students [i.e., students in the extremes of any normal distribution]” (p. 1). Hence, Kane argues, “[t]o the extent that students with relatively low or high prior scores are clustered in particular classes and schools, the student-level bias will tend to generate bias in VAM estimates of teacher and school effects” (p. 1; see also prior posts about this type of bias here, here, and here; see also Haertel (2013) cited below). Kane concludes that “[a]djusting for this bias is possible, but it requires estimates of generalizability (or reliability) coefficients that are more accurate and precise than those that are generally available for standardized achievement tests” (p. 1; see also prior posts about issues with reliability across VAMs here, here, and here).

Kane’s more specific points of note:

  • To accurately calculate teachers’/schools’ value-added, “current and prior scores have to be on the same scale (or on vertically aligned scales) for the differences to make sense. Furthermore, the scale has to be an interval scale in the sense that a difference of a certain number of points has, at least approximately, the same meaning along the scale, so that it makes sense to compare gain scores from different parts of the scale…some uncertainty about scale characteristics is not a problem for many applications of vertical scaling, but it is a serious problem if the proposed use of the scores (e.g., educational accountability based on growth scores) demands that the vertical scale be demonstrably equal interval” (p. 1).
  • Likewise, while some approaches can be used to minimize the need for such scales (e.g., residual gain scores, covariate-adjustment models, and ordinary least squares (OLS) regression approaches which are of specific interest in this piece), “it is still necessary to assume [emphasis added] that a difference of a certain number of points has more or less the same meaning along the score scale for the current test scores” (p. 2).
  • Related, “such adjustments can [still] be biased to the extent that the predicted score does not include all factors that may have an impact on student performance. Bias can also result from errors of measurement in the prior scores included in the prediction equation…[and this can be]…substantial” (p. 2).
  • Accordingly, “gains for students with high true scores on the prior year’s test will be overestimated, and the gains for students with low true scores in the prior year will be underestimated. To the extent that students with relatively low and high true scores tend to be clustered in particular classes and schools, the student-level bias will generate bias in estimates of teacher and school effects” (p. 2).
  • Hence, if not corrected, this source of bias could have a substantial negative impact on estimated VAM scores for teachers and schools that serve students with low prior true scores and could have a substantial positive impact for teachers and schools that serve mainly high-performing students” (p. 2).
  • Put differently, random errors in students’ prior scores may “tend to add a positive bias to the residual gain scores for students with prior scores above the population mean, and they [may] tend to add a negative bias to the residual gain scores for students with prior scores below the mean. Th[is] bias is associated with the well-known phenomenon of regression to the mean” (p. 10).
  • Although, at least this latter claim — that students with relatively high true scores in the prior year could substantially and positively impact their teachers’/schools value-added estimates — does run somewhat contradictory to other claims as evidenced in the literature in terms of the extent to which ceiling effects substantially and negatively impact their teachers’/schools value-added estimates (see, for example, Point #7 as per the ongoing lawsuit in Houston here, and see also Florida teacher Luke Flint’s “Story” here).
  • In sum, and as should be a familiar conclusion to followers of this blog, “[g]iven that the results of VAMs may be used for high-stakes decisions about teachers and schools in the context of accountability programs,…any substantial source of bias would be a matter of great concern” (p. 2).

Citation: Kane, M. T. (2017). Measurement error and bias in value-added models. Princeton, NJ: Educational Testing Service (ETS) Research Report Series. doi:10.1002/ets2.12153 Retrieved from http://onlinelibrary.wiley.com/doi/10.1002/ets2.12153/full

See also Haertel, E. H. (2013). Reliability and validity of inferences about teachers based on student test scores (14th William H. Angoff Memorial Lecture). Princeton, NJ: Educational Testing Service (ETS).