Why Standardized Tests Should Not Be Used to Evaluate Teachers (and Teacher Education Programs)

David C. Berliner, Regents’ Professor Emeritus here at Arizona State University (ASU), who also just happens to be my former albeit forever mentor, recently took up research on the use of test scores to evaluate teachers, for example, using value-added models (VAMs). While David is world-renowned for his research in educational psychology, and more specific to this case, his expertise on effective teaching behaviors and how to capture and observe them, he has also now ventured into the VAM-related debates.

Accordingly, he recently presented his newest and soon-to-be-forthcoming published research on using standardized tests to evaluate teachers, something he aptly termed in the title of his presentation “A Policy Fiasco.” He delivered his speech to an audience in Melbourne, Australia, and you can click here for the full video-taped presentation; however, given the whole presentation takes about one hour to watch, although I must say watching the full hour is well worth it, I highlight below what are his highlights and key points. These should certainly be of interest to you all as followers of this blog, and hopefully others.

Of main interest are his 14 reasons, “big and small’ for [his] judgment that assessing teacher competence using standardized achievement tests is nearly worthless.”

Here are his fourteen reasons:

  1. “When using standardized achievement tests as the basis for inferences about the quality of teachers, and the institutions from which they came, it is easy to confuse the effects of sociological variables on standardized test scores” and the effects teachers have on those same scores. Sociological variables (e.g., chronic absenteeism) continue to distort others’ even best attempts to disentangle them from the very instructional variables of interest. This, what we also term as biasing variables, are important not to inappropriately dismiss, as purportedly statistically “controlled for.”
  2. In law, we do not hold people accountable for the actions of others, for example, when a child kills another child and the parents are not charged as guilty. Hence, “[t]he logic of holding [teachers and] schools of education responsible for student achievement does not fit into our system of law or into the moral code subscribed to by most western nations.” Related, should medical school or doctors, for that matter, be held accountable for the health of their patients? One of the best parts of his talk, in fact, is about the medical field and the corollaries Berliner draws between doctors and medical schools, and teachers and colleges of education, respectively (around the 19-25 minute mark of his video presentation).
  3. Professionals are often held harmless for their lower success rates with clients who have observable difficulties in meeting the demands and the expectations of the professionals who attend to them. In medicine again, for example, when working with impoverished patients, “[t]here is precedent for holding [doctors] harmless for their lowest success rates with clients who have observable difficulties in meeting the demands and expectations of the [doctors] who attend to them, but the dispensation we offer to physicians is not offered to teachers.”
  4. There are other quite acceptable sources of data, besides tests, for judging the efficacy of teachers and teacher education programs. “People accept the fact that treatment and medicine may not result in the cure of a disease. Practicing good medicine is the goal, whether or not the patient gets better or lives. It is equally true that competent teaching can occur independent of student learning or of the achievement test scores that serve as proxies for said learning. A teacher can literally “save lives” and not move the metrics used to measure teacher effectiveness.
  5. Reliance on standardized achievement test scores as the source of data about teacher quality will inevitably promote confusion between “successful” instruction and “good” instruction. “Successful” instruction gets test scores up. “Good” instruction leaves lasting impressions, fosters further interest by the students, makes them feel competent in the area, etc. Good instruction is hard to measure, but remains the goal of our finest teachers.
  6. Related, teachers affect individual students greatly, but affect standardized achievement test scores very little. All can think of how their own teachers impacted their lives in ways that cannot be captured on a standardized achievement test.  Standardized achievement test scores are much more related to home, neighborhood and cohort than they are to teachers’ instructional capabilities. In more contemporary terms, this is also due the fact that large-scale standardized tests have (still) never been validated to measure student growth over time, nor have they been validated to attribute that growth to teachers. “Teachers have huge effects, it’s just that the tests are not sensitive to them.”
  7. Teacher’s effects on standardized achievement test scores fade quickly, barely discernable after a few years. So we might not want to overly worry about most teachers’ effects on their students—good or bad—as they are hard to detect on tests after two or so years. To use these ephemeral effects to then hold teacher education programs accountable seems even more problematic.
  8. Observational measures of teacher competency and achievement tests of teacher competency do not correlate well. This suggest nothing more than that one or both of these measures, and likely the latter, are malfunctioning in their capacities to measure the teacher effectiveness construct. See other Vamboozled posts about this here, here, and here.
  9. Different standardized achievement tests, both purporting to measure reading, mathematics, or science at the same grade level, will give different estimates of teacher competency. That is because different test developers have different visions of what it means to be competent in each of these subject areas. Thus one achievement test in these subject areas could find a teacher exemplary, but another test of those same subject areas would find the teacher lacking. What then? Have we an unstable teacher or an ill-defined subject area?
  10. Tests can be administered early or late in the fall, early or late in the spring, and the dates they are given influence the judgments about whether a teacher is performing well or poorly. Teacher competency should not be determined by minor differences in the date of testing, but that happens frequently.
  11. No standardized achievement tests have provided proof that their items are instructionally sensitive. If test items do not, because they cannot “react to good instruction,” how can one make a claim that the test items are “tapping good instruction?”
  12. Teacher effects show up more dramatically on teacher made tests than on standardized achievement tests because the former are based on the enacted curriculum, while the latter are based on the desired curriculum. You get seven times more instructionally sensitive tests the closer the test is to the classroom (i.e., teacher made tests).
  13. The opt-out testing movement invalidates inferences about teachers and schools that can be made from standardized achievement test results. Its not bad to remove these kids from taking these tests, and perhaps it is even necessary in our over-tested schools, but the tests and the VAM estimates derived via these tests, are far less valid when that happens. This is because the students who opt out are likely different in significant ways from those who do take the tests. This severely limits the validity claims that are made.
  14. Assessing new teachers with standardized achievement tests is likely to yield many false negatives. That is, the assessments would identify teachers early in their careers as ineffective in improving test scores, which is, in fact, often the case for new teachers. Two or three years later that could change. Perhaps the last thing we want to do in a time of teacher shortage is discourage new teachers while they acquire their skills.

Doug Harris on the “Every Student Succeeds Act” and Reasons to Continue with VAMs Regardless

Two weeks ago I wrote a post about the passage of the “Every Student Succeeds Act” (ESSA), and its primary purpose to reduce “the federal footprint and restore local control, while empowering parents and education leaders to hold schools accountable for effectively teaching students” within their states. More specific to VAMs, I wrote about how ESSA will allow states to decide how to weight their standardized test scores and decide whether and how to evaluate teachers with or without said scores.

Doug Harris, an Associate Professor of Economics at Tulane in Louisiana and, as I’ve written prior, “a ‘cautious’ but quite active proponent of VAMs” (see another two posts about Harris here and here) recently took to an Education Week blog to respond to ESSA, as well. His post titled, “NCLB 3.0 Could Improve the Use of Value-Added and Other Measures” can be read with a subscription. For those of you without a subscription, however, I’ve highlighted some of his key arguments below, along with my responses to those I felt were most pertinent here.

In his post, Harris argues that one element of ESSA “clearly reinforces value-added–the continuation of annual testing.” He equates these (i.e., value-added and testing) as synonyms, with which I would certainly disagree. The latter precedes the former, and without the latter the former would not exist. In other words, it’s what one does with the test scores, whereby test scores do not just mean by default the same thing as VAMs.

In addition, while the continuation of annual testing is written into ESSA, the use of VAMs is not, and their use for teacher evaluations, or not, is now left to all state’s to decide. Hence, it is true that ESSA “means the states will no longer have to abide by [federal/NCLB] waivers and there is a good chance that some states will scale back aggressive evaluation and accountability [measures].”

Harris discusses the controversies surrounding VAMs, as argued by both VAM proponents and critics, and he contends in the end that “[i]t is difficult to weigh these pros and cons… because we have so little evidence on the [formative] effects of using value-added measures on teaching and learning.” I would argue that we have plenty of evidence on the formative effects of using VAMs, given the evidence dating back now almost thirty years (e.g, from Tennessee). This is one area of research where I do not believe we need more evidence to support the lack of formative effects, or rather instructionally informative benefits to be drawn from VAM use. Again, this is largely due to the fact that VAM estimates are based on tests that are so far removed from the realities of the classroom, that the tests as well as the VAM estimates derived thereafter are not “instructionally sensitive.” Should Harris argue for value-added using “instructionally senstive” tests, perhaps, this suggestion for future research might carry some more weight or potential.

Harris also discusses some “other ways to use value-added” should states still decide (e.g., within a flagging system whereby VAMs as the more “objective” measure could be used to raise red flags for individual teachers who might require further investigation using other less “objective” measures). Given the millions of taxpayer revenues it would require to do even this, again for only 30% of all teachers who are primarily elementary school teachers of primary subject areas, cost should certainly be of note on this one. This suggestion also overlooks VAMs still prevalent methodological and measurement issues, and how these issues should also likely prevent VAMs from even playing a primary role as the key criteria used to flag teachers. This is also not warranted, from an educational measurement standpoint.

Harris continues that “The point is that it would be a shame if we went back to a no-feedback world, and even more of a shame if we did so because of an over-stated connection between evaluation, accountability, and value-added.” In my professional opinion, he is incorrect on both of these points: (1) Teachers are not (really) using the feedback they are getting from their VAM reports, as described prior, and for a variety of reasons including transparency, usability, actionability, etc.; hence, we are nowhere nearer to some utopian “feedback world” than we were pre-VAMs. All of the related research points to observational feedback, actually, to be the most useful when it comes to garnering and using formative feedback. (2) There is nothing “over-stated” in terms of the “connection between evaluation, accountability, and value-added.” Rather, what he terms as possibly “over-stated” is actually real, in practice, and overly-abused versus stated. This is not just something related to semantics.

Finally, the one point upon which we agree, with some distinction, is that ESSA “is also an opportunity because it requires schools to move beyond test scores in accountability. This will mean a reduced focus on value-added to student test scores, but that’s OK if the new measures provide a more complete assessment of the value [defined in much broader terms] that schools provide.” ESSA, then, offers the nation “a chance to get value-added [defined in much broader terms] right and correct the problems with how we measure teacher and school performance.” If we define “value-added” in much broader terms, even if test scores are not a part of the “value-added” construct, this would likely be a step in the right direction.

Another Take on New Mexico’s Ruling on the State’s Teacher Evaluation/VAM System

John Thompson, a historian and teacher, wrote a post just published in Diane Ravitch’s blog (here) in which he took a closer look at the New Mexico court decision of which I was a part and which I covered a few weeks ago (here). This is the case in which state District Judge David K. Thomson, who presided over the five-day teacher-evaluation lawsuit in New Mexico, granted a preliminary injunction preventing consequences from being attached to the state’s teacher evaluation data.

Historian/Teacher John Thompson adds another, and also independent take on this ruling, again here, also having read through Judge Thomson’s entire ruling. Here’s what he wrote:

New Mexico District Judge David K. Thomson granted a preliminary injunction preventing consequences from being attached to the state’s teacher evaluation data. As Audrey Amrein-Beardsley explains, “can proceed with ‘developing’ and ‘improving’ its teacher evaluation system, but the state is not to make any consequential decisions about New Mexico’s teachers using the data the state collects until the state (and/or others external to the state) can evidence to the court during another trial (set for now, for April) that the system is reliable, valid, fair, uniform, and the like.”

This is wonderful news. As the American Federation of Teachers observes, “Superintendents, principals, parents, students and the teachers have spoken out against a system that is so rife with errors that in some districts as many as 60 percent of evaluations were incorrect. It is telling that the judge characterizes New Mexico’s system as a ‘policy experiment’ and says that it seems to be a ‘Beta test where teachers bear the burden for its uneven and inconsistent application.’”

A close reading of the ruling makes it clear that this case is an even greater victory over the misuse of test-driven accountability than even the jubilant headlines suggest. It shows that Judge Thomson made the right ruling on the key issues for the right reasons, and he seems to be predicting that other judges will be following his legal logic. Litigation over value-added teacher evaluations is being conducted in 14 states, and the legal battleground is shifting to the place where corporate reformers are weakest. No longer are teachers being forced to prove that there is no rational basis for defending the constitutionality of value-added evaluations. Now, the battleground is shifting to the actual implementation of those evaluations and how they violate state laws.

Judge Thomson concludes that the state’s evaluation systems don’t “resemble at all the theory” they were based on. He agreed with the district superintendent who compared it to the Wizard of Oz, where “the guy is behind the curtain and pulling levers and it is loud.” Some may say that the Wizard’s behavior is “understandable,” but that is not the judge’s concern. The Court must determine whether the consequences are assessed by a system that is “objective and uniform.” Clearly, it has been impossible in New Mexico and elsewhere for reformers to meet the requirements they mandated, and that is the legal terrain where VAM proponents must now fight.

The judge thus concludes, “New Mexico’s evaluation system is less like a [sound] model than a cafeteria-style evaluation system where the combination of factors, data, and elements are not easily determined and the variance from school district to school district creates conflicts with the [state] statutory mandate.”

The state of New Mexico counters by citing cases in Florida and Tennessee as precedents. But, Judge Thomson writes that those cases ruled against early challenges based on equal protection or constitutional issues, as they have also cited practical concerns in implementation. He writes of the Florida (Cook) case, “The language in the Cook case could be lifted from the Court findings in this case.” That state’s judge decided “‘The unfairness of this system is not lost on this Court.’” Judge Thomson also argues, “The (Florida) Court in fact seemed to predict the type of legal challenge that could result …‘The individual plaintiffs have a separate remedy to challenge an evaluation on procedural due process grounds if an evaluation is actually used to deprive the teacher of an evaluation right.’”

The question in Florida and Tennessee had been whether there was “a conceivable rational basis” for proceeding with the teacher evaluation policy experiment. Below are some of the more irrational results of those evaluations. The facts in the New Mexico case may be somewhat more absurd than those in other places that have implemented VAMs but, given the inherent flaws in those evaluations, I doubt they are qualitatively worse. In fact, Audrey Amrein-Beardsley testified about a similar outcome in Houston which was as awful as the New Mexico travesties and led to about 1/4th of their teachers subject to those evaluations being subject to “growth plans.”

As has become common across the nation, New Mexico teachers have been evaluated on students who aren’t in the teachers’ classrooms. They have been held accountable for test results from subjects that the teacher didn’t teach. Science teachers might be evaluated on a student taught in 2011, based on how that student scored in 2013.

The judge cited testimony regarding a case where 50% of the teachers rated Minimally Effective had missing data due to reassignment to a wrong group. One year, a district questioned the state’s data, and immediately it saw an unexplained 11% increase in effective teachers. The next year, also without explanation, the state’s original numbers on effectiveness were reduced by 6%.

One teacher taught 160 students but was evaluated on scores of 73 of them and was then placed on a plan for improvement. Because of the need to quantify the effectiveness of teachers in Group B and Group C, who aren’t subject to state End of Instruction tests, there are 63 different tests being used in one district to generate high-stakes data. And, when changing tests to the Common Core PARCC test, the state has to violate scientific protocol, and mix and match test score results in an indefensible manner. Perhaps just as bad, in 2014-15, 76% of teachers were still being evaluated on less than three years of data.

The Albuquerque situation seems exceptionally important because it serves 25% of the state’s students, and it is the type of high-poverty system where value-added evaluations are likely to be most unreliable and invalid. It had 1728 queries about data and 28% of its teachers ranked below the Effective level. The judge noted that if you teach a core subject, you are twice as likely as a French teacher to be judged Ineffective. But, that was not the most shocking statistic. In Albuquerque, Group A elementary teachers (where VAMs play a larger role) are five times more likely to be rated below Effective than their colleagues in Group B. In Roswell, Group B teachers are three times more likely to be rated below Effective than Group C teachers.

Curiously, VAM advocate Tom Kane testified, but he did so in a way the made it unclear whether he saw himself as a witness for the defense or the plaintiffs. When asked about Amrein-Beardsley’s criticism of using tests that weren’t designed for evaluating teachers, Kane countered that the Gates Foundation MET study used random samples and concluded that differing tests could be used in a way that was “useful in evaluating teachers” and valid predictors of student achievement. Kane also replied that he could estimate the state’s error rate “on average,” but he couldn’t estimate error rates for individual teachers. He did not address the judge’s real concern about whether New Mexico’s use of VAMs was uniform and objective.

I am not a lawyer but I have years of experience as a legal historian. Although I have long been disappointed that the legal profession did not condemn value-added evaluations as a violation of our democracy’s fundamental principles, I also knew that the first wave of lawsuits challenging VAMs would face an uphill battle. Using teachers as guinea pigs in a risky experiment, where non-educators imposed their untested opinions on public schools, was always bad policy. Along with their other sins, value-added evaluations would mean collective punishment of some teachers merely for teaching in schools and classes where it is harder to meet dubious test score growth targets. But, many officers of the court might decide that they did not have the grounds to overrule new teacher evaluation laws. They might have to hold their noses while ruling in favor of laws that make a mockery of our tenets of fairness in a constitutional democracy.

During the last few years, rather than force those who would destroy the hard-earned legal rights of teachers to meet the legal standard of “strict scrutiny,” those who would fire teachers without proving that their data was reliable and valid have mostly had to show that their policies were not irrational. Now that their policies are being implemented, reformers must defend the ways that their VAMs are actually being used. Corporate reformers and the Duncan administration were able to coerce almost all of the states into writing laws requiring quantitative components in teacher evaluations. Not surprisingly, it has often proven impossible to implement their schemes in a rational manner.

In theory, corporate reformers could have won if they required the high-stakes use of flawed metrics while maintaining the message discipline that they are famous for. School administrators could have been trained to say that they were merely enforcing the law when they assessed consequences based on metrics. Their job would have been to recite the standard soundbite when firing teachers – saying that their metrics may or may not reflect the actual performance of the teacher in question – but the law required that practice. Life’s not fair, they could have said, and whether or not the individual teacher was being unfairly sacrificed, the administrators who enforced the law were just following orders. It was the will of the lawmakers that the firing of the teachers with the lowest VAMs – regardless of whether the metric reflected actually effectiveness – would make schools more like corporations, so practitioners would have to accept it. But, this is one more case where reformers ignored the real world, did not play out the education policy and legal chess game, and [did or did not] anticipate that rulings such as Judge Thomson’s would soon be coming.

Real world, VAM advocates had to claim that its results represented the actual effectiveness of teachers and that, somehow, their scheme would someday improve schools. This liberated teachers and administrators to fight back in the courts. Moreover, top-down reformers set out to impose the same basic system on every teacher, in every type of class and school, in our diverse nation. When this top-down micromanaging met reality, proponents of test-driven evaluations had to play so many statistical games, create so many made-up metrics, and improvise in so many bizarre ways, that the resulting mess would be legally indefensible.

And, that is why the cases in Florida and Tennessee might soon be seen as the end of the beginning of the nation’s repudiation of value-added evaluations. The New Mexico case, along with the renewal of the federal ESEA and the departure of Arne Duncan, is clearly the beginning of the end. Had VAM proponents objectively briefed attorneys on the strengths and weaknesses of their theories, they could have thought through the inevitable legal process. On the other hand, I doubt that Kane and his fellow economists knew enough about education to be able to anticipate the inevitable, unintended results of their theories on schools. In numerous conversations with VAM true believers, rarely have I met one who seemed to know enough about the nuts and bolts about schools to be able to brief legal advisors, much less anticipate the inevitable results that would eventually have to be defended in court.

Special Issue of “Educational Researcher” (Paper #5 of 9): Teachers’ Perceptions of Observations and Student Growth

Recall that the peer-reviewed journal Educational Researcher (ER) – recently published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#5 of 9) here, titled “Teacher Perspectives on Evaluation Reform: Chicago’s REACH [Recognizing Educators Advancing Chicago Students] Students.” This one is authored by Jennie Jiang, Susan Sporte, and Stuart Luppescu, all of whom are associated with The University of Chicago’s Consortium on Chicago School Research, and all of whom conducted survey- and interview-based research on teachers’ perceptions of the Chicago Public Schools (CPS) teacher evaluation system, twice since it was implemented in 2012–2013. They did this across CPS’s almost 600 schools and its more than 12,000 teachers, with high-stakes being recently attached to teacher evaluations (e.g., professional development plans, remediation, tenure attainment, teacher dismissal/contract non-renewal; p. 108).

Directly related to the Review of Article #4 prior (i.e., #4 of 9 on observational systems’ potentials here), these researchers found that Chicago teachers are, in general, positive about the evaluation system, primarily given the system’s observational component (i.e., the Charlotte Danielson Framework for Teaching, used twice per year for tenured teachers and that counts for 75% of teachers’ evaluation scores), and not given the inclusion of student growth in this evaluation system (that counts for the other 25%). Although researchers also found that overall satisfaction levels with the REACH system at large is declining at a statistically significant rate over time, as teachers get to know the system, perhaps, better.

This system, like the strong majority of others across the nation, is based on only these two components, although the growth measure includes a combination of two different metrics (i.e., value-added scores and growth on “performance tasks” as per the grades and subject areas taught). See more information about how these measures are broken down by teacher type in Table 1 (p. 107), and see also (p. 107) for the different types of measures used (e.g., the Northwest Evaluation Association’s Measures of Academic Progress assessment (NWEA-MAP), a Web-based, computer-adaptive, multiple-choice assessment, that is used to measure value-added scores for teachers in grades 3-8).

As for the student growth component, more specifically, when researchers asked teachers “if their evaluation relies too heavily on student growth, 65% of teachers agreed or strongly agreed” (p. 112); “Fifty percent of teachers disagreed or strongly disagreed that NWEA-MAP [and other off-the-shelf tests used to measure growth in CPS offered] a fair assessment of their student’s learning” (p. 112); “teachers expressed concerns about the narrow representation of student learning that is measured by standardized tests and the increase in the already heavy testing burden on teachers and students” (p. 112); and “Several teachers also expressed concerns that measures of student growth were unfair to teachers in more challenging schools [i.e., bias], because student growth is related to the supports that students may or may not receive outside of the classroom” (p. 112). “One teacher explained this concern [writing]: “I think the part that I find unfair is that so much of what goes on in these kids’ lives is affecting their academics, and those are things that a
teacher cannot possibly control” (p. 112).

As for the performance tasks meant to compliment (or serve as) the student growth or VAM measure, teachers were discouraged with this being so subjective, and susceptible to distortion because teachers “score their own students’ performance tasks at both the beginning and end of the year. Teachers noted that if they wanted to maximize their student growth score, they could simply give all students a low score on the beginning-of-year task and a higher score at the end of the year” (p. 113).

As for the observational component, however, researchers found that “almost 90% of teachers agreed that the feedback they were provided in post-observation conferences” (p. 111) was of highest value; the observational processes but more importantly the post-observational processes made them and their supervisors more accountable for their effectiveness, and more importantly their improvement. While in the conclusions section of this article authors stretch this finding out a bit, writing that “Overall, this study finds that there is promise in teacher evaluation reform in Chicago,” (p. 114) as primarily based on their findings about “the new observation process” (p. 114) being used in CPS, recall from the Review of Article #4 prior (i.e., #4 of 9 on observational systems’ potentials here), these observational systems are not “new and improved.” Rather, these are the same observational systems that, given their levels of subjectivity featured and highlighted in reports like “The Widget Effect” (here), brought us to our now (over)reliance on VAMs.

Researchers also found that teachers were generally confused about the REACH system, and what actually “counted” and for how much in their evaluations. The most confusion surrounded the student growth or value-added component, as (based on prior research) would be expected. Beginning teachers reported more clarity, than did relatively more experienced teachers, high school teachers, and teachers of special education students, and all of this was related to the extent to which a measure of student growth directly impacted teachers’ evaluations. Teachers receiving school-wide value-added scores were also relatively more critical.

Lastly, researchers found that in 2014, “79% of teachers reported that the evaluation process had increased their levels of stress and anxiety, and almost 60% of teachers
agreed or strongly agreed the evaluation process takes more effort than the results are worth.” Again, beginning teachers were “consistently more positive on all…measures than veteran teachers; elementary teachers were consistently more positive than high
school teachers, special education teachers were significantly more negative about student growth than general teachers,” and the like (p. 113). And all of this was positively and significantly related to teachers’ perceptions of their school’s leadership, perceptions of the professional communities at their schools, and teachers’ perceptions of evaluation writ large.


If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; and see the Review of Article #4 – on observational systems’ potentials here.

Article #5 Reference: Jiang, J. Y., Sporte, S. E., & Luppescu, S. (2015). Teacher perspectives on evaluation reform: Chicago’s REACH students. Educational Researcher, 44(2), 105-116. doi:10.3102/0013189X15575517

Special Issue of “Educational Researcher” (Paper #4 of 9): Make Room VAMs for Observations

Recall that the peer-reviewed journal Educational Researcher (ER) – recently published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#4 of 9) here, titled “Make Room Value-Added: Principals’ Human Capital Decisions and the Emergence of Teacher Observation Data. This one is authored by Ellen Goldring, Jason A. Grissom, Christine Neumerski, Marisa Cannata, Mollie Rubin, Timothy Drake, and Patrick Schuermann, all of whom are associated with Vanderbilt University.

This article is primarily about (1) the extent to which the data generated by “high-quality observation systems” can inform principals’ human capital decisions (e.g., teacher hiring, contract renewal, assignment to classrooms, professional development), and (2) the extent to which principals are relying less on test scores derived via value-added models (VAMs), when making the same decisions, and why. Here are some of their key (and most important, in my opinion) findings:

  • Principals across all school systems revealed major hesitations and challenges regarding the use of VAM output for human capital decisions. Barriers preventing VAM use included the timing of data availability (e.g., the fall), which is well after human capital decisions are made (p. 99).
  • VAM output are too far removed from the practice of teaching (p. 99), and this lack of instructional sensitivity impedes, if not entirely prevents their actual versus hypothetical use for school/teacher improvement.
  • “Principals noted they did not really understand how value-added scores were calculated, and therefore they were not completely comfortable using them” (p. 99). Likewise, principals reported that because teachers did not understand how the systems worked either, teachers did not use VAM output data either (p. 100).
  • VAM output are not transparent when used to determine compensation, and especially when used to evaluate teachers teaching nontested subject areas. In districts that use school-wide VAM output to evaluate teachers in nontested subject areas, in fact, principals reported regularly ignoring VAM output altogether (p. 99-100).
  • “Principals reported that they perceived observations to be more valid than value-added measures” (p. 100); hence, principals reported using observational output much more, again, in terms of human capital decisions and making such decisions “valid.” (p. 100).
  • “One noted exception to the use of value-added scores seemed to be in the area of assigning teachers to particular grades, subjects, and classes. Many principals mentioned they use value-added measures to place teachers in tested subjects and with students in grade levels that ‘count’ for accountability purpose…some principals [also used] VAM [output] to move ineffective teachers to untested grades, such as K-2 in elementary schools and 12th grade in high schools” (p. 100).

Of special note here is also the following finding: “In half of the systems [in which researchers investigated these systems], there [was] a strong and clear expectation that there be alignment between a teacher’s value-added growth score and observation ratings…Sometimes this was a state directive and other times it was district-based. In some systems, this alignment is part of the principal’s own evaluation; principals receive reports that show their alignment” (p. 101). In other words, principals are being evaluated and held accountable given the extent to which their observations of their teachers match their teachers’ VAM-based data. If misalignment is noticed, it is not to be the fault of either measure (e.g., in terms of measurement error), it is to be the fault of the principal who is critiqued for inaccuracy, and therefore (inversely) incentivized to skew their observational data (the only data over which the supervisor has control) to artificially match VAM-based output. This clearly distorts validity, or rather the validity of the inferences that are to be made using such data. Appropriately, principals also “felt uncomfortable [with this] because they were not sure if their observation scores should align primarily…with the VAM” output (p. 101).

“In sum, the use of observation data is important to principals for a number of reasons: It provides a “bigger picture” of the teacher’s performance, it can inform individualized and large group professional development, and it forms the basis of individualized support for remediation plans that serve as the documentation for dismissal cases. It helps principals provides specific and ongoing feedback to teachers. In some districts, it is beginning to shape the approach to teacher hiring as well” (p. 102).

The only significant weakness, again in my opinion, with this piece is that the authors write that these observational data, at focus in this study, are “new,” thanks to recent federal initiatives. They write, for example, that “data from structured teacher observations—both quantitative and qualitative—constitute a new [emphasis added] source of information principals and school systems can utilize in decision making” (p. 96). They are also “beginning to emerge [emphasis added] in the districts…as powerful engines for principal data use” (p. 97). I would beg to differ as these systems have not changed much over time, pre and post these federal initiatives as (without evidence or warrant) claimed by these authors herein. See, for example, Table 1 on p. 98 of the article to see if what they have included within the list of components of such new and “complex, elaborate teacher observation systems systems” is actually new or much different than most of the observational systems in use prior. As an aside, one such system in use and of issue in this examination is one with which I am familiar, in use in the Houston Independent School District. Click here to also see if this system is also more “complex” or “elaborate” over and above such systems prior.

Also recall that one of the key reports that triggered the current call for VAMs, as the “more objective” measures needed to measure and therefore improve teacher effectiveness, was based on data that suggested that “too many teachers” were being rated as satisfactory or above. The observational systems in use then are essentially the same observational systems still in use today (see “The Widget Effect” report here). This is in stark contradiction to authors’ claims throughout this piece, for example, when they write “Structured teacher observations, as integral components of teacher evaluations, are poised to be a very powerful lever for changing principal leadership and the influence of principals on schools, teachers, and learning.” This counters all that is and all that came from “The Widget Effect” report here.


If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; and see the Review of Article #3 – on VAMs’ potentials here.

Article #4 Reference: Goldring, E., Grissom, J. A., Rubin, M., Neumerski, C. M., Cannata, M., Drake, T., & Schuermann, P. (2015). Make room value-added: Principals’ human capital decisions and the emergence of teacher observation data. Educational Researcher, 44(2), 96-104. doi:10.3102/0013189X15575031

The “Every Student Succeeds Act” (ESSA) To Replace “No Child Left Behind” (NCLB)

Yesterday, the US “Senate overwhelmingly passe[d] new national education legislation” called the “Every Student Succeeds Act” (ESSA; formerly known as The Student Success Act (H.R. 5)). The ESSA passed the Senate with an 85-12 vote, and it is officially set to replace “No Child Left Behind” (NCLB), once President Obama signs it into law (expectedly, today). This same act passed, with a similar margin, in the US House last October (see a prior post about this here).

The ESSA is to reduce “the federal footprint and restore local control, while empowering parents and education leaders to hold schools accountable for effectively teaching students” within their states, and also “[reset] Washington’s relationship with the nation’s 100,000 public schools” and its nearly 50 million public school students and their 3.4 million public school teachers, while “sending significant power back to states and local districts while maintaining limited federal oversight of education.” Peripherally, albeit substantially, this will also impact those who greatly influence (and/or profit from) the “public school market estimated to be worth about $700 billion” (e.g., testing companies, value-added modelers/contractors).

More specifically, ESSA is to:

  • Replace the current national accountability scheme based on high stakes tests with state-led accountability systems, returning responsibility for measuring student and school performance to states and school districts. Although, states will still be required to test students annually in mathematics and reading in grades three through eight and once in high school, as per NCLB’s earlier provisions. States will also be required to publicly report these scores according to race, income, ethnicity, disability and whether students are English-language learners (ELLs).
  • Allow states to decide how to weight these and other test scores and, more importantly as related to this blog, decide whether and how to evaluate teachers with or without said scores. States will be able to “set their own goals and timelines for academic progress, though their plans must be approved by the federal Department of Education.” About this latter caveat there exists some uncertainty; hence, we will have to see how this one plays out.
  • Related, ESSA will excuse states with NCLB waivers; that is, from having to adopt stronger accountability measures as based on student and teacher level growth, and as per current (and soon to be past) federal legislative requirements. All 43 states currently holding waivers are, accordingly, soon to be released from these waivers, no later than August. “It is unclear [however] whether states will retain [these] policies absent a federal mandate.”
  • Overall, ESSA will protect state and local autonomy over decisions in the classroom by preventing, for example, the US Secretary of Education from coercing states into adopting federal initiatives. As per the same Washington Post article, “The new law will significantly reduce the legal authority of the education secretary, who [will] be legally barred from influencing state decisions about academic benchmarks, such as the Common Core State Standards, teacher evaluations and other policies.”

This “is the single biggest step toward local control of public schools in 25 years,” said Senator Lamar Alexander (Republican-Tennessee), chair of the Senate education panel and a chief architect of the law along with Senator Patty Murray (Democrat-Washington).

See other related articles on this here, here, and here. As per this last post, the Badass Teachers Association (BATs) highlight both the good and the bad in ESSA as they see it. The good more or less mirrors that which is highlighted above, the bad includes legitimate concerns about how ESSA will allow for more charter schools, more room for Teach For America (TFA), “Pay for Success” for investors, and the like.

The Public Release of Value-Added Scores Does Not (Yet) Impact Real Estate

As per ScienceDaily, a resource “for the latest research news,” research just conducted by economists at Michigan State and Cornell evidences that “New school-evaluation method fails to affect housing prices.” See also a press release about this study on Michigan State’s website here, and see what I believe is a pre-publication version of the full study here.

As asserted in both pieces, the study recently published in the Journal of Urban Economics, is the first to examine how the public release of such data is considered in housing prices. Researchers, more specifically, examined whether and to what extent the (very controversial) public release of teachers’ VAM data by the Los Angeles Times impacted housing prices in Los Angeles. To read a prior post on this release, click here.

While for some time now we have known from similar research studies, conducted throughout the pre-VAM era, that students’ test scores are correlated with (or cause) rises in housing prices, these researchers evidenced that, thus far, the same does not (yet) seem to be true in the case of VAMs. That is, the public consumption of publicly available value-added data, at least in Los Angeles, does not (yet) seem to be correlated with or causing really anything in the housing market.

“The implication: Either people don’t value the popular new measures or they don’t fully understand them.” Perhaps another implication is that it is just (unfortunately) a matter of time. I write this in consideration of the fact that while researchers included data from more than 63,000 home sales as per the Los Angeles County Assessor’s Office, they did so in only the eight-month period following the public release of the VAM data. True effects might be lagged; hence, readers might interpret these results as preliminary, for now.

Victory in Court: Consequences Attached to VAMs Suspended Throughout New Mexico

Great news for New Mexico and New Mexico’s approximately 23,000 teachers, and great news for states and teachers potentially elsewhere, in terms of setting precedent!

Late yesterday, state District Judge David K. Thomson, who presided over the ongoing teacher-evaluation lawsuit in New Mexico, granted a preliminary injunction preventing consequences from being attached to the state’s teacher evaluation data. More specifically, Judge Thomson ruled that the state can proceed with “developing” and “improving” its teacher evaluation system, but the state is not to make any consequential decisions about New Mexico’s teachers using the data the state collects until the state (and/or others external to the state) can evidence to the court during another trial (set for now, for April) that the system is reliable, valid, fair, uniform, and the like.

As you all likely recall, the American Federation of Teachers (AFT), joined by the Albuquerque Teachers Federation (ATF), last year, filed a “Lawsuit in New Mexico Challenging [the] State’s Teacher Evaluation System.” Plaintiffs charged that the state’s teacher evaluation system, imposed on the state in 2012 by the state’s current Public Education Department (PED) Secretary Hanna Skandera (with value-added counting for 50% of teachers’ evaluation scores), is unfair, error-ridden, spurious, harming teachers, and depriving students of high-quality educators, among other claims (see the actual lawsuit here).

Thereafter, one scheduled day of testimonies turned into five in Santa Fe, that ran from the end of September through the beginning of October (each of which I covered here, here, here, here, and here). I served as the expert witness for the plaintiff’s side, along with other witnesses including lawmakers (e.g., a state senator) and educators (e.g., teachers, superintendents) who made various (and very articulate) claims about the state’s teacher evaluation system on the stand. Thomas Kane served as the expert witness for the defendant’s side, along with other witnesses including lawmakers and educators who made counter claims about the system, some of which backfired, unfortunately for the defense, primarily during cross-examination.

See articles released about this ruling this morning in the Santa Fe New Mexican (“Judge suspends penalties linked to state’s teacher eval system”) and the Albuquerque Journal (“Judge curbs PED teacher evaluations).” See also the AFT’s press release, written by AFT President Randi Weingarten, here. Click here for the full 77-page Order written by Judge Thomson (see also, below, five highlights I pulled from this Order).

The journalist of the Santa Fe New Mexican, though, provided the most detailed information about Judge Thomson’s Order, writing, for example, that the “ruling by state District Judge David Thomson focused primarily on the complicated combination of student test scores used to judge teachers. The ruling [therefore] prevents the Public Education Department [PED] from denying teachers licensure advancement or renewal, and it strikes down a requirement that poorly performing teachers be placed on growth plans.” In addition, the Judge noted that “the teacher evaluation system varies from district to district, which goes against a state law calling for a consistent evaluation plan for all educators.”

The PED continues to stand by its teacher evaluation system, calling the court challenge “frivolous” and “a legal PR stunt,” all the while noting that Judge Thomson’s decision “won’t affect how the state conducts its teacher evaluations.” Indeed it will, for now and until the state’s teacher evaluation system is vetted, and validated, and “the court” is “assured” that the system can actually be used to take the “consequential actions” against teachers, “required” by the state’s PED.

Here are some other highlights that I took directly from Judge Thomson’s ruling, capturing what I viewed as his major areas of concern about the state’s system (click here, again, to read Judge Thomson’s full Order):

  • Validation Needed: “The American Statistical Association says ‘estimates from VAM should always be accompanied by measures of precision and a discussion of the assumptions and possible limitations of the model. These limitations are particularly relevant if VAM are used for high stake[s] purposes” (p. 1). These are the measures, assumptions, limitations, and the like that are to be made transparent in this state.
  • Uniformity Required: “New Mexico’s evaluation system is less like a [sound] model than a cafeteria-style evaluation system where the combination of factors, data, and elements are not easily determined and the variance from school district to school district creates conflicts with the [state] statutory mandate” (p. 2)…with the existing statutory framework for teacher evaluations for licensure purposes requiring “that the teacher be evaluated for ‘competency’ against a ‘highly objective uniform statewide standard of evaluation’ to be developed by PED” (p. 4). “It is the term ‘highly objective uniform’ that is the subject matter of this suit” (p. 4), whereby the state and no other “party provided [or could provide] the Court a total calculation of the number of available district-specific plans possible given all the variables” (p. 54). See also the Judge’s points #78-#80 (starting on page 70) for some of the factors that helped to “establish a clear lack of statewide uniformity among teachers” (p. 70).
  • Transparency Missing: “The problem is that it is not easy to pull back the curtain, and the inner workings of the model are not easily understood, translated or made accessible” (p. 2). “Teachers do not find the information transparent or accurate” and “there is no evidence or citation that enables a teacher to verify the data that is the content of their evaluation” (p. 42). In addition, “[g]iven the model’s infancy, there are no real studies to explain or define the [s]tate’s value-added system…[hence, the consequences and decisions]…that are to be made using such system data should be examined and validated prior to making such decisions” (p. 12).
  • Consequences Halted: “Most significant to this Order, [VAMs], in this [s]tate and others, are being used to make consequential decisions…This is where the rubber hits the road [as per]…teacher employment impacts. It is also where, for purposes of this proceeding, the PED departs from the statutory mandate of uniformity requiring an injunction” (p. 9). In addition, it should be noted that indeed “[t]here are adverse consequences to teachers short of termination” (p. 33) including, for example, “a finding of ‘minimally effective’ [that] has an impact on teacher licenses” (p. 41). These, too, are to be halted under this injunction Order.
  • Clarification Required: “[H]ere is what this [O]rder is not: This [O]rder does not stop the PED’s operation, development and improvement of the VAM in this [s]tate, it simply restrains the PED’s ability to take consequential actions…until a trial on the merits is held” (p. 2). In addition, “[a] preliminary injunction differs from a permanent injunction, as does the factors for its issuance…’ The objective of the preliminary injunction is to preserve the status quo [minus the consequences] pending the litigation of the merits. This is quite different from finally determining the cause itself” (p. 74). Hence, “[t]he court is simply enjoining the portion of the evaluation system that has adverse consequences on teachers” (p. 75).

The PED also argued that “an injunction would hurt students because it could leave in place bad teachers.” As per Judge Thomson, “That is also a faulty argument. There is no evidence that temporarily halting consequences due to the errors outlined in this lengthy Opinion more likely results in retention of bad teachers than in the firing of good teachers” (p. 75).

Finally, given my involvement in this lawsuit and given the team with whom I was/am still so fortunate to work (see picture below), including all of those who testified as part of the team and whose testimonies clearly proved critical in Judge Thomson’s final Order, I want to thank everyone for all of their time, energy, and efforts in this case, thus far, on behalf of the educators attempting to (still) do what they love to do — teach and serve students in New Mexico’s public schools.


Left to right: (1) Stephanie Ly, President of AFT New Mexico; (2) Dan McNeil, AFT Legal Department; (3) Ellen Bernstein, ATF President; (4) Shane Youtz, Attorney at Law; and (5) me 😉

Brookings’ Critique of AERA Statement on VAMs, and Henry Braun’s Rebuttal

Two weeks ago I published a post about the newly released “American Educational Research Association (AERA) Statement on Use of Value-Added Models (VAM) for the Evaluation of Educators and Educator Preparation Programs.”

In this post I also included a summary of the AERA Council’s eight key, and very important points abut VAMs and VAM use. I also noted that I contributed to this piece in one of its earliest forms. More importantly, however, the person who managed the statement’s external review and also assisted the AERA Council in producing the final statement before it was officially released was Boston College’s Dr. Henry Braun, Boisi Professor of Education and Public Policy and Educational Research, Measurement, and Evaluation.

Just this last week, the Brookings Institution published a critique of the AERA statement for, in my opinion, no other apparent reason than just being critical. The critique was written by Brookings affiliate Michael Hansen and University of Washington Bothell’s Dan Goldhaber, titled a “Response to AERA statement on Value-Added Measures: Where are the Cautionary Statements on Alternative Measures?

Accordingly, I invited Dr. Henry Braun to respond, and he graciously agreed:

In a recent posting, Michael Hansen and Dan Goldhaber complain that the AERA statement on the use of VAMs does not take a similarly critical stance with respect to “alternative measures”. True enough! The purpose of the statement is to provide a considered, research-based discussion of the issues related to the use of value-added scores for high-stakes evaluation. It culminates in a set of eight requirements to be met before such use should be made.

The AERA statement does not stake out an extreme position. First, it is grounded in the broad research literature on drawing causal inferences from observational data subject to strong selection (i.e., the pairings of teachers and students is highly non-random), as well as empirical studies of VAMs in different contexts. Second, the requirements are consistent with the AERA, American Psychological Association (APA), and National Council on Measurement in Education (NCME) Standards for Educational and Psychological Testing. Finally, its cautions are in line with those expressed in similar statements released by the Board on Testing and Assessment of the National Research Council and by the American Statistical Association.

Hansen and Goldhaber are certainly correct when they assert that, in devising an accountability system for educators, a comparative perspective is essential. One should consider the advantages and disadvantages of different indicators, which ones to employ, how to combine them and, most importantly, consider both the consequences for educators and the implications for the education system as a whole. Nothing in the AERA statement denies the importance of subjecting all potential indicators to scrutiny. Indeed, it states: “Justification should be provided for the inclusion of each indicator and the weight accorded to it in the evaluation process.” Of course, guidelines for designing evaluation systems would constitute a challenge of a different order!

In this context, it must be recognized that rankings based on VAM scores and ratings based on observational protocols will necessarily have different psychometric and statistical properties. Moreover, they both require a “causal leap” to justify their use: VAM scores are derived directly from student test performance, but require a way of linking to the teacher of record. Observational ratings are based directly on a teacher’s classroom performance, but require a way of linking back to her students’ achievement or progress.

Thus, neither approach is intrinsically superior to the other. But the singular danger with VAM scores, being the outcome of a sophisticated statistical procedure, is that they are seen by many as providing a gold standard against which other indicators should be judged. Both the AERA and ASA statements offer a needed corrective, by pointing out the path that must be traversed before an indicator based on VAM scores approaches the status of a gold standard. Though the requirements listed in the AERA statement may be aspirational, they do offer signposts against which we can judge how far we have come along that path.

Henry Braun, Lynch School of Education, Boston College