New Teacher Evaluation Report Released by the Network for Public Education

A new report on current teacher evaluation systems throughout the US was just released by the Network for Public Education. The report is titled, “Teachers Talk Back: Educators on the Impact of Teacher Evaluation,” and below are their findings, followed by a condensed version of their six recommendations (as taken from the Executive Summary, although you can read the full 17-page report, again, here):


  • Teachers and principals believe that evaluations based on student test scores, especially Value Added Measures (VAMs), are neither valid nor reliable measures of their work. They believe that VAM scores punish teachers who work with the
    most vulnerable students. Of the respondents, 83% indicated that the use of test scores in evaluations has had a negative impact on instruction, and 88% said that more time is spent on test prep than ever before. Evaluations based on frameworks and rubrics, such as those created by Danielson and Marzano, have resulted in wasting far too much time. This is damaging the very work evaluation is supposed to improve, as valuable time is diverted to engage in related compliance exercises and paperwork. Of the respondents, 84% reported a significant increase in teacher time spent on evaluations.
  • The emphasis on improving test scores has overwhelmed every aspect of teachers’ work, forcing them to spend precious collaborative time poring over student data rather than having conversations about students and instruction. Sixty-six percent of respondents reported a negative impact on relationships with their students as a result of the pressure to focus on test scores.
  • Over half of the respondents (52.08%) reported witnessing evidence of bias against veteran educators. This supports evidence that evaluations are having a disparate impact, contributing to a decline in teachers of color, veteran teachers, and those serving students in poverty. A recent study (ASI, 2015) found that changes to evaluation practices have coincided with a precipitous drop in the number of black teachers in nine major cities.
  • Teacher professional development tied to the evaluation process is having a stifling effect on teachers, by undermining their sense of autonomy, and limiting their capacity for real professional growth. 85% of respondents indicated that high quality professional development is not connected to their evaluations, and 84% reported a negative effect on conversations between teachers and supervisors. Collegial relationships have also been affected, with 81% of respondents reporting negative changes in conversations with colleagues.


  1. We recommend an immediate halt to the use of test scores as any part of teacher evaluation.
  2. We recommend that teacher collaboration not be tied to evaluation but instead be a teacher-led cooperative process that focuses on their students’ and their own professional learning.
  3. We recommend that the observation process focus on improving instruction—resulting in reflection and dialogue between teacher and observer—the result should be a narrative, not a number.
  4. We recommend that evaluations require less paperwork and documentation so that more time can be spent on reflection and improvement of instruction.
  5. We recommend an immediate review of the impact that evaluations have had on teachers of color and veteran teachers.
  6. We recommend that teachers not be “scored” on professional development activities nor that professional development be dictated by evaluation scores rather than teacher needs.

Again, to read more, please see the full article, as also cited here: The Network for Public Education. (2020). Teachers talk back: Educators on the impact of teacher evaluation.

Why Bother Testing in 2021?

David Berliner and Gene Glass, both dear mentors of mine while I was a PhD student at Arizona State University (ASU) and beyond, are scholar-leaders in the area of educational policy, also with specializations in test-based policies and test uses, misuses, and abuses. Today, via Diane Ravitch’s blog, they published their “thoughts about the value of annual testing in 2021.” Pasted below and also linked to here is their “must read!”

Gene V Glass
David C. Berliner

At a recent Education Writers Association seminar, Jim Blew, an assistant to Betsy DeVos at the Department of Education, opined that the Department is inclined not to grant waivers to states seeking exemptions from the federally mandated annual standardized achievement testing. States like Michigan, Georgia, and South Carolina were seeking a one year moratorium. Blew insisted that “even during a pandemic [tests] serve as an important tool in our education system.” He said that the Department’s “instinct” was to grant no waivers. What system he was referring to and important to whom are two questions we seek to unravel here.

Without question, the “system” of the U.S. Department of Education has a huge stake in enforcing annual achievement testing. It’s not just that the Department’s relationship is at stake with Pearson Education, the U.K. corporation that is the major contractor for state testing, with annual revenues of nearly $5 billion. The Department’s image as a “get tough” defender of high standards is also at stake. Pandemic be damned! We can’t let those weak kneed blue states get away with covering up the incompetence of those teacher unions.

To whom are the results of these annual testings important? Governors? District superintendents? Teachers?

How the governors feel about the test results depends entirely on where they stand on the political spectrum. Blue state governors praise the findings when they are above the national average, and they call for increased funding when they are below. Red state governors, whose state’s scores are generally below average, insist that the results are a clear call for vouchers and more charter schools – in a word, choice. District administrators and teachers live in fear that they will be blamed for bad scores; and they will.

Fortunately, all the drama and politicking about the annual testing is utterly unnecessary. Last year’s district or even schoolhouse average almost perfectly predicts this year’s average. Give us the average Reading score for Grade Three for any medium or larger size district for the last year and we’ll give you the average for this year within a point or two. So at the very least, testing every year is a waste of time and money – money that might ultimately help cover the salary of executives like John Fallon, Pearson Education CEO, whose total compensation in 2017 was more than $4 million.

But we wouldn’t even need to bother looking up a district’s last year’s test scores to know where their achievement scores are this year. We can accurately predict those scores from data that cost nothing. It is well known and has been for many years – just Google “Karl R. White” 1982 – that a school’s average socio-economic status (SES) is an accurate predictor of its achievement test average. “Accurate” here means a correlation exceeding .80. Even though a school’s racial composition overlaps considerably with the average wealth of the families it serves, adding Race to the prediction equation will improve the prediction of test performance. Together, SES and Race tell us much about what is actually going on in the school lives of children: the years of experience of their teachers; the quality of the teaching materials and equipment; even the condition of the building they attend.

Don’t believe it? Think about this. In a recent year the free and reduced lunch rate (FRL) at the 42 largest high schools in Nebraska was correlated with the school’s average score in Reading, Math, and Science on the Nebraska State Assessments. The correlations obtained were FRL & Reading r = -.93, FRL & Science r = -.94, and FRL & Math r = -.92. Correlation coefficients don’t get higher than 1.00.

If you can know the schools’ test scores from their poverty rate, why give the test?

In fact, Chris Tienken answered that very question in New Jersey. With data on household income, % single parent households, and parent education level in each township, he predicted a township’s rates of scoring “proficient” on the New Jersy state assessment. In Maple Shade Township, 48.71% of the students were predicted to be proficient in Language Arts; the actual proficiency rate was 48.70%. In Mount Arlington township, 61.4% were predicted proficient; 61.5% were actually proficient. And so it went. Demographics may not be destiny for individuuals, but when you want a reliable, quick, inexpensive estimate of how a school, township, or district is doing in terms of their achievement scores on a standardized test of acheievement, demographics really are destiny, until governments at many levels get serious about addressing the inequities holding back poor and minority schools!

There is one more point to consider here: a school can more easily “fake” its achievement scores than it can fake its SES and racial composition. Test scores can be artificially raised by paying a test prep company, or giving just a tiny bit more time on the test, looking the other way as students whip out their cell phones during the test, by looking at the test before hand and sharing some “ideas” with students about how they might do better on the tests, or examining the tests after they are given and changing an answer or two here and there. These are not hypothetical examples; they go on all the time.

However, don’t the principals and superintendents need the test data to determine which teachers are teaching well and which ones ought to be fired? That seems logical but it doesn’t work. Our colleague Audrey Amrein Beardsley and her students have addressed this issue in detail on the blog VAMboozled. In just one study, a Houston teacher was compared to other teachers in other schools sixteen different times over four years. Her students’ test scores indicated that she was better than the other teachers 8 times and worse than the others 8 times. So, do achievement tests tell us whether we have identified a great teacher, or a bad teacher? Or do the tests merely reveal who was in that teacher’s class that particualr year? Again, the makeup of the class – demographics like social class, ethnicity, and native language – are powerful determiners of test scores.

But wait. Don’t the teachers need the state standardized test results to know how well their students are learning, what they know and what is still to be learned? Not at all. By Christmas, but certainly by springtime when most of the standardized tests are given, teachers can accurately tell you how their students will rank on those tests. Just ask them! And furthermore, they almost never get the information about their students’ achievement until the fall following the year they had those students in class making the information value of the tests nil!

In a pilot study by our former ASU student Annapurna Ganesh, a dozen 2nd and 3rd grade teachers ranked their children in terms of their likely scores on their upcoming Arizona state tests. Correlations were uniformly high – as high in one class as +.96! In a follow up study, with a larger sample, here are the correlations found for 8 of the third-grade teachers who predicted the ranking of their students on that year’s state of Arizona standardized tests:

Screen Shot 2020-08-11 at 12.16.01 PM

In this third grade sample, the lowest rank order coefficient between a teacher’s ranking of the students and the student’s ranking on the state Math or Reading test was +.72! Berliner took these results to the Arizona Department of Education, informing them that they could get the information they wanted about how children are doing in about 10 minutes and for no money! He was told that he was “lying,” and shown out of the office. The abuse must go on. Contracts must be honored.

Predicting rank can’t tell you the national percentile of this child or that, but that information is irrelevant to teachers anyway. Teachers usually know which child is struggling, which is soaring, and what both of them need. That is really the information that they need!

Thus far as we argue against the desire our federal Department of Education to reinstitute achievement testing in each state, we neglected to mention a test’s most important characteristic—its validity. We mention here, briefly, just one type of validity, content validity. To have content validity students in each state have to be exposed to/taught the curriculum for which the test is appropriate. The US Department of Education seems not to have noticed that since March 2020 public schooling has been in a bit of an upheaval! The chances that each district, in each state, has provided equal access to the curriculm on which a states’ test is based, is fraught under normal circumstances. In a pandemic it is a remarkably stupid assumption! We assert that no state achievement test will be content valid if given in the 2020-2021 school year. Furthermore, those who help in administering and analyzing such tests are likely in violation of the testing standards of the American Psychological Association, the American Educational Research Association, and the National Council on Measurement in Education. In addition to our other concerns with state standardized tests, there is no defensible use of an invalid test. Period.

We are not opposed to all testing, just to stupid testing. The National Assessment Governing Board voted 12 to 10 in favor of administering NAEP in 2021. There is some sense to doing so. NAEP tests fewer than 1 in 1,000 students in grades 4, 8, and 12. As a valid longitudinal measure, the results could tell us the extent of the devastation of the Corona virus.

We end this essay with some good news. The DeVos Department of Education position on Spring 2021 testing is likely to be utterly irrelevant. She and assistant Blew are likely to be watching the operation of the Department of Education from the sidelines after January 21, 2021. We can only hope that members of a new administration read this and understand that some of the desperately needed money for American public schools can come from the huge federal budget for standardized testing. Because in seeking the answer to the question “Why bother testing in 2021?” we have necessarily confronted the more important question: “Why ever bother to administer these mandated tests?”

We hasten to add that we are not alone in this opinion. Among measurement experts competent to opine on such things, our colleagues at the National Education Policy Center likewise question the wisdom of a 2021 federal government mandated testing.

Special Issue: Moving Teacher Evaluation Forward

At the beginning of this week, in the esteemed, online, open-access, and peer-reviewed journal of which I am the Lead Editor — Education Policy Analysis Archives — a special issue on for which I also served as the Guest Editor was published. The special issue is about Policies and Practices of Promise in Teacher Evaluation and, more specifically, about how after the federal passage of the Every Student Succeeds Act (ESSA) in 2016, state leaders have (or have not) changed their teacher evaluation systems, potentially for the better. Changing for the better is defined throughout this special issue as aligning with the theoretical and empirical research currently available in the literature base surrounding contemporary teacher evaluation systems, as well as the theoretical and empirical research that is presented in the ten pieces included in this special issue.

The pieces include: one introduction, a set of two peer-reviewed theoretical commentaries, and seven empirical articles, via which authors present or discuss teacher evaluation policies and practices that may help us move (hopefully, well) beyond high-stakes teacher evaluation systems, especially as solely or primarily based on teachers’ impacts on growing their students’ standardized test scores over time (e.g., via the use of SGMs or VAMs). Below are all of the articles included.

Happy Reading!

  1. Policies and Practices of Promise in Teacher Evaluation: The Introduction to the Special Issue. Authored by Audrey-Amrein-Beardsley (Note: this piece was editorially-reviewed by journal leaders other than myself prior to publication).
  2. Teacher Accountability, Datafication and Evaluation: A Case for Reimagining Schooling. A Commentary Authored by Jessica Holloway.
  3. Excavating Theory in Teacher Evaluation: Implementing Evaluation Frameworks as Wengerian Boundary Objects. A Commentary Authored by Kelley M. King and Noelle A. Paufler.
  4. Putting Teacher Evaluation Systems on the Map: An Overview of States’ Teacher Evaluation Systems Post–Every Student Succeeds Act. Authored by Kevin Close, Audrey-Amrein-Beardsley, and Clarin Collins. (Note: being one of the authors on this piece, it is important to note that I had no involvement in the peer-review process whatsoever).
  5. How Middle School Special and General Educators Make Sense of and Respond to Changes in Teacher Evaluation Policy. Authored by Alisha M. B. Braun and Peter Youngs.
  6. Entangled Educator Evaluation Apparatuses: Contextual Influences on New Policies. Authored by Jake Malloy (University of Wisconsin – Madison).
  7. Improving Instructional Practice Through Peer Observation and Feedback: A Review of the Literature. Authored by Brady L. Ridge (Utah State University) and Alyson L. Lavigne.
  8. Using Global Observation Protocols to Inform Research on Teaching Effectiveness and School Improvement: Strengths and Emerging Limitations. Authored by Sean Kelly, Robert Bringe (University of North Carolina-Chapel Hill), Esteban Aucejo, and Jane Cooley Fruehwirth.
  9. Better Integrating Summative and Formative Goals in the Design of Next Generation Teacher Evaluation Systems. Authored by Timothy G. Ford and Kim Kappler Hewitt.
  10. Moving Forward While Looking Back: Lessons Learned from Teacher Evaluation Litigation Concerning Value-Added Models (VAMs). Authored by Mark Paige.

NCTQ on States’ Teacher Evaluation Systems’ Failures, Again

In February of 2017, the controversial National Council on Teacher Quality (NCTQ) — created by the conservative Thomas B. Fordham Institute and funded (in part) by the Bill & Melinda Gates Foundation as “part of a coalition for ‘a better orchestrated agenda’ for accountability, choice, and using test scores to drive the evaluation of teachers” (see here) — issued a report about states’ teacher evaluation systems titled: “Running in Place: How New Teacher Evaluations Fail to Live Up to Promises.” See another blog post about a similar (and also slanted) study NCTQ conducted two years prior here. The NCTQ recently published another — a “State of the States: Teacher & Principal Evaluation Policy.” Like I did in those two prior posts, I summarize this report, only as per their teacher evaluation policy findings and assertions, below.

  • In 2009, only 15 states required objective measures of student growth (e.g., VAMs) in teacher evaluations; by 2015 this number increased nearly threefold to 43 states. However, as swiftly as states moved to make these changes, many of them have made a hasty retreat. Now there are 34 states requiring such measures. These modifications to these nine states’ evaluation systems are “poorly supported by research literature” which, of course, is untrue. Of note, as well, is that there are no literature cited to support this very statement.
  • For an interesting and somewhat interactive chart capturing what states are doing in the areas of their teacher and principal evaluation systems, however, you might want to look at NCTQ’s Figure 3 (again, within the full report, here). Not surprisingly, NCTQ subtotals these indicators by state and essentially categorizes states by the extent to which they have retreated from such “research-backed policies.”
    • You can also explore states’ laws, rules, and regulations, that range from data about teacher preparation, licensing, and evaluation to data about teacher compensation, professional development, and dismissal policies via NCTQ’s State Teacher Policy Database here.
  • Do states use data from state standardize tests to evaluate their teachers? See the (promising) evidence of states backing away from research-backed policies here (as per NCTQ’s Figure 5):
  • Also of interest is the number of states in which student surveys are being used to evaluate teachers, which is something reportedly trending across states, but perhaps not so much as currently thought (as per NCTQ’s Figure 9).
  • The NCTQ also backs the “research-backed benefits” of using such surveys, primarily (and again not surprisingly) in that they correlate (albeit at very weak-to-weak magnitudes) with the more objective measures (e.g., VAMs) still being pushed by the NCTQ. The NCTQ also, entirely, overlooks the necessary conditions required to make the data derived from student surveys, as well as their use, “reliable and valid” as over-simplistically claimed.

The rest of the report includes the NCTQ’s findings and assertions regarding states’ principal evaluation systems. If these are of interest, please scroll to the lower part of the document, again, available here.

Citation: Ross, E. & Walsh, K. (2019). State of the States 2019: Teacher and Principal Evaluation Policy. Washington, DC: National Council on Teacher Quality (NCTQ).

Racing to Nowhere: Ways Teacher Evaluation Reform Devalues Teachers

In a recent blog (see here), I posted about a teacher evaluation brief written by Alyson Lavigne and Thomas Good for Division 15 of the American Psychological Association (see here). There, Lavigne and Good voiced their concerns about inadequate teacher evaluation practices that did not help teachers improve instruction, and they described in detail the weaknesses of testing and observation practices used in current teacher evaluation practices.

In their book, Enhancing Teacher Education, Development, and Evaluation, they discuss other factors which diminish the value of teachers and teaching. They note that for decades many various federal documents, special commissions, summits, and foundation reports periodically issue reports that blatantly assert (with limited or no evidence) that American schools and teachers are tragically flawed and at times the finger has even been pointed at our students (e.g., A Nation at Risk chided students for their poor effort and performance). These reports, ranging from the Sputnik fear to the Race to the Top crisis, have pointed to an immediate and dangerous crisis. The cause of the crisis: Our inadequate schools that places America at scientific, military, or economic peril. 

Given the plethora of media reports that follow these pronouncements of school crises (and pending doom) citizens are taught at least implicitly that schools are a mess, but the solutions are easy…if only teachers worked hard enough. Thus, when reforms fail, many policy makers scapegoat teachers as inadequate or uncaring. Lavigne and Good contend that these sweeping reforms (and their failures) reinforce the notion that teachers are inadequate. As the authors note, most teachers do an excellent job in supporting student growth and that they should be recognized for this accomplishment. In contrast, and unfortunately, teachers are scapegoated for conditions (e.g., poverty) that they cannot control.

They reiterate (and effectively emphasize) that an unexplored collateral damage (beyond the enormous cost and wasted resources of teachers and administrators) is the impact that sweeping and failed reform has upon citizens’ willingness to invest in public education. Policy makers and the media must recognize that teachers are competent and hard working and accomplish much despite the inadequate conditions in which they work.

Read more here: Enhancing Teacher Education, Development, and Evaluation

Teacher Evaluation Recommendations Endorsed by the Educational Psychology Division of the American Psychological Association (APA)

Recently, the Educational Psychology Division of the American Psychological Association (APA) endorsed a set of recommendations, captured withing a research brief for policymakers, pertaining to best practices when evaluating teachers. The brief, that can be accessed here, was authored by Alyson Lavigne, Assistant Professor at Utah State, and Tom Good, Professor Emeritus at the University of Arizona.

In general, they recommend that states’/districts teacher evaluation efforts emphasize improving teaching in informed and formative ways verses categorizing and stratifying teachers in terms of their effectiveness in outcome-based and summative ways. As per recent evidence (see, for example, here), post the passage of the Every Student Succeeds Act (ESSA) in 2016, it seems states and districts are already heading in this direction.

Otherwise, they note that prior emphases on using teachers’ students’ test scores via, for example, the use of value-added models (VAMs) to hold teachers accountable for their effects on student achievement and simultaneously using observational systems (the two most common teacher evaluation measures of teacher evaluation’s recent past) is “problematic and [has] not improved student achievement” as a result of states’ and districts’ past efforts in these regards. Both teacher evaluation measures “fail to recognize the complexity of teaching or how to measure it.”

More specifically in terms of VAMs: (1) VAM scores do not adequately compare teachers given the varying contexts in which teachers teach and the varying factors that influence teaching and student learning; (2) Teacher effectiveness often varies over time making it difficult to achieve appropriate reliability (i.e., consistency) to justify VAM use, especially for high-stakes decision-making purposes; (3) VAMs can only attempt to capture effects for approximately 30% of all teachers, raising serious issues with fairness and uniformity; (4) VAM scores do not help teachers improve their instruction, also in that often teachers and their administrators do not have access, have late access, and simply do not understand their VAM-based data in order to use them in formative ways; and (5) Using VAMs discourages collegial exchange and sharing of ideas and resources.

More specifically in terms of observations: (1) Given classroom teaching is so complex, dynamic, and contextual, these measures are problematic given no systems that are currently available capture all aspects of good teaching; (2) Observing and providing teachers with feedback warrants significant time, attention, and resources but oft-receives little in all regards; (3) Principals have still not been prepared well enough to observe or provide useful feedback to teachers; and (4) The common practice of three formal observations/year/teacher does not adequately account for the fact that teacher practice and performance varies over time, across subject areas and students, and the like. I would add here a (5) in that these observational system have also been evidenced as biased in that, for example, teachers representing certain racial and ethnic backgrounds might be more likely than others to receive lower observational scores (see prior posts on these studies here, here and here).

In consideration of the above, what they recommend in terms of moving teacher evaluation systems forward follows:

  • Eliminate high-stakes teacher evaluations based only on student achievement data and especially limited observations (all should consider if and how additional observers, beyond just principals, might be leveraged);
  • Provide opportunities for teachers to be heard, for example, in terms of when and how they might be evaluated and to what ends;
  • Improve teacher evaluation systems in fundamental ways using technology, collaboration, and other innovations to transform teaching practice;
  • Emphasize formative feedback within and across teacher evaluation systems in that “improving instruction should be at least as important as evaluating instruction.”

You can see more of their criticisms of the current and recommendations for the future, again, in the full report here.

New Mexico Lawsuit: Final Update

In December 2015 in New Mexico, via a preliminary injunction set forth by state District Judge David K. Thomson, all consequences attached to teacher-level value-added model (VAM) scores (e.g., flagging the files of teachers with low VAM scores) were suspended throughout the state until the state (and/or others external to the state) could prove to the state court that the system was reliable, valid, fair, uniform, and the like. The trial during which this evidence was to be presented was set, and re-set, and re-set again, never to actually occur. More specifically, after December 2015 and through 2018, multiple depositions and hearings occurred. In April 2019, the case was reassigned to a new judge (via a mass reassignment state policy), again, while the injunction was still in place.

Thereafter, teacher evaluation was a hot policy issue during the state’s 2018 gubernatorial election. The now-prior state governor, Republican Susana Martinez, who essentially ordered and helped shape the state’s teacher evaluation system at issue during this lawsuit, had reached the maximum number of terms served and could not run again. All candidates running to replace her had grave concerns about the state’s teacher evaluation system. Democrat Michelle Lujan Grisham ending up winning.

Two days after Grisham was sworn in, she signed an Executive Order for the entire state system to be amended, including no longer using value-added data to evaluate teachers. Her Executive Order also stipulated that the state department was to work with teachers, administrators, parents, students, and the like, to determine more appropriate methods of measuring teacher effectiveness. While the education task force charged with this task is still in the process of finalizing the state’s new system, it is important to note that now, although actually beginning in the 2018-2019 school year, teachers are being evaluated via (primarily) classroom observations and student/family surveys. The value-added component (and a teacher attendance component that was also the source of contention during this lawsuit) were removed entirely from the state’s teacher evaluation framework.

Likewise, the plaintiffs (the lawyers, teachers, and administrators with whom I worked on this case) are no longer moving forward with the 2015 lawsuit as Grisham’s Executive Order also rendered this lawsuit as moot.

The initial victory that we achieved in 2015 ultimately yielded a victory in the end. Way to go New Mexico!

*This post was co-authored by one of my PhD students – Tray Geiger – who is finishing up his dissertation about this case.

Teachers “Grow” Their Students’ Heights As Much As Their Achievement

The title of this post captures the key findings of a study that has come across my desk now over 25 times during the past two weeks; hence, I decided to summarize and share out, also as significant to our collective understandings about value-added models (VAMs).

The study — “Teacher Effects on Student Achievement and Height: A Cautionary Tale” — was recently published by the National Bureau of Economic Research (NBER) (Note 1) and authored by Marianne Bitler (Professor of Economics at the University of California, Davis), Sean Corcoran (Associate Professor of Public Policy and Education at Vanderbilt University), Thurston Domina (Professor at the University of North Carolina at Chapel Hill), and Emily Penner (Assistant Professor at the University of California, Irvine).

In short, study researchers used administrative data from New York City Public Schools to estimate the “value” teachers “add” to student achievement, and (also in comparison) to student height. The assumption herein, of course, is that teachers’ cannot plausibly or literally “grow” their students’ heights. If they were found to do so using a VAM (also oft-referred to as “growth” models, hereafter referred to more generally as VAMs), this would threaten the overall validity of the output derive via any such VAM, given VAMs’ sole purposes are to measure teacher effects on “growth” in student achievement and only student achievement over time. Put differently, if a VAM was found to “grow” students’ height, this would ultimately negate the validity of any such VAM given the very purposes for which VAMs have been adopted, implemented, and used, misused, and abused across states, especially over the last decade.

Notwithstanding, study researchers found that “the standard deviation of teacher effects on height is nearly as large as that for math and reading achievement” (Abstract). More specifically, they found that the “estimated teacher ‘effects’ on height [were] comparable in magnitude to actual teacher effects on math and ELA achievement, 0.22 [standard deviations] compared to 0.29 [standard deviations] and 0.26 [standard deviations], respectively (p. 24).

Put differently, teacher effects, as measured by a commonly used VAM, were about the same in terms of the extent to which teachers “added value” to their students’ growth in achievement over time and their students’ physical heights. Clearly, this raises serious questions about the overall validity of this (and perhaps all) VAMs in terms of not only what they are intended to do, and what they did (at least in this study) as well. To yield such spurious results (i.e., results that are nonsensical and more likely due to noise than anything else) threatens the overall validity of the output derived via these models, as well as the extent to which their output can or should be trusted. This is clearly an issue with validity, or rather the validity of the inferences to be drawn from this (and perhaps/likely any other) VAM.

Ultimately the authors conclude that the findings from their paper should “serve as a cautionary tale” for the use of VAMs in practice. With all due respect to my colleagues, in my opinion their findings are much more serious than those that might merely warrant caution. Only one other study of which I am aware (Note 2), as akin to the study conducted here, could be as damming to the validity of VAMs and their too often “naïve application[s]” (p. 24).

Citation: Bitler, M., Corcoran, S., Domina, T., & Penner, E. (2019, November). Teacher effects on student achievement and height: A cautionary tale. National Bureau of Economic Research (NBER) Working Paper No. 26480. Retrieved from

Note 1: As I have oft-commented in prior posts about papers published by the NBER, it is important to note that NBER papers such as these (i.e., “working papers”) have not been internally reviewed (e.g., by NBER Board Directors), nor have they been peer-reviewed or vetted. Rather, such “working papers” are widely circulated for discussion and comment, prior to what the academy of education would consider appropriate vetting. While listed in the front matter of this piece are highly respected scholars who helped critique and likely improve this paper, this is not the same as putting any such piece through a double-blinded, peer reviewed, process. Hence, caution is also warranted here when interpreting study results.

Note 2: Rothstein (2009, 2010) conducted a falsification test by which he tested, also counter-intuitively, whether a teacher in the future could cause, or have an impact on his/her students’ levels of achievement in the past. Rothstein demonstrated that given non-random student placement (and tracking) practices, VAM-based estimates of future teachers could be used to predict students’ past levels of achievement. More generally, Rothstein demonstrated that both typical and complex VAMs demonstrated counterfactual effects and did not mitigate bias because students are consistently and systematically grouped in ways that explicitly bias value-added estimates. Otherwise, the backwards predictions Rothstein demonstrated could not have been made.

Citations: Rothstein, J. (2009). Student sorting and bias in value-added estimation: Selection on observables and unobservables. Education Finance and Policy, (4)4, 537-571. doi:

Rothstein, J. (2010). Teacher quality in educational production: Tracking, decay, and student achievement. Quarterly Journal of Economics. 175-214. doi:10.1162/qjec.2010.125.1.175

Mapping America’s Teacher Evaluation Plans Under ESSA

One of my doctoral students — Kevin Close, one of my former doctoral students — Clarin Collins, and I just had a study published in the practitioner journal Phi Delta Kappan that I wanted to share out with all of you, especially before the study is no longer open-access or free (see the full article as currently available here). As the title of this post (which is the same as the title of the article) indicates, the study is about research the three of us conducted, by surveying every state (or interviewing leaders at every state’s department of education), about how each state’s changed their teacher evaluation systems post the passage of the Every Student Succeeds Act (ESSA).

In short, we found states have reduced their use of growth or value-added models (VAMs) within their teacher evaluation systems. In addition, states that are still using such models are using them in much less consequential ways, while many states are offering more alternatives for measuring the relationships between student achievement and teacher effectiveness. Additionally, state teacher evaluation plans also contain more language supporting formative teacher feedback (i.e., a noteworthy change from states’ prior summative and oft-highly consequential teacher evaluation systems). State departments of education also seem to be allowing districts to develop and implement more flexible teacher evaluation systems, with states simultaneously acknowledging challenges with being able to support increased local control, and localized teacher evaluation systems, especially when varied local systems present challenges with being able to support various local systems and compare data across schools and districts, in effect.

Again, you can read more here. See also the longer version of this study, if interested, here.

Litigating Algorithms, Beyond Education

This past June, I presented at a conference at New York University (NYU) called Litigating Algorithms. Most attendees were lawyers, law students, and the like, all of whom were there to discuss the multiple ways that they have collectively and independently been challenging governmental uses of algorithm-based, decision-making systems (i.e., like VAMs) across disciplines. I was there to present about how VAMs have been used by states and school districts in education, as well as present the key issues with VAMs as litigated via the lawsuits in which I have been engaged (e.g., Houston, New Mexico, New York, Tennessee, and Texas). The conference was sponsored by the AI Now Institute, also at NYU, which has as its mission to examine the social implications of artificial intelligence (AI), and in collaboration with the Center on Race, Inequality, and the Law, affiliated with the NYU School of Law.

Anyhow, they just released their report from this conference and I thought it important to share out with all of you, also in that it details the extent to which similar AI systems are being used across disciplines beyond education, and it details how such uses (misuses and abuses) are being litigated in court.

See the press release below, and see the full report here.


Litigating Algorithms 2019 U.S. Report – New Challenges to Government Use of Algorithmic Decision Systems

Today the AI Now Institute and NYU Law’s Center on Race, Inequality, and the Law published new research on the ways litigation is being used as a tool to hold government accountable for using algorithmic tools that produce harmful results.

Algorithmic decision systems (ADS) are often sold as offering a number of benefits, from mitigating human bias and error, to cutting costs and increasing efficiency, accuracy, and reliability. Yet proof of these advantages is rarely offered, even as evidence of harm increases. Within health care, criminal justice, education, employment, and other areas, the implementation of these technologies has resulted in numerous problems with profound effects on millions of peoples’ lives.

More than 19,000 Michigan residents were incorrectly disqualified from food-assistance benefits by an errant ADS. A similar system automatically and arbitrarily cut Oregonians’ disability benefits. And an ADS falsely labeled 40,000 workers in Michigan as having committed unemployment fraud. These are a handful of examples that make clear the profound human consequences of the use of ADS, and the urgent need for accountability and validation mechanisms. 

In recent years, litigation has become a valuable tool for understanding the concrete and real impacts of flawed ADS and holding government accountable when it harms us. 

The Report picks up where our 2018 report left off, revisiting the first wave of U.S. lawsuits brought against government use of ADS, and examining what progress, if any, has been made.  We also explore a new wave of legal challenges that raise significant questions, including:

  1. What access, if any, criminal defense attorneys should have to law enforcement ADS in order to challenge allegations leveled by the prosecution; 
  2. The profound human consequences of erroneous or vindictive uses of governmental ADS; and 
  3. The evolution of the Illinois Biometric Information Privacy Act, America’s most powerful biometric privacy law, and what its potential impact on ADS accountability might be. 

This report offers concrete insights from actual cases involving plaintiffs and lawyers seeking justice in the face of harmful ADS. These cases illuminate many ways that ADS are perpetuating concrete harms, and the ways ADS companies are pushing against accountability and transparency.

The report also outlines several recommendations for advocates and other stakeholders interested in using litigation as a tool to hold government accountable for its use of ADS.

Citation: Richardson, R., Schultz, J. M., & Southerland, V. M. (2019). Litigating algorithms 2019 US report: New challenges to government use of algorithmic decision systems. New York, NY: AI Now Institute. Retrieved from