Special Issue: Moving Teacher Evaluation Forward

At the beginning of this week, in the esteemed, online, open-access, and peer-reviewed journal of which I am the Lead Editor — Education Policy Analysis Archives — a special issue on for which I also served as the Guest Editor was published. The special issue is about Policies and Practices of Promise in Teacher Evaluation and, more specifically, about how after the federal passage of the Every Student Succeeds Act (ESSA) in 2016, state leaders have (or have not) changed their teacher evaluation systems, potentially for the better. Changing for the better is defined throughout this special issue as aligning with the theoretical and empirical research currently available in the literature base surrounding contemporary teacher evaluation systems, as well as the theoretical and empirical research that is presented in the ten pieces included in this special issue.

The pieces include: one introduction, a set of two peer-reviewed theoretical commentaries, and seven empirical articles, via which authors present or discuss teacher evaluation policies and practices that may help us move (hopefully, well) beyond high-stakes teacher evaluation systems, especially as solely or primarily based on teachers’ impacts on growing their students’ standardized test scores over time (e.g., via the use of SGMs or VAMs). Below are all of the articles included.

Happy Reading!

  1. Policies and Practices of Promise in Teacher Evaluation: The Introduction to the Special Issue. Authored by Audrey-Amrein-Beardsley (Note: this piece was editorially-reviewed by journal leaders other than myself prior to publication).
  2. Teacher Accountability, Datafication and Evaluation: A Case for Reimagining Schooling. A Commentary Authored by Jessica Holloway.
  3. Excavating Theory in Teacher Evaluation: Implementing Evaluation Frameworks as Wengerian Boundary Objects. A Commentary Authored by Kelley M. King and Noelle A. Paufler.
  4. Putting Teacher Evaluation Systems on the Map: An Overview of States’ Teacher Evaluation Systems Post–Every Student Succeeds Act. Authored by Kevin Close, Audrey-Amrein-Beardsley, and Clarin Collins. (Note: being one of the authors on this piece, it is important to note that I had no involvement in the peer-review process whatsoever).
  5. How Middle School Special and General Educators Make Sense of and Respond to Changes in Teacher Evaluation Policy. Authored by Alisha M. B. Braun and Peter Youngs.
  6. Entangled Educator Evaluation Apparatuses: Contextual Influences on New Policies. Authored by Jake Malloy (University of Wisconsin – Madison).
  7. Improving Instructional Practice Through Peer Observation and Feedback: A Review of the Literature. Authored by Brady L. Ridge (Utah State University) and Alyson L. Lavigne.
  8. Using Global Observation Protocols to Inform Research on Teaching Effectiveness and School Improvement: Strengths and Emerging Limitations. Authored by Sean Kelly, Robert Bringe (University of North Carolina-Chapel Hill), Esteban Aucejo, and Jane Cooley Fruehwirth.
  9. Better Integrating Summative and Formative Goals in the Design of Next Generation Teacher Evaluation Systems. Authored by Timothy G. Ford and Kim Kappler Hewitt.
  10. Moving Forward While Looking Back: Lessons Learned from Teacher Evaluation Litigation Concerning Value-Added Models (VAMs). Authored by Mark Paige.

NCTQ on States’ Teacher Evaluation Systems’ Failures, Again

In February of 2017, the controversial National Council on Teacher Quality (NCTQ) — created by the conservative Thomas B. Fordham Institute and funded (in part) by the Bill & Melinda Gates Foundation as “part of a coalition for ‘a better orchestrated agenda’ for accountability, choice, and using test scores to drive the evaluation of teachers” (see here) — issued a report about states’ teacher evaluation systems titled: “Running in Place: How New Teacher Evaluations Fail to Live Up to Promises.” See another blog post about a similar (and also slanted) study NCTQ conducted two years prior here. The NCTQ recently published another — a “State of the States: Teacher & Principal Evaluation Policy.” Like I did in those two prior posts, I summarize this report, only as per their teacher evaluation policy findings and assertions, below.

  • In 2009, only 15 states required objective measures of student growth (e.g., VAMs) in teacher evaluations; by 2015 this number increased nearly threefold to 43 states. However, as swiftly as states moved to make these changes, many of them have made a hasty retreat. Now there are 34 states requiring such measures. These modifications to these nine states’ evaluation systems are “poorly supported by research literature” which, of course, is untrue. Of note, as well, is that there are no literature cited to support this very statement.
  • For an interesting and somewhat interactive chart capturing what states are doing in the areas of their teacher and principal evaluation systems, however, you might want to look at NCTQ’s Figure 3 (again, within the full report, here). Not surprisingly, NCTQ subtotals these indicators by state and essentially categorizes states by the extent to which they have retreated from such “research-backed policies.”
    • You can also explore states’ laws, rules, and regulations, that range from data about teacher preparation, licensing, and evaluation to data about teacher compensation, professional development, and dismissal policies via NCTQ’s State Teacher Policy Database here.
  • Do states use data from state standardize tests to evaluate their teachers? See the (promising) evidence of states backing away from research-backed policies here (as per NCTQ’s Figure 5):
  • Also of interest is the number of states in which student surveys are being used to evaluate teachers, which is something reportedly trending across states, but perhaps not so much as currently thought (as per NCTQ’s Figure 9).
  • The NCTQ also backs the “research-backed benefits” of using such surveys, primarily (and again not surprisingly) in that they correlate (albeit at very weak-to-weak magnitudes) with the more objective measures (e.g., VAMs) still being pushed by the NCTQ. The NCTQ also, entirely, overlooks the necessary conditions required to make the data derived from student surveys, as well as their use, “reliable and valid” as over-simplistically claimed.

The rest of the report includes the NCTQ’s findings and assertions regarding states’ principal evaluation systems. If these are of interest, please scroll to the lower part of the document, again, available here.

Citation: Ross, E. & Walsh, K. (2019). State of the States 2019: Teacher and Principal Evaluation Policy. Washington, DC: National Council on Teacher Quality (NCTQ).

Racing to Nowhere: Ways Teacher Evaluation Reform Devalues Teachers

In a recent blog (see here), I posted about a teacher evaluation brief written by Alyson Lavigne and Thomas Good for Division 15 of the American Psychological Association (see here). There, Lavigne and Good voiced their concerns about inadequate teacher evaluation practices that did not help teachers improve instruction, and they described in detail the weaknesses of testing and observation practices used in current teacher evaluation practices.

In their book, Enhancing Teacher Education, Development, and Evaluation, they discuss other factors which diminish the value of teachers and teaching. They note that for decades many various federal documents, special commissions, summits, and foundation reports periodically issue reports that blatantly assert (with limited or no evidence) that American schools and teachers are tragically flawed and at times the finger has even been pointed at our students (e.g., A Nation at Risk chided students for their poor effort and performance). These reports, ranging from the Sputnik fear to the Race to the Top crisis, have pointed to an immediate and dangerous crisis. The cause of the crisis: Our inadequate schools that places America at scientific, military, or economic peril. 

Given the plethora of media reports that follow these pronouncements of school crises (and pending doom) citizens are taught at least implicitly that schools are a mess, but the solutions are easy…if only teachers worked hard enough. Thus, when reforms fail, many policy makers scapegoat teachers as inadequate or uncaring. Lavigne and Good contend that these sweeping reforms (and their failures) reinforce the notion that teachers are inadequate. As the authors note, most teachers do an excellent job in supporting student growth and that they should be recognized for this accomplishment. In contrast, and unfortunately, teachers are scapegoated for conditions (e.g., poverty) that they cannot control.

They reiterate (and effectively emphasize) that an unexplored collateral damage (beyond the enormous cost and wasted resources of teachers and administrators) is the impact that sweeping and failed reform has upon citizens’ willingness to invest in public education. Policy makers and the media must recognize that teachers are competent and hard working and accomplish much despite the inadequate conditions in which they work.

Read more here: Enhancing Teacher Education, Development, and Evaluation

Teacher Evaluation Recommendations Endorsed by the Educational Psychology Division of the American Psychological Association (APA)

Recently, the Educational Psychology Division of the American Psychological Association (APA) endorsed a set of recommendations, captured withing a research brief for policymakers, pertaining to best practices when evaluating teachers. The brief, that can be accessed here, was authored by Alyson Lavigne, Assistant Professor at Utah State, and Tom Good, Professor Emeritus at the University of Arizona.

In general, they recommend that states’/districts teacher evaluation efforts emphasize improving teaching in informed and formative ways verses categorizing and stratifying teachers in terms of their effectiveness in outcome-based and summative ways. As per recent evidence (see, for example, here), post the passage of the Every Student Succeeds Act (ESSA) in 2016, it seems states and districts are already heading in this direction.

Otherwise, they note that prior emphases on using teachers’ students’ test scores via, for example, the use of value-added models (VAMs) to hold teachers accountable for their effects on student achievement and simultaneously using observational systems (the two most common teacher evaluation measures of teacher evaluation’s recent past) is “problematic and [has] not improved student achievement” as a result of states’ and districts’ past efforts in these regards. Both teacher evaluation measures “fail to recognize the complexity of teaching or how to measure it.”

More specifically in terms of VAMs: (1) VAM scores do not adequately compare teachers given the varying contexts in which teachers teach and the varying factors that influence teaching and student learning; (2) Teacher effectiveness often varies over time making it difficult to achieve appropriate reliability (i.e., consistency) to justify VAM use, especially for high-stakes decision-making purposes; (3) VAMs can only attempt to capture effects for approximately 30% of all teachers, raising serious issues with fairness and uniformity; (4) VAM scores do not help teachers improve their instruction, also in that often teachers and their administrators do not have access, have late access, and simply do not understand their VAM-based data in order to use them in formative ways; and (5) Using VAMs discourages collegial exchange and sharing of ideas and resources.

More specifically in terms of observations: (1) Given classroom teaching is so complex, dynamic, and contextual, these measures are problematic given no systems that are currently available capture all aspects of good teaching; (2) Observing and providing teachers with feedback warrants significant time, attention, and resources but oft-receives little in all regards; (3) Principals have still not been prepared well enough to observe or provide useful feedback to teachers; and (4) The common practice of three formal observations/year/teacher does not adequately account for the fact that teacher practice and performance varies over time, across subject areas and students, and the like. I would add here a (5) in that these observational system have also been evidenced as biased in that, for example, teachers representing certain racial and ethnic backgrounds might be more likely than others to receive lower observational scores (see prior posts on these studies here, here and here).

In consideration of the above, what they recommend in terms of moving teacher evaluation systems forward follows:

  • Eliminate high-stakes teacher evaluations based only on student achievement data and especially limited observations (all should consider if and how additional observers, beyond just principals, might be leveraged);
  • Provide opportunities for teachers to be heard, for example, in terms of when and how they might be evaluated and to what ends;
  • Improve teacher evaluation systems in fundamental ways using technology, collaboration, and other innovations to transform teaching practice;
  • Emphasize formative feedback within and across teacher evaluation systems in that “improving instruction should be at least as important as evaluating instruction.”

You can see more of their criticisms of the current and recommendations for the future, again, in the full report here.

New Mexico Lawsuit: Final Update

In December 2015 in New Mexico, via a preliminary injunction set forth by state District Judge David K. Thomson, all consequences attached to teacher-level value-added model (VAM) scores (e.g., flagging the files of teachers with low VAM scores) were suspended throughout the state until the state (and/or others external to the state) could prove to the state court that the system was reliable, valid, fair, uniform, and the like. The trial during which this evidence was to be presented was set, and re-set, and re-set again, never to actually occur. More specifically, after December 2015 and through 2018, multiple depositions and hearings occurred. In April 2019, the case was reassigned to a new judge (via a mass reassignment state policy), again, while the injunction was still in place.

Thereafter, teacher evaluation was a hot policy issue during the state’s 2018 gubernatorial election. The now-prior state governor, Republican Susana Martinez, who essentially ordered and helped shape the state’s teacher evaluation system at issue during this lawsuit, had reached the maximum number of terms served and could not run again. All candidates running to replace her had grave concerns about the state’s teacher evaluation system. Democrat Michelle Lujan Grisham ending up winning.

Two days after Grisham was sworn in, she signed an Executive Order for the entire state system to be amended, including no longer using value-added data to evaluate teachers. Her Executive Order also stipulated that the state department was to work with teachers, administrators, parents, students, and the like, to determine more appropriate methods of measuring teacher effectiveness. While the education task force charged with this task is still in the process of finalizing the state’s new system, it is important to note that now, although actually beginning in the 2018-2019 school year, teachers are being evaluated via (primarily) classroom observations and student/family surveys. The value-added component (and a teacher attendance component that was also the source of contention during this lawsuit) were removed entirely from the state’s teacher evaluation framework.

Likewise, the plaintiffs (the lawyers, teachers, and administrators with whom I worked on this case) are no longer moving forward with the 2015 lawsuit as Grisham’s Executive Order also rendered this lawsuit as moot.

The initial victory that we achieved in 2015 ultimately yielded a victory in the end. Way to go New Mexico!

*This post was co-authored by one of my PhD students – Tray Geiger – who is finishing up his dissertation about this case.

Teachers “Grow” Their Students’ Heights As Much As Their Achievement

The title of this post captures the key findings of a study that has come across my desk now over 25 times during the past two weeks; hence, I decided to summarize and share out, also as significant to our collective understandings about value-added models (VAMs).

The study — “Teacher Effects on Student Achievement and Height: A Cautionary Tale” — was recently published by the National Bureau of Economic Research (NBER) (Note 1) and authored by Marianne Bitler (Professor of Economics at the University of California, Davis), Sean Corcoran (Associate Professor of Public Policy and Education at Vanderbilt University), Thurston Domina (Professor at the University of North Carolina at Chapel Hill), and Emily Penner (Assistant Professor at the University of California, Irvine).

In short, study researchers used administrative data from New York City Public Schools to estimate the “value” teachers “add” to student achievement, and (also in comparison) to student height. The assumption herein, of course, is that teachers’ cannot plausibly or literally “grow” their students’ heights. If they were found to do so using a VAM (also oft-referred to as “growth” models, hereafter referred to more generally as VAMs), this would threaten the overall validity of the output derive via any such VAM, given VAMs’ sole purposes are to measure teacher effects on “growth” in student achievement and only student achievement over time. Put differently, if a VAM was found to “grow” students’ height, this would ultimately negate the validity of any such VAM given the very purposes for which VAMs have been adopted, implemented, and used, misused, and abused across states, especially over the last decade.

Notwithstanding, study researchers found that “the standard deviation of teacher effects on height is nearly as large as that for math and reading achievement” (Abstract). More specifically, they found that the “estimated teacher ‘effects’ on height [were] comparable in magnitude to actual teacher effects on math and ELA achievement, 0.22 [standard deviations] compared to 0.29 [standard deviations] and 0.26 [standard deviations], respectively (p. 24).

Put differently, teacher effects, as measured by a commonly used VAM, were about the same in terms of the extent to which teachers “added value” to their students’ growth in achievement over time and their students’ physical heights. Clearly, this raises serious questions about the overall validity of this (and perhaps all) VAMs in terms of not only what they are intended to do, and what they did (at least in this study) as well. To yield such spurious results (i.e., results that are nonsensical and more likely due to noise than anything else) threatens the overall validity of the output derived via these models, as well as the extent to which their output can or should be trusted. This is clearly an issue with validity, or rather the validity of the inferences to be drawn from this (and perhaps/likely any other) VAM.

Ultimately the authors conclude that the findings from their paper should “serve as a cautionary tale” for the use of VAMs in practice. With all due respect to my colleagues, in my opinion their findings are much more serious than those that might merely warrant caution. Only one other study of which I am aware (Note 2), as akin to the study conducted here, could be as damming to the validity of VAMs and their too often “naïve application[s]” (p. 24).

Citation: Bitler, M., Corcoran, S., Domina, T., & Penner, E. (2019, November). Teacher effects on student achievement and height: A cautionary tale. National Bureau of Economic Research (NBER) Working Paper No. 26480. Retrieved from https://www.nber.org/papers/w26480.pdf

Note 1: As I have oft-commented in prior posts about papers published by the NBER, it is important to note that NBER papers such as these (i.e., “working papers”) have not been internally reviewed (e.g., by NBER Board Directors), nor have they been peer-reviewed or vetted. Rather, such “working papers” are widely circulated for discussion and comment, prior to what the academy of education would consider appropriate vetting. While listed in the front matter of this piece are highly respected scholars who helped critique and likely improve this paper, this is not the same as putting any such piece through a double-blinded, peer reviewed, process. Hence, caution is also warranted here when interpreting study results.

Note 2: Rothstein (2009, 2010) conducted a falsification test by which he tested, also counter-intuitively, whether a teacher in the future could cause, or have an impact on his/her students’ levels of achievement in the past. Rothstein demonstrated that given non-random student placement (and tracking) practices, VAM-based estimates of future teachers could be used to predict students’ past levels of achievement. More generally, Rothstein demonstrated that both typical and complex VAMs demonstrated counterfactual effects and did not mitigate bias because students are consistently and systematically grouped in ways that explicitly bias value-added estimates. Otherwise, the backwards predictions Rothstein demonstrated could not have been made.

Citations: Rothstein, J. (2009). Student sorting and bias in value-added estimation: Selection on observables and unobservables. Education Finance and Policy, (4)4, 537-571. doi:http://dx.doi.org/10.1162/edfp.2009.4.4.537

Rothstein, J. (2010). Teacher quality in educational production: Tracking, decay, and student achievement. Quarterly Journal of Economics. 175-214. doi:10.1162/qjec.2010.125.1.175

Mapping America’s Teacher Evaluation Plans Under ESSA

One of my doctoral students — Kevin Close, one of my former doctoral students — Clarin Collins, and I just had a study published in the practitioner journal Phi Delta Kappan that I wanted to share out with all of you, especially before the study is no longer open-access or free (see the full article as currently available here). As the title of this post (which is the same as the title of the article) indicates, the study is about research the three of us conducted, by surveying every state (or interviewing leaders at every state’s department of education), about how each state’s changed their teacher evaluation systems post the passage of the Every Student Succeeds Act (ESSA).

In short, we found states have reduced their use of growth or value-added models (VAMs) within their teacher evaluation systems. In addition, states that are still using such models are using them in much less consequential ways, while many states are offering more alternatives for measuring the relationships between student achievement and teacher effectiveness. Additionally, state teacher evaluation plans also contain more language supporting formative teacher feedback (i.e., a noteworthy change from states’ prior summative and oft-highly consequential teacher evaluation systems). State departments of education also seem to be allowing districts to develop and implement more flexible teacher evaluation systems, with states simultaneously acknowledging challenges with being able to support increased local control, and localized teacher evaluation systems, especially when varied local systems present challenges with being able to support various local systems and compare data across schools and districts, in effect.

Again, you can read more here. See also the longer version of this study, if interested, here.

Litigating Algorithms, Beyond Education

This past June, I presented at a conference at New York University (NYU) called Litigating Algorithms. Most attendees were lawyers, law students, and the like, all of whom were there to discuss the multiple ways that they have collectively and independently been challenging governmental uses of algorithm-based, decision-making systems (i.e., like VAMs) across disciplines. I was there to present about how VAMs have been used by states and school districts in education, as well as present the key issues with VAMs as litigated via the lawsuits in which I have been engaged (e.g., Houston, New Mexico, New York, Tennessee, and Texas). The conference was sponsored by the AI Now Institute, also at NYU, which has as its mission to examine the social implications of artificial intelligence (AI), and in collaboration with the Center on Race, Inequality, and the Law, affiliated with the NYU School of Law.

Anyhow, they just released their report from this conference and I thought it important to share out with all of you, also in that it details the extent to which similar AI systems are being used across disciplines beyond education, and it details how such uses (misuses and abuses) are being litigated in court.

See the press release below, and see the full report here.


Litigating Algorithms 2019 U.S. Report – New Challenges to Government Use of Algorithmic Decision Systems

Today the AI Now Institute and NYU Law’s Center on Race, Inequality, and the Law published new research on the ways litigation is being used as a tool to hold government accountable for using algorithmic tools that produce harmful results.

Algorithmic decision systems (ADS) are often sold as offering a number of benefits, from mitigating human bias and error, to cutting costs and increasing efficiency, accuracy, and reliability. Yet proof of these advantages is rarely offered, even as evidence of harm increases. Within health care, criminal justice, education, employment, and other areas, the implementation of these technologies has resulted in numerous problems with profound effects on millions of peoples’ lives.

More than 19,000 Michigan residents were incorrectly disqualified from food-assistance benefits by an errant ADS. A similar system automatically and arbitrarily cut Oregonians’ disability benefits. And an ADS falsely labeled 40,000 workers in Michigan as having committed unemployment fraud. These are a handful of examples that make clear the profound human consequences of the use of ADS, and the urgent need for accountability and validation mechanisms. 

In recent years, litigation has become a valuable tool for understanding the concrete and real impacts of flawed ADS and holding government accountable when it harms us. 

The Report picks up where our 2018 report left off, revisiting the first wave of U.S. lawsuits brought against government use of ADS, and examining what progress, if any, has been made.  We also explore a new wave of legal challenges that raise significant questions, including:

  1. What access, if any, criminal defense attorneys should have to law enforcement ADS in order to challenge allegations leveled by the prosecution; 
  2. The profound human consequences of erroneous or vindictive uses of governmental ADS; and 
  3. The evolution of the Illinois Biometric Information Privacy Act, America’s most powerful biometric privacy law, and what its potential impact on ADS accountability might be. 

This report offers concrete insights from actual cases involving plaintiffs and lawyers seeking justice in the face of harmful ADS. These cases illuminate many ways that ADS are perpetuating concrete harms, and the ways ADS companies are pushing against accountability and transparency.

The report also outlines several recommendations for advocates and other stakeholders interested in using litigation as a tool to hold government accountable for its use of ADS.

Citation: Richardson, R., Schultz, J. M., & Southerland, V. M. (2019). Litigating algorithms 2019 US report: New challenges to government use of algorithmic decision systems. New York, NY: AI Now Institute. Retrieved from https://ainowinstitute.org/litigatingalgorithms-2019-us.html

More on the VAM (Ab)Use in Florida

In my most recent post, about it being “Time to Organize in Florida” (see here), I wrote about how in Florida teachers were (two or so weeks after the start of the school year) being removed from teaching in Florida schools if their state-calculated, teacher-level VAM scores deemed them as teachers who “needed improvement” or were “unsatisfactory.” Yes – they were being removed from teaching in low performing schools IF their VAM scores, and VAM scores alone, deemed them as not adding value.

A reporter from the Tampa Bay Times with whom I spoke on this story just published his article all about this, titled “Florida’s ‘VAM Score’ for Rating Teachers is Still Around, and Still Hated” (see the full article here, and see the article’s full reference below). This piece captures the situation in Florida better than my prior post; hence, please give it a read.

Again, click here to read.

Full citation: Solochek, J. S. (2019). Florida’s ‘VAM score’ for rating teachers is still around, and still hated. Tampa Bay Times. Retrieved from https://www.tampabay.com/news/gradebook/2019/09/23/floridas-vam-score-for-rating-teachers-is-still-around-and-still-hated/

New (Unvetted) Research about Washington DC’s Teacher Evaluation Reforms

In November of 2013, I published a blog post about a “working paper” released by the National Bureau of Economic Research (NBER) and written by authors Thomas Dee – Economics and Educational Policy Professor at Stanford, and James Wyckoff – Economics and Educational Policy Professor at the University of Virginia. In the study titled “Incentives, Selection, and Teacher Performance: Evidence from IMPACT,” Dee and Wyckoff (2013) analyzed the controversial IMPACT educator evaluation system that was put into place in Washington DC Public Schools (DCPS) under the then Chancellor, Michelle Rhee. In this paper, Dee and Wyckoff (2013) presented what they termed to be “novel evidence” to suggest that the “uniquely high-powered incentives” linked to “teacher performance” via DC’s IMPACT initiative worked to improve the performance of high-performing teachers, and that dismissal threats worked to increase the voluntary attrition of low-performing teachers, as well as improve the performance of the students of the teachers who replaced them.

I critiqued this study in full (see both short and long versions of this critique here), and ultimately asserted that the study had “fatal flaws” which compromised the (exaggerated) claims Dee and Wyckoff (2013) advanced. These flaws included but were not limited to that only 17% of the teachers included in this study (i.e., teachers of reading and mathematics in grades 4 through 8) were actually evaluated under the value-added component of the IMPACT system. Put inversely, 83% of the teachers included in this study about teachers’ “value-added” did not have student test scores available to determine if they were indeed of “added value.” That is, 83% of the teachers evaluated, rather, were assessed on their overall levels of effectiveness or subsequent increases/decreases in effectiveness as per only the subjective observational and other self-report data include within the IMPACT system. Hence, while authors’ findings were presented as hard fact, given the 17% fact, their (exaggerated) conclusions did not at all generalize across teachers given the sample limitations, and despite what they claimed.

In short, the extent to which Dee and Wyckoff (2013) oversimplified very complex data to oversimplify a very complex context and policy situation, after which they exaggerated questionable findings, was of issue, that should have been reconciled or cleared out prior to the study’s release. I should add that this study was published in 2015 in the (economics-oriented and not-educational-policy specific) Journal of Policy Analysis and Management (see here), although I have not since revisited the piece to analyze, comparatively (e.g., via a content analysis), the original 2013 to the final 2015 piece.

Anyhow, they are at it again. Just this past January (2017) they published another report, albeit alongside two additional authors: Melinda Adnot – a Visiting Assistant Professor at the University of Virginia, and Veronica Katz – an Educational Policy PhD student, also at the University of Virginia. This study titled “Teacher Turnover, Teacher Quality, and Student Achievement in DCPS,” was also (prematurely) released as a “working paper” by the same NBER, again, without any internal or external vetting but (irresponsibly) released “for discussion and comment.”

Hence, I provide below my “discussion and comments” below, all the while underscoring how this continues to be problematic, also given the fact that I was contacted by the media for comment. Frankly, no media reports should be released about these (or for that matter any other) “working papers” until they are not only internally but also externally reviewed (e.g., in press or published, post vetting). Unfortunately, as they too commonly do, however, NBER released this report, clearly without such concern. Now, we as the public are responsible for consuming this study with much critical caution, while also advocating that others (and helping others to) do the same. Hence, I write into this post my critiques of this particular study.

First, the primary assumption (i.e., the “conceptual model”) driving this Adnot, Dee, Katz, & Wyckoff (2016) piece is that low-performing teachers should be identified and replaced with more effective teachers. This is akin to the assumption noted in the first Dee and Wyckoff (2013) piece. It should be noted here that in DCPS teachers rated as “Ineffective” or consecutively as “Minimally Effective” are “separated” from the district; hence, DCPS has adopted educational policies that align with this “conceptual model” as well. Interesting to note is how researchers, purportedly external to DCPS, entered into this study with the same a priori “conceptual model.” This, in and of itself, is an indicator of researcher bias (see also forthcoming).

Nonetheless, Adnot et al.’s (2016) highlighted finding was that “on average, DCPS replaced teachers who left with teachers who increased student achievement by 0.08 SD [standard deviations] in math.” Buried further into the report they also found that DCPS replaced teachers who left with teachers who increased student achievement by 0.05 SD in reading (at not a 5% but a 10% statistical significance level). These findings, in simpler but also more realistic terms, mean that (if actually precise and correct, also given all of the problems with how teacher classifications were determined at the DCPS level), “effective” mathematics teachers who replaced “ineffective” mathematics teachers increased student achievement by approximately 2.7%, and “effective” reading teachers who replaced “ineffective” reading teachers increased student achievement by approximately 1.7% (at not a 5% but a 10% statistical significance level). These are hardly groundbreaking results as these proportional movements likely represented one or maybe two total test items on the large-scale standardized tests uses to assess DCPS’s policy impacts.

Interesting to also note is that not only were the “small effects” exaggerated to mean so much more than what they are actually worth (see also forthcoming), but also that only the larger of the two findings – the mathematics finding – is highlighted in the abstract. The complimentary and smaller reading effect is actually buried into the text. Also buried is that these findings pertain only to grade four and eight, general education teachers who were value-added eligible, akin to Dee and Wyckoff’s (2013) earlier piece (e.g., typically 30% of a school’s population, although Dee and Wyckoff’s (2013) piece marked this percentage at 17%).

As mentioned prior, none of this would have likely happened had this piece been internally and/or externally reviewed prior to this study’s release.

Regardless, Adnot et al. (2016) also found that the attrition of relatively higher-performing teachers (e.g., “Effective” or “Highly Effective”) had a negative but also statistically insignificant effect.

Hence, it can be concluded that the only “finding” highlighted in the abstract of this piece was not the only “finding,” but rather buried in this piece were these other findings that researchers (perhaps) purposefully buried into the text. It is possible, in other words, that because these other findings did not support researchers a priori conclusions and claims, researchers chose not to bring attention to these findings, or rather the lack thereof (e.g., in the abstract).

Related, I should note that in a few places the authors exaggerate how, for example, teachers’ effects on their students’ achievement are so tangible, without any mention of contrary reports, namely as published by the American Statistical Association (ASA), in which the ASA evidenced that these (oft-exaggerated) teacher effects account for no more than 1%-14% of the variance in students’ growth scores (see more information here). In fact, teacher effectiveness is very likely not “qualitatively large” as Adnot et al. (2016) argue, without evidence in this piece, and also imply throughout this piece as a foundational part of their aforementioned “conceptual model.”

Likewise, while most everyone would likely agree that there are “stark inequities” in students’ access to effective teachers, how to improve this condition is certainly of great debate, as also not explicitly or implicitly acknowledged throughout this piece. Rather, much disagreement and debate, in fact, still exist regarding whether inducing teacher turnover will get “us” really anywhere in terms of school reform, as also related to how big (or small) teachers’ effects on students’ measurable performance actually are as discussed prior. Accordingly, and perhaps not surprisingly, Adnot et al. (2016) cite only the articles of other researchers, or rather members of their proverbial choir (e.g, Eric Hanushek who, without actual evidence, has been hypothesizing about how replacing “ineffective” teachers with “effective” teachers will reform America’s schools for nearly now one decade) to support these same a priori conclusions. Consequently, the professional integrity of the researchers must be put into check given these simple (albeit biased) errors.

Taking all of this into consideration, I would hardly call the findings they advanced in this piece (and emphasized in the abstract) solid indicators of the “overall positive effects of teacher turnover,” with only one statistically but not practically significant finding of note in mathematics (i.e., a 2.7% increase if accurate)? None of this could/should, accordingly, lead anyone to conclude that “the supply of entering teachers appears to be of sufficient quality to sustain a relatively high turnover rate.”

Hence, this is yet another case of these authors oversimplifying very complex data to oversimplify a very complex context and policy situation, after which they exaggerated negligible findings while also dismissing others.

Related, would we not expect greater results given teachers who are deemed highly effective are to be given one-time bonuses of up to $25,000, and permanent increases to teachers’ base salaries of up to $27,000 per year? This bang, or lack thereof, may not be worth the buck, either.

Additionally, is an annual attrition rate of “low-performing teachers” (e.g., classified as such for one or two consecutive years) in the district, currently hanging at around 46% worth these diminutive results?

Did they also actually find, overall, that “high-poverty schools actually improve as a result of teacher turnover?” I don’t think so, but do give this study a full read to test their as well as my conclusions for yourself (see, again, the full study here).

In the end, Adnot et al. (2016) do conclude that they found “that the overall effect of teacher turnover in DCPS conservatively had no effect on achievement and, under reasonable assumptions, improved achievement.” This is a MUCH more balanced interpretation of this study, although I would certainly question their “reasonable assumptions” (see also prior). Moreover, it is much more curious as to why we had to wait for the actual headline of this study until the end. This is especially important given that others, including members of the media, public, and policy making community, might not make it that far (i.e., trusting only what is in the abstract).


Adnot, M., Dee, T., Katz, V., & Wyckoff, J. (2016). Teacher turnover, teacher quality, and student achievement in DCPS [Washington DC Public Schools]. Cambridge, MA: National Bureau of Economic Research (NBER). Retrieved from http://www.nber.org/papers/w21922.pdf

Dee, T., & Wyckoff, J. (2013). Incentives, selection, and teacher performance: Evidence from IMPACT. National Bureau of Economic Research (NBER). Retrieved from http://www.nber.org/papers/w19529.pdf