Rest in Peace, EVAAS Developer William L. Sanders

Over the last 3.5 years since I developed this blog, I have written many posts about one particular value-added model (VAM) – the Education Value-Added Assessment System (EVAAS), formerly known as the Tennessee Value-Added Assessment System (TVAAS), now known by some states as the TxVAAS in Texas, the PVAAS in Pennsylvania, and also known as the generically-named EVAAS in states like Ohio, North Carolina, and South Carolina (and many districts throughout the nation). It is this model on which I have conducted most of my research (see, for example, the first piece I published about this model here, in which most of the claims I made still stand, although EVAAS modelers disagreed here). And it is this model that is at the source of the majority of the teacher evaluation lawsuits in which I have been or still am currently engaged (see, for example, details about the Houston lawsuit here, the former Tennessee lawsuit here, and the new Texas lawsuit here, although the model is more peripheral in this particular case).

Anyhow, the original EVAAS model (i.e, the TVAAS) was originally developed by a man named William L. Sanders who ultimately sold it to SAS Institute Inc. that now holds all rights to the proprietary model. See, for example, here. See also examples of prior posts about Sanders here, here, here, here, here, and here. See also examples of prior posts about the EVAAS here, here, here, here, here, and here.

It is William L. Sanders who just passed away and we sincerely hope may rest in peace.

Sanders had a bachelors degree in animal science and a doctorate in statistics and quantitative genetics. As an adjunct professor and agricultural statistician in the college of business at the University of Knoxville, Tennessee, he developed in the late 1980s his TVAAS.

Sanders thought that educators struggling with student achievement in the state should “simply” use more advanced statistics, similar to those used when modeling genetic and reproductive trends among cattle, to measure growth, hold teachers accountable for that growth, and solve the educational measurement woes facing the state of Tennessee at the time. It was to be as simple as that…. I should also mention that given this history, not surprisingly, Tennessee was one of the first states to receive Race to the Top funds to the tune of $502 million to further advance this model; hence, this has also contributed to this model’s popularity across the nation.

Nonetheless, Sanders passed away this past Thursday, March 16, 2017, from natural causes in Columbia, Tennessee. As per his obituary here,

  • He was most well-known for developing “a method used to measure a district, school, and teacher’s effect on student performance by tracking the year-to-year progress of students against themselves over their school career with various teachers’ classes.”
  • He “stood for a hopeful view that teacher effectiveness dwarfs all other factors as a predictor of student academic growth…[challenging]…decades of assumptions that student family life, income, or ethnicity has more effect on student learning.”
  • He believed, in the simplest of terms, “that educational influence matters and teachers matter most.”

Of course, we have much research evidence to counter these claims, but for now we will just leave all of this at that. Again, may he rest in peace.

New Texas Lawsuit: VAM-Based Estimates as Indicators of Teachers’ “Observable” Behaviors

Last week I spent a few days in Austin, one day during which I provided expert testimony for a new state-level lawsuit that has the potential to impact teachers throughout Texas. The lawsuit — Texas State Teachers Association (TSTA) v. Texas Education Agency (TEA), Mike Morath in his Official Capacity as Commissioner of Education for the State of Texas.

The key issue is that, as per the state’s Texas Education Code (Sec. § 21.351, see here) regarding teachers’ “Recommended Appraisal Process and Performance Criteria,” The Commissioner of Education must adopt “a recommended teacher appraisal process and criteria on which to appraise the performance of teachers. The criteria must be based on observable, job-related behavior, including: (1) teachers’ implementation of discipline management procedures; and (2) the performance of teachers’ students.” As for the latter, the State/TEA/Commissioner defined, as per its Texas Administrative Code (T.A.C., Chapter 15, Sub-Chapter AA, §150.1001, see here), that teacher-level value-added measures should be treated as one of the four measures of “(2) the performance of teachers’ students;” that is, one of the four measures recognized by the State/TEA/Commissioner as an “observable” indicator of a teacher’s “job-related” performance.

While currently no district throughout the State of Texas is required to use a value-added component to assess and evaluate its teachers, as noted, the value-added component is listed as one of four measures from which districts must choose at least one. All options listed in the category of “observable” indicators include: (A) student learning objectives (SLOs); (B) student portfolios; (C) pre- and post-test results on district-level assessments; and (D) value-added data based on student state assessment results.

Related, the state has not recommended or required that any district, if the value-added option is selected, to choose any particular value-added model (VAM) or calculation approach. Nor has it recommended or required that any district adopt any consequences as attached to these output; however, things like teacher contract renewal and sharing teachers’ prior appraisals with other districts in which teachers might be applying for new jobs is not discouraged. Again, though, the main issue here (and the key points to which I testified) was that the value-added component is listed as an “observable” and “job-related” teacher effectiveness indicator as per the state’s administrative code.

Accordingly, my (5 hour) testimony was primarily (albeit among many other things including the “job-related” part) about how teacher-level value-added data do not yield anything that is observable in terms of teachers’ effects. Likewise, officially referring to these data in this way is entirely false, in fact, in that:

  • “We” cannot directly observe a teacher “adding” (or detracting) value (e.g., with our own eyes, like supervisors can when they conduct observations of teachers in practice);
  • Using students’ test scores to measure student growth upwards (or downwards) and over time, as is very common practice using the (very often instructionally insensitive) state-level tests required by No Child Left Behind (NCLB), and doing this once per year in mathematics and reading/language arts (that includes prior and other current teachers’ effects, summer learning gains and decay, etc.), is not valid practice. That is, doing this has not been validated by the scholarly/testing community; and
  • Worse and less valid is to thereafter aggregate this student-level growth to the teacher level and then call whatever “growth” (or the lack thereof) is because of something the teacher (and really only the teacher did), as directly “observable.” These data are far from assessing a teacher’s causal or “observable” impacts on his/her students’ learning and achievement over time. See, for example, the prior statement released about value-added data use in this regard by the American Statistical Association (ASA) here. In this statement it is written that: “Research on VAMs has been fairly consistent that aspects of educational effectiveness that are measurable and within teacher control represent a small part of the total variation [emphasis added to note that this is variation explained which = correlational versus causal research] in student test scores or growth; most estimates in the literature attribute between 1% and 14% of the total variability [emphasis added] to teachers. This is not saying that teachers have little effect on students, but that variation among teachers [emphasis added] accounts for a small part of the variation [emphasis added] in [said test] scores. The majority of the variation in [said] test scores is [inversely, 86%-99% related] to factors outside of the teacher’s control such as student and family background, poverty, curriculum, and unmeasured influences.”

If any of you have anything to add to this, please do so in the comments section of this post. Otherwise, I will keep you posted on how this goes. My current understanding is that this one will be headed to court.

New Article Published on Using Value-Added Data to Evaluate Teacher Education Programs

A former colleague, a current PhD student, and I just had an article released about using value-added data to (or rather not to) evaluate teacher education/preparation, higher education programs. The article is titled “An Elusive Policy Imperative: Data and Methodological Challenges When Using Growth in Student Achievement to Evaluate Teacher Education Programs’ ‘Value-Added,” and the abstract of the article is included below.

If there is anyone out there who might be interested in this topic, please note that the journal in which this piece was published (online first and to be published in its paper version later) – Teaching Education – has made the article free for its first 50 visitors. Hence, I thought I’d share this with you all first.

If you’re interested, do access the full piece here.

Happy reading…and here’s the abstract:

In this study researchers examined the effectiveness of one of the largest teacher education programs located within the largest research-intensive universities within the US. They did this using a value-added model as per current federal educational policy imperatives to assess the measurable effects of teacher education programs on their teacher graduates’ students’ learning and achievement as compared to other teacher education programs. Correlational and group comparisons revealed little to no relationship between value-added scores and teacher education program regardless of subject area or position on the value-added scale. These findings are discussed within the context of several very important data and methodological challenges researchers also made transparent, as also likely common across many efforts to evaluate teacher education programs using value-added approaches. Such transparency and clarity might assist in the creation of more informed value-added practices (and more informed educational policies) surrounding teacher education accountability.

David Berliner on The Purported Failure of America’s Schools

My primary mentor, David Berliner (Regents Professor at Arizona State University (ASU)) wrote, yesterday, a blog post for the Equity Alliance Blog (also at ASU) on “The Purported Failure of America’s Schools, and Ways to Make Them Better” (click here to access the original blog post). See other posts about David’s scholarship on this blog here, here, and here. See also one of our best blog posts that David also wrote here, about “Why Standardized Tests Should Not Be Used to Evaluate Teachers (and Teacher Education Programs).”

In sum, for many years David has been writing “about the lies told about the poor performance of our students and the failure of our schools and teachers.” For example, he wrote one of the education profession’s all time classics and best sellers: The Manufactured Crisis: Myths, Fraud, And The Attack On America’s Public Schools (1995). If you have not read it, you should! All educators should read this book, on that note and in my opinion, but also in the opinion of many other iconic educational scholars throughout the U.S. (Paufler, Amrein-Beardsley, Hobson, under revision for publication).

While the title of this book accurately captures its contents, more specifically it “debunks the myths that test scores in America’s schools are falling, that illiteracy is rising, and that better funding has no benefit. It shares the good news about public education.” I’ve found the contents of this book to still be my best defense when others with whom I interact attack America’s public schools, as often misinformed and perpetuated by many American politicians and journalists.

In this blog post David, once again, debunks many of these myths surrounding America’s public schools using more up-to-date data from international tests, our country’s National Assessment of Educational Progress (NAEP), state-level SAT and ACT scores, and the like. He reminds us of how student characteristics “strongly influence the [test] scores obtained by the students” at any school and, accordingly, “strongly influence” or bias these scores when used in any aggregate form (e.g., to hold teachers, schools, districts, and states accountable for their students’ performance).

He reminds us that “in the US, wealthy children attending public schools that serve the wealthy are competitive with any nation in the world…[but in]…schools in which low-income students do not achieve well, [that are not competitive with many nations in the world] we find the common correlates of poverty: low birth weight in the neighborhood, higher than average rates of teen and single parenthood, residential mobility, absenteeism, crime, and students in need of special education or English language instruction.” These societal factors explain poor performance much more (i.e., more variance explained) than any school-level, and as pertinent to this blog, teacher-level factor (e.g., teacher quality as measured by large-scale standardized test scores).

In this post David reminds us of much, much more, that we need to remember and also often recall in defense of our public schools and in support of our schools’ futures (e.g., research-based notes to help “fix” some of our public schools).

Again, please do visit the original blog post here to read more.

Difficulties When Combining Multiple Teacher Evaluation Measures

A new study about multiple “Approaches for Combining Multiple Measures of Teacher Performance,” with special attention paid to reliability, validity, and policy, was recently published in the American Educational Research Association (AERA) sponsored and highly-esteemed Educational Evaluation and Policy Analysis journal. You can find the free and full version of this study here.

In this study authors José Felipe Martínez – Associate Professor at the University of California, Los Angeles, Jonathan Schweig – at the RAND Corporation, and Pete Goldschmidt – Associate Professor at California State University, Northridge and creator of the value-added model (VAM) at legal issue in the state of New Mexico (see, for example, here), set out to help practitioners “combine multiple measures of complex [teacher evaluation] constructs into composite indicators of performance…[using]…various conjunctive, disjunctive (or complementary), and weighted (or compensatory) models” (p. 738). Multiple measures in this study include teachers’ VAM estimates, observational scores, and student survey results.

While authors ultimately suggest that “[a]ccuracy and consistency are greatest if composites are constructed to maximize reliability,” perhaps more importantly, especially for practitioners, authors note that “accuracy varies across models and cut-scores and that models with similar accuracy may yield different teacher classifications.”

This, of course, has huge implications for teacher evaluation systems as based upon multiple measures in that “accuracy” means “validity” and “valid” decisions cannot be made as based on “invalid” or “inaccurate” data that can so arbitrarily change. In other words, what this means is that likely never will a decision about a teacher being this or that actually mean this or that. In fact, this or that might be close, not so close, or entirely wrong, which is a pretty big deal when the measures combined are assumed to function otherwise. This is especially interesting, again and as stated prior, that the third author on this piece – Pete Goldschmidt – is the person consulting with the state of New Mexico. Again, this is the state that is still trying to move forward with the attachment of consequences to teachers’ multiple evaluation measures, as assumed (by the state but not the state’s consultant?) to be accurate and correct (see, for example, here).

Indeed, this is a highly inexact and imperfect social science.

Authors also found that “policy weights yield[ed] more reliable composites than optimal prediction [i.e., empirical] weights” (p. 750). In addition, “[e]mpirically derived weights may or may not align with important theoretical and policy rationales” (p. 750); hence, the authors collectively referred others to use theory and policy when combining measures, while also noting that doing so would (a) still yield overall estimates that would “change from year to year as new crops of teachers and potentially measures are incorporated” (p. 750) and (b) likely “produce divergent inferences and judgments about individual teachers (p. 751). Authors, therefore, concluded that “this in turn highlights the need for a stricter measurement validity framework guiding the development, use, and monitoring of teacher evaluation systems” (p. 751), given all of this also makes the social science arbitrary, which is also a legal issue in and of itself, as also quasi noted.

Now, while I will admit that those who are (perhaps unwisely) devoted to the (in many ways forced) combining of these measures (despite what low reliability indicators already mean for validity, as unaddressed in this piece) might find some value in this piece (e.g., how conjunctive and disjunctive models vary, how principal component, unit weight, policy weight, optimal prediction approaches vary), I will also note that forcing the fit of such multiple measures in such ways, especially without a thorough background in and understanding of reliability and validity and what reliability means for validity (i.e., with rather high levels of reliability required before any valid inferences and especially high-stakes decisions can be made) is certainly unwise.

If high-stakes decisions are not to be attached, such nettlesome (but still necessary) educational measurement issues are of less importance. But any positive (e.g., merit pay) or negative (e.g., performance improvement plan) consequence that comes about without adequate reliability and validity should certainly cause pause, if not a justifiable grievance as based on the evidence provided herein, called for herein, and required pretty much every time such a decision is to be made (and before it is made).

Citation: Martinez, J. F., Schweig, J., & Goldschmidt, P. (2016). Approaches for combining multiple measures of teacher performance: Reliability, validity, and implications for evaluation policy. Educational Evaluation and Policy Analysis, 38(4), 738–756. doi: 10.3102/0162373716666166 Retrieved from

Note: New Mexico’s data were not used for analytical purposes in this study, unless any districts in New Mexico participated in the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) study yielding the data used for analytical purposes herein.

NCTQ on States’ Teacher Evaluation Systems’ Failures

The controversial National Council on Teacher Quality (NCTQ) — created by the conservative Thomas B. Fordham Institute and funded (in part) by the Bill & Melinda Gates Foundation as “part of a coalition for ‘a better orchestrated agenda’ for accountability, choice, and using test scores to drive the evaluation of teachers” (see here; see also other instances of controversy here and here) — recently issued yet another report about state’s teacher evaluation systems titled: “Running in Place: How New Teacher Evaluations Fail to Live Up to Promises.” See a related blog post in Education Week about this report here. See also a related blog post about NCTQ’s prior large-scale (and also slanted) study — “State of the States 2015: Evaluating Teaching, Leading and Learning” — here. Like I did in that post, I summarize this study below.

From the abstract: Authors of this report find that “within the 30 states that [still] require student learning measures to be at least a significant factor in teacher evaluations, state guidance and rules in most states allow teachers to be rated effective even if they receive low scores on the student learning component of the evaluation.” They add in the full report that in many states “a high score on an evaluation’s observation and [other] non-student growth components [can] result in a teacher earning near or at the minimum number of points needed to earn an effective rating. As a result, a low score on the student growth component of the evaluation is sufficient in several states to push a teacher over the minimum number of points needed to earn a summative effective rating. This essentially diminishes any real influence the student growth component has on the summative evaluation rating” (p. 3-4).

The first assumption surrounding the authors’ main tenets they make explicit: that “[u]nfortunately, [the] policy transformation [that began with the publication of the “Widget Effect” report in 2009] has not resulted in drastic alterations in outcomes” (p. 2). This is because, “[in] effect…states have been running in place” (p. 2) and not using teachers’ primarily test-based indicators for high-stakes decision-making. Hence, “evaluation results continue to look much like they did…back in 2009” (p. 2). The authors then, albeit ahistorically, ask, “How could so much effort to change state laws result in so little actual change?” (p. 2). Yet they don’t realize (or care to realize) that this is because we have almost 40 years of evidence that really any type of test-based, educational accountability policies and initiatives have never yield their intended consequences (i.e., increased student achievement on national and international indicators). Rather, the authors argue, that “most states’ evaluation laws fated these systems to status quo results long before” they really had a chance (p. 2).

The authors’ second assumption they imply: that the two most often used teacher evaluation indicators (i.e., the growth or value-added and observational measures) should be highly correlated, which many argue they should be IF in fact they are measuring general teacher effectiveness. But the more fundamental assumption here is that if the student learning (i.e., test based) indicators do not correlate with the observational indicators, the latter MUST be wrong, biased, distorted, and accordingly less trustworthy and the like. They add that “teachers and students are not well served when a teacher is rated effective or higher even though her [sic] students have not made sufficient gains in their learning over the course of a school year” (p. 4). Accordingly, they add that “evaluations should require that a teacher is rated well on both the student growth measures and the professional practice component (e.g., observations, student surveys, etc.) in order to be rated effective” (p. 4). Hence, also in this report the authors put forth recommendations for how states might address this challenge. See these recommendations forthcoming, as also related to a new phenomenon my students and I are studying called artificial inflation.

Artificial inflation is a term I recently coined to represent what is/was happening in Houston, and elsewhere (e.g., Tennessee), when district leaders (e.g., superintendents) mandate or force principals and other teacher effectiveness appraisers or evaluators to align their observational ratings of teachers’ effectiveness with teachers’ value-added scores, with the latter being (sometimes relentlessly) considered the “objective measure” around which all other measures (e.g., subjective observational measures) should revolve, or align. Hence, the push is to conflate the latter “subjective” measure to match the former “objective” measure, even if the process of artificial conflation causes both indicators to become invalid. As per my affidavit from the still ongoing lawsuit in Houston (see here), “[t]o purposefully and systematically endorse the engineering and distortion of the perceptible ‘subjective’ indicator, using the perceptibly ‘objective’ indicator as a keystone of truth and consequence, is more than arbitrary, capricious, and remiss…not to mention in violation of the educational measurement field’s “Standards for Educational and Psychological Testing.”


Here is one important figure, taken out of context in some ways on purpose (e.g., as the text surrounding this particular figure is ironically, subjectively used to define what the NCTQ defines as as indicators or progress, or regress).

Near Figure 1 (p. 1) the authors note that “as of January 2017, there has been little evidence of a large-scale reversal of states’ formal evaluation policies. In fact, only four states (Alaska, Mississippi, North Carolina, and Oklahoma) have reversed course on factoring student learning into a teacher’s evaluation rating” (p. 3). While this reversal of four is not illustrated in their accompanying figure, see also a prior post about what other states, beyond just these four states of dishonorable mention, have done to “reverse” the “course” (p. 3) here. While the authors shame all states for minimizing teachers’ test-based ratings before these systems had a chance, as also ignorant to what they cite as “a robust body of research” (without references or citations here, and few elsewhere in a set of footnotes), they add that it remains an unknown as to “why state educational agencies put forth regulations or guidance that would allow teachers to be rated effective without meeting their student growth goals” (p. 4). Many of us know that this was often done to counter the unreliable and invalid results often yielded via the “objective” test-based sides of things that the NCTQ continues to advance.

Otherwise, here are also some important descriptive findings:

  • Thirty states require measures of student academic growth to be at least a significant factor within teacher evaluations; another 10 states require some student growth, and 11 states do not require any objective measures of student growth (p. 5).
  • With only [emphasis added] two exceptions, in the 30 states where student
    growth is at least a significant factor in teacher evaluations, state
    rules or guidance effectively allow teachers who have not met student
    growth goals to still receive a summative rating of at least effective (p. 5).
  • In 18 [of these 30] states, state educational agency regulations and/or guidance
    explicitly permit teachers to earn a summative rating of effective even after earning a less-than-effective score on the student learning portion of their evaluations…these regulations meet the letter of the law while still allowing teachers with low ratings on
    student growth measures to be rated effective or higher (p. 5). In Colorado, for example…a teacher can earn a rating of highly effective with a score of just 1 for student growth (which the state classifies as “less than expected”) in conjunction with a top professional practice score (p. 4).
  • Ten states do not specifically address whether a teacher who has not met student growth goals may be rated as effective or higher. These states neither specifically allow nor specifically disallow such a scenario, but by failing to provide guidance to prevent such an occurrence, they enable it to exist (p. 6).
  • Only two of the 30 states (Indiana and Kentucky) make it impossible for a teacher who has not been found effective at increasing student learning to receive a summative rating of effective (p. 6).

Finally, here are some of their important recommendations, as related to all of the above, and to create more meaningful teacher evaluation systems. So they argue, states should:

  • Establish policies that preclude teachers from earning a label of effective if they are found ineffective at increasing student learning (p. 12).
  • Track the results of discrete components within evaluation systems, both statewide and districtwide. In districts where student growth measures and observation measures are significantly out of alignment, states should reevaluate their systems and/or offer districts technical assistance (p. 12). ][That is, states should possibly promote artificial inflation as we have observed elsewhere. The authors add that] to ensure that evaluation ratings better reflect teacher performance, states should [more specifically] track the results of each evaluation measure to pinpoint where misalignment between components, such as between student learning and observation measures, exists. Where major components within an evaluation system are significantly misaligned, states should examine their systems and offer districts technical assistance where needed, whether through observation training or examining student growth models or calculations (p. 12-13). [Tennessee, for example,] publishes this information so that it is transparent and publicly available to guide actions by key stakeholders and point the way to needed reforms (p. 13).

See also state-by-state reports in the appendices of the full report, in case your state was one of the state’s that responded or, rather, “recognized the factual accuracy of this analysis.”

Citation: Walsh, K., Joseph, N., Lakis, K., & Lubell, S. (2017). Running in place: How new teacher evaluations fail to live up to promises. Washington DC: National Council on Teacher Quality (NCTQ). Retrieved from

Last Saturday Night Live’s VAM-Related Skit

For those of you who may have missed it last Saturday, Melissa McCarthy portrayed Sean Spicer — President Trump’s new White House Press Secretary and Communications Director — in one of the funniest of a very funny set of skits recently released on Saturday Night Live. You can watch the full video, compliments of YouTube, here:

In one of the sections of the skit, though, “Spicer” introduces “Betsy DeVos” — portrayed by Kate McKinnon and also just today confirmed as President Trump’s Secretary of Education — to answer some very simple questions about today’s public schools which she, well, very simply could not answer. See this section of the clip starting at about 6:00 (of the above 8:00 minute total skit).

In short, “the man” reporter asks “DeVos” how she values “growth versus proficiency in [sic] measuring progress in students.” Literally at a loss of words, “DeVos” responds that she really doesn’t “know anything about school.” She rambles on, until “Spicer” pushes her off of the stage 40-or-so seconds later.

Humor set aside, this was the one question Saturday Night Live writers wrote into this skit, which reminds us that what we know more generally as the purpose of VAMs is still alive and well in our educational rhetoric as well as popular culture. As background, this question apparently came from Minnesota Sen. Al Franken’s prior, albeit similar question during DeVos’s confirmation hearing.

Notwithstanding, Steve Snyder – the editorial director of The 74 — an (allegedly) non-partisan, honest, and fact-based backed by Editor-in-Chief Campbell Brown (see prior posts about this news site here and here) — took the opportunity to write a “featured” piece about this section of the script (see here). The purpose of the piece was, as the title illustrates, to help us “understand” the skit, as well as it’s important meaning for all of “us.”

Snyder notes that Saturday Night Live writers, with their humor, might have consequently (and perhaps mistakenly) “made their viewers just a little more knowledgeable about how their child’s school works,” or rather should work, as “[g]rowth vs. proficiency is a key concept in the world of education research.” Thereafter, Snyder falsely asserts that more than 2/3rds of educational researchers agree that VAMs are a good way to measure school quality. If you visit the actual statistic cited in this piece, however, as “non-partison, honest, and fact-based” that it is supposed to be, you would find (here) that this 2/3rds consists of 57% of responding American Education Finance Association (AEFA) members, and AEFA members alone, who are certainly not representative of “educational researchers” as claimed.

Regardless, Snyder asks: “Why are researchers…so in favor of [these] growth measures?” Because this disciplinary subset does not represent educational researchers writ large, but only a subset, Snyder.

As it is with politics today, many educational researchers who define themselves as aligned with the disciplines of educational finance or educational econometricians are substantively more in favor of VAMs than those who align more with the more general disciplines of educational research and educational measurement, methods, and statistics, in general. While this is somewhat of a sweeping generalization, which is not wise as I also argue and also acknowledge in this piece, there is certainly more to be said here about the validity of the inferences drawn here, and (too) often driven via the “media” like The 74.

The bottom line is to question and critically consume everything, and everyone who feels qualified to write about particular things without enough expertise in most everything, including in this case good and professional journalism, this area of educational research, and what it means to make valid inferences and then responsibly share them out with the public.

States’ Teacher Evaluation Systems Now “All over the Map”

We are now just one year past the federal passage of the Every Student Succeeds Act (ESSA), within which it is written that states must no longer set up teacher-evaluation systems based in significant part on their students’ test scores. As per a recent article written in Education Week, accordingly, most states are still tinkering with their teacher evaluation systems—particularly regarding the student growth or value-added measures (VAMs) that were also formerly required to help states assesses teachers’ purported impacts on students’ test scores over time.

“States now have a newfound flexibility to adjust their evaluation systems—and in doing so, they’re all over the map.” Likewise, though, “[a] number of states…have been moving away from [said] student growth [and value-added] measures in [teacher] evaluations,” said a friend, colleague, co-editor, and occasional writer on this blog (see, for example, here and here) Kimberly Kappler Hewitt (University of North Carolina at Greensboro).  She added that this is occurring “whether [this] means postponing [such measures’] inclusion, reducing their percentage in the evaluation breakdown, or eliminating those measures altogether.”

While states like Alabama, Iowa, and Ohio seem to still be moving forward with the attachment of students’ test scores to their teachers, other states seem to be going “back and forth” or putting a halt to all of this altogether (e.g, California). Alaska cut back the weight of the measure, while New Jersey tripled the weight to count for 30% of a teacher’s evaluation score, and then introduced a bill to reduce it back to 0%. In New York teacher are to still receive a test-based evaluation score, but it is not to be tied to consequences and completely revamped by 2019. In Alabama a bill that would have tied 25% of a teacher’s evaluation to his/her students’ ACT and ACT Aspire college-readiness tests has yet to see the light of day. In North Carolina state leaders re-framed the use(s) of such measures to be more for improvement tool (e.g., for professional development), but not “a hammer” to be used against schools or teachers. The same thing is happening in Oklahoma, although this state is not specifically mentioned in this piece.

While some might see all of this as good news — or rather better news than what we have seen for nearly the last decade during which states, state departments of education, and practitioners have been grappling with and trying to make sense of student growth measures and VAMs — others are still (and likely forever will be) holding onto what now seems to be some of the now unclenched promises attached to such stronger accountability measures.

Namely in this article, Daniel Weisberg of The New Teacher Project (TNTP) and author of the now famous “Widget Effect” report — about “Our National Failure to Acknowledge and Act on Differences in Teacher Effectiveness” that helped to “inspire” the last near-decade of these policy-based reforms — “doesn’t see states backing away” from using these measures given ESSA’s new flexibility. We “haven’t seen the clock turn back to 2009, and I don’t think [we]’re going to see that.”

Citation: Will, M. (2017). States are all over the map when it comes to how they’re looking to approach teacher-evaluation systems under ESSA. Education Week. Retrieved from

The Elephant in the Room – Fairness

While VAMs have many issues pertaining, fundamentally, to their levels of reliability, validity, and bias, they are wholeheartedly unfair. This is one thing that is so very important but so rarely discussed when those external to VAM-based metrics and metrics use are debating, mainly the benefits of VAMs.

Issues of “fairness” arise when a test, or more likely its summative (i.e., summary and sometimes consequential) and formative (i.e., informative) uses, impact some more than others in unfair yet often important ways. In terms of VAMs, the main issue here is that VAM-based estimates can be produced for only approximately 30-40% of all teachers across America’s public schools. The other 60-70%, which sometimes includes entire campuses of teachers (e.g., early elementary and high school teachers), cannot altogether be evaluated or “held accountable” using teacher- or individual-level VAM data.

Put differently, what VAM-based data provide, in general, “are incredibly imprecise and inconsistent measures of supposed teacher effectiveness for only a tiny handful [30-40%] of teachers in a given school” (see reference here). But this is often entirely overlooked, not only in the debates surrounding VAM use (and abuse) but also in the discussions surrounding how many taxpayer-derived funds are still being used to support such a (purportedly) reformatory overhaul of America’s public education system. The fact of the matter is that VAMs only directly impact the large minority.

While some states and districts are rushing into adopting “multiple measures” to alleviate at least some of these issues with fairness, what state and district leaders don’t entirely understand is that this, too, is grossly misguided. Should any of these states and districts also tie serious consequences to such output (e.g., merit pay, performance plans, teacher termination, denial of tenure), or rather tie serious consequences to measures of growth derived via any varieties of the “multiple assessment” that can be pulled from increasingly prevalent multiple assessment “menus,” states and districts are also setting themselves for lawsuits…no joke! Starting with the basic psychometrics, and moving onto the (entire) lack of research in support of using more “off-the-shelf” tests to help alleviate issues with fairness, would be the (easy) approach to take in a court of law as, really, doing any of this is entirely wrong.

School-level value-added is also being used to accommodate the issue of “fairness,” just less frequently now than before given the aforementioned “multiple assessment” trends. Regardless, many states and districts also continue to attribute a school-level aggregate score to teachers who do not teach primarily reading/language arts and mathematics, primarily in grades 3-8. That’s right, a majority of teachers receive a value-added score that is based on students whom they do not teach. This also calls for legal recourse, also in that this has been a contested issue within all of the lawsuits in which I’ve thus far been engaged.

Another Study about Bias in Teachers’ Observational Scores

Following-up on two prior posts about potential bias in teachers’ observations (see prior posts here and here), another research study was recently released evidencing, again, that the evaluation ratings derived via observations of teachers in practice are indeed related to (and potentially biased by) teachers’ demographic characteristics. The study also evidenced that teachers representing racial and ethnic minority background might be more likely than others to not only receive lower relatively scores but also be more likely identified for possible dismissal as a result of their relatively lower evaluation scores.

The Regional Educational Laboratory (REL) authored and U.S. Department of Education (Institute of Education Sciences) sponsored study titled “Teacher Demographics and Evaluation: A Descriptive Study in a Large Urban District” can be found here, and a condensed version of the study can be found here. Interestingly, the study was commissioned by district leaders who were already concerned about what they believed to be occurring in this regard, but for which they had no hard evidence… until the completion of this study.

Authors’ key finding follows (as based on three consecutive years of data): Black teachers, teachers age 50 and older, and male teachers were rated below proficient relatively more often than the same district teachers to whom they were compared. More specifically,

  • In all three years the percentage of teachers who were rated below proficient was higher among Black teachers than among White teachers, although the gap was smaller in 2013/14 and 2014/15.
  • In all three years the percentage of teachers with a summative performance rating who were rated below proficient was higher among teachers age 50 and older than among teachers younger than age 50.
  • In all three years the difference in the percentage of male and female teachers with a summative performance rating who were rated below proficient was approximately 5 percentage points or less.
  • The percentage of teachers who improved their rating during all three year-to-year
    comparisons did not vary by race/ethnicity, age, or gender.

This is certainly something to (still) keep in consideration, especially when teachers are rewarded (e.g., via merit pay) or penalized (e.g., vie performance improvement plans or plans for dismissal). Basing these or other high-stakes decisions on not only subjective but also likely biased observational data (see, again, other studies evidencing that this is happening here and here), is not only unwise, it’s also possibly prejudiced.

While study authors note that their findings do not necessarily “explain why the
patterns exist or to what they may be attributed,” and that there is a “need
for further research on the potential causes of the gaps identified, as well as strategies for
ameliorating them,” for starters and at minimum, those conducting these observations literally across the country must be made aware.

Citation: Bailey, J., Bocala, C., Shakman, K., & Zweig, J. (2016). Teacher demographics and evaluation: A descriptive study in a large urban district. Washington DC: U.S. Department of Education. Retrieved from