The New York Times on “The Little Known Statistician” Who Passed

As many of you may recall, I wrote a post last March about the passing of William L. Sanders at age 74. Sanders developed the Education Value-Added Assessment System (EVAAS) — the value-added model (VAM) on which I have conducted most of my research (see, for example, here and here) and the VAM at the core of most of the teacher evaluation lawsuits in which I have been (or still am) engaged (see here, here, and here).

Over the weekend, though, The New York Times released a similar piece about Sanders’s passing, titled “The Little-Known Statistician Who Taught Us to Measure Teachers.” Because I had multiple colleagues and blog followers email me (or email me about) this article, I thought I would share it out with all of you, with some additional comments, of course, but also given the comments I already made in my prior post here.

First, I will start by saying that the title of this article is misleading in that what this “little-known” statistician contributed to the field of education was hardly “little” in terms of its size and impact. Rather, Sanders and his associates at SAS Institute Inc. greatly influenced our nation in terms of the last decade of our nation’s educational policies, as largely bent on high-stakes teacher accountability for educational reform. This occurred in large part due to Sanders’s (and others’) lobbying efforts when the federal government ultimately choose to incentivize and de facto require that all states hold their teachers accountable for their value-added, or lack thereof, while attaching high-stakes consequences (e.g., teacher termination) to teachers’ value-added estimates. This, of course, was to ensure educational reform. This occurred at the federal level, as we all likely know, primarily via Race to the Top and the No Child Left Behind Waivers essentially forced upon states when states had to adopt VAMs (or growth models) to also reform their teachers, and subsequently their schools, in order to continue to receive the federal funds upon which all states still rely.

It should be noted, though, that we as a nation have been relying upon similar high-stakes educational policies since the late 1970s (i.e., for now over 35 years); however, we have literally no research evidence that these high-stakes accountability policies have yielded any of their intended effects, as still perpetually conceptualized (see, for example, Nevada’s recent legislative ruling here) and as still advanced via large- and small-scale educational policies (e.g., we are still A Nation At Risk in terms of our global competitiveness). Yet, we continue to rely on the logic in support of such “carrot and stick” educational policies, even with this last decade’s teacher- versus student-level “spin.” We as a nation could really not be more ahistorical in terms of our educational policies in this regard.

Regardless, Sanders contributed to all of this at the federal level (that also trickled down to the state level) while also actively selling his VAM to state governments as well as local school districts (i.e., including the Houston Independent School District in which teacher plaintiffs just won a recent court ruling against the Sanders value-added system here), and Sanders did this using sets of (seriously) false marketing claims (e.g., purchasing and using the EVAAS will help “clear [a] path to achieving the US goal of leading the world in college completion by the year 2020”). To see two empirical articles about the claims made to sell Sanders’s EVAAS system, the research non-existent in support of each of the claims, and the realities of those at the receiving ends of this system (i.e., teachers) as per their experiences with each of the claims, see here and here.

Hence, to assert that what this “little known” statistician contributed to education was trivial or inconsequential is entirely false. Thankfully, with the passage of the Every Student Succeeds Act” (ESSA) the federal government came around, in at least some ways. While not yet acknowledging how holding teachers accountable for their students’ test scores, while ideal, simply does not work (see the “Top Ten” reasons why this does not work here), at least the federal government has given back to the states the authority to devise, hopefully, some more research-informed educational policies in these regards (I know….).

Nonetheless, may he rest in peace (see also here), perhaps also knowing that his forever stance of “[making] no apologies for the fact that his methods were too complex for most of the teachers whose jobs depended on them to understand,” just landed his EVAAS in serious jeopardy in court in Houston (see here) given this stance was just ruled as contributing to the violation of teachers’ Fourteenth Amendment rights (i.e., no state or in this case organization shall deprive any person of life, liberty, or property, without due process [emphasis added]).

Also Last Thursday in Nevada: The “Top Ten” Research-Based Reasons Why Large-Scale, Standardized Tests Should Not Be Used to Evaluate Teachers

Last Thursday was a BIG day in terms of value-added models (VAMs). For those of you who missed it, US Magistrate Judge Smith ruled — in Houston Federation of Teachers (HFT) et al. v. Houston Independent School District (HISD) — that Houston teacher plaintiffs’ have legitimate claims regarding how their EVAAS value-added estimates, as used (and abused) in HISD, was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case organization shall deprive any person of life, liberty, or property, without due process). See post here: “A Big Victory in Court in Houston.” On the same day, “we” won another court case — Texas State Teachers Association v. Texas Education Agency —  on which The Honorable Lora J. Livingston ruled that the state was to remove all student growth requirements from all state-level teacher evaluation systems. In other words, and in the name of increased local control, teachers throughout Texas will no longer be required to be evaluated using their students’ test scores. See prior post here: “Another Big Victory in Court in Texas.”

Also last Thursday (it was a BIG day, like I said), I testified, again, regarding a similar provision (hopefully) being passed in the state of Nevada. As per a prior post here, Nevada’s “Democratic lawmakers are trying to eliminate — or at least reduce — the role [students’] standardized tests play in evaluations of teachers, saying educators are being unfairly judged on factors outside of their control.” More specifically, as per AB320 the state would eliminate statewide, standardized test results as a mandated teacher evaluation measure but allow local assessments to account for 20% of a teacher’s total evaluation. AB320 is still in work session. It has the votes in committee and on the floor, thus far.

The National Center on Teacher Quality (NCTQ), unsurprisingly (see here and here), submitted (unsurprising) testimony against AB320 that can be read here, and I submitted testimony (I think, quite effectively 😉 ) refuting their “research-based” testimony, and also making explicit what I termed “The “Top Ten” Research-Based Reasons Why Large-Scale, Standardized Tests Should Not Be Used to Evaluate Teachers” here. I have also pasted my submission below, in case anybody wants to forward/share any of my main points with others, especially others in similar positions looking to impact state or local educational policies in similar ways.

*****

May 4, 2017

Dear Assemblywoman Miller:

Re: The “Top Ten” Research-Based Reasons Why Large-Scale, Standardized Tests Should Not Be Used to Evaluate Teachers

While I understand that the National Council on Teacher Quality (NCTQ) submitted a letter expressing their opposition against Assembly Bill (AB) 320, it should be officially noted that, counter to that which the NCTQ wrote into its “research-based” letter,[1] the American Statistical Association (ASA), the American Educational Research Association (AERA), the National Academy of Education (NAE), and other large-scale, highly esteemed, professional educational and educational research/measurement associations disagree with the assertions the NCTQ put forth. Indeed, the NCTQ is not a nonpartisan research and policy organization as claimed, but one of only a small handful of partisan operations still in existence and still pushing forward what is increasingly becoming dismissed as America’s ideal teacher evaluation systems (e.g., announced today, Texas dropped their policy requirement that standardized test scores be used to evaluate teachers; Connecticut moved in the same policy direction last month).

Accordingly, these aforementioned and highly esteemed organizations have all released statements cautioning all against the use of students’ large-scale, state-level standardized tests to evaluate teachers, primarily, for the following research-based reasons, that I have limited to ten for obvious purposes:

  1. The ASA evidenced that teacher effects correlate with only 1-14% of the variance in their students’ large-scale standardized test scores. This means that the other 86%-99% of the variance is due to factors outside of any teacher’s control (e.g., out-of-school and student-level variables). That teachers’ effects, as measured by large-scaled standardized tests (and not including other teacher effects that cannot be measured using large-scaled standardized tests), account for such little variance makes using them to evaluate teachers wholly irrational and unreasonable.
  1. Large-scale standardized tests have always been, and continue to be, developed to assess levels of student achievement, but not levels of growth in achievement over time, and definitely not growth in achievement that can be attributed back to a teacher (i.e., in terms of his/her effects). Put differently, these tests were never designed to estimate teachers’ effects; hence, using them in this regard is also psychometrically invalid and indefensible.
  1. Large-scale standardized tests, when used to evaluate teachers, often yield unreliable or inconsistent results. Teachers who should be (more or less) consistently effective are, accordingly, being classified in sometimes highly inconsistent ways year-to-year. As per the current research, a teacher evaluated using large-scale standardized test scores as effective one year has a 25% to 65% chance of being classified as ineffective the following year(s), and vice versa. This makes the probability of a teacher being identified as effective, as based on students’ large-scale test scores, no different than the flip of a coin (i.e., random).
  1. The estimates derived via teachers’ students’ large-scale standardized test scores are also invalid. Very limited evidence exists to support that teachers whose students’ yield high- large-scale standardized tests scores are also effective using at least one other correlated criterion (e.g., teacher observational scores, student satisfaction survey data), and vice versa. That these “multiple measures” don’t map onto each other, also given the error prevalent in all of the “multiple measures” being used, decreases the degree to which all measures, students’ test scores included, can yield valid inferences about teachers’ effects.
  1. Large-scale standardized tests are often biased when used to measure teachers’ purported effects over time. More specifically, test-based estimates for teachers who teach inordinate proportions of English Language Learners (ELLs), special education students, students who receive free or reduced lunches, students retained in grade, and gifted students are often evaluated not as per their true effects but group effects that bias their estimates upwards or downwards given these mediating factors. The same thing holds true with teachers who teach English/language arts versus mathematics, in that mathematics teachers typically yield more positive test-based effects (which defies logic and commonsense).
  1. Related, large-scale standardized tests estimates are fraught with measurement errors that negate their usefulness. These errors are caused by inordinate amounts of inaccurate and missing data that cannot be replaced or disregarded; student variables that cannot be statistically “controlled for;” current and prior teachers’ effects on the same tests that also prevent their use for making determinations about single teachers’ effects; and the like.
  1. Using large-scale standardized tests to evaluate teachers is unfair. Issues of fairness arise when these test-based indicators impact some teachers more than others, sometimes in consequential ways. Typically, as is true across the nation, only teachers of mathematics and English/language arts in certain grade levels (e.g., grades 3-8 and once in high school) can be measured or held accountable using students’ large-scale test scores. Across the nation, this leaves approximately 60-70% of teachers as test-based ineligible.
  1. Large-scale standardized test-based estimates are typically of very little formative or instructional value. Related, no research to date evidences that using tests for said purposes has improved teachers’ instruction or student achievement as a result. As per UCLA Professor Emeritus James Popham: The farther the test moves away from the classroom level (e.g., a test developed and used at the state level) the worst the test gets in terms of its instructional value and its potential to help promote change within teachers’ classrooms.
  1. Large-scale standardized test scores are being used inappropriately to make consequential decisions, although they do not have the reliability, validity, fairness, etc. to satisfy that for which they are increasingly being used, especially at the teacher-level. This is becoming increasingly recognized by US court systems as well (e.g., in New York and New Mexico).
  1. The unintended consequences of such test score use for teacher evaluation purposes are continuously going unrecognized (e.g., by states that pass such policies, and that states should acknowledge in advance of adapting such policies), given research has evidenced, for example, that teachers are choosing not to teach certain types of students whom they deem as the most likely to hinder their potentials positive effects. Principals are also stacking teachers’ classes to make sure certain teachers are more likely to demonstrate positive effects, or vice versa, to protect or penalize certain teachers, respectively. Teachers are leaving/refusing assignments to grades in which test-based estimates matter most, and some are leaving teaching altogether out of discontent or in professional protest.

[1] Note that the two studies the NCTQ used to substantiate their “research-based” letter would not support the claims included. For example, their statement that “According to the best-available research, teacher evaluation systems that assign between 33 and 50 percent of the available weight to student growth ‘achieve more consistency, avoid the risk of encouraging too narrow a focus on any one aspect of teaching, and can support a broader range of learning objectives than measured by a single test’ is false. First, the actual “best-available” research comes from over 10 years of peer-reviewed publications on this topic, including over 500 peer-reviewed articles. Second, what the authors of the Measures of Effective Teaching (MET) Studies found was that the percentages to be assigned to student test scores were arbitrary at best, because their attempts to empirically determine such a percentage failed. This face the authors also made explicit in their report; that is, they also noted that the percentages they suggested were not empirically supported.

Breaking News: A Big Victory in Court in Houston

Recall from multiple prior posts (see here, here, here, and here) that a set of teachers in the Houston Independent School District (HISD), with the support of the Houston Federation of Teachers (HFT) and the American Federation of Teachers (AFT), took their district to federal court to fight against the (mis)use of their value-added scores, derived via the Education Value-Added Assessment System (EVAAS) — the “original” value-added model (VAM) developed in Tennessee by William L. Sanders who just recently passed away (see here). Teachers’ EVAAS scores, in short, were being used to evaluate teachers in Houston in more consequential ways than anywhere else in the nation (e.g., the termination of 221 teachers in just one year as based, primarily, on their EVAAS scores).

The case — Houston Federation of Teachers et al. v. Houston ISD — was filed in 2014 and just yesterday, United States Magistrate Judge Stephen Wm. Smith denied in the United States District Court, Southern District of Texas, the district’s request for summary judgment given the plaintiffs’ due process claims. Put differently, Judge Smith ruled that the plaintiffs’ did have legitimate claims regarding how EVAAS use in HISD was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case organization shall deprive any person of life, liberty, or property, without due process). Hence, on this charge, this case is officially going to trial.

This is a huge victory, and one unprecedented that will likely set precedent, trial pending, for others, and more specifically other teachers.

Of primary issue will be the following (as taken from Judge Smith’s Summary Judgment released yesterday): “Plaintiffs [will continue to] challenge the use of EVAAS under various aspects of the Fourteenth Amendment, including: (1) procedural due process, due to lack of sufficient information to meaningfully challenge terminations based on low EVAAS scores,” and given “due process is designed to foster government decision-making that is both fair and accurate.”

Related, and of most importance, as also taken directly from Judge Smith’s Summary, he wrote:

  • HISD’s value-added appraisal system poses a realistic threat to deprive plaintiffs of constitutionally protected property interests in employment.
  • HISD does not itself calculate the EVAAS score for any of its teachers. Instead, that task is delegated to its third party vendor, SAS. The scores are generated by complex algorithms, employing “sophisticated software and many layers of calculations.” SAS treats these algorithms and software as trade secrets, refusing to divulge them to either HISD or the teachers themselves. HISD has admitted that it does not itself verify or audit the EVAAS scores received from SAS, nor does it engage any contractor to do so. HISD further concedes that any effort by teachers to replicate their own scores, with the limited information available to them, will necessarily fail. This has been confirmed by plaintiffs’ expert, who was unable to replicate the scores despite being given far greater access to the underlying computer codes than is available to an individual teacher [emphasis added, as also related to a prior post about how SAS claimed that plaintiffs violated SAS’s protective order (protecting its trade secrets), that the court overruled, see here].
  • The EVAAS score might be erroneously calculated for any number of reasons, ranging from data-entry mistakes to glitches in the computer code itself. Algorithms are human creations, and subject to error like any other human endeavor. HISD has acknowledged that mistakes can occur in calculating a teacher’s EVAAS score; moreover, even when a mistake is found in a particular teacher’s score, it will not be promptly corrected. As HISD candidly explained in response to a frequently asked question, “Why can’t my value-added analysis be recalculated?”:
    • Once completed, any re-analysis can only occur at the system level. What this means is that if we change information for one teacher, we would have to re- run the analysis for the entire district, which has two effects: one, this would be very costly for the district, as the analysis itself would have to be paid for again; and two, this re-analysis has the potential to change all other teachers’ reports.
  • The remarkable thing about this passage is not simply that cost considerations trump accuracy in teacher evaluations, troubling as that might be. Of greater concern is the house-of-cards fragility of the EVAAS system, where the wrong score of a single teacher could alter the scores of every other teacher in the district. This interconnectivity means that the accuracy of one score hinges upon the accuracy of all. Thus, without access to data supporting all teacher scores, any teacher facing discharge for a low value-added score will necessarily be unable to verify that her own score is error-free.
  • HISD’s own discovery responses and witnesses concede that an HISD teacher is unable to verify or replicate his EVAAS score based on the limited information provided by HISD.
  • According to the unrebutted testimony of plaintiffs’ expert, without access to SAS’s proprietary information – the value-added equations, computer source codes, decision rules, and assumptions – EVAAS scores will remain a mysterious “black box,” impervious to challenge.
  • While conceding that a teacher’s EVAAS score cannot be independently verified, HISD argues that the Constitution does not require the ability to replicate EVAAS scores “down to the last decimal point.” But EVAAS scores are calculated to the second decimal place, so an error as small as one hundredth of a point could spell the difference between a positive or negative EVAAS effectiveness rating, with serious consequences for the affected teacher.

Hence, “When a public agency adopts a policy of making high stakes employment decisions based on secret algorithms incompatible with minimum due process, the proper remedy is to overturn the policy.”

Moreover, he wrote, that all of this is part of the violation of teaches’ Fourteenth Amendment rights. Hence, he also wrote, “On this summary judgment record, HISD teachers have no meaningful way to ensure correct calculation of their EVAAS scores, and as a result are unfairly subject to mistaken deprivation of constitutionally protected property interests in their jobs.”

Otherwise, Judge Smith granted summary judgment to the district on the other claims forwarded by the plaintiffs, including plaintiffs’ equal protection claims. All of us involved in the case — recall that Jesse Rothstein and I served as the expert witnesses on behalf of the plaintiffs, and Thomas Kane of the Measures of Effective Teaching (MET) Project and John Friedman of the infamous Chetty et al. studies (see here and here) served as the expert witnesses on behalf of the defendants — knew that all of the plaintiffs’ claims would be tough to win given all of the constitutional legal standards would be difficult for plaintiffs to satisfy (e.g., that evaluating teachers using their value-added scores was not “unreasonable” was difficult to prove, as it was in the Tennessee case we also fought and was then dismissed on similar grounds (see here)).

Nonetheless, that “we” survived on the due process claim is fantastic, especially as this is the first case like this of which we are aware across the country.

Here is the press release, released last night by the AFT:

May 4, 2017 – AFT, Houston Federation of Teachers Hail Court Ruling on Flawed Evaluation System

Statements by American Federation of Teachers President Randi Weingarten and Houston Federation of Teachers President Zeph Capo on U.S. District Court decision on Houston’s Evaluation Value-Added Assessment System (EVAAS), known elsewhere as VAM or value-added measures:

AFT President Randi Weingarten: “Houston developed an incomprehensible, unfair and secret algorithm to evaluate teachers that had no rational meaning. This is the algebraic formula: = + (Σ∗≤Σ∗∗ × ∗∗∗∗=1)+

“U.S. Magistrate Judge Stephen Smith saw that it was seriously flawed and posed a threat to teachers’ employment rights; he rejected it. This is a huge victory for Houston teachers, their students and educators’ deeply held contention that VAM is a sham.

“The judge said teachers had no way to ensure that EVAAS was correctly calculating their performance score, nor was there a way to promptly correct a mistake. Judge Smith added that the proper remedy is to overturn the policy; we wholeheartedly agree. Teaching must be about helping kids develop the skills and knowledge they need to be prepared for college, career and life—not be about focusing on test scores for punitive purposes.”

HFT President Zeph Capo: “With this decision, Houston should wipe clean the record of every teacher who was negatively evaluated. From here on, teacher evaluation systems should be developed with educators to ensure that they are fair, transparent and help inform instruction, not be used as a punitive tool.”

New Texas Lawsuit: VAM-Based Estimates as Indicators of Teachers’ “Observable” Behaviors

Last week I spent a few days in Austin, one day during which I provided expert testimony for a new state-level lawsuit that has the potential to impact teachers throughout Texas. The lawsuit — Texas State Teachers Association (TSTA) v. Texas Education Agency (TEA), Mike Morath in his Official Capacity as Commissioner of Education for the State of Texas.

The key issue is that, as per the state’s Texas Education Code (Sec. § 21.351, see here) regarding teachers’ “Recommended Appraisal Process and Performance Criteria,” The Commissioner of Education must adopt “a recommended teacher appraisal process and criteria on which to appraise the performance of teachers. The criteria must be based on observable, job-related behavior, including: (1) teachers’ implementation of discipline management procedures; and (2) the performance of teachers’ students.” As for the latter, the State/TEA/Commissioner defined, as per its Texas Administrative Code (T.A.C., Chapter 15, Sub-Chapter AA, §150.1001, see here), that teacher-level value-added measures should be treated as one of the four measures of “(2) the performance of teachers’ students;” that is, one of the four measures recognized by the State/TEA/Commissioner as an “observable” indicator of a teacher’s “job-related” performance.

While currently no district throughout the State of Texas is required to use a value-added component to assess and evaluate its teachers, as noted, the value-added component is listed as one of four measures from which districts must choose at least one. All options listed in the category of “observable” indicators include: (A) student learning objectives (SLOs); (B) student portfolios; (C) pre- and post-test results on district-level assessments; and (D) value-added data based on student state assessment results.

Related, the state has not recommended or required that any district, if the value-added option is selected, to choose any particular value-added model (VAM) or calculation approach. Nor has it recommended or required that any district adopt any consequences as attached to these output; however, things like teacher contract renewal and sharing teachers’ prior appraisals with other districts in which teachers might be applying for new jobs is not discouraged. Again, though, the main issue here (and the key points to which I testified) was that the value-added component is listed as an “observable” and “job-related” teacher effectiveness indicator as per the state’s administrative code.

Accordingly, my (5 hour) testimony was primarily (albeit among many other things including the “job-related” part) about how teacher-level value-added data do not yield anything that is observable in terms of teachers’ effects. Likewise, officially referring to these data in this way is entirely false, in fact, in that:

  • “We” cannot directly observe a teacher “adding” (or detracting) value (e.g., with our own eyes, like supervisors can when they conduct observations of teachers in practice);
  • Using students’ test scores to measure student growth upwards (or downwards) and over time, as is very common practice using the (very often instructionally insensitive) state-level tests required by No Child Left Behind (NCLB), and doing this once per year in mathematics and reading/language arts (that includes prior and other current teachers’ effects, summer learning gains and decay, etc.), is not valid practice. That is, doing this has not been validated by the scholarly/testing community; and
  • Worse and less valid is to thereafter aggregate this student-level growth to the teacher level and then call whatever “growth” (or the lack thereof) is because of something the teacher (and really only the teacher did), as directly “observable.” These data are far from assessing a teacher’s causal or “observable” impacts on his/her students’ learning and achievement over time. See, for example, the prior statement released about value-added data use in this regard by the American Statistical Association (ASA) here. In this statement it is written that: “Research on VAMs has been fairly consistent that aspects of educational effectiveness that are measurable and within teacher control represent a small part of the total variation [emphasis added to note that this is variation explained which = correlational versus causal research] in student test scores or growth; most estimates in the literature attribute between 1% and 14% of the total variability [emphasis added] to teachers. This is not saying that teachers have little effect on students, but that variation among teachers [emphasis added] accounts for a small part of the variation [emphasis added] in [said test] scores. The majority of the variation in [said] test scores is [inversely, 86%-99% related] to factors outside of the teacher’s control such as student and family background, poverty, curriculum, and unmeasured influences.”

If any of you have anything to add to this, please do so in the comments section of this post. Otherwise, I will keep you posted on how this goes. My current understanding is that this one will be headed to court.

New Article Published on Using Value-Added Data to Evaluate Teacher Education Programs

A former colleague, a current PhD student, and I just had an article released about using value-added data to (or rather not to) evaluate teacher education/preparation, higher education programs. The article is titled “An Elusive Policy Imperative: Data and Methodological Challenges When Using Growth in Student Achievement to Evaluate Teacher Education Programs’ ‘Value-Added,” and the abstract of the article is included below.

If there is anyone out there who might be interested in this topic, please note that the journal in which this piece was published (online first and to be published in its paper version later) – Teaching Education – has made the article free for its first 50 visitors. Hence, I thought I’d share this with you all first.

If you’re interested, do access the full piece here.

Happy reading…and here’s the abstract:

In this study researchers examined the effectiveness of one of the largest teacher education programs located within the largest research-intensive universities within the US. They did this using a value-added model as per current federal educational policy imperatives to assess the measurable effects of teacher education programs on their teacher graduates’ students’ learning and achievement as compared to other teacher education programs. Correlational and group comparisons revealed little to no relationship between value-added scores and teacher education program regardless of subject area or position on the value-added scale. These findings are discussed within the context of several very important data and methodological challenges researchers also made transparent, as also likely common across many efforts to evaluate teacher education programs using value-added approaches. Such transparency and clarity might assist in the creation of more informed value-added practices (and more informed educational policies) surrounding teacher education accountability.

States’ Teacher Evaluation Systems Now “All over the Map”

We are now just one year past the federal passage of the Every Student Succeeds Act (ESSA), within which it is written that states must no longer set up teacher-evaluation systems based in significant part on their students’ test scores. As per a recent article written in Education Week, accordingly, most states are still tinkering with their teacher evaluation systems—particularly regarding the student growth or value-added measures (VAMs) that were also formerly required to help states assesses teachers’ purported impacts on students’ test scores over time.

“States now have a newfound flexibility to adjust their evaluation systems—and in doing so, they’re all over the map.” Likewise, though, “[a] number of states…have been moving away from [said] student growth [and value-added] measures in [teacher] evaluations,” said a friend, colleague, co-editor, and occasional writer on this blog (see, for example, here and here) Kimberly Kappler Hewitt (University of North Carolina at Greensboro).  She added that this is occurring “whether [this] means postponing [such measures’] inclusion, reducing their percentage in the evaluation breakdown, or eliminating those measures altogether.”

While states like Alabama, Iowa, and Ohio seem to still be moving forward with the attachment of students’ test scores to their teachers, other states seem to be going “back and forth” or putting a halt to all of this altogether (e.g, California). Alaska cut back the weight of the measure, while New Jersey tripled the weight to count for 30% of a teacher’s evaluation score, and then introduced a bill to reduce it back to 0%. In New York teacher are to still receive a test-based evaluation score, but it is not to be tied to consequences and completely revamped by 2019. In Alabama a bill that would have tied 25% of a teacher’s evaluation to his/her students’ ACT and ACT Aspire college-readiness tests has yet to see the light of day. In North Carolina state leaders re-framed the use(s) of such measures to be more for improvement tool (e.g., for professional development), but not “a hammer” to be used against schools or teachers. The same thing is happening in Oklahoma, although this state is not specifically mentioned in this piece.

While some might see all of this as good news — or rather better news than what we have seen for nearly the last decade during which states, state departments of education, and practitioners have been grappling with and trying to make sense of student growth measures and VAMs — others are still (and likely forever will be) holding onto what now seems to be some of the now unclenched promises attached to such stronger accountability measures.

Namely in this article, Daniel Weisberg of The New Teacher Project (TNTP) and author of the now famous “Widget Effect” report — about “Our National Failure to Acknowledge and Act on Differences in Teacher Effectiveness” that helped to “inspire” the last near-decade of these policy-based reforms — “doesn’t see states backing away” from using these measures given ESSA’s new flexibility. We “haven’t seen the clock turn back to 2009, and I don’t think [we]’re going to see that.”

Citation: Will, M. (2017). States are all over the map when it comes to how they’re looking to approach teacher-evaluation systems under ESSA. Education Week. Retrieved from http://www.edweek.org/ew/articles/2017/01/04/assessing-quality-of-teaching-staff-still-complex.html?intc=EW-QC17-TOC&_ga=1.138540723.1051944855.1481128421

Another Study about Bias in Teachers’ Observational Scores

Following-up on two prior posts about potential bias in teachers’ observations (see prior posts here and here), another research study was recently released evidencing, again, that the evaluation ratings derived via observations of teachers in practice are indeed related to (and potentially biased by) teachers’ demographic characteristics. The study also evidenced that teachers representing racial and ethnic minority background might be more likely than others to not only receive lower relatively scores but also be more likely identified for possible dismissal as a result of their relatively lower evaluation scores.

The Regional Educational Laboratory (REL) authored and U.S. Department of Education (Institute of Education Sciences) sponsored study titled “Teacher Demographics and Evaluation: A Descriptive Study in a Large Urban District” can be found here, and a condensed version of the study can be found here. Interestingly, the study was commissioned by district leaders who were already concerned about what they believed to be occurring in this regard, but for which they had no hard evidence… until the completion of this study.

Authors’ key finding follows (as based on three consecutive years of data): Black teachers, teachers age 50 and older, and male teachers were rated below proficient relatively more often than the same district teachers to whom they were compared. More specifically,

  • In all three years the percentage of teachers who were rated below proficient was higher among Black teachers than among White teachers, although the gap was smaller in 2013/14 and 2014/15.
  • In all three years the percentage of teachers with a summative performance rating who were rated below proficient was higher among teachers age 50 and older than among teachers younger than age 50.
  • In all three years the difference in the percentage of male and female teachers with a summative performance rating who were rated below proficient was approximately 5 percentage points or less.
  • The percentage of teachers who improved their rating during all three year-to-year
    comparisons did not vary by race/ethnicity, age, or gender.

This is certainly something to (still) keep in consideration, especially when teachers are rewarded (e.g., via merit pay) or penalized (e.g., vie performance improvement plans or plans for dismissal). Basing these or other high-stakes decisions on not only subjective but also likely biased observational data (see, again, other studies evidencing that this is happening here and here), is not only unwise, it’s also possibly prejudiced.

While study authors note that their findings do not necessarily “explain why the
patterns exist or to what they may be attributed,” and that there is a “need
for further research on the potential causes of the gaps identified, as well as strategies for
ameliorating them,” for starters and at minimum, those conducting these observations literally across the country must be made aware.

Citation: Bailey, J., Bocala, C., Shakman, K., & Zweig, J. (2016). Teacher demographics and evaluation: A descriptive study in a large urban district. Washington DC: U.S. Department of Education. Retrieved from http://ies.ed.gov/ncee/edlabs/regions/northeast/pdf/REL_2017189.pdf

Miami-Dade, Florida’s Recent “Symbolic” and “Artificial” Teacher Evaluation Moves

Last spring, Eduardo Porter – writer of the Economic Scene column for The New York Times – wrote an excellent article, from an economics perspective, about that which is happening with our current obsession in educational policy with “Grading Teachers by the Test” (see also my prior post about this article here; although you should give the article a full read; it’s well worth it). In short, though, Porter wrote about what economist’s often refer to as Goodhart’s Law, which states that “when a measure becomes the target, it can no longer be used as the measure.” This occurs given the great (e.g., high-stakes) value (mis)placed on any measure, and the distortion (i.e., in terms of artificial inflation or deflation, depending on the desired direction of the measure) that often-to-always comes about as a result.

Well, it’s happened again, this time in Miami-Dade, Florida, where the Miami-Dade district’s teachers are saying its now “getting harder to get a good evaluation” (see the full article here). Apparently, teachers evaluation scores, from last to this year, are being “dragged down,” primarily given teachers’ students’ performances on tests (as well as tests of subject areas that and students whom they do not teach).

“In the weeks after teacher evaluations for the 2015-16 school year were distributed, Miami-Dade teachers flooded social media with questions and complaints. Teachers reported similar stories of being evaluated based on test scores in subjects they don’t teach and not being able to get a clear explanation from school administrators. In dozens of Facebook posts, they described feeling confused, frustrated and worried. Teachers risk losing their jobs if they get a series of low evaluations, and some stand to gain pay raises and a bonus of up to $10,000 if they get top marks.”

As per the figure also included in this article, see the illustration of how this is occurring below; that is, how it is becoming more difficult for teachers to get “good” overall evaluation scores but also, and more importantly, how it is becoming more common for districts to simply set different cut scores to artificially increase teachers’ overall evaluation scores.

00-00 template_cs5

“Miami-Dade say the problems with the evaluation system have been exacerbated this year as the number of points needed to get the “highly effective” and “effective” ratings has continued to increase. While it took 85 points on a scale of 100 to be rated a highly effective teacher for the 2011-12 school year, for example, it now takes 90.4.”

This, as mentioned prior, is something called “artificial deflation,” whereas the quality of teaching is likely not changing nearly to the extent the data might illustrate it is. Rather, what is happening behind the scenes (e.g., the manipulation of cut scores) is giving the impression that indeed the overall teacher system is in fact becoming better, more rigorous, aligning with policymakers’ “higher standards,” etc).

This is something in the educational policy arena that we also call “symbolic policies,” whereas nothing really instrumental or material is happening, and everything else is a facade, concealing a less pleasant or creditable reality that nothing, in fact, has changed.

Citation: Gurney, K. (2016). Teachers say it’s getting harder to get a good evaluation. The school district disagrees. The Miami Herald. Retrieved from http://www.miamiherald.com/news/local/education/article119791683.html#storylink=cpy

New Mexico: Holding Teachers Accountable for Missing More Than 3 Days of Work

One state that seems to still be going strong after the passage of last January’s Every Student Succeeds Act (ESSA) — via which the federal government removed (or significantly relaxed) its former mandates that all states adopt and use of growth and value-added models (VAMs) to hold their teachers accountable (see here) — is New Mexico.

This should be of no surprise to followers of this blog, especially those who have not only recognized the decline in posts via this blog post ESSA (see a post about this decline here), but also those who have noted that “New Mexico” is the state most often mentioned in said posts post ESSA (see for example here, here, and here).

Well, apparently now (and post  revisions likely caused by the ongoing lawsuit regarding New Mexico’s teacher evaluation system, of which attendance is/was a part; see for example here, here, and here), teachers are to now also be penalized if missing more than three days of work.

As per a recent article in the Santa Fe New Mexican (here), and the title of this article, these new teacher attendance regulations, as to be factored into teachers’ performance evaluations, has clearly caught schools “off guard.”

“The state has said that including attendance in performance reviews helps reduce teacher absences, which saves money for districts and increases students’ learning time.” In fact, effective this calendar year, 5 percent of a teacher’s evaluation is to be made up of teacher attendance. New Mexico Public Education Department spokesman Robert McEntyre clarified that “teachers can miss up to three days of work without being penalized.” He added that “Since attendance was first included in teacher evaluations, it’s estimated that New Mexico schools are collectively saving $3.5 million in costs for substitute teachers and adding 300,000 hours of instructional time back into [their] classrooms.”

“The new guidelines also do not dock teachers for absences covered by the federal Family and Medical Leave Act, or absences because of military duty, jury duty, bereavement, religious leave or professional development programs.” Reported to me only anecdotally (i.e., I could not find evidence of this elsewhere), the new guidelines might also dock teachers for engaging in professional development or overseeing extracurricular events such as debate team performances. If anybody has anything to add on this end, especially as evidence of this, please do comment below.

Value-Added for Kindergarten Teachers in Ecuador

In a study a colleague of mine recently sent me, authors of a study recently released in The Quarterly Journal of Economics and titled “Teacher Quality and Learning Outcomes in Kindergarten,” (nearly randomly) assigned two cohorts of more than 24,000 kindergarten students to teachers to examine whether, indeed and once again, teacher behaviors are related to growth in students’ test scores over time (i.e., value-added).

To assess this, researchers administered 12 tests to the Kindergarteners (I know) at the beginning and end of the year in mathematics and language arts (although apparently the 12 posttests only took 30-40 minutes to complete, which is a content validity and coverage issue in and of itself, p. 1424). They also assessed something they called the executive function (EF), and that they defined as children’s inhibitory control, working memory, capacity to pay attention, and cognitive flexibility, all of which they argue to be related to “Volumetric measures of prefrontal cortex size [when] predict[ed]” (p. 1424). This, along with the fact that teachers’ IQs were also measured (using the Spanish-speaking version of the Wechsler Adult Intelligence Scale) speaks directly to the researchers’ background theory and approach (e.g., recall our world’s history with craniometry, aptly captured in one of my favorite books — Stephen J. Gould’s best selling “The Mismeasure of Man”). Teachers were also observed using the Classroom Assessment Scoring System (CLASS), and parents were also solicited for their opinions about their children’s’ teachers (see other measures collected p. 1417-1418).

What should by now be some familiar names (e.g., Raj Chetty, Thomas Kane) served as collaborators on the study. Likewise, their works and the works of other likely familiar scholars and notorious value-added supporters (e.g., Eric Hanushek, Jonah Rockoff) are also cited throughout in support as evidence of “substantial research” (p. 1416) in support of value-added models (VAMs). Of course, this is unfortunate but important to point out in that this is an indicator of “researcher bias” in and of itself. For example, one of the authors’ findings really should come at no surprise: “Our results…complement estimates from [Thomas Kane’s Bill & Melinda Gates Measures of Effective Teaching] MET project” (p. 1419); although, the authors in a very interesting footnote (p. 1419) describe in more detail than I’ve seen elsewhere all of the weaknesses with the MET study in terms of its design, “substantial attrition,” “serious issue[s]” with contamination and compliance, and possibly/likely biased findings caused by self-selection given the extent to which teachers volunteered to be a part of the MET study.

Also very important to note is that this study took place in Ecuador. Apparently, “they,” including some of the key players in this area of research noted above, are moving their VAM-based efforts across international waters, perhaps in part given the Every Student Succeeds Act (ESSA) recently passed in the U.S., that we should all know by now dramatically curbed federal efforts akin to what is apparently going on now and being pushed here and in other developing countries (although the authors assert that Ecuador is a middle-income country, not a developing country, even though this categorization apparently only applies to the petroleum rich sections of the nation). Related, they assert that, “concerns about teacher quality are likely to be just as important in [other] developing countries” (p. 1416); hence, adopting VAMs in such countries might just be precisely what these countries need to “reform” their schools, as well.

Unfortunately, many big businesses and banks (e.g., the Inter-American Development Bank that funded this particular study) are becoming increasingly interested in investing in and solving these and other developing countries’ educational woes, as well, via measuring and holding teachers accountable for teacher-level value-added, regardless of the extent to which doing this has not worked in the U.S to improve much of anything. Needless to say, many who are involved with these developing nation initiatives, including some of those mentioned above, are also financially benefitting by continuing to serve others their proverbial Kool-Aid.

Nonetheless, their findings:

  • First, they “estimate teacher (rather than classroom) effects of 0.09 on language and math” (p. 1434). That is, just less than 1/10th of a standard deviation, or just over a 3% move in the positive direction away from the mean.
  • Similarly, the “estimate classroom effects of 0.07 standard deviation on EF” (p. 1433). That is, precisely 7/100th of a standard deviation, or about a 2% move in the positive direction away from the mean.
  • They found that “children assigned to teachers with a 1-standard deviation higher CLASS score have between 0.05 and 0.07 standard deviation higher end-of-year test scores” (p. 1437), or a 1-2% move in the positive direction away from the mean.
  • And they found that “that parents generally give higher scores to better teachers…parents are 15 percentage points more likely to classify a teacher who produces 1 standard deviation higher test scores as ‘‘very good’’ rather than ‘‘good’’ or lower” (p. 1442). This is quite an odd way of putting it, along with the assumption that the difference between “very good” and “good” is not arbitrary but empirically grounded, along with whatever reason a simple correlation was not more simply reported.
  • Their most major finding is that “a 1 standard deviation increase in classroom quality, corrected for sampling error, results in 0.11 standard deviation higher test scores in both language and math” (p. 1433; see also other findings from p. 1434-447).

Interestingly, the authors equivocate all of these effects to teacher or classroom “shocks,” although I’d hardly call them “shocks” that inherently imply a large, unidirectional, and causal impact. Moreover, this also implies how the authors, also as economists, still view this type of research (i.e., not correlational, even with close-to-random assignment, although they make a slight mention of this possibility on p. 1449).

Nonetheless, the authors conclude that in this article they effectively evidenced “that there are substantial differences [emphasis added] in the amount of learning that takes place in language, math, and executive function across kindergarten classrooms in Ecuador” (p. 1448). In addition, “These differences are associated with differences in teacher behaviors and practices,” as observed, and “that parents can generally tell better from worse teachers, but do not meaningfully alter their investments in children in response to random shocks [emphasis added] to teacher quality” (p. 1448).

Ultimately, they find that “value added is a useful summary measure of teacher quality in Ecuador” (p. 1448). Go figure…

They conclude “to date, no country in Latin America regularly calculates the value added of teachers,” yet “in virtually all countries in the region, decisions about tenure, in-service training, promotion, pay, and early retirement are taken with no regard for (and in most cases no knowledge about) a teacher’s effectiveness” (p. 1448). Also sound familiar??

“Value added is no silver bullet,” and indeed it is not as per much evidence now existent throughout the U.S., “but knowing which teachers produce more or less learning among equivalent students [is] an important step to designing policies to improve learning outcomes” (p. 1448), they also recognizably argue.

Citation: Araujo, M. C., Carneiro, P.,  Cruz-Aguayo, Y., & Schady, N. (2016). Teacher quality and learning outcomes in Kindergarten. The Quarterly Journal of Economics, 1415–1453. doi:10.1093/qje/qjw016  Retrieved from http://qje.oxfordjournals.org/content/131/3/1415.abstract