Breaking News: The End of Value-Added Measures for Teacher Termination in Houston

Recall from multiple prior posts (see, for example, here, here, here, here, and here) that a set of teachers in the Houston Independent School District (HISD), with the support of the Houston Federation of Teachers (HFT) and the American Federation of Teachers (AFT), took their district to federal court to fight against the (mis)use of their value-added scores derived via the Education Value-Added Assessment System (EVAAS) — the “original” value-added model (VAM) developed in Tennessee by William L. Sanders who just recently passed away (see here). Teachers’ EVAAS scores, in short, were being used to evaluate teachers in Houston in more consequential ways than any other district or state in the nation (e.g., the termination of 221 teachers in one year as based, primarily, on their EVAAS scores).

The case — Houston Federation of Teachers et al. v. Houston ISD — was filed in 2014 and just one day ago (October 10, 2017) came the case’s final federal suit settlement. Click here to read the “Settlement and Full and Final Release Agreement.” But in short, this means the “End of Value-Added Measures for Teacher Termination in Houston” (see also here).

More specifically, recall that the judge notably ruled prior (in May of 2017) that the plaintiffs did have sufficient evidence to proceed to trial on their claims that the use of EVAAS in Houston to terminate their contracts was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case district shall deprive any person of life, liberty, or property, without due process). That is, the judge ruled that “any effort by teachers to replicate their own scores, with the limited information available to them, [would] necessarily fail” (see here p. 13). This was confirmed by the one of the plaintiffs’ expert witness who was also “unable to replicate the scores despite being given far greater access to the underlying computer codes than [was] available to an individual teacher” (see here p. 13).

Hence, and “[a]ccording to the unrebutted testimony of [the] plaintiffs’ expert [witness], without access to SAS’s proprietary information – the value-added equations, computer source codes, decision rules, and assumptions – EVAAS scores will remain a mysterious ‘black box,’ impervious to challenge” (see here p. 17). Consequently, the judge concluded that HISD teachers “have no meaningful way to ensure correct calculation of their EVAAS scores, and as a result are unfairly subject to mistaken deprivation of constitutionally protected property interests in their jobs” (see here p. 18).

Thereafter, and as per this settlement, HISD agreed to refrain from using VAMs, including the EVAAS, to terminate teachers’ contracts as long as the VAM score is “unverifiable.” More specifically, “HISD agree[d] it will not in the future use value-added scores, including but not limited to EVAAS scores, as a basis to terminate the employment of a term or probationary contract teacher during the term of that teacher’s contract, or to terminate a continuing contract teacher at any time, so long as the value-added score assigned to the teacher remains unverifiable. (see here p. 2; see also here). HISD also agreed to create an “instructional consultation subcommittee” to more inclusively and democratically inform HISD’s teacher appraisal systems and processes, and HISD agreed to pay the Texas AFT $237,000 in its attorney and other legal fees and expenses (State of Texas, 2017, p. 2; see also AFT, 2017).

This is yet another big win for teachers in Houston, and potentially elsewhere, as this ruling is an unprecedented development in VAM litigation. Teachers and others using the EVAAS or another VAM for that matter (e.g., that is also “unverifiable”) do take note, at minimum.

The “Widget Effect” Report Revisited

You might recall that in 2009, The New Teacher Project published a highly influential “Widget Effect” report in which researchers (see citation below) evidenced that 99% of teachers (whose teacher evaluation reports they examined across a sample of school districts spread across a handful of states) received evaluation ratings of “satisfactory” or higher. Inversely, only 1% of the teachers whose reports researchers examined received ratings of “unsatisfactory,” even though teachers’ supervisors could identify more teachers whom they deemed ineffective when asked otherwise.

Accordingly, this report was widely publicized given the assumed improbability that only 1% of America’s public school teachers were, in fact, ineffectual, and given the fact that such ineffective teachers apparently existed but were not being identified using standard teacher evaluation/observational systems in use at the time.

Hence, this report was used as evidence that America’s teacher evaluation systems were unacceptable and in need of reform, primarily given the subjectivities and flaws apparent and arguably inherent across the observational components of these systems. This reform was also needed to help reform America’s public schools, writ large, so the logic went and (often) continues to go. While binary constructions of complex data such as these are often used to ground simplistic ideas and push definitive policies, ideas, and agendas, this tactic certainly worked here, as this report (among a few others) was used to inform the federal and state policies pushing teacher evaluation system reform as a result (e.g., Race to the Top (RTTT)).

Likewise, this report continues to be used whenever a state’s or district’s new-and-improved teacher evaluation systems (still) evidence “too many” (as typically arbitrarily defined) teachers as effective or higher (see, for example, an Education Week article about this here). Although, whether in fact the systems have actually been reformed is also of debate in that states are still using many of the same observational systems they were using prior (i.e., not the “binary checklists” exaggerated in the original as well as this report, albeit true in the case of the district of focus in this study). The real “reforms,” here, pertained to the extent to which value-added model (VAM) or other growth output were combined with these observational measures, and the extent to which districts adopted state-level observational models as per the centralized educational policies put into place at the same time.

Nonetheless, now eight years later, Matthew A. Kraft – an Assistant Professor of Education & Economics at Brown University and Allison F. Gilmour – an Assistant Professor at Temple University (and former doctoral student at Vanderbilt University), revisited the original report. Just published in the esteemed, peer-reviewed journal Educational Researcher (see an earlier version of the published study here), Kraft and Gilmour compiled “teacher performance ratings across 24 [of the 38, including 14 RTTT] states that [by 2014-2015] adopted major reforms to their teacher evaluation systems” as a result of such policy initiatives. They found that “the percentage of teachers rated Unsatisfactory remains less than 1%,” except for in two states (i.e., Maryland and New Mexico), with Unsatisfactory (or similar) ratings varying “widely across states with 0.7% to 28.7%” as the low and high, respectively (see also the study Abstract).

Related, Kraft and Gilmour found that “some new teacher evaluation systems do differentiate among teachers, but most only do so at the top of the ratings spectrum” (p. 10). More specifically, observers in states in which teacher evaluation ratings include five versus four rating categories differentiate teachers more, but still do so along the top three ratings, which still does not solve the negative skew at issue (i.e., “too many” teachers still scoring “too well”). They also found that when these observational systems were used for formative (i.e., informative, improvement) purposes, teachers’ ratings were lower than when they were used for summative (i.e., final summary) purposes.

Clearly, the assumptions of all involved in this area of policy research come into play, here, akin to how they did in The Bell Curve and The Bell Curve Debate. During this (still ongoing) debate, many fervently debated whether socioeconomic and educational outcomes (e.g., IQ) should be normally distributed. What this means in this case, for example, is that for every teacher who is rated highly effective there should be a teacher rated as highly ineffective, more or less, to yield a symmetrical distribution of teacher observational scores across the spectrum.

In fact, one observational system of which I am aware (i.e., the TAP System for Teacher and Student Advancement) is marketing its proprietary system, using as a primary selling point figures illustrating (with text explaining) how clients who use their system will improve their prior “Widget Effect” results (i.e., yielding such normal curves; see Figure below, as per Jerald & Van Hook, 2011, p. 1).

Evidence also suggests that these scores are also (sometimes) being artificially deflated to assist in these attempts (see, for example, a recent publication of mine released a few days ago here in the (also) esteemed, peer-reviewed Teachers College Record about how this is also occurring in response to the “Widget Effect” report and the educational policies that follows).

While Kraft and Gilmour assert that “systems that place greater weight on normative measures such as value-added scores rather than…[just]…observations have fewer teachers rated proficient” (p. 19; see also Steinberg & Kraft, forthcoming; a related article about how this has occurred in New Mexico here; and New Mexico’s 2014-2016 data below and here, as also illustrative of the desired normal curve distributions discussed above), I highly doubt this purely reflects New Mexico’s “commitment to putting students first.”

I also highly doubt that, as per New Mexico’s acting Secretary of Education, this was “not [emphasis added] designed with quote unquote end results in mind.” That is, “the New Mexico Public Education Department did not set out to place any specific number or percentage of teachers into a given category.” If true, it’s pretty miraculous how this simply worked out as illustrated… This is also at issue in the lawsuit in which I am involved in New Mexico, in which the American Federation of Teachers won an injunction in 2015 that still stands today (see more information about this lawsuit here). Indeed, as per Kraft, all of this “might [and possibly should] undercut the potential for this differentiation [if ultimately proven artificial, for example, as based on statistical or other pragmatic deflation tactics] to be seen as accurate and valid” (as quoted here).

Notwithstanding, Kraft and Gilmour, also as part (and actually the primary part) of this study, “present original survey data from an urban district illustrating that evaluators perceive more than three times as many teachers in their schools to be below Proficient than they rate as such.” Accordingly, even though their data for this part of this study come from one district, their findings are similar to others evidenced in the “Widget Effect” report; hence, there are still likely educational measurement (and validity) issues on both ends (i.e., with using such observational rubrics as part of America’s reformed teacher evaluation systems and using survey methods to put into check these systems, overall). In other words, just because the survey data did not match the observational data does not mean either is wrong, or right, but there are still likely educational measurement issues.

Also of issue in this regard, in terms of the 1% issue, is (a) the time and effort it takes supervisors to assist/desist after rating teachers low is sometimes not worth assigning low ratings; (b) how supervisors often give higher ratings to those with perceived potential, also in support of their future growth, even if current evidence suggests a lower rating is warranted; (c) how having “difficult conversations” can sometimes prevent supervisors from assigning the scores they believe teachers may deserve, especially if things like job security are on the line; (d) supervisors’ challenges with removing teachers, including “long, laborious, legal, draining process[es];” and (e) supervisors’ challenges with replacing teachers, if terminated, given current teacher shortages and the time and effort, again, it often takes to hire (ideally more qualified) replacements.

References:

Jerald, C. D., & Van Hook, K. (2011). More than measurement: The TAP system’s lessons learned for designing better teacher evaluation systems. Santa Monica, CA: National Institute for Excellence in Teaching (NIET). Retrieved from http://files.eric.ed.gov/fulltext/ED533382.pdf

Kraft, M. A, & Gilmour, A. F. (2017). Revisiting the Widget Effect: Teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46(5) 234-249. doi:10.3102/0013189X17718797

Steinberg, M. P., & Kraft, M. A. (forthcoming). The sensitivity of teacher performance ratings to the design of teacher evaluation systems. Educational Researcher.

Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). “The Widget Effect.” Education Digest, 75(2), 31–35.

A “Next Generation” Vision for School, Teacher, and Student Accountability

Within a series of prior posts (see, for example, here and here), I have written about what the Every Student Succeeds Act (ESSA), passed in December of 2015, means for the U.S., or more specifically states’ school and teacher evaluation systems as per the federal government’s prior mandates requiring their use of growth and value-added models (VAMs).

Related, states were recently (this past May) required to submit to the federal government their revised school and teacher evaluation plans, post ESSA, given how they have changed, or not. While I have a doctoral student currently gathering updated teacher evaluation data, state-by-state, and our preliminary findings indicate that “things” have not (yet) changed much post ESSA, at least at the teacher level of focus in this study and except for in a few states (e.g., Connecticut, Oklahoma), states still have the liberties to change that which they do on both ends (i.e., school and teacher accountability).

Recently, a colleague recently shared with me a study titled “Next Generation Accountability: A Vision for School Improvement Under ESSA” that warrants coverage here, in hopes that states are still “out there” trying to reform their school and teacher evaluation systems, of course, for the better. While the document was drafted by folks coming from the aforementioned state of Oklahoma, who are also affiliated with the Learning Policy Institute, it is important to note that the document was also vetted by some “heavy hitters” in this line of research including, but not limited to, David C. Berliner (Arizona State University), Peter W. Cookson Jr. (American Institutes for Research (AIR)), Linda Darling-Hammond (Stanford University), and William A. Firestone (Rutgers University).

As per ESSA, states are to have increased opportunities “to develop innovative strategies for advancing equity, measuring success, and developing cycles of continuous improvement” while using “multiple measures to assess school and student performance” (p. iii). Likewise, the authors of this report state that “A broader spectrum of indicators,
going well beyond a summary of annual test performance, seems necessary to account transparently for performance and assign responsibility for improvement.”

Here are some of their more specific recommendations that I found of value for blog followers:

  • The continued use of a single composite indicator to reduce and then sort teachers or schools by their overall effectiveness or performance (e.g., using teacher “effectiveness” categories or school A–F letter grades) is myopic, to say the least. This is because doing this (a) misses all that truly “matters,” including  multidimensional concepts and (non)cognitive competencies we want students to know and to be able to do, not captured by large-scale tests; and (b) inhibits the usefulness of what may be informative, stand-alone data (i.e., as taken from “multiple measures” individually) once these data are reduced and then collapsed so that they can be used for hierarchical categorizations and rankings. This also (c) very much trivializes the multiple causes of low achievement, also of importance and in much greater need of attention.
  • Accordingly, “Next Generation” accountability systems should include “a broad palette of functionally significant indicators to replace [such] single composite indicators [as this] will likely be regarded as informational rather than controlling, thereby motivating stakeholders to action” (p. ix). Stakeholders should be defined in the following terms…
  • “Next Generation” accountability systems should incorporate principles of “shared accountability,” whereby educational responsibility and accountability should be “distributed across system components and not foisted upon any one group of actors or stakeholders” (p. ix). “[E]xerting pressure on stakeholders who do not have direct control over [complex educational] elements is inappropriate and worse, harmful” (p. ix). Accordingly, the goal of “shared accountability” is to “create an accountability environment in which all participants [including governmental organizations] recognize their obligations and commitments in relation to each other” (p. ix) and their collective educational goals.
  • To facilitate this, “Next Generation” information systems should be designed and implemented in order to service the “dual reporting needs of compliance with federal mandates and the particular improvement needs of a state’s schools,” while also addressing “the different information needs of state, district, school site
    leadership, teachers, and parents” (p. ix). Data may include, at minimum, data on school resources, processes, outcomes, and other nuanced indicators, and this information must be made transparent and accessible in order for all types of data users to be responsive, holistically and individually (e.g, at school or classroom levels). The formative functions of such “Next Generation” informational systems, accordingly, take priority, at least for initial terms, until informational data can be used to, with priority, “identify and transform schools in catastrophic failure” (p. ix).
  • Related, all test- or other educational measurement-related components of states’ “Next Generation” statutes and policies should adhere to the Standards for Educational and Psychological Testing, and more specifically their definitions of reliability, validity, bias, fairness, and the like. Statutes and policies should also be written “in the least restrictive and prescriptive terms possible to allow for [continous] corrective action and improvement” (p. x).
  • Finally, “Next Generation” accountability systems should adhere to the following five essentials: “(a) state, district, and school leaders must create a system-wide culture grounded in “learning to improve;” (b) learning to improve using [the aforementioned informational systems also] necessitates the [overall] development of [students’] strong pedagogical data-literacy skills; (c) resources in addition to funding—including time, access to expertise, and collaborative opportunities—should be prioritized for sustaining these ongoing improvement efforts; (d) there must be a coherent structure of state-level support for learning to improve, including the development of a strong Longitudinal Data System (LDS) infrastructure; and (e) educator labor market policy in some states may need adjustment to support the above elements” (p. x).

To read more, please access the full report here.

In sum, “Next Generation” accountability systems aim at “a loftier goal—universal college and career readiness—a goal that current accountability systems were not designed to achieve. To reach this higher level, next generation accountability must embrace a wider vision, distribute trustworthy performance information, and build support infrastructure, while eliciting the assent, support, and enthusiasm of citizens and educators” (p. vii).

As briefly noted prior, “a few states have been working to put more supportive, humane accountability systems in place, but others remain stuck in a compliance mindset that undermines their ability to design effective accountability systems” (p. vii). Perhaps (or perhaps likely) this is because for the past decade or so states invested so much time, effort, and money to “reforming” their prior teacher evaluations systems as formerly required by the federal government. This included investments in states’ growth models of VAMs, onto which many/most states seem to be holding firm.

Hence, while it seems that the residual effects of the federal governments’ former efforts are still dominating states’ actions with regards to educational accountability, hopefully some states can at least begin to lead the way to what will likely yield the educational reform…still desired…

The New York Times on “The Little Known Statistician” Who Passed

As many of you may recall, I wrote a post last March about the passing of William L. Sanders at age 74. Sanders developed the Education Value-Added Assessment System (EVAAS) — the value-added model (VAM) on which I have conducted most of my research (see, for example, here and here) and the VAM at the core of most of the teacher evaluation lawsuits in which I have been (or still am) engaged (see here, here, and here).

Over the weekend, though, The New York Times released a similar piece about Sanders’s passing, titled “The Little-Known Statistician Who Taught Us to Measure Teachers.” Because I had multiple colleagues and blog followers email me (or email me about) this article, I thought I would share it out with all of you, with some additional comments, of course, but also given the comments I already made in my prior post here.

First, I will start by saying that the title of this article is misleading in that what this “little-known” statistician contributed to the field of education was hardly “little” in terms of its size and impact. Rather, Sanders and his associates at SAS Institute Inc. greatly influenced our nation in terms of the last decade of our nation’s educational policies, as largely bent on high-stakes teacher accountability for educational reform. This occurred in large part due to Sanders’s (and others’) lobbying efforts when the federal government ultimately choose to incentivize and de facto require that all states hold their teachers accountable for their value-added, or lack thereof, while attaching high-stakes consequences (e.g., teacher termination) to teachers’ value-added estimates. This, of course, was to ensure educational reform. This occurred at the federal level, as we all likely know, primarily via Race to the Top and the No Child Left Behind Waivers essentially forced upon states when states had to adopt VAMs (or growth models) to also reform their teachers, and subsequently their schools, in order to continue to receive the federal funds upon which all states still rely.

It should be noted, though, that we as a nation have been relying upon similar high-stakes educational policies since the late 1970s (i.e., for now over 35 years); however, we have literally no research evidence that these high-stakes accountability policies have yielded any of their intended effects, as still perpetually conceptualized (see, for example, Nevada’s recent legislative ruling here) and as still advanced via large- and small-scale educational policies (e.g., we are still A Nation At Risk in terms of our global competitiveness). Yet, we continue to rely on the logic in support of such “carrot and stick” educational policies, even with this last decade’s teacher- versus student-level “spin.” We as a nation could really not be more ahistorical in terms of our educational policies in this regard.

Regardless, Sanders contributed to all of this at the federal level (that also trickled down to the state level) while also actively selling his VAM to state governments as well as local school districts (i.e., including the Houston Independent School District in which teacher plaintiffs just won a recent court ruling against the Sanders value-added system here), and Sanders did this using sets of (seriously) false marketing claims (e.g., purchasing and using the EVAAS will help “clear [a] path to achieving the US goal of leading the world in college completion by the year 2020”). To see two empirical articles about the claims made to sell Sanders’s EVAAS system, the research non-existent in support of each of the claims, and the realities of those at the receiving ends of this system (i.e., teachers) as per their experiences with each of the claims, see here and here.

Hence, to assert that what this “little known” statistician contributed to education was trivial or inconsequential is entirely false. Thankfully, with the passage of the Every Student Succeeds Act” (ESSA) the federal government came around, in at least some ways. While not yet acknowledging how holding teachers accountable for their students’ test scores, while ideal, simply does not work (see the “Top Ten” reasons why this does not work here), at least the federal government has given back to the states the authority to devise, hopefully, some more research-informed educational policies in these regards (I know….).

Nonetheless, may he rest in peace (see also here), perhaps also knowing that his forever stance of “[making] no apologies for the fact that his methods were too complex for most of the teachers whose jobs depended on them to understand,” just landed his EVAAS in serious jeopardy in court in Houston (see here) given this stance was just ruled as contributing to the violation of teachers’ Fourteenth Amendment rights (i.e., no state or in this case organization shall deprive any person of life, liberty, or property, without due process [emphasis added]).

Also Last Thursday in Nevada: The “Top Ten” Research-Based Reasons Why Large-Scale, Standardized Tests Should Not Be Used to Evaluate Teachers

Last Thursday was a BIG day in terms of value-added models (VAMs). For those of you who missed it, US Magistrate Judge Smith ruled — in Houston Federation of Teachers (HFT) et al. v. Houston Independent School District (HISD) — that Houston teacher plaintiffs’ have legitimate claims regarding how their EVAAS value-added estimates, as used (and abused) in HISD, was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case organization shall deprive any person of life, liberty, or property, without due process). See post here: “A Big Victory in Court in Houston.” On the same day, “we” won another court case — Texas State Teachers Association v. Texas Education Agency —  on which The Honorable Lora J. Livingston ruled that the state was to remove all student growth requirements from all state-level teacher evaluation systems. In other words, and in the name of increased local control, teachers throughout Texas will no longer be required to be evaluated using their students’ test scores. See prior post here: “Another Big Victory in Court in Texas.”

Also last Thursday (it was a BIG day, like I said), I testified, again, regarding a similar provision (hopefully) being passed in the state of Nevada. As per a prior post here, Nevada’s “Democratic lawmakers are trying to eliminate — or at least reduce — the role [students’] standardized tests play in evaluations of teachers, saying educators are being unfairly judged on factors outside of their control.” More specifically, as per AB320 the state would eliminate statewide, standardized test results as a mandated teacher evaluation measure but allow local assessments to account for 20% of a teacher’s total evaluation. AB320 is still in work session. It has the votes in committee and on the floor, thus far.

The National Center on Teacher Quality (NCTQ), unsurprisingly (see here and here), submitted (unsurprising) testimony against AB320 that can be read here, and I submitted testimony (I think, quite effectively 😉 ) refuting their “research-based” testimony, and also making explicit what I termed “The “Top Ten” Research-Based Reasons Why Large-Scale, Standardized Tests Should Not Be Used to Evaluate Teachers” here. I have also pasted my submission below, in case anybody wants to forward/share any of my main points with others, especially others in similar positions looking to impact state or local educational policies in similar ways.

*****

May 4, 2017

Dear Assemblywoman Miller:

Re: The “Top Ten” Research-Based Reasons Why Large-Scale, Standardized Tests Should Not Be Used to Evaluate Teachers

While I understand that the National Council on Teacher Quality (NCTQ) submitted a letter expressing their opposition against Assembly Bill (AB) 320, it should be officially noted that, counter to that which the NCTQ wrote into its “research-based” letter,[1] the American Statistical Association (ASA), the American Educational Research Association (AERA), the National Academy of Education (NAE), and other large-scale, highly esteemed, professional educational and educational research/measurement associations disagree with the assertions the NCTQ put forth. Indeed, the NCTQ is not a nonpartisan research and policy organization as claimed, but one of only a small handful of partisan operations still in existence and still pushing forward what is increasingly becoming dismissed as America’s ideal teacher evaluation systems (e.g., announced today, Texas dropped their policy requirement that standardized test scores be used to evaluate teachers; Connecticut moved in the same policy direction last month).

Accordingly, these aforementioned and highly esteemed organizations have all released statements cautioning all against the use of students’ large-scale, state-level standardized tests to evaluate teachers, primarily, for the following research-based reasons, that I have limited to ten for obvious purposes:

  1. The ASA evidenced that teacher effects correlate with only 1-14% of the variance in their students’ large-scale standardized test scores. This means that the other 86%-99% of the variance is due to factors outside of any teacher’s control (e.g., out-of-school and student-level variables). That teachers’ effects, as measured by large-scaled standardized tests (and not including other teacher effects that cannot be measured using large-scaled standardized tests), account for such little variance makes using them to evaluate teachers wholly irrational and unreasonable.
  1. Large-scale standardized tests have always been, and continue to be, developed to assess levels of student achievement, but not levels of growth in achievement over time, and definitely not growth in achievement that can be attributed back to a teacher (i.e., in terms of his/her effects). Put differently, these tests were never designed to estimate teachers’ effects; hence, using them in this regard is also psychometrically invalid and indefensible.
  1. Large-scale standardized tests, when used to evaluate teachers, often yield unreliable or inconsistent results. Teachers who should be (more or less) consistently effective are, accordingly, being classified in sometimes highly inconsistent ways year-to-year. As per the current research, a teacher evaluated using large-scale standardized test scores as effective one year has a 25% to 65% chance of being classified as ineffective the following year(s), and vice versa. This makes the probability of a teacher being identified as effective, as based on students’ large-scale test scores, no different than the flip of a coin (i.e., random).
  1. The estimates derived via teachers’ students’ large-scale standardized test scores are also invalid. Very limited evidence exists to support that teachers whose students’ yield high- large-scale standardized tests scores are also effective using at least one other correlated criterion (e.g., teacher observational scores, student satisfaction survey data), and vice versa. That these “multiple measures” don’t map onto each other, also given the error prevalent in all of the “multiple measures” being used, decreases the degree to which all measures, students’ test scores included, can yield valid inferences about teachers’ effects.
  1. Large-scale standardized tests are often biased when used to measure teachers’ purported effects over time. More specifically, test-based estimates for teachers who teach inordinate proportions of English Language Learners (ELLs), special education students, students who receive free or reduced lunches, students retained in grade, and gifted students are often evaluated not as per their true effects but group effects that bias their estimates upwards or downwards given these mediating factors. The same thing holds true with teachers who teach English/language arts versus mathematics, in that mathematics teachers typically yield more positive test-based effects (which defies logic and commonsense).
  1. Related, large-scale standardized tests estimates are fraught with measurement errors that negate their usefulness. These errors are caused by inordinate amounts of inaccurate and missing data that cannot be replaced or disregarded; student variables that cannot be statistically “controlled for;” current and prior teachers’ effects on the same tests that also prevent their use for making determinations about single teachers’ effects; and the like.
  1. Using large-scale standardized tests to evaluate teachers is unfair. Issues of fairness arise when these test-based indicators impact some teachers more than others, sometimes in consequential ways. Typically, as is true across the nation, only teachers of mathematics and English/language arts in certain grade levels (e.g., grades 3-8 and once in high school) can be measured or held accountable using students’ large-scale test scores. Across the nation, this leaves approximately 60-70% of teachers as test-based ineligible.
  1. Large-scale standardized test-based estimates are typically of very little formative or instructional value. Related, no research to date evidences that using tests for said purposes has improved teachers’ instruction or student achievement as a result. As per UCLA Professor Emeritus James Popham: The farther the test moves away from the classroom level (e.g., a test developed and used at the state level) the worst the test gets in terms of its instructional value and its potential to help promote change within teachers’ classrooms.
  1. Large-scale standardized test scores are being used inappropriately to make consequential decisions, although they do not have the reliability, validity, fairness, etc. to satisfy that for which they are increasingly being used, especially at the teacher-level. This is becoming increasingly recognized by US court systems as well (e.g., in New York and New Mexico).
  1. The unintended consequences of such test score use for teacher evaluation purposes are continuously going unrecognized (e.g., by states that pass such policies, and that states should acknowledge in advance of adapting such policies), given research has evidenced, for example, that teachers are choosing not to teach certain types of students whom they deem as the most likely to hinder their potentials positive effects. Principals are also stacking teachers’ classes to make sure certain teachers are more likely to demonstrate positive effects, or vice versa, to protect or penalize certain teachers, respectively. Teachers are leaving/refusing assignments to grades in which test-based estimates matter most, and some are leaving teaching altogether out of discontent or in professional protest.

[1] Note that the two studies the NCTQ used to substantiate their “research-based” letter would not support the claims included. For example, their statement that “According to the best-available research, teacher evaluation systems that assign between 33 and 50 percent of the available weight to student growth ‘achieve more consistency, avoid the risk of encouraging too narrow a focus on any one aspect of teaching, and can support a broader range of learning objectives than measured by a single test’ is false. First, the actual “best-available” research comes from over 10 years of peer-reviewed publications on this topic, including over 500 peer-reviewed articles. Second, what the authors of the Measures of Effective Teaching (MET) Studies found was that the percentages to be assigned to student test scores were arbitrary at best, because their attempts to empirically determine such a percentage failed. This face the authors also made explicit in their report; that is, they also noted that the percentages they suggested were not empirically supported.

Nevada (Potentially) Dropping Students’ Test Scores from Its Teacher Evaluation System

This week in Nevada “Lawmakers Mull[ed] Dropping Student Test Scores from Teacher Evaluations,” as per a recent article in The Nevada Independent (see here). This would be quite a move from 2011 when the state (as backed by state Republicans, not backed by federal Race to the Top funds, and as inspired by Michelle Rhee) passed into policy a requirement that 50% of all Nevada teachers’ evaluations were to rely on said data. The current percentage rests at 20%, but it is to double next year to 40%.

Nevada is one of a still uncertain number of states looking to retract the weight and purported “value-added” of such measures. Note also that last week Connecticut dropped some of its test-based components of its teacher evaluation system (see here). All of this is occurring, of course, post the federal passage of the Every Student Succeeds Act (ESSA), within which it is written that states must no longer set up teacher-evaluation systems based in significant part on their students’ test scores.

Accordingly, Nevada’s “Democratic lawmakers are trying to eliminate — or at least reduce — the role [students’] standardized tests play in evaluations of teachers, saying educators are being unfairly judged on factors outside of their control.” The Democratic Assembly Speaker, for example, said that “he’s always been troubled that teachers are rated on standardized test scores,” more specifically noting: “I don’t think any single teacher that I’ve talked to would shirk away from being held accountable…[b]ut if they’re going to be held accountable, they want to be held accountable for things that … reflect their actual work.” I’ve never met a teacher would disagree with this statement.

Anyhow, this past Monday the state’s Assembly Education Committee heard public testimony on these matters and three bills “that would alter the criteria for how teachers’ effectiveness is measured.” These three bills are as follows:

  • AB212 would prohibit the use of student test scores in evaluating teachers, while
  • AB320 would eliminate statewide [standardized] test results as a measure but allow local assessments to account for 20 percent of the total evaluation.
  • AB312 would ensure that teachers in overcrowded classrooms not be penalized for certain evaluation metrics deemed out of their control given the student-to-teacher ratio.

Many presented testimony in support of these bills over an extended period of time on Tuesday. I was also invited to speak, during which I “cautioned lawmakers against being ‘mesmerized’ by the promised objectivity of standardized tests. They have their own flaws, [I] argued, estimating that 90-95 percent of researchers who are looking at the effects of high-stakes testing agree that they’re not moving the dial [really whatsoever] on teacher performance.”

Lawmakers have until the end of tomorrow (i.e., Friday) to pass these bills outside of the committee. Otherwise, they will die.

Of course, I will keep you posted, but things are currently looking “very promising,” especially for AB320.

NCTQ on States’ Teacher Evaluation Systems’ Failures

The controversial National Council on Teacher Quality (NCTQ) — created by the conservative Thomas B. Fordham Institute and funded (in part) by the Bill & Melinda Gates Foundation as “part of a coalition for ‘a better orchestrated agenda’ for accountability, choice, and using test scores to drive the evaluation of teachers” (see here; see also other instances of controversy here and here) — recently issued yet another report about state’s teacher evaluation systems titled: “Running in Place: How New Teacher Evaluations Fail to Live Up to Promises.” See a related blog post in Education Week about this report here. See also a related blog post about NCTQ’s prior large-scale (and also slanted) study — “State of the States 2015: Evaluating Teaching, Leading and Learning” — here. Like I did in that post, I summarize this study below.

From the abstract: Authors of this report find that “within the 30 states that [still] require student learning measures to be at least a significant factor in teacher evaluations, state guidance and rules in most states allow teachers to be rated effective even if they receive low scores on the student learning component of the evaluation.” They add in the full report that in many states “a high score on an evaluation’s observation and [other] non-student growth components [can] result in a teacher earning near or at the minimum number of points needed to earn an effective rating. As a result, a low score on the student growth component of the evaluation is sufficient in several states to push a teacher over the minimum number of points needed to earn a summative effective rating. This essentially diminishes any real influence the student growth component has on the summative evaluation rating” (p. 3-4).

The first assumption surrounding the authors’ main tenets they make explicit: that “[u]nfortunately, [the] policy transformation [that began with the publication of the “Widget Effect” report in 2009] has not resulted in drastic alterations in outcomes” (p. 2). This is because, “[in] effect…states have been running in place” (p. 2) and not using teachers’ primarily test-based indicators for high-stakes decision-making. Hence, “evaluation results continue to look much like they did…back in 2009” (p. 2). The authors then, albeit ahistorically, ask, “How could so much effort to change state laws result in so little actual change?” (p. 2). Yet they don’t realize (or care to realize) that this is because we have almost 40 years of evidence that really any type of test-based, educational accountability policies and initiatives have never yield their intended consequences (i.e., increased student achievement on national and international indicators). Rather, the authors argue, that “most states’ evaluation laws fated these systems to status quo results long before” they really had a chance (p. 2).

The authors’ second assumption they imply: that the two most often used teacher evaluation indicators (i.e., the growth or value-added and observational measures) should be highly correlated, which many argue they should be IF in fact they are measuring general teacher effectiveness. But the more fundamental assumption here is that if the student learning (i.e., test based) indicators do not correlate with the observational indicators, the latter MUST be wrong, biased, distorted, and accordingly less trustworthy and the like. They add that “teachers and students are not well served when a teacher is rated effective or higher even though her [sic] students have not made sufficient gains in their learning over the course of a school year” (p. 4). Accordingly, they add that “evaluations should require that a teacher is rated well on both the student growth measures and the professional practice component (e.g., observations, student surveys, etc.) in order to be rated effective” (p. 4). Hence, also in this report the authors put forth recommendations for how states might address this challenge. See these recommendations forthcoming, as also related to a new phenomenon my students and I are studying called artificial inflation.

Artificial inflation is a term I recently coined to represent what is/was happening in Houston, and elsewhere (e.g., Tennessee), when district leaders (e.g., superintendents) mandate or force principals and other teacher effectiveness appraisers or evaluators to align their observational ratings of teachers’ effectiveness with teachers’ value-added scores, with the latter being (sometimes relentlessly) considered the “objective measure” around which all other measures (e.g., subjective observational measures) should revolve, or align. Hence, the push is to conflate the latter “subjective” measure to match the former “objective” measure, even if the process of artificial conflation causes both indicators to become invalid. As per my affidavit from the still ongoing lawsuit in Houston (see here), “[t]o purposefully and systematically endorse the engineering and distortion of the perceptible ‘subjective’ indicator, using the perceptibly ‘objective’ indicator as a keystone of truth and consequence, is more than arbitrary, capricious, and remiss…not to mention in violation of the educational measurement field’s “Standards for Educational and Psychological Testing.”

Nonetheless…

Here is one important figure, taken out of context in some ways on purpose (e.g., as the text surrounding this particular figure is ironically, subjectively used to define what the NCTQ defines as as indicators or progress, or regress).

Near Figure 1 (p. 1) the authors note that “as of January 2017, there has been little evidence of a large-scale reversal of states’ formal evaluation policies. In fact, only four states (Alaska, Mississippi, North Carolina, and Oklahoma) have reversed course on factoring student learning into a teacher’s evaluation rating” (p. 3). While this reversal of four is not illustrated in their accompanying figure, see also a prior post about what other states, beyond just these four states of dishonorable mention, have done to “reverse” the “course” (p. 3) here. While the authors shame all states for minimizing teachers’ test-based ratings before these systems had a chance, as also ignorant to what they cite as “a robust body of research” (without references or citations here, and few elsewhere in a set of footnotes), they add that it remains an unknown as to “why state educational agencies put forth regulations or guidance that would allow teachers to be rated effective without meeting their student growth goals” (p. 4). Many of us know that this was often done to counter the unreliable and invalid results often yielded via the “objective” test-based sides of things that the NCTQ continues to advance.

Otherwise, here are also some important descriptive findings:

  • Thirty states require measures of student academic growth to be at least a significant factor within teacher evaluations; another 10 states require some student growth, and 11 states do not require any objective measures of student growth (p. 5).
  • With only [emphasis added] two exceptions, in the 30 states where student
    growth is at least a significant factor in teacher evaluations, state
    rules or guidance effectively allow teachers who have not met student
    growth goals to still receive a summative rating of at least effective (p. 5).
  • In 18 [of these 30] states, state educational agency regulations and/or guidance
    explicitly permit teachers to earn a summative rating of effective even after earning a less-than-effective score on the student learning portion of their evaluations…these regulations meet the letter of the law while still allowing teachers with low ratings on
    student growth measures to be rated effective or higher (p. 5). In Colorado, for example…a teacher can earn a rating of highly effective with a score of just 1 for student growth (which the state classifies as “less than expected”) in conjunction with a top professional practice score (p. 4).
  • Ten states do not specifically address whether a teacher who has not met student growth goals may be rated as effective or higher. These states neither specifically allow nor specifically disallow such a scenario, but by failing to provide guidance to prevent such an occurrence, they enable it to exist (p. 6).
  • Only two of the 30 states (Indiana and Kentucky) make it impossible for a teacher who has not been found effective at increasing student learning to receive a summative rating of effective (p. 6).

Finally, here are some of their important recommendations, as related to all of the above, and to create more meaningful teacher evaluation systems. So they argue, states should:

  • Establish policies that preclude teachers from earning a label of effective if they are found ineffective at increasing student learning (p. 12).
  • Track the results of discrete components within evaluation systems, both statewide and districtwide. In districts where student growth measures and observation measures are significantly out of alignment, states should reevaluate their systems and/or offer districts technical assistance (p. 12). ][That is, states should possibly promote artificial inflation as we have observed elsewhere. The authors add that] to ensure that evaluation ratings better reflect teacher performance, states should [more specifically] track the results of each evaluation measure to pinpoint where misalignment between components, such as between student learning and observation measures, exists. Where major components within an evaluation system are significantly misaligned, states should examine their systems and offer districts technical assistance where needed, whether through observation training or examining student growth models or calculations (p. 12-13). [Tennessee, for example,] publishes this information so that it is transparent and publicly available to guide actions by key stakeholders and point the way to needed reforms (p. 13).

See also state-by-state reports in the appendices of the full report, in case your state was one of the state’s that responded or, rather, “recognized the factual accuracy of this analysis.”

Citation: Walsh, K., Joseph, N., Lakis, K., & Lubell, S. (2017). Running in place: How new teacher evaluations fail to live up to promises. Washington DC: National Council on Teacher Quality (NCTQ). Retrieved from http://www.nctq.org/dmsView/Final_Evaluation_Paper

Another Study about Bias in Teachers’ Observational Scores

Following-up on two prior posts about potential bias in teachers’ observations (see prior posts here and here), another research study was recently released evidencing, again, that the evaluation ratings derived via observations of teachers in practice are indeed related to (and potentially biased by) teachers’ demographic characteristics. The study also evidenced that teachers representing racial and ethnic minority background might be more likely than others to not only receive lower relatively scores but also be more likely identified for possible dismissal as a result of their relatively lower evaluation scores.

The Regional Educational Laboratory (REL) authored and U.S. Department of Education (Institute of Education Sciences) sponsored study titled “Teacher Demographics and Evaluation: A Descriptive Study in a Large Urban District” can be found here, and a condensed version of the study can be found here. Interestingly, the study was commissioned by district leaders who were already concerned about what they believed to be occurring in this regard, but for which they had no hard evidence… until the completion of this study.

Authors’ key finding follows (as based on three consecutive years of data): Black teachers, teachers age 50 and older, and male teachers were rated below proficient relatively more often than the same district teachers to whom they were compared. More specifically,

  • In all three years the percentage of teachers who were rated below proficient was higher among Black teachers than among White teachers, although the gap was smaller in 2013/14 and 2014/15.
  • In all three years the percentage of teachers with a summative performance rating who were rated below proficient was higher among teachers age 50 and older than among teachers younger than age 50.
  • In all three years the difference in the percentage of male and female teachers with a summative performance rating who were rated below proficient was approximately 5 percentage points or less.
  • The percentage of teachers who improved their rating during all three year-to-year
    comparisons did not vary by race/ethnicity, age, or gender.

This is certainly something to (still) keep in consideration, especially when teachers are rewarded (e.g., via merit pay) or penalized (e.g., vie performance improvement plans or plans for dismissal). Basing these or other high-stakes decisions on not only subjective but also likely biased observational data (see, again, other studies evidencing that this is happening here and here), is not only unwise, it’s also possibly prejudiced.

While study authors note that their findings do not necessarily “explain why the
patterns exist or to what they may be attributed,” and that there is a “need
for further research on the potential causes of the gaps identified, as well as strategies for
ameliorating them,” for starters and at minimum, those conducting these observations literally across the country must be made aware.

Citation: Bailey, J., Bocala, C., Shakman, K., & Zweig, J. (2016). Teacher demographics and evaluation: A descriptive study in a large urban district. Washington DC: U.S. Department of Education. Retrieved from http://ies.ed.gov/ncee/edlabs/regions/northeast/pdf/REL_2017189.pdf

The “Value-Added” of Teacher Preparation Programs: New Research

The journal Education of Economics Review recently published a study titled “Teacher Quality Differences Between Teacher Preparation Programs: How Big? How Reliable? Which Programs Are Different?” The study was authored by researchers at the University of Texas – Austin, Duke University, and Tulane. The pre-publication version of this piece can be found here.

As the title implies, the purpose of the study was to “evaluate statistical methods for estimating teacher quality differences between TPPs [teacher preparation programs].” Needless to say, this research is particularly relevant, here, given “Sixteen US states have begun to hold teacher preparation programs (TPPs) accountable for teacher quality, where quality is estimated by teacher value-added to student test scores.” The federal government continues to support and advance these initiatives, as well (see, for example, here).

But this research study is also particularly important because while researchers found that “[t]he most convincing estimates [of TPP quality] [came] from a value-added model where confidence intervals [were] widened;” that is, the extent to which measurement errors were permitted was dramatically increased, and also widened further using statistical corrections. But even when using these statistical techniques and accomodations, they found that it was still “rarely possible to tell which TPPs, if any, [were] better or worse than average.”

They therefore concluded that “[t]he potential benefits of TPP accountability may be too small to balance the risk that a proliferation of noisy TPP estimates will encourage arbitrary and ineffective policy actions” in response. More specifically, and in their own words, they found that:

  1. Differences between TPPs. While most of [their] results suggest that real differences between TPPs exist, the differences [were] not large [or large enough to make or evidence the differentiation between programs as conceptualized and expected]. [Their] estimates var[ied] a bit with their statistical methods, but averaging across plausible methods [they] conclude[d] that between TPPs the heterogeneity [standard deviation (SD) was] about .03 in math and .02 in reading. That is, a 1 SD increase in TPP quality predict[ed] just [emphasis added] a [very small] .03 SD increase in student math scores and a [very small] .02 SD increase in student reading scores.
  2. Reliability of TPP estimates. Even if the [above-mentioned] differences between TPPs were large enough to be of policy interest, accountability could only work if TPP differences could be estimated reliably. And [their] results raise doubts that they can. Every plausible analysis that [they] conducted suggested that TPP estimates consist[ed] mostly of noise. In some analyses, TPP estimates appeared to be about 50% noise; in other analyses, they appeared to be as much as 80% or 90% noise…Even in large TPPs the estimates were mostly noise [although]…[i]t is plausible [although perhaps not probable]…that TPP estimates would be more reliable if [researchers] had more than one year of data…[although states smaller than the one in this study — Texs]…would require 5 years to accumulate the amount of data that [they used] from one year of data.
  3. Notably Different TPPs. Even if [they] focus[ed] on estimates from a single model, it remains hard to identify which TPPs differ from the average…[Again,] TPP differences are small and estimates of them are uncertain.

In conclusion, that researchers found “that there are only small teacher quality differences between TPPs” might seem surprising, but not really given the outcome variables they used to measure and assess TPP effects were students’ test scores. In short, students’ test scores are three times removed from the primary unit of analysis in studies like these. That is, (1) the TPP is to be measured by the effectiveness of its teacher graduates, and (2) teacher graduates are to be measured by their purported impacts on their students’ test scores, while (3) students’ test scores are to only and have only been validated for measuring student learning and achievement. These test scores have not been validated to assess and measure, in the inverse, teachers causal impacts on said achievements or on TPPs impacts on teachers on said achievements.

If this sounds confusing, it is, and also highly nonsensical, but this is also a reason why this is so difficult to do, and as evidenced in this study, improbable to do this well or as theorized in that TPP estimates are sensitive to error, insensitive given error, and, accordingly, highly uncertain and invalid.

Citation: von Hippela, P. T., Bellowsb, L., Osbornea, C., Lincovec, J. A., & Millsd, N. (2016). Teacher quality differences between teacher preparation programs: How big? How reliable? Which programs are different? Education of Economics Review, 53, 31–45. doi:10.1016/j.econedurev.2016.05.002

U.S. Department of Education: Value-Added Not Good for Evaluating Schools and Principals

Just this month, the Institute of Education Sciences (IES) wing of the U.S. Department of Education released a report about using value-added models (VAMs) for measuring school principals’ performance. The article conducted by researchers at Mathematica Policy Research and titled “Can Student Test Scores Provide Useful Measures of School Principals’ Performance?” can be found online here, with my summary of the study findings highlighted next and herein.

Before the passage of the Every Student Succeeds Act (ESSA), 40 states had written into their state statutes, as incentivized by the federal government, to use growth in student achievement growth for annual principal evaluation purposes. More states had written growth/value-added models (VAMs) for teacher evaluation purposes, which we have covered extensively via this blog, but this pertains only to school and/or principal evaluation purposes. Now since the passage of ESSA, and the reduction in the federal government’s control over state-level policies, states now have much more liberty to more freely decide whether to continue using student achievement growth for either purposes. This paper is positioned within this reasoning, and more specifically to help states decide whether or to what extent they might (or might not) continue to move forward with using growth/VAMs for school and principal evaluation purposes.

Researchers, more specifically, assessed (1) reliability – or the consistency or stability of these ratings over time, which is important “because only stable parts of a rating have the potential to contain information about principals’ future performance; unstable parts reflect only transient aspects of their performance;” and (2) one form of multiple evidences of validity – the predictive validity of these principal-level measures, with predictive validity defined as “the extent to which ratings from these measures accurately reflect principals’ contributions to student achievement in future years.” In short, “A measure could have high predictive validity only if [emphasis added] it was highly stable between consecutive years [i.e., reliability]…and its stable part was strongly related to principals’ contributions to student achievement” over time (i.e., predictive validity).

Researchers used principal-level value-added (unadjusted and adjusted for prior achievement and other potentially biasing demographic variables) to more directly examine “the extent to which student achievement growth at a school differed from average growth statewide for students with similar prior achievement and background characteristics.” Also important to note is that the data they used to examine school-level value-added came from Pennsylvania, which is one of a handful of states that uses the popular and proprietary (and controversial) Education Value-Added Assessment System (EVAAS) statewide.

Here are the researchers’ key findings, taken directly from the study’s summary (again, for more information see the full manuscript here).

  • The two performance measures in this study that did not account for students’ past achievement—average achievement and adjusted average achievement—provided no information for predicting principals’ contributions to student achievement in the following year.
  • The two performance measures in this study that accounted for students’ past achievement—school value-added and adjusted school value-added—provided, at most, a small amount of information for predicting principals’ contributions to student achievement in the following year. This was due to instability and inaccuracy in the stable parts.
  • Averaging performance measures across multiple recent years did not improve their accuracy for predicting principals’ contributions to student achievement in the following year. In simpler terms, a principal’s average rating over three years did not predict his or her future contributions more accurately than did a rating from the most recent year only. This is more of a statistical finding than one that has direct implications for policy and practice (except for silly states who might, despite findings like those presented in this study, decide that they can use one year to do this not at all well instead of three years to do this not at all well).

Their bottom line? “…no available measures of principal [/school] performance have yet been shown to accurately identify principals [/schools] who will contribute successfully to student outcomes in future years,” especially if based on students’ test scores, although the researchers also assert that “no research has ever determined whether non-test measures, such as measures of principals’ leadership practices, [have successfully or accurately] predict[ed] their future contributions” either.

The researchers follow-up with a highly cautionary note: “the value-added measures will make plenty of mistakes when trying to identify principals [/schools] who will contribute effectively or ineffectively to student achievement in future years. Therefore, states and districts should exercise caution when using these measures to make major decisions about principals. Given the inaccuracy of the test-based measures, state and district leaders and researchers should also make every effort to identify nontest measures that can predict principals’ future contributions to student outcomes [instead].”

Citation: Chiang, H., McCullough, M., Lipscomb, S., & Gill, B. (2016). Can student test scores provide useful measures of school principals’ performance? Washington DC: U.S. Department of Education, Institute of Education Sciences. Retrieved from http://ies.ed.gov/ncee/pubs/2016002/pdf/2016002.pdf