Learning from What Doesn’t Work in Teacher Evaluation

One of my doctoral students — Kevin Close — and I just had a study published in the practitioner journal Phi Delta Kappan that I wanted to share out with all of you, especially before the study is no longer open-access or free (see full study as currently available here). As the title indicates, the study is about how states, school districts, and schools can “Learn from What Doesn’t Work in Teacher Evaluation,” given an analysis that the two of us conducted of all documents pertaining to the four teacher evaluation and value-added model (VAM)-centered lawsuits in which I have been directly involved, and that I have also covered in this blog. These lawsuits include Lederman v. King in New York (see here), American Federation of Teachers et al. v. Public Education Department in New Mexico (see here), Houston Federation of Teachers v. Houston Independent School District in Texas (see here), and Trout v. Knox County Board of Education in Tennessee (see here).

Via this analysis we set out to comb through the legal documents to identify the strongest objections, as also recognized by the courts in these lawsuits, to VAMs as teacher measurement and accountability strategies. “The lessons to be learned from these cases are both important and timely” given that “[u]nder the Every Student Succeeds Act (ESSA), local education leaders once again have authority to decide for themselves how to assess teachers’ work.”

The most pertinent and also common issues as per these cases were as follows:

(1) Inconsistencies in teachers’ VAM-based estimates from one year to the next that are sometimes “wildly different.” Across these lawsuits, issues with reliability were very evident, whereas teachers classified as “effective” one year were either theorized or demonstrated to have around a 25%-59% chance of being classified as “ineffective” the next year, or vice versa, with other permutations also possible. As per our profession’s Standards for Educational and Psychological Testing, reliability should, rather, be observed whereby VAM estimates of teacher effectiveness are more or less consistent over time, from one year to the next, regardless of the type of students and perhaps subject areas that teachers teach.

(2) Bias in teachers’ VAM-based estimates were also of note, whereby documents suggested or evidenced that bias, or rather biased estimates of teachers’ actual effects does indeed exist (although this area was also of most contention and dispute). Specific to VAMs, since teachers are not randomly assigned the students they teach, whether their students are invariably more or less motivated, smart, knowledgeable, or capable can bias students’ test-based data, and teachers’ test-based data when aggregated. Court documents, although again not without counterarguments, suggested that VAM-based estimates are sometimes biased, especially when relatively homogeneous sets of students (i.e., English Language Learners (ELLs), gifted and special education students, free-or-reduced lunch eligible students) are non-randomly concentrated into schools, purposefully placed into classrooms, or both. Research suggests that this also sometimes happens regardless of the the sophistication of the statistical controls used to block said bias.

(3) The gaming mechanisms in play within teacher evaluation systems in which VAMs play a key role, or carry significant evaluative weight, were also of legal concern and dispute. That administrators sometimes inflate the observational ratings of their teachers whom they want to protect, while simultaneously offsetting the weight the VAMs sometimes carry was of note, as was the inverse. That administrators also sometimes lower teachers’ ratings to better align them with their “more objective” VAM counterparts were also at issue. “So argued the plaintiffs in the Houston and Tennessee lawsuits, for example. In those systems, school leaders appear to have given precedence to VAM scores, adjusting their classroom observations to match them. In both cases, administrators admitted to doing so, explaining that they sensed pressure to ensure that their ‘subjective’ classroom ratings were in sync with the VAM’s ‘objective’ scores.” Both sets of behavior distort the validity (or “truthfulness”) of any teacher evaluation system and are in violation of the same, aforementioned Standards for Educational and Psychological Testing that call for VAM scores and observation ratings to be kept separate. One indicator should never be adjusted to offset or to fit the other.

(4) Transparency, or the lack thereof, was also a common issue across cases. Transparency, which can be defined as the extent to which something is accessible and readily capable of being understood, pertains to whether VAM-based estimates are accessible and make sense to those at the receiving ends. “Not only should [teachers] have access to [their VAM-based] information for instructional purposes, but if they believe their evaluations to be unfair, they should be able to see all of the relevant data and calculations so that they can defend themselves.” In no case was this more legally pertinent than in Houston Federation of Teachers v. Houston Independent School District in Texas. Here, the presiding judge ruled that teachers did have “legitimate claims to see how their scores were calculated. Concealing this information, the judge ruled, violated teachers’ due process protections under the 14th Amendment (which holds that no state — or in this case organization — shall deprive any person of life, liberty, or property, without due process). Given this precedent, it seems likely that teachers in other states and districts will demand transparency as well.”

In the main article (here) we also discuss what states are now doing to (hopefully) improve upon their teacher evaluation systems in terms of using multiple measures to help to evaluate teachers more holistically. We emphasize the (in)formative versus the summative and high-stakes functions of such systems, and allowing teachers to take ownership over such systems in their development and implementation. I will leave you all to read the full article (here) for these details.

In sum, though, when rethinking states’ teacher evaluation systems, especially given the new liberties afforded to states via the Every Student Succeeds Act (ESSA), educators, education leaders, policymakers, and the like would do well to look to the past for guidance on what not to do — and what to do better. These legal cases can certainly inform such efforts.

Reference: Close, K., & Amrein-Beardsley, A. (2018). Learning from what doesn’t work in teacher evaluation. Phi Delta Kappan, 100(1), 15-19. Retrieved from http://www.kappanonline.org/learning-from-what-doesnt-work-in-teacher-evaluation/

A Win in New Jersey: Tests to Now Account for 5% of Teachers’ Evaluations

Phil Murphy, the Governor of New Jersey, is keeping his campaign promise to parents, students, and educators, according to a news article just posted by the New Jersey Education Association (NJEA; see here). As per the New Jersey Commissioner of Education – Dr. Lamont Repollet, who was a classroom teacher himself — throughout New Jersey, Partnership for Assessment of Readiness for College and Careers (PARCC) test scores will now account for just 5% of a teacher’s evaluation, which is down from 30% as mandated for approxunatelt five years prior by both Murphy’s and Repollet’s predecessors.

Alas, the New Jersey Department of Education and the Murphy administration have “shown their respect for the research.” Because state law continues to require that standardized test scores play some role in teacher evaluation, a decrease to 5% is a victory, perhaps with a revocation of this law forthcoming.

“Today’s announcement is another step by Gov. Murphy toward keeping a campaign promise to rid New Jersey’s public schools of the scourge of high-stakes testing. While tens of thousands of families across the state have already refused to subject their children to PARCC, schools are still required to administer it and educators are still subject to its arbitrary effects on their evaluation. By dramatically lowering the stakes for the test, Murphy is making it possible for educators and students alike to focus more time and attention on real teaching and learning.” Indeed, “this is a victory of policy over politics, powered by parents and educators.”

Way to go New Jersey!

The Gates Foundation’s Expensive ($335 Million) Teacher Evaluation Missteps

The header of an Education Week article released last week (click here) was that “[t]he Bill & Melinda Gates Foundation’s multi-million-dollar, multi-year effort aimed at making teachers more effective largely fell short of its goal to increase student achievement-including among low-income and minority students.”

An evaluation of Gates Foundation’s Intensive Partnerships for Effective Teaching initiative funded at $290 million, an extension of its Measures of Effective Teaching (MET) project funded at $45 million, was the focus of this article. The MET project was lead by Thomas Kane (Professor of Education and Economics at Harvard, former leader of the MET project, and expert witness on the defendant’s side of the ongoing lawsuit supporting New Mexico’s MET project-esque statewide teacher evaluation system; see here and here), and both projects were primarily meant to hold teachers accountable using their students test scores via growth or value-added models (VAMs) and financial incentives. Both projects were tangentially meant to improve staffing, professional development opportunities, improve the retention of the teachers of “added value,” and ultimately lead to more-effective teaching and student achievement, especially in low-income schools and schools with higher relative proportions of racial minority students. The six-year evaluation of focus in this Education Week article was conducted by the RAND Corporation and the American Institutes for Research, and the evaluation was also funded by the Gates Foundation (click here for the evaluation report, see below for the full citation of this study).

Their key finding was that Intensive Partnerships for Effective Teaching district/school sites (see them listed here) implemented new measures of teaching effectiveness and modified personnel policies, but they did not achieve their goals for students.

Evaluators also found (see also here):

  • The sites succeeded in implementing measures of effectiveness to evaluate teachers and made use of the measures in a range of human-resource decisions.
  • Every site adopted an observation rubric that established a common understanding of effective teaching. Sites devoted considerable time and effort to train and certify classroom observers and to observe teachers on a regular basis.
  • Every site implemented a composite measure of teacher effectiveness that included scores from direct classroom observations of teaching and a measure of growth in student achievement.
  • Every site used the composite measure to varying degrees to make decisions about human resource matters, including recruitment, hiring, placement, tenure, dismissal, professional development, and compensation.

Overall, the initiative did not achieve its goals for student achievement or graduation, especially for low-income and racial minority students. With minor exceptions, student achievement, access to effective teaching, and dropout rates were also not dramatically better than they were for similar sites that did not participate in the intensive initiative.

Their recommendations were as follows (see also here):

  • Reformers should not underestimate the resistance that could arise if changes to teacher-evaluation systems have major negative consequences.
  • A near-exclusive focus on teacher evaluation systems such as these might be insufficient to improve student outcomes. Many other factors might also need to be addressed, ranging from early childhood education, to students’ social and emotional competencies, to the school learning environment, to family support. Dramatic improvement in outcomes, particularly for low-income and racial minority students, will likely require attention to many of these factors as well.
  • In change efforts such as these, it is important to measure the extent to which each of the new policies and procedures is implemented in order to understand how the specific elements of the reform relate to outcomes.

Reference:

Stecher, B. M., Holtzman, D. J., Garet, M. S., Hamilton, L. S., Engberg, J., Steiner, E. D., Robyn, A., Baird, M. D., Gutierrez, I. A., Peet, E. D., de los Reyes, I. B., Fronberg, K., Weinberger, G., Hunter, G. P., & Chambers, J. (2018). Improving teaching effectiveness: Final report. The Intensive Partnerships for Effective Teaching through 2015–2016. Santa Monica, CA: The RAND Corporation. Retrieved from https://www.rand.org/pubs/research_reports/RR2242.html

States’ Teacher Evaluation Systems Moving in the “Right” Direction

Last week, a technical report that one of my current and one of my former doctoral students helped me to research and write, was published by the University of Colorado Boulder’s National Education Policy Center (NEPC). While you can navigate to and read the press release here, as well as download and read the full report here, I thought I would summarize the report’s most interesting facts in this post, for the readers/followers of this blog who are likely more interested in the findings pertaining to states’ revised teacher evaluation systems, post the federal passage of the Every Student Succeeds Act (ESSA).

In short, we collected and analyzed for purposes of this study the 51 (i.e., 50 states plus Washington DC) revised teacher evaluation plans submitted to the federal government post ESSA (i.e. spring/summer of 2017) We found, again as specific only to states’ teacher evaluation systems, three key findings:

— First, the role of growth or value-added models (VAMs) for teacher evaluation purposes is declining. That is, the number of states using statewide growth models or VAMs has decreased from 42% to 30% since 2014. This is certainly a step in the “right,” defined as research-informed, direction. See also Figure 1 below (Close, Amrein-Beardsley, & Collins, 2018, p. 13).

— Second, because ESSA loosened federal control of teacher evaluation, many states no longer have a one-size-fits-all teacher evaluation system. This is allowing local districts to make more choices about models, implementation, execution, and the like, in the contexts of the schools and communities in which schools exist.

— Third, the rhetoric surrounding teacher evaluation has changed: language about holding teachers accountable for their value-added effects, or lack thereof, is much less evident in post-ESSA plans. Rather, new plans make note of providing data to teachers as a means of supporting professional development and improvement, essentially shifting the purpose of the evaluation system away from summative and toward formative use.

We also set forth recommendations for states in this report, as based on the evidence noted above (and presented in much more detail in the full report). The recommendations that also directly pertain to states’ (and districts’) teacher evaluation systems are that states/districts:

  1. Take advantage of decreased federal control by formulating revised assessment policies informed by the viewpoints of as many stakeholders as feasible. Such informed revision can help remedy earlier weaknesses, promote effective implementation, stress correct interpretation, and yield formative information.
  2. Ensure that teacher evaluation systems rely on a balanced system of multiple measures, without disproportionate weight assigned to any one measure as allegedly “superior” than any other. If measures contradict one another, however, output from all measures should be interpreted judiciously.
  3. Emphasize data useful as formative feedback in state systems, so that specific weaknesses in student learning can be identified, targeted and used to inform teachers’ professional development.
  4. Mandate ongoing research and evaluation of state assessment systems and ensure that adequate resources are provided to support [ongoing] evaluation [efforts].
  5. Set goals for reducing proficiency gaps and outline procedures for developing strategies to effectively reduce gaps once they have been identified.

We hope this information helps, especially the states and districts still looking to other states to see what is trending. While we note in the title of this blog post as well as the title of the full report that all of this represents “some steps in the right direction,” there is still much work to be done. This is especially true in states, for example like New Mexico (see my most recent post about the ongoing lawsuit in this state here) and other states which have yet to give up on the false promises and limited research of such educational policies established almost one decade ago (e.g., Race to the Top; Duncan, 2009).

Citations:

Close, K., Amrein-Beardsley, A., & Collins, C. (2018). State-level assessments and teacher evaluation systems after the passage of the Every Student Succeeds Act: Some steps in the right direction. Boulder, CO: Nation Education Policy Center (NEPC). Retrieved from http://nepc.colorado.edu/publication/state-assessment

Duncan, A. (2009, July 4). The race to the top begins: Remarks by Secretary Arne Duncan. Retrieved from http://www.ed.gov/news/speeches/2009/07/07242009.html

Identifying Effective Teacher Preparation Programs Using VAMs Does Not Work

A New Study [does not] Show Why It’s So Hard to Improve Teacher Preparation” Programs (TPPs). More specifically, it shows why using value-added models (VAMs) to evaluate TPPs, and then ideally improving them using the value-added data derived, is nearly if not entirely impossible.

This is precisely why yet another, perhaps, commonsensical but highly improbable federal policy move to imitate great teacher education programs and shut down ineffective ones, as based on their graduates’ students test-based performance over time (i.e., value-added) continues to fail.

Accordingly, in another, although not-yet peer-reviewed or published study referenced in the article above, titled “How Much Does Teacher Quality Vary Across Teacher Preparation Programs? Reanalyzing Estimates from [Six] States,” authors Paul T. von Hippel, from the University of Texas at Austin, and Laura Bellows, a PhD Student from Duke University, investigated “whether the teacher quality differences between TPPs are large enough to make [such] an accountability system worthwhile” (p. 2). More specifically, using a meta-analysis technique, they reanalyzed the results of such evaluations in six of the approximately 16 states doing this (i.e., in New York, Louisiana, Missouri, Washington, Texas, and Florida), each of which ultimately yielded a peer-reviewed publication, and they found “that teacher quality differences between most TPPs [were] negligible [at approximately] 0-0.04 standard deviations in student test scores” (p. 2).

They also highlight some of the statistical practices that exaggerated the “true” differences noted between TPPs in each of these but also these types of studies in general, and consequently conclude that the “results of TPP evaluations in different states may vary not for substantive reasons, but because of the[se] methodological choices” (p. 5). Likewise, as is the case with value-added research in general, when “[f]aced with the same set of results, some authors may [also] believe they see intriguing differences between TPPs, while others may believe there is not much going on” (p. 6). With that being said, I will not cover these statistical/technical issue more here. Do read the full study for these details, though, as also important.

Related, they found that in every state, the variation that they statistically observed was greater among relatively small TPPs versus large ones. They suggest that this occurs, accordingly, due to estimation or statistical methods that may be inadequate for the task at hand. However, if this is true this also means that because there is relatively less variation observed among large TPPs, it may be much more difficult “to single out a large TPP that is significantly better or worse than average” (p. 30). Accordingly, there are
several ways to mistakenly single out a TPP as exceptional or less than, merely given TPP size. This is obviously problematic.

Nonetheless, the authors also note that before they began this study, in Missouri, Texas, and Washington, that “the differences between TPPs appeared small or negligible” (p. 29), but in Louisiana and New York “they appeared more substantial” (p. 29). After their (re)analyses, however, their found that the results from and across these six different states were “more congruent” (p. 29), as also noted prior (i.e., differences between TPPs around 0 and 0.04 SDs in student test scores).

“In short,” they conclude, that “TPP evaluations may have some policy value, but the value is more modest than was originally envisioned. [Likewise, it] is probably not meaningful to rank all the TPPs in a state; the true differences between most TPPs are too small to matter, and the estimated differences consist mostly of noise” (p. 29). As per the article cited prior, they added that “It appears that differences between [programs] are rarely detectable, and that if they could be detected they would usually be too small to support effective policy decisions.”

To see a study similar to this, that colleagues and I conducted in Arizona, and that was recently published in Teaching Education, see “An Elusive Policy Imperative: Data and Methodological Challenges When Using Growth in Student Achievement to Evaluate Teacher Education Programs’ ‘Value-Added” summarized and referenced here.

Breaking News: The End of Value-Added Measures for Teacher Termination in Houston

Recall from multiple prior posts (see, for example, here, here, here, here, and here) that a set of teachers in the Houston Independent School District (HISD), with the support of the Houston Federation of Teachers (HFT) and the American Federation of Teachers (AFT), took their district to federal court to fight against the (mis)use of their value-added scores derived via the Education Value-Added Assessment System (EVAAS) — the “original” value-added model (VAM) developed in Tennessee by William L. Sanders who just recently passed away (see here). Teachers’ EVAAS scores, in short, were being used to evaluate teachers in Houston in more consequential ways than any other district or state in the nation (e.g., the termination of 221 teachers in one year as based, primarily, on their EVAAS scores).

The case — Houston Federation of Teachers et al. v. Houston ISD — was filed in 2014 and just one day ago (October 10, 2017) came the case’s final federal suit settlement. Click here to read the “Settlement and Full and Final Release Agreement.” But in short, this means the “End of Value-Added Measures for Teacher Termination in Houston” (see also here).

More specifically, recall that the judge notably ruled prior (in May of 2017) that the plaintiffs did have sufficient evidence to proceed to trial on their claims that the use of EVAAS in Houston to terminate their contracts was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case district shall deprive any person of life, liberty, or property, without due process). That is, the judge ruled that “any effort by teachers to replicate their own scores, with the limited information available to them, [would] necessarily fail” (see here p. 13). This was confirmed by the one of the plaintiffs’ expert witness who was also “unable to replicate the scores despite being given far greater access to the underlying computer codes than [was] available to an individual teacher” (see here p. 13).

Hence, and “[a]ccording to the unrebutted testimony of [the] plaintiffs’ expert [witness], without access to SAS’s proprietary information – the value-added equations, computer source codes, decision rules, and assumptions – EVAAS scores will remain a mysterious ‘black box,’ impervious to challenge” (see here p. 17). Consequently, the judge concluded that HISD teachers “have no meaningful way to ensure correct calculation of their EVAAS scores, and as a result are unfairly subject to mistaken deprivation of constitutionally protected property interests in their jobs” (see here p. 18).

Thereafter, and as per this settlement, HISD agreed to refrain from using VAMs, including the EVAAS, to terminate teachers’ contracts as long as the VAM score is “unverifiable.” More specifically, “HISD agree[d] it will not in the future use value-added scores, including but not limited to EVAAS scores, as a basis to terminate the employment of a term or probationary contract teacher during the term of that teacher’s contract, or to terminate a continuing contract teacher at any time, so long as the value-added score assigned to the teacher remains unverifiable. (see here p. 2; see also here). HISD also agreed to create an “instructional consultation subcommittee” to more inclusively and democratically inform HISD’s teacher appraisal systems and processes, and HISD agreed to pay the Texas AFT $237,000 in its attorney and other legal fees and expenses (State of Texas, 2017, p. 2; see also AFT, 2017).

This is yet another big win for teachers in Houston, and potentially elsewhere, as this ruling is an unprecedented development in VAM litigation. Teachers and others using the EVAAS or another VAM for that matter (e.g., that is also “unverifiable”) do take note, at minimum.

The “Widget Effect” Report Revisited

You might recall that in 2009, The New Teacher Project published a highly influential “Widget Effect” report in which researchers (see citation below) evidenced that 99% of teachers (whose teacher evaluation reports they examined across a sample of school districts spread across a handful of states) received evaluation ratings of “satisfactory” or higher. Inversely, only 1% of the teachers whose reports researchers examined received ratings of “unsatisfactory,” even though teachers’ supervisors could identify more teachers whom they deemed ineffective when asked otherwise.

Accordingly, this report was widely publicized given the assumed improbability that only 1% of America’s public school teachers were, in fact, ineffectual, and given the fact that such ineffective teachers apparently existed but were not being identified using standard teacher evaluation/observational systems in use at the time.

Hence, this report was used as evidence that America’s teacher evaluation systems were unacceptable and in need of reform, primarily given the subjectivities and flaws apparent and arguably inherent across the observational components of these systems. This reform was also needed to help reform America’s public schools, writ large, so the logic went and (often) continues to go. While binary constructions of complex data such as these are often used to ground simplistic ideas and push definitive policies, ideas, and agendas, this tactic certainly worked here, as this report (among a few others) was used to inform the federal and state policies pushing teacher evaluation system reform as a result (e.g., Race to the Top (RTTT)).

Likewise, this report continues to be used whenever a state’s or district’s new-and-improved teacher evaluation systems (still) evidence “too many” (as typically arbitrarily defined) teachers as effective or higher (see, for example, an Education Week article about this here). Although, whether in fact the systems have actually been reformed is also of debate in that states are still using many of the same observational systems they were using prior (i.e., not the “binary checklists” exaggerated in the original as well as this report, albeit true in the case of the district of focus in this study). The real “reforms,” here, pertained to the extent to which value-added model (VAM) or other growth output were combined with these observational measures, and the extent to which districts adopted state-level observational models as per the centralized educational policies put into place at the same time.

Nonetheless, now eight years later, Matthew A. Kraft – an Assistant Professor of Education & Economics at Brown University and Allison F. Gilmour – an Assistant Professor at Temple University (and former doctoral student at Vanderbilt University), revisited the original report. Just published in the esteemed, peer-reviewed journal Educational Researcher (see an earlier version of the published study here), Kraft and Gilmour compiled “teacher performance ratings across 24 [of the 38, including 14 RTTT] states that [by 2014-2015] adopted major reforms to their teacher evaluation systems” as a result of such policy initiatives. They found that “the percentage of teachers rated Unsatisfactory remains less than 1%,” except for in two states (i.e., Maryland and New Mexico), with Unsatisfactory (or similar) ratings varying “widely across states with 0.7% to 28.7%” as the low and high, respectively (see also the study Abstract).

Related, Kraft and Gilmour found that “some new teacher evaluation systems do differentiate among teachers, but most only do so at the top of the ratings spectrum” (p. 10). More specifically, observers in states in which teacher evaluation ratings include five versus four rating categories differentiate teachers more, but still do so along the top three ratings, which still does not solve the negative skew at issue (i.e., “too many” teachers still scoring “too well”). They also found that when these observational systems were used for formative (i.e., informative, improvement) purposes, teachers’ ratings were lower than when they were used for summative (i.e., final summary) purposes.

Clearly, the assumptions of all involved in this area of policy research come into play, here, akin to how they did in The Bell Curve and The Bell Curve Debate. During this (still ongoing) debate, many fervently debated whether socioeconomic and educational outcomes (e.g., IQ) should be normally distributed. What this means in this case, for example, is that for every teacher who is rated highly effective there should be a teacher rated as highly ineffective, more or less, to yield a symmetrical distribution of teacher observational scores across the spectrum.

In fact, one observational system of which I am aware (i.e., the TAP System for Teacher and Student Advancement) is marketing its proprietary system, using as a primary selling point figures illustrating (with text explaining) how clients who use their system will improve their prior “Widget Effect” results (i.e., yielding such normal curves; see Figure below, as per Jerald & Van Hook, 2011, p. 1).

Evidence also suggests that these scores are also (sometimes) being artificially deflated to assist in these attempts (see, for example, a recent publication of mine released a few days ago here in the (also) esteemed, peer-reviewed Teachers College Record about how this is also occurring in response to the “Widget Effect” report and the educational policies that follows).

While Kraft and Gilmour assert that “systems that place greater weight on normative measures such as value-added scores rather than…[just]…observations have fewer teachers rated proficient” (p. 19; see also Steinberg & Kraft, forthcoming; a related article about how this has occurred in New Mexico here; and New Mexico’s 2014-2016 data below and here, as also illustrative of the desired normal curve distributions discussed above), I highly doubt this purely reflects New Mexico’s “commitment to putting students first.”

I also highly doubt that, as per New Mexico’s acting Secretary of Education, this was “not [emphasis added] designed with quote unquote end results in mind.” That is, “the New Mexico Public Education Department did not set out to place any specific number or percentage of teachers into a given category.” If true, it’s pretty miraculous how this simply worked out as illustrated… This is also at issue in the lawsuit in which I am involved in New Mexico, in which the American Federation of Teachers won an injunction in 2015 that still stands today (see more information about this lawsuit here). Indeed, as per Kraft, all of this “might [and possibly should] undercut the potential for this differentiation [if ultimately proven artificial, for example, as based on statistical or other pragmatic deflation tactics] to be seen as accurate and valid” (as quoted here).

Notwithstanding, Kraft and Gilmour, also as part (and actually the primary part) of this study, “present original survey data from an urban district illustrating that evaluators perceive more than three times as many teachers in their schools to be below Proficient than they rate as such.” Accordingly, even though their data for this part of this study come from one district, their findings are similar to others evidenced in the “Widget Effect” report; hence, there are still likely educational measurement (and validity) issues on both ends (i.e., with using such observational rubrics as part of America’s reformed teacher evaluation systems and using survey methods to put into check these systems, overall). In other words, just because the survey data did not match the observational data does not mean either is wrong, or right, but there are still likely educational measurement issues.

Also of issue in this regard, in terms of the 1% issue, is (a) the time and effort it takes supervisors to assist/desist after rating teachers low is sometimes not worth assigning low ratings; (b) how supervisors often give higher ratings to those with perceived potential, also in support of their future growth, even if current evidence suggests a lower rating is warranted; (c) how having “difficult conversations” can sometimes prevent supervisors from assigning the scores they believe teachers may deserve, especially if things like job security are on the line; (d) supervisors’ challenges with removing teachers, including “long, laborious, legal, draining process[es];” and (e) supervisors’ challenges with replacing teachers, if terminated, given current teacher shortages and the time and effort, again, it often takes to hire (ideally more qualified) replacements.

References:

Jerald, C. D., & Van Hook, K. (2011). More than measurement: The TAP system’s lessons learned for designing better teacher evaluation systems. Santa Monica, CA: National Institute for Excellence in Teaching (NIET). Retrieved from http://files.eric.ed.gov/fulltext/ED533382.pdf

Kraft, M. A, & Gilmour, A. F. (2017). Revisiting the Widget Effect: Teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46(5) 234-249. doi:10.3102/0013189X17718797

Steinberg, M. P., & Kraft, M. A. (forthcoming). The sensitivity of teacher performance ratings to the design of teacher evaluation systems. Educational Researcher.

Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). “The Widget Effect.” Education Digest, 75(2), 31–35.

A “Next Generation” Vision for School, Teacher, and Student Accountability

Within a series of prior posts (see, for example, here and here), I have written about what the Every Student Succeeds Act (ESSA), passed in December of 2015, means for the U.S., or more specifically states’ school and teacher evaluation systems as per the federal government’s prior mandates requiring their use of growth and value-added models (VAMs).

Related, states were recently (this past May) required to submit to the federal government their revised school and teacher evaluation plans, post ESSA, given how they have changed, or not. While I have a doctoral student currently gathering updated teacher evaluation data, state-by-state, and our preliminary findings indicate that “things” have not (yet) changed much post ESSA, at least at the teacher level of focus in this study and except for in a few states (e.g., Connecticut, Oklahoma), states still have the liberties to change that which they do on both ends (i.e., school and teacher accountability).

Recently, a colleague recently shared with me a study titled “Next Generation Accountability: A Vision for School Improvement Under ESSA” that warrants coverage here, in hopes that states are still “out there” trying to reform their school and teacher evaluation systems, of course, for the better. While the document was drafted by folks coming from the aforementioned state of Oklahoma, who are also affiliated with the Learning Policy Institute, it is important to note that the document was also vetted by some “heavy hitters” in this line of research including, but not limited to, David C. Berliner (Arizona State University), Peter W. Cookson Jr. (American Institutes for Research (AIR)), Linda Darling-Hammond (Stanford University), and William A. Firestone (Rutgers University).

As per ESSA, states are to have increased opportunities “to develop innovative strategies for advancing equity, measuring success, and developing cycles of continuous improvement” while using “multiple measures to assess school and student performance” (p. iii). Likewise, the authors of this report state that “A broader spectrum of indicators,
going well beyond a summary of annual test performance, seems necessary to account transparently for performance and assign responsibility for improvement.”

Here are some of their more specific recommendations that I found of value for blog followers:

  • The continued use of a single composite indicator to reduce and then sort teachers or schools by their overall effectiveness or performance (e.g., using teacher “effectiveness” categories or school A–F letter grades) is myopic, to say the least. This is because doing this (a) misses all that truly “matters,” including  multidimensional concepts and (non)cognitive competencies we want students to know and to be able to do, not captured by large-scale tests; and (b) inhibits the usefulness of what may be informative, stand-alone data (i.e., as taken from “multiple measures” individually) once these data are reduced and then collapsed so that they can be used for hierarchical categorizations and rankings. This also (c) very much trivializes the multiple causes of low achievement, also of importance and in much greater need of attention.
  • Accordingly, “Next Generation” accountability systems should include “a broad palette of functionally significant indicators to replace [such] single composite indicators [as this] will likely be regarded as informational rather than controlling, thereby motivating stakeholders to action” (p. ix). Stakeholders should be defined in the following terms…
  • “Next Generation” accountability systems should incorporate principles of “shared accountability,” whereby educational responsibility and accountability should be “distributed across system components and not foisted upon any one group of actors or stakeholders” (p. ix). “[E]xerting pressure on stakeholders who do not have direct control over [complex educational] elements is inappropriate and worse, harmful” (p. ix). Accordingly, the goal of “shared accountability” is to “create an accountability environment in which all participants [including governmental organizations] recognize their obligations and commitments in relation to each other” (p. ix) and their collective educational goals.
  • To facilitate this, “Next Generation” information systems should be designed and implemented in order to service the “dual reporting needs of compliance with federal mandates and the particular improvement needs of a state’s schools,” while also addressing “the different information needs of state, district, school site
    leadership, teachers, and parents” (p. ix). Data may include, at minimum, data on school resources, processes, outcomes, and other nuanced indicators, and this information must be made transparent and accessible in order for all types of data users to be responsive, holistically and individually (e.g, at school or classroom levels). The formative functions of such “Next Generation” informational systems, accordingly, take priority, at least for initial terms, until informational data can be used to, with priority, “identify and transform schools in catastrophic failure” (p. ix).
  • Related, all test- or other educational measurement-related components of states’ “Next Generation” statutes and policies should adhere to the Standards for Educational and Psychological Testing, and more specifically their definitions of reliability, validity, bias, fairness, and the like. Statutes and policies should also be written “in the least restrictive and prescriptive terms possible to allow for [continous] corrective action and improvement” (p. x).
  • Finally, “Next Generation” accountability systems should adhere to the following five essentials: “(a) state, district, and school leaders must create a system-wide culture grounded in “learning to improve;” (b) learning to improve using [the aforementioned informational systems also] necessitates the [overall] development of [students’] strong pedagogical data-literacy skills; (c) resources in addition to funding—including time, access to expertise, and collaborative opportunities—should be prioritized for sustaining these ongoing improvement efforts; (d) there must be a coherent structure of state-level support for learning to improve, including the development of a strong Longitudinal Data System (LDS) infrastructure; and (e) educator labor market policy in some states may need adjustment to support the above elements” (p. x).

To read more, please access the full report here.

In sum, “Next Generation” accountability systems aim at “a loftier goal—universal college and career readiness—a goal that current accountability systems were not designed to achieve. To reach this higher level, next generation accountability must embrace a wider vision, distribute trustworthy performance information, and build support infrastructure, while eliciting the assent, support, and enthusiasm of citizens and educators” (p. vii).

As briefly noted prior, “a few states have been working to put more supportive, humane accountability systems in place, but others remain stuck in a compliance mindset that undermines their ability to design effective accountability systems” (p. vii). Perhaps (or perhaps likely) this is because for the past decade or so states invested so much time, effort, and money to “reforming” their prior teacher evaluations systems as formerly required by the federal government. This included investments in states’ growth models of VAMs, onto which many/most states seem to be holding firm.

Hence, while it seems that the residual effects of the federal governments’ former efforts are still dominating states’ actions with regards to educational accountability, hopefully some states can at least begin to lead the way to what will likely yield the educational reform…still desired…

The New York Times on “The Little Known Statistician” Who Passed

As many of you may recall, I wrote a post last March about the passing of William L. Sanders at age 74. Sanders developed the Education Value-Added Assessment System (EVAAS) — the value-added model (VAM) on which I have conducted most of my research (see, for example, here and here) and the VAM at the core of most of the teacher evaluation lawsuits in which I have been (or still am) engaged (see here, here, and here).

Over the weekend, though, The New York Times released a similar piece about Sanders’s passing, titled “The Little-Known Statistician Who Taught Us to Measure Teachers.” Because I had multiple colleagues and blog followers email me (or email me about) this article, I thought I would share it out with all of you, with some additional comments, of course, but also given the comments I already made in my prior post here.

First, I will start by saying that the title of this article is misleading in that what this “little-known” statistician contributed to the field of education was hardly “little” in terms of its size and impact. Rather, Sanders and his associates at SAS Institute Inc. greatly influenced our nation in terms of the last decade of our nation’s educational policies, as largely bent on high-stakes teacher accountability for educational reform. This occurred in large part due to Sanders’s (and others’) lobbying efforts when the federal government ultimately choose to incentivize and de facto require that all states hold their teachers accountable for their value-added, or lack thereof, while attaching high-stakes consequences (e.g., teacher termination) to teachers’ value-added estimates. This, of course, was to ensure educational reform. This occurred at the federal level, as we all likely know, primarily via Race to the Top and the No Child Left Behind Waivers essentially forced upon states when states had to adopt VAMs (or growth models) to also reform their teachers, and subsequently their schools, in order to continue to receive the federal funds upon which all states still rely.

It should be noted, though, that we as a nation have been relying upon similar high-stakes educational policies since the late 1970s (i.e., for now over 35 years); however, we have literally no research evidence that these high-stakes accountability policies have yielded any of their intended effects, as still perpetually conceptualized (see, for example, Nevada’s recent legislative ruling here) and as still advanced via large- and small-scale educational policies (e.g., we are still A Nation At Risk in terms of our global competitiveness). Yet, we continue to rely on the logic in support of such “carrot and stick” educational policies, even with this last decade’s teacher- versus student-level “spin.” We as a nation could really not be more ahistorical in terms of our educational policies in this regard.

Regardless, Sanders contributed to all of this at the federal level (that also trickled down to the state level) while also actively selling his VAM to state governments as well as local school districts (i.e., including the Houston Independent School District in which teacher plaintiffs just won a recent court ruling against the Sanders value-added system here), and Sanders did this using sets of (seriously) false marketing claims (e.g., purchasing and using the EVAAS will help “clear [a] path to achieving the US goal of leading the world in college completion by the year 2020”). To see two empirical articles about the claims made to sell Sanders’s EVAAS system, the research non-existent in support of each of the claims, and the realities of those at the receiving ends of this system (i.e., teachers) as per their experiences with each of the claims, see here and here.

Hence, to assert that what this “little known” statistician contributed to education was trivial or inconsequential is entirely false. Thankfully, with the passage of the Every Student Succeeds Act” (ESSA) the federal government came around, in at least some ways. While not yet acknowledging how holding teachers accountable for their students’ test scores, while ideal, simply does not work (see the “Top Ten” reasons why this does not work here), at least the federal government has given back to the states the authority to devise, hopefully, some more research-informed educational policies in these regards (I know….).

Nonetheless, may he rest in peace (see also here), perhaps also knowing that his forever stance of “[making] no apologies for the fact that his methods were too complex for most of the teachers whose jobs depended on them to understand,” just landed his EVAAS in serious jeopardy in court in Houston (see here) given this stance was just ruled as contributing to the violation of teachers’ Fourteenth Amendment rights (i.e., no state or in this case organization shall deprive any person of life, liberty, or property, without due process [emphasis added]).

Also Last Thursday in Nevada: The “Top Ten” Research-Based Reasons Why Large-Scale, Standardized Tests Should Not Be Used to Evaluate Teachers

Last Thursday was a BIG day in terms of value-added models (VAMs). For those of you who missed it, US Magistrate Judge Smith ruled — in Houston Federation of Teachers (HFT) et al. v. Houston Independent School District (HISD) — that Houston teacher plaintiffs’ have legitimate claims regarding how their EVAAS value-added estimates, as used (and abused) in HISD, was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case organization shall deprive any person of life, liberty, or property, without due process). See post here: “A Big Victory in Court in Houston.” On the same day, “we” won another court case — Texas State Teachers Association v. Texas Education Agency —  on which The Honorable Lora J. Livingston ruled that the state was to remove all student growth requirements from all state-level teacher evaluation systems. In other words, and in the name of increased local control, teachers throughout Texas will no longer be required to be evaluated using their students’ test scores. See prior post here: “Another Big Victory in Court in Texas.”

Also last Thursday (it was a BIG day, like I said), I testified, again, regarding a similar provision (hopefully) being passed in the state of Nevada. As per a prior post here, Nevada’s “Democratic lawmakers are trying to eliminate — or at least reduce — the role [students’] standardized tests play in evaluations of teachers, saying educators are being unfairly judged on factors outside of their control.” More specifically, as per AB320 the state would eliminate statewide, standardized test results as a mandated teacher evaluation measure but allow local assessments to account for 20% of a teacher’s total evaluation. AB320 is still in work session. It has the votes in committee and on the floor, thus far.

The National Center on Teacher Quality (NCTQ), unsurprisingly (see here and here), submitted (unsurprising) testimony against AB320 that can be read here, and I submitted testimony (I think, quite effectively 😉 ) refuting their “research-based” testimony, and also making explicit what I termed “The “Top Ten” Research-Based Reasons Why Large-Scale, Standardized Tests Should Not Be Used to Evaluate Teachers” here. I have also pasted my submission below, in case anybody wants to forward/share any of my main points with others, especially others in similar positions looking to impact state or local educational policies in similar ways.

*****

May 4, 2017

Dear Assemblywoman Miller:

Re: The “Top Ten” Research-Based Reasons Why Large-Scale, Standardized Tests Should Not Be Used to Evaluate Teachers

While I understand that the National Council on Teacher Quality (NCTQ) submitted a letter expressing their opposition against Assembly Bill (AB) 320, it should be officially noted that, counter to that which the NCTQ wrote into its “research-based” letter,[1] the American Statistical Association (ASA), the American Educational Research Association (AERA), the National Academy of Education (NAE), and other large-scale, highly esteemed, professional educational and educational research/measurement associations disagree with the assertions the NCTQ put forth. Indeed, the NCTQ is not a nonpartisan research and policy organization as claimed, but one of only a small handful of partisan operations still in existence and still pushing forward what is increasingly becoming dismissed as America’s ideal teacher evaluation systems (e.g., announced today, Texas dropped their policy requirement that standardized test scores be used to evaluate teachers; Connecticut moved in the same policy direction last month).

Accordingly, these aforementioned and highly esteemed organizations have all released statements cautioning all against the use of students’ large-scale, state-level standardized tests to evaluate teachers, primarily, for the following research-based reasons, that I have limited to ten for obvious purposes:

  1. The ASA evidenced that teacher effects correlate with only 1-14% of the variance in their students’ large-scale standardized test scores. This means that the other 86%-99% of the variance is due to factors outside of any teacher’s control (e.g., out-of-school and student-level variables). That teachers’ effects, as measured by large-scaled standardized tests (and not including other teacher effects that cannot be measured using large-scaled standardized tests), account for such little variance makes using them to evaluate teachers wholly irrational and unreasonable.
  1. Large-scale standardized tests have always been, and continue to be, developed to assess levels of student achievement, but not levels of growth in achievement over time, and definitely not growth in achievement that can be attributed back to a teacher (i.e., in terms of his/her effects). Put differently, these tests were never designed to estimate teachers’ effects; hence, using them in this regard is also psychometrically invalid and indefensible.
  1. Large-scale standardized tests, when used to evaluate teachers, often yield unreliable or inconsistent results. Teachers who should be (more or less) consistently effective are, accordingly, being classified in sometimes highly inconsistent ways year-to-year. As per the current research, a teacher evaluated using large-scale standardized test scores as effective one year has a 25% to 65% chance of being classified as ineffective the following year(s), and vice versa. This makes the probability of a teacher being identified as effective, as based on students’ large-scale test scores, no different than the flip of a coin (i.e., random).
  1. The estimates derived via teachers’ students’ large-scale standardized test scores are also invalid. Very limited evidence exists to support that teachers whose students’ yield high- large-scale standardized tests scores are also effective using at least one other correlated criterion (e.g., teacher observational scores, student satisfaction survey data), and vice versa. That these “multiple measures” don’t map onto each other, also given the error prevalent in all of the “multiple measures” being used, decreases the degree to which all measures, students’ test scores included, can yield valid inferences about teachers’ effects.
  1. Large-scale standardized tests are often biased when used to measure teachers’ purported effects over time. More specifically, test-based estimates for teachers who teach inordinate proportions of English Language Learners (ELLs), special education students, students who receive free or reduced lunches, students retained in grade, and gifted students are often evaluated not as per their true effects but group effects that bias their estimates upwards or downwards given these mediating factors. The same thing holds true with teachers who teach English/language arts versus mathematics, in that mathematics teachers typically yield more positive test-based effects (which defies logic and commonsense).
  1. Related, large-scale standardized tests estimates are fraught with measurement errors that negate their usefulness. These errors are caused by inordinate amounts of inaccurate and missing data that cannot be replaced or disregarded; student variables that cannot be statistically “controlled for;” current and prior teachers’ effects on the same tests that also prevent their use for making determinations about single teachers’ effects; and the like.
  1. Using large-scale standardized tests to evaluate teachers is unfair. Issues of fairness arise when these test-based indicators impact some teachers more than others, sometimes in consequential ways. Typically, as is true across the nation, only teachers of mathematics and English/language arts in certain grade levels (e.g., grades 3-8 and once in high school) can be measured or held accountable using students’ large-scale test scores. Across the nation, this leaves approximately 60-70% of teachers as test-based ineligible.
  1. Large-scale standardized test-based estimates are typically of very little formative or instructional value. Related, no research to date evidences that using tests for said purposes has improved teachers’ instruction or student achievement as a result. As per UCLA Professor Emeritus James Popham: The farther the test moves away from the classroom level (e.g., a test developed and used at the state level) the worst the test gets in terms of its instructional value and its potential to help promote change within teachers’ classrooms.
  1. Large-scale standardized test scores are being used inappropriately to make consequential decisions, although they do not have the reliability, validity, fairness, etc. to satisfy that for which they are increasingly being used, especially at the teacher-level. This is becoming increasingly recognized by US court systems as well (e.g., in New York and New Mexico).
  1. The unintended consequences of such test score use for teacher evaluation purposes are continuously going unrecognized (e.g., by states that pass such policies, and that states should acknowledge in advance of adapting such policies), given research has evidenced, for example, that teachers are choosing not to teach certain types of students whom they deem as the most likely to hinder their potentials positive effects. Principals are also stacking teachers’ classes to make sure certain teachers are more likely to demonstrate positive effects, or vice versa, to protect or penalize certain teachers, respectively. Teachers are leaving/refusing assignments to grades in which test-based estimates matter most, and some are leaving teaching altogether out of discontent or in professional protest.

[1] Note that the two studies the NCTQ used to substantiate their “research-based” letter would not support the claims included. For example, their statement that “According to the best-available research, teacher evaluation systems that assign between 33 and 50 percent of the available weight to student growth ‘achieve more consistency, avoid the risk of encouraging too narrow a focus on any one aspect of teaching, and can support a broader range of learning objectives than measured by a single test’ is false. First, the actual “best-available” research comes from over 10 years of peer-reviewed publications on this topic, including over 500 peer-reviewed articles. Second, what the authors of the Measures of Effective Teaching (MET) Studies found was that the percentages to be assigned to student test scores were arbitrary at best, because their attempts to empirically determine such a percentage failed. This face the authors also made explicit in their report; that is, they also noted that the percentages they suggested were not empirically supported.