Learning from What Doesn’t Work in Teacher Evaluation

One of my doctoral students — Kevin Close — and I just had a study published in the practitioner journal Phi Delta Kappan that I wanted to share out with all of you, especially before the study is no longer open-access or free (see full study as currently available here). As the title indicates, the study is about how states, school districts, and schools can “Learn from What Doesn’t Work in Teacher Evaluation,” given an analysis that the two of us conducted of all documents pertaining to the four teacher evaluation and value-added model (VAM)-centered lawsuits in which I have been directly involved, and that I have also covered in this blog. These lawsuits include Lederman v. King in New York (see here), American Federation of Teachers et al. v. Public Education Department in New Mexico (see here), Houston Federation of Teachers v. Houston Independent School District in Texas (see here), and Trout v. Knox County Board of Education in Tennessee (see here).

Via this analysis we set out to comb through the legal documents to identify the strongest objections, as also recognized by the courts in these lawsuits, to VAMs as teacher measurement and accountability strategies. “The lessons to be learned from these cases are both important and timely” given that “[u]nder the Every Student Succeeds Act (ESSA), local education leaders once again have authority to decide for themselves how to assess teachers’ work.”

The most pertinent and also common issues as per these cases were as follows:

(1) Inconsistencies in teachers’ VAM-based estimates from one year to the next that are sometimes “wildly different.” Across these lawsuits, issues with reliability were very evident, whereas teachers classified as “effective” one year were either theorized or demonstrated to have around a 25%-59% chance of being classified as “ineffective” the next year, or vice versa, with other permutations also possible. As per our profession’s Standards for Educational and Psychological Testing, reliability should, rather, be observed whereby VAM estimates of teacher effectiveness are more or less consistent over time, from one year to the next, regardless of the type of students and perhaps subject areas that teachers teach.

(2) Bias in teachers’ VAM-based estimates were also of note, whereby documents suggested or evidenced that bias, or rather biased estimates of teachers’ actual effects does indeed exist (although this area was also of most contention and dispute). Specific to VAMs, since teachers are not randomly assigned the students they teach, whether their students are invariably more or less motivated, smart, knowledgeable, or capable can bias students’ test-based data, and teachers’ test-based data when aggregated. Court documents, although again not without counterarguments, suggested that VAM-based estimates are sometimes biased, especially when relatively homogeneous sets of students (i.e., English Language Learners (ELLs), gifted and special education students, free-or-reduced lunch eligible students) are non-randomly concentrated into schools, purposefully placed into classrooms, or both. Research suggests that this also sometimes happens regardless of the the sophistication of the statistical controls used to block said bias.

(3) The gaming mechanisms in play within teacher evaluation systems in which VAMs play a key role, or carry significant evaluative weight, were also of legal concern and dispute. That administrators sometimes inflate the observational ratings of their teachers whom they want to protect, while simultaneously offsetting the weight the VAMs sometimes carry was of note, as was the inverse. That administrators also sometimes lower teachers’ ratings to better align them with their “more objective” VAM counterparts were also at issue. “So argued the plaintiffs in the Houston and Tennessee lawsuits, for example. In those systems, school leaders appear to have given precedence to VAM scores, adjusting their classroom observations to match them. In both cases, administrators admitted to doing so, explaining that they sensed pressure to ensure that their ‘subjective’ classroom ratings were in sync with the VAM’s ‘objective’ scores.” Both sets of behavior distort the validity (or “truthfulness”) of any teacher evaluation system and are in violation of the same, aforementioned Standards for Educational and Psychological Testing that call for VAM scores and observation ratings to be kept separate. One indicator should never be adjusted to offset or to fit the other.

(4) Transparency, or the lack thereof, was also a common issue across cases. Transparency, which can be defined as the extent to which something is accessible and readily capable of being understood, pertains to whether VAM-based estimates are accessible and make sense to those at the receiving ends. “Not only should [teachers] have access to [their VAM-based] information for instructional purposes, but if they believe their evaluations to be unfair, they should be able to see all of the relevant data and calculations so that they can defend themselves.” In no case was this more legally pertinent than in Houston Federation of Teachers v. Houston Independent School District in Texas. Here, the presiding judge ruled that teachers did have “legitimate claims to see how their scores were calculated. Concealing this information, the judge ruled, violated teachers’ due process protections under the 14th Amendment (which holds that no state — or in this case organization — shall deprive any person of life, liberty, or property, without due process). Given this precedent, it seems likely that teachers in other states and districts will demand transparency as well.”

In the main article (here) we also discuss what states are now doing to (hopefully) improve upon their teacher evaluation systems in terms of using multiple measures to help to evaluate teachers more holistically. We emphasize the (in)formative versus the summative and high-stakes functions of such systems, and allowing teachers to take ownership over such systems in their development and implementation. I will leave you all to read the full article (here) for these details.

In sum, though, when rethinking states’ teacher evaluation systems, especially given the new liberties afforded to states via the Every Student Succeeds Act (ESSA), educators, education leaders, policymakers, and the like would do well to look to the past for guidance on what not to do — and what to do better. These legal cases can certainly inform such efforts.

Reference: Close, K., & Amrein-Beardsley, A. (2018). Learning from what doesn’t work in teacher evaluation. Phi Delta Kappan, 100(1), 15-19. Retrieved from http://www.kappanonline.org/learning-from-what-doesnt-work-in-teacher-evaluation/

Effects of the Los Angeles Times Prior Publications of Teachers’ Value-Added Scores

In one of my older posts (here), I wrote about the Los Angeles Times and its controversial move to solicit Los Angeles Unified School District (LAUSD) students’ test scores via an open-records request, calculate LAUSD teachers’ value-added scores themselves, and then publish thousands of LAUSD teachers’ value-added scores along with their “effectiveness” classifications (e.g., least effective, less effective, average, more effective, and most effective) on their Los Angeles Teacher Ratings website. They did this, repeatedly, since 2010, and they have done this all the while despite the major research-based issues surrounding teachers’ value-added estimates (that hopefully followers of this blog know at least somewhat well). This is also of professional frustration for me since the authors of the initial articles and the creators of the searchable website (Jason Felch and Jason Strong) contacted me back in 2011 regarding whether what they were doing was appropriate, valid, and fair. Despite my strong warnings against it, Felch and Song thanked me for my time and moved forward.

Just yesterday, the National Education Policy Center (NEPC) at the University of Colorado – Boulder, published a Newsletter in which authors answer the following question, as taken from the Newsletter’s title: “Whatever Happened with the Los Angeles Times’ Decision to Publish Teachers’ Value-Added Scores?” Here is what they found, by summarizing one article and two studies on the topic, although you can also certainly read the full report here.

  • Publishing the scores meant already high-achieving students were assigned to the classrooms of higher-rated teachers the next year, [found a study in the peer-reviewed Economics of Education Review]. That could be because affluent or well-connected parents were able to pull strings to get their kids assigned to those top teachers, or because those teachers pushed to teach the highest-scoring students. In other words, the academically rich got even richer — an unintended consequence of what could be considered a journalistic experiment in school reform.
  • The decision to publish the scores led to: (1) A temporary increase in teacher turnover; (2) Improvements
    in value-added scores; and (3) No impact on local housing prices.
  • The Los Angeles Times’ analysis erroneously concluded that there was no relationship between value-added scores and levels of teacher education and experience.
  • It failed to account for the fact that teachers are non-randomly assigned to classes in ways that benefit some and disadvantage others.
  • It generated results that changed when Briggs and Domingue tweaked the underlying statistical model [i.e., yielding different value-estimates and classifications for the same teachers].
  • It produced “a significant number of false positives (teachers rated as effective who are really average), and false negatives (teachers rated as ineffective who are really average).”

After the Los Angeles Times’ used a different approach in 2011, Catherine Durso found:

  • Class composition varied so much that comparisons of value-added scores of two teachers were only valid if both teachers are assigned students with similar characteristics.
  • Annual fluctuations in results were so large that they lead to widely varying conclusions from one year to the next for the same teacher.
  • There was strong evidence that results were often due to the teaching environment, not just the teacher.
  • Some teachers’ scores were based on very little data.

In sum, while “[t]he debate over publicizing value-added scores, so fierce in 2010, has since died down to
a dull roar,” more states (e.g., like in New York and Virginia), organizations (e.g., like Matt Barnum’s Chalbeat), and news outlets (e.g., the Los Angeles Times has apparently discontinued this practice, although their website is still live) need to take a stand against or prohibit the publications of individual teachers’ value-added results from hereon out. As I noted to Jason Felch and Jason Strong a long time ago, this IS simply bad practice.

New Texas Lawsuit: VAM-Based Estimates as Indicators of Teachers’ “Observable” Behaviors

Last week I spent a few days in Austin, one day during which I provided expert testimony for a new state-level lawsuit that has the potential to impact teachers throughout Texas. The lawsuit — Texas State Teachers Association (TSTA) v. Texas Education Agency (TEA), Mike Morath in his Official Capacity as Commissioner of Education for the State of Texas.

The key issue is that, as per the state’s Texas Education Code (Sec. § 21.351, see here) regarding teachers’ “Recommended Appraisal Process and Performance Criteria,” The Commissioner of Education must adopt “a recommended teacher appraisal process and criteria on which to appraise the performance of teachers. The criteria must be based on observable, job-related behavior, including: (1) teachers’ implementation of discipline management procedures; and (2) the performance of teachers’ students.” As for the latter, the State/TEA/Commissioner defined, as per its Texas Administrative Code (T.A.C., Chapter 15, Sub-Chapter AA, §150.1001, see here), that teacher-level value-added measures should be treated as one of the four measures of “(2) the performance of teachers’ students;” that is, one of the four measures recognized by the State/TEA/Commissioner as an “observable” indicator of a teacher’s “job-related” performance.

While currently no district throughout the State of Texas is required to use a value-added component to assess and evaluate its teachers, as noted, the value-added component is listed as one of four measures from which districts must choose at least one. All options listed in the category of “observable” indicators include: (A) student learning objectives (SLOs); (B) student portfolios; (C) pre- and post-test results on district-level assessments; and (D) value-added data based on student state assessment results.

Related, the state has not recommended or required that any district, if the value-added option is selected, to choose any particular value-added model (VAM) or calculation approach. Nor has it recommended or required that any district adopt any consequences as attached to these output; however, things like teacher contract renewal and sharing teachers’ prior appraisals with other districts in which teachers might be applying for new jobs is not discouraged. Again, though, the main issue here (and the key points to which I testified) was that the value-added component is listed as an “observable” and “job-related” teacher effectiveness indicator as per the state’s administrative code.

Accordingly, my (5 hour) testimony was primarily (albeit among many other things including the “job-related” part) about how teacher-level value-added data do not yield anything that is observable in terms of teachers’ effects. Likewise, officially referring to these data in this way is entirely false, in fact, in that:

  • “We” cannot directly observe a teacher “adding” (or detracting) value (e.g., with our own eyes, like supervisors can when they conduct observations of teachers in practice);
  • Using students’ test scores to measure student growth upwards (or downwards) and over time, as is very common practice using the (very often instructionally insensitive) state-level tests required by No Child Left Behind (NCLB), and doing this once per year in mathematics and reading/language arts (that includes prior and other current teachers’ effects, summer learning gains and decay, etc.), is not valid practice. That is, doing this has not been validated by the scholarly/testing community; and
  • Worse and less valid is to thereafter aggregate this student-level growth to the teacher level and then call whatever “growth” (or the lack thereof) is because of something the teacher (and really only the teacher did), as directly “observable.” These data are far from assessing a teacher’s causal or “observable” impacts on his/her students’ learning and achievement over time. See, for example, the prior statement released about value-added data use in this regard by the American Statistical Association (ASA) here. In this statement it is written that: “Research on VAMs has been fairly consistent that aspects of educational effectiveness that are measurable and within teacher control represent a small part of the total variation [emphasis added to note that this is variation explained which = correlational versus causal research] in student test scores or growth; most estimates in the literature attribute between 1% and 14% of the total variability [emphasis added] to teachers. This is not saying that teachers have little effect on students, but that variation among teachers [emphasis added] accounts for a small part of the variation [emphasis added] in [said test] scores. The majority of the variation in [said] test scores is [inversely, 86%-99% related] to factors outside of the teacher’s control such as student and family background, poverty, curriculum, and unmeasured influences.”

If any of you have anything to add to this, please do so in the comments section of this post. Otherwise, I will keep you posted on how this goes. My current understanding is that this one will be headed to court.

The Elephant in the Room – Fairness

While VAMs have many issues pertaining, fundamentally, to their levels of reliability, validity, and bias, they are wholeheartedly unfair. This is one thing that is so very important but so rarely discussed when those external to VAM-based metrics and metrics use are debating, mainly the benefits of VAMs.

Issues of “fairness” arise when a test, or more likely its summative (i.e., summary and sometimes consequential) and formative (i.e., informative) uses, impact some more than others in unfair yet often important ways. In terms of VAMs, the main issue here is that VAM-based estimates can be produced for only approximately 30-40% of all teachers across America’s public schools. The other 60-70%, which sometimes includes entire campuses of teachers (e.g., early elementary and high school teachers), cannot altogether be evaluated or “held accountable” using teacher- or individual-level VAM data.

Put differently, what VAM-based data provide, in general, “are incredibly imprecise and inconsistent measures of supposed teacher effectiveness for only a tiny handful [30-40%] of teachers in a given school” (see reference here). But this is often entirely overlooked, not only in the debates surrounding VAM use (and abuse) but also in the discussions surrounding how many taxpayer-derived funds are still being used to support such a (purportedly) reformatory overhaul of America’s public education system. The fact of the matter is that VAMs only directly impact the large minority.

While some states and districts are rushing into adopting “multiple measures” to alleviate at least some of these issues with fairness, what state and district leaders don’t entirely understand is that this, too, is grossly misguided. Should any of these states and districts also tie serious consequences to such output (e.g., merit pay, performance plans, teacher termination, denial of tenure), or rather tie serious consequences to measures of growth derived via any varieties of the “multiple assessment” that can be pulled from increasingly prevalent multiple assessment “menus,” states and districts are also setting themselves for lawsuits…no joke! Starting with the basic psychometrics, and moving onto the (entire) lack of research in support of using more “off-the-shelf” tests to help alleviate issues with fairness, would be the (easy) approach to take in a court of law as, really, doing any of this is entirely wrong.

School-level value-added is also being used to accommodate the issue of “fairness,” just less frequently now than before given the aforementioned “multiple assessment” trends. Regardless, many states and districts also continue to attribute a school-level aggregate score to teachers who do not teach primarily reading/language arts and mathematics, primarily in grades 3-8. That’s right, a majority of teachers receive a value-added score that is based on students whom they do not teach. This also calls for legal recourse, also in that this has been a contested issue within all of the lawsuits in which I’ve thus far been engaged.

Miami-Dade, Florida’s Recent “Symbolic” and “Artificial” Teacher Evaluation Moves

Last spring, Eduardo Porter – writer of the Economic Scene column for The New York Times – wrote an excellent article, from an economics perspective, about that which is happening with our current obsession in educational policy with “Grading Teachers by the Test” (see also my prior post about this article here; although you should give the article a full read; it’s well worth it). In short, though, Porter wrote about what economist’s often refer to as Goodhart’s Law, which states that “when a measure becomes the target, it can no longer be used as the measure.” This occurs given the great (e.g., high-stakes) value (mis)placed on any measure, and the distortion (i.e., in terms of artificial inflation or deflation, depending on the desired direction of the measure) that often-to-always comes about as a result.

Well, it’s happened again, this time in Miami-Dade, Florida, where the Miami-Dade district’s teachers are saying its now “getting harder to get a good evaluation” (see the full article here). Apparently, teachers evaluation scores, from last to this year, are being “dragged down,” primarily given teachers’ students’ performances on tests (as well as tests of subject areas that and students whom they do not teach).

“In the weeks after teacher evaluations for the 2015-16 school year were distributed, Miami-Dade teachers flooded social media with questions and complaints. Teachers reported similar stories of being evaluated based on test scores in subjects they don’t teach and not being able to get a clear explanation from school administrators. In dozens of Facebook posts, they described feeling confused, frustrated and worried. Teachers risk losing their jobs if they get a series of low evaluations, and some stand to gain pay raises and a bonus of up to $10,000 if they get top marks.”

As per the figure also included in this article, see the illustration of how this is occurring below; that is, how it is becoming more difficult for teachers to get “good” overall evaluation scores but also, and more importantly, how it is becoming more common for districts to simply set different cut scores to artificially increase teachers’ overall evaluation scores.

00-00 template_cs5

“Miami-Dade say the problems with the evaluation system have been exacerbated this year as the number of points needed to get the “highly effective” and “effective” ratings has continued to increase. While it took 85 points on a scale of 100 to be rated a highly effective teacher for the 2011-12 school year, for example, it now takes 90.4.”

This, as mentioned prior, is something called “artificial deflation,” whereas the quality of teaching is likely not changing nearly to the extent the data might illustrate it is. Rather, what is happening behind the scenes (e.g., the manipulation of cut scores) is giving the impression that indeed the overall teacher system is in fact becoming better, more rigorous, aligning with policymakers’ “higher standards,” etc).

This is something in the educational policy arena that we also call “symbolic policies,” whereas nothing really instrumental or material is happening, and everything else is a facade, concealing a less pleasant or creditable reality that nothing, in fact, has changed.

Citation: Gurney, K. (2016). Teachers say it’s getting harder to get a good evaluation. The school district disagrees. The Miami Herald. Retrieved from http://www.miamiherald.com/news/local/education/article119791683.html#storylink=cpy

VAM-Based Chaos Reigns in Florida, as Caused by State-Mandated Teacher Turnovers

The state of Florida is another one of our state’s to watch in that, even since the passage of the Every Student Succeeds Act (ESSA) last January, the state is still moving forward with using its VAMs for high-stakes accountability reform. See my most recent post about one district in Florida here, after the state ordered it to dismiss a good number of its teachers as per their low VAM scores when this school year started. After realizing this also caused or contributed to a teacher shortage in the district, the district scrambled to hire Kelly Services contracted substitute teachers to replace them, after which the district also put administrators back into the classroom to help alleviate the bad situation turned worse.

In a recent post released by The Ledger, teachers from the same Polk County School District (size = 100K students) added much needed details and also voiced concerns about all of this in the article that author Madison Fantozzi titled “Polk teachers: We are more than value-added model scores.”

Throughout this piece Fantozzi covers the story of Elizabeth Keep, a teacher who was “plucked from” the middle school in which she taught for 13 years, after which she was involuntarily placed at a district high school “just days before she was to report back to work.” She was one of 35 teachers moved from five schools in need of reform as based on schools’ value-added scores, although this was clearly done with no real concern or regard of the disruption this would cause these teachers, not to mention the students on the exiting and receiving ends. Likewise, and according to Keep, “If you asked students what they need, they wouldn’t say a teacher with a high VAM score…They need consistency and stability.” Apparently not. In Keep’s case, she “went from being the second most experienced person in [her middle school’s English] department…where she was department chair and oversaw the gifted program, to a [new, and never before] 10th- and 11th-grade English teacher” at the new high school to which she was moved.

As background, when Polk County School District officials presented turnaround plans to the State Board of Education last July, school board members “were most critical of their inability to move ‘unsatisfactory’ teachers out of the schools and ‘effective’ teachers in.”  One board member, for example, expressed finding it “horrendous” that the district was “held hostage” by the extent to which the local union was protecting teachers from being moved as per their value-added scores. Referring to the union, and its interference in this “reform,” he accused the unions of “shackling” the districts and preventing its intended reforms. Note that the “effective” teachers who are to replace the “ineffective” ones can earn up to $7,500 in bonuses per year to help the “turnaround” the schools into which they enter.

Likewise, the state’s Commissioner of Education concurred saying that she also “wanted ‘unsatisfactory’ teachers out and ‘highly effective’ teachers in,” again, with effectiveness being defined by teachers’ value-added or lack thereof, even though (1) the teachers targeted only had one or two years of the three years of value-added data required by state statute, and even though (2) the district’s senior director of assessment, accountability and evaluation noted that, in line with a plethora of other research findings, teachers being evaluated using the state’s VAM have a 51% chance of changing their scores from one year to the next. This lack of reliability, as we know it, should outright prevent any such moves in that without some level of stability, valid inferences from which valid decisions are to be made cannot be drawn. It’s literally impossible.

Nonetheless, state board of education members “unanimously… threatened to take [all of the district’s poor performing] over or close them in 2017-18 if district officials [didn’t] do what [the Board said].” See also other tales of similar districts in the article available, again, here.

In Keep’s case, “her ‘unsatisfactory’ VAM score [that caused the district to move her, as] paired with her ‘highly effective’ in-class observations by her administrators brought her overall district evaluation to ‘effective’…[although she also notes that]…her VAM scores fluctuate because the state has created a moving target.” Regardless, Keep was notified “five days before teachers were due back to their assigned schools Aug. 8 [after which she was] told she had to report to a new school with a different start time that [also] disrupted her 13-year routine and family that shares one car.”

VAM-based chaos reigns, especially in Florida.

New Mexico Lawsuit Update

As you all likely recall, the American Federation of Teachers (AFT), joined by the Albuquerque Teachers Federation (ATF), last fall, filed a “Lawsuit in New Mexico Challenging [the] State’s Teacher Evaluation System.” Plaintiffs charged that the state’s teacher evaluation system, imposed on the state in 2012 by the state’s current Public Education Department (PED) Secretary Hanna Skandera (with value-added counting for 50% of teachers’ evaluation scores), was unfair, error-ridden, spurious, harming teachers, and depriving students of high-quality educators, among other claims (see the actual lawsuit here). Again, I’m serving as the expert witness on the side of the plaintiffs in this suit.

As you all likely also recall, in December of 2015, State District Judge David K. Thomson granted a preliminary injunction preventing consequences from being attached to the state’s teacher evaluation data. More specifically, Judge Thomson ruled that the state could proceed with “developing” and “improving” its teacher evaluation system, but the state was not to make any consequential decisions about New Mexico’s teachers using the data the state collected until the state (and/or others external to the state) could evidence to the court during another trial (initially set for April 2016, then postponed to October 2016, and likely to be postponed again) that the system is reliable, valid, fair, uniform, and the like (see prior post on this ruling here).

Well, many of you have (since these prior posts) written requesting updates regarding this lawsuit, and here is one as released jointly by the AFT and ATF. This accurately captures the current and ongoing situation:

September 23, 2016

Many of you will remember the classic Christmas program, Rudolph the Red Nose Reindeer, and how the terrible and menacing abominable snowman became harmless once his teeth were removed. This is how you should view the PED evaluation you recently received – a harmless abominable snowman.  

The math is still wrong, the methodology deeply flawed, but the preliminary injunction achieved by our union, removed the teeth from PED’s evaluations, and so there is no reason to worry. As explained below, we will continue to fight these evaluations and will not rest until the PED institutes an evaluation system that is fair, meaningful, and consistently applied.

For all of you, who just got arbitrarily labeled by the PED in your summative evaluations, just remember, like the abominable snowman, these labels have no teeth, and your career is safe.

2014-2015 Evaluations

These evaluations, as you know, were the subject of our lawsuit filed in 2014. As a result of the Court’s order, the preliminary injunction, no negative consequences can result from your value-added scores.

In an effort to comply with the Court’s order, the PED announced in May it would be issuing new regulations.  This did not happen, and it did not happen in June, in July, in August, or in September. The bottom line is the PED still has not issued new regulations – though it still promises that those regulations are coming soon. So much for accountability.

The trial on the old regulations, scheduled for October 24, has been postponed based upon the PED’s repetitive assertions that new regulations would be issued.

In addition, we have repeatedly asked the PED to provide their data, which they finally did, however it lacked the codebook necessary to meaningfully interpret the data. We view this as yet another stall tactic.

Soon, we will petition the Court for an order compelling PED to produce the documents it promised months ago. Our union’s lawyers and expert witnesses will use this data to critically analyze the PED’s claims and methodology … again.

2015-2016 Evaluations

Even though the PED has condensed the number of ways an educator can be evaluated in a false attempt to satisfy the Courts, the fact remains that value-added models are based on false math and highly inaccurate data. In addition to the PED’s information we have requested for the 2014-2015 evaluations, we have requested all data associated with the current 2015-2016 evaluations.

If our experts determine the summative evaluation scores are again, “based on fundamentally, and irreparably, flawed methodology which is further plagued by consistent and appalling data errors,” we will also challenge the 2015-2016 evaluations. If the PED ever releases new regulations, and we determine that they violate statute (again), we will challenge those regulations, as well.

Rest assured our union will not stop challenging the PED until we are satisfied they have adopted an evaluation system that is respectful of students and educators. We will keep you updated as we learn more information, including the release of new regulations and the rescheduled trial date.

In Solidarity,

Stephanie Ly                                   Ellen Bernstein
President, AFT NM                         President, ATF

Houston Education and Civil Rights Summit (Friday, Oct. 14 to Saturday, Oct. 15)

For those of you interested, and perhaps close to Houston, Texas, I will be presenting my research on the Houston Independent School District’s (now hopefully past) use of the Education Value-Added Assessment System for more high-stakes, teacher-level consequences than anywhere else in the nation.

As you may recall from prior posts (see, for example, here, here, and here), seven teachers in the disrict, with the support of the Houston Federation of Teachers (HFT), are taking the district to federal court over how their value-added scores are/were being used, and allegedly abused. The case, Houston Federation of Teachers, et al. v. Houston ISD, is still ongoing; although, also as per a prior post, the school board just this past June, in a 3:3 split vote, elected to no longer pay an annual $680K to SAS Institute Inc. to calculate the district’s EVAAS estimates. Hence, by non-renewing this contract it appears, at least for the time being, that the district is free from its prior history using the EVAAS for high-stakes accountability. See also this post here for an analysis of Houston’s test scores post EVAAS implementation,  as compared to other districts in the state of Texas. Apparently, all of the time and energy invested did not pay off for the district, or more importantly its teachers and students located within its boundaries.

Anyhow, those presenting and attending the conference–the Houston Education and Civil Rights Summit, as also sponsored and supported by United Opt Out National–will prioritize and focus on the “continued challenges of public education and the teaching profession [that] have only been exacerbated by past and current policies and practices,”  as well as “the shifting landscape of public education and its impact on civil and human rights and civil society.”

As mentioned, I will be speaking, alongside two featured speakers: Samuel Abrams–the Director of the National Center for the Study of Privatization in Education (NCSPE) and an instructor in Columbia’s Teachers College, and Julian Vasquez Heilig–Professor of Educational Leadership and Policy Studies at California State Sacramento and creator of the blog Cloaking Inequality. For more information about these and other speakers, many of whom are practitioners, see  the conference website available, again, here.

When is it? Friday, October 14, 2016 at 4:00 PM through to Saturday, October 15, 2016 at 8:00 PM (CDT).

Where is it? Houston Hilton Post Oak – 2001 Post Oak Blvd, Houston, TX 77056

Hope to see you there!

One Score and Seven Policy Iterations Ago…

I just read what might be one of the best articles I’ve read in a long time on using test scores to measure teacher effectiveness, and why this is such a bad idea. Not surprisingly, unfortunately, this article was written 20 years ago (i.e., 1986) by – Edward Haertel, National Academy of Education member and recently retired Professor at Stanford University. If the name sounds familiar, it should as Professor Emeritus Haertel is one of the best on the topic of, and history behind VAMs (see prior posts about his related scholarship here, here, and here). To access the full article, please scroll to the reference at the bottom of this post.

Heartel wrote this article when at the time policymakers were, like they still are now, trying to hold teachers accountable for their students’ learning as measured on states’ standardized test scores. Although this article deals with minimum competency tests, which were in policy fashion at the time, about seven policy iterations ago, the contents of the article still have much relevance given where we are today — investing in “new and improved” Common Core tests and still riding on unsinkable beliefs that this is the way to reform the schools that have been in despair and (still) in need of major repair since 20+ years ago.

Here are some of the points I found of most “value:”

  • On isolating teacher effects: “Inferring teacher competence from test scores requires the isolation of teaching effects from other major influences on student test performance,” while “the task is to support an interpretation of student test performance as reflecting teacher competence by providing evidence against plausible rival hypotheses or interpretation.” While “student achievement depends on multiple factors, many of which are out of the teacher’s control,” and many of which cannot and likely never will be able to be “controlled.” In terms of home supports, “students enjoy varying levels of out-of-school support for learning. Not only may parental support and expectations influence student motivation and effort, but some parents may share directly in the task of instruction itself, reading with children, for example, or assisting them with homework.” In terms of school supports, “[s]choolwide learning climate refers to the host of factors that make a school more than a collection of self-contained classrooms. Where the principal is a strong instructional leader; where schoolwide policies on attendance, drug use, and discipline are consistently enforced; where the dominant peer culture is achievement-oriented; and where the school is actively supported by parents and the community.” This, all, makes isolating the teacher effect nearly if not wholly impossible.
  • On the difficulties with defining the teacher effect: “Does it include homework? Does it include self-directed study initiated by the student? How about tutoring by a parent or an older sister or brother? For present purposes, instruction logically refers to whatever the teacher being evaluated is responsible for, but there are degrees of responsibility, and it is often shared. If a teacher informs parents of a student’s learning difficulties and they arrange for private tutoring, is the teacher responsible for the student’s improvement? Suppose the teacher merely gives the student low marks, the student informs her parents, and they arrange for a tutor? Should teachers be credited with inspiring a student’s independent study of school subjects? There is no time to dwell on these difficulties; others lie ahead. Recognizing that some ambiguity remains, it may suffice to define instruction as any learning activity directed by the teacher, including homework….The question also must be confronted of what knowledge counts as achievement. The math teacher who digresses into lectures on beekeeping may be effective in communicating information, but for purposes of teacher evaluation the learning outcomes will not match those of a colleague who sticks to quadratic equations.” Much if not all of this cannot and likely never will be able to be “controlled” or “factored” in or our, as well.
  • On standardized tests: The best of standardized tests will (likely) always be too imperfect and not up to the teacher evaluation task, no matter the extent to which they are pitched as “new and improved.” While it might appear that these “problem[s] could be solved with better tests,” they cannot. Ultimately, all that these tests provide is “a sample of student performance. The inference that this performance reflects educational achievement [not to mention teacher effectiveness] is probabilistic [emphasis added], and is only justified under certain conditions.” Likewise, these tests “measure only a subset of important learning objectives, and if teachers are rated on their students’ attainment of just those outcomes, instruction of unmeasured objectives [is also] slighted.” Like it was then as it still is today, “it has become a commonplace that standardized student achievement tests are ill-suited for teacher evaluation.”
  • On the multiple choice formats of such tests: “[A] multiple-choice item remains a recognition task, in which the problem is to find the best of a small number of predetermined alternatives and the cri- teria for comparing the alternatives are well defined. The nonacademic situations where school learning is ultimately ap- plied rarely present problems in this neat, closed form. Discovery and definition of the problem itself and production of a variety of solutions are called for, not selection among a set of fixed alternatives.”
  • On students and the scores they are to contribute to the teacher evaluation formula: “Students varying in their readiness to profit from instruction are said to differ in aptitude. Not only general cognitive abilities, but relevant prior instruction, motivation, and specific inter- actions of these and other learner characteristics with features of the curriculum and instruction will affect academic growth.” In other words, one cannot simply assume all students will learn or grow at the same rate with the same teacher. Rather, they will learn at different rates given their aptitudes, their “readiness to profit from instruction,” the teachers’ instruction, and sometimes despite the teachers’ instruction or what the teacher teaches.
  • And on the formative nature of such tests, as it was then: “Teachers rarely consult standardized test results except, perhaps, for initial grouping or placement of students, and they believe that the tests are of more value to school or district administrators than to themselves.”

Sound familiar?

Reference: Haertel, E. (1986). The valid use of student performance measures for teacher evaluation. Educational Evaluation and Policy Analysis, 8(1), 45-60.

Center on the Future of American Education, on America’s “New and Improved” Teacher Evaluation Systems

Thomas Toch — education policy expert and research fellow at Georgetown University, and founding director of the Center on the Future of American Education — just released, as part of the Center, a report titled: Grading the Graders: A Report on Teacher Evaluation Reform in Public Education. He sent this to me for my thoughts, and I decided to summarize my thoughts here, with thanks and all due respect to the author, as clearly we are on different sides of the spectrum in terms of the literal “value” America’s new teacher evaluation systems might in fact “add” to the reformation of America’s public schools.

While quite a long and meaty report, here are some of the points I think that are important to address publicly:

First, is it true that using prior teacher evaluation systems (which were almost if not entirely based on teacher observational systems) yielded for “nearly every teacher satisfactory ratings”? Indeed, this is true. However, what we have seen since 2009, when states began to adopt what were then (and in many ways still are) viewed as America’s “new and improved” or “strengthened” teacher evaluation systems, is that for 70% of America’s teachers, these teacher evaluation systems are still based only on the observational indicators being used prior, because for only 30% of America’s teachers are value-added estimates calculable. As also noted in this report, it is for these 70% that “the superficial teacher [evaluation] practices of the past” (p. 2) will remain the same, although I disagree with this particular adjective, especially when these measures are used for formative purposes. While certainly imperfect, these are not simply “flimsy checklists” of no use or value. There is, indeed, much empirical research to support this assertion.

Likewise, these observational systems have not really changed since 2009, or 1999 for that matter and not that they could change all that much; but, they are not in their “early stages” (p. 2) of development. Indeed, this includes the Danielson Framework explicitly propped up in this piece as an exemplar, regardless of the fact it has been used across states and districts for decades and it is still not functioning as intended, especially when summative decisions about teacher effectiveness are to be made (see, for example, here).

Hence, in some states and districts (sometimes via educational policy) principals or other observers are now being asked, or required to deliberately assign to teachers’ lower observational categories, or assign approximate proportions of teachers per observational category used. Whereby the instrument might not distribute scores “as currently needed,” one way to game the system is to tell principals, for example, that they should only allot X% of teachers as per the three-to-five categories most often used across said instruments. In fact, in an article one of my doctoral students and I have forthcoming, we have termed this, with empirical evidence, the “artificial deflation” of observational scores, as externally being persuaded or required. Worse is that this sometimes signals to the greater public that these “new and improved” teacher evaluation systems are being used for more discriminatory purposes (i.e., to actually differentiate between good and bad teachers on some sort of discriminating continuum), or that, indeed, there is a normal distribution of teachers, as per their levels of effectiveness. While certainly there is some type of distribution, no evidence exists whatsoever to suggest that those who fall on the wrong side of the mean are, in fact, ineffective, and vice versa. It’s all relative, seriously, and unfortunately.

Related, the goal here is really not to “thoughtfully compare teacher performances,” but to evaluate teachers as per a set of criteria against which they can be evaluated and judged (i.e., whereby criterion-referenced inferences and decisions can be made). Inversely, comparing teachers in norm-referenced ways, as (socially) Darwinian and resonate with many-to-some, does not necessarily work, either or again. This is precisely what the authors of The Widget Effect report did, after which they argued for wide-scale system reform, so that increased discrimination among teachers, and reduced indifference on the part of evaluating principals, could occur. However, as also evidenced in this aforementioned article, the increasing presence of normal curves illustrating “new and improved” teacher observational distributions does not necessarily mean anything normal.

And were these systems not used often enough or “rarely” prior, to fire teachers? Perhaps, although there are no data to support such assertions, either. This very argument was at the heart of the Vergara v. California case (see, for example, here) — that teacher tenure laws, as well as laws protecting teachers’ due process rights, were keeping “grossly ineffective” teachers teaching in the classroom. Again, while no expert on either side could produce for the Court any hard numbers regarding how many “grossly ineffective” teachers were in fact being protected but such archaic rules and procedures, I would estimate (as based on my years of experience as a teacher) that this number is much lower than many believe it (and perhaps perpetuate it) to be. In fact, there was only one teacher whom I recall, who taught with me in a highly urban school, who I would have classified as grossly ineffective, and also tenured. He was ultimately fired, and quite easy to fire, as he also knew that he just didn’t have it.

Now to be clear, here, I do think that not just “grossly ineffective” but also simply “bad teachers” should be fired, but the indicators used to do this must yield valid inferences, as based on the evidence, as critically and appropriately consumed by the parties involved, after which valid and defensible decisions can and should be made. Whether one calls this due process in a proactive sense, or a wrongful termination suit in a retroactive sense, what matters most, though, is that the evidence supports the decision. This is the very issue at the heart of many of the lawsuits currently ongoing on this topic, as many of you know (see, for example, here).

Finally, where is the evidence, I ask, for many of the declaration included within and throughout this report. A review of the 133 endnotes included, for example, include only a very small handful of references to the larger literature on this topic (see a very comprehensive list of these literature here, here, and here). This is also highly problematic in this piece, as only the usual suspects (e.g., Sandi Jacobs, Thomas Kane, Bill Sanders) are cited to support the assertions advanced.

Take, for example, the following declaration: “a large and growing body of state and local implementation studies, academic research, teacher surveys, and interviews with dozens of policymakers, experts, and educators all reveal a much more promising picture: The reforms have strengthened many school districts’ focus on instructional quality, created a foundation for making teaching a more attractive profession, and improved the prospects for student achievement” (p. 1). Where is the evidence? There is no such evidence, and no such evidence published in high-quality, scholarly peer-reviewed journals of which I am aware. Again, publications released by the National Council on Teacher Quality (NCTQ) and from the Measures of Effective Teaching (MET) studies, as still not externally reviewed and still considered internal technical reports with “issues”, don’t necessarily count. Accordingly, no such evidence has been introduced, by either side, in any court case in which I am involved, likely, because such evidence does not exist, again, empirically and at some unbiased, vetted, and/or generalizable level. While Thomas Kane has introduced some of his MET study findings in the cases in Houston and New Mexico, these might be  some of the easiest pieces of evidence to target, accordingly, given the issues.

Otherwise, the only thing I can say from reading this piece that with which I agree, as that which I view, given the research literature as true and good, is that now teachers are being observed more often, by more people, in more depth, and in perhaps some cases with better observational instruments. Accordingly, teachers, also as per the research, seem to appreciate and enjoy the additional and more frequent/useful feedback and discussions about their practice, as increasingly offered. This, I would agree is something that is very positive that has come out of the nation’s policy-based focus on its “new and improved” teacher evaluation systems, again, as largely required by the federal government, especially pre-Every Student Succeeds Act (ESSA).

Overall, and in sum, “the research reveals that comprehensive teacher-evaluation models are stronger than the sum of their parts.” Unfortunately again, however, this is untrue in that systems based on multiple measures are entirely limited by the indicator that, in educational measurement terms, performs the worst. While such a holistic view is ideal, in measurement terms the sum of the parts is entirely limited by the weakest part. This is currently the value-added indicator (i.e., with the lowest levels of reliability and, related, issues with validity and bias) — the indicator at issue within this particular blog, and the indicator of the most interest, as it is this indicator that has truly changed our overall approaches to the evaluation of America’s teachers. It has yet to deliver, however, especially if to be used for high-stakes consequential decision-making purposes (e.g., incentives, getting rid of “bad apples”).

Feel free to read more here, as publicly available: Grading the Teachers: A Report on Teacher Evaluation Reform in Public Education. See also other claims regarding the benefits of said systems within (e.g., these systems as foundations for new teacher roles and responsibilities, smarter employment decisions, prioritizing classrooms, increased focus on improved standards). See also the recommendations offered, some with which I agree on the observational side (e.g., ensuring that teachers receive multiple observations during a school year by multiple evaluators), and none with which I agree on the value-added side (e.g., use at least two years of student achievement data in teacher evaluation ratings–rather, researchers agree that three years of value-added data are needed, as based on at least four years of student-level test data). There are, of course, many other recommendations included. You all can be the judges of those.