Effects of the Los Angeles Times Prior Publications of Teachers’ Value-Added Scores

In one of my older posts (here), I wrote about the Los Angeles Times and its controversial move to solicit Los Angeles Unified School District (LAUSD) students’ test scores via an open-records request, calculate LAUSD teachers’ value-added scores themselves, and then publish thousands of LAUSD teachers’ value-added scores along with their “effectiveness” classifications (e.g., least effective, less effective, average, more effective, and most effective) on their Los Angeles Teacher Ratings website. They did this, repeatedly, since 2010, and they have done this all the while despite the major research-based issues surrounding teachers’ value-added estimates (that hopefully followers of this blog know at least somewhat well). This is also of professional frustration for me since the authors of the initial articles and the creators of the searchable website (Jason Felch and Jason Strong) contacted me back in 2011 regarding whether what they were doing was appropriate, valid, and fair. Despite my strong warnings against it, Felch and Song thanked me for my time and moved forward.

Just yesterday, the National Education Policy Center (NEPC) at the University of Colorado – Boulder, published a Newsletter in which authors answer the following question, as taken from the Newsletter’s title: “Whatever Happened with the Los Angeles Times’ Decision to Publish Teachers’ Value-Added Scores?” Here is what they found, by summarizing one article and two studies on the topic, although you can also certainly read the full report here.

  • Publishing the scores meant already high-achieving students were assigned to the classrooms of higher-rated teachers the next year, [found a study in the peer-reviewed Economics of Education Review]. That could be because affluent or well-connected parents were able to pull strings to get their kids assigned to those top teachers, or because those teachers pushed to teach the highest-scoring students. In other words, the academically rich got even richer — an unintended consequence of what could be considered a journalistic experiment in school reform.
  • The decision to publish the scores led to: (1) A temporary increase in teacher turnover; (2) Improvements
    in value-added scores; and (3) No impact on local housing prices.
  • The Los Angeles Times’ analysis erroneously concluded that there was no relationship between value-added scores and levels of teacher education and experience.
  • It failed to account for the fact that teachers are non-randomly assigned to classes in ways that benefit some and disadvantage others.
  • It generated results that changed when Briggs and Domingue tweaked the underlying statistical model [i.e., yielding different value-estimates and classifications for the same teachers].
  • It produced “a significant number of false positives (teachers rated as effective who are really average), and false negatives (teachers rated as ineffective who are really average).”

After the Los Angeles Times’ used a different approach in 2011, Catherine Durso found:

  • Class composition varied so much that comparisons of value-added scores of two teachers were only valid if both teachers are assigned students with similar characteristics.
  • Annual fluctuations in results were so large that they lead to widely varying conclusions from one year to the next for the same teacher.
  • There was strong evidence that results were often due to the teaching environment, not just the teacher.
  • Some teachers’ scores were based on very little data.

In sum, while “[t]he debate over publicizing value-added scores, so fierce in 2010, has since died down to
a dull roar,” more states (e.g., like in New York and Virginia), organizations (e.g., like Matt Barnum’s Chalbeat), and news outlets (e.g., the Los Angeles Times has apparently discontinued this practice, although their website is still live) need to take a stand against or prohibit the publications of individual teachers’ value-added results from hereon out. As I noted to Jason Felch and Jason Strong a long time ago, this IS simply bad practice.

Fired “Ineffective” Teacher Wins Battle with DC Public Schools

In November of 2013, I published a blog post about a “working paper” released by the National Bureau of Economic Research (NBER) and written by authors Thomas Dee – Economics and Educational Policy Professor at Stanford, and James Wyckoff – Economics and Educational Policy Professor at the University of Virginia. In the study titled “Incentives, Selection, and Teacher Performance: Evidence from IMPACT,” Dee and Wyckoff (2013) analyzed the controversial IMPACT educator evaluation system that was put into place in Washington DC Public Schools (DCPS) under the then Chancellor, Michelle Rhee. In this paper, Dee and Wyckoff (2013) presented what they termed to be “novel evidence” to suggest that the “uniquely high-powered incentives” linked to “teacher performance” via DC’s IMPACT initiative worked to improve the performance of high-performing teachers, and that dismissal threats worked to increase the voluntary attrition of low-performing teachers, as well as improve the performance of the students of the teachers who replaced them.

I critiqued this study in full (see both short and long versions of this critique here), ultimately asserting that the study had “fatal flaws” which compromised the exaggerated claims Dee and Wyckoff (2013) advanced. This past January (2017) they published another report, titled “Teacher Turnover, Teacher Quality, and Student Achievement in DCPS,” which was also (prematurely) released as a “working paper” by the same NBER. I also critiqued this study here).

Anyhow, a public interest story that should be of interest to followers of this blog was published two days ago in The Washington Post. The article, “I’ve Been a Hostage for Nine Years’: Fired Teacher Wins Battle with D.C. Schools,” details one fired, now 53-year old, veteran’s teachers last nine years after being one of nearly 1,000 educators fired during the tenure of Michelle Rhee. He was fired after district “leaders,” using the IMPACT system and a teacher evaluation system prior, deemed him “ineffective.” He “contested his dismissal, arguing that he was wrongly fired and that the city was punishing him for being a union activist and for publicly criticizing the school system.” That he made a significant salary at the time (2009) also likely had something to do with it in terms of cost-savings, although this is more peripherally discussed in this piece.

In short, “an arbitrator [just] ruled in favor of the fired teacher, a decision that could entitle him to hundreds of thousands of dollars in back pay and the opportunity to be a District teacher again” although, perhaps not surprisingly, he might not take them up on that  offer. As well, apparently this teacher “isn’t the only one fighting to get his job back. Other educators who were fired years ago and allege unjust dismissals [as per the IMPACT system] are waiting for their cases to be settled.” The school system can appeal this ruling.

The Gates Foundation’s Expensive ($335 Million) Teacher Evaluation Missteps

The header of an Education Week article released last week (click here) was that “[t]he Bill & Melinda Gates Foundation’s multi-million-dollar, multi-year effort aimed at making teachers more effective largely fell short of its goal to increase student achievement-including among low-income and minority students.”

An evaluation of Gates Foundation’s Intensive Partnerships for Effective Teaching initiative funded at $290 million, an extension of its Measures of Effective Teaching (MET) project funded at $45 million, was the focus of this article. The MET project was lead by Thomas Kane (Professor of Education and Economics at Harvard, former leader of the MET project, and expert witness on the defendant’s side of the ongoing lawsuit supporting New Mexico’s MET project-esque statewide teacher evaluation system; see here and here), and both projects were primarily meant to hold teachers accountable using their students test scores via growth or value-added models (VAMs) and financial incentives. Both projects were tangentially meant to improve staffing, professional development opportunities, improve the retention of the teachers of “added value,” and ultimately lead to more-effective teaching and student achievement, especially in low-income schools and schools with higher relative proportions of racial minority students. The six-year evaluation of focus in this Education Week article was conducted by the RAND Corporation and the American Institutes for Research, and the evaluation was also funded by the Gates Foundation (click here for the evaluation report, see below for the full citation of this study).

Their key finding was that Intensive Partnerships for Effective Teaching district/school sites (see them listed here) implemented new measures of teaching effectiveness and modified personnel policies, but they did not achieve their goals for students.

Evaluators also found (see also here):

  • The sites succeeded in implementing measures of effectiveness to evaluate teachers and made use of the measures in a range of human-resource decisions.
  • Every site adopted an observation rubric that established a common understanding of effective teaching. Sites devoted considerable time and effort to train and certify classroom observers and to observe teachers on a regular basis.
  • Every site implemented a composite measure of teacher effectiveness that included scores from direct classroom observations of teaching and a measure of growth in student achievement.
  • Every site used the composite measure to varying degrees to make decisions about human resource matters, including recruitment, hiring, placement, tenure, dismissal, professional development, and compensation.

Overall, the initiative did not achieve its goals for student achievement or graduation, especially for low-income and racial minority students. With minor exceptions, student achievement, access to effective teaching, and dropout rates were also not dramatically better than they were for similar sites that did not participate in the intensive initiative.

Their recommendations were as follows (see also here):

  • Reformers should not underestimate the resistance that could arise if changes to teacher-evaluation systems have major negative consequences.
  • A near-exclusive focus on teacher evaluation systems such as these might be insufficient to improve student outcomes. Many other factors might also need to be addressed, ranging from early childhood education, to students’ social and emotional competencies, to the school learning environment, to family support. Dramatic improvement in outcomes, particularly for low-income and racial minority students, will likely require attention to many of these factors as well.
  • In change efforts such as these, it is important to measure the extent to which each of the new policies and procedures is implemented in order to understand how the specific elements of the reform relate to outcomes.

Reference:

Stecher, B. M., Holtzman, D. J., Garet, M. S., Hamilton, L. S., Engberg, J., Steiner, E. D., Robyn, A., Baird, M. D., Gutierrez, I. A., Peet, E. D., de los Reyes, I. B., Fronberg, K., Weinberger, G., Hunter, G. P., & Chambers, J. (2018). Improving teaching effectiveness: Final report. The Intensive Partnerships for Effective Teaching through 2015–2016. Santa Monica, CA: The RAND Corporation. Retrieved from https://www.rand.org/pubs/research_reports/RR2242.html

New Mexico’s Motion for Summary Judgment, Following Houston’s Precedent-Setting Ruling

Recall that in New Mexico, just over two years ago, all consequences attached to teacher-level value-added model (VAM) scores (e.g., flagging the files of teachers with low VAM scores) were suspended throughout the state until the state (and/or others external to the state) could prove to the state court that the system was reliable, valid, fair, uniform, and the like. The trial during which this evidence was to be presented by the state was repeatedly postponed since, yet with teacher-level consequences prohibited all the while. See more information about this ruling here.

Recall as well that in Houston, just this past May, that a district judge ruled that Houston Independent School District (HISD) teachers’ who had VAM scores (as based on the Education Value-Added Assessment System (EVAAS)) had legitimate claims regarding how EVAAS use in HISD was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case organization shall deprive any person of life, liberty, or property, without due process). More specifically, in what turned out to be a huge and unprecedented victory, the judge ruled that because HISD teachers “ha[d] no meaningful way to ensure correct calculation of their EVAAS scores,” they were, as a result, “unfairly subject to mistaken deprivation of constitutionally protected property interests in their jobs.” This ruling ultimately led the district to end the use of the EVAAS for teacher termination throughout Houston. See more information about this ruling here.

Just this past week, New Mexico charged that the Houston ruling regarding Houston teachers’ Fourteenth Amendment due process protections also applies to teachers throughout the state of New Mexico.

As per an article titled “Motion For Summary Judgment Filed In New Mexico Teacher Evaluation Lawsuit,” the American Federation of Teachers and Albuquerque Teachers Federation filed a “motion for summary judgment in the litigation in our continuing effort to make teacher evaluations beneficial and accurate in New Mexico.” They, too, are “seeking a determination that the [state’s] failure to provide teachers with adequate information about the calculation of their VAM scores violated their procedural due process rights.”

“The evidence demonstrates that neither school administrators nor educators have been provided with sufficient information to replicate the [New Mexico] VAM score calculations used as a basis for teacher evaluations. The VAM algorithm is complex, and the general overview provided in the NMTeach Technical Guide is not enough to pass constitutional muster. During previous hearings, educators testified they do not receive an explanation at the time they receive their annual evaluation, and teachers have been subjected to performance growth plans based on low VAM scores, without being given any guidance or explanation as to how to raise that score on future evaluations. Thus, not only do educators not understand the algorithm used to derive the VAM score that is now part of the basis for their overall evaluation rating, but school administrators within the districts do not have sufficient information on how the score is derived in order to replicate it or to provide professional development, whether as part of a disciplinary scenario or otherwise, to assist teachers in raising their VAM score.”

For more information about this update, please click here.

Bias in VAMs, According to Validity Expert Michael T. Kane

During the still ongoing, value-added lawsuit in New Mexico (see my most recent update about this case here), I was honored to testify as the expert witness on behalf of the plaintiffs (see, for example, here). I was also fortunate to witness the testimony of the expert witness who testified on behalf of the defendants – Thomas Kane, Economics Professor at Harvard and former Director of the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) studies. During Kane’s testimony, one of the highlights (i.e., for the plaintiffs), or rather the low-lights (i.e., for him and the defendants), in my opinion, was when one of the plaintiff’s attorney’s questioned Kane, on the stand, about his expertise in the area of validity. In sum, Kane responded that he defined himself as an “expert” in the area, having also been trained by some of the best. Consequently, the plaintiff’s attorney’s questioned Kane about different types of validity evidences (e.g., construct, content, criterion), and Kane could not answer those questions. The only form of validity evidence with which he was familiar, and which he could clearly define, was evidence related to predictive validity. This hardly made him the expert he proclaimed himself to be minutes prior.

Let’s not mince words, though, or in this case names.

A real expert in validity (and validity theory) is another Kane, who goes by the full name of Michael T. Kane. This Kane is The Samuel J. Messick Chair in Test Validity at the Educational Testing Service (ETS); this Kane wrote one of the best, most contemporary, and currently most foundational papers on validity (see here); and this Kane just released an ETS-sponsored paper on Measurement Error and Bias in Value-Added Models certainly of interest here. I summarize this piece below (see the PDF of this report here).

In this paper Kane examines “the origins of [value-added model (VAM)-based] bias and its potential impact” and indicates that bias that is observed “is an increasing linear function of the student’s prior achievement and can be quite large (e.g., half a true-score standard deviation) for very low-scoring and high-scoring students [i.e., students in the extremes of any normal distribution]” (p. 1). Hence, Kane argues, “[t]o the extent that students with relatively low or high prior scores are clustered in particular classes and schools, the student-level bias will tend to generate bias in VAM estimates of teacher and school effects” (p. 1; see also prior posts about this type of bias here, here, and here; see also Haertel (2013) cited below). Kane concludes that “[a]djusting for this bias is possible, but it requires estimates of generalizability (or reliability) coefficients that are more accurate and precise than those that are generally available for standardized achievement tests” (p. 1; see also prior posts about issues with reliability across VAMs here, here, and here).

Kane’s more specific points of note:

  • To accurately calculate teachers’/schools’ value-added, “current and prior scores have to be on the same scale (or on vertically aligned scales) for the differences to make sense. Furthermore, the scale has to be an interval scale in the sense that a difference of a certain number of points has, at least approximately, the same meaning along the scale, so that it makes sense to compare gain scores from different parts of the scale…some uncertainty about scale characteristics is not a problem for many applications of vertical scaling, but it is a serious problem if the proposed use of the scores (e.g., educational accountability based on growth scores) demands that the vertical scale be demonstrably equal interval” (p. 1).
  • Likewise, while some approaches can be used to minimize the need for such scales (e.g., residual gain scores, covariate-adjustment models, and ordinary least squares (OLS) regression approaches which are of specific interest in this piece), “it is still necessary to assume [emphasis added] that a difference of a certain number of points has more or less the same meaning along the score scale for the current test scores” (p. 2).
  • Related, “such adjustments can [still] be biased to the extent that the predicted score does not include all factors that may have an impact on student performance. Bias can also result from errors of measurement in the prior scores included in the prediction equation…[and this can be]…substantial” (p. 2).
  • Accordingly, “gains for students with high true scores on the prior year’s test will be overestimated, and the gains for students with low true scores in the prior year will be underestimated. To the extent that students with relatively low and high true scores tend to be clustered in particular classes and schools, the student-level bias will generate bias in estimates of teacher and school effects” (p. 2).
  • Hence, if not corrected, this source of bias could have a substantial negative impact on estimated VAM scores for teachers and schools that serve students with low prior true scores and could have a substantial positive impact for teachers and schools that serve mainly high-performing students” (p. 2).
  • Put differently, random errors in students’ prior scores may “tend to add a positive bias to the residual gain scores for students with prior scores above the population mean, and they [may] tend to add a negative bias to the residual gain scores for students with prior scores below the mean. Th[is] bias is associated with the well-known phenomenon of regression to the mean” (p. 10).
  • Although, at least this latter claim — that students with relatively high true scores in the prior year could substantially and positively impact their teachers’/schools value-added estimates — does run somewhat contradictory to other claims as evidenced in the literature in terms of the extent to which ceiling effects substantially and negatively impact their teachers’/schools value-added estimates (see, for example, Point #7 as per the ongoing lawsuit in Houston here, and see also Florida teacher Luke Flint’s “Story” here).
  • In sum, and as should be a familiar conclusion to followers of this blog, “[g]iven that the results of VAMs may be used for high-stakes decisions about teachers and schools in the context of accountability programs,…any substantial source of bias would be a matter of great concern” (p. 2).

Citation: Kane, M. T. (2017). Measurement error and bias in value-added models. Princeton, NJ: Educational Testing Service (ETS) Research Report Series. doi:10.1002/ets2.12153 Retrieved from http://onlinelibrary.wiley.com/doi/10.1002/ets2.12153/full

See also Haertel, E. H. (2013). Reliability and validity of inferences about teachers based on student test scores (14th William H. Angoff Memorial Lecture). Princeton, NJ: Educational Testing Service (ETS).

A North Carolina Teacher’s Guest Post on His/Her EVAAS Scores

A teacher from the state of North Carolina recently emailed me for my advice regarding how to help him/her read and understand his/her recently received Education Value-Added Assessment System (EVAAS) value added scores. You likely recall that the EVAAS is the model I cover most on this blog, also in that this is the system I have researched the most, as well as the proprietary system adopted by multiple states (e.g., Ohio, North Carolina, and South Carolina) and districts across the country for which taxpayers continue to pay big $. Of late, this is also the value-added model (VAM) of sole interest in the recent lawsuit that teachers won in Houston (see here).

You might also recall that the EVAAS is the system developed by the now late William Sanders (see here), who ultimately sold it to SAS Institute Inc. that now holds all rights to the VAM (see also prior posts about the EVAAS here, here, here, here, here, and here). It is also important to note, because this teacher teaches in North Carolina where SAS Institute Inc. is located and where its CEO James Goodnight is considered the richest man in the state, that as a major Grand Old Party (GOP) donor “he” helps to set all of of the state’s education policy as the state is also dominated by Republicans. All of this also means that it is unlikely EVAAS will go anywhere unless there is honest and open dialogue about the shortcomings of the data.

Hence, the attempt here is to begin at least some honest and open dialogue herein. Accordingly, here is what this teacher wrote in response to my request that (s)he write a guest post:

***

SAS Institute Inc. claims that the EVAAS enables teachers to “modify curriculum, student support and instructional strategies to address the needs of all students.”  My goal this year is to see whether these claims are actually possible or true. I’d like to dig deep into the data made available to me — for which my state pays over $3.6 million per year — in an effort to see what these data say about my instruction, accordingly.

For starters, here is what my EVAAS-based growth looks like over the past three years:

As you can see, three years ago I met my expected growth, but my growth measure was slightly below zero. The year after that I knocked it out of the park. This past year I was right in the middle of my prior two years of results. Notice the volatility [aka an issue with VAM-based reliability, or consistency, or a lack thereof; see, for example, here].

Notwithstanding, SAS Institute Inc. makes the following recommendations in terms of how I should approach my data:

Reflecting on Your Teaching Practice: Learn to use your Teacher reports to reflect on the effectiveness of your instructional delivery.

The Teacher Value Added report displays value-added data across multiple years for the same subject and grade or course. As you review the report, you’ll want to ask these questions:

  • Looking at the Growth Index for the most recent year, were you effective at helping students to meet or exceed the Growth Standard?
  • If you have multiple years of data, are the Growth Index values consistent across years? Is there a positive or negative trend?
  • If there is a trend, what factors might have contributed to that trend?
  • Based on this information, what strategies and instructional practices will you replicate in the current school year? What strategies and instructional practices will you change or refine to increase your success in helping students make academic growth?

Yet my growth index values are not consistent across years, as also noted above. Rather, my “trends” are baffling to me.  When I compare those three instructional years in my mind, nothing stands out to me in terms of differences in instructional strategies that would explain the fluctuations in growth measures, either.

So let’s take a closer look at my data for last year (i.e., 2016-2017).  I teach 7th grade English/language arts (ELA), so my numbers are based on my students reading grade 7 scores in the table below.

What jumps out for me here is the contradiction in “my” data for achievement Levels 3 and 4 (achievement levels start at Level 1 and top out at Level 5, whereas levels 3 and 4 are considered proficient/middle of the road).  There is moderate evidence that my grade 7 students who scored a Level 4 on the state reading test exceeded the Growth Standard.  But there is also moderate evidence that my same grade 7 students who scored Level 3 did not meet the Growth Standard.  At the same time, the number of students I had demonstrating proficiency on the same reading test (by scoring at least a 3) increased from 71% in 2015-2016 (when I exceeded expected growth) to 76% in school year 2016-2017 (when my growth declined significantly). This makes no sense, right?

Hence, and after considering my data above, the question I’m left with is actually really important:  Are the instructional strategies I’m using for my students whose achievement levels are in the middle working, or are they not?

I’d love to hear from other teachers on their interpretations of these data.  A tool that costs taxpayers this much money and impacts teacher evaluations in so many states should live up to its claims of being useful for informing our teaching.

The More Weight VAMs Carry, the More Teacher Effects (Will Appear to) Vary

Matthew A. Kraft — an Assistant Professor of Education & Economics at Brown University and co-author of an article published in Educational Researcher on “Revisiting The Widget Effect” (here), and another of his co-authors Matthew P. Steinberg — an Assistant Professor of Education Policy at the University of Pennsylvania — just published another article in this same journal on “The Sensitivity of Teacher Performance Ratings to the Design of Teacher Evaluation Systems” (see the full and freely accessible, at least for now, article here; see also its original and what should be enduring version here).

In this article, Steinberg and Kraft (2017) examine teacher performance measure weights while conducting multiple simulations of data taken from the Bill & Melinda Gates Measures of Effective Teaching (MET) studies. They conclude that “performance measure weights and ratings” surrounding teachers’ value-added, observational measures, and student survey indicators play “critical roles” when “determining teachers’ summative evaluation ratings and the distribution of teacher proficiency rates.” In other words, the weighting of teacher evaluation systems’ multiple measures matter, matter differently for different types of teachers within and across school districts and states, and matter also in that so often these weights are arbitrarily and politically defined and set.

Indeed, because “state and local policymakers have almost no empirically based evidence [emphasis added, although I would write “no empirically based evidence”] to inform their decision process about how to combine scores across multiple performance measures…decisions about [such] weights…are often made through a somewhat arbitrary and iterative process, one that is shaped by political considerations in place of empirical evidence” (Steinberg & Kraft, 2017, p. 379).

This is very important to note in that the consequences attached to these measures, also given the arbitrary and political constructions they represent, can be both professionally and personally, career and life changing, respectively. How and to what extent “the proportion of teachers deemed professionally proficient changes under different weighting and ratings thresholds schemes” (p. 379), then, clearly matters.

While Steinberg and Kraft (2017) have other key findings they also present throughout this piece, their most important finding, in my opinion, is that, again, “teacher proficiency rates change substantially as the weights assigned to teacher performance measures change” (p. 387). Moreover, the more weight assigned to measures with higher relative means (e.g., observational or student survey measures), the greater the rate by which teachers are rated effective or proficient, and vice versa (i.e., the more weight assigned to teachers’ value-added, the higher the rate by which teachers will be rated ineffective or inadequate; as also discussed on p. 388).

Put differently, “teacher proficiency rates are lowest across all [district and state] systems when norm-referenced teacher performance measures, such as VAMs [i.e., with scores that are normalized in line with bell curves, with a mean or average centered around the middle of the normal distributions], are given greater relative weight” (p. 389).

This becomes problematic when states or districts then use these weighted systems (again, weighted in arbitrary and political ways) to illustrate, often to the public, that their new-and-improved teacher evaluation systems, as inspired by the MET studies mentioned prior, are now “better” at differentiating between “good and bad” teachers. Thereafter, some states over others are then celebrated (e.g., by the National Center of Teacher Quality; see, for example, here) for taking the evaluation of teacher effects more seriously than others when, as evidenced herein, this is (unfortunately) more due to manipulation than true changes in these systems. Accordingly, the fact remains that the more weight VAMs carry, the more teacher effects (will appear to) vary. It’s not necessarily that they vary in reality, but the manipulation of the weights on the back end, rather, cause such variation and then lead to, quite literally, such delusions of grandeur in these regards (see also here).

At a more pragmatic level, this also suggests that the teacher evaluation ratings for the roughly 70% of teachers who are not VAM eligible “are likely to differ in systematic ways from the ratings of teachers for whom VAM scores can be calculated” (p. 392). This is precisely why evidence in New Mexico suggests VAM-eligible teachers are up to five times more likely to be ranked as “ineffective” or “minimally effective” than their non-VAM-eligible colleagues; that is, “[also b]ecause greater weight is consistently assigned to observation scores for teachers in nontested grades and subjects” (p. 392). This also causes a related but also important issue with fairness, whereas equally effective teachers, just by being VAM eligible, may be five-or-so times likely (e.g., in states like New Mexico) of being rated as ineffective by the mere fact that they are VAM eligible and their states, quite literally, “value” value-added “too much” (as also arbitrarily defined).

Finally, it should also be noted as an important caveat here, that the findings advanced by Steinberg and Kraft (2017) “are not intended to provide specific recommendations about what weights and ratings to select—such decisions are fundamentally subject to local district priorities and preferences. (p. 379). These findings do, however, “offer important insights about how these decisions will affect the distribution of teacher performance ratings as policymakers and administrators continue to refine and possibly remake teacher evaluation systems” (p. 379).

Related, please recall that via the MET studies one of the researchers’ goals was to determine which weights per multiple measure were empirically defensible. MET researchers failed to do so and then defaulted to recommending an equal distribution of weights without empirical justification (see also Rothstein & Mathis, 2013). This also means that anyone at any state or district level who might say that this weight here or that weight there is empirically defensible should be asked for the evidence in support.

Citations:

Rothstein, J., & Mathis, W. J. (2013, January). Review of two culminating reports from the MET Project. Boulder, CO: National Educational Policy Center. Retrieved from http://nepc.colorado.edu/thinktank/review-MET-final-2013

Steinberg, M. P., & Kraft, M. A. (2017). The sensitivity of teacher performance ratings to the design of teacher evaluation systems. Educational Researcher, 46(7), 378–
396. doi:10.3102/0013189X17726752 Retrieved from http://journals.sagepub.com/doi/abs/10.3102/0013189X17726752

Breaking News: The End of Value-Added Measures for Teacher Termination in Houston

Recall from multiple prior posts (see, for example, here, here, here, here, and here) that a set of teachers in the Houston Independent School District (HISD), with the support of the Houston Federation of Teachers (HFT) and the American Federation of Teachers (AFT), took their district to federal court to fight against the (mis)use of their value-added scores derived via the Education Value-Added Assessment System (EVAAS) — the “original” value-added model (VAM) developed in Tennessee by William L. Sanders who just recently passed away (see here). Teachers’ EVAAS scores, in short, were being used to evaluate teachers in Houston in more consequential ways than any other district or state in the nation (e.g., the termination of 221 teachers in one year as based, primarily, on their EVAAS scores).

The case — Houston Federation of Teachers et al. v. Houston ISD — was filed in 2014 and just one day ago (October 10, 2017) came the case’s final federal suit settlement. Click here to read the “Settlement and Full and Final Release Agreement.” But in short, this means the “End of Value-Added Measures for Teacher Termination in Houston” (see also here).

More specifically, recall that the judge notably ruled prior (in May of 2017) that the plaintiffs did have sufficient evidence to proceed to trial on their claims that the use of EVAAS in Houston to terminate their contracts was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case district shall deprive any person of life, liberty, or property, without due process). That is, the judge ruled that “any effort by teachers to replicate their own scores, with the limited information available to them, [would] necessarily fail” (see here p. 13). This was confirmed by the one of the plaintiffs’ expert witness who was also “unable to replicate the scores despite being given far greater access to the underlying computer codes than [was] available to an individual teacher” (see here p. 13).

Hence, and “[a]ccording to the unrebutted testimony of [the] plaintiffs’ expert [witness], without access to SAS’s proprietary information – the value-added equations, computer source codes, decision rules, and assumptions – EVAAS scores will remain a mysterious ‘black box,’ impervious to challenge” (see here p. 17). Consequently, the judge concluded that HISD teachers “have no meaningful way to ensure correct calculation of their EVAAS scores, and as a result are unfairly subject to mistaken deprivation of constitutionally protected property interests in their jobs” (see here p. 18).

Thereafter, and as per this settlement, HISD agreed to refrain from using VAMs, including the EVAAS, to terminate teachers’ contracts as long as the VAM score is “unverifiable.” More specifically, “HISD agree[d] it will not in the future use value-added scores, including but not limited to EVAAS scores, as a basis to terminate the employment of a term or probationary contract teacher during the term of that teacher’s contract, or to terminate a continuing contract teacher at any time, so long as the value-added score assigned to the teacher remains unverifiable. (see here p. 2; see also here). HISD also agreed to create an “instructional consultation subcommittee” to more inclusively and democratically inform HISD’s teacher appraisal systems and processes, and HISD agreed to pay the Texas AFT $237,000 in its attorney and other legal fees and expenses (State of Texas, 2017, p. 2; see also AFT, 2017).

This is yet another big win for teachers in Houston, and potentially elsewhere, as this ruling is an unprecedented development in VAM litigation. Teachers and others using the EVAAS or another VAM for that matter (e.g., that is also “unverifiable”) do take note, at minimum.

“Virginia SGP” Overruled

You might recall from a post I released approximately 1.5 years ago a story about how a person who self-identifies as “Virginia SGP,” who is also now known as Brian Davison — a parent of two public school students in the affluent Loudoun, Virginia area (hereafter referred to as Virginia SGP), sued the state of Virginia in an attempt to force the release of teachers’ student growth percentile (SGP) data for all teachers across the state.

More specifically, Virginia SGP “pressed for the data’s release because he thinks parents have a right to know how their children’s teachers are performing, information about public employees that exists but has so far been hidden. He also want[ed] to expose what he sa[id was] Virginia’s broken promise to begin [to use] the data to evaluate how effective the state’s teachers are.” The “teacher data should be out there,” especially if taxpayers are paying for it.

In January of 2016, a Richmond, Virginia judge ruled in Virginia SGP’s favor. The following April, a Richmond Circuit Court judge ruled that the Virginia Department of Education was to also release Loudoun County Public Schools’ SGP scores by school and by teacher, including teachers’ identifying information. Accordingly, the judge noted that the department of education and the Loudoun school system failed to “meet the burden of proof to establish an exemption’ under Virginia’s Freedom of Information Act [FOIA]” preventing the release of teachers’ identifiable information (i.e., beyond teachers’ SGP data). The court also ordered VDOE to pay Davison $35,000 to cover his attorney fees and other costs.

As per an article published last week, the Virginia Supreme Court overruled this former ruling, noting that the department of education did not have to provide teachers’ identifiable information along with teachers’ SGP data, after all.

See more details in the actual article here, but ultimately the Virginia Supreme Court concluded that the Richmond Circuit Court “erred in ordering the production of these documents containing teachers’ identifiable information.” The court added that “it was [an] error for the circuit court to order that the School Board share in [Virginia SGP’s] attorney’s fees and costs,” pushing that decision (i.e., the decision regarding how much to pay, if anything at all, in legal fees) back down to the circuit court.

Virginia SGP plans to ask for a rehearing of this ruling. See also his comments on this ruling here.

On Conditional Bias and Correlation: A Guest Post

After I posted about “Observational Systems: Correlations with Value-Added and Bias,” a blog follower, associate professor, and statistician named Laura Ring Kapitula (see also a very influential article she wrote on VAMs here) posted comments on this site that I found of interest, and I thought would also be of interest to blog followers. Hence, I invited her to write a guest post, and she did.

She used R (i.e., a free software environment for statistical computing and graphics) to simulate correlation scatterplots (see Figures below) to illustrate three unique situations: (1) a simulation where there are two indicators (e.g., teacher value-added and observational estimates plotted on the x and y axes) that have a correlation of r = 0.28 (the highest correlation coefficient at issue in the aforementioned post); (2) a simulation exploring the impact of negative bias and a moderate correlation on a group of teachers; and (3) another simulation with two indicators that have a non-linear relationship possibly induced or caused by bias. She designed simulations (2) and (3) to illustrate the plausibility of the situation suggested next (as written into Audrey’s post prior) about potential bias in both value-added and observational estimates:

If there is some bias present in value-added estimates, and some bias present in the observational estimates…perhaps this is why these low correlations are observed. That is, only those teachers teaching classrooms inordinately stacked with students from racial minority, poor, low achieving, etc. groups might yield relatively stronger correlations between their value-added and observational scores given bias, hence, the low correlations observed may be due to bias and bias alone.

Laura continues…

Here, Audrey makes the point that a correlation of r = 0.28 is “weak.” It is, accordingly, useful to see an example of just how “weak” such a correlation is by looking at a scatterplot of data selected from a population where the true correlation is r = 0.28. To make the illustration more meaningful the points are colored based on their quintile scores as per simulated teachers’ value-added divided into the lowest 20%, next 20%, etc.

In this figure you can see by looking at the blue “least squares line” that, “on average,” as a simulated teacher’s value-added estimate increases the average of a teacher’s observational estimate increases. However, there is a lot of variability (or scatter points) around the (scatterplot) line. Given this variability, we can make statements about averages, such as “on average” teachers in the top 20% for VAM scores will likely have on average higher observed observational scores; however, there is not nearly enough precision to make any (and certainly not any good) predictions about the observational score from the VAM score for individual teachers. In fact, the linear relationship between teachers’ VAM and observational scores only accounts for about 8% of the variation in VAM score. Note: we get 8% by squaring the aforementioned r = 0.28 correlation (i.e., an R squared). The other 92% of the variance is due to error and other factors.

What this means in practice is that when correlations are this “weak,” it is reasonable to say statements about averages, for example, that “on average” as one variable increases the mean of the other variable increases, but it would not be prudent or wise to make predictions for individuals based on these data. See, for example, that individuals in the top 20% (quintile 5) of VAM have a very large spread in their scores on the observational score, with 95% of the scores in the top quintile being in between the 7th and 98th percentiles for their observational scores. So, here if we observe a VAM for a specific teacher in the top 20%, and we do not know their observational score, we cannot say much more than their observational score is likely to be in the top 90%. Similarly, if we observe a VAM in the bottom 20%, we cannot say much more than their observational score is likely to be somewhere in the bottom 90%. That’s not saying a lot, in terms of precision, but also in terms of practice.

The second scatterplot I ran to test how bias that only impacts a small group of teachers might theoretically impact an overall correlation, as posited by Audrey. Here I simulated a situation where, again, there are two values present in a population of teachers: a teacher’s value-added and a teacher’s observational score. Then I insert a group of teachers (as Audrey described) who represent 20% of a population and teach a disproportionate number of students who come from relatively lower socioeconomic, high racial minority, etc. backgrounds, and I assume this group is measured with negative bias on both indicators and this group has a moderate correlation between indicators of r = 0.50. The other 80% of the population is assumed to be uncorrelated. Note: for this demonstration I assume that this group includes 20% of teachers from the aforementioned population, these teachers I assume to be measured with negative bias (by one standard deviation on average) on both measures, and, again, I set their correlation at r = 0.50 with the other 80% of teachers at a correlation of zero.

What you can see is that if there is bias in this correlation that impacts only a certain group on the two instrument indicators; hence, it is possible that this bias can result in an observed correlation overall. In other words, a strong correlation noted in just one group of teachers (i.e., teachers scoring the lowest on their value-added and observational indicators in this case) can be relatively stronger than the “weak” correlation observed on average or overall.

Another, possible situation is that there might be a non-linear relationship between these two measures. In the simulation below, I assume that different quantiles on VAM have a different linear relationship with the observational score. For example, in the plot there is not a constant slope, but teachers who are in the first quintile on VAM I assume to have a correlation of r = 0.50 with observational scores, the second quintile I assume to have a correlation of r = 0.20, and the other quintiles I assume to be uncorrelated. This results in an overall correlation in the simulation of r = 0.24, with a very small p-value (i.e. a very small chance that a correlation of this size would be observed by random chance alone if the true correlation was zero).

What this means in practice is that if, in fact, there is a non-linear relationship between teachers’ observational and VAM scores, this can induce a small but statistically significant correlation. As evidenced, teachers in the lowest 20% on the VAM score have differences in the mean observational score depending on the VAM score (a moderate correlation of r = 0.50), but for the other 80%, knowing the VAM score is not informative as there is a very small correlation for the second quintile and no correlation for the upper 60%. So, if quintile cut-off scores are used, teachers can easily be misclassified. In sum, Pearson Correlations (the standard correlation coefficient) measure the overall strength of  linear relationships between X and Y, but if X and Y have a non-linear relationship (like as illustrated in the above), this statistic can be very misleading.

Note also that for all of these simulations very small p-values are observed (e.g., p-values <0.0000001 which, again, mean these correlations are statistically significant or that the probability of observing correlations this large by chance if the true correlation is zero, is nearly 0%). What this illustrates, again, is that correlations (especially correlations this small) are (still) often misleading. While they might be statistically significant, they might mean relatively little in the grand scheme of things (i.e., in terms of practical significance; see also “The Difference Between”Significant’ and ‘Not Significant’ is not Itself Statistically Significant” or posts on Andrew Gelman’s blog for more discussion on these topics if interested).

At the end of the day r = 0.28 is still a “weak” correlation. In addition, it might be “weak,” on average, but much stronger and statistically and practically significant for teachers in the bottom quintiles (e.g., teachers in the bottom 20%, as illustrated in the final figure above) typically teaching the highest needs students. Accordingly, this might be due, at least in part, to bias.

In conclusion, one should always be wary of claims based on “weak” correlations, especially if they are positioned to be stronger than industry standards would classify them (e.g., in the case highlighted in the prior post). Even if a correlation is “statistically significant,” it is possible that the correlation is the result of bias, and that the relationship is so weak that it is not meaningful in practice, especially when the goal is to make high-stakes decisions about individual teachers. Accordingly, when you see correlations this small, keep these scatterplots in mind or generate some of your own (see, for example, here to dive deeper into what these correlations might mean and how significant these correlations might really be).

*Please contact Dr. Kapitula directly at kapitull@gvsu.edu if you want more information or to access the R code she used for the above.