New Mexico Teacher Evaluation Lawsuit Updates

In December of 2015 in New Mexico, via a preliminary injunction set forth by state District Judge David K. Thomson, all consequences attached to teacher-level value-added model (VAM) scores (e.g., flagging the files of teachers with low VAM scores) were suspended throughout the state until the state (and/or others external to the state) could prove to the state court that the system was reliable, valid, fair, uniform, and the like. The trial during which this evidence is to be presented by the state is currently set for this October. See more information about this ruling here.

As the expert witness for the plaintiffs in this case, I was deposed a few weeks ago here in Phoenix, given my analyses of the state’s data (supported by one of my PhD students – Tray Geiger). In short, we found and I testified during the deposition that:

  • In terms of uniformity and fairness, there seem to be 70% or so of New Mexico teachers who are ineligible to be assessed using VAMs, and this proportion held constant across the years of data analyzed. This is even more important to note knowing that when VAM-based data are to be used to make consequential decisions about teachers, issues with fairness and uniformity become even more important given accountability-eligible teachers are also those who are relatively more likely to realize the negative or reap the positive consequences attached to VAM-based estimates.
  • In terms of reliability (or the consistency of teachers’ VAM-based scores over time), approximately 40% of teachers differed by one quintile (quintiles are derived when a sample or population is divided into fifths) and approximately 28% of teachers differed, from year-to-year, by two or more quintiles in terms of their VAM-derived effectiveness ratings. These results make sense when New Mexico’s results are situated within the current literature, whereas teachers classified as “effective” one year can have a 25%-59% chance of being classified as “ineffective” the next, or vice versa, with other permutations also possible.
  • In terms of validity (i.e., concurrent related evidence of validity), and importantly as also situated within the current literature, the correlations between New Mexico teachers’ VAM-based and observational scores ranged from r = 0.153 to r = 0.210. Not only are these correlations very weak[1], they are also very weak as appropriately situated within the literature, via which it is evidenced that correlations between multiple VAMs and observational scores typically range from 0.30 ≤ r ≤ 0.50.
  • In terms of bias, New Mexico’s Caucasian teachers had significantly higher observation scores than non-Caucasian teachers implying, also as per the current research, that Caucasian teachers may be (falsely) perceived as being better teachers than non-Caucasians teachers given bias within these instruments and/or bias of the scorers observing and scoring teachers using these instruments in practice. See prior posts about observational-based bias here, here and here.
  • Also of note in terms of bias was that: (1) teachers with fewer years of experience yielded VAM scores that were significantly lower than teachers with more years of experience, with similar patterns noted across teachers’ observation scores, which could all mean, as also in line with common sense as well as the research, that teachers with more experience are typically better teachers; (2) teachers who taught English language learners (ELLs) or special education students had lower VAM scores across the board than those who did not teach such students; (3) teachers who taught gifted students had significantly higher VAM scores than non-gifted teachers which runs counter to the current research evidencing that teachers’ gifted students oft-thwart or prevent them from demonstrating growth given ceiling effects; (4) teachers in schools with lower relative proportions of ELLs, special education students, students eligible for free-or-reduced lunches, and students from racial minority backgrounds, as well as higher relative proportions of gifted students, consistently had significantly higher VAM scores. These results suggest that teachers in these schools are as a group better, and/or that VAM-based estimates might be biased against teachers not teaching in these schools, preventing them from demonstrating comparable growth.

To read more about the data and methods used, as well as other findings, please see my affidavit submitted to the court attached here: Affidavit Feb2018.

Although, also in terms of a recent update, I should also note that a few weeks ago, as per an article in the AlbuquerqueJournal, New Mexico’s teacher evaluation systems is now likely to be overhauled, or simply “expired” as early as 2019. In short, “all three Democrats running for governor and the lone Republican candidate…have expressed misgivings about using students’ standardized test scores to evaluate the effectiveness of [New Mexico’s] teachers, a key component of the current system [at issue in this lawsuit and] imposed by the administration of outgoing Gov. Susana Martinez.” All four candidates described the current system “as fundamentally flawed and said they would move quickly to overhaul it.”

While I/we will proceed our efforts pertaining to this lawsuit until further notice, this is also important to note at this time in that it seems that New Mexico’s policymakers of new are going to be much wiser than those of late, at least in these regards.

[1] Interpreting r: 0.8 ≤ r ≤ 1.0 = a very strong correlation; 0.6 ≤ r ≤ 0.8 = a strong correlation; 0.4 ≤ r ≤ 0.6 = a moderate correlation; 0.2 ≤ r ≤ 0.4 = a weak correlation; and 0.0 ≤ r ≤ 0.2 = a very weak correlation, if any at all.

 

An Important but False Claim about the EVAAS in Ohio

Just this week in Ohio – a state that continues to contract with SAS Institute Inc. for test-based accountability output from its Education Value-Added Assessment System – SAS’s EVAAS Director, John White, “defended” the use of his model statewide, during which he also claimed before Ohio’s Joint Education Oversight Committee (JEOC) that “poorer schools do no better or worse on student growth than richer schools” when using the EVAAS model.

For the record, this is false. First, about five years ago in Ohio, while the state of Ohio was using the same EVAAS model, Ohio’s The Plain Dealer in conjunction with StateImpact Ohio found that Ohio’s “value-added results show that districts, schools and teachers with large numbers of poor students tend to have lower value-added results than those that serve more-affluent ones.” They also found that:

  • Value-added scores were 2½ times higher on average for districts where the median family income is above $35,000 than for districts with income below that amount.
  • For low-poverty school districts, two-thirds had positive value-added scores — scores indicating students made more than a year’s worth of progress.
  • For high-poverty school districts, two-thirds had negative value-added scores — scores indicating that students made less than a year’s progress.
  • Almost 40 percent of low-poverty schools scored “Above” the state’s value-added target, compared with 20 percent of high-poverty schools.
  • At the same time, 25 percent of high-poverty schools scored “Below” state value-added targets while low-poverty schools were half as likely to score “Below.” See the study here.

Second, about three years ago, similar results were evidenced in Pennsylvania – another state that uses the same EVAAS statewide, although in Pennsylvania the model is known as the Pennsylvania Education Value-Added Assessment System (PVAAS). Research for Action (click here for more about the organization and its mission), more specifically, evidenced that bias also appears to exist particularly at the school-level. See more here.

Third, and related, in Arizona – my state that is also using growth to measure school-level value-added, albeit not with the EVAAS – the same issues with bias are being evidenced when measuring school-level growth for similar purposes. Just two days ago, for example, The Arizona Republic evidenced that the “schools with ‘D’ and ‘F’ letter grades” recently released by the state board of education “were more likely to have high percentages of students eligible for free and reduced-price lunch, an indicator of poverty” (see more here). In actuality, the correlation is as high or “strong” as r = -0.60 (e.g., correlation coefficient values that land between = ± 0.50 and ± 1.00 are often said to indicate “strong” correlations). What this means in more pragmatic terms is that the better the school letter grade received the lower the level of poverty at the school (i.e., a negative correlation which indicates in this case that as the letter grade goes up the level of poverty goes down).

While the state of Arizona combines with growth a proficiency measure (always strongly correlated with poverty), and this explains at least some of the strength of this correlation (although combining proficiency with growth is also a practice endorsed and encouraged by John White), this strong correlation is certainly at issue.

More specifically at issue, though, should be how to get any such correlation down to zero or near-zero (if possible), which is the only correlation that would, in fact, warrant any such claim, again as noted to the JEOC this week in Ohio, that “poorer schools do no better or worse on student growth than richer schools”.

Identifying Effective Teacher Preparation Programs Using VAMs Does Not Work

A New Study [does not] Show Why It’s So Hard to Improve Teacher Preparation” Programs (TPPs). More specifically, it shows why using value-added models (VAMs) to evaluate TPPs, and then ideally improving them using the value-added data derived, is nearly if not entirely impossible.

This is precisely why yet another, perhaps, commonsensical but highly improbable federal policy move to imitate great teacher education programs and shut down ineffective ones, as based on their graduates’ students test-based performance over time (i.e., value-added) continues to fail.

Accordingly, in another, although not-yet peer-reviewed or published study referenced in the article above, titled “How Much Does Teacher Quality Vary Across Teacher Preparation Programs? Reanalyzing Estimates from [Six] States,” authors Paul T. von Hippel, from the University of Texas at Austin, and Laura Bellows, a PhD Student from Duke University, investigated “whether the teacher quality differences between TPPs are large enough to make [such] an accountability system worthwhile” (p. 2). More specifically, using a meta-analysis technique, they reanalyzed the results of such evaluations in six of the approximately 16 states doing this (i.e., in New York, Louisiana, Missouri, Washington, Texas, and Florida), each of which ultimately yielded a peer-reviewed publication, and they found “that teacher quality differences between most TPPs [were] negligible [at approximately] 0-0.04 standard deviations in student test scores” (p. 2).

They also highlight some of the statistical practices that exaggerated the “true” differences noted between TPPs in each of these but also these types of studies in general, and consequently conclude that the “results of TPP evaluations in different states may vary not for substantive reasons, but because of the[se] methodological choices” (p. 5). Likewise, as is the case with value-added research in general, when “[f]aced with the same set of results, some authors may [also] believe they see intriguing differences between TPPs, while others may believe there is not much going on” (p. 6). With that being said, I will not cover these statistical/technical issue more here. Do read the full study for these details, though, as also important.

Related, they found that in every state, the variation that they statistically observed was greater among relatively small TPPs versus large ones. They suggest that this occurs, accordingly, due to estimation or statistical methods that may be inadequate for the task at hand. However, if this is true this also means that because there is relatively less variation observed among large TPPs, it may be much more difficult “to single out a large TPP that is significantly better or worse than average” (p. 30). Accordingly, there are
several ways to mistakenly single out a TPP as exceptional or less than, merely given TPP size. This is obviously problematic.

Nonetheless, the authors also note that before they began this study, in Missouri, Texas, and Washington, that “the differences between TPPs appeared small or negligible” (p. 29), but in Louisiana and New York “they appeared more substantial” (p. 29). After their (re)analyses, however, their found that the results from and across these six different states were “more congruent” (p. 29), as also noted prior (i.e., differences between TPPs around 0 and 0.04 SDs in student test scores).

“In short,” they conclude, that “TPP evaluations may have some policy value, but the value is more modest than was originally envisioned. [Likewise, it] is probably not meaningful to rank all the TPPs in a state; the true differences between most TPPs are too small to matter, and the estimated differences consist mostly of noise” (p. 29). As per the article cited prior, they added that “It appears that differences between [programs] are rarely detectable, and that if they could be detected they would usually be too small to support effective policy decisions.”

To see a study similar to this, that colleagues and I conducted in Arizona, and that was recently published in Teaching Education, see “An Elusive Policy Imperative: Data and Methodological Challenges When Using Growth in Student Achievement to Evaluate Teacher Education Programs’ ‘Value-Added” summarized and referenced here.

New Mexico’s Motion for Summary Judgment, Following Houston’s Precedent-Setting Ruling

Recall that in New Mexico, just over two years ago, all consequences attached to teacher-level value-added model (VAM) scores (e.g., flagging the files of teachers with low VAM scores) were suspended throughout the state until the state (and/or others external to the state) could prove to the state court that the system was reliable, valid, fair, uniform, and the like. The trial during which this evidence was to be presented by the state was repeatedly postponed since, yet with teacher-level consequences prohibited all the while. See more information about this ruling here.

Recall as well that in Houston, just this past May, that a district judge ruled that Houston Independent School District (HISD) teachers’ who had VAM scores (as based on the Education Value-Added Assessment System (EVAAS)) had legitimate claims regarding how EVAAS use in HISD was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case organization shall deprive any person of life, liberty, or property, without due process). More specifically, in what turned out to be a huge and unprecedented victory, the judge ruled that because HISD teachers “ha[d] no meaningful way to ensure correct calculation of their EVAAS scores,” they were, as a result, “unfairly subject to mistaken deprivation of constitutionally protected property interests in their jobs.” This ruling ultimately led the district to end the use of the EVAAS for teacher termination throughout Houston. See more information about this ruling here.

Just this past week, New Mexico charged that the Houston ruling regarding Houston teachers’ Fourteenth Amendment due process protections also applies to teachers throughout the state of New Mexico.

As per an article titled “Motion For Summary Judgment Filed In New Mexico Teacher Evaluation Lawsuit,” the American Federation of Teachers and Albuquerque Teachers Federation filed a “motion for summary judgment in the litigation in our continuing effort to make teacher evaluations beneficial and accurate in New Mexico.” They, too, are “seeking a determination that the [state’s] failure to provide teachers with adequate information about the calculation of their VAM scores violated their procedural due process rights.”

“The evidence demonstrates that neither school administrators nor educators have been provided with sufficient information to replicate the [New Mexico] VAM score calculations used as a basis for teacher evaluations. The VAM algorithm is complex, and the general overview provided in the NMTeach Technical Guide is not enough to pass constitutional muster. During previous hearings, educators testified they do not receive an explanation at the time they receive their annual evaluation, and teachers have been subjected to performance growth plans based on low VAM scores, without being given any guidance or explanation as to how to raise that score on future evaluations. Thus, not only do educators not understand the algorithm used to derive the VAM score that is now part of the basis for their overall evaluation rating, but school administrators within the districts do not have sufficient information on how the score is derived in order to replicate it or to provide professional development, whether as part of a disciplinary scenario or otherwise, to assist teachers in raising their VAM score.”

For more information about this update, please click here.

Bias in VAMs, According to Validity Expert Michael T. Kane

During the still ongoing, value-added lawsuit in New Mexico (see my most recent update about this case here), I was honored to testify as the expert witness on behalf of the plaintiffs (see, for example, here). I was also fortunate to witness the testimony of the expert witness who testified on behalf of the defendants – Thomas Kane, Economics Professor at Harvard and former Director of the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) studies. During Kane’s testimony, one of the highlights (i.e., for the plaintiffs), or rather the low-lights (i.e., for him and the defendants), in my opinion, was when one of the plaintiff’s attorney’s questioned Kane, on the stand, about his expertise in the area of validity. In sum, Kane responded that he defined himself as an “expert” in the area, having also been trained by some of the best. Consequently, the plaintiff’s attorney’s questioned Kane about different types of validity evidences (e.g., construct, content, criterion), and Kane could not answer those questions. The only form of validity evidence with which he was familiar, and which he could clearly define, was evidence related to predictive validity. This hardly made him the expert he proclaimed himself to be minutes prior.

Let’s not mince words, though, or in this case names.

A real expert in validity (and validity theory) is another Kane, who goes by the full name of Michael T. Kane. This Kane is The Samuel J. Messick Chair in Test Validity at the Educational Testing Service (ETS); this Kane wrote one of the best, most contemporary, and currently most foundational papers on validity (see here); and this Kane just released an ETS-sponsored paper on Measurement Error and Bias in Value-Added Models certainly of interest here. I summarize this piece below (see the PDF of this report here).

In this paper Kane examines “the origins of [value-added model (VAM)-based] bias and its potential impact” and indicates that bias that is observed “is an increasing linear function of the student’s prior achievement and can be quite large (e.g., half a true-score standard deviation) for very low-scoring and high-scoring students [i.e., students in the extremes of any normal distribution]” (p. 1). Hence, Kane argues, “[t]o the extent that students with relatively low or high prior scores are clustered in particular classes and schools, the student-level bias will tend to generate bias in VAM estimates of teacher and school effects” (p. 1; see also prior posts about this type of bias here, here, and here; see also Haertel (2013) cited below). Kane concludes that “[a]djusting for this bias is possible, but it requires estimates of generalizability (or reliability) coefficients that are more accurate and precise than those that are generally available for standardized achievement tests” (p. 1; see also prior posts about issues with reliability across VAMs here, here, and here).

Kane’s more specific points of note:

  • To accurately calculate teachers’/schools’ value-added, “current and prior scores have to be on the same scale (or on vertically aligned scales) for the differences to make sense. Furthermore, the scale has to be an interval scale in the sense that a difference of a certain number of points has, at least approximately, the same meaning along the scale, so that it makes sense to compare gain scores from different parts of the scale…some uncertainty about scale characteristics is not a problem for many applications of vertical scaling, but it is a serious problem if the proposed use of the scores (e.g., educational accountability based on growth scores) demands that the vertical scale be demonstrably equal interval” (p. 1).
  • Likewise, while some approaches can be used to minimize the need for such scales (e.g., residual gain scores, covariate-adjustment models, and ordinary least squares (OLS) regression approaches which are of specific interest in this piece), “it is still necessary to assume [emphasis added] that a difference of a certain number of points has more or less the same meaning along the score scale for the current test scores” (p. 2).
  • Related, “such adjustments can [still] be biased to the extent that the predicted score does not include all factors that may have an impact on student performance. Bias can also result from errors of measurement in the prior scores included in the prediction equation…[and this can be]…substantial” (p. 2).
  • Accordingly, “gains for students with high true scores on the prior year’s test will be overestimated, and the gains for students with low true scores in the prior year will be underestimated. To the extent that students with relatively low and high true scores tend to be clustered in particular classes and schools, the student-level bias will generate bias in estimates of teacher and school effects” (p. 2).
  • Hence, if not corrected, this source of bias could have a substantial negative impact on estimated VAM scores for teachers and schools that serve students with low prior true scores and could have a substantial positive impact for teachers and schools that serve mainly high-performing students” (p. 2).
  • Put differently, random errors in students’ prior scores may “tend to add a positive bias to the residual gain scores for students with prior scores above the population mean, and they [may] tend to add a negative bias to the residual gain scores for students with prior scores below the mean. Th[is] bias is associated with the well-known phenomenon of regression to the mean” (p. 10).
  • Although, at least this latter claim — that students with relatively high true scores in the prior year could substantially and positively impact their teachers’/schools value-added estimates — does run somewhat contradictory to other claims as evidenced in the literature in terms of the extent to which ceiling effects substantially and negatively impact their teachers’/schools value-added estimates (see, for example, Point #7 as per the ongoing lawsuit in Houston here, and see also Florida teacher Luke Flint’s “Story” here).
  • In sum, and as should be a familiar conclusion to followers of this blog, “[g]iven that the results of VAMs may be used for high-stakes decisions about teachers and schools in the context of accountability programs,…any substantial source of bias would be a matter of great concern” (p. 2).

Citation: Kane, M. T. (2017). Measurement error and bias in value-added models. Princeton, NJ: Educational Testing Service (ETS) Research Report Series. doi:10.1002/ets2.12153 Retrieved from http://onlinelibrary.wiley.com/doi/10.1002/ets2.12153/full

See also Haertel, E. H. (2013). Reliability and validity of inferences about teachers based on student test scores (14th William H. Angoff Memorial Lecture). Princeton, NJ: Educational Testing Service (ETS).

A North Carolina Teacher’s Guest Post on His/Her EVAAS Scores

A teacher from the state of North Carolina recently emailed me for my advice regarding how to help him/her read and understand his/her recently received Education Value-Added Assessment System (EVAAS) value added scores. You likely recall that the EVAAS is the model I cover most on this blog, also in that this is the system I have researched the most, as well as the proprietary system adopted by multiple states (e.g., Ohio, North Carolina, and South Carolina) and districts across the country for which taxpayers continue to pay big $. Of late, this is also the value-added model (VAM) of sole interest in the recent lawsuit that teachers won in Houston (see here).

You might also recall that the EVAAS is the system developed by the now late William Sanders (see here), who ultimately sold it to SAS Institute Inc. that now holds all rights to the VAM (see also prior posts about the EVAAS here, here, here, here, here, and here). It is also important to note, because this teacher teaches in North Carolina where SAS Institute Inc. is located and where its CEO James Goodnight is considered the richest man in the state, that as a major Grand Old Party (GOP) donor “he” helps to set all of of the state’s education policy as the state is also dominated by Republicans. All of this also means that it is unlikely EVAAS will go anywhere unless there is honest and open dialogue about the shortcomings of the data.

Hence, the attempt here is to begin at least some honest and open dialogue herein. Accordingly, here is what this teacher wrote in response to my request that (s)he write a guest post:

***

SAS Institute Inc. claims that the EVAAS enables teachers to “modify curriculum, student support and instructional strategies to address the needs of all students.”  My goal this year is to see whether these claims are actually possible or true. I’d like to dig deep into the data made available to me — for which my state pays over $3.6 million per year — in an effort to see what these data say about my instruction, accordingly.

For starters, here is what my EVAAS-based growth looks like over the past three years:

As you can see, three years ago I met my expected growth, but my growth measure was slightly below zero. The year after that I knocked it out of the park. This past year I was right in the middle of my prior two years of results. Notice the volatility [aka an issue with VAM-based reliability, or consistency, or a lack thereof; see, for example, here].

Notwithstanding, SAS Institute Inc. makes the following recommendations in terms of how I should approach my data:

Reflecting on Your Teaching Practice: Learn to use your Teacher reports to reflect on the effectiveness of your instructional delivery.

The Teacher Value Added report displays value-added data across multiple years for the same subject and grade or course. As you review the report, you’ll want to ask these questions:

  • Looking at the Growth Index for the most recent year, were you effective at helping students to meet or exceed the Growth Standard?
  • If you have multiple years of data, are the Growth Index values consistent across years? Is there a positive or negative trend?
  • If there is a trend, what factors might have contributed to that trend?
  • Based on this information, what strategies and instructional practices will you replicate in the current school year? What strategies and instructional practices will you change or refine to increase your success in helping students make academic growth?

Yet my growth index values are not consistent across years, as also noted above. Rather, my “trends” are baffling to me.  When I compare those three instructional years in my mind, nothing stands out to me in terms of differences in instructional strategies that would explain the fluctuations in growth measures, either.

So let’s take a closer look at my data for last year (i.e., 2016-2017).  I teach 7th grade English/language arts (ELA), so my numbers are based on my students reading grade 7 scores in the table below.

What jumps out for me here is the contradiction in “my” data for achievement Levels 3 and 4 (achievement levels start at Level 1 and top out at Level 5, whereas levels 3 and 4 are considered proficient/middle of the road).  There is moderate evidence that my grade 7 students who scored a Level 4 on the state reading test exceeded the Growth Standard.  But there is also moderate evidence that my same grade 7 students who scored Level 3 did not meet the Growth Standard.  At the same time, the number of students I had demonstrating proficiency on the same reading test (by scoring at least a 3) increased from 71% in 2015-2016 (when I exceeded expected growth) to 76% in school year 2016-2017 (when my growth declined significantly). This makes no sense, right?

Hence, and after considering my data above, the question I’m left with is actually really important:  Are the instructional strategies I’m using for my students whose achievement levels are in the middle working, or are they not?

I’d love to hear from other teachers on their interpretations of these data.  A tool that costs taxpayers this much money and impacts teacher evaluations in so many states should live up to its claims of being useful for informing our teaching.

More of Kane’s “Objective” Insights on Teacher Evaluation Measures

You might recall from a series of prior posts (see, for example, here, here, and here), the name of Thomas Kane — an economics professor from Harvard University who directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation, who also testified as an expert witness in two lawsuits (i.e., in New Mexico and Houston) opposite me (and in the case of Houston, also opposite Jesse Rothstein).

He, along with Andrew Bacher-Hicks (PhD Candidate at Harvard), Mark Chin (PhD Candidate at Harvard), and Douglas Staiger (Economics Professor of Dartmouth), just released yet another National Bureau of Economic Research (NBER) “working paper” (i.e., not peer-reviewed, and in this case not internally reviewed by NBER for public consumption and use either) titled “An Evaluation of Bias in Three Measures of Teacher Quality: Value-Added, Classroom Observations, and Student Surveys.” I review this study here.

Using Kane’s MET data, they test whether 66 mathematics teachers’ performance measured (1) by using teachers’ student test achievement gains (i.e., calculated using value-added models (VAMs)), classroom observations, and student surveys, and (2) under naturally occurring (i.e., non-experimental) settings “predicts performance following random assignment of that teacher to a class of students” (p. 2). More specifically, researchers “observed a sample of fourth- and fifth-grade mathematics teachers and collected [these] measures…[under normal conditions, and then in]…the third year…randomly assigned participating teachers to classrooms within their schools and then again collected all three measures” (p. 3).

They concluded that “the test-based value-added measure—is a valid predictor of teacher impacts on student achievement following random assignment” (p. 28). This finding “is the latest in a series of studies” (p. 27) substantiating this not-surprising, as-oft-Kane-asserted finding, or as he might assert it, fact. I should note here that no other studies substantiating “the latest in a series of studies” (p. 27) claim are referenced or cited, but a quick review of the 31 total references included in this report include 16/31 (52%) references conducted by only econometricians (i.e., not statisticians or other educational researchers) on this general topic, of which 10/16 (63%) are not peer reviewed and of which 6/16 (38%) are either authored or co-authored by Kane (1/6 being published in a peer-reviewed journal). The other articles cited are about the measurements used, the geenral methods used in this study, and four other articles written on the topic not authored by econometricians. Needless to say, there is clearly a slant that is quite obvious in this piece, and unfortunately not surprising, but that had it gone through any respectable vetting process, this sh/would have been caught and addressed prior to this study’s release.

I must add that this reminds me of Kane’s New Mexico testimony (see here) where he, again, “stressed that numerous studies [emphasis added] show[ed] that teachers [also] make a big impact on student success.” He stated this on the stand while expressly contradicting the findings of the American Statistical Association (ASA). While testifying otherwise, and again, he also only referenced (non-representative) studies in his (or rather defendants’ support) authored by primarily him (e.g, as per his MET studies) and some of his other econometric friends (e.g. Raj Chetty, Eric Hanushek, Doug Staiger) as also cited within this piece here. This was also a concern registered by the court, in terms of whether Kane’s expertise was that of a generalist (i.e., competent across multi-disciplinary studies conducted on the matter) or a “selectivist” (i.e., biased in terms of his prejudice against, or rather selectivity of certain studies for confirmation, inclusion, or acknowledgment). This is also certainly relevant, and should be taken into consideration here.

Otherwise, in this study the authors also found that the Mathematical Quality of Instruction (MQI) observational measure (one of two observational measures they used in this study, with the other one being the Classroom Assessment Scoring System (CLASS)) was a valid predictor of teachers’ classroom observations following random assignment. The MQI also, did “not seem to be biased by the unmeasured characteristics of students [a] teacher typically teaches” (p. 28). This also expressly contradicts what is now an emerging set of studies evidencing the contrary, also not cited in this particular piece (see, for example, here, here, and here), some of which were also conducted using Kane’s MET data (see, for example, here and here).

Finally, authors’ evidence on the predictive validity of student surveys was inconclusive.

Needless to say…

Citation: Bacher-Hicks, A., Chin, M. J., Kane, T. J., & Staiger, D. O. (2017). An evaluation of bias in three measures of teacher quality: Value-added, classroom observations, and student surveys. Cambridge, MA: ational Bureau of Economic Research (NBER). Retrieved from http://www.nber.org/papers/w23478

New Evidence that Developmental (and Formative) Approaches to Teacher Evaluation Systems Work

Susan Moore Johnson – Professor of Education at Harvard University and author of another important article regarding how value-added models (VAMs) oft-reinforce the walls of “egg-crate” schools (here) – recently published (along with two co-authors) an article in the esteemed, peer-reviewed Educational Evaluation and Policy Analysis. The article titled: Investing in Development: Six High-Performing, High-Poverty Schools Implement the Massachusetts Teacher Evaluation Policy can be downloaded here (in its free, pre-publication form).

In this piece, as taken from the abstract, they “studied how six high-performing, high-poverty [and traditional, charter, under state supervision] schools in one large Massachusetts city implemented the state’s new teacher evaluation policy” (p. 383). They aimed to learn how these “successful” schools, with “success” defined by the state’s accountability ranking per school along with its “public reputation,” approached the state’s teacher evaluation system and its system components (e.g., classroom observations, follow-up feedback, and the construction and treatment of teachers’ summative evaluation ratings). They also investigated how educators within these schools “interacted to shape the character and impact of [the state’s] evaluation” (p. 384).

Akin to Moore Johnson’s aforementioned work, she and her colleagues argue that “to understand whether and how new teacher evaluation policies affect teachers and their work, we must investigate [the] day-to-day responses [of] those within the schools” (p. 384). Hence, they explored “how the educators in these schools interpreted and acted on the new state policy’s opportunities and requirements and, overall, whether they used evaluation to promote greater accountability, more opportunities for development, or both” (p. 384).

They found that “despite important differences among the six successful schools [they] studied (e.g., size, curriculum and pedagogy, student discipline codes), administrators responded to the state evaluation policy in remarkably similar ways, giving priority to the goal of development over accountability [emphasis added]” (p. 385). In addition, “[m]ost schools not only complied with the new regulations of the law but also went beyond them to provide teachers with more frequent observations, feedback, and support than the policy required. Teachers widely corroborated their principal’s reports that evaluation in their school was meant to improve their performance and they strongly endorsed that priority” (p. 385).

Overall, and accordingly, they concluded that “an evaluation policy focusing on teachers’ development can be effectively implemented in ways that serve the interests of schools, students, and teachers” (p. 402). This is especially true when (1) evaluation efforts are “well grounded in the observations, feedback, and support of a formative evaluation process;” (2) states rely on “capacity building in addition to mandates to promote effective implementation;” and (3) schools also benefit from spillover effects from other, positive, state-level policies (i.e., states do not take Draconian approaches to other educational policies) that, in these cases included policies permitting district discretion and control over staffing and administrative support (p. 402).

Related, such developmental and formatively-focused teacher evaluation systems can work, they also conclude, when schools are lead by highly effective principals who are free to select high quality teachers. Their findings suggest that this “is probably the most important thing district officials can do to ensure that teacher evaluation will be a constructive, productive process” (p. 403). In sum, “as this study makes clear, policies that are intended to improve schooling depend on both administrators and teachers for their effective implementation” (p. 403).

Please note, however, that this study was conducted before districts in this state were required to incorporate standardized test scores to measure teachers’ effects (e.g., using VAMs); hence, the assertions and conclusions that authors set forth throughout this piece should be read and taken into consideration given that important caveat. Perhaps findings should matter even more in that here is at least some proof that teacher evaluation works IF used for developmental and formative (versus or perhaps in lieu of summative) purposes.

Citation: Reinhorn, S. K., Moore Johnson, S., & Simon, N. S. (2017). Educational Evaluation and Policy Analysis, 39(3), 383–406. doi:10.3102/0162373717690605 Retrieved from https://projectngt.gse.harvard.edu/files/gse-projectngt/files/eval_041916_unblinded.pdf

The More Weight VAMs Carry, the More Teacher Effects (Will Appear to) Vary

Matthew A. Kraft — an Assistant Professor of Education & Economics at Brown University and co-author of an article published in Educational Researcher on “Revisiting The Widget Effect” (here), and another of his co-authors Matthew P. Steinberg — an Assistant Professor of Education Policy at the University of Pennsylvania — just published another article in this same journal on “The Sensitivity of Teacher Performance Ratings to the Design of Teacher Evaluation Systems” (see the full and freely accessible, at least for now, article here; see also its original and what should be enduring version here).

In this article, Steinberg and Kraft (2017) examine teacher performance measure weights while conducting multiple simulations of data taken from the Bill & Melinda Gates Measures of Effective Teaching (MET) studies. They conclude that “performance measure weights and ratings” surrounding teachers’ value-added, observational measures, and student survey indicators play “critical roles” when “determining teachers’ summative evaluation ratings and the distribution of teacher proficiency rates.” In other words, the weighting of teacher evaluation systems’ multiple measures matter, matter differently for different types of teachers within and across school districts and states, and matter also in that so often these weights are arbitrarily and politically defined and set.

Indeed, because “state and local policymakers have almost no empirically based evidence [emphasis added, although I would write “no empirically based evidence”] to inform their decision process about how to combine scores across multiple performance measures…decisions about [such] weights…are often made through a somewhat arbitrary and iterative process, one that is shaped by political considerations in place of empirical evidence” (Steinberg & Kraft, 2017, p. 379).

This is very important to note in that the consequences attached to these measures, also given the arbitrary and political constructions they represent, can be both professionally and personally, career and life changing, respectively. How and to what extent “the proportion of teachers deemed professionally proficient changes under different weighting and ratings thresholds schemes” (p. 379), then, clearly matters.

While Steinberg and Kraft (2017) have other key findings they also present throughout this piece, their most important finding, in my opinion, is that, again, “teacher proficiency rates change substantially as the weights assigned to teacher performance measures change” (p. 387). Moreover, the more weight assigned to measures with higher relative means (e.g., observational or student survey measures), the greater the rate by which teachers are rated effective or proficient, and vice versa (i.e., the more weight assigned to teachers’ value-added, the higher the rate by which teachers will be rated ineffective or inadequate; as also discussed on p. 388).

Put differently, “teacher proficiency rates are lowest across all [district and state] systems when norm-referenced teacher performance measures, such as VAMs [i.e., with scores that are normalized in line with bell curves, with a mean or average centered around the middle of the normal distributions], are given greater relative weight” (p. 389).

This becomes problematic when states or districts then use these weighted systems (again, weighted in arbitrary and political ways) to illustrate, often to the public, that their new-and-improved teacher evaluation systems, as inspired by the MET studies mentioned prior, are now “better” at differentiating between “good and bad” teachers. Thereafter, some states over others are then celebrated (e.g., by the National Center of Teacher Quality; see, for example, here) for taking the evaluation of teacher effects more seriously than others when, as evidenced herein, this is (unfortunately) more due to manipulation than true changes in these systems. Accordingly, the fact remains that the more weight VAMs carry, the more teacher effects (will appear to) vary. It’s not necessarily that they vary in reality, but the manipulation of the weights on the back end, rather, cause such variation and then lead to, quite literally, such delusions of grandeur in these regards (see also here).

At a more pragmatic level, this also suggests that the teacher evaluation ratings for the roughly 70% of teachers who are not VAM eligible “are likely to differ in systematic ways from the ratings of teachers for whom VAM scores can be calculated” (p. 392). This is precisely why evidence in New Mexico suggests VAM-eligible teachers are up to five times more likely to be ranked as “ineffective” or “minimally effective” than their non-VAM-eligible colleagues; that is, “[also b]ecause greater weight is consistently assigned to observation scores for teachers in nontested grades and subjects” (p. 392). This also causes a related but also important issue with fairness, whereas equally effective teachers, just by being VAM eligible, may be five-or-so times likely (e.g., in states like New Mexico) of being rated as ineffective by the mere fact that they are VAM eligible and their states, quite literally, “value” value-added “too much” (as also arbitrarily defined).

Finally, it should also be noted as an important caveat here, that the findings advanced by Steinberg and Kraft (2017) “are not intended to provide specific recommendations about what weights and ratings to select—such decisions are fundamentally subject to local district priorities and preferences. (p. 379). These findings do, however, “offer important insights about how these decisions will affect the distribution of teacher performance ratings as policymakers and administrators continue to refine and possibly remake teacher evaluation systems” (p. 379).

Related, please recall that via the MET studies one of the researchers’ goals was to determine which weights per multiple measure were empirically defensible. MET researchers failed to do so and then defaulted to recommending an equal distribution of weights without empirical justification (see also Rothstein & Mathis, 2013). This also means that anyone at any state or district level who might say that this weight here or that weight there is empirically defensible should be asked for the evidence in support.

Citations:

Rothstein, J., & Mathis, W. J. (2013, January). Review of two culminating reports from the MET Project. Boulder, CO: National Educational Policy Center. Retrieved from http://nepc.colorado.edu/thinktank/review-MET-final-2013

Steinberg, M. P., & Kraft, M. A. (2017). The sensitivity of teacher performance ratings to the design of teacher evaluation systems. Educational Researcher, 46(7), 378–
396. doi:10.3102/0013189X17726752 Retrieved from http://journals.sagepub.com/doi/abs/10.3102/0013189X17726752

Breaking News: The End of Value-Added Measures for Teacher Termination in Houston

Recall from multiple prior posts (see, for example, here, here, here, here, and here) that a set of teachers in the Houston Independent School District (HISD), with the support of the Houston Federation of Teachers (HFT) and the American Federation of Teachers (AFT), took their district to federal court to fight against the (mis)use of their value-added scores derived via the Education Value-Added Assessment System (EVAAS) — the “original” value-added model (VAM) developed in Tennessee by William L. Sanders who just recently passed away (see here). Teachers’ EVAAS scores, in short, were being used to evaluate teachers in Houston in more consequential ways than any other district or state in the nation (e.g., the termination of 221 teachers in one year as based, primarily, on their EVAAS scores).

The case — Houston Federation of Teachers et al. v. Houston ISD — was filed in 2014 and just one day ago (October 10, 2017) came the case’s final federal suit settlement. Click here to read the “Settlement and Full and Final Release Agreement.” But in short, this means the “End of Value-Added Measures for Teacher Termination in Houston” (see also here).

More specifically, recall that the judge notably ruled prior (in May of 2017) that the plaintiffs did have sufficient evidence to proceed to trial on their claims that the use of EVAAS in Houston to terminate their contracts was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case district shall deprive any person of life, liberty, or property, without due process). That is, the judge ruled that “any effort by teachers to replicate their own scores, with the limited information available to them, [would] necessarily fail” (see here p. 13). This was confirmed by the one of the plaintiffs’ expert witness who was also “unable to replicate the scores despite being given far greater access to the underlying computer codes than [was] available to an individual teacher” (see here p. 13).

Hence, and “[a]ccording to the unrebutted testimony of [the] plaintiffs’ expert [witness], without access to SAS’s proprietary information – the value-added equations, computer source codes, decision rules, and assumptions – EVAAS scores will remain a mysterious ‘black box,’ impervious to challenge” (see here p. 17). Consequently, the judge concluded that HISD teachers “have no meaningful way to ensure correct calculation of their EVAAS scores, and as a result are unfairly subject to mistaken deprivation of constitutionally protected property interests in their jobs” (see here p. 18).

Thereafter, and as per this settlement, HISD agreed to refrain from using VAMs, including the EVAAS, to terminate teachers’ contracts as long as the VAM score is “unverifiable.” More specifically, “HISD agree[d] it will not in the future use value-added scores, including but not limited to EVAAS scores, as a basis to terminate the employment of a term or probationary contract teacher during the term of that teacher’s contract, or to terminate a continuing contract teacher at any time, so long as the value-added score assigned to the teacher remains unverifiable. (see here p. 2; see also here). HISD also agreed to create an “instructional consultation subcommittee” to more inclusively and democratically inform HISD’s teacher appraisal systems and processes, and HISD agreed to pay the Texas AFT $237,000 in its attorney and other legal fees and expenses (State of Texas, 2017, p. 2; see also AFT, 2017).

This is yet another big win for teachers in Houston, and potentially elsewhere, as this ruling is an unprecedented development in VAM litigation. Teachers and others using the EVAAS or another VAM for that matter (e.g., that is also “unverifiable”) do take note, at minimum.