Bias in VAMs, According to Validity Expert Michael T. Kane

During the still ongoing, value-added lawsuit in New Mexico (see my most recent update about this case here), I was honored to testify as the expert witness on behalf of the plaintiffs (see, for example, here). I was also fortunate to witness the testimony of the expert witness who testified on behalf of the defendants – Thomas Kane, Economics Professor at Harvard and former Director of the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) studies. During Kane’s testimony, one of the highlights (i.e., for the plaintiffs), or rather the low-lights (i.e., for him and the defendants), in my opinion, was when one of the plaintiff’s attorney’s questioned Kane, on the stand, about his expertise in the area of validity. In sum, Kane responded that he defined himself as an “expert” in the area, having also been trained by some of the best. Consequently, the plaintiff’s attorney’s questioned Kane about different types of validity evidences (e.g., construct, content, criterion), and Kane could not answer those questions. The only form of validity evidence with which he was familiar, and which he could clearly define, was evidence related to predictive validity. This hardly made him the expert he proclaimed himself to be minutes prior.

Let’s not mince words, though, or in this case names.

A real expert in validity (and validity theory) is another Kane, who goes by the full name of Michael T. Kane. This Kane is The Samuel J. Messick Chair in Test Validity at the Educational Testing Service (ETS); this Kane wrote one of the best, most contemporary, and currently most foundational papers on validity (see here); and this Kane just released an ETS-sponsored paper on Measurement Error and Bias in Value-Added Models certainly of interest here. I summarize this piece below (see the PDF of this report here).

In this paper Kane examines “the origins of [value-added model (VAM)-based] bias and its potential impact” and indicates that bias that is observed “is an increasing linear function of the student’s prior achievement and can be quite large (e.g., half a true-score standard deviation) for very low-scoring and high-scoring students [i.e., students in the extremes of any normal distribution]” (p. 1). Hence, Kane argues, “[t]o the extent that students with relatively low or high prior scores are clustered in particular classes and schools, the student-level bias will tend to generate bias in VAM estimates of teacher and school effects” (p. 1; see also prior posts about this type of bias here, here, and here; see also Haertel (2013) cited below). Kane concludes that “[a]djusting for this bias is possible, but it requires estimates of generalizability (or reliability) coefficients that are more accurate and precise than those that are generally available for standardized achievement tests” (p. 1; see also prior posts about issues with reliability across VAMs here, here, and here).

Kane’s more specific points of note:

  • To accurately calculate teachers’/schools’ value-added, “current and prior scores have to be on the same scale (or on vertically aligned scales) for the differences to make sense. Furthermore, the scale has to be an interval scale in the sense that a difference of a certain number of points has, at least approximately, the same meaning along the scale, so that it makes sense to compare gain scores from different parts of the scale…some uncertainty about scale characteristics is not a problem for many applications of vertical scaling, but it is a serious problem if the proposed use of the scores (e.g., educational accountability based on growth scores) demands that the vertical scale be demonstrably equal interval” (p. 1).
  • Likewise, while some approaches can be used to minimize the need for such scales (e.g., residual gain scores, covariate-adjustment models, and ordinary least squares (OLS) regression approaches which are of specific interest in this piece), “it is still necessary to assume [emphasis added] that a difference of a certain number of points has more or less the same meaning along the score scale for the current test scores” (p. 2).
  • Related, “such adjustments can [still] be biased to the extent that the predicted score does not include all factors that may have an impact on student performance. Bias can also result from errors of measurement in the prior scores included in the prediction equation…[and this can be]…substantial” (p. 2).
  • Accordingly, “gains for students with high true scores on the prior year’s test will be overestimated, and the gains for students with low true scores in the prior year will be underestimated. To the extent that students with relatively low and high true scores tend to be clustered in particular classes and schools, the student-level bias will generate bias in estimates of teacher and school effects” (p. 2).
  • Hence, if not corrected, this source of bias could have a substantial negative impact on estimated VAM scores for teachers and schools that serve students with low prior true scores and could have a substantial positive impact for teachers and schools that serve mainly high-performing students” (p. 2).
  • Put differently, random errors in students’ prior scores may “tend to add a positive bias to the residual gain scores for students with prior scores above the population mean, and they [may] tend to add a negative bias to the residual gain scores for students with prior scores below the mean. Th[is] bias is associated with the well-known phenomenon of regression to the mean” (p. 10).
  • Although, at least this latter claim — that students with relatively high true scores in the prior year could substantially and positively impact their teachers’/schools value-added estimates — does run somewhat contradictory to other claims as evidenced in the literature in terms of the extent to which ceiling effects substantially and negatively impact their teachers’/schools value-added estimates (see, for example, Point #7 as per the ongoing lawsuit in Houston here, and see also Florida teacher Luke Flint’s “Story” here).
  • In sum, and as should be a familiar conclusion to followers of this blog, “[g]iven that the results of VAMs may be used for high-stakes decisions about teachers and schools in the context of accountability programs,…any substantial source of bias would be a matter of great concern” (p. 2).

Citation: Kane, M. T. (2017). Measurement error and bias in value-added models. Princeton, NJ: Educational Testing Service (ETS) Research Report Series. doi:10.1002/ets2.12153 Retrieved from http://onlinelibrary.wiley.com/doi/10.1002/ets2.12153/full

See also Haertel, E. H. (2013). Reliability and validity of inferences about teachers based on student test scores (14th William H. Angoff Memorial Lecture). Princeton, NJ: Educational Testing Service (ETS).

A North Carolina Teacher’s Guest Post on His/Her EVAAS Scores

A teacher from the state of North Carolina recently emailed me for my advice regarding how to help him/her read and understand his/her recently received Education Value-Added Assessment System (EVAAS) value added scores. You likely recall that the EVAAS is the model I cover most on this blog, also in that this is the system I have researched the most, as well as the proprietary system adopted by multiple states (e.g., Ohio, North Carolina, and South Carolina) and districts across the country for which taxpayers continue to pay big $. Of late, this is also the value-added model (VAM) of sole interest in the recent lawsuit that teachers won in Houston (see here).

You might also recall that the EVAAS is the system developed by the now late William Sanders (see here), who ultimately sold it to SAS Institute Inc. that now holds all rights to the VAM (see also prior posts about the EVAAS here, here, here, here, here, and here). It is also important to note, because this teacher teaches in North Carolina where SAS Institute Inc. is located and where its CEO James Goodnight is considered the richest man in the state, that as a major Grand Old Party (GOP) donor “he” helps to set all of of the state’s education policy as the state is also dominated by Republicans. All of this also means that it is unlikely EVAAS will go anywhere unless there is honest and open dialogue about the shortcomings of the data.

Hence, the attempt here is to begin at least some honest and open dialogue herein. Accordingly, here is what this teacher wrote in response to my request that (s)he write a guest post:

***

SAS Institute Inc. claims that the EVAAS enables teachers to “modify curriculum, student support and instructional strategies to address the needs of all students.”  My goal this year is to see whether these claims are actually possible or true. I’d like to dig deep into the data made available to me — for which my state pays over $3.6 million per year — in an effort to see what these data say about my instruction, accordingly.

For starters, here is what my EVAAS-based growth looks like over the past three years:

As you can see, three years ago I met my expected growth, but my growth measure was slightly below zero. The year after that I knocked it out of the park. This past year I was right in the middle of my prior two years of results. Notice the volatility [aka an issue with VAM-based reliability, or consistency, or a lack thereof; see, for example, here].

Notwithstanding, SAS Institute Inc. makes the following recommendations in terms of how I should approach my data:

Reflecting on Your Teaching Practice: Learn to use your Teacher reports to reflect on the effectiveness of your instructional delivery.

The Teacher Value Added report displays value-added data across multiple years for the same subject and grade or course. As you review the report, you’ll want to ask these questions:

  • Looking at the Growth Index for the most recent year, were you effective at helping students to meet or exceed the Growth Standard?
  • If you have multiple years of data, are the Growth Index values consistent across years? Is there a positive or negative trend?
  • If there is a trend, what factors might have contributed to that trend?
  • Based on this information, what strategies and instructional practices will you replicate in the current school year? What strategies and instructional practices will you change or refine to increase your success in helping students make academic growth?

Yet my growth index values are not consistent across years, as also noted above. Rather, my “trends” are baffling to me.  When I compare those three instructional years in my mind, nothing stands out to me in terms of differences in instructional strategies that would explain the fluctuations in growth measures, either.

So let’s take a closer look at my data for last year (i.e., 2016-2017).  I teach 7th grade English/language arts (ELA), so my numbers are based on my students reading grade 7 scores in the table below.

What jumps out for me here is the contradiction in “my” data for achievement Levels 3 and 4 (achievement levels start at Level 1 and top out at Level 5, whereas levels 3 and 4 are considered proficient/middle of the road).  There is moderate evidence that my grade 7 students who scored a Level 4 on the state reading test exceeded the Growth Standard.  But there is also moderate evidence that my same grade 7 students who scored Level 3 did not meet the Growth Standard.  At the same time, the number of students I had demonstrating proficiency on the same reading test (by scoring at least a 3) increased from 71% in 2015-2016 (when I exceeded expected growth) to 76% in school year 2016-2017 (when my growth declined significantly). This makes no sense, right?

Hence, and after considering my data above, the question I’m left with is actually really important:  Are the instructional strategies I’m using for my students whose achievement levels are in the middle working, or are they not?

I’d love to hear from other teachers on their interpretations of these data.  A tool that costs taxpayers this much money and impacts teacher evaluations in so many states should live up to its claims of being useful for informing our teaching.

More of Kane’s “Objective” Insights on Teacher Evaluation Measures

You might recall from a series of prior posts (see, for example, here, here, and here), the name of Thomas Kane — an economics professor from Harvard University who directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation, who also testified as an expert witness in two lawsuits (i.e., in New Mexico and Houston) opposite me (and in the case of Houston, also opposite Jesse Rothstein).

He, along with Andrew Bacher-Hicks (PhD Candidate at Harvard), Mark Chin (PhD Candidate at Harvard), and Douglas Staiger (Economics Professor of Dartmouth), just released yet another National Bureau of Economic Research (NBER) “working paper” (i.e., not peer-reviewed, and in this case not internally reviewed by NBER for public consumption and use either) titled “An Evaluation of Bias in Three Measures of Teacher Quality: Value-Added, Classroom Observations, and Student Surveys.” I review this study here.

Using Kane’s MET data, they test whether 66 mathematics teachers’ performance measured (1) by using teachers’ student test achievement gains (i.e., calculated using value-added models (VAMs)), classroom observations, and student surveys, and (2) under naturally occurring (i.e., non-experimental) settings “predicts performance following random assignment of that teacher to a class of students” (p. 2). More specifically, researchers “observed a sample of fourth- and fifth-grade mathematics teachers and collected [these] measures…[under normal conditions, and then in]…the third year…randomly assigned participating teachers to classrooms within their schools and then again collected all three measures” (p. 3).

They concluded that “the test-based value-added measure—is a valid predictor of teacher impacts on student achievement following random assignment” (p. 28). This finding “is the latest in a series of studies” (p. 27) substantiating this not-surprising, as-oft-Kane-asserted finding, or as he might assert it, fact. I should note here that no other studies substantiating “the latest in a series of studies” (p. 27) claim are referenced or cited, but a quick review of the 31 total references included in this report include 16/31 (52%) references conducted by only econometricians (i.e., not statisticians or other educational researchers) on this general topic, of which 10/16 (63%) are not peer reviewed and of which 6/16 (38%) are either authored or co-authored by Kane (1/6 being published in a peer-reviewed journal). The other articles cited are about the measurements used, the geenral methods used in this study, and four other articles written on the topic not authored by econometricians. Needless to say, there is clearly a slant that is quite obvious in this piece, and unfortunately not surprising, but that had it gone through any respectable vetting process, this sh/would have been caught and addressed prior to this study’s release.

I must add that this reminds me of Kane’s New Mexico testimony (see here) where he, again, “stressed that numerous studies [emphasis added] show[ed] that teachers [also] make a big impact on student success.” He stated this on the stand while expressly contradicting the findings of the American Statistical Association (ASA). While testifying otherwise, and again, he also only referenced (non-representative) studies in his (or rather defendants’ support) authored by primarily him (e.g, as per his MET studies) and some of his other econometric friends (e.g. Raj Chetty, Eric Hanushek, Doug Staiger) as also cited within this piece here. This was also a concern registered by the court, in terms of whether Kane’s expertise was that of a generalist (i.e., competent across multi-disciplinary studies conducted on the matter) or a “selectivist” (i.e., biased in terms of his prejudice against, or rather selectivity of certain studies for confirmation, inclusion, or acknowledgment). This is also certainly relevant, and should be taken into consideration here.

Otherwise, in this study the authors also found that the Mathematical Quality of Instruction (MQI) observational measure (one of two observational measures they used in this study, with the other one being the Classroom Assessment Scoring System (CLASS)) was a valid predictor of teachers’ classroom observations following random assignment. The MQI also, did “not seem to be biased by the unmeasured characteristics of students [a] teacher typically teaches” (p. 28). This also expressly contradicts what is now an emerging set of studies evidencing the contrary, also not cited in this particular piece (see, for example, here, here, and here), some of which were also conducted using Kane’s MET data (see, for example, here and here).

Finally, authors’ evidence on the predictive validity of student surveys was inconclusive.

Needless to say…

Citation: Bacher-Hicks, A., Chin, M. J., Kane, T. J., & Staiger, D. O. (2017). An evaluation of bias in three measures of teacher quality: Value-added, classroom observations, and student surveys. Cambridge, MA: ational Bureau of Economic Research (NBER). Retrieved from http://www.nber.org/papers/w23478