More of Kane’s “Objective” Insights on Teacher Evaluation Measures

You might recall from a series of prior posts (see, for example, here, here, and here), the name of Thomas Kane — an economics professor from Harvard University who directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation, who also testified as an expert witness in two lawsuits (i.e., in New Mexico and Houston) opposite me (and in the case of Houston, also opposite Jesse Rothstein).

He, along with Andrew Bacher-Hicks (PhD Candidate at Harvard), Mark Chin (PhD Candidate at Harvard), and Douglas Staiger (Economics Professor of Dartmouth), just released yet another National Bureau of Economic Research (NBER) “working paper” (i.e., not peer-reviewed, and in this case not internally reviewed by NBER for public consumption and use either) titled “An Evaluation of Bias in Three Measures of Teacher Quality: Value-Added, Classroom Observations, and Student Surveys.” I review this study here.

Using Kane’s MET data, they test whether 66 mathematics teachers’ performance measured (1) by using teachers’ student test achievement gains (i.e., calculated using value-added models (VAMs)), classroom observations, and student surveys, and (2) under naturally occurring (i.e., non-experimental) settings “predicts performance following random assignment of that teacher to a class of students” (p. 2). More specifically, researchers “observed a sample of fourth- and fifth-grade mathematics teachers and collected [these] measures…[under normal conditions, and then in]…the third year…randomly assigned participating teachers to classrooms within their schools and then again collected all three measures” (p. 3).

They concluded that “the test-based value-added measure—is a valid predictor of teacher impacts on student achievement following random assignment” (p. 28). This finding “is the latest in a series of studies” (p. 27) substantiating this not-surprising, as-oft-Kane-asserted finding, or as he might assert it, fact. I should note here that no other studies substantiating “the latest in a series of studies” (p. 27) claim are referenced or cited, but a quick review of the 31 total references included in this report include 16/31 (52%) references conducted by only econometricians (i.e., not statisticians or other educational researchers) on this general topic, of which 10/16 (63%) are not peer reviewed and of which 6/16 (38%) are either authored or co-authored by Kane (1/6 being published in a peer-reviewed journal). The other articles cited are about the measurements used, the geenral methods used in this study, and four other articles written on the topic not authored by econometricians. Needless to say, there is clearly a slant that is quite obvious in this piece, and unfortunately not surprising, but that had it gone through any respectable vetting process, this sh/would have been caught and addressed prior to this study’s release.

I must add that this reminds me of Kane’s New Mexico testimony (see here) where he, again, “stressed that numerous studies [emphasis added] show[ed] that teachers [also] make a big impact on student success.” He stated this on the stand while expressly contradicting the findings of the American Statistical Association (ASA). While testifying otherwise, and again, he also only referenced (non-representative) studies in his (or rather defendants’ support) authored by primarily him (e.g, as per his MET studies) and some of his other econometric friends (e.g. Raj Chetty, Eric Hanushek, Doug Staiger) as also cited within this piece here. This was also a concern registered by the court, in terms of whether Kane’s expertise was that of a generalist (i.e., competent across multi-disciplinary studies conducted on the matter) or a “selectivist” (i.e., biased in terms of his prejudice against, or rather selectivity of certain studies for confirmation, inclusion, or acknowledgment). This is also certainly relevant, and should be taken into consideration here.

Otherwise, in this study the authors also found that the Mathematical Quality of Instruction (MQI) observational measure (one of two observational measures they used in this study, with the other one being the Classroom Assessment Scoring System (CLASS)) was a valid predictor of teachers’ classroom observations following random assignment. The MQI also, did “not seem to be biased by the unmeasured characteristics of students [a] teacher typically teaches” (p. 28). This also expressly contradicts what is now an emerging set of studies evidencing the contrary, also not cited in this particular piece (see, for example, here, here, and here), some of which were also conducted using Kane’s MET data (see, for example, here and here).

Finally, authors’ evidence on the predictive validity of student surveys was inconclusive.

Needless to say…

Citation: Bacher-Hicks, A., Chin, M. J., Kane, T. J., & Staiger, D. O. (2017). An evaluation of bias in three measures of teacher quality: Value-added, classroom observations, and student surveys. Cambridge, MA: ational Bureau of Economic Research (NBER). Retrieved from http://www.nber.org/papers/w23478

New Evidence that Developmental (and Formative) Approaches to Teacher Evaluation Systems Work

Susan Moore Johnson – Professor of Education at Harvard University and author of another important article regarding how value-added models (VAMs) oft-reinforce the walls of “egg-crate” schools (here) – recently published (along with two co-authors) an article in the esteemed, peer-reviewed Educational Evaluation and Policy Analysis. The article titled: Investing in Development: Six High-Performing, High-Poverty Schools Implement the Massachusetts Teacher Evaluation Policy can be downloaded here (in its free, pre-publication form).

In this piece, as taken from the abstract, they “studied how six high-performing, high-poverty [and traditional, charter, under state supervision] schools in one large Massachusetts city implemented the state’s new teacher evaluation policy” (p. 383). They aimed to learn how these “successful” schools, with “success” defined by the state’s accountability ranking per school along with its “public reputation,” approached the state’s teacher evaluation system and its system components (e.g., classroom observations, follow-up feedback, and the construction and treatment of teachers’ summative evaluation ratings). They also investigated how educators within these schools “interacted to shape the character and impact of [the state’s] evaluation” (p. 384).

Akin to Moore Johnson’s aforementioned work, she and her colleagues argue that “to understand whether and how new teacher evaluation policies affect teachers and their work, we must investigate [the] day-to-day responses [of] those within the schools” (p. 384). Hence, they explored “how the educators in these schools interpreted and acted on the new state policy’s opportunities and requirements and, overall, whether they used evaluation to promote greater accountability, more opportunities for development, or both” (p. 384).

They found that “despite important differences among the six successful schools [they] studied (e.g., size, curriculum and pedagogy, student discipline codes), administrators responded to the state evaluation policy in remarkably similar ways, giving priority to the goal of development over accountability [emphasis added]” (p. 385). In addition, “[m]ost schools not only complied with the new regulations of the law but also went beyond them to provide teachers with more frequent observations, feedback, and support than the policy required. Teachers widely corroborated their principal’s reports that evaluation in their school was meant to improve their performance and they strongly endorsed that priority” (p. 385).

Overall, and accordingly, they concluded that “an evaluation policy focusing on teachers’ development can be effectively implemented in ways that serve the interests of schools, students, and teachers” (p. 402). This is especially true when (1) evaluation efforts are “well grounded in the observations, feedback, and support of a formative evaluation process;” (2) states rely on “capacity building in addition to mandates to promote effective implementation;” and (3) schools also benefit from spillover effects from other, positive, state-level policies (i.e., states do not take Draconian approaches to other educational policies) that, in these cases included policies permitting district discretion and control over staffing and administrative support (p. 402).

Related, such developmental and formatively-focused teacher evaluation systems can work, they also conclude, when schools are lead by highly effective principals who are free to select high quality teachers. Their findings suggest that this “is probably the most important thing district officials can do to ensure that teacher evaluation will be a constructive, productive process” (p. 403). In sum, “as this study makes clear, policies that are intended to improve schooling depend on both administrators and teachers for their effective implementation” (p. 403).

Please note, however, that this study was conducted before districts in this state were required to incorporate standardized test scores to measure teachers’ effects (e.g., using VAMs); hence, the assertions and conclusions that authors set forth throughout this piece should be read and taken into consideration given that important caveat. Perhaps findings should matter even more in that here is at least some proof that teacher evaluation works IF used for developmental and formative (versus or perhaps in lieu of summative) purposes.

Citation: Reinhorn, S. K., Moore Johnson, S., & Simon, N. S. (2017). Educational Evaluation and Policy Analysis, 39(3), 383–406. doi:10.3102/0162373717690605 Retrieved from https://projectngt.gse.harvard.edu/files/gse-projectngt/files/eval_041916_unblinded.pdf

The More Weight VAMs Carry, the More Teacher Effects (Will Appear to) Vary

Matthew A. Kraft — an Assistant Professor of Education & Economics at Brown University and co-author of an article published in Educational Researcher on “Revisiting The Widget Effect” (here), and another of his co-authors Matthew P. Steinberg — an Assistant Professor of Education Policy at the University of Pennsylvania — just published another article in this same journal on “The Sensitivity of Teacher Performance Ratings to the Design of Teacher Evaluation Systems” (see the full and freely accessible, at least for now, article here; see also its original and what should be enduring version here).

In this article, Steinberg and Kraft (2017) examine teacher performance measure weights while conducting multiple simulations of data taken from the Bill & Melinda Gates Measures of Effective Teaching (MET) studies. They conclude that “performance measure weights and ratings” surrounding teachers’ value-added, observational measures, and student survey indicators play “critical roles” when “determining teachers’ summative evaluation ratings and the distribution of teacher proficiency rates.” In other words, the weighting of teacher evaluation systems’ multiple measures matter, matter differently for different types of teachers within and across school districts and states, and matter also in that so often these weights are arbitrarily and politically defined and set.

Indeed, because “state and local policymakers have almost no empirically based evidence [emphasis added, although I would write “no empirically based evidence”] to inform their decision process about how to combine scores across multiple performance measures…decisions about [such] weights…are often made through a somewhat arbitrary and iterative process, one that is shaped by political considerations in place of empirical evidence” (Steinberg & Kraft, 2017, p. 379).

This is very important to note in that the consequences attached to these measures, also given the arbitrary and political constructions they represent, can be both professionally and personally, career and life changing, respectively. How and to what extent “the proportion of teachers deemed professionally proficient changes under different weighting and ratings thresholds schemes” (p. 379), then, clearly matters.

While Steinberg and Kraft (2017) have other key findings they also present throughout this piece, their most important finding, in my opinion, is that, again, “teacher proficiency rates change substantially as the weights assigned to teacher performance measures change” (p. 387). Moreover, the more weight assigned to measures with higher relative means (e.g., observational or student survey measures), the greater the rate by which teachers are rated effective or proficient, and vice versa (i.e., the more weight assigned to teachers’ value-added, the higher the rate by which teachers will be rated ineffective or inadequate; as also discussed on p. 388).

Put differently, “teacher proficiency rates are lowest across all [district and state] systems when norm-referenced teacher performance measures, such as VAMs [i.e., with scores that are normalized in line with bell curves, with a mean or average centered around the middle of the normal distributions], are given greater relative weight” (p. 389).

This becomes problematic when states or districts then use these weighted systems (again, weighted in arbitrary and political ways) to illustrate, often to the public, that their new-and-improved teacher evaluation systems, as inspired by the MET studies mentioned prior, are now “better” at differentiating between “good and bad” teachers. Thereafter, some states over others are then celebrated (e.g., by the National Center of Teacher Quality; see, for example, here) for taking the evaluation of teacher effects more seriously than others when, as evidenced herein, this is (unfortunately) more due to manipulation than true changes in these systems. Accordingly, the fact remains that the more weight VAMs carry, the more teacher effects (will appear to) vary. It’s not necessarily that they vary in reality, but the manipulation of the weights on the back end, rather, cause such variation and then lead to, quite literally, such delusions of grandeur in these regards (see also here).

At a more pragmatic level, this also suggests that the teacher evaluation ratings for the roughly 70% of teachers who are not VAM eligible “are likely to differ in systematic ways from the ratings of teachers for whom VAM scores can be calculated” (p. 392). This is precisely why evidence in New Mexico suggests VAM-eligible teachers are up to five times more likely to be ranked as “ineffective” or “minimally effective” than their non-VAM-eligible colleagues; that is, “[also b]ecause greater weight is consistently assigned to observation scores for teachers in nontested grades and subjects” (p. 392). This also causes a related but also important issue with fairness, whereas equally effective teachers, just by being VAM eligible, may be five-or-so times likely (e.g., in states like New Mexico) of being rated as ineffective by the mere fact that they are VAM eligible and their states, quite literally, “value” value-added “too much” (as also arbitrarily defined).

Finally, it should also be noted as an important caveat here, that the findings advanced by Steinberg and Kraft (2017) “are not intended to provide specific recommendations about what weights and ratings to select—such decisions are fundamentally subject to local district priorities and preferences. (p. 379). These findings do, however, “offer important insights about how these decisions will affect the distribution of teacher performance ratings as policymakers and administrators continue to refine and possibly remake teacher evaluation systems” (p. 379).

Related, please recall that via the MET studies one of the researchers’ goals was to determine which weights per multiple measure were empirically defensible. MET researchers failed to do so and then defaulted to recommending an equal distribution of weights without empirical justification (see also Rothstein & Mathis, 2013). This also means that anyone at any state or district level who might say that this weight here or that weight there is empirically defensible should be asked for the evidence in support.

Citations:

Rothstein, J., & Mathis, W. J. (2013, January). Review of two culminating reports from the MET Project. Boulder, CO: National Educational Policy Center. Retrieved from http://nepc.colorado.edu/thinktank/review-MET-final-2013

Steinberg, M. P., & Kraft, M. A. (2017). The sensitivity of teacher performance ratings to the design of teacher evaluation systems. Educational Researcher, 46(7), 378–
396. doi:10.3102/0013189X17726752 Retrieved from http://journals.sagepub.com/doi/abs/10.3102/0013189X17726752

Breaking News: The End of Value-Added Measures for Teacher Termination in Houston

Recall from multiple prior posts (see, for example, here, here, here, here, and here) that a set of teachers in the Houston Independent School District (HISD), with the support of the Houston Federation of Teachers (HFT) and the American Federation of Teachers (AFT), took their district to federal court to fight against the (mis)use of their value-added scores derived via the Education Value-Added Assessment System (EVAAS) — the “original” value-added model (VAM) developed in Tennessee by William L. Sanders who just recently passed away (see here). Teachers’ EVAAS scores, in short, were being used to evaluate teachers in Houston in more consequential ways than any other district or state in the nation (e.g., the termination of 221 teachers in one year as based, primarily, on their EVAAS scores).

The case — Houston Federation of Teachers et al. v. Houston ISD — was filed in 2014 and just one day ago (October 10, 2017) came the case’s final federal suit settlement. Click here to read the “Settlement and Full and Final Release Agreement.” But in short, this means the “End of Value-Added Measures for Teacher Termination in Houston” (see also here).

More specifically, recall that the judge notably ruled prior (in May of 2017) that the plaintiffs did have sufficient evidence to proceed to trial on their claims that the use of EVAAS in Houston to terminate their contracts was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case district shall deprive any person of life, liberty, or property, without due process). That is, the judge ruled that “any effort by teachers to replicate their own scores, with the limited information available to them, [would] necessarily fail” (see here p. 13). This was confirmed by the one of the plaintiffs’ expert witness who was also “unable to replicate the scores despite being given far greater access to the underlying computer codes than [was] available to an individual teacher” (see here p. 13).

Hence, and “[a]ccording to the unrebutted testimony of [the] plaintiffs’ expert [witness], without access to SAS’s proprietary information – the value-added equations, computer source codes, decision rules, and assumptions – EVAAS scores will remain a mysterious ‘black box,’ impervious to challenge” (see here p. 17). Consequently, the judge concluded that HISD teachers “have no meaningful way to ensure correct calculation of their EVAAS scores, and as a result are unfairly subject to mistaken deprivation of constitutionally protected property interests in their jobs” (see here p. 18).

Thereafter, and as per this settlement, HISD agreed to refrain from using VAMs, including the EVAAS, to terminate teachers’ contracts as long as the VAM score is “unverifiable.” More specifically, “HISD agree[d] it will not in the future use value-added scores, including but not limited to EVAAS scores, as a basis to terminate the employment of a term or probationary contract teacher during the term of that teacher’s contract, or to terminate a continuing contract teacher at any time, so long as the value-added score assigned to the teacher remains unverifiable. (see here p. 2; see also here). HISD also agreed to create an “instructional consultation subcommittee” to more inclusively and democratically inform HISD’s teacher appraisal systems and processes, and HISD agreed to pay the Texas AFT $237,000 in its attorney and other legal fees and expenses (State of Texas, 2017, p. 2; see also AFT, 2017).

This is yet another big win for teachers in Houston, and potentially elsewhere, as this ruling is an unprecedented development in VAM litigation. Teachers and others using the EVAAS or another VAM for that matter (e.g., that is also “unverifiable”) do take note, at minimum.

On Conditional Bias and Correlation: A Guest Post

After I posted about “Observational Systems: Correlations with Value-Added and Bias,” a blog follower, associate professor, and statistician named Laura Ring Kapitula (see also a very influential article she wrote on VAMs here) posted comments on this site that I found of interest, and I thought would also be of interest to blog followers. Hence, I invited her to write a guest post, and she did.

She used R (i.e., a free software environment for statistical computing and graphics) to simulate correlation scatterplots (see Figures below) to illustrate three unique situations: (1) a simulation where there are two indicators (e.g., teacher value-added and observational estimates plotted on the x and y axes) that have a correlation of r = 0.28 (the highest correlation coefficient at issue in the aforementioned post); (2) a simulation exploring the impact of negative bias and a moderate correlation on a group of teachers; and (3) another simulation with two indicators that have a non-linear relationship possibly induced or caused by bias. She designed simulations (2) and (3) to illustrate the plausibility of the situation suggested next (as written into Audrey’s post prior) about potential bias in both value-added and observational estimates:

If there is some bias present in value-added estimates, and some bias present in the observational estimates…perhaps this is why these low correlations are observed. That is, only those teachers teaching classrooms inordinately stacked with students from racial minority, poor, low achieving, etc. groups might yield relatively stronger correlations between their value-added and observational scores given bias, hence, the low correlations observed may be due to bias and bias alone.

Laura continues…

Here, Audrey makes the point that a correlation of r = 0.28 is “weak.” It is, accordingly, useful to see an example of just how “weak” such a correlation is by looking at a scatterplot of data selected from a population where the true correlation is r = 0.28. To make the illustration more meaningful the points are colored based on their quintile scores as per simulated teachers’ value-added divided into the lowest 20%, next 20%, etc.

In this figure you can see by looking at the blue “least squares line” that, “on average,” as a simulated teacher’s value-added estimate increases the average of a teacher’s observational estimate increases. However, there is a lot of variability (or scatter points) around the (scatterplot) line. Given this variability, we can make statements about averages, such as “on average” teachers in the top 20% for VAM scores will likely have on average higher observed observational scores; however, there is not nearly enough precision to make any (and certainly not any good) predictions about the observational score from the VAM score for individual teachers. In fact, the linear relationship between teachers’ VAM and observational scores only accounts for about 8% of the variation in VAM score. Note: we get 8% by squaring the aforementioned r = 0.28 correlation (i.e., an R squared). The other 92% of the variance is due to error and other factors.

What this means in practice is that when correlations are this “weak,” it is reasonable to say statements about averages, for example, that “on average” as one variable increases the mean of the other variable increases, but it would not be prudent or wise to make predictions for individuals based on these data. See, for example, that individuals in the top 20% (quintile 5) of VAM have a very large spread in their scores on the observational score, with 95% of the scores in the top quintile being in between the 7th and 98th percentiles for their observational scores. So, here if we observe a VAM for a specific teacher in the top 20%, and we do not know their observational score, we cannot say much more than their observational score is likely to be in the top 90%. Similarly, if we observe a VAM in the bottom 20%, we cannot say much more than their observational score is likely to be somewhere in the bottom 90%. That’s not saying a lot, in terms of precision, but also in terms of practice.

The second scatterplot I ran to test how bias that only impacts a small group of teachers might theoretically impact an overall correlation, as posited by Audrey. Here I simulated a situation where, again, there are two values present in a population of teachers: a teacher’s value-added and a teacher’s observational score. Then I insert a group of teachers (as Audrey described) who represent 20% of a population and teach a disproportionate number of students who come from relatively lower socioeconomic, high racial minority, etc. backgrounds, and I assume this group is measured with negative bias on both indicators and this group has a moderate correlation between indicators of r = 0.50. The other 80% of the population is assumed to be uncorrelated. Note: for this demonstration I assume that this group includes 20% of teachers from the aforementioned population, these teachers I assume to be measured with negative bias (by one standard deviation on average) on both measures, and, again, I set their correlation at r = 0.50 with the other 80% of teachers at a correlation of zero.

What you can see is that if there is bias in this correlation that impacts only a certain group on the two instrument indicators; hence, it is possible that this bias can result in an observed correlation overall. In other words, a strong correlation noted in just one group of teachers (i.e., teachers scoring the lowest on their value-added and observational indicators in this case) can be relatively stronger than the “weak” correlation observed on average or overall.

Another, possible situation is that there might be a non-linear relationship between these two measures. In the simulation below, I assume that different quantiles on VAM have a different linear relationship with the observational score. For example, in the plot there is not a constant slope, but teachers who are in the first quintile on VAM I assume to have a correlation of r = 0.50 with observational scores, the second quintile I assume to have a correlation of r = 0.20, and the other quintiles I assume to be uncorrelated. This results in an overall correlation in the simulation of r = 0.24, with a very small p-value (i.e. a very small chance that a correlation of this size would be observed by random chance alone if the true correlation was zero).

What this means in practice is that if, in fact, there is a non-linear relationship between teachers’ observational and VAM scores, this can induce a small but statistically significant correlation. As evidenced, teachers in the lowest 20% on the VAM score have differences in the mean observational score depending on the VAM score (a moderate correlation of r = 0.50), but for the other 80%, knowing the VAM score is not informative as there is a very small correlation for the second quintile and no correlation for the upper 60%. So, if quintile cut-off scores are used, teachers can easily be misclassified. In sum, Pearson Correlations (the standard correlation coefficient) measure the overall strength of  linear relationships between X and Y, but if X and Y have a non-linear relationship (like as illustrated in the above), this statistic can be very misleading.

Note also that for all of these simulations very small p-values are observed (e.g., p-values <0.0000001 which, again, mean these correlations are statistically significant or that the probability of observing correlations this large by chance if the true correlation is zero, is nearly 0%). What this illustrates, again, is that correlations (especially correlations this small) are (still) often misleading. While they might be statistically significant, they might mean relatively little in the grand scheme of things (i.e., in terms of practical significance; see also “The Difference Between”Significant’ and ‘Not Significant’ is not Itself Statistically Significant” or posts on Andrew Gelman’s blog for more discussion on these topics if interested).

At the end of the day r = 0.28 is still a “weak” correlation. In addition, it might be “weak,” on average, but much stronger and statistically and practically significant for teachers in the bottom quintiles (e.g., teachers in the bottom 20%, as illustrated in the final figure above) typically teaching the highest needs students. Accordingly, this might be due, at least in part, to bias.

In conclusion, one should always be wary of claims based on “weak” correlations, especially if they are positioned to be stronger than industry standards would classify them (e.g., in the case highlighted in the prior post). Even if a correlation is “statistically significant,” it is possible that the correlation is the result of bias, and that the relationship is so weak that it is not meaningful in practice, especially when the goal is to make high-stakes decisions about individual teachers. Accordingly, when you see correlations this small, keep these scatterplots in mind or generate some of your own (see, for example, here to dive deeper into what these correlations might mean and how significant these correlations might really be).

*Please contact Dr. Kapitula directly at kapitull@gvsu.edu if you want more information or to access the R code she used for the above.

The “Widget Effect” Report Revisited

You might recall that in 2009, The New Teacher Project published a highly influential “Widget Effect” report in which researchers (see citation below) evidenced that 99% of teachers (whose teacher evaluation reports they examined across a sample of school districts spread across a handful of states) received evaluation ratings of “satisfactory” or higher. Inversely, only 1% of the teachers whose reports researchers examined received ratings of “unsatisfactory,” even though teachers’ supervisors could identify more teachers whom they deemed ineffective when asked otherwise.

Accordingly, this report was widely publicized given the assumed improbability that only 1% of America’s public school teachers were, in fact, ineffectual, and given the fact that such ineffective teachers apparently existed but were not being identified using standard teacher evaluation/observational systems in use at the time.

Hence, this report was used as evidence that America’s teacher evaluation systems were unacceptable and in need of reform, primarily given the subjectivities and flaws apparent and arguably inherent across the observational components of these systems. This reform was also needed to help reform America’s public schools, writ large, so the logic went and (often) continues to go. While binary constructions of complex data such as these are often used to ground simplistic ideas and push definitive policies, ideas, and agendas, this tactic certainly worked here, as this report (among a few others) was used to inform the federal and state policies pushing teacher evaluation system reform as a result (e.g., Race to the Top (RTTT)).

Likewise, this report continues to be used whenever a state’s or district’s new-and-improved teacher evaluation systems (still) evidence “too many” (as typically arbitrarily defined) teachers as effective or higher (see, for example, an Education Week article about this here). Although, whether in fact the systems have actually been reformed is also of debate in that states are still using many of the same observational systems they were using prior (i.e., not the “binary checklists” exaggerated in the original as well as this report, albeit true in the case of the district of focus in this study). The real “reforms,” here, pertained to the extent to which value-added model (VAM) or other growth output were combined with these observational measures, and the extent to which districts adopted state-level observational models as per the centralized educational policies put into place at the same time.

Nonetheless, now eight years later, Matthew A. Kraft – an Assistant Professor of Education & Economics at Brown University and Allison F. Gilmour – an Assistant Professor at Temple University (and former doctoral student at Vanderbilt University), revisited the original report. Just published in the esteemed, peer-reviewed journal Educational Researcher (see an earlier version of the published study here), Kraft and Gilmour compiled “teacher performance ratings across 24 [of the 38, including 14 RTTT] states that [by 2014-2015] adopted major reforms to their teacher evaluation systems” as a result of such policy initiatives. They found that “the percentage of teachers rated Unsatisfactory remains less than 1%,” except for in two states (i.e., Maryland and New Mexico), with Unsatisfactory (or similar) ratings varying “widely across states with 0.7% to 28.7%” as the low and high, respectively (see also the study Abstract).

Related, Kraft and Gilmour found that “some new teacher evaluation systems do differentiate among teachers, but most only do so at the top of the ratings spectrum” (p. 10). More specifically, observers in states in which teacher evaluation ratings include five versus four rating categories differentiate teachers more, but still do so along the top three ratings, which still does not solve the negative skew at issue (i.e., “too many” teachers still scoring “too well”). They also found that when these observational systems were used for formative (i.e., informative, improvement) purposes, teachers’ ratings were lower than when they were used for summative (i.e., final summary) purposes.

Clearly, the assumptions of all involved in this area of policy research come into play, here, akin to how they did in The Bell Curve and The Bell Curve Debate. During this (still ongoing) debate, many fervently debated whether socioeconomic and educational outcomes (e.g., IQ) should be normally distributed. What this means in this case, for example, is that for every teacher who is rated highly effective there should be a teacher rated as highly ineffective, more or less, to yield a symmetrical distribution of teacher observational scores across the spectrum.

In fact, one observational system of which I am aware (i.e., the TAP System for Teacher and Student Advancement) is marketing its proprietary system, using as a primary selling point figures illustrating (with text explaining) how clients who use their system will improve their prior “Widget Effect” results (i.e., yielding such normal curves; see Figure below, as per Jerald & Van Hook, 2011, p. 1).

Evidence also suggests that these scores are also (sometimes) being artificially deflated to assist in these attempts (see, for example, a recent publication of mine released a few days ago here in the (also) esteemed, peer-reviewed Teachers College Record about how this is also occurring in response to the “Widget Effect” report and the educational policies that follows).

While Kraft and Gilmour assert that “systems that place greater weight on normative measures such as value-added scores rather than…[just]…observations have fewer teachers rated proficient” (p. 19; see also Steinberg & Kraft, forthcoming; a related article about how this has occurred in New Mexico here; and New Mexico’s 2014-2016 data below and here, as also illustrative of the desired normal curve distributions discussed above), I highly doubt this purely reflects New Mexico’s “commitment to putting students first.”

I also highly doubt that, as per New Mexico’s acting Secretary of Education, this was “not [emphasis added] designed with quote unquote end results in mind.” That is, “the New Mexico Public Education Department did not set out to place any specific number or percentage of teachers into a given category.” If true, it’s pretty miraculous how this simply worked out as illustrated… This is also at issue in the lawsuit in which I am involved in New Mexico, in which the American Federation of Teachers won an injunction in 2015 that still stands today (see more information about this lawsuit here). Indeed, as per Kraft, all of this “might [and possibly should] undercut the potential for this differentiation [if ultimately proven artificial, for example, as based on statistical or other pragmatic deflation tactics] to be seen as accurate and valid” (as quoted here).

Notwithstanding, Kraft and Gilmour, also as part (and actually the primary part) of this study, “present original survey data from an urban district illustrating that evaluators perceive more than three times as many teachers in their schools to be below Proficient than they rate as such.” Accordingly, even though their data for this part of this study come from one district, their findings are similar to others evidenced in the “Widget Effect” report; hence, there are still likely educational measurement (and validity) issues on both ends (i.e., with using such observational rubrics as part of America’s reformed teacher evaluation systems and using survey methods to put into check these systems, overall). In other words, just because the survey data did not match the observational data does not mean either is wrong, or right, but there are still likely educational measurement issues.

Also of issue in this regard, in terms of the 1% issue, is (a) the time and effort it takes supervisors to assist/desist after rating teachers low is sometimes not worth assigning low ratings; (b) how supervisors often give higher ratings to those with perceived potential, also in support of their future growth, even if current evidence suggests a lower rating is warranted; (c) how having “difficult conversations” can sometimes prevent supervisors from assigning the scores they believe teachers may deserve, especially if things like job security are on the line; (d) supervisors’ challenges with removing teachers, including “long, laborious, legal, draining process[es];” and (e) supervisors’ challenges with replacing teachers, if terminated, given current teacher shortages and the time and effort, again, it often takes to hire (ideally more qualified) replacements.

References:

Jerald, C. D., & Van Hook, K. (2011). More than measurement: The TAP system’s lessons learned for designing better teacher evaluation systems. Santa Monica, CA: National Institute for Excellence in Teaching (NIET). Retrieved from http://files.eric.ed.gov/fulltext/ED533382.pdf

Kraft, M. A, & Gilmour, A. F. (2017). Revisiting the Widget Effect: Teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46(5) 234-249. doi:10.3102/0013189X17718797

Steinberg, M. P., & Kraft, M. A. (forthcoming). The sensitivity of teacher performance ratings to the design of teacher evaluation systems. Educational Researcher.

Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). “The Widget Effect.” Education Digest, 75(2), 31–35.

Observational Systems: Correlations with Value-Added and Bias

A colleague recently sent me a report released in November of 2016 by the Institute of Education Sciences (IES) division of the U.S. Department of Education that should be of interest to blog followers. The study is about “The content, predictive power, and potential bias in five widely used teacher observation instruments” and is authored by affiliates of Mathematica Policy Research.

Using data from the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) studies, researchers examined five widely used teacher observation instruments. Instruments included the more generally popular Classroom Assessment Scoring System (CLASS) and Danielson Framework for Teaching (of general interest in this post), as well as the more subject-specific instruments including the Protocol for Language Arts Teaching Observations (PLATO), the Mathematical Quality of Instruction (MQI), and the UTeach Observational Protocol (UTOP) for science and mathematics teachers.

Researchers examined these instruments in terms of (1) what they measure (which is not of general interest in this post), but also (2) the relationships of observational output to teachers’ impacts on growth in student learning over time (as measured using a standard value-added model (VAM)), and (3) whether observational output are biased by the characteristics of the students non-randomly (or in this study randomly) assigned to teachers’ classrooms.

As per #2 above, researchers found that the instructional practices captured across these instruments modestly [emphasis added] correlate with teachers’ value-added scores, with an adjusted (and likely, artificially inflated; see Note 1 below) correlation coefficient between observational and value added indicators at: 0.13 ≤ r ≤ 0.28 (see also Table 4, p. 10). As per the higher, adjusted r (emphasis added; see also Note 1 below), they found that these instruments’ classroom management dimensions most strongly (r = 0.28) correlated with teachers’ value-added.

Related, also at issue here is that such correlations are not “modest,” but rather “weak” to “very weak” (see Note 2 below). While all correlation coefficients were statistically significant, this is much more likely due to the sample size used in this study versus the actual or practical magnitude of these results. “In sum” this hardly supports the overall conclusion that “observation scores predict teachers’ value-added scores” (p. 11); although, it should also be noted that this summary statement, in and of itself, suggests that the value-added score is the indicator around which all other “less objective” indicators are to revolve.

As per #3 above, researchers found that students randomly assigned to teachers’ classrooms (as per the MET data, although there was some noncompliance issues with the random assignment employed in the MET studies) do bias teachers’ observational scores, for better or worse, and more often in English language arts than in mathematics. More specifically, they found that for the Danielson Framework and CLASS (the two more generalized instruments examined in this study, also of main interest in this post), teachers with relatively more racial/ethnic minority and lower-achieving students (in that order, although these are correlated themselves) tended to receive lower observation scores. Bias was observed more often for the Danielson Framework versus the CLASS, but it was observed in both cases. An “alternative explanation [may be] that teachers are providing less-effective instruction to non-White or low-achieving students” (p. 14).

Notwithstanding, and in sum, in classrooms in which students were randomly assigned to teachers, teachers’ observational scores were biased by students’ group characteristics, which also means that  bias is also likely more prevalent in classrooms to which students are non-randomly assigned (which is common practice). These findings are also akin to those found elsewhere (see, for example, two similar studies here), as this was also evidenced in mathematics, which may also be due to the random assignment factor present in this study. In other words, if non-random assignment of students into classrooms is practice, a biasing influence may (likely) still exist in English language arts and mathematics.

The long and short of it, though, is that the observational components of states’ contemporary teacher systems certainly “add” more “value” than their value-added counterparts (see also here), especially when considering these systems’ (in)formative purposes. But to suggest that because these observational indicators (artificially) correlate with teachers’ value-added scores at “weak” and “very weak” levels (see Notes 1 and 2 below), that this means that these observational systems might “add” more “value” to the summative sides of teacher evaluations (i.e., their predictive value) is premature, not to mention a bit absurd. Adding import to this statement is the fact that, as s duly noted in this study, these observational indicators are oft-to-sometimes biased against teachers who teacher lower-achieving and racial minority students, even when random assignment is present, making such bias worse when non-random assignment, which is very common, occurs.

Hence, and again, this does not make the case for the summative uses of really either of these indicators or instruments, especially when high-stakes consequences are to be attached to output from either indicator (or both indicators together given the “weak” to “very weak” relationships observed). On the plus side, though, remain the formative functions of the observational indicators.

*****

Note 1: Researchers used the “year-to-year variation in teachers’ value-added scores to produce an adjusted correlation [emphasis added] that may be interpreted as the correlation between teachers’ average observation dimension score and their underlying value added—the value added that is [not very] stable [or reliable] for a teacher over time, rather than a single-year measure (Kane & Staiger, 2012)” (p. 9). This practice or its statistic derived has not been externally vetted. Likewise, this also likely yields a correlation coefficient that is falsely inflated. Both of these concerns are at issue in the ongoing New Mexico and Houston lawsuits, in which Kane is one of the defendants’ expert witnesses in both cases testifying in support of his/this practice.

Note 2: As is common with social science research when interpreting correlation coefficients: 0.8 ≤ r ≤ 1.0 = a very strong correlation; 0.6 ≤ r ≤ 0.8 = a strong correlation; 0.4 ≤ r ≤ 0.6 = a moderate correlation; 0.2 ≤ r ≤ 0.4 = a weak correlation; and 0 ≤ r ≤ 0.2 = a very weak correlation, if any at all.

*****

Citation: Gill, B., Shoji, M., Coen, T., & Place, K. (2016). The content, predictive power, and potential bias in five widely used teacher observation instruments. Washington, DC: U.S. Department of Education, Institute of Education Sciences. Retrieved from https://ies.ed.gov/ncee/edlabs/regions/midatlantic/pdf/REL_2017191.pdf

Breaking News: Another Big Victory in Court in Texas

Earlier today I released a post regarding “A Big Victory in Court in Houston,” in which I wrote about how, yesterday, US Magistrate Judge Smith ruled — in Houston Federation of Teachers et al. v. Houston Independent School District — that Houston teacher plaintiffs’ have legitimate claims regarding how their Education Value-Added Assessment System (EVAAS) value-added scores, as used (and abused) in HISD, was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case organization shall deprive any person of life, liberty, or property, without due process). Hence, on this charge, this case is officially going to trial.

Well, also yesterday, “we” won another court case on which I also served as an expert witness (I served as an expert witness on behalf of the plaintiffs alongside Jesse Rothstein in the court case noted above). As per this case — Texas State Teachers Association v. Texas Education Agency, Mike Morath in his Official Capacity as Commissioner of Education for the State of Texas (although there were three similar cases also filed – see all four referenced below) — The Honorable Lora J. Livingston ruled that the Defendants are to make revisions to 19 Tex. Admin. Code § 150.1001 that most notably include the removal of (A) student learning objectives [SLOs], (B) student portfolios, (C) pre and post test results on district level assessments; or (D) value added data based on student state assessment results. In addition, “The rules do not restrict additional factors a school district may consider…,” and “Under the local appraisal system, there [will be] no required weighting for each measure…,” although districts can chose to weight whatever measures they might choose. “Districts can also adopt an appraisal system that does not provide a single, overall summative rating.” That is, increased local control.

If the Texas Education Agency (TEA) does not adopt the regulations put forth by the court by next October, this case will continue. This does not look likely, however, in that as per a news article released today, here, Texas “Commissioner of Education Mike Morath…agreed to revise the [states’] rules in exchange for the four [below] teacher groups’ suspending their legal challenges.” As noted prior, the terms of this settlement call for the removal of the above-mentioned, state-required, four growth measures when evaluating teachers.

This was also highlighted in a news article, released yesterday, here, with this one more generally about how teachers throughout Texas will no longer be evaluated using their students’ test scores, again, as required by the state.

At the crux of this case, as also highlighted in this particular piece, and to which I testified (quite extensively), was that the value-added measures formerly required/suggested by the state did not constitute teachers’ “observable,” job-related behaviors. See also a prior post about this case here.

*****

Cases Contributing to this Ruling:

1. Texas State Teachers Association v. Texas Education Agency, Mike Morath, in his Official Capacity as Commissioner of Education for the State of Texas; in the 345th Judicial District Court, Travis County, Texas

2. Texas Classroom Teachers Association v. Mike Morath, Texas Commissioner of Education; in the 419th Judicial District Court, Travis County, Texas

3. Texas American Federation of Teachers v. Mike Morath, Commissioner of Education, in his official capacity, and Texas Education Agency; in the 201st Judicial District Court, Travis County, Texas

4. Association of Texas Professional Educators v. Mike Morath, the Commissioner of Education and the Texas Education Agency; in the 200th District Court of Travis County, Texas.

Breaking News: A Big Victory in Court in Houston

Recall from multiple prior posts (see here, here, here, and here) that a set of teachers in the Houston Independent School District (HISD), with the support of the Houston Federation of Teachers (HFT) and the American Federation of Teachers (AFT), took their district to federal court to fight against the (mis)use of their value-added scores, derived via the Education Value-Added Assessment System (EVAAS) — the “original” value-added model (VAM) developed in Tennessee by William L. Sanders who just recently passed away (see here). Teachers’ EVAAS scores, in short, were being used to evaluate teachers in Houston in more consequential ways than anywhere else in the nation (e.g., the termination of 221 teachers in just one year as based, primarily, on their EVAAS scores).

The case — Houston Federation of Teachers et al. v. Houston ISD — was filed in 2014 and just yesterday, United States Magistrate Judge Stephen Wm. Smith denied in the United States District Court, Southern District of Texas, the district’s request for summary judgment given the plaintiffs’ due process claims. Put differently, Judge Smith ruled that the plaintiffs’ did have legitimate claims regarding how EVAAS use in HISD was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case organization shall deprive any person of life, liberty, or property, without due process). Hence, on this charge, this case is officially going to trial.

This is a huge victory, and one unprecedented that will likely set precedent, trial pending, for others, and more specifically other teachers.

Of primary issue will be the following (as taken from Judge Smith’s Summary Judgment released yesterday): “Plaintiffs [will continue to] challenge the use of EVAAS under various aspects of the Fourteenth Amendment, including: (1) procedural due process, due to lack of sufficient information to meaningfully challenge terminations based on low EVAAS scores,” and given “due process is designed to foster government decision-making that is both fair and accurate.”

Related, and of most importance, as also taken directly from Judge Smith’s Summary, he wrote:

  • HISD’s value-added appraisal system poses a realistic threat to deprive plaintiffs of constitutionally protected property interests in employment.
  • HISD does not itself calculate the EVAAS score for any of its teachers. Instead, that task is delegated to its third party vendor, SAS. The scores are generated by complex algorithms, employing “sophisticated software and many layers of calculations.” SAS treats these algorithms and software as trade secrets, refusing to divulge them to either HISD or the teachers themselves. HISD has admitted that it does not itself verify or audit the EVAAS scores received from SAS, nor does it engage any contractor to do so. HISD further concedes that any effort by teachers to replicate their own scores, with the limited information available to them, will necessarily fail. This has been confirmed by plaintiffs’ expert, who was unable to replicate the scores despite being given far greater access to the underlying computer codes than is available to an individual teacher [emphasis added, as also related to a prior post about how SAS claimed that plaintiffs violated SAS’s protective order (protecting its trade secrets), that the court overruled, see here].
  • The EVAAS score might be erroneously calculated for any number of reasons, ranging from data-entry mistakes to glitches in the computer code itself. Algorithms are human creations, and subject to error like any other human endeavor. HISD has acknowledged that mistakes can occur in calculating a teacher’s EVAAS score; moreover, even when a mistake is found in a particular teacher’s score, it will not be promptly corrected. As HISD candidly explained in response to a frequently asked question, “Why can’t my value-added analysis be recalculated?”:
    • Once completed, any re-analysis can only occur at the system level. What this means is that if we change information for one teacher, we would have to re- run the analysis for the entire district, which has two effects: one, this would be very costly for the district, as the analysis itself would have to be paid for again; and two, this re-analysis has the potential to change all other teachers’ reports.
  • The remarkable thing about this passage is not simply that cost considerations trump accuracy in teacher evaluations, troubling as that might be. Of greater concern is the house-of-cards fragility of the EVAAS system, where the wrong score of a single teacher could alter the scores of every other teacher in the district. This interconnectivity means that the accuracy of one score hinges upon the accuracy of all. Thus, without access to data supporting all teacher scores, any teacher facing discharge for a low value-added score will necessarily be unable to verify that her own score is error-free.
  • HISD’s own discovery responses and witnesses concede that an HISD teacher is unable to verify or replicate his EVAAS score based on the limited information provided by HISD.
  • According to the unrebutted testimony of plaintiffs’ expert, without access to SAS’s proprietary information – the value-added equations, computer source codes, decision rules, and assumptions – EVAAS scores will remain a mysterious “black box,” impervious to challenge.
  • While conceding that a teacher’s EVAAS score cannot be independently verified, HISD argues that the Constitution does not require the ability to replicate EVAAS scores “down to the last decimal point.” But EVAAS scores are calculated to the second decimal place, so an error as small as one hundredth of a point could spell the difference between a positive or negative EVAAS effectiveness rating, with serious consequences for the affected teacher.

Hence, “When a public agency adopts a policy of making high stakes employment decisions based on secret algorithms incompatible with minimum due process, the proper remedy is to overturn the policy.”

Moreover, he wrote, that all of this is part of the violation of teaches’ Fourteenth Amendment rights. Hence, he also wrote, “On this summary judgment record, HISD teachers have no meaningful way to ensure correct calculation of their EVAAS scores, and as a result are unfairly subject to mistaken deprivation of constitutionally protected property interests in their jobs.”

Otherwise, Judge Smith granted summary judgment to the district on the other claims forwarded by the plaintiffs, including plaintiffs’ equal protection claims. All of us involved in the case — recall that Jesse Rothstein and I served as the expert witnesses on behalf of the plaintiffs, and Thomas Kane of the Measures of Effective Teaching (MET) Project and John Friedman of the infamous Chetty et al. studies (see here and here) served as the expert witnesses on behalf of the defendants — knew that all of the plaintiffs’ claims would be tough to win given all of the constitutional legal standards would be difficult for plaintiffs to satisfy (e.g., that evaluating teachers using their value-added scores was not “unreasonable” was difficult to prove, as it was in the Tennessee case we also fought and was then dismissed on similar grounds (see here)).

Nonetheless, that “we” survived on the due process claim is fantastic, especially as this is the first case like this of which we are aware across the country.

Here is the press release, released last night by the AFT:

May 4, 2017 – AFT, Houston Federation of Teachers Hail Court Ruling on Flawed Evaluation System

Statements by American Federation of Teachers President Randi Weingarten and Houston Federation of Teachers President Zeph Capo on U.S. District Court decision on Houston’s Evaluation Value-Added Assessment System (EVAAS), known elsewhere as VAM or value-added measures:

AFT President Randi Weingarten: “Houston developed an incomprehensible, unfair and secret algorithm to evaluate teachers that had no rational meaning. This is the algebraic formula: = + (Σ∗≤Σ∗∗ × ∗∗∗∗=1)+

“U.S. Magistrate Judge Stephen Smith saw that it was seriously flawed and posed a threat to teachers’ employment rights; he rejected it. This is a huge victory for Houston teachers, their students and educators’ deeply held contention that VAM is a sham.

“The judge said teachers had no way to ensure that EVAAS was correctly calculating their performance score, nor was there a way to promptly correct a mistake. Judge Smith added that the proper remedy is to overturn the policy; we wholeheartedly agree. Teaching must be about helping kids develop the skills and knowledge they need to be prepared for college, career and life—not be about focusing on test scores for punitive purposes.”

HFT President Zeph Capo: “With this decision, Houston should wipe clean the record of every teacher who was negatively evaluated. From here on, teacher evaluation systems should be developed with educators to ensure that they are fair, transparent and help inform instruction, not be used as a punitive tool.”

The Tripod Student Survey Instrument: Its Factor Structure and Value-Added Correlations

The Tripod student perception survey instrument is a “research-based” instrument increasingly being used by states to add to state’s teacher evaluation systems as based on “multiple measures.” While there are other instruments also in use, as well as student survey instruments being developed by states and local districts, this one in particular is gaining in popularity, also in that it was used throughout the Bill & Melinda Gates Foundation’s ($43 million worth of) Measures of Effective Teaching (MET) studies. A current estimate (as per the study discussed in this post) is that during the 2015–2016 school year approximately 1,400 schools purchased and administered the Tripod. See also a prior post (here) about this instrument, or more specifically a chapter of a book about the instrument as authored by the instrument’s developer and lead researcher in a  research surrounding it – Ronald Ferguson.

In a study recently released in the esteemed American Educational Research Journal (AERJ), and titled “What Can Student Perception Surveys Tell Us About Teaching? Empirically Testing the Underlying Structure of the Tripod Student Perception Survey,” researchers found that the Tripod’s factor structure did not “hold up.” That is, Tripod’s 7Cs (i.e., seven constructs including: Care, Confer, Captivate, Clarify, Consolidate, Challenge, Classroom Management; see more information about the 7Cs here) and the 36 items that are positioned within each of the 7Cs did not fit the 7C framework as theorized by instrument developer(s).

Rather, using the MET database (N=1,049 middle school math class sections; N=25,423 students), researchers found that an alternative bi-factor structure (i.e., two versus seven constructs) best fit the Tripod items theoretically positioned otherwise. These two factors included (1) a general responsivity dimension that includes all items (more or less) unrelated to (2) a classroom management dimension that governs responses on items surrounding teachers’ classroom management. Researchers were unable to to distinguish across items seven separate dimensions.

Researchers also found that the two alternative factors noted — general responsivity and classroom management — were positively associated with teacher value-added scores. More specifically, results suggested that these two factors were positively and statistically significantly associated with teachers’ value-added measures based on state mathematics tests (standardized coefficients were .25 and .25, respectively), although for undisclosed reasons, results apparently suggested nothing about these two factors’ (cor)relationships with value-added estimates base on state English/language arts (ELA) tests. As per authors’ findings in the area of mathematics, prior researchers have also found low to moderate agreement between teacher ratings and student perception ratings; hence, this particular finding simply adds another source of convergent evidence.

Authors do give multiple reasons and plausible explanations as to why they found what they did that you all can read in more depth via the full article, linked to above and fully cited below. Authors also note that “It is unclear whether the original 7Cs that describe the Tripod instrument were intended to capture seven distinct dimensions on which students can reliably discriminate among teachers or whether the 7Cs were merely intended to be more heuristic domains that map out important aspects of teaching” (p. 1859); hence, this is also important to keep in mind given study findings.

As per study authors, and to their knowledge, “this study [was] the first to systematically investigate the multidimensionality of the Tripod student perception survey” (p. 1863).

Citation: Wallace, T. L., Kelcey, B., &  Ruzek, E. (2016). What can student perception surveys tell us about teaching? Empirically testing the underlying structure of the Tripod student perception survey.  American Educational Research Journal, 53(6), 1834–1868.
doiI:10.3102/0002831216671864 Retrieved from http://journals.sagepub.com/doi/pdf/10.3102/0002831216671864