A colleague recently sent me a report released in November of 2016 by the Institute of Education Sciences (IES) division of the U.S. Department of Education that should be of interest to blog followers. The study is about “The content, predictive power, and potential bias in five widely used teacher observation instruments” and is authored by affiliates of Mathematica Policy Research.

Using data from the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) studies, researchers examined five widely used teacher observation instruments. Instruments included the more generally popular Classroom Assessment Scoring System (CLASS) and Danielson Framework for Teaching (of general interest in this post), as well as the more subject-specific instruments including the Protocol for Language Arts Teaching Observations (PLATO), the Mathematical Quality of Instruction (MQI), and the UTeach Observational Protocol (UTOP) for science and mathematics teachers.

Researchers examined these instruments in terms of (1) what they measure (which is not of general interest in this post), but also (2) the relationships of observational output to teachers’ impacts on growth in student learning over time (as measured using a standard value-added model (VAM)), and (3) whether observational output are biased by the characteristics of the students non-randomly (or in this study randomly) assigned to teachers’ classrooms.

As per #2 above, researchers found that the instructional practices captured across these instruments *modestly* [emphasis added] correlate with teachers’ value-added scores, with an adjusted (and likely, artificially inflated; see Note 1 below) correlation coefficient between observational and value added indicators at: 0.13 ≤ *r* ≤ 0.28 (see also Table 4, p. 10). As per the higher, *adjusted* *r *(emphasis added; see also Note 1 below), they found that these instruments’ classroom management dimensions most strongly (*r *= 0.28) correlated with teachers’ value-added.

Related, also at issue here is that such correlations are not “modest,” but rather “weak” to “very weak” (see Note 2 below). While all correlation coefficients were statistically significant, this is much more likely due to the sample size used in this study versus the actual or practical magnitude of these results. “In sum” this hardly supports the overall conclusion that “observation scores predict teachers’ value-added scores” (p. 11); although, it should also be noted that this summary statement, in and of itself, suggests that the value-added score is the indicator around which all other “less objective” indicators are to revolve.

As per #3 above, researchers found that students randomly assigned to teachers’ classrooms (as per the MET data, although there was some noncompliance issues with the random assignment employed in the MET studies) do bias teachers’ observational scores, for better or worse, and more often in English language arts than in mathematics. More specifically, they found that for the Danielson Framework and CLASS (the two more generalized instruments examined in this study, also of main interest in this post), teachers with relatively more racial/ethnic minority and lower-achieving students (in that order, although these are correlated themselves) tended to receive lower observation scores. Bias was observed more often for the Danielson Framework versus the CLASS, but it was observed in both cases. An “alternative explanation [may be] that teachers are providing less-effective instruction to non-White or low-achieving students” (p. 14).

Notwithstanding, and in sum, in classrooms in which students were randomly assigned to teachers, teachers’ observational scores were biased by students’ group characteristics, which also means that bias is also likely more prevalent in classrooms to which students are non-randomly assigned (which is common practice). These findings are also akin to those found elsewhere (see, for example, two similar studies here), as this was also evidenced in mathematics, which may also be due to the random assignment factor present in this study. In other words, if non-random assignment of students into classrooms is practice, a biasing influence may (likely) still exist in English language arts *and *mathematics.

The long and short of it, though, is that the observational components of states’ contemporary teacher systems certainly “add” more “value” than their value-added counterparts (see also here), especially when considering these systems’ (in)formative purposes. But to suggest that because these observational indicators (artificially) correlate with teachers’ value-added scores at “weak” and “very weak” levels (see Notes 1 and 2 below), that this means that these observational systems might “add” more “value” to the summative sides of teacher evaluations (i.e., their predictive value) is premature, not to mention a bit absurd. Adding import to this statement is the fact that, as s duly noted in this study, these observational indicators are oft-to-sometimes biased against teachers who teacher lower-achieving and racial minority students, even when random assignment is present, making such bias worse when non-random assignment, which is very common, occurs.

Hence, and again, this does not make the case for the summative uses of really either of these indicators or instruments, especially when high-stakes consequences are to be attached to output from either indicator (or both indicators together given the “weak” to “very weak” relationships observed). On the plus side, though, remain the formative functions of the observational indicators.

*****

Note 1: Researchers used the “year-to-year variation in teachers’ value-added scores to produce an *adjusted correlation* [emphasis added] that may be interpreted as the correlation between teachers’ average observation dimension score and their underlying value added—the value added that is [not very] stable [or reliable] for a teacher over time, rather than a single-year measure (Kane & Staiger, 2012)” (p. 9). This practice or its statistic derived has not been externally vetted. Likewise, this also likely yields a correlation coefficient that is falsely inflated. Both of these concerns are at issue in the ongoing New Mexico and Houston lawsuits, in which Kane is one of the defendants’ expert witnesses in both cases testifying in support of his/this practice.

Note 2: As is common with social science research when interpreting correlation coefficients: 0.8 ≤ *r* ≤ 1.0 = a very strong correlation; 0.6 ≤ *r* ≤ 0.8 = a strong correlation; 0.4 ≤ *r* ≤ 0.6 = a moderate correlation; 0.2 ≤ *r* ≤ 0.4 = a weak correlation; and 0 ≤ *r* ≤ 0.2 = a very weak correlation, if any at all.

*****

Citation: Gill, B., Shoji, M., Coen, T., & Place, K. (2016). The content, predictive power, and potential bias in five widely used teacher observation instruments. Washington, DC: U.S. Department of Education, Institute of Education Sciences. Retrieved from https://ies.ed.gov/ncee/edlabs/regions/midatlantic/pdf/REL_2017191.pdf

Using data from the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) studies,……

Anyone who has looked at these studies knows that the only reason for the data to be massaged, massaged, and massaged to extract conclusions is this: There is not much there except quantity–easy to access–and routinely taken for granted as if valid, reliable, and conclusive results can be obtained by having yet another go at analyses. I have seen more than one article in Educational Research trying to get more mileage from the MET studies. I wonder if Gates is financing these “play it again” studies. Did IES reviewers even look at the fact that the Danielson observations were made by looking at videos made and selected by teachers and with the promise of video equipment for participation? Did anyone look at the number of rubrics in the Danielson protocol, number of observations required, and number of different observers likely to be required to produce reliable ratings?

“This practice or its statistic derived has not been externally vetted. Likewise, this also likely yields a correlation coefficient that is falsely inflated. Both of these concerns are at issue in the ongoing New Mexico and Houston lawsuits, in which Kane is one of the defendants’ expert witnesses in both cases testifying in support of his/this practice.”

I hope that someday the MET studies are discredited, if necessary by lawsuits and some penalty to the marketers of VAM and “correlations” as if evidence of teacher effectiveness. Don’t let me continue with the Ron Ferguson Student surveys. Thanks for permission to vent.

We embrace your “venting,” Laura. Thanks for all of that you “add” to these discussions.

Hi Audrey–I posted this to twitter this morning and received the following response from Dylan Wiliam, one of the top experts on assessment in the UK:

\”As I pointed out in \”Leadership for teacher learning\” you would need 11 years of data on a teacher to get a 0.9 reliable rating of quality.\”

Leadership for Teacher Learning is his recent book: https://www.amazon.com/Leadership-Teacher-Learning-Creating-Teachers/dp/1941112269

I don’t know if I would agree with that, actually. Current research says the best you are going to get — IF you don’t artificially inflate the numbers (see above) — is no more than a 0.4 or 0.5 with three years data. After that, the “added value” of more years plateaus. Hence, the current and research-based recommendation (to which many but not all adhere) is to use three years to make (still unreliable) decisions about teachers using these data. When high-stakes are attached, this is hardly acceptable.

David Berliner also replied to me via email re: this post this am. Here’s what he wrote: “…what you forgot to say, if you do more on this, is that if the construct underlying one measure of effective teaching, and the construct underlying another measure of effective teaching (almost always) correlate at .3 or less, they only share 10% of the variance that these construct measure. Thus, inarguably, one or both of these measures of teacher effectiveness cannot be measuring that construct.”

Indeed, I agree with this. Related, I had a conversation with a colleague who is an econometrician this morning about these correlations (i.e., correlations between observational scores and value-added estimates), and while we could not figure out a way to test our hypothesis, our hypothesis was as follows: If there is some bias present in value-added estimates, and some bias present in the observational estimates (see post above), perhaps this is why these low correlations are observed. That is, only those teachers teaching classrooms inordinately stacked with students from racial minority, poor, low achieving, etc. groups might yield relatively stronger correlations between their value-added and observational scores given bias, hence, the low correlations observed may be due to bias and bias alone.

I ran a quick simulation in R and you can see how this would induce a correlation. Theoretically it is definitely the case. So with assumptions you could prove it but I thought a little simulation would help illustrate it. I also think just looking at scatterplots of these low correlations can be very helpful to people because they can see the people with high values on one measure and low on the other. I will send you the html file with my R notebook in case you want to look at it or post it.

I just read the whole paper and the statistical methodology in this paper is terrible. They count statistically significant results and act as if that is meaningful. It is very important to realize that the difference between a statistically significant result and a non-statistically significant result is not necessarily statistically significant. If you look at table C-7 you see all the different individual effect estimates and their standard errors. Note that the non-significant ones typically have larger standard errors. So their conclusions are incorrect. They may as well have just said measures with less variability are statistically significantly different from zero, note that the confidence intervals for many of them would still come very close to zero as well just doing a quick estimate +/- 2* SE . I really hope this paper is not used to make any high stakes decisions in terms of instruments to use.