More of Kane’s “Objective” Insights on Teacher Evaluation Measures

You might recall from a series of prior posts (see, for example, here, here, and here), the name of Thomas Kane — an economics professor from Harvard University who directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation, who also testified as an expert witness in two lawsuits (i.e., in New Mexico and Houston) opposite me (and in the case of Houston, also opposite Jesse Rothstein).

He, along with Andrew Bacher-Hicks (PhD Candidate at Harvard), Mark Chin (PhD Candidate at Harvard), and Douglas Staiger (Economics Professor of Dartmouth), just released yet another National Bureau of Economic Research (NBER) “working paper” (i.e., not peer-reviewed, and in this case not internally reviewed by NBER for public consumption and use either) titled “An Evaluation of Bias in Three Measures of Teacher Quality: Value-Added, Classroom Observations, and Student Surveys.” I review this study here.

Using Kane’s MET data, they test whether 66 mathematics teachers’ performance measured (1) by using teachers’ student test achievement gains (i.e., calculated using value-added models (VAMs)), classroom observations, and student surveys, and (2) under naturally occurring (i.e., non-experimental) settings “predicts performance following random assignment of that teacher to a class of students” (p. 2). More specifically, researchers “observed a sample of fourth- and fifth-grade mathematics teachers and collected [these] measures…[under normal conditions, and then in]…the third year…randomly assigned participating teachers to classrooms within their schools and then again collected all three measures” (p. 3).

They concluded that “the test-based value-added measure—is a valid predictor of teacher impacts on student achievement following random assignment” (p. 28). This finding “is the latest in a series of studies” (p. 27) substantiating this not-surprising, as-oft-Kane-asserted finding, or as he might assert it, fact. I should note here that no other studies substantiating “the latest in a series of studies” (p. 27) claim are referenced or cited, but a quick review of the 31 total references included in this report include 16/31 (52%) references conducted by only econometricians (i.e., not statisticians or other educational researchers) on this general topic, of which 10/16 (63%) are not peer reviewed and of which 6/16 (38%) are either authored or co-authored by Kane (1/6 being published in a peer-reviewed journal). The other articles cited are about the measurements used, the geenral methods used in this study, and four other articles written on the topic not authored by econometricians. Needless to say, there is clearly a slant that is quite obvious in this piece, and unfortunately not surprising, but that had it gone through any respectable vetting process, this sh/would have been caught and addressed prior to this study’s release.

I must add that this reminds me of Kane’s New Mexico testimony (see here) where he, again, “stressed that numerous studies [emphasis added] show[ed] that teachers [also] make a big impact on student success.” He stated this on the stand while expressly contradicting the findings of the American Statistical Association (ASA). While testifying otherwise, and again, he also only referenced (non-representative) studies in his (or rather defendants’ support) authored by primarily him (e.g, as per his MET studies) and some of his other econometric friends (e.g. Raj Chetty, Eric Hanushek, Doug Staiger) as also cited within this piece here. This was also a concern registered by the court, in terms of whether Kane’s expertise was that of a generalist (i.e., competent across multi-disciplinary studies conducted on the matter) or a “selectivist” (i.e., biased in terms of his prejudice against, or rather selectivity of certain studies for confirmation, inclusion, or acknowledgment). This is also certainly relevant, and should be taken into consideration here.

Otherwise, in this study the authors also found that the Mathematical Quality of Instruction (MQI) observational measure (one of two observational measures they used in this study, with the other one being the Classroom Assessment Scoring System (CLASS)) was a valid predictor of teachers’ classroom observations following random assignment. The MQI also, did “not seem to be biased by the unmeasured characteristics of students [a] teacher typically teaches” (p. 28). This also expressly contradicts what is now an emerging set of studies evidencing the contrary, also not cited in this particular piece (see, for example, here, here, and here), some of which were also conducted using Kane’s MET data (see, for example, here and here).

Finally, authors’ evidence on the predictive validity of student surveys was inconclusive.

Needless to say…

Citation: Bacher-Hicks, A., Chin, M. J., Kane, T. J., & Staiger, D. O. (2017). An evaluation of bias in three measures of teacher quality: Value-added, classroom observations, and student surveys. Cambridge, MA: ational Bureau of Economic Research (NBER). Retrieved from http://www.nber.org/papers/w23478

On Conditional Bias and Correlation: A Guest Post

After I posted about “Observational Systems: Correlations with Value-Added and Bias,” a blog follower, associate professor, and statistician named Laura Ring Kapitula (see also a very influential article she wrote on VAMs here) posted comments on this site that I found of interest, and I thought would also be of interest to blog followers. Hence, I invited her to write a guest post, and she did.

She used R (i.e., a free software environment for statistical computing and graphics) to simulate correlation scatterplots (see Figures below) to illustrate three unique situations: (1) a simulation where there are two indicators (e.g., teacher value-added and observational estimates plotted on the x and y axes) that have a correlation of r = 0.28 (the highest correlation coefficient at issue in the aforementioned post); (2) a simulation exploring the impact of negative bias and a moderate correlation on a group of teachers; and (3) another simulation with two indicators that have a non-linear relationship possibly induced or caused by bias. She designed simulations (2) and (3) to illustrate the plausibility of the situation suggested next (as written into Audrey’s post prior) about potential bias in both value-added and observational estimates:

If there is some bias present in value-added estimates, and some bias present in the observational estimates…perhaps this is why these low correlations are observed. That is, only those teachers teaching classrooms inordinately stacked with students from racial minority, poor, low achieving, etc. groups might yield relatively stronger correlations between their value-added and observational scores given bias, hence, the low correlations observed may be due to bias and bias alone.

Laura continues…

Here, Audrey makes the point that a correlation of r = 0.28 is “weak.” It is, accordingly, useful to see an example of just how “weak” such a correlation is by looking at a scatterplot of data selected from a population where the true correlation is r = 0.28. To make the illustration more meaningful the points are colored based on their quintile scores as per simulated teachers’ value-added divided into the lowest 20%, next 20%, etc.

In this figure you can see by looking at the blue “least squares line” that, “on average,” as a simulated teacher’s value-added estimate increases the average of a teacher’s observational estimate increases. However, there is a lot of variability (or scatter points) around the (scatterplot) line. Given this variability, we can make statements about averages, such as “on average” teachers in the top 20% for VAM scores will likely have on average higher observed observational scores; however, there is not nearly enough precision to make any (and certainly not any good) predictions about the observational score from the VAM score for individual teachers. In fact, the linear relationship between teachers’ VAM and observational scores only accounts for about 8% of the variation in VAM score. Note: we get 8% by squaring the aforementioned r = 0.28 correlation (i.e., an R squared). The other 92% of the variance is due to error and other factors.

What this means in practice is that when correlations are this “weak,” it is reasonable to say statements about averages, for example, that “on average” as one variable increases the mean of the other variable increases, but it would not be prudent or wise to make predictions for individuals based on these data. See, for example, that individuals in the top 20% (quintile 5) of VAM have a very large spread in their scores on the observational score, with 95% of the scores in the top quintile being in between the 7th and 98th percentiles for their observational scores. So, here if we observe a VAM for a specific teacher in the top 20%, and we do not know their observational score, we cannot say much more than their observational score is likely to be in the top 90%. Similarly, if we observe a VAM in the bottom 20%, we cannot say much more than their observational score is likely to be somewhere in the bottom 90%. That’s not saying a lot, in terms of precision, but also in terms of practice.

The second scatterplot I ran to test how bias that only impacts a small group of teachers might theoretically impact an overall correlation, as posited by Audrey. Here I simulated a situation where, again, there are two values present in a population of teachers: a teacher’s value-added and a teacher’s observational score. Then I insert a group of teachers (as Audrey described) who represent 20% of a population and teach a disproportionate number of students who come from relatively lower socioeconomic, high racial minority, etc. backgrounds, and I assume this group is measured with negative bias on both indicators and this group has a moderate correlation between indicators of r = 0.50. The other 80% of the population is assumed to be uncorrelated. Note: for this demonstration I assume that this group includes 20% of teachers from the aforementioned population, these teachers I assume to be measured with negative bias (by one standard deviation on average) on both measures, and, again, I set their correlation at r = 0.50 with the other 80% of teachers at a correlation of zero.

What you can see is that if there is bias in this correlation that impacts only a certain group on the two instrument indicators; hence, it is possible that this bias can result in an observed correlation overall. In other words, a strong correlation noted in just one group of teachers (i.e., teachers scoring the lowest on their value-added and observational indicators in this case) can be relatively stronger than the “weak” correlation observed on average or overall.

Another, possible situation is that there might be a non-linear relationship between these two measures. In the simulation below, I assume that different quantiles on VAM have a different linear relationship with the observational score. For example, in the plot there is not a constant slope, but teachers who are in the first quintile on VAM I assume to have a correlation of r = 0.50 with observational scores, the second quintile I assume to have a correlation of r = 0.20, and the other quintiles I assume to be uncorrelated. This results in an overall correlation in the simulation of r = 0.24, with a very small p-value (i.e. a very small chance that a correlation of this size would be observed by random chance alone if the true correlation was zero).

What this means in practice is that if, in fact, there is a non-linear relationship between teachers’ observational and VAM scores, this can induce a small but statistically significant correlation. As evidenced, teachers in the lowest 20% on the VAM score have differences in the mean observational score depending on the VAM score (a moderate correlation of r = 0.50), but for the other 80%, knowing the VAM score is not informative as there is a very small correlation for the second quintile and no correlation for the upper 60%. So, if quintile cut-off scores are used, teachers can easily be misclassified. In sum, Pearson Correlations (the standard correlation coefficient) measure the overall strength of  linear relationships between X and Y, but if X and Y have a non-linear relationship (like as illustrated in the above), this statistic can be very misleading.

Note also that for all of these simulations very small p-values are observed (e.g., p-values <0.0000001 which, again, mean these correlations are statistically significant or that the probability of observing correlations this large by chance if the true correlation is zero, is nearly 0%). What this illustrates, again, is that correlations (especially correlations this small) are (still) often misleading. While they might be statistically significant, they might mean relatively little in the grand scheme of things (i.e., in terms of practical significance; see also “The Difference Between”Significant’ and ‘Not Significant’ is not Itself Statistically Significant” or posts on Andrew Gelman’s blog for more discussion on these topics if interested).

At the end of the day r = 0.28 is still a “weak” correlation. In addition, it might be “weak,” on average, but much stronger and statistically and practically significant for teachers in the bottom quintiles (e.g., teachers in the bottom 20%, as illustrated in the final figure above) typically teaching the highest needs students. Accordingly, this might be due, at least in part, to bias.

In conclusion, one should always be wary of claims based on “weak” correlations, especially if they are positioned to be stronger than industry standards would classify them (e.g., in the case highlighted in the prior post). Even if a correlation is “statistically significant,” it is possible that the correlation is the result of bias, and that the relationship is so weak that it is not meaningful in practice, especially when the goal is to make high-stakes decisions about individual teachers. Accordingly, when you see correlations this small, keep these scatterplots in mind or generate some of your own (see, for example, here to dive deeper into what these correlations might mean and how significant these correlations might really be).

*Please contact Dr. Kapitula directly at kapitull@gvsu.edu if you want more information or to access the R code she used for the above.

Observational Systems: Correlations with Value-Added and Bias

A colleague recently sent me a report released in November of 2016 by the Institute of Education Sciences (IES) division of the U.S. Department of Education that should be of interest to blog followers. The study is about “The content, predictive power, and potential bias in five widely used teacher observation instruments” and is authored by affiliates of Mathematica Policy Research.

Using data from the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) studies, researchers examined five widely used teacher observation instruments. Instruments included the more generally popular Classroom Assessment Scoring System (CLASS) and Danielson Framework for Teaching (of general interest in this post), as well as the more subject-specific instruments including the Protocol for Language Arts Teaching Observations (PLATO), the Mathematical Quality of Instruction (MQI), and the UTeach Observational Protocol (UTOP) for science and mathematics teachers.

Researchers examined these instruments in terms of (1) what they measure (which is not of general interest in this post), but also (2) the relationships of observational output to teachers’ impacts on growth in student learning over time (as measured using a standard value-added model (VAM)), and (3) whether observational output are biased by the characteristics of the students non-randomly (or in this study randomly) assigned to teachers’ classrooms.

As per #2 above, researchers found that the instructional practices captured across these instruments modestly [emphasis added] correlate with teachers’ value-added scores, with an adjusted (and likely, artificially inflated; see Note 1 below) correlation coefficient between observational and value added indicators at: 0.13 ≤ r ≤ 0.28 (see also Table 4, p. 10). As per the higher, adjusted r (emphasis added; see also Note 1 below), they found that these instruments’ classroom management dimensions most strongly (r = 0.28) correlated with teachers’ value-added.

Related, also at issue here is that such correlations are not “modest,” but rather “weak” to “very weak” (see Note 2 below). While all correlation coefficients were statistically significant, this is much more likely due to the sample size used in this study versus the actual or practical magnitude of these results. “In sum” this hardly supports the overall conclusion that “observation scores predict teachers’ value-added scores” (p. 11); although, it should also be noted that this summary statement, in and of itself, suggests that the value-added score is the indicator around which all other “less objective” indicators are to revolve.

As per #3 above, researchers found that students randomly assigned to teachers’ classrooms (as per the MET data, although there was some noncompliance issues with the random assignment employed in the MET studies) do bias teachers’ observational scores, for better or worse, and more often in English language arts than in mathematics. More specifically, they found that for the Danielson Framework and CLASS (the two more generalized instruments examined in this study, also of main interest in this post), teachers with relatively more racial/ethnic minority and lower-achieving students (in that order, although these are correlated themselves) tended to receive lower observation scores. Bias was observed more often for the Danielson Framework versus the CLASS, but it was observed in both cases. An “alternative explanation [may be] that teachers are providing less-effective instruction to non-White or low-achieving students” (p. 14).

Notwithstanding, and in sum, in classrooms in which students were randomly assigned to teachers, teachers’ observational scores were biased by students’ group characteristics, which also means that  bias is also likely more prevalent in classrooms to which students are non-randomly assigned (which is common practice). These findings are also akin to those found elsewhere (see, for example, two similar studies here), as this was also evidenced in mathematics, which may also be due to the random assignment factor present in this study. In other words, if non-random assignment of students into classrooms is practice, a biasing influence may (likely) still exist in English language arts and mathematics.

The long and short of it, though, is that the observational components of states’ contemporary teacher systems certainly “add” more “value” than their value-added counterparts (see also here), especially when considering these systems’ (in)formative purposes. But to suggest that because these observational indicators (artificially) correlate with teachers’ value-added scores at “weak” and “very weak” levels (see Notes 1 and 2 below), that this means that these observational systems might “add” more “value” to the summative sides of teacher evaluations (i.e., their predictive value) is premature, not to mention a bit absurd. Adding import to this statement is the fact that, as s duly noted in this study, these observational indicators are oft-to-sometimes biased against teachers who teacher lower-achieving and racial minority students, even when random assignment is present, making such bias worse when non-random assignment, which is very common, occurs.

Hence, and again, this does not make the case for the summative uses of really either of these indicators or instruments, especially when high-stakes consequences are to be attached to output from either indicator (or both indicators together given the “weak” to “very weak” relationships observed). On the plus side, though, remain the formative functions of the observational indicators.

*****

Note 1: Researchers used the “year-to-year variation in teachers’ value-added scores to produce an adjusted correlation [emphasis added] that may be interpreted as the correlation between teachers’ average observation dimension score and their underlying value added—the value added that is [not very] stable [or reliable] for a teacher over time, rather than a single-year measure (Kane & Staiger, 2012)” (p. 9). This practice or its statistic derived has not been externally vetted. Likewise, this also likely yields a correlation coefficient that is falsely inflated. Both of these concerns are at issue in the ongoing New Mexico and Houston lawsuits, in which Kane is one of the defendants’ expert witnesses in both cases testifying in support of his/this practice.

Note 2: As is common with social science research when interpreting correlation coefficients: 0.8 ≤ r ≤ 1.0 = a very strong correlation; 0.6 ≤ r ≤ 0.8 = a strong correlation; 0.4 ≤ r ≤ 0.6 = a moderate correlation; 0.2 ≤ r ≤ 0.4 = a weak correlation; and 0 ≤ r ≤ 0.2 = a very weak correlation, if any at all.

*****

Citation: Gill, B., Shoji, M., Coen, T., & Place, K. (2016). The content, predictive power, and potential bias in five widely used teacher observation instruments. Washington, DC: U.S. Department of Education, Institute of Education Sciences. Retrieved from https://ies.ed.gov/ncee/edlabs/regions/midatlantic/pdf/REL_2017191.pdf

New Article Published on Using Value-Added Data to Evaluate Teacher Education Programs

A former colleague, a current PhD student, and I just had an article released about using value-added data to (or rather not to) evaluate teacher education/preparation, higher education programs. The article is titled “An Elusive Policy Imperative: Data and Methodological Challenges When Using Growth in Student Achievement to Evaluate Teacher Education Programs’ ‘Value-Added,” and the abstract of the article is included below.

If there is anyone out there who might be interested in this topic, please note that the journal in which this piece was published (online first and to be published in its paper version later) – Teaching Education – has made the article free for its first 50 visitors. Hence, I thought I’d share this with you all first.

If you’re interested, do access the full piece here.

Happy reading…and here’s the abstract:

In this study researchers examined the effectiveness of one of the largest teacher education programs located within the largest research-intensive universities within the US. They did this using a value-added model as per current federal educational policy imperatives to assess the measurable effects of teacher education programs on their teacher graduates’ students’ learning and achievement as compared to other teacher education programs. Correlational and group comparisons revealed little to no relationship between value-added scores and teacher education program regardless of subject area or position on the value-added scale. These findings are discussed within the context of several very important data and methodological challenges researchers also made transparent, as also likely common across many efforts to evaluate teacher education programs using value-added approaches. Such transparency and clarity might assist in the creation of more informed value-added practices (and more informed educational policies) surrounding teacher education accountability.

Another Study about Bias in Teachers’ Observational Scores

Following-up on two prior posts about potential bias in teachers’ observations (see prior posts here and here), another research study was recently released evidencing, again, that the evaluation ratings derived via observations of teachers in practice are indeed related to (and potentially biased by) teachers’ demographic characteristics. The study also evidenced that teachers representing racial and ethnic minority background might be more likely than others to not only receive lower relatively scores but also be more likely identified for possible dismissal as a result of their relatively lower evaluation scores.

The Regional Educational Laboratory (REL) authored and U.S. Department of Education (Institute of Education Sciences) sponsored study titled “Teacher Demographics and Evaluation: A Descriptive Study in a Large Urban District” can be found here, and a condensed version of the study can be found here. Interestingly, the study was commissioned by district leaders who were already concerned about what they believed to be occurring in this regard, but for which they had no hard evidence… until the completion of this study.

Authors’ key finding follows (as based on three consecutive years of data): Black teachers, teachers age 50 and older, and male teachers were rated below proficient relatively more often than the same district teachers to whom they were compared. More specifically,

  • In all three years the percentage of teachers who were rated below proficient was higher among Black teachers than among White teachers, although the gap was smaller in 2013/14 and 2014/15.
  • In all three years the percentage of teachers with a summative performance rating who were rated below proficient was higher among teachers age 50 and older than among teachers younger than age 50.
  • In all three years the difference in the percentage of male and female teachers with a summative performance rating who were rated below proficient was approximately 5 percentage points or less.
  • The percentage of teachers who improved their rating during all three year-to-year
    comparisons did not vary by race/ethnicity, age, or gender.

This is certainly something to (still) keep in consideration, especially when teachers are rewarded (e.g., via merit pay) or penalized (e.g., vie performance improvement plans or plans for dismissal). Basing these or other high-stakes decisions on not only subjective but also likely biased observational data (see, again, other studies evidencing that this is happening here and here), is not only unwise, it’s also possibly prejudiced.

While study authors note that their findings do not necessarily “explain why the
patterns exist or to what they may be attributed,” and that there is a “need
for further research on the potential causes of the gaps identified, as well as strategies for
ameliorating them,” for starters and at minimum, those conducting these observations literally across the country must be made aware.

Citation: Bailey, J., Bocala, C., Shakman, K., & Zweig, J. (2016). Teacher demographics and evaluation: A descriptive study in a large urban district. Washington DC: U.S. Department of Education. Retrieved from http://ies.ed.gov/ncee/edlabs/regions/northeast/pdf/REL_2017189.pdf

Ohio Rejects Subpar VAM, for Another VAM Arguably Less Subpar?

From a prior post coming from Ohio (see here), you may recall that Ohio state legislators recently introduced a bill to review its state’s value-added model (VAM), especially as it pertains to the state’s use of their VAM (i.e., the Education Value-Added Assessment System (EVAAS); see more information about the use of this model in Ohio here).

As per an article published last week in The Columbus Dispatch, the Ohio Department of Education (ODE) apparently rejected a proposal made by the state’s pro-charter school Ohio Coalition for Quality Education and the state’s largest online charter school, all of whom wanted to add (or replace) this state’s VAM with another, unnamed “Similar Students” measure (which could be the Student Growth Percentiles model discussed prior on this blog, for example, here, here, and here) used in California.

The ODE charged that this measure “would lower expectations for students with different backgrounds, such as those in poverty,” which is not often a common criticism of this model (if I have the model correct), nor is it a common criticism of the model they already have in place. In fact, and again if I have the model correct, these are really the only two models that do not statistically control for potentially biasing factors (e.g., student demographic and other background factors) when calculating teachers’ value-added; hence, their arguments about this model may be in actuality no different than that which they are already doing. Hence, statements like that made by Chris Woolard, senior executive director of the ODE, are false: “At the end of the day, our system right now has high expectations for all students. This (California model) violates that basic principle that we want all students to be able to succeed.”

The models, again if I am correct, are very much the same. While indeed the California measurement might in fact consider “student demographics such as poverty, mobility, disability and limited-English learners,” this model (if I am correct on the model) does not statistically factor these variables out. If anything, the state’s EVAAS system does, even though EVAAS modelers claim they do not do this, by statistically controlling for students’ prior performance, which (unfortunately) has these demographics already built into them. In essence, they are already doing the same thing they now protest.

Indeed, as per a statement made by Ron Adler, president of the Ohio Coalition for Quality Education, not only is it “disappointing that ODE spends so much time denying that poverty and mobility of students impedes their ability to generate academic performance…they [continue to] remain absolutely silent about the state’s broken report card and continually defend their value-added model that offers no transparency and creates wild swings for schools across Ohio” (i.e., the EVAAS system, although in all fairness all VAMs and the SGP yield the “wild swings’ noted). See, for example, here.

What might be worse, though, is that the ODE apparently found that, depending on the variables used in the California model, it produced different results. Guess what! All VAMs, depending on the variables used, produce different results. In fact, using the same data and different VAMs for the same teachers at the same time also produce (in some cases grossly) different results. The bottom line here is if any thinks that any VAM is yielding estimates from which valid or “true” statements can be made are fooling themselves.

Bias in Teacher Observations, As Well

Following a post last month titled “New Empirical Evidence: Students’ ‘Persistent Economic Disadvantage’ More Likely to Bias Value-Added Estimates,” Matt Barnum — senior staff writer for The 74, an (allegedly) non-partisan, honest, and fact-based news site backed by Editor-in-Chief Campbell Brown and covering America’s education system “in crisis” (see, also, a prior post about The 74 here) — followed up with a tweet via Twitter. He wrote: “Yes, though [bias caused by economic disadvantage] likely applies with equal or even more force to other measures of teacher quality, like observations.” I replied via Twitter that I disagreed with this statement in that I was unaware of research in support of his assertion, and Barnum sent me two articles to review thereafter.

I attempted to review both of these articles herein, although I quickly figured out that I had actually read and reviewed the first (2014) piece on this blog (see original post here, see also a 2014 Brookings Institution article summarizing this piece here). In short, in this study researchers found that the observational components of states’ contemporary teacher systems certainly “add” more “value” than their value-added counterparts, especially for (in)formative purposes. However, researchers  found that observational bias also exists, as akin to value-added bias, whereas teachers who are non-randomly assigned students who enter their classrooms with higher levels of prior achievement tend to get higher observational scores than teachers non-randomly assigned students entering their classrooms with lower levels of prior achievement. Researchers concluded that because districts “do not have processes in place to address the possible biases in observational scores,” statistical adjustments might be made to offset said bias, as might external observers/raters be brought in to yield more “objective” observational assessments of teachers.

For the second study, and this post here, I gave this one a more thorough read (you can find the full study, pre-publication here). Using data from the Measures of Effective
Teaching (MET) Project, in which random assignment was used (or more accurately attempted), researchers also explored the extent to which students enrolled in teachers’ classrooms influence classroom observational scores.

They found, primarily, that:

  1. “[T]he context in which teachers work—most notably, the incoming academic performance of their students—plays a critical role in determining teachers’ performance” as measured by teacher observations. More specifically, “ELA [English/language arts] teachers were more than twice as likely to be rated in the top performance quintile if [nearly randomly] assigned the highest achieving students compared with teachers assigned the low-est achieving students,” and “math teachers were more than 6 times as likely.” In addition, “approximately half of the teachers—48% in ELA and 54% in math—were rated in the top two performance quintiles if assigned the highest performing students, while 37% of ELA and only 18% of math teachers assigned the lowest performing students were highly rated based on classroom observation scores”
  2. “[T]he intentional sorting of teachers to students has a significant influence on measured performance” as well. More specifically, results further suggest that “higher performing students [are, at least sometimes] endogenously sorted into the classes of higher performing teachers…Therefore, the nonrandom and positive assignment of teachers to classes of students based on time-invariant (and unobserved) teacher
    characteristics would reveal more effective teacher performance, as measured by classroom observation scores, than may actually be true.”

So, the non-random assignment of teachers biases both the value-added and observational components written into America’s now “more objective” teacher evaluation systems, as (formerly) required of all states that were to comply with federal initiatives and incentives (e.g., Race to the Top). In addition, when those responsible for assigning students to classrooms (sub)consciously favor teachers with high, prior observational scores, this exacerbates the issues. This is especially important when observational (and value-added) data are to be used for high-stakes accountability systems in that the data yielded via really both measurement systems may be less likely to reflect “true” teaching effectiveness due to “true” bias. “Indeed, teachers working with higher achieving students tend to receive higher performance ratings, above and beyond that which might be attributable to aspects of teacher quality,” and vice-versa.

Citation Study #1: Whitehurst, G. J., Chingos, M. M., & Lindquist, K. M. (2014). Evaluating teachers with classroom observations: Lessons learned in four districts. Washington, DC: Brookings Institution. Retrieved from https://www.brookings.edu/wp-content/uploads/2016/06/Evaluating-Teachers-with-Classroom-Observations.pdf

Citation Study #2: Steinberg, M. P., & Garrett, R. (2016). Classroom composition and measured teacher performance: What do teacher observation scores really measure? Educational Evaluation and Policy Analysis, 38(2), 293-317. doi:10.3102/0162373715616249  Retrieved from http://static.politico.com/58/5f/f14b2b144846a9b3365b8f2b0897/study-of-classroom-observations-of-teachers.pdf

 

New Empirical Evidence: Students’ “Persistent Economic Disadvantage” More Likely to Bias Value-Added Estimates

The National Bureau of Economic Research (NBER) recently released a circulated but not-yet internally or externally reviewed study titled “The Gap within the Gap: Using Longitudinal Data to Understand Income Differences in Student Achievement.” Note that we have covered NBER studies such as this in the past in this blog, so in all fairness and like I have noted in the past, this paper should also be critically consumed, as well as my interpretations of the authors’ findings.

Nevertheless, this study is authored by Katherine Michelmore — Assistant Professor of Public Administration and International Affairs at Syracuse University, and Susan Dynarski — Professor of Public Policy, Education, and Economics at the University of Michigan, and this study is entirely relevant to value-added models (VAMs). Hence, below I cover their key highlights and takeaways, as I see them. I should note up front, however, that the authors did not directly examine how the new measure of economic disadvantage that they introduce (see below) actually affects calculations of teacher-level value-added. Rather, they motivate their analyses by saying that calculating teacher value-added is one application of their analyses.

The background to their study is as follows: “Gaps in educational achievement between high- and low-income children are growing” (p. 1), but the data that are used to capture “high- and low-income” in the state of Michigan (i.e., the state in which their study took place) and many if not most other states throughout the US, capture “income” demographics in very rudimentary, blunt, and often binary ways (i.e., “yes” for students who are eligible to receive federally funded free-or-reduced lunches and “no” for the ineligible).

Consequently, in this study the authors “leverage[d] the longitudinal structure of these data sets to develop a new measure of persistent economic disadvantage” (p. 1), all the while defining “persistent economic disadvantage” by the extent to which students were “eligible for subsidized meals in every grade since kindergarten” (p. 8). Students “who [were] never eligible for subsidized meals during those grades [were] defined as never [being economically] disadvantaged” (p. 8), and students who were eligible for subsidized meals for variable years were defined as “transitorily disadvantaged” (p. 8). This all runs counter, however, to the binary codes typically used, again, across the nation.

Appropriately, then, their goal (among other things) was to see how a new measure they constructed to better measure and capture “persistent economic disadvantage” might help when calculating teacher-level value-added. They accordingly argue (among other things) that, perhaps, not accounting for persistent disadvantage might subsequently cause more biased value-added estimates “against teachers of [and perhaps schools educating] persistently disadvantaged children” (p. 3). This, of course, also depends on how persistently disadvantaged students are (non)randomly assigned to teachers.

With statistics like the following as also reported in their report: “Students [in Michigan] [persistently] disadvantaged by 8th grade were six times more likely to be black and four times more likely to be Hispanic, compared to those who were never disadvantaged,” their assertions speak volumes not only to the importance of their findings for educational policy, but also for the teachers and schools still being evaluated using value-added scores and the researchers investigating, criticizing, promoting, or even trying to make these models better (if that is possible). In short, though, teachers who are disproportionately teaching in urban areas with more students akin to their equally disadvantaged peers, might realize relatively more biased value-added estimates as a result.

For value-added purposes, then, it is clear that the assumptions that controlling for student disadvantage by using such basal indicators of current economic disadvantage is overly simplistic, and just using test scores to also count for this economic disadvantage (i.e., as promoted in most versions of the Education Value-Added Assessment System (EVAAS)) is likely worse. More specifically, the assumption that economic disadvantage also does not impact some students more than others over time, or over the period of data being used to capture value-added (typically 3-5 years of students’ test score data), is also highly susceptible. “[T]hat children who are persistently disadvantaged perform worse than those who are disadvantaged in only some grades” (p. 14) also violates another fundamental assumption that teachers’ effects are consistent over time for similar students who learn at more or less consistent rates over time, regardless of these and other demographics.

The bottom line here, then, is that the indicator that should be used instead of our currently used proxies for current economic disadvantage is the number of grades students spend in economic disadvantage. If the value-added indicator does not effectively account for the “negative, nearly linear relationship between [students’ test] scores and the number of grades spent in economic disadvantage” (p. 18), while controlling for other student demographics and school fixed effects, value-added estimates will likely be (even) more biased against teachers who teach these students as a result.

Otherwise, teachers who teach students with persistent economic disadvantages will likely have it worse (i.e., in terms of bias) than teachers who teach students with current economic disadvantages, teachers who teach students with economically disadvantaged in their current or past histories will have it worse than teachers who teach students without (m)any prior economic disadvantages, and so on.

Citation: Michelmore, K., & Dynarski, S. (2016). The gap within the gap: Using longitudinal data to understand income differences in student achievement. Cambridge, MA: National Bureau of Economic Research (NBER). Retrieved from http://www.nber.org/papers/w22474

Using VAMs “In Not Very Intelligent Ways:” A Q&A with Jesse Rothstein

The American Prospect — a self-described “liberal intelligence” magazine — featured last week a question and answer, interview-based article with Jesse Rothstein — Professor of Economics at University of California – Berkeley — on “The Economic Consequences of Denying Teachers Tenure.” Rothstein is a great choice for this one in that indeed he is an economist, but one of a few, really, who is deep into the research literature and who, accordingly, has a balanced set of research-based beliefs about value-added models (VAMs), their current uses in America’s public schools, and what they can and cannot do (theoretically) to support school reform. He’s probably most famous for a study he conducted in 2009 about how the non-random, purposeful sorting of students into classrooms indeed biases (or distorts) value-added estimations, pretty much despite the sophistication of the statistical controls meant to block (or control for) such bias (or distorting effects). You can find this study referenced here, and a follow-up to this study here.

In this article, though, the interviewer — Rachel Cohen — interviews Jesse primarily about how in California a higher court recently reversed the Vergara v. California decision that would have weakened teacher employment protections throughout the state (see also here). “In 2014, in Vergara v. California, a Los Angeles County Superior Court judge ruled that a variety of teacher job protections worked together to violate students’ constitutional right to an equal education. This past spring, in a 3–0 decision, the California Court of Appeals threw this ruling out.”

Here are the highlights in my opinion, by question and answer, although there is much more information in the full article here:

Cohen: “Your research suggests that even if we got rid of teacher tenure, principals still wouldn’t fire many teachers. Why?”

Rothstein: “It’s basically because in most cases, there’s just not actually a long list of [qualified] people lining up to take the jobs; there’s a shortage of qualified teachers to hire.” In addition, “Lots of schools recognize it makes more sense to keep the teacher employed, and incentivize them with tenure…”I’ve studied this, and it’s basically economics 101. There is evidence that you get more people interested in teaching when the job is better, and there is evidence that firing teachers reduces the attractiveness of the job.”

Cohen: Your research suggests that even if we got rid of teacher tenure, principals still wouldn’t fire many teachers. Why?

Rothstein: It’s basically because in most cases, there’s just not actually a long list of people lining up to take the jobs; there’s a shortage of qualified teachers to hire. If you deny tenure to someone, that creates a new job opening. But if you’re not confident you’ll be able to fill it with someone else, that doesn’t make you any better off. Lots of schools recognize it makes more sense to keep the teacher employed, and incentivize them with tenure.

Cohen: “Aren’t most teachers pretty bad their first year? Are we denying them a fair shot if we make tenure decisions so soon?”

Rothstein: “Even if they’re struggling, you can usually tell if things will turn out to be okay. There is quite a bit of evidence for someone to look at.”

Cohen: “Value-added models (VAM) played a significant role in the Vergara trial. You’ve done a lot of research on these tools. Can you explain what they are?”

Rothstein: “[The] value-added model is a statistical tool that tries to use student test scores to come up with estimates of teacher effectiveness. The idea is that if we define teacher effectiveness as the impact that teachers have on student test scores, then we can use statistics to try to then tell us which teachers are good and bad. VAM played an odd role in the trial. The plaintiffs were arguing that now, with VAM, we have these new reliable measures of teacher effectiveness, so we should use them much more aggressively, and we should throw out the job statutes. It was a little weird that the judge took it all at face value in his decision.”

Cohen: “When did VAM become popular?”

Rothstein: “I would say it became a big deal late in the [George W.] Bush administration. That’s partly because we had new databases that we hadn’t had previously, so it was possible to estimate on a large scale. It was also partly because computers had gotten better. And then VAM got a huge push from the Obama administration.”

Cohen: “So you’re skeptical of VAM.”

Rothstein: “I think the metrics are not as good as the plaintiffs made them out to be. There are bias issues, among others.”

Cohen: “During the Vergara trials you testified against some of Harvard economist Raj Chetty’s VAM research, and the two of you have been going back and forth ever since. Can you describe what you two are arguing about?”

Rothstein: “Raj’s testimony at the trial was very focused on his work regarding teacher VAM. After the trial, I really dug in to understand his work, and I probed into some of his assumptions, and found that they didn’t really hold up. So while he was arguing that VAM showed unbiased results, and VAM results tell you a lot about a teacher’s long-term outcomes, I concluded that what his approach really showed was that value-added scores are moderately biased, and that they don’t really tell us one way or another about a teacher’s long-term outcomes” (see more about this debate here).

Cohen: “Could VAM be improved?”

Rothstein: “It may be that there is a way to use VAM to make a better system than we have now, but we haven’t yet figured out how to do that. Our first attempts have been trying to use them in not very intelligent ways.”

Cohen: “It’s been two years since the Vergara trial. Do you think anything’s changed?”

Rothstein: “I guess in general there’s been a little bit of a political walk-back from the push for VAM. And this retreat is not necessarily tied to the research evidence; sometimes these things just happen. But I’m not sure the trial court opinion would have come out the same if it were held today.”

Again, see more from this interview, also about teacher evaluation systems in general, job protections, and the like in the full article here.

Citation: Cohen, R. M. (2016, August 4). Q&A: The economic consequences of eenying teachers tenure. The American Prospect. Retrieved from http://prospect.org/article/qa-economic-consequences-denying-teachers-tenure