Recall the Chetty, Friedman, and Rockoff studies at focus of many posts on this blog in the past (see for example here, here, and here)? These studies were cited in President Obama’s 2012 State of the Union address. Since, they have been cited by every VAM proponent as the key set of studies to which others should defer, especially when advancing, or defending in court, the large- and small-scale educational policies bent on VAM-based accountability for educational reform.
In a newly released working, not-yet-peer-reviewed, National Bureau of Economic Research (NBER) paper, Chetty, Friedman, and Rockoff attempt to assess how “Using Lagged Outcomes to Evaluate Bias in Value-Added Models [VAMs]” might better address the amount of bias in VAM-based estimates due to the non-random assignment of students to teachers (a.k.a. sorting). Accordingly, Chetty et al. argue that the famous “Rothstein” falsification test (a.k.a. the Jesse Rothstein — Associate Professor of Economics at University of California – Berkeley — falsification test) that is oft-referenced/used to test for the presence of bias in VAM-based estimates might not be the most effective approach. This is the second time this set of researchers have argued with Rothstein about the merits of his falsification test (see prior posts about these debates here and here).
In short, at question is the extent to which teacher-level VAM-based estimates might be influenced by the groups of students a teacher is assigned to teach. If biased, the value-added estimates are said to be biased or markedly different from the actual parameter of interest the VAM is supposed to estimate, ideally, in an unbiased way. If bias is found, the VAM-based estimates should not be used in personnel evaluations, especially those associated with high-stakes consequences (e.g., merit pay, teacher termination). Hence, in order to test for the presence of the bias, Rothstein demonstrated that he could predict past outcomes of students with current teacher value-added estimates, which is impossible (i.e., the prediction of past outcomes). One would expect that past outcomes should not be related to current teacher effectiveness, so if the Rothstein falsification test proves otherwise, it indicates the presence of bias. Rothstein also demonstrated that this was (is still) the case with all conventional VAMs.
In their new study, however, Chetty et al. demonstrate that there might be another explanation regarding why Rothstein’s falsification test would reveal bias, even if there might not be bias in VAM estimates, and this bias is not caused by student sorting. Rather, the bias might result from different reasons, given the presence of what they term as dynamic sorting (i.e., there are common trends across grades and years, known as correlated shocks). Likewise, they argue, small sample sizes for a teacher, which are normally calculated as the number of students in a teacher’s class or on a teacher’s roster, also cause such bias. However, this problem cannot be solved even with the large scale data since the number of students per teacher remains the same, independent of the total number of students in any data set.
Chetty et al., then, using simulated data (i.e., generated with predetermined characteristics of teachers and students), demonstrate that even in the absence of bias, when dynamic sorting is not accounted for in a VAM, teacher-level VAM estimates will be correlated with lagged student outcomes that will still “reveal” said bias. However, they argue that the correlations observed will be due to noise rather than, again, the non-random sorting of students as claimed by Rothstein.
So, the bottom line is that bias exists, it just depends on whose side one might fall to claim from where it came.
Accordingly, Chetty et al. offer two potential solutions: (1) “We” develop VAMs that might account for dynamic sorting and be, thus, more robust to misspecification, or (2) “We” use experimental or quasi-experimental data to estimate the magnitude of such bias. This all, of course, assumes we should continue with our use of VAMs for said purposes, but given the academic histories of these authors, this is of no surprise.
Chetty et al. ultimately conclude that more research is needed on this matter, and that researchers should focus future studies on quantifying the bias that appears within and across any VAM, thus providing a potential threshold for an acceptable magnitude of bias, versus trying to prove its existence or lack thereof.
Thanks to ASU Assistant Professor of Education Economics, Margarita Pivovarova, for her review of this study