In January I wrote a post about “The Study that Keeps on Giving…” Specifically, this post was about the study conducted and authored by Raj Chetty (Economics Professor at Harvard), John Friedman (Assistant Professor of Public Policy at Harvard), and Jonah Rockoff (Associate Professor of Finance and Economics at Harvard) that was published first in 2011 (in its non-peer-reviewed and not even internally reviewed form) by the National Bureau of Economic Research (NBER) and then published again by NBER in January of 2014 (in the same form) but this time split into two separate studies (see them split here and here).
Their re-release of the same albeit split study was what prompted the title of the initial “The Study that Keeps on Giving…” post. Little did I know then, though, that the reason this study was re-released in split form was that it was soon to be published in a peer-reviewed journal. Its non-peer-reviewed publication status was a major source of prior criticism. While journal editors seemed to have suggested the split, NBER seemingly took advantage of this opportunity to publicize this study in two forms, regardless and without prior explanation.
Anyhow, this came to my attention when the study’s lead author – Raj Chetty – emailed me a few weeks ago, emailed Diane Ravitch on the same email, and also apparently emailed other study “critics” at the same time (see prior reviews of this study as per this study’s other notable “critics” here, here, here, and here) to notify all of us that this study made it through peer review and was to be published in a forthcoming issue of the American Economic Review. While Diane and I responded to our joint email (as other critics may have done as well), we ultimately promised Chetty that we would not share the actual contents of any of the approximately 20 email exchanges that went back and forth among the three of us over the following days.
What I can say, though, is that no genuine concern was expressed by Chetty or on behalf of his co-authors, in particular, about the intended or unintended consequences that came about as a result of his study, nor how many policymakers since used and abused study results for political gain and the further advancement of VAM-based policies. Instead, emails were more or less self-promotional and celebratory, especially given that President Obama cited the study in his 2012 State of the Union Address and that Chetty apparently continues to advise U.S. Secretary of Education Arne Duncan about his similar VAM-based policies. Perhaps, next, the Nobel prize committee might pay this study its due respects, now overdue, but again I only paraphrase from that which I inferred from these email conversations.
As a refresher, Chetty et al. conducted value-added analyses on a massive data set (with over 1 million student-level test and tax records) and presented (highly-questionable) evidence that favored teachers’ long-lasting, enduring, and in some cases miraculous effects. While some of the findings would have been very welcomed to the profession, had they indeed been true (e.g., high value-added teachers substantively affect students incomes in their adult years), the study’s authors overstated their findings, and they did not duly consider (or provide evidence to counter) the alternative hypotheses in terms of what other factors besides teachers might have caused the outcomes they observed (e.g., those things that happen outside of schools while students are in school and throughout students’ lives).
Nor did they consider, or rather satisfactorily consider, how the non-random assignment of students into both schools and classrooms might have biased the effects observed, whereas the students in high “value-added” teachers’ classrooms might have been more “likely to succeed” regardless of, or even despite the teacher effect, on both the short and long term effects demonstrated in their findings…then widely publicized via the media and beyond throughout other political spheres.
Rather, Chetty et al. advanced what they argued were a series of causal effects by exploiting a series of correlations that they turned attributional. They did this because (I believe) they truly believe that their sophisticated econometric models and the sophisticated controls and approaches they use in fact work as intended. Perhaps this also explains why Chetty et al. give pretty much all credit in the area of value-added research to econometricians, and they do this throughout their papers, all the while over-citing the works of their economic researchers/friends but not the others (besides Berkeley economist Jesse Rothstein, see the full reference to his study here) who have also outright contradicted their findings, with evidence. Apparently, educational researchers do not have much to add on this topic, but I digress.
But this is too a serious fault as “they” (and I don’t mean to make sweeping generalizations here) have never been much for understanding what goes into the data they analyze, as socially constructed and largely context dependent. Nor do they seem to care to fully understand the realities of the classrooms from which they receive such data, or what test scores actually mean, or when using them what one can and cannot actually infer. This, too, was made clear via our email exchange. It seems this from-the-sky-down view of educational data is the best (as well as the most convenient) approach that “they” might even expressly prefer, so that they do not have to get their data fingers dirty and deal with the messiness that always surrounds these types of educational data and always comes into play when conducting most any type of educational research that relies (in this case solely) on students’ large-scale standardized test scores.
Regardless, I decided to give this study yet another review to see if, now that this study has made it through the peer review process, I was missing something. I wasn’t. The studies are pretty much exactly the same as they were when first released (which unfortunately does not say much for peer review). The first study here is about VAM-based bias and how VAM estimates that control for students’ prior test scores “exhibit little bias despite the grouping of students” and despite the number of studies not referenced or cited that continue to evidence the opposite. The second study here is about teacher-level value-added and how teachers with a lot of it (purportedly) cause grander things throughout their students’ lives. More specifically, they found that “students [non-randomly] assigned to high [value-added] teachers are more likely to attend college, earn higher salaries, and are less likely to have children as teenagers.” They also found that “[r]eplacing a teacher whose [value-added] is in the bottom 5% with an average teacher would increase the present value of students’ lifetime income by approximately $250,000 per classroom [emphasis added].” Please note that this overstated figure is not per student; had it been broken out by student it would have rather become “chump change,” for the lack of a better term, which serves as one example of just one of their classic exaggerations. They do, however, when you read through the actual text, tone their powerful language down a bit to note that, on average, this is more accurately $185,000, still per classroom. Again, to read the more thorough critiques conducted by scholars with also impressive academic profiles, I suggest readers click here, here, here, or here.
What I did find important to bring to light during this round of review were the assumptions that, thanks to Chetty and his emails, were made more obvious (and likewise troublesome) than before. These are the, in some cases, “very strong” assumptions that Chetty et al. make explicit in both of their studies (see Assumptions 1-3 in the first and second papers). These are also the assumptions they make explicit, with “evidence” why they should not reject these assumptions (most likely, and in some cases clearly) because their study relied on such assumptions. The assumptions they made were so strong, in fact, at one point they even mention that it would be useful could they have “relaxed” some of the assumptions they made. In other cases, they justify their adoption of these assumptions given the data limitations and methodological issues they faced, plain and simply because there was no other way to conduct (or continue) their analyses without making and agreeing to these assumptions.
So, see if you agree with the following three assumptions they make most explicit and use throughout both studies (although other assumptions are littered throughout both pieces), yourselves. I would love for Chetty et al. to discuss whether their assumptions in fact hold given the realities the everyday teacher, school, or classroom face. But again, I digress…
Assumption 1 [Stationarity]: Teacher levels of value-added as based on growth in student achievement over time follows a stationary, unchanging, constant, and consistent process. On average, “teacher quality does not vary across calendar years and [rather] depends only on the amount of time that elapses between” years. While completely nonsensical to the average adult with really any commonsense, this assumption, to them, helped them “simplif[y] the estimation of teacher [value-added] by reducing the number of parameters” needed in their models, or more appropriately needed to model their effects.
Assumption 2 [Selection on Excluded Observables]: Students are sorted or assigned to teachers on excluded observables that can be estimated. See a recent study that I conducted with a doctoral student of mine (that was just published in this month’s issue of the highly esteemed American Educational Research Journal here) in which we found, with evidence, that 98% of the time this assumption is false. Students are non-randomly sorted on “observables” and “non-observables” (most of which are not and cannot be included in such data sets) 98% of the time; both types of variables bias teacher-level value-added over time given the statistical procedures meant to control for these variables do not work effectively well, especially for students in the extremes or on both sides of the normal bell curve. While convenient, especially when conducting this type of far-removed research, this assumption is false and cannot really be taken seriously given the pragmatic realities of schools.
Assumption 3 [Teacher Switching as a Quasi-Experiment]: Changes in teacher-level value-added scores across cohorts within a school-grade are orthogonal (i.e., non-overlapping, uncorrelated, or independent) with changes in other determinants of student scores. While Chetty et al. themselves write that this assumption “could potentially be violated by endogenous student or teacher sorting to schools over time,” they also state that “[s]tudent sorting at an annual frequency is minimal because of the costs of changing schools” which is yet another unchecked assumption without reference(s) in support. They further note that “[w]hile endogenous teacher sorting is plausible over long horizons, the high-frequency changes [they] analyze are likely driven by idiosyncratic shocks such as changes in staffing needs, maternity leaves, or the relocation of spouses.” These are all plausible assumptions too, right? Is “high-frequency teacher turnover…uncorrelated with student and school characteristics?” Concerns about this and really all of these assumptions, and ultimately how they impact study findings, should certainly cause pause.
My final point, interestingly enough, also came up during the email exchanges mentioned above. Chetty made the argument that he, more or less, had no dog in the fight surrounding value-added. In the first sentence of the first manuscript, however, he (and his colleagues) wrote, “Are teachers’ impacts on students’ test scores (“value-added”) a good measure of their quality?” The answer soon thereafter and repeatedly made known in both papers becomes an unequivocal “Yes.” Chetty et al. write in the first paper that “[they] established that value-added measures can help us [emphasis added as “us” is undefined] identify which teachers have the greatest ability to raise students’ test scores.” In the second paper, they write that “We find that teacher [value-added] has substantial impacts on a broad range of outcomes.” Apparently, Chetty wasn’t representing his and/or his colleagues’ honest “research-based” opinions and feelings about VAMs in one place (i.e., our emails) or the other (his publications) very well.
Contradictions…as nettlesome as those dirtly little assumptions I suppose.
Regarding your study: Oh how I wish I could be surprised that someone from the ADE didn’t know how classes are set up, but given how clueless Huppenthal is, as well as his 2 predecessors (at least), I guess I shouldn’t be. It’s been a long time since anyone truly qualified has held that office, nor staffed it with qualified people. Sad, since I am a product of a once decent AZ public education system K-12 and higher ed, and I miss what it once was.
My aunt taught in a very large district in Utah, and her class lists were randomly sorted by computer her last 4 years of teaching. She says those were her worst 4 years of teaching ever, and it wasn’t until after she retired and reflected on it that she realized that was probably the contributing factor. She felt like she was never able to get a harmonious community going, no matter what she tried. In my district up here in WA State, we carefully sort our students to create balance in a number of ways, including the number of parent volunteers, and it is a low-stress but collegial and thoughtful process. Most of our principals have little to do with the class lists other than double checking to make sure they are balanced. If Arne Duncan had succeeded in forcing us to change the one word in our teacher eval legislation from can (use standardized student test scores) to must, I dread to think how that would change the dynamics of setting up class lists. Way more stressful, much more unpleasant. Like we need more stress.
Glad to see a good example of mixed methods. I get much more meaning out of a study with both qualitative and quantitative elements. Typical elementary teacher, I suppose. We like our comments.
But one thing to ask:
How does anyone get tenure at HARVARD shoving a study into such tortured assumptions? I can almost hear the data crying in pain…