Over one week ago I published a post about some “Economists Declar[ing] Victory for VAMs,” as per an article titled “The Science Of Grading Teachers Gets High Marks,” written by the economics site fivethirtyeight.com’s “quantitative editor” Andrew Flowers.
Valerie Strauss, author of the Answer Sheet section of The Washington Post, was apparently also busy finding more about Flowers’ piece, as well as his take, at the same time. She communicated with Flowers via email, after which she communicated with me via email to help her/us respond to Flowers’ claims. These email exchanges, and more, were just published on her Answer Sheet section of The Washington Post here.
For those of you interested in reading the whole thing, do click here. For those of you interested in just the email exchanges, as a follow-up to my previous post here, I’ve pasted the highlights of the conversation for you all below…with compliments to Valerie for including what she viewed as the key points for discussion/thought.
From her post:
I asked Audrey Amrein-Beardsley, a former middle- and high-school mathematics teacher who is now associate professor in Arizona State University’s Mary Lou Fulton Teachers College and a VAM researcher, about the FiveThirtyEight blog post and e-mail comments by Flowers. She earned a Ph.D. in 2002 from Arizona State University in the Division of Educational Leadership and Policy Studies with an emphasis on research methods. She had already written about Flowers’ blog post on her VAMBoozled! blog, which you can see here.
Here are her comments on what Flowers wrote to me in the e-mail. Some of them are technical, as any discussion about formulas would be:
Flowers: “The piece I wrote that was recently published by FiveThirtyEight was focused on a specific type of value-added model (VAM) — the one developed by Chetty, Friedman and Rockoff (CFR). In my reading of the literature on VAMs, including the American Statistical Association’s (ASA) statement, I felt it fair to characterize the CFR research as cutting-edge.”
Amrein-Beardsley: There is no such thing as a “cutting-edge” VAM. Just because Chetty had access to millions of data observations does not make his actual VAM more sophisticated than any of those in use otherwise or in other ways. The fact of the matter is is that all states have essentially the same school level data (i.e., very similar test scores by students over time, links to teachers, and series of typically dichotomous/binary variables meant to capture things like special education status, English language status, free-and-reduced lunch eligibility, etc.). These latter variables are the ones used, or not used depending on the model for VAM-based analyses. While Chetty used these data and also had access to other demographic data (e.g., IRS data, correlated with other demographic data as well), and he could use these data to supplement the data from NYC schools, the data whether dichotomous or continuous (which is a step in the right direction) still cannot and do not capture all of the things we know from the research that influence student learning, achievement, and more specifically growth in achievement in schools. These are the unquantifiable/uncontrollable variables that (will likely forever) continue to distort the measurement of teachers’ causal effects, and that cannot be captured using IRS data alone. For example, unless Chetty had data to capture teachers’ residuals effects (from prior years), out of school learning, parental impacts on learning or a lack thereof, summer learning and decay, etc. it is virtually impossible, no matter how sophisticated any model or dataset is, to make such causal claims. Yes, such demographic variables are correlated with, for example, family income [but] they are not correlated to the extent that they can remove systematic error from the model.
Accordingly, Chetty’s model is no more sophisticated or “cutting–edge” than any other. There are probably, now, five+ models being used today (i.e., the EVAAS, the Value–Added Research Center (VARC) model, the RAND Corporation model, the American Institute for Research (AIR) model, and the Student Growth Percentiles (SGP) model). All of the them except for the SGP have been developed by economists, and they are likely just as sophisticated in their design (1) given minor tweaks to model specifications and (2) given various data limitations and restrictions. In fact, the EVAAS, because it’s been around for over twenty years (in use in Tennessee since 1993, and in years of development prior), is probably considered the best and most sophisticated of all VAMs, and because it’s now run by the SAS analytics software corporation, I (and likely many other VAM researchers) would likely put our money down on that model any day over Chetty’s model, if both had access to the same dataset. Chetty might even agree with this assertion, although he would disagree with the EVAAS’s (typical) lack of use of controls for student background variables/demographics — a point of contention that has been debated, now, for years, with research evidence supporting both approaches; hence, the intense debates about VAM–based bias, now also going on for years.
—
Flowers: “So, because the CFR research is so advanced, much of the ASA’s [American Statistical Association’s] critique does not apply to it. In its statement, the ASA says VAMs “generally… not directly measure potential teacher contributions toward other student outcomes” (emphasis added). Well, this CFR work I profiled is the exception — it explicitly controls for student demographic variables (by using millions of IRS records linked to their parents). And, as I’ll explain below, the ASA statement’s point that VAMs are only capturing correlation, not causation, also does not apply to the CFR model (in my view). The ASA statement is still smart, though. I’m not dismissing it. I just thought — given how superb the CFR research was — that it wasn’t really directed at the paper I covered.”
Amrein-Beardsley: This is based on the false assumption, addressed above, that Chetty’s model is “so advanced” or “cutting edge,” or now as written here “superb.” When you appropriately remove or reject this assumption, ASA’s critique applies to Chetty’s model along with the rest of them. Should we not give credit to the ASA for taking into consideration all models when they wrote this statement, especially as they wrote their statement well after Chetty’s model had hit the public? Would the ASA not have written, somewhere, that their critique applies to all models “except for” the one used by Chetty et al because they too agreed this one was exempt from their critiques? This singular statement is absurd in and of itself, as is the statement that Flowers isn’t “dismissing it.” I’m sure the ASA would be thrilled to hear. More specifically, the majority of models “explicitly control for student demographics” — Chetty’s model is by far not the only one (see the first response above, as again, this is one of the most contentious issues going). Given this, and the above, it is true that all “VAMs are only capturing correlation, not causation,” and all VAMs are doing this at a mediocre level of quality. The true challenge, should Chetty take it on, would be to put his model up against the other VAMs mentioned above, using the same NYC school-level dataset, and prove to the public that his model is so “cutting-edge” that it does not suffer from the serious issues with reliability, validity, bias, etc. with which all other modelers are contending. Perhaps Flowers’ main problem in this piece is that he conflated model sophistication with dataset quality, whereby the former is likely no better (or worse) than any of the others.
Lastly, for what “wasn’t really directly at the paper [Flowers] covered…let’s talk about the 20+ years of research we have on VAMs that Flowers dismissed, implicitly in that it was not written by economists, whereas Jesse Rothstein was positioned as the only respected critic of VAMs. My best estimates, and I’ll stick with them today, is that approximately 90 percent of all value-added researchers, including econometricians and statisticians alike, have grave concerns about these models, and consensus has been reached regarding many of their current issues. Only folks like Chetty and Kain (the two-pro VAM scholars), however, were positioned as leading thought and research in this area. Flowers, before he wrote such a piece, really should have done more homework. This also includes the other critiques of Chetty’s work, not mentioned whatsoever in this piece albeit very important to understanding it (see, for example, here, here, here, and here).
—
Flowers: “That said, I felt like the criticism of the CFR work by other academic economists, as well as the general caution of the ASA, warranted inclusion — and so I reached out to Jesse Rothstein, the most respected “anti-VAM” economist, for comment. I started and ended the piece with the perspective of “pro-VAM” voices because that was the peg of the story — this new exchange between CFR and Rothstein — and, if one reads both papers and talks to both sides, I though it was clear how the debate tilted in the favor of CFR.”
Amrein-Beardsley: Again, why only the critiques of other “academic economists,” or actually just one other academic economist to be specific (i.e., Jesse Rothstein, who most would agree is “the most respected ‘anti-VAM’ economist)? Everybody knows Chetty and Kane (the other economist to whom Flowers “reached out) are colleagues/buddies and very much on the same page and side of all of this, so Rothstein was really the only respected critic included to represent the other side. All of this is biased in and of itself (see also studies above for economists’ and statisticians’ other critiques),and quite frankly insulting to/marginalizing of the other well-respected scholars also conducting solid empirical research in this area (e.g., Henry Braun, Stephen Raudenbush, Jonathan Papay, Sean Corcoran). Nonetheless, this “new exchange” between Chetty and Rothstein is not “new” as claimed. It actually started back in October to be specific (see, here, for example). I too have read both papers and talked to both sides, and would hardly say it’s “clear how the debate” tilts either way. It’s educational research, and complicated, and not nearly objective, hard, conclusive, or ultimately victorious as Flowers claims.
—
Flowers: “Now, why is that? I think there are two (one could argue three) empirical arguments at stake here. First, are the CFR results, based on NYC public schools, reproducible in other settings? If not — if other researchers can’t produced similar estimates with different data — then that calls it into question. Second, assuming the reproducibility bar is passed, can the CFR’s specification model withstand scrutiny; that is, is CFR’s claim to capture teacher value-added in isolation of all other factors (e.g., demographic characteristics, student sorting, etc.) really believable? This second argument is less about data than about statistical modeling…What I found was that there was complete agreement (even by Rothstein) on this first empirical argument. CFR’s results are reproducible even by their critics, in different settings (Rothstein replicated in North Carolina). That’s amazing, right? “
Amrein-Beardsley: These claims are actually quite interesting in that there is a growing set of research evidence that all models, using the same datasets, actually yield similar results. It’s really no surprise, and certainly not “amazing” that Kane replicated Chetty’s results, or that Rothstein replicated them, more or less, as well. Even what some argue is the least sophisticated VAM (although some would cringe calling it a VAM) – the Student Growth Percentiles (SGP) model – has demonstrated itself, even without using student demographics in model specifications/controls, to yield similar output when the same datasets are used. One of my doctoral students, in fact, ran five different models using the same dataset and yielded inter/intra correlations that some could actually consider “amazing.” That is because, what at least some contend, these models are quite similar, and yield similar results given their similarities, and also their limitations. Some even go as far as calling all such models “garbage in, garbage out” systems, given the test data they all (typically) use to generate VAM-based estimates, and almost regardless of the extent to which model specifications differ. So replication, in this case, is certainly not the cat’s meow. One must also look to other traditional notions of educational measurement: reliability/consistency (which is not at high-enough levels, especially across teacher types), validity (which is not at high-enough levels, especially for high-stakes purposes), etc. in that “replicability” alone is more common than Flowers (and perhaps others) might assume. Just like it takes multiple measures to get at teachers’ effects, it takes multiple measures to assess model quality. Using replication, alone, is remiss.
—
Flowers: “For those curious about this third empirical argument, I would refer anyone back to CFR’s second paper in (American Economic Review 2014b), where they impressively demonstrate how students taught by teachers with high VAM scores, all things equal, grow up to have higher earnings (through age 28), avoid teen pregnancy at greater rates, attend better colleges, etc. This is based off an administrative data set from the IRS — that’s millions of students, over 30 years. Of course, it all hinges on the first study’s validity (that VAM is unbiased)— which was the center of debate between Rothstein and CFR.”
Amrein-Beardsley: The jury is definitely still out on this, across all studies…. Plenty of studies demonstrate (with solid evidence) that bias exists and plenty others demonstrate (with solid evidence) that it doesn’t.
—
Flowers: “Long story, short: the CFR research has withstood criticism from Rothstein (a brilliant economist, whom CFR greatly respects), and their findings were backed up by other economists in the field (yes, some of them do have a “pro-VAM” bias, but such is social science).”
Amrein-Beardsley: Long story, short: the CFR research has [not] withstood criticism from Rothstein (a brilliant economist, whom CFR [and many others] greatly respect, and their findings were backed up by other economists [i.e., two to be exact] in the field (yes, some of them [only Chetty’s buddy Kane] do have a “pro-VAM” bias, but such is social science). Such is the biased stance taken by Flowers in this piece, as well.
—
Flowers: “If one really wants to poke holes in the CFR research, I’d look to its setting: New York City. What if NYC’s standardized test are just better at capturing students’ long-run achievement? That’s possible. If it’s hard to do what NYC does elsewhere in the U.S., then CFR’s results may not apply.”
Amrein-Beardsley: First, plenty of respected researchers have already poked what I would consider as “enough” holes in the CFR research. Second, Flowers clearly does not know much about current standardized tests in that they are all constructed under contract with the same testing companies, they all include the same types of items, they all measure (more or less) the same set of standards… they all undergo the same sets of bias, discrimination, etc. analyses, and the like. As for their capacities to measure growth, they all suffer from a lack of horizontal, but more importantly, vertical equating; their growth output are all distorted because the tests (from pre to post) all capture one full year’s of growth; and they cannot isolate teachers’ residuals, summer growth/decay, etc. given that the pretests are not given the same year, within the same teacher’s classroom.