National Opinion Poll on Testing and Test-Based Teacher Accountability

Share Button

In an article titled “Testing Doesn’t Measure up for Americans,” just released by Phi Delta Kappa (PDK) International in partnership with the flagship professional polling organization Gallup, you can read the results of the annual public poll that PDK/Gallup conduct every year regarding the public’s thoughts on America’s public education system. Pollers polled a nationally representative samples of 3,499 Americans over 18 via the web and 1,001 Americans over 18 via phone. As most pertinent to our purposes here, here are the key findings with which you all might be most interested.

“Americans agree that there is too much testing in schools, but few parents report that their children are complaining about excessive testing. Most Americans believe parents should have the right to opt out of standardized testing, but few said they
would exercise that option themselves.”

  • 55% of Americans and 61% of public school parents oppose including student
    scores on standardized tests as part of teacher evaluations.
  • To measure the effectiveness of the public schools, using tests came in last as a measure of effectiveness with just 14% of public school parents valuing test scores as very important for such measurements.
  • 64% of Americans and a similar proportion of public school parents said there is too much emphasis on standardized testing in the public schools in their community with just 7% believing there’s not enough.
  • When asked what ideas were most important for improving public schools
    in their community from a list of five options, testing ranked last in importance
    once again.
  • Americans split on whether parents should be allowed to excuse their child from taking one or more standardized tests: 41% said yes, 44% said no.
  • A majority of public school parents said they would not excuse their own child from taking a standardized test; nearly one-third said they would excuse their own child.
  • Lack of financial support is the biggest problem facing American schools, according to respondents to the PDK/Gallup poll. That’s been a consistent message from the public for the past 10 years. Having sufficient money to spend would improve the quality of the public schools, according to a sizeable portion of American adults.


Citation: Phi Delta Kappa (PDK) International/Gallup. (2015). The 47the Annual PDK/Gallup poll of the public’s attitudes toward the public schools. Phi Delta Kappan, 97(1). Retrieved from

Share Button

New Research Study: Controlling for Student Background Variables Matters

Share Button

An article about the “Sensitivity of Teacher Value-Added Estimates to Student and Peer Control Variables” was recently published in the peer-reviewed Journal of Research on Educational Effectiveness. While this article is not open-access, here is a link to the article released by Mathematica prior, which nearly mirrors the final published article.

In this study, researchers Matthew Johnson, Stephen Lipscomb, and Brian Gill, all of whom are associated with Mathematica, examined the sensitivity and precision of various VAMs, the extent to which their estimates vary given whether modelers include student- and peer-level background characteristics as control variables, and the extent to which their estimates vary given whether modelers include 1+ years of students’ prior achievement scores, also as control variables. They did this while examining state data, as also compared to what they called District X – a district within the state with three-times more African-American students, two-times more students receiving free or reduced-price lunch, and generally more special educations students than the state average. While the state data included more students, the district data included additional control variables, supporting researchers’ analyses.

Here are the highlights, with thanks to lead author Matthew Johnson for edits and clarifications.

  • Different VAMs produced similar results overall [when using the same data], almost regardless of specifications. “[T]eacher estimates are highly correlated across model specifications. The correlations [they] observe[d] in the state and district data range[d] from 0.90 to 0.99 relative to [their] baseline specification.”

This has been something that has been evidenced in the literature prior, when the same models are applied to the same datasets taken from the same sets of tests at the same time. Hence, many critics argue that similar results come about when using different VAMs on the same data because conditions are such that when using the same, fallible, large-scale standardized test data, even the most sophisticated models are processing “garbage in” and “garbage out.” When the tests used and inserted into the same VAM vary, however, even if the tests used are measuring the same constructs (e.g., mathematics learning in grade X), things go haywire. For more information about this, please see Papay (2010) here.

  • However, “even correlation coefficients above 0.9 do not preclude substantial amounts of potential misclassification of teachers across performance categories.” The researchers also found that, even with such consistencies, 26% of teachers rated in the bottom quintile were placed in higher performance categories under an alternative model.
  • Modeling choices impacted the rankings of teachers in District X in “meaningful” ways, given District X’s varying student demographics compared with those in the state overall. In other words, VAMs that do not include all relevant student characteristics can penalize teachers in districts that serve students who are more disadvantaged than statewide averages.

See an article related to whether VAMs can include all relevant student characteristics, also given the non-random assignment of students to teachers (and teachers to classrooms) here.


Original Version Citation: Johnson, M., Lipscomb, S., & Gill, B. (2013). Sensitivity of teacher value-added estimates to student and peer control variables. Princeton, NJ: Mathematica Policy Research.

Published Version Citation: Johnson, M., Lipscomb, S., & Gill, B. (2013). Sensitivity of teacher value-added estimates to student and peer control variables. Journal of Research on Educational Effectiveness, 8(1), 60-83. doi:10.1080/19345747.2014.967898

Share Button

Why Gene Glass is No Longer a Measurement Specialist

Share Button

One of my mentors – Dr. Gene Glass (formerly at ASU and now at Boulder) wrote a letter earlier this week on his blog, titled “Why I Am No Longer a Measurement Specialist.” This is a must read for all of you following the current policy trends not only surrounding teacher-level accountability, but also high-stakes testing in general.

Gene – one of the most well-established and well-known measurement specialists in and outside of the field of education, world renowned for developing “meta-analysis,” writes:

I was introduced to psychometrics in 1959. I thought it was really neat.By 1960, I was programming a computer on a psychometrics research project funded by the Office of Naval Research. In 1962, I entered graduate school to study educational measurement under the top scholars in the field.

My mentors – both those I spoke with daily and those whose works I read – had served in WWII. Many did research on human factors — measuring aptitudes and talents and matching them to jobs. Assessments showed who were the best candidates to be pilots or navigators or marksmen. We were told that psychometrics had won the war; and of course, we believed it.

The next wars that psychometrics promised it could win were the wars on poverty and ignorance. The man who led the Army Air Corps effort in psychometrics started a private research center. (It exists today, and is a beneficiary of the millions of dollars spent on Common Core testing.) My dissertation won the 1966 prize in Psychometrics awarded by that man’s organization. And I was hired to fill the slot recently vacated by the world’s leading psychometrician at the University of Illinois. Psychometrics was flying high, and so was I.

Psychologists of the 1960s & 1970s were saying that just measuring talent wasn’t enough. Talents had to be matched with the demands of tasks to optimize performance. Measure a learning style, say, and match it to the way a child is taught. If Jimmy is a visual learner, then teach Jimmy in a visual way. Psychometrics promised to help build a better world. But twenty years later, the promises were still unfulfilled. Both talent and tasks were too complex to yield to this simple plan. Instead, psychometricians grew enthralled with mathematical niceties. Testing in schools became a ritual without any real purpose other than picking a few children for special attention.

Around 1980, I served for a time on the committee that made most of the important decisions about the National Assessment of Educational Progress. The project was under increasing pressure to “grade” the NAEP results: Pass/Fail; A/B/C/D/F; Advanced/Proficient/Basic. Our committee held firm: such grading was purely arbitrary, and worse, would only be used politically. The contract was eventually taken from our organization and given to another that promised it could give the nation a grade, free of politics. It couldn’t.

Measurement has changed along with the nation. In the last three decades, the public has largely withdrawn its commitment to public education. The reasons are multiple: those who pay for public schools have less money, and those served by the public schools look less and less like those paying taxes.

The degrading of public education has involved impugning its effectiveness, cutting its budget, and busting its unions. Educational measurement has been the perfect tool for accomplishing all three: cheap and scientific looking.

International tests have purported to prove that America’s schools are inefficient or run by lazy incompetents. Paper-and-pencil tests seemingly show that kids in private schools – funded by parents – are smarter than kids in public schools. We’ll get to the top, so the story goes, if we test a teacher’s students in September and June and fire that teacher if the gains aren’t great enough.

There has been resistance, of course. Teachers and many parents understand that children’s development is far too complex to capture with an hour or two taking a standardized test. So resistance has been met with legislated mandates. The test company lobbyists convince politicians that grading teachers and schools is as easy as grading cuts of meat. A huge publishing company from the UK has spent $8 million in the past decade lobbying Congress. Politicians believe that testing must be the cornerstone of any education policy.

The results of this cronyism between corporations and politicians have been chaotic. Parents see the stress placed on their children and report them sick on test day. Educators, under pressure they see as illegitimate, break the rules imposed on them by governments. Many teachers put their best judgment and best lessons aside and drill children on how to score high on multiple-choice tests. And too many of the best teachers exit the profession.

When measurement became the instrument of accountability, testing companies prospered and schools suffered. I have watched this happen for several years now. I have slowly withdrawn my intellectual commitment to the field of measurement. Recently I asked my dean to switch my affiliation from the measurement program to the policy program. I am no longer comfortable being associated with the discipline of educational measurement.

Gene V Glass
Arizona State University
National Education Policy Center
University of Colorado Boulder

Share Button

Stanford Professor Haertel: Short Video about VAM Reliability and Bias

Share Button

I just came across this 3-minute video that you all might/should find of interest (click here for direct link to this video on YouTube; click here to view the video’s original posting on Stanford’s Center for Opportunity Policy in Education (SCOPE)).

Featured is Stanford’s Professor Emeritus – Dr. Edward Haertel – describing what he sees as two major flaws in the use of VAMs for teacher evaluation and accountability. These are two flaws serious enough, he argues, to prevent others from using VAM scores to make high-stakes decisions about really any of America’s public school teachers. “Like all measurements, these scores are imperfect. They are appropriate and useful for some purposes, but not for others. Viewed from a measurement perspective, value-added scores have limitations that make them unsuitable for high-stakes personnel decisions.”

The first problem is the unreliability of VAM scores which is attributed to noise from the data. The effect of a teacher is important, but weak when all of the other contributing factors are taken into account. The separation of the effect of a teacher from all the other effects is very difficult. This isn’t a flaw that can be fixed by more sophisticated statistical models; it is innate to the data collected.

The second problem is that the models must account for bias. The bias is the difference in circumstances faced by a teacher in a strong school and a teacher in a high-needs school. The instructional history of a student includes out of school support, peer support, and the academic learning climate of the school and VAMs do not take these important factors into account.

Share Button

NY Teacher Lederman’s Day in Court

Share Button

Do you recall the case of Sheri Lederman? The Long Island teacher who, apparently by all accounts other than her composite growth (or value-added) score is a terrific 4th grade/18 year veteran teacher, who received a score of 1 out of 20 after she scored a 14 out of 20 the year prior (see prior posts herehere and here; see also here and here)?

With her husband, attorney Bruce Lederman leading her case, she is suing the state of New York (the state in which Governor Cuomo is pushing to now have teachers’ value-added scores count for approximately 50% of their total evaluations) to challenge the state’s teacher evaluation system. She is also being fully supported by her students, her principal, her superintendent, and a series of VAM experts including: Linda Darling-Hammond (Stanford), Aaron Pallas (Columbia University Teachers College), Carol Burris (Educator and Principal of the Year from New York), Brad Lindell (Long Island Research Consultant), and me (Arizona State University) (see their/our expert witness affidavits here). See also an affidavit more recently submitted by Jesse Rothstein (Berkeley) here, as well as the full document explaining the entire case – the Memorandum of Law – here.

Well, the Ledermans had their day in court this past Wednesday (August 12, 2015).

It was apparent in the hearing that the Judge carefully read all the papers prior, and he was fully familiar with the issues. As per Bruce Lederman, “[t]he issue that seemed to catch the Judge’s attention the most was whether it was rational to have a system which decides in advance that 7% of teachers will be ineffective, regardless of actual results. The Judge asked numerous questions about whether it was fair to use a bell curve,” whereby when using a bell curve to distribute teachers’ growth or value-added scores, there will always be a set of “ineffective” teachers, regardless of whether in face they are truly “ineffective.” This occurs not naturally but by the statistical manipulation needed to fit all scores within the normal distribution needed to spread out the scores in order to make relative distinctions and categorizations (e.g., highly effective, effective, ineffective), the validity of which are highly uncertain (see, for example, a prior post here). Hence, “[t]he Judge pressed the lawyer representing New York’s Education Department very hard on this particular issue,” but the state’s lawyer did not (most likely because she could not) give the Judge a satisfactory explanation, justification, or rationale.

For more information on the case, see here the video that I feel best captures the case, thanks to CBS news in Albany. For another video see here, compliments of NBC news in Albany. See also two additional articles, here and here, with the latter including the photo of Sheri and Bruce Lederman pasted below.

a - ledermans_0

Share Button

Individual-Level VAM Scores Over Time: “Less Reliable than Flipping a Coin”

Share Button

In case you missed it (I did), an article authored by Stuart Yeh (Associate Professor at the University of Minnesota) titled “A Reanalysis of the Effects of Teacher Replacement Using Value-Added Modeling” was (recently) published in the esteemed, peer-reviewed journal: Teachers College Record. While the publication suggests a 2013 publication date, my understanding is that the actual article was more recently released.

Regardless, it’s contents are important to share, particularly in terms of VAM-based levels of reliability, whereas reliability is positioned as follows: “The question of stability [reliability/consistency] is not a question about whether average teacher performance rises, declines, or remains flat over time. The issue that concerns critics of VAM is whether individual teacher performance fluctuates over time in a way that invalidates inferences that an individual teacher is “low-” or “high-” performing. This distinction is crucial because VAM is increasingly being applied such that individual teachers who are identified as low-performing are to be terminated. From the perspective of individual teachers, it is inappropriate and invalid to fire a teacher whose performance is low this year but high the next year, and it is inappropriate to retain a teacher whose performance is high this year but low next year. Even if average teacher performance remains stable over time, individual teacher performance may fluctuate wildly from year to year” (p. 7).

Yeh’s conclusions, then (and as based on the evidence presented in this piece) is that “VAM is less reliable than flipping a coin for the purpose of categorizing high- and low-performing teachers” (p. 19). More specifically, VAMs have an estimated, overall error rate of 59% (see Endnote 2, page 26 for further explanation).

That being said, not only is the assumption that teacher quality is a fixed characteristic (i.e., that a high-performing teacher this year will be a high-performing teacher next year, and a low-performing teacher this year will be a low-performing teacher next year) false and not supported by the available data, albeit continuously assumed by many VAM proponents, (including Chetty et al.; see, for example, here, here, and here), prior estimates that using VAMs to identify teachers is no different than the flip of a coin may actually be an underestimate given current reliability estimates (see also Table 2, p. 19; see also p. 25, 26).

In section two of this article, for those of you following the Chetty et al. debates, Yeh also critiques other assumptions supporting and behind the Chetty et al. studies (see other, similar critiques here, here, here, and here). For example, Yeh critiques the VAM-based proposals to raise student achievement by (essentially) terminating low-value-added teachers. Here, the assumption is that “the use of VAM to identify and replace the lowest-performing 5% of teachers with average teachers would increase student achievement and would translate into sizable gains in the lifetime earnings of their students” (p. 2). However, because this also assumes that “there is an adequate supply of unemployed teachers who are ready and willing to be hired and would perform at a level that is 2.04 standard deviations above the performance of teachers who are fired based on value-added rankings [and] Chetty et al. do not justify this assumption with empirical data” (p. 14), this too proves much more idealistic than realistic in the grand scheme of things.

In section three of this article, for those of you generally interested in better and in this case more cost effective solutions, Yeh discusses a number of cost-effectiveness analyses comparing 22 leading approaches for raising student achievement, the results of which suggest that “the most efficient approach—rapid performance feedback (RPF)—is approximately 5,700 times as efficient as the use of VAM to identify and replace low-performing teachers” (p. 25; see also p. 23-24).


Citation: Yeh, S. S. (2013). A re-analysis of the effects of teacher replacement using value-added modeling. Teachers College Record, 115(12), 1-35. Retrieved from

Share Button

Special Issue of “Educational Researcher” Examines Value-Added Measures (Paper #1 of 9)

Share Button

A few months ago, the flagship journal of the American Educational Research Association (AERA) – the peer-reviewed journal titled Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs) (i.e., one introduction (reviewed below), four feature articles, one essay, and three commentaries). I will review each of these pieces separately over the next few weeks or so, although if any of you want an advanced preview, do click here as AERA made each of these articles free and accessible.

In this “Special Issue” editors Douglas Harris – Associate Professor of Economics at Tulane University – and Carolyn Herrington – Professor of Educational Leadership and Policy at Florida State University – solicited “[a]rticles from leading scholars cover[ing] a range of topics, from challenges in the design and implementation of teacher evaluation systems, to the emerging use of teacher observation information by principals as an alternative to VAM data in making teacher staffing decisions.” They challenged authors “to participate in the important conversation about value-added by providing rigorous evidence, noting that successful policy implementation and design are the product of evaluation and adaption” (assuming “successful policy implementation and design” exist, but I digress).

More specifically, in the co-editors’ Introduction to the Special Issue, Harris and Herrington note that in this special issue they “pose dozens of unanswered questions [see below], not only about the net effects of these policies on measurable student outcomes, but about the numerous, often indirect ways in which [unintended] and less easily observed effects might arise.” This section is of, in my opinion, the most “added value.”

Here are some of their key assertions:

  • “[T]eachers and principals trust classroom observations more than value added.”
  • “Teachers—especially the better ones—want to know what exactly they are doing well and doing poorly. In this respect, value-added measures are unhelpful.”
  • “[D]istrust in value-added measures may be partly due to [or confounded with] frustration with high-stakes testing generally.”
  • “Support for value added also appears stronger among administrators than teachers…But principals are still somewhat skeptical.”
  • “[T]he [pre-VAM] data collection process may unintentionally reduce the validity and credibility of value-added measures.”
  • “[I]t seems likely that support for value added among educators will decrease as the stakes increase.”
  • “[V]alue-added measures suffer from much higher missing data rates than classroom observation[s].”
  • “[T]he timing of value-added measures—that they arrive only once a year and during the middle of the school year when it is hard to adjust teaching assignments—is a real concern among teachers and principals alike.”
  • “[W]e cannot lose sight of the ample evidence against the traditional model [i.e., based on snapshot measures examined once per year as was done for decades past, or pre-VAM].” This does not make VAMs “better,” but with this statement most researchers agree.

Inversely, here are some points or assertions that should cause pause:

  • “The issue is not whether value-added measures are valid but whether they can be used in a way that improves teaching and learning.” I would strongly argue that validity is a pre-condition to use, as we do not want educators using invalid data to even attempt to improve teaching and learning. I’m actually surprised this statement was published, as so scientifically and pragmatically off-based.
  • We know “very little” about how educators “actually respond to policies that use value-added measures.” Clearly, the co-editors are not followers of this blog, other similar outlets (e.g., The Washington Post’s Answer Sheet), or other articles published in the media as well as scholarly journals, about educator use, interpretation, opinion, response, and the like to VAMs. (For example articles published in scholarly journals see, for example, here, here, and here).
  • “[I]n the debate about these policies, perspective has taken over where the evidence trail ends.” Rather, the evidence trail is already quite saturated in many respects, as study after study continues to evidence the same things (e.g., inconsistencies in teacher-level ratings over time, mediocre correlations between VAM and observational output, all of which matter most if high-stakes decisions are to be tied to value-added output).
  • [T]he best available evidence [is just] beginning to emerge on the use of value added.” Perhaps the authors of this piece are correct if focusing only on use, or the lack thereof as we have a good deal of evidence that much use is not happening given issues with transparency, accessibility, comprehensibility, relevance, fairness, and the like.

In the end, and also of high value, the authors of this piece offer others (e.g., graduate students, practitioners, or academics looking take note) some interesting points for future research, given VAMs are likely still to stay, for at least awhile. Some, although not all of their suggested questions for future research are included here:

  • How do educators’ perceptions impact their behavioral responses with and towards VAMs or VAM output? Does administrator skepticism affect how they use these measures?
  • Does the use of VAMs actually lead to more teaching to the test and shallow instruction, aimed more at developing basic skills than at critical thinking and creative
    problem solving?
  • Does the approach further narrow the curriculum and lead to more gaming of the system or, following Campbell’s law, distort the measures in ways that make them less informative?
  • In the process of sanctioning and dismissing low-performing teachers, do value-added-based accountability policies sap the creativity and motivation of our best teachers?
  • Does the use of value-added measures reduce trust and undermine collaboration among educators and principals, thus weakening the organization as a whole?
  • Are these unintended effects made worse by the common misperception that when
    teachers help their colleagues within the school, they could reduce their own value-added measures?
  • Aside from these incentives, does the general orientation toward individual
    performance lead teachers to think less about their colleagues and organizational roles and responsibilities?
  • Are teacher and administrator preparation programs helping to prepare future educators for value-added-based accountability by explaining the measures and their potential uses and misuses?
  • Although not explicitly posed as a question in this piece, but important, what are their (i.e., VAMs and VAM outputs) benefits, intended or otherwise?

Some of these questions are discussed or answered at least in part in the eight articles included in this “Special Issue” also to be highlighted in the next few weeks or so, one by one. Do stay tuned.


Article #1 Reference: Harris, D. N., & Herrington, C. D. (2015). Editors’ introduction: The use of teacher value-added measures in schools: New evidence, unanswered questions, and future prospects. Educational Researcher, 44(2), 71-76. doi:10.3102/0013189X15576142

Share Button

EVAAS, Value-Added, and Teacher Branding

Share Button

I do not think I ever shared this video out, and now following up on another post, about the potential impact these videos should really have, I thought now is an appropriate time to share. “We can be the change,” and social media can help.

My former doctoral student and I put together this video, after conducting a study with teachers in the Houston Independent School District and more specifically four teachers whose contracts were not renewed due in large part to their EVAAS scores in the summer of 2011. This video (which is really a cartoon, although it certainly lacks humor) is about them, but also about what is happening in general in their schools, post the adoption and implementation (at approximately $500,000/year) of the SAS EVAAS value-added system.

To read the full study from which this video was created, click here. Below is the abstract.

The SAS Educational Value-Added Assessment System (SAS® EVAAS®) is the most widely used value-added system in the country. It is also self-proclaimed as “the most robust and reliable” system available, with its greatest benefit to help educators improve their teaching practices. This study critically examined the effects of SAS® EVAAS® as experienced by teachers, in one of the largest, high-needs urban school districts in the nation – the Houston Independent School District (HISD). Using a multiple methods approach, this study critically analyzed retrospective quantitative and qualitative data to better comprehend and understand the evidence collected from four teachers whose contracts were not renewed in the summer of 2011, in part given their low SAS® EVAAS® scores. This study also suggests some intended and unintended effects that seem to be occurring as a result of SAS® EVAAS® implementation in HISD. In addition to issues with reliability, bias, teacher attribution, and validity, high-stakes use of SAS® EVAAS® in this district seems to be exacerbating unintended effects.

Share Button

Key & Peele of Comedy Central: The Teacher Draft

Share Button

In case you missed it…this one is worth a watch.

The Comedy Central duo, Key and Peele, discuss in this comedy clip teachers, as if they were professional athletes (and also worth the money and fame).

Click here to “Watch Key & Peele Talk About Teachers in the Same Way SportsCenter Talks About Athletes.”

Share Button

Follow-Up on “Economists Declaring Victory…”

Share Button

Over one week ago I published a post about some “Economists Declar[ing] Victory for VAMs,” as per an article titled “The Science Of Grading Teachers Gets High Marks,” written by the economics site’s “quantitative editor” Andrew Flowers.

Valerie Strauss, author of the Answer Sheet section of The Washington Post, was apparently also busy finding more about Flowers’ piece, as well as his take, at the same time. She communicated with Flowers via email, after which she communicated with me via email to help her/us respond to Flowers’ claims. These email exchanges, and more, were just published on her Answer Sheet section of The Washington Post here.

For those of you interested in reading the whole thing, do click here. For those of you interested in just the email exchanges, as a follow-up to my previous post here, I’ve pasted the highlights of the conversation for you all below…with compliments to Valerie for including what she viewed as the key points for discussion/thought.

From her post:

I asked Audrey Amrein-Beardsley, a former middle- and high-school mathematics teacher who is now associate professor in Arizona State University’s Mary Lou Fulton Teachers College and a VAM researcher, about the FiveThirtyEight blog post and e-mail comments by Flowers. She earned a Ph.D. in 2002 from Arizona State University in the Division of Educational Leadership and Policy Studies with an emphasis on research methods. She had already written about Flowers’ blog post on her VAMBoozled! blog, which you can see here.

Here are her comments on what Flowers wrote to me in the e-mail. Some of them are technical, as any discussion about formulas would be:

Flowers: “The piece I wrote that was recently published by FiveThirtyEight was focused on a specific type of value-added model (VAM) — the one developed by Chetty, Friedman and Rockoff (CFR). In my reading of the literature on VAMs, including the American Statistical Association’s (ASA) statement, I felt it fair to characterize the CFR research as cutting-edge.”

Amrein-Beardsley: There is no such thing as a “cutting-edge” VAM. Just because Chetty had access to millions of data observations does not make his actual VAM more sophisticated than any of those in use otherwise or in other ways. The fact of the matter is is that all states have essentially the same school level data (i.e., very similar test scores by students over time, links to teachers, and series of typically dichotomous/binary variables meant to capture things like special education status, English language status, free-and-reduced lunch eligibility, etc.). These latter variables are the ones used, or not used depending on the model for VAM-based analyses. While Chetty used these data and also had access to other demographic data (e.g., IRS data, correlated with other demographic data as well), and he could use these data to supplement the data from NYC schools, the data whether dichotomous or continuous (which is a step in the right direction) still cannot and do not capture all of the things we know from the research that influence student learning, achievement, and more specifically growth in achievement in schools. These are the unquantifiable/uncontrollable variables that (will likely forever) continue to distort the measurement of teachers’ causal effects, and that cannot be captured using IRS data alone. For example, unless Chetty had data to capture teachers’ residuals effects (from prior years), out of school learning, parental impacts on learning or a lack thereof, summer learning and decay, etc. it is virtually impossible, no matter how sophisticated any model or dataset is, to make such causal claims. Yes, such demographic variables are correlated with, for example, family income [but]  they are not correlated to the extent that they can remove systematic error from the model.

Accordingly, Chetty’s model is no more sophisticated or “cutting–edge” than any other. There are probably, now, five+ models being used today (i.e., the EVAAS, the Value–Added Research Center (VARC) model, the RAND Corporation model, the American Institute for Research (AIR) model, and the Student Growth Percentiles (SGP) model). All of the them except for the SGP have been developed by economists, and they are likely just as sophisticated in their design (1) given minor tweaks to model specifications and (2) given various data limitations and restrictions. In fact, the EVAAS, because it’s been around for over twenty years (in use in Tennessee since 1993, and in years of development prior), is probably considered the best and most sophisticated of all VAMs, and because it’s now run by the SAS analytics software corporation, I (and likely many other VAM researchers) would likely put our money down on that model any day over Chetty’s model, if both had access to the same dataset. Chetty might even agree with this assertion, although he would disagree with the EVAAS’s (typical) lack of use of controls for student background variables/demographics — a point of contention that has been debated, now, for years, with research evidence supporting both approaches; hence, the intense debates about VAM–based bias, now also going on for years.

Flowers: “So, because the CFR research is so advanced, much of the ASA’s [American Statistical Association’s] critique does not apply to it. In its statement, the ASA says VAMs “generally… not directly measure potential teacher contributions toward other student outcomes” (emphasis added). Well, this CFR work I profiled is the exception — it explicitly controls for student demographic variables (by using millions of IRS records linked to their parents). And, as I’ll explain below, the ASA statement’s point that VAMs are only capturing correlation, not causation, also does not apply to the CFR model (in my view). The ASA statement is still smart, though. I’m not dismissing it. I just thought — given how superb the CFR research was — that it wasn’t really directed at the paper I covered.”

Amrein-Beardsley: This is based on the false assumption, addressed above, that Chetty’s model is “so advanced” or “cutting edge,” or now as written here “superb.” When you appropriately remove or reject this assumption, ASA’s critique applies to Chetty’s model along with the rest of them. Should we not give credit to the ASA for taking into consideration all models when they wrote this statement, especially as they wrote their statement well after Chetty’s model had hit the public? Would the ASA not have written, somewhere, that their critique applies to all models “except for” the one used by Chetty et al because they too agreed this one was exempt from their critiques? This singular statement is absurd in and of itself, as is the statement that Flowers isn’t “dismissing it.” I’m sure the ASA would be thrilled to hear. More specifically, the majority of models “explicitly control for student demographics” — Chetty’s model is by far not the only one (see the first response above, as again, this is one of the most contentious issues going). Given this, and the above, it is true that all “VAMs are only capturing correlation, not causation,” and all VAMs are doing this at a mediocre level of quality. The true challenge, should Chetty take it on, would be to put his model up against the other VAMs mentioned above, using the same NYC school-level dataset, and prove to the public that his model is so “cutting-edge” that it does not suffer from the serious issues with reliability, validity, bias, etc. with which all other modelers are contending. Perhaps Flowers’ main problem in this piece is that he conflated model sophistication with dataset quality, whereby the former is likely no better (or worse) than any of the others.

Lastly, for what “wasn’t really directly at the paper [Flowers] covered…let’s talk about the 20+ years of research we have on VAMs that Flowers dismissed, implicitly in that it was not written by economists, whereas Jesse Rothstein was positioned as the only respected critic of VAMs. My best estimates, and I’ll stick with them today, is that approximately 90 percent of all value-added researchers, including econometricians and statisticians alike, have grave concerns about these models, and consensus has been reached regarding many of their current issues. Only folks like Chetty and Kain (the two-pro VAM scholars), however, were positioned as leading thought and research in this area. Flowers, before he wrote such a piece, really should have done more homework. This also includes the other critiques of Chetty’s work, not mentioned whatsoever in this piece albeit very important to understanding it (see, for example, here, here, here, and here).

Flowers: “That said, I felt like the criticism of the CFR work by other academic economists, as well as the general caution of the ASA, warranted inclusion — and so I reached out to Jesse Rothstein, the most respected “anti-VAM” economist, for comment. I started and ended the piece with the perspective of “pro-VAM” voices because that was the peg of the story — this new exchange between CFR and Rothstein — and, if one reads both papers and talks to both sides, I though it was clear how the debate tilted in the favor of CFR.”

Amrein-Beardsley: Again, why only the critiques of other “academic economists,” or actually just one other academic economist to be specific (i.e., Jesse Rothstein, who most would agree is “the most respected ‘anti-VAM’ economist)? Everybody knows Chetty and Kane (the other economist to whom Flowers “reached out) are colleagues/buddies and very much on the same page and side of all of this, so Rothstein was really the only respected critic included to represent the other side. All of this is biased in and of itself (see also studies above for economists’ and statisticians’ other critiques),and quite frankly insulting to/marginalizing of the other well-respected scholars also conducting solid empirical research in this area (e.g., Henry Braun, Stephen Raudenbush, Jonathan Papay, Sean Corcoran). Nonetheless, this “new exchange” between Chetty and Rothstein is not “new” as claimed. It actually started back in October to be specific (see, here, for example). I too have read both papers and talked to both sides, and would hardly say it’s “clear how the debate” tilts either way. It’s educational research, and complicated, and not nearly objective, hard, conclusive, or ultimately victorious as Flowers claims.

Flowers: “Now, why is that? I think there are two (one could argue three) empirical arguments at stake here. First, are the CFR results, based on NYC public schools, reproducible in other settings? If not — if other researchers can’t produced similar estimates with different data — then that calls it into question. Second, assuming the reproducibility bar is passed, can the CFR’s specification model withstand scrutiny; that is, is CFR’s claim to capture teacher value-added in isolation of all other factors (e.g., demographic characteristics, student sorting, etc.) really believable? This second argument is less about data than about statistical modeling…What I found was that there was complete agreement (even by Rothstein) on this first empirical argument. CFR’s results are reproducible even by their critics, in different settings (Rothstein replicated in North Carolina). That’s amazing, right? “

Amrein-Beardsley: These claims are actually quite interesting in that there is a growing set of research evidence that all models, using the same datasets, actually yield similar results. It’s really no surprise, and certainly not “amazing” that Kane replicated Chetty’s results, or that Rothstein replicated them, more or less, as well. Even what some argue is the least sophisticated VAM (although some would cringe calling it a VAM) – the Student Growth Percentiles (SGP) model – has demonstrated itself, even without using student demographics in model specifications/controls, to yield similar output when the same datasets are used. One of my doctoral students, in fact, ran five different models using the same dataset and yielded inter/intra correlations that some could actually consider “amazing.” That is because, what at least some contend, these models are quite similar, and yield similar results given their similarities, and also their limitations. Some even go as far as calling all such models “garbage in, garbage out” systems, given the test data they all (typically) use to generate VAM-based estimates, and almost regardless of the extent to which model specifications differ. So replication, in this case, is certainly not the cat’s meow. One must also look to other traditional notions of educational measurement: reliability/consistency (which is not at high-enough levels, especially across teacher types), validity (which is not at high-enough levels, especially for high-stakes purposes), etc. in that “replicability” alone is more common than Flowers (and perhaps others) might assume. Just like it takes multiple measures to get at teachers’ effects, it takes multiple measures to assess model quality. Using replication, alone, is remiss.

Flowers: “For those curious about this third empirical argument, I would refer anyone back to CFR’s second paper in (American Economic Review 2014b), where they impressively demonstrate how students taught by teachers with high VAM scores, all things equal, grow up to have higher earnings (through age 28), avoid teen pregnancy at greater rates, attend better colleges, etc. This is based off an administrative data set from the IRS — that’s millions of students, over 30 years. Of course, it all hinges on the first study’s validity (that VAM is unbiased)— which was the center of debate between Rothstein and CFR.”

Amrein-Beardsley: The jury is definitely still out on this, across all studies…. Plenty of studies demonstrate (with solid evidence) that bias exists and plenty others demonstrate (with solid evidence) that it doesn’t.

Flowers: “Long story, short: the CFR research has withstood criticism from Rothstein (a brilliant economist, whom CFR greatly respects), and their findings were backed up by other economists in the field (yes, some of them do have a “pro-VAM” bias, but such is social science).”

Amrein-Beardsley: Long story, short: the CFR research has [not] withstood criticism from Rothstein (a brilliant economist, whom CFR [and many others] greatly respect, and their findings were backed up by other economists [i.e., two to be exact] in the field (yes, some of them [only Chetty’s buddy Kane] do have a “pro-VAM” bias, but such is social science). Such is the biased stance taken by Flowers in this piece, as well.

Flowers: “If one really wants to poke holes in the CFR research, I’d look to its setting: New York City. What if NYC’s standardized test are just better at capturing students’ long-run achievement? That’s possible. If it’s hard to do what NYC does elsewhere in the U.S., then CFR’s results may not apply.”

 Amrein-Beardsley: First, plenty of respected researchers have already poked what I would consider as “enough” holes in the CFR research. Second, Flowers clearly does not know much about current standardized tests in that they are all constructed under contract with the same testing companies, they all include the same types of items, they all measure (more or less) the same set of standards… they all undergo the same sets of bias, discrimination, etc. analyses, and the like. As for their capacities to measure growth, they all suffer from a lack of horizontal, but more importantly, vertical equating; their growth output are all distorted because the tests (from pre to post) all capture one full year’s of growth; and they cannot isolate teachers’ residuals, summer growth/decay, etc. given that the pretests are not given the same year, within the same teacher’s classroom.

Share Button