Follow-Up on “Economists Declaring Victory…”

Over one week ago I published a post about some “Economists Declar[ing] Victory for VAMs,” as per an article titled “The Science Of Grading Teachers Gets High Marks,” written by the economics site’s “quantitative editor” Andrew Flowers.

Valerie Strauss, author of the Answer Sheet section of The Washington Post, was apparently also busy finding more about Flowers’ piece, as well as his take, at the same time. She communicated with Flowers via email, after which she communicated with me via email to help her/us respond to Flowers’ claims. These email exchanges, and more, were just published on her Answer Sheet section of The Washington Post here.

For those of you interested in reading the whole thing, do click here. For those of you interested in just the email exchanges, as a follow-up to my previous post here, I’ve pasted the highlights of the conversation for you all below…with compliments to Valerie for including what she viewed as the key points for discussion/thought.

From her post:

I asked Audrey Amrein-Beardsley, a former middle- and high-school mathematics teacher who is now associate professor in Arizona State University’s Mary Lou Fulton Teachers College and a VAM researcher, about the FiveThirtyEight blog post and e-mail comments by Flowers. She earned a Ph.D. in 2002 from Arizona State University in the Division of Educational Leadership and Policy Studies with an emphasis on research methods. She had already written about Flowers’ blog post on her VAMBoozled! blog, which you can see here.

Here are her comments on what Flowers wrote to me in the e-mail. Some of them are technical, as any discussion about formulas would be:

Flowers: “The piece I wrote that was recently published by FiveThirtyEight was focused on a specific type of value-added model (VAM) — the one developed by Chetty, Friedman and Rockoff (CFR). In my reading of the literature on VAMs, including the American Statistical Association’s (ASA) statement, I felt it fair to characterize the CFR research as cutting-edge.”

Amrein-Beardsley: There is no such thing as a “cutting-edge” VAM. Just because Chetty had access to millions of data observations does not make his actual VAM more sophisticated than any of those in use otherwise or in other ways. The fact of the matter is is that all states have essentially the same school level data (i.e., very similar test scores by students over time, links to teachers, and series of typically dichotomous/binary variables meant to capture things like special education status, English language status, free-and-reduced lunch eligibility, etc.). These latter variables are the ones used, or not used depending on the model for VAM-based analyses. While Chetty used these data and also had access to other demographic data (e.g., IRS data, correlated with other demographic data as well), and he could use these data to supplement the data from NYC schools, the data whether dichotomous or continuous (which is a step in the right direction) still cannot and do not capture all of the things we know from the research that influence student learning, achievement, and more specifically growth in achievement in schools. These are the unquantifiable/uncontrollable variables that (will likely forever) continue to distort the measurement of teachers’ causal effects, and that cannot be captured using IRS data alone. For example, unless Chetty had data to capture teachers’ residuals effects (from prior years), out of school learning, parental impacts on learning or a lack thereof, summer learning and decay, etc. it is virtually impossible, no matter how sophisticated any model or dataset is, to make such causal claims. Yes, such demographic variables are correlated with, for example, family income [but]  they are not correlated to the extent that they can remove systematic error from the model.

Accordingly, Chetty’s model is no more sophisticated or “cutting–edge” than any other. There are probably, now, five+ models being used today (i.e., the EVAAS, the Value–Added Research Center (VARC) model, the RAND Corporation model, the American Institute for Research (AIR) model, and the Student Growth Percentiles (SGP) model). All of the them except for the SGP have been developed by economists, and they are likely just as sophisticated in their design (1) given minor tweaks to model specifications and (2) given various data limitations and restrictions. In fact, the EVAAS, because it’s been around for over twenty years (in use in Tennessee since 1993, and in years of development prior), is probably considered the best and most sophisticated of all VAMs, and because it’s now run by the SAS analytics software corporation, I (and likely many other VAM researchers) would likely put our money down on that model any day over Chetty’s model, if both had access to the same dataset. Chetty might even agree with this assertion, although he would disagree with the EVAAS’s (typical) lack of use of controls for student background variables/demographics — a point of contention that has been debated, now, for years, with research evidence supporting both approaches; hence, the intense debates about VAM–based bias, now also going on for years.

Flowers: “So, because the CFR research is so advanced, much of the ASA’s [American Statistical Association’s] critique does not apply to it. In its statement, the ASA says VAMs “generally… not directly measure potential teacher contributions toward other student outcomes” (emphasis added). Well, this CFR work I profiled is the exception — it explicitly controls for student demographic variables (by using millions of IRS records linked to their parents). And, as I’ll explain below, the ASA statement’s point that VAMs are only capturing correlation, not causation, also does not apply to the CFR model (in my view). The ASA statement is still smart, though. I’m not dismissing it. I just thought — given how superb the CFR research was — that it wasn’t really directed at the paper I covered.”

Amrein-Beardsley: This is based on the false assumption, addressed above, that Chetty’s model is “so advanced” or “cutting edge,” or now as written here “superb.” When you appropriately remove or reject this assumption, ASA’s critique applies to Chetty’s model along with the rest of them. Should we not give credit to the ASA for taking into consideration all models when they wrote this statement, especially as they wrote their statement well after Chetty’s model had hit the public? Would the ASA not have written, somewhere, that their critique applies to all models “except for” the one used by Chetty et al because they too agreed this one was exempt from their critiques? This singular statement is absurd in and of itself, as is the statement that Flowers isn’t “dismissing it.” I’m sure the ASA would be thrilled to hear. More specifically, the majority of models “explicitly control for student demographics” — Chetty’s model is by far not the only one (see the first response above, as again, this is one of the most contentious issues going). Given this, and the above, it is true that all “VAMs are only capturing correlation, not causation,” and all VAMs are doing this at a mediocre level of quality. The true challenge, should Chetty take it on, would be to put his model up against the other VAMs mentioned above, using the same NYC school-level dataset, and prove to the public that his model is so “cutting-edge” that it does not suffer from the serious issues with reliability, validity, bias, etc. with which all other modelers are contending. Perhaps Flowers’ main problem in this piece is that he conflated model sophistication with dataset quality, whereby the former is likely no better (or worse) than any of the others.

Lastly, for what “wasn’t really directly at the paper [Flowers] covered…let’s talk about the 20+ years of research we have on VAMs that Flowers dismissed, implicitly in that it was not written by economists, whereas Jesse Rothstein was positioned as the only respected critic of VAMs. My best estimates, and I’ll stick with them today, is that approximately 90 percent of all value-added researchers, including econometricians and statisticians alike, have grave concerns about these models, and consensus has been reached regarding many of their current issues. Only folks like Chetty and Kain (the two-pro VAM scholars), however, were positioned as leading thought and research in this area. Flowers, before he wrote such a piece, really should have done more homework. This also includes the other critiques of Chetty’s work, not mentioned whatsoever in this piece albeit very important to understanding it (see, for example, here, here, here, and here).

Flowers: “That said, I felt like the criticism of the CFR work by other academic economists, as well as the general caution of the ASA, warranted inclusion — and so I reached out to Jesse Rothstein, the most respected “anti-VAM” economist, for comment. I started and ended the piece with the perspective of “pro-VAM” voices because that was the peg of the story — this new exchange between CFR and Rothstein — and, if one reads both papers and talks to both sides, I though it was clear how the debate tilted in the favor of CFR.”

Amrein-Beardsley: Again, why only the critiques of other “academic economists,” or actually just one other academic economist to be specific (i.e., Jesse Rothstein, who most would agree is “the most respected ‘anti-VAM’ economist)? Everybody knows Chetty and Kane (the other economist to whom Flowers “reached out) are colleagues/buddies and very much on the same page and side of all of this, so Rothstein was really the only respected critic included to represent the other side. All of this is biased in and of itself (see also studies above for economists’ and statisticians’ other critiques),and quite frankly insulting to/marginalizing of the other well-respected scholars also conducting solid empirical research in this area (e.g., Henry Braun, Stephen Raudenbush, Jonathan Papay, Sean Corcoran). Nonetheless, this “new exchange” between Chetty and Rothstein is not “new” as claimed. It actually started back in October to be specific (see, here, for example). I too have read both papers and talked to both sides, and would hardly say it’s “clear how the debate” tilts either way. It’s educational research, and complicated, and not nearly objective, hard, conclusive, or ultimately victorious as Flowers claims.

Flowers: “Now, why is that? I think there are two (one could argue three) empirical arguments at stake here. First, are the CFR results, based on NYC public schools, reproducible in other settings? If not — if other researchers can’t produced similar estimates with different data — then that calls it into question. Second, assuming the reproducibility bar is passed, can the CFR’s specification model withstand scrutiny; that is, is CFR’s claim to capture teacher value-added in isolation of all other factors (e.g., demographic characteristics, student sorting, etc.) really believable? This second argument is less about data than about statistical modeling…What I found was that there was complete agreement (even by Rothstein) on this first empirical argument. CFR’s results are reproducible even by their critics, in different settings (Rothstein replicated in North Carolina). That’s amazing, right? “

Amrein-Beardsley: These claims are actually quite interesting in that there is a growing set of research evidence that all models, using the same datasets, actually yield similar results. It’s really no surprise, and certainly not “amazing” that Kane replicated Chetty’s results, or that Rothstein replicated them, more or less, as well. Even what some argue is the least sophisticated VAM (although some would cringe calling it a VAM) – the Student Growth Percentiles (SGP) model – has demonstrated itself, even without using student demographics in model specifications/controls, to yield similar output when the same datasets are used. One of my doctoral students, in fact, ran five different models using the same dataset and yielded inter/intra correlations that some could actually consider “amazing.” That is because, what at least some contend, these models are quite similar, and yield similar results given their similarities, and also their limitations. Some even go as far as calling all such models “garbage in, garbage out” systems, given the test data they all (typically) use to generate VAM-based estimates, and almost regardless of the extent to which model specifications differ. So replication, in this case, is certainly not the cat’s meow. One must also look to other traditional notions of educational measurement: reliability/consistency (which is not at high-enough levels, especially across teacher types), validity (which is not at high-enough levels, especially for high-stakes purposes), etc. in that “replicability” alone is more common than Flowers (and perhaps others) might assume. Just like it takes multiple measures to get at teachers’ effects, it takes multiple measures to assess model quality. Using replication, alone, is remiss.

Flowers: “For those curious about this third empirical argument, I would refer anyone back to CFR’s second paper in (American Economic Review 2014b), where they impressively demonstrate how students taught by teachers with high VAM scores, all things equal, grow up to have higher earnings (through age 28), avoid teen pregnancy at greater rates, attend better colleges, etc. This is based off an administrative data set from the IRS — that’s millions of students, over 30 years. Of course, it all hinges on the first study’s validity (that VAM is unbiased)— which was the center of debate between Rothstein and CFR.”

Amrein-Beardsley: The jury is definitely still out on this, across all studies…. Plenty of studies demonstrate (with solid evidence) that bias exists and plenty others demonstrate (with solid evidence) that it doesn’t.

Flowers: “Long story, short: the CFR research has withstood criticism from Rothstein (a brilliant economist, whom CFR greatly respects), and their findings were backed up by other economists in the field (yes, some of them do have a “pro-VAM” bias, but such is social science).”

Amrein-Beardsley: Long story, short: the CFR research has [not] withstood criticism from Rothstein (a brilliant economist, whom CFR [and many others] greatly respect, and their findings were backed up by other economists [i.e., two to be exact] in the field (yes, some of them [only Chetty’s buddy Kane] do have a “pro-VAM” bias, but such is social science). Such is the biased stance taken by Flowers in this piece, as well.

Flowers: “If one really wants to poke holes in the CFR research, I’d look to its setting: New York City. What if NYC’s standardized test are just better at capturing students’ long-run achievement? That’s possible. If it’s hard to do what NYC does elsewhere in the U.S., then CFR’s results may not apply.”

 Amrein-Beardsley: First, plenty of respected researchers have already poked what I would consider as “enough” holes in the CFR research. Second, Flowers clearly does not know much about current standardized tests in that they are all constructed under contract with the same testing companies, they all include the same types of items, they all measure (more or less) the same set of standards… they all undergo the same sets of bias, discrimination, etc. analyses, and the like. As for their capacities to measure growth, they all suffer from a lack of horizontal, but more importantly, vertical equating; their growth output are all distorted because the tests (from pre to post) all capture one full year’s of growth; and they cannot isolate teachers’ residuals, summer growth/decay, etc. given that the pretests are not given the same year, within the same teacher’s classroom.

The Multiple Teacher Evaluation System(s) in New Mexico, from a Concerned New Mexico Parent

A “concerned New Mexico parent” who wrote a prior post for this blog here, wrote another for you all below, about the sheer numbers of different teacher evaluation systems, or variations, now in place in his/her state of New Mexico. (S)he writes:

Readers of this blog are well aware of the limitations of VAMs for evaluating teachers. However, many readers may not be aware that there are actually many system variations used to evaluate teachers. In the state of New Mexico, for example, 217 different variations are used to evaluate the many and diverse types of teachers teaching in the state [and likely all other states].

But. Is there any evidence that they are valid? NO. Is there any evidence that they are equivalent? NO. Is there any evidence that this is fair? NO.

The New Mexico Public Education Department (NMPED) provides a framework for teacher evaluations, and the final teacher evaluation should be weighted as follows: Improved Student Achievement (50%), Teacher Observations (25%), and Multiple Measures (25%).

Every school district in New Mexico is required to submit a detailed evaluation plan of specifically what measures will be used to satisfy the overall NMPED 50-25-25 percentage framework, after which NMPED approves all plans.

The exact details of any district’s educator effectiveness plan can be found on the NMTEACH website, as every public and charter school plan is posted here.

There are massive differences between how groups of teachers are graded between districts, however, which distorts most everything about the system(s), including the extent to which similar (and different) teachers might be similarly (and fairly) evaluated and assessed.

Even within districts, there are massive differences in how grade level (elementary, middle, high school) teachers are evaluated.

And, even something as seemingly simple as evaluating K-2 teachers requires 42 different variations in scoring.

Table 1 below shows the number of different scales used to calculate teacher effectiveness for each group of teachers and each grade level, for example, at the state level.

New Mexico divides all teachers into three categories — group A teachers have scores based on the statewide test (mathematics, English/language arts (ELA)), group B teachers (e.g. music or history) do not have a corresponding statewide test, and group C teachers teach grades K-2. Table 1 shows the number of scales used by New Mexico school districts for each teacher group. It is further broken down by grade-level. For example, as illustrated, there are 42 different scales used to evaluate Elementary-level Group A teachers in New Mexico. The column marked “Unique (one-offs)” indicates the number of scales that are completely unique for a given teacher group and grade-level. For example, as illustrated, there are 11 unique scales used to grade Group B High School teachers, and for each of these eleven scales, only one district, one grade-level, and one teacher group is evaluated within the entire state.

Based on the size of the school district, a unique scale may be grading as few as a dozen teachers! In addition, there are 217 scales used statewide, with 99 of these scales being unique (by teacher)!

Table 1: New Mexico Teacher Evaluation System(s)

Group Grade Scales Used Unique (one-offs)
Group A (SBA-based) All 58 15
(e.g. 5th grade English teacher) Elem 42 10
MS 37 2
HS 37 3
Group B (non-SBA) All 117 56
(e.g. Elem music teacher) Elem 67 37
MS 62 8
HS 61 11
Group C (grades K-2) All 42 28
Elem 42 28
TOTAL   217 variants 99 one-offs

The table above highlights the spectacular absurdity of the New Mexico Teacher Evaluation System.

(The complete listings of all variants for the three groups are contained here (in Table A for Group A), here (in Table B for Group B), and here (in Table C for Group C). The abbreviations and notes for these tables are listed here (in Table D).

By approving all of these different formulas, all things considered, NMPED is also making the following nonsensical claims..

NMPED Claim: The prototype 50-25-25 percentage split has some validity.

There is no evidence to support this division between student achievement measures, observation, and multiple measures at all. It simply represents what NMPED could politically “get away with” in terms of a formula. Why not 60-20-20 or 57-23-20 or 46-18-36, etcetera? The NMPED prototype scale has no proven validity, whatsoever.

NMPED Claim: All 217 formulas are equivalent to evaluate teachers.

This claim by NMPED is absurd on its face and every other part of its… Is there any evidence that they have cross-validated the tests? There is no evidence that any of these scales are valid or accurate measures of “teacher effectiveness.” Also, there is no evidence whatsoever that they are equivalent.

Further, if the formulas are equivalent (as NMPED claims), why is New Mexico wasting money on technology for administering SBA tests or End-of-Course exams? Why not use an NMPED-approved formula that includes tests like Discovery, MAPS, DIBELS, or Star that are already being used?

NMPED Claim: Teacher Attendance and Student Surveys are interchangeable.

According to the approved plans, many districts assign 10% to Teacher Attendance while other districts assign 10% to Student Surveys. Both variants have been approved by NMPED.

Mathematically, (i.e., in terms of the proportions either is to be allotted) they appear to be interchangeable. If that is so, why is NMPED also specifically trying to enforce Teacher Attendance as an element of the evaluation scale? Why did Hanna Skandera proclaim to the press that this measure improved New Mexico education? (For typical news coverage, on this topic, for example, see here).

The use of teacher attendance appears to be motivated by union-busting rather than any mathematical rationale.

NMPED Claim: All observation methods are equivalent.

NMPED allows for three very different observation methods to be used for 40% of the final score. Each method is somewhat complicated and involves different observers.

There is no indication that NMPED has evaluated the reliability or validity of these three very different observation methods, or tested their results for equivalence. They simply assert that they are equivalent.

NMPED Claim: These formulas will be used to rate teachers.

These formulas are the worst kind of statistical jiggery-pokery (to use a newly current phrase). NMPED presents a seemingly rational, scientific number to the public using invalid and unvalidated mathematical manipulations and then determines teachers’ careers based on the completely bogus New Mexico teacher evaluation system(s).

Conclusion: Not only is the emperor naked, he has a closet containing 217 equivalent outfits at home!

Economists Declare Victory for VAMs

On a popular economics site,, authors use “hard numbers” to tell compelling stories, and this time the compelling story told is about value-added models and all of the wonders, thanks to the “hard numbers” derived via model output, they are working to reform the way “we” evaluate and hold teachers accountable for their effects.

In an article titled “The Science Of Grading Teachers Gets High Marks,” this site’s “quantitative editor” (?!?) – Andrew Flowers – writes about how “the science” behind using “hard numbers” to evaluate teachers’ effects is, fortunately for America and thanks to the efforts of (many/most) econometricians, gaining much-needed momentum.

Not to really anyone’s surprise, the featured economics study of this post is…wait for it…the Chetty et al. study at focus of much controversy and many prior posts on this blog (see for example here, here, here, and here). This is the study cited in President Obama’ 2012 State of the Union address when he said that, “We know a good teacher can increase the lifetime income of a classroom by over $250,000,” and this study was more recently the focus of attention when the judge in Vergara v. California cited Chetty et al.’s study as providing evidence that “a single year in a classroom with a grossly ineffective teacher costs students $1.4 million in lifetime earnings per classroom.”

These are the “hard numbers” that have since been duly critiqued by scholars from California to New York since (see, for example, here, here, here, and here), but that’s not mentioned in this post. What is mentioned, however, is the notable work of economist Jesse Rothstein, whose work I have also cited in prior posts (see, for example, here, here, here, and here,), as he has also countered Chetty et al.’s claims, not to mention added critical research to the topic on VAM-based bias.

What is also mentioned, not to really anyone’s surprise again, though, is that Thomas Kane – a colleague of Chetty’s at Harvard who has also been the source of prior VAMboozled! posts (see, for example, here, here, and here), who also replicated Chetty’s results as notably cited/used during the Vergara v. California case last summer, endorses Chetty’s work throughout this same article. Article author “reached out” to Kane “to get more perspective,” although I, for one, question how random this implied casual reach really was… Recall a recent post about our “(Unfortunate) List of VAMboozlers?” Two of our five total honorees include Chetty and Kane – the same two “hard number” economists prominently featured in this piece.

Nonetheless, this article’s “quantitative editor” (?!?) Flowers sides with them (i.e., Chetty and Kane), and ultimately declares victory for VAMs, writing that VAMs ultimately and “accurately isolate a teacher’s impact on students”…”[t]he implication[s] being, school administrators can legitimately use value-added scores to hire, fire and otherwise evaluate teacher performance.”

This “cutting-edge science,” as per a quote taken from Chetty’s co-author Friedman (Brown University), captures it all: “It’s almost like we’re doing real, hard science…Well, almost. But by the standards of empirical social science — with all its limitations in experimental design, imperfect data, and the hard-to-capture behavior of individuals — it’s still impressive….[F]or what has been called the “credibility revolution” in empirical economics, it’s a win.”

“Students Matter” Sues 13 California Districts for Not Using Student Test Scores to Evaluate Teachers

From Diane Ravitch’s Blog, here:

“Students Matter,” the Silicon Valley-funded group that launched the Vergara lawsuit to block teacher tenure in California, is now suing 13 school districts for their failure to use test scores in evaluating teachers.

The goal is to compel the entire state to use value-added-modeling (VAM), despite the fact that experience and research have demonstrated its invalidity and lack of reliability.

The Southern California school systems named in the latest filing are El Monte City, Inglewood Unified, Chaffey Joint Union, Chino Valley Unified, Ontario-Montclair, Saddleback Valley Unified, Upland Unified and Victor Elementary District. The others are: Fairfield-Suisun Unified, Fremont Union, Pittsburg Unified; San Ramon Valley Unified and Antioch Unified.

“School districts are not going to get away with bargaining away their ability to use test scores to evaluate teachers,” said attorney Joshua S. Lipshutz, who is working on behalf of Students Matter. “That’s a direct violation of state law.”

The plaintiffs are six California residents, including some parents and teachers, three of whom are participating anonymously.

In all, the districts serve about 250,000 students, although the group’s goal is to compel change across California.

“The impact is intended to be statewide, to show that no school district is above the law,” Lipshutz said.

The plaintiffs are not asking the courts to determine how much weight test scores should be given in a performance review, Lipshutz said. He cited research, however, suggesting that test scores should account for 30% to 40% of an evaluation.

The current case, Doe vs. Antioch, builds on earlier litigation involving the Los Angeles Unified School District. In 2012, a Los Angeles Superior Court judge ruled that the school system had to include student test scores in teacher evaluations. But the judge also allowed wide latitude for negotiation between the union and district.

The court decision was based on the 1971 Stull Act, which set out rules for teacher evaluations. Many districts had failed for decades to comply with it, according to some experts.

Will the Silicon Valley billionaires help to find new teachers when the state faces massive teacher shortages based on the litigation they continue to file?

It’s a VAM Shame…

It’s a VAM shame to see how VAMs have been used, erroneously yet as assuredly perfect-to-near-perfect indicators of educational quality, to influence educational policy. A friend and colleague of mine just sent me a PowerPoint that William L. Sanders – the developer of the Tennessee Value-Added Assessment System (TVAAS) now more popularly known as the Education Value-Added Assessment System (EVAAS®) and “arguably the most ardent supporter and marketer of [for-profit] value-added” (Harris, 2011; see also prior links about Sanders and his T/EVAAS model here, here, here, and here) – presented to the Tennessee Board of Education back in 2013.

The simple and straightforward (and hence policymaker-friendly) PowerPoint titled “Teacher Characteristics and Effectiveness” consists of seven total slides with figures illustrating three key points: teacher value-added as calculated using the TVAAS model does not differ by (1) years of teaching experience, (2) teachers’ education level, and (3) teacher salary. In other words, and as translated into simpler terms but also terms that have greatly influenced (and continue to influence) educational policy: (1) years of teacher experience do not matter, (2) advanced degrees do not matter, and (3) teacher salaries do not matter.

While it’s difficult to determine how this particular presentation influenced educational policy in Tennessee (see, for example, here), at a larger scale these are the three key policy trends that have since directed (and continue to direct) particularly state policy initiatives. What is trending in educational policy is to evaluate teachers only by their teacher-level value-added. At the same time, this “research” supports simultaneous calls to destruct teachers’ traditional salary schedules that reward teachers for their years of experience (which matters, as per other research) and advanced degrees (on which other research is mixed).

This “research” evidence is certainly convenient when calls for budget cuts are politically in order. But this “research” is also more than unfortunate in that the underlying assumption in support of all of this is that VAMs are perfect-to-near-perfect indicators of educational quality; hence, their output data can and should be trusted. Likewise, all of the figures illustrated in this and many other similar PowerPoints can be wholly trusted because they are based on VAMs.

Despite the plethora of methodological and pragmatic issues with VAMs, highlighted here within the first post I ever posted on this blog and also duly noted by the American Statistical Association as well as other associations (e.g., the National Association of Secondary School Principals (NASSP), the National Academy of Education), these VAMs are being used to literally change and set bunkum educational policy, because so many care not to be bothered with the truth, as inconvenient.

Like I wrote, it’s a VAM shame…

The Flaws of Using Value-Added Models for Teacher Assessment: Video By Edward Haertel

Within this blog, my team and I have tried to make available various resources for our various followers (which, by the way, are over 13,000 in number; see, for example, here).

These resource include, but are not limited to, our lists of research articles (see the “Top 15″ articles here, the “Top 25″ articles here, all articles published in AERA journals here, and all suggested research articles, books, etc., click here.), our list of VAM scholars (whom I/we most respect, even if their research-based opinions differ), VAMboozlers (who represent the opposite, with my/our level of respect questionable), and internet videos all housed and archived here. This includes, for example what I still believe is the best video yet on all of these issues combined — HBO’s Last Week Tonight with John Oliver on Standardized Testing (which includes a decent section, also, on value-added models).

But one video we have included in our collection we have not explicitly made public. It was, however, just posted on another website, and this reminded us that it indeed deserves special attention, and a special post.

The video featured here, as well as on this blog here, features Dr. Edward Haertel — National Academy of Education member and Professor Emeritus at Stanford University — talking about “The Flaws of Using Value-Added Models for Teacher Assessment.” The video is just over three minutes; do give it a watch here.

VAM Scholars and An (Unfortunate) List of VAMboozlers

Some time ago I posted a list of those I consider the (now) top 35 VAM Scholars whose research folks out there should be following, especially if they need good (mainly) peer-reviewed research to help them and others (e.g., local, regional, and state policymakers) become more informed about VAMs and their related policies. If you missed this post, do check out these VAM Scholars here.

Soon after, a colleague suggested that I should follow this list up with a list of what I termed in a prior post as appropriate to this blog as the VAMboozlers.

VAMboozlers are VAM Scholars whose research I would advise consumers to consume carefully. These researchers might be (in my opinion) prematurely optimistic about the potentials of VAMs contrary to what aproximately 90% of the empirical research in this area would support; these scholars might use methods that over-simplistically approach very complex problems and accordingly make often sweeping, unwarranted, and perhaps invalid assertions regardless; these folks might have financial or other vested interests in the VAMs being adopted and implemented; or the like.

While I aim to keep this section of the blog as professional and fair, open, and aboveboard as possible, I simultaneously hope to make such information on this blog more actionable and accessible for blog followers and readers.

Accordingly, here is my (still working) list of VAMboozlers:

*If you have any recommendations for this list, please let me know

Article on the “Heroic” Assumptions Surrounding VAMs Published (and Currently Available) in TCR

My former doctoral student and I wrote a paper about the “heroic” assumptions surrounding VAMs. We titled it “Truths” Devoid of Empirical Proof: Underlying Assumptions Surrounding Value-Added Models in Teacher Evaluation” and it was just published in the esteemed Teachers College Record (TCR). It is also open and accessible for one week, for free, here. I have also pasted the abstract below for more information.


Despite the overwhelming and research-based concerns regarding value-added models (VAMs), VAM advocates, policymakers, and supporters continue to hold strong to VAMs’ purported, yet still largely theoretical strengths and potentials. Those advancing VAMs have, more or less, adopted and promoted a set of agreed-upon, albeit “heroic” set of assumptions, without independent, peer-reviewed research in support. These “heroic” assumptions transcend promotional, policy, media, and research-based pieces, but they have never been fully investigated, explicated, or made explicit as a set or whole. These assumptions, though often violated, are often ignored in order to promote VAM adoption and use, and also to sell for-profits’ and sometimes non-profits’ VAM-based systems to states and districts. The purpose of this study was to make obvious the assumptions that have been made within the VAM narrative and that, accordingly, have often been accepted without challenge. Ultimately, sources for this study included 470 distinctly different written pieces, from both traditional and non-traditional sources. The results of this analysis suggest that the preponderance of sources propagating unfounded assertions are fostering a sort of VAM echo chamber that seems impenetrable by even the most rigorous and trustworthy empirical evidence.

More (This Time Obvious) Correlations between Race to the Top and State Policies

About one year ago I released a post titled “States on the VAMwagon Most Likely to Receive Race to the Top Funds” in which I wrote about the correlational analyses that reveal that state-level policies that rely at least in part on VAMs are indeed more common in states that (1) allocate less money than the national average for schooling, (2) allocate relatively less in terms of per pupil expenditures, (3) have more centralized governments, (4) are more highly populated and educate relatively larger populations of poor and racial and language minority students, and (5) have as state residents people who predominantly vote for the Republican Party and, related, Republican initiatives. All of these underlying correlations indeed explain why such policies are more popular, and accordingly adopted in certain states versus others.

Later, Mathematica released a News Brief (sponsored by the U.S. Department of Education’s Institute of Education Sciences) titled “Alignment of State Teacher Evaluation Policies with Race to the Top Priorities.” Although Mathematica wrongfully claimed that they were “the first to present data on the extent to which states, both those that received Race to the Top grants and those that did not, reported requiring teacher evaluation policies aligned with Race to the Top priorities as of spring 2012.”

Beat ya to it, Mathematica, but whatever 😉

Anyhow, they found also (continuing from the list above) that states that won Race to the Top monies were states that (6) required more teacher evaluation and accountability policies, (7) used (or proposed to use) multiple measures to evaluate teacher performance, (8) used (or proposed to use) multiple rating categories to classify teacher effectiveness, (9) conducted (or proposed to conduct) teacher evaluations on an annual basis, and (10) used (or proposed to use) evaluation results to inform decisions regarding teacher compensation and career advancement. Go figure!

Bias in School-Level Value-Added, Related to High V. Low Attrition

In a 2013 study titled “Re-testing PISA Students One Year Later: On School Value Added Estimation Using OECD-PISA” (Organisation for Economic Co‑operation and Development-Programme for International Student Assessment), researchers Bratti and Chechi explored a unique PISA international test score data set in Italy.

‘[I]n two regions of North Italy (Valle d’Aosta and the autonomous province of Trento) the PISA 2009 test was re-administered to the same students one year later.” Hence, authors had the unique opportunity to analyze what happens with school-level value-added when the same students were retested for two adjacent years, using a very strong standardized achievement test (i.e., the PISA).

Researchers found that “cross-sectional measures of school value added based on PISA…tend to be very volatile over time whenever there is a high year-to-year attrition in the student population.” In addition, some of this volatility can be mitigated when longitudinal measures of school value added take into account students’ prior test scores; however, higher consistency (less volatility) tends to be more evident in schools in which there is little attrition/transition. Inversely, lower consistency (higher volatility) tends to be more evident in schools in which there is much attrition/transition.

Researchers observed correlations “as high as 0.92 in Trento and is close to zero in Valle d’Aosta” when the VAM was not used to control for past test scores. When a more sophisticated VAM was used (accounting for students’ prior performance, and school fixed effects), however, researchers found that the ” coefficient [was] much higher for Valle d’Aosta than for Trento.” So, the correlations flip-flopped based on model specifications, the more advanced specs yielding “the better” or “more accurate” value-added output.

Researchers attribute this to panel attrition in that “in Trento only 8% of the students who were originally tested in 2009 dropped out or changed school in 2010, [but] the percentage [rose] to about 21% in Valle d’Aosta” at the same time.

Likewise, “[i]n educational settings characterized by high student attrition, this will lead to very volatile measures of VA.” Inversely, “in settings characterized by low student attrition (drop-out or school changes), longitudinal and cross-sectional measures of school VA turn out to be very correlated.”

Get every new post delivered to your inbox
Powered By
Follow by Email
The views expressed herein and throughout all pages associated with are solely those of the authors and may not reflect those of Arizona State University (ASU) or Mary Lou Fulton Teachers College (MLFTC). While the authors and others associated with are affiliated with ASU and MLFTC, all opinions, views, original entries, errors, and the like should be attributable to the authors and content developers of this blog, not whatsoever to ASU or MLFTC.