Is Alabama the New, New Mexico?

In Alabama, the Grand Old Party (GOP) has put forth a draft bill to be entitled as an act and ultimately called the Rewarding Advancement in Instruction and Student Excellence (RAISE) Act of 2016. The purpose of the act will be to…wait for it…use test scores to grade and pay teachers annual bonuses (i.e., “supplements”) as per their performance. More specifically, the bill is to “provide a procedure for observing and evaluating teachers” to help make “significant differentiation[s] in pay, retention, promotion, dismissals, and other staffing decisions, including transfers, placements, and preferences in the event of reductions in force, [as] primarily [based] on evaluation results.” Related, Alabama districts may no longer use teachers’ “seniority, degrees, or credentials as a basis for determining pay or making the retention, promotion, dismissal, and staffing decisions.” Genius!

Accordingly, Larry Lee whose blog is based on the foundation that “education is everyone’s business,” sent me this bill to review, and critique, and help make everyone’s business. I attach it here for others who are interested, but I also summarize and critique it’s most relevant (but also contemptible) issues below.

For the Alabama teachers who are eligible, they are (after a staggered period of time) to be primarily evaluated (i.e., for up to 45% of a teacher’s total evaluation score) on the extent to which they purportedly cause student growth in achievement, with student growth being defined as the teachers’ purported impacts on “[t]he change in achievement for an individual student between two or more points in time.” Teachers are also to be observed at least twice per year (i.e., for up to 45% of a teacher’s total evaluation score), by their appropriate and appropriately trained evaluators/supervisors, and an unnamed and undefined set of parent and student surveys are to be used to evaluate the teachers (i.e., up to 15% of a teacher’s total evaluation score).

Again, no real surprises here as the adoption of such measures is common among states like Alabama (and New Mexico), but when these components are explained in more detail is where things really go awry.

“For grade levels and subjects for which student standardized assessment data is not available and for teachers for whom student standardized assessment data is not available, the [state’s] department [of education] shall establish a list of preapproved options for governing boards to utilize to measure student growth.” This is precisely what has gotten the whole state of New Mexico wrapped up in, and currently losing their ongoing lawsuit (see my most recent post on this here). While providing districts with menus of preapproved assessment options might make sense to policymakers, any self respecting researcher or even assessment commoner should know why this is entirely inappropriate. To read more about this, the best research study explaining why doing just this will set any state up for lawsuits comes from Brown University’s John Papay in his highly esteemed and highly cited “Different tests, different answers: The stability of teacher value-added estimates across outcome measures” article. The title of this research article alone should explain enough why simply positioning and offering up such tests in such casual (and quite careless) ways makes way for legal recourse.

Otherwise, the only test mentioned that is also to be used to measure teachers’ purported impacts on student growth is the ACT Aspire – the ACT test corporation’s “college and career readiness” test that is aligned to and connected with their more familiar college-entrance ACT. This, too, was one of the sources of the aforementioned lawsuit in New Mexico in terms of what we call content validity, in that states cannot simply pull in tests that are not adequately aligned with a state’s curriculum (e.g., I could find no information about the alignment of the ACT Aspire to Alabama’s curriculum here, which is also highly problematic as this information should definitely be available) and that have not been validated for such purposes (i.e., to measure teachers’ impacts on student growth).

Regardless of the tests, however, all of the secondary measures to be used to evaluate Alabama teachers (e.g., student and parent survey scores, observational scores) are also to be “correlated with impacts on student achievement results.” We’ve also increasingly seen this becoming the case across the nation, whereas state/district leaders are not simply assessing whether these indicators are independently correlated, which they should be if they all, in fact, help to measure our construct of interest = teacher effectiveness, but state/district leaders are rather manufacturing and forcing these correlations via what I have termed “artificial conflation” strategies (see also a recent post here about how this is one of the fundamental and critical points of litigation in Houston).

The state is apparently also set on going “all in” on evaluating their principals in many of the same ways, although I did not critique those sections for this particular post.

Most importantly, though, for those of you who have access to such leaders in Alabama, do send them this post so they might be a bit more proactive, and appropriately more careful and cautious, before going down this poor educational policy path. While I do embrace my professional responsibility as a public scholar to be called to court to testify about all of this when such high-stakes consequences are ultimately, yet inappropriately based upon invalid inferences, I’d much rather be proactive in this regard and save states and states’ taxpayers their time and money, respectively.

Accordingly, I see the state is also to put out a request for proposals to retain an external contractor to help them measure said student growth and teachers’ purported impacts on it. I would also be more than happy to help the state negotiate this contract, much more wisely than so many other states and districts have negotiated similar contracts thus far (e.g., without asking for reliability and validity evidence as a contractual deliverable)…should this poor educational policy actually come to fruition.

Houston Lawsuit Update, with Summary of Expert Witnesses’ Findings about the EVAAS

Recall from a prior post that a set of teachers in the Houston Independent School District (HISD), with the support of the Houston Federation of Teachers (HFT) are taking their district to federal court to fight for their rights as professionals, and how their value-added scores, derived via the Education Value-Added Assessment System (EVAAS), have allegedly violated them. The case, Houston Federation of Teachers, et al. v. Houston ISD, is to officially begin in court early this summer.

More specifically, the teachers are arguing that EVAAS output are inaccurate, the EVAAS is unfair, that teachers are being evaluated via the EVAAS using tests that do not match the curriculum they are to teach, that the EVAAS system fails to control for student-level factors that impact how well teachers perform but that are outside of teachers’ control (e.g., parental effects), that the EVAAS is incomprehensible and hence very difficult if not impossible to actually use to improve upon their instruction (i.e., actionable), and, accordingly, that teachers’ due process rights are being violated because teachers do not have adequate opportunities to change as a results of their EVAAS results.

The EVAAS is the one value-added model (VAM) on which I’ve conducted most of my research, also in this district (see, for example, here, here, here, and here); hence, I along with Jesse Rothstein – Professor of Public Policy and Economics at the University of California – Berkeley, who also conducts extensive research on VAMs – are serving as the expert witnesses in this case.

What was recently released regarding this case is a summary of the contents of our affidavits, as interpreted by authors of the attached “EVAAS Litigation UPdate,” in which the authors declare, with our and others’ research in support, that “Studies Declare EVAAS ‘Flawed, Invalid and Unreliable.” Here are the twelve key highlights, again, as summarized by the authors of this report and re-summarized, by me, below:

  1. Large-scale standardized tests have never been validated for their current uses. In other words, as per my affidavit, “VAM-based information is based upon large-scale achievement tests that have been developed to assess levels of student achievement, but not levels of growth in student achievement over time, and not levels of growth in student achievement over time that can be attributed back to students’ teachers, to capture the teachers’ [purportedly] causal effects on growth in student achievement over time.”
  2. The EVAAS produces different results from another VAM. When, for this case, Rothstein constructed and ran an alternative, albeit sophisticated VAM using data from HISD both times, he found that results “yielded quite different rankings and scores.” This should not happen if these models are indeed yielding indicators of truth, or true levels of teacher effectiveness from which valid interpretations and assertions can be made.
  3. EVAAS scores are highly volatile from one year to the next. Rothstein, when running the actual data, found that while “[a]ll VAMs are volatile…EVAAS growth indexes and effectiveness categorizations are particularly volatile due to the EVAAS model’s failure to adequately account for unaccounted-for variation in classroom achievement.” In addition, volatility is “particularly high in grades 3 and 4, where students have relatively few[er] prior [test] scores available at the time at which the EVAAS scores are first computed.”
  4. EVAAS overstates the precision of teachers’ estimated impacts on growth. As per Rothstein, “This leads EVAAS to too often indicate that teachers are statistically distinguishable from the average…when a correct calculation would indicate that these teachers are not statistically distinguishable from the average.”
  5. Teachers of English Language Learners (ELLs) and “highly mobile” students are substantially less likely to demonstrate added value, as per the EVAAS, and likely most/all other VAMs. This, what we term as “bias,” makes it “impossible to know whether this is because ELL teachers [and teachers of highly mobile students] are, in fact, less effective than non-ELL teachers [and teachers of less mobile students] in HISD, or whether it is because the EVAAS VAM is biased against ELL [and these other] teachers.”
  6. The number of students each teacher teaches (i.e., class size) also biases teachers’ value-added scores. As per Rothstein, “teachers with few linked students—either because they teach small classes or because many of the students in their classes cannot be used for EVAAS calculations—are overwhelmingly [emphasis added] likely to be assigned to the middle effectiveness category under EVAAS (labeled “no detectable difference [from average], and average effectiveness”) than are teachers with more linked students.”
  7. Ceiling effects are certainly an issue. Rothstein found that in some grades and subjects, “teachers whose students have unusually high prior year scores are very unlikely to earn high EVAAS scores, suggesting that ‘ceiling effects‘ in the tests are certainly relevant factors.” While EVAAS and HISD have previously acknowledged such problems with ceiling effects, they apparently believe these effects are being mediated with the new and improved tests recently adopted throughout the state of Texas. Rothstein, however, found that these effects persist even given the new and improved.
  8. There are major validity issues with “artificial conflation.” This is a term I recently coined to represent what is happening in Houston, and elsewhere (e.g., Tennessee), when district leaders (e.g., superintendents) mandate or force principals and other teacher effectiveness appraisers or evaluators, for example, to align their observational ratings of teachers’ effectiveness with value-added scores, with the latter being the “objective measure” around which all else should revolve, or align; hence, the conflation of the one to match the other, even if entirely invalid. As per my affidavit, “[t]o purposefully and systematically endorse the engineering and distortion of the perceptible ‘subjective’ indicator, using the perceptibly ‘objective’ indicator as a keystone of truth and consequence, is more than arbitrary, capricious, and remiss…not to mention in violation of the educational measurement field’s Standards for Educational and Psychological Testing” (American Educational Research Association (AERA), American Psychological Association (APA), National Council on Measurement in Education (NCME), 2014).
  9. Teaching-to-the-test is of perpetual concern. Both Rothstein and I, independently, noted concerns about how “VAM ratings reward teachers who teach to the end-of-year test [more than] equally effective teachers who focus their efforts on other forms of learning that may be more important.”
  10. HISD is not adequately monitoring the EVAAS system. According to HISD, EVAAS modelers keep the details of their model secret, even from them and even though they are paying an estimated $500K per year for district teachers’ EVAAS estimates. “During litigation, HISD has admitted that it has not performed or paid any contractor to perform any type of verification, analysis, or audit of the EVAAS scores. This violates the technical standards for use of VAM that AERA specifies, which provide that if a school district like HISD is going to use VAM, it is responsible for ‘conducting the ongoing evaluation of both intended and unintended consequences’ and that ‘monitoring should be of sufficient scope and extent to provide evidence to document the technical quality of the VAM application and the validity of its use’ (AERA Statement, 2015).
  11. EVAAS lacks transparency. AERA emphasizes the importance of transparency with respect to VAM uses. For example, as per the AERA Council who wrote the aforementioned AERA Statement, “when performance levels are established for the purpose of evaluative decisions, the methods used, as well as the classification accuracy, should be documented and reported” (AERA Statement, 2015). However, and in contrast to meeting AERA’s requirements for transparency, in this district and elsewhere, as per my affidavit, the “EVAAS is still more popularly recognized as the ‘black box’ value-added system.”
  12. Related, teachers lack opportunities to verify their own scores. This part is really interesting. “As part of this litigation, and under a very strict protective order that was negotiated over many months with SAS [i.e., SAS Institute Inc. which markets and delivers its EVAAS system], Dr. Rothstein was allowed to view SAS’ computer program code on a laptop computer in the SAS lawyer’s office in San Francisco, something that certainly no HISD teacher has ever been allowed to do. Even with the access provided to Dr. Rothstein, and even with his expertise and knowledge of value-added modeling, [however] he was still not able to reproduce the EVAAS calculations so that they could be verified.”Dr. Rothstein added, “[t]he complexity and interdependency of EVAAS also presents a barrier to understanding how a teacher’s data translated into her EVAAS score. Each teacher’s EVAAS calculation depends not only on her students, but also on all other students with- in HISD (and, in some grades and years, on all other students in the state), and is computed using a complex series of programs that are the proprietary business secrets of SAS Incorporated. As part of my efforts to assess the validity of EVAAS as a measure of teacher effectiveness, I attempted to reproduce EVAAS calculations. I was unable to reproduce EVAAS, however, as the information provided by HISD about the EVAAS model was far from sufficient.”

Brookings’ Critique of AERA Statement on VAMs, and Henry Braun’s Rebuttal

Two weeks ago I published a post about the newly released “American Educational Research Association (AERA) Statement on Use of Value-Added Models (VAM) for the Evaluation of Educators and Educator Preparation Programs.”

In this post I also included a summary of the AERA Council’s eight key, and very important points abut VAMs and VAM use. I also noted that I contributed to this piece in one of its earliest forms. More importantly, however, the person who managed the statement’s external review and also assisted the AERA Council in producing the final statement before it was officially released was Boston College’s Dr. Henry Braun, Boisi Professor of Education and Public Policy and Educational Research, Measurement, and Evaluation.

Just this last week, the Brookings Institution published a critique of the AERA statement for, in my opinion, no other apparent reason than just being critical. The critique was written by Brookings affiliate Michael Hansen and University of Washington Bothell’s Dan Goldhaber, titled a “Response to AERA statement on Value-Added Measures: Where are the Cautionary Statements on Alternative Measures?

Accordingly, I invited Dr. Henry Braun to respond, and he graciously agreed:

In a recent posting, Michael Hansen and Dan Goldhaber complain that the AERA statement on the use of VAMs does not take a similarly critical stance with respect to “alternative measures”. True enough! The purpose of the statement is to provide a considered, research-based discussion of the issues related to the use of value-added scores for high-stakes evaluation. It culminates in a set of eight requirements to be met before such use should be made.

The AERA statement does not stake out an extreme position. First, it is grounded in the broad research literature on drawing causal inferences from observational data subject to strong selection (i.e., the pairings of teachers and students is highly non-random), as well as empirical studies of VAMs in different contexts. Second, the requirements are consistent with the AERA, American Psychological Association (APA), and National Council on Measurement in Education (NCME) Standards for Educational and Psychological Testing. Finally, its cautions are in line with those expressed in similar statements released by the Board on Testing and Assessment of the National Research Council and by the American Statistical Association.

Hansen and Goldhaber are certainly correct when they assert that, in devising an accountability system for educators, a comparative perspective is essential. One should consider the advantages and disadvantages of different indicators, which ones to employ, how to combine them and, most importantly, consider both the consequences for educators and the implications for the education system as a whole. Nothing in the AERA statement denies the importance of subjecting all potential indicators to scrutiny. Indeed, it states: “Justification should be provided for the inclusion of each indicator and the weight accorded to it in the evaluation process.” Of course, guidelines for designing evaluation systems would constitute a challenge of a different order!

In this context, it must be recognized that rankings based on VAM scores and ratings based on observational protocols will necessarily have different psychometric and statistical properties. Moreover, they both require a “causal leap” to justify their use: VAM scores are derived directly from student test performance, but require a way of linking to the teacher of record. Observational ratings are based directly on a teacher’s classroom performance, but require a way of linking back to her students’ achievement or progress.

Thus, neither approach is intrinsically superior to the other. But the singular danger with VAM scores, being the outcome of a sophisticated statistical procedure, is that they are seen by many as providing a gold standard against which other indicators should be judged. Both the AERA and ASA statements offer a needed corrective, by pointing out the path that must be traversed before an indicator based on VAM scores approaches the status of a gold standard. Though the requirements listed in the AERA statement may be aspirational, they do offer signposts against which we can judge how far we have come along that path.

Henry Braun, Lynch School of Education, Boston College

“Value-Less” Value-Added Data

Peter Greene, a veteran teacher of English in Pennsylvania who works as a teacher in a state using the Pennsylvania version of the Education Value-Added Assessment System (EVAAS), wrote last week (October 5, 2015) in his Curmudgucation blog about his “Value-Less Data.” I thought it very important to share with you all, as he does a great job deconstructing one of the most widespread claims being made, and most lacking research support, about using the data derived via value-added models (VAMs) to inform and improve what teachers do in their classrooms.

Greene sententiously critiques this claim, writing:

It’s autumn in Pennsylvania, which means it’s time to look at the rich data to be gleaned from our Big Standardized Test (called PSSA for grades 3-8, and Keystone Exams at the high school level).

We love us some value added data crunching in PA (our version is called PVAAS, an early version of the value-added baloney model). This is a model that promises far more than it can deliver, but it also makes up a sizeable chunk of our school evaluation model, which in turn is part of our teacher evaluation model.

Of course the data crunching and collecting is supposed to have many valuable benefits, not the least of which is unleashing a pack of rich and robust data hounds who will chase the wild beast of low student achievement up the tree of instructional re-alignment. Like every other state, we have been promised that the tests will have classroom teachers swimming in a vast vault of data, like Scrooge McDuck on a gold bullion bender. So this morning I set out early to the states Big Data Portal to see what riches the system could reveal.

Here’s what I can learn from looking at the rich data.

* the raw scores of each student
* how many students fell into each of the achievement subgroups (test scores broken down by 20 point percentile slices)
* if each of the five percentile slices was generally above, below, or at its growth target

Annnnd that’s about it. I can sift through some of that data for a few other features.

For instance, PVAAS can, in a Minority Report sort of twist, predict what each student should get as a score based on– well, I’ve been trying for six years to find someone who can explain this to me, and still nothing. But every student has his or her own personal alternate universe score. If the student beats that score, they have shown growth. If they don’t, they have not.

The state’s site will actually tell me what each student’s alternate universe score was, side by side with their actual score. This is kind of an amazing twist– you might think this data set would be useful for determining how well the state’s predictive legerdemain actually works. Or maybe a discrepancy might be a signal that something is up with the student. But no — all discrepancies between predicted and actual scores are either blamed on or credited to the teacher.

I can use that same magical power to draw a big target on the backs of certain students. I can generate a list of students expected to fall within certain score ranges and throw them directly into the extra test prep focused remediation tank. Although since I’m giving them the instruction based on projected scores from a test they haven’t taken yet, maybe I should call it premediation.

Of course, either remediation or premediation would be easier to develop if I knew exactly what the problem was.

But the website gives only raw scores. I don’t know what “modules” or sections of the test the student did poorly on. We’ve got a principal working on getting us that breakdown, but as classroom teachers we don’t get to see it. Hell, as classroom teachers, we are not allowed to see the questions, and if we do see them, we are forbidden to talk about them, report on them, or use them in any way. (Confession: I have peeked, and many of the questions absolutely suck as measures of anything).

Bottom line– we have no idea what exactly our students messed up to get a low score on the test. In fact, we have no idea what they messed up generally.

So that’s my rich data. A test grade comes back, but I can’t see the test, or the questions, or the actual items that the student got wrong.

The website is loaded with bells and whistles and flash-dependent functions along with instructional videos that seem to assume that the site will be used by nine-year-olds, combining instructions that should be unnecessary (how to use a color-coding key to read a pie chart) to explanations of “analysis” that isn’t (by looking at how many students have scored below basic, we can determine how many students have scored below basic).

I wish some of the reformsters who believe that BS [i.e., not “basic skills” but the “other” BS] Testing gets us rich data that can drive and focus instruction would just get in there and take a look at this, because they would just weep. No value is being added, but lots of time and money is being wasted.

Valerie Strauss also covered Greene’s post in her Answer Sheet Blog in The Washington Post here, in case you’re interested in seeing her take on this as well: “Why the ‘rich’ student data we get from testing is actually worthless.”

Follow-Up on “Economists Declaring Victory…”

Over one week ago I published a post about some “Economists Declar[ing] Victory for VAMs,” as per an article titled “The Science Of Grading Teachers Gets High Marks,” written by the economics site’s “quantitative editor” Andrew Flowers.

Valerie Strauss, author of the Answer Sheet section of The Washington Post, was apparently also busy finding more about Flowers’ piece, as well as his take, at the same time. She communicated with Flowers via email, after which she communicated with me via email to help her/us respond to Flowers’ claims. These email exchanges, and more, were just published on her Answer Sheet section of The Washington Post here.

For those of you interested in reading the whole thing, do click here. For those of you interested in just the email exchanges, as a follow-up to my previous post here, I’ve pasted the highlights of the conversation for you all below…with compliments to Valerie for including what she viewed as the key points for discussion/thought.

From her post:

I asked Audrey Amrein-Beardsley, a former middle- and high-school mathematics teacher who is now associate professor in Arizona State University’s Mary Lou Fulton Teachers College and a VAM researcher, about the FiveThirtyEight blog post and e-mail comments by Flowers. She earned a Ph.D. in 2002 from Arizona State University in the Division of Educational Leadership and Policy Studies with an emphasis on research methods. She had already written about Flowers’ blog post on her VAMBoozled! blog, which you can see here.

Here are her comments on what Flowers wrote to me in the e-mail. Some of them are technical, as any discussion about formulas would be:

Flowers: “The piece I wrote that was recently published by FiveThirtyEight was focused on a specific type of value-added model (VAM) — the one developed by Chetty, Friedman and Rockoff (CFR). In my reading of the literature on VAMs, including the American Statistical Association’s (ASA) statement, I felt it fair to characterize the CFR research as cutting-edge.”

Amrein-Beardsley: There is no such thing as a “cutting-edge” VAM. Just because Chetty had access to millions of data observations does not make his actual VAM more sophisticated than any of those in use otherwise or in other ways. The fact of the matter is is that all states have essentially the same school level data (i.e., very similar test scores by students over time, links to teachers, and series of typically dichotomous/binary variables meant to capture things like special education status, English language status, free-and-reduced lunch eligibility, etc.). These latter variables are the ones used, or not used depending on the model for VAM-based analyses. While Chetty used these data and also had access to other demographic data (e.g., IRS data, correlated with other demographic data as well), and he could use these data to supplement the data from NYC schools, the data whether dichotomous or continuous (which is a step in the right direction) still cannot and do not capture all of the things we know from the research that influence student learning, achievement, and more specifically growth in achievement in schools. These are the unquantifiable/uncontrollable variables that (will likely forever) continue to distort the measurement of teachers’ causal effects, and that cannot be captured using IRS data alone. For example, unless Chetty had data to capture teachers’ residuals effects (from prior years), out of school learning, parental impacts on learning or a lack thereof, summer learning and decay, etc. it is virtually impossible, no matter how sophisticated any model or dataset is, to make such causal claims. Yes, such demographic variables are correlated with, for example, family income [but]  they are not correlated to the extent that they can remove systematic error from the model.

Accordingly, Chetty’s model is no more sophisticated or “cutting–edge” than any other. There are probably, now, five+ models being used today (i.e., the EVAAS, the Value–Added Research Center (VARC) model, the RAND Corporation model, the American Institute for Research (AIR) model, and the Student Growth Percentiles (SGP) model). All of the them except for the SGP have been developed by economists, and they are likely just as sophisticated in their design (1) given minor tweaks to model specifications and (2) given various data limitations and restrictions. In fact, the EVAAS, because it’s been around for over twenty years (in use in Tennessee since 1993, and in years of development prior), is probably considered the best and most sophisticated of all VAMs, and because it’s now run by the SAS analytics software corporation, I (and likely many other VAM researchers) would likely put our money down on that model any day over Chetty’s model, if both had access to the same dataset. Chetty might even agree with this assertion, although he would disagree with the EVAAS’s (typical) lack of use of controls for student background variables/demographics — a point of contention that has been debated, now, for years, with research evidence supporting both approaches; hence, the intense debates about VAM–based bias, now also going on for years.

Flowers: “So, because the CFR research is so advanced, much of the ASA’s [American Statistical Association’s] critique does not apply to it. In its statement, the ASA says VAMs “generally… not directly measure potential teacher contributions toward other student outcomes” (emphasis added). Well, this CFR work I profiled is the exception — it explicitly controls for student demographic variables (by using millions of IRS records linked to their parents). And, as I’ll explain below, the ASA statement’s point that VAMs are only capturing correlation, not causation, also does not apply to the CFR model (in my view). The ASA statement is still smart, though. I’m not dismissing it. I just thought — given how superb the CFR research was — that it wasn’t really directed at the paper I covered.”

Amrein-Beardsley: This is based on the false assumption, addressed above, that Chetty’s model is “so advanced” or “cutting edge,” or now as written here “superb.” When you appropriately remove or reject this assumption, ASA’s critique applies to Chetty’s model along with the rest of them. Should we not give credit to the ASA for taking into consideration all models when they wrote this statement, especially as they wrote their statement well after Chetty’s model had hit the public? Would the ASA not have written, somewhere, that their critique applies to all models “except for” the one used by Chetty et al because they too agreed this one was exempt from their critiques? This singular statement is absurd in and of itself, as is the statement that Flowers isn’t “dismissing it.” I’m sure the ASA would be thrilled to hear. More specifically, the majority of models “explicitly control for student demographics” — Chetty’s model is by far not the only one (see the first response above, as again, this is one of the most contentious issues going). Given this, and the above, it is true that all “VAMs are only capturing correlation, not causation,” and all VAMs are doing this at a mediocre level of quality. The true challenge, should Chetty take it on, would be to put his model up against the other VAMs mentioned above, using the same NYC school-level dataset, and prove to the public that his model is so “cutting-edge” that it does not suffer from the serious issues with reliability, validity, bias, etc. with which all other modelers are contending. Perhaps Flowers’ main problem in this piece is that he conflated model sophistication with dataset quality, whereby the former is likely no better (or worse) than any of the others.

Lastly, for what “wasn’t really directly at the paper [Flowers] covered…let’s talk about the 20+ years of research we have on VAMs that Flowers dismissed, implicitly in that it was not written by economists, whereas Jesse Rothstein was positioned as the only respected critic of VAMs. My best estimates, and I’ll stick with them today, is that approximately 90 percent of all value-added researchers, including econometricians and statisticians alike, have grave concerns about these models, and consensus has been reached regarding many of their current issues. Only folks like Chetty and Kain (the two-pro VAM scholars), however, were positioned as leading thought and research in this area. Flowers, before he wrote such a piece, really should have done more homework. This also includes the other critiques of Chetty’s work, not mentioned whatsoever in this piece albeit very important to understanding it (see, for example, here, here, here, and here).

Flowers: “That said, I felt like the criticism of the CFR work by other academic economists, as well as the general caution of the ASA, warranted inclusion — and so I reached out to Jesse Rothstein, the most respected “anti-VAM” economist, for comment. I started and ended the piece with the perspective of “pro-VAM” voices because that was the peg of the story — this new exchange between CFR and Rothstein — and, if one reads both papers and talks to both sides, I though it was clear how the debate tilted in the favor of CFR.”

Amrein-Beardsley: Again, why only the critiques of other “academic economists,” or actually just one other academic economist to be specific (i.e., Jesse Rothstein, who most would agree is “the most respected ‘anti-VAM’ economist)? Everybody knows Chetty and Kane (the other economist to whom Flowers “reached out) are colleagues/buddies and very much on the same page and side of all of this, so Rothstein was really the only respected critic included to represent the other side. All of this is biased in and of itself (see also studies above for economists’ and statisticians’ other critiques),and quite frankly insulting to/marginalizing of the other well-respected scholars also conducting solid empirical research in this area (e.g., Henry Braun, Stephen Raudenbush, Jonathan Papay, Sean Corcoran). Nonetheless, this “new exchange” between Chetty and Rothstein is not “new” as claimed. It actually started back in October to be specific (see, here, for example). I too have read both papers and talked to both sides, and would hardly say it’s “clear how the debate” tilts either way. It’s educational research, and complicated, and not nearly objective, hard, conclusive, or ultimately victorious as Flowers claims.

Flowers: “Now, why is that? I think there are two (one could argue three) empirical arguments at stake here. First, are the CFR results, based on NYC public schools, reproducible in other settings? If not — if other researchers can’t produced similar estimates with different data — then that calls it into question. Second, assuming the reproducibility bar is passed, can the CFR’s specification model withstand scrutiny; that is, is CFR’s claim to capture teacher value-added in isolation of all other factors (e.g., demographic characteristics, student sorting, etc.) really believable? This second argument is less about data than about statistical modeling…What I found was that there was complete agreement (even by Rothstein) on this first empirical argument. CFR’s results are reproducible even by their critics, in different settings (Rothstein replicated in North Carolina). That’s amazing, right? “

Amrein-Beardsley: These claims are actually quite interesting in that there is a growing set of research evidence that all models, using the same datasets, actually yield similar results. It’s really no surprise, and certainly not “amazing” that Kane replicated Chetty’s results, or that Rothstein replicated them, more or less, as well. Even what some argue is the least sophisticated VAM (although some would cringe calling it a VAM) – the Student Growth Percentiles (SGP) model – has demonstrated itself, even without using student demographics in model specifications/controls, to yield similar output when the same datasets are used. One of my doctoral students, in fact, ran five different models using the same dataset and yielded inter/intra correlations that some could actually consider “amazing.” That is because, what at least some contend, these models are quite similar, and yield similar results given their similarities, and also their limitations. Some even go as far as calling all such models “garbage in, garbage out” systems, given the test data they all (typically) use to generate VAM-based estimates, and almost regardless of the extent to which model specifications differ. So replication, in this case, is certainly not the cat’s meow. One must also look to other traditional notions of educational measurement: reliability/consistency (which is not at high-enough levels, especially across teacher types), validity (which is not at high-enough levels, especially for high-stakes purposes), etc. in that “replicability” alone is more common than Flowers (and perhaps others) might assume. Just like it takes multiple measures to get at teachers’ effects, it takes multiple measures to assess model quality. Using replication, alone, is remiss.

Flowers: “For those curious about this third empirical argument, I would refer anyone back to CFR’s second paper in (American Economic Review 2014b), where they impressively demonstrate how students taught by teachers with high VAM scores, all things equal, grow up to have higher earnings (through age 28), avoid teen pregnancy at greater rates, attend better colleges, etc. This is based off an administrative data set from the IRS — that’s millions of students, over 30 years. Of course, it all hinges on the first study’s validity (that VAM is unbiased)— which was the center of debate between Rothstein and CFR.”

Amrein-Beardsley: The jury is definitely still out on this, across all studies…. Plenty of studies demonstrate (with solid evidence) that bias exists and plenty others demonstrate (with solid evidence) that it doesn’t.

Flowers: “Long story, short: the CFR research has withstood criticism from Rothstein (a brilliant economist, whom CFR greatly respects), and their findings were backed up by other economists in the field (yes, some of them do have a “pro-VAM” bias, but such is social science).”

Amrein-Beardsley: Long story, short: the CFR research has [not] withstood criticism from Rothstein (a brilliant economist, whom CFR [and many others] greatly respect, and their findings were backed up by other economists [i.e., two to be exact] in the field (yes, some of them [only Chetty’s buddy Kane] do have a “pro-VAM” bias, but such is social science). Such is the biased stance taken by Flowers in this piece, as well.

Flowers: “If one really wants to poke holes in the CFR research, I’d look to its setting: New York City. What if NYC’s standardized test are just better at capturing students’ long-run achievement? That’s possible. If it’s hard to do what NYC does elsewhere in the U.S., then CFR’s results may not apply.”

 Amrein-Beardsley: First, plenty of respected researchers have already poked what I would consider as “enough” holes in the CFR research. Second, Flowers clearly does not know much about current standardized tests in that they are all constructed under contract with the same testing companies, they all include the same types of items, they all measure (more or less) the same set of standards… they all undergo the same sets of bias, discrimination, etc. analyses, and the like. As for their capacities to measure growth, they all suffer from a lack of horizontal, but more importantly, vertical equating; their growth output are all distorted because the tests (from pre to post) all capture one full year’s of growth; and they cannot isolate teachers’ residuals, summer growth/decay, etc. given that the pretests are not given the same year, within the same teacher’s classroom.

VAM Scholars and An (Unfortunate) List of VAMboozlers

Some time ago I posted a list of those I consider the (now) top 35 VAM Scholars whose research folks out there should be following, especially if they need good (mainly) peer-reviewed research to help them and others (e.g., local, regional, and state policymakers) become more informed about VAMs and their related policies. If you missed this post, do check out these VAM Scholars here.

Soon after, a colleague suggested that I should follow this list up with a list of what I termed in a prior post as appropriate to this blog as the VAMboozlers.

VAMboozlers are VAM Scholars whose research I would advise consumers to consume carefully. These researchers might be (in my opinion) prematurely optimistic about the potentials of VAMs contrary to what aproximately 90% of the empirical research in this area would support; these scholars might use methods that over-simplistically approach very complex problems and accordingly make often sweeping, unwarranted, and perhaps invalid assertions regardless; these folks might have financial or other vested interests in the VAMs being adopted and implemented; or the like.

While I aim to keep this section of the blog as professional and fair, open, and aboveboard as possible, I simultaneously hope to make such information on this blog more actionable and accessible for blog followers and readers.

Accordingly, here is my (still working) list of VAMboozlers:

*If you have any recommendations for this list, please let me know

Data Secrecy in DC Continued…

Following a recent post on “Data Secrecy Violating Data Democracy in DC Public Schools (DCPS),” the lawyer(s) from Washington DC sent me an email, including the actual complaint they filed in DC Superior Court to get access to the DC teacher evaluations. With their permission, I include this complaint here, for those of you who might be interested.

The chronology and description of their information request is detailed in the complaint, and the chronology of the attempt to codify the FOIA exemption (under Mayor Bowser) follows (also, as per the above-mentioned lawyer(s)):

On Feb 20, 2015, the American Federation of Teachers (AFT) and Washington Teachers Union (WTU) appealed to Mayor Bowser to require DCPS and DC’s Office of the State Superintendent of Education (OSSE) to turnover the state’s teacher IMPACT evaluation scores (with names redacted) for school years 2009-10 through 2013-14. On March 3, 2015, emergency legislation gets introduced (i.e., legislation in support of “a radical new secrecy provision to hide the information that’s being used to make [such] big decisions.” On March 18, 2015, Mayor Bowser denied AFT/WTU’s appeal for teacher IMPACT scores (again, with names redacted). On March 30, 2015, Mayor Bowser signs the emergency legislation exempting educator evaluations and effectiveness ratings from being disclosed. On April 14, 2015, AFT/WTU file suit to overturn the decision of DCPS and Mayor Bowser. On June 2, 2015, the permanent legislation exempting educator evaluations from FOIA is placed in the DC budget bill “at the request of the Mayor.”

As it also turns out, the prior mayor (Mayor Gray) introduced “emergency” legislation in 2014 to keep teacher evaluations exempt from FOIA as well, and this legislation was actually about to expire when Mayor Bowser recently introduced the emergency, and now permanent legislation. Mayor Gray’s justification was different than current Mayor Bowser’s, however, as according to the legislative history, under former Mayor Gray’s watch, emergency legislation was needed to keep teacher evaluations secret because charter schools throughout DC were refusing to turn over their teacher evaluations to the OSSE, out of fear that the OSSE would release them (e.g., like they did in Los Angeles Unified, via the Los Angeles Times).

Nonetheless, Mayor Gray felt that neither he nor OSSE could compel the charters to turn over their teacher evaluations. Now, Mayor Bowser wants permanent legislation that would exempt teacher evaluations from FOIA, but Mayor Bowser and DCPS are both arguing that the legislation would only apply to charters.

Why? It is not clear. The proposed legislation does not limit the exemption, but rather states: “Individual educator evaluations and effectiveness ratings, observation, and value-added data collected or maintained by OSSE are not public records and shall not be subject to disclosure…” It is also important to note also, though, that charter operators can use whatever evaluation system or performance measures they want. So they are also exempt, in general.

Kaya Henderson, the DCPS Chancellor (Michelle Rhee’s Deputy Chancellor) was on NPR last week on The Politics Hour hosted by Kojo Nnamdi, during which she also insisted that the new legislation was only to apply to charters.

The WTU President, Liz Davis, will be on Kojo’s show this Thursday to address the DCPS IMPACT evaluations and collective bargaining (CBA) negotiations.

The Silencing of the Educators: A Shocking Idea, and Trending

In a recent post I published titled, “New Mexico UnEnchanted,” I described a great visit I recently made to Las Cruces to meet with students, parents, teachers, school board members, state leaders, and the like. In this post, I also described something I found shocking as I had never heard of this before. Under the “leadership” of Hanna Skandera — former Florida Deputy Commissioner of Education under former Governor Jeb Bush and head of the New Mexico Public Education Department — teachers throughout the state are being silenced.

New Mexico now requires teachers to sign a contractual document that they are not to “diminish the significance or importance of the tests” (see, for example, slide 7 here) or they could lose their jobs. Teachers are not to speak negatively about the tests or say anything negatively about these tests in their classrooms or in public; if they do they could be found in violation of their contracts. At my main presentation in New Mexico, a few teachers even approached me after “in secret” whispering their concerns in fear of being “found out.” Rumor also has it that Hanna Skandera has requested the names and license numbers of any teachers who have helped or encouraged students to protest the state’s “new” PARCC test(s), as well.

One New Mexico teacher asked whether “this is a quelling of free speech and professional communication?” I believe it most certainly is a Constitutional violation. I am also shocked to now find out that something quite similar is occurring in my state of Arizona.

Needless to say, neither of our states (or many states typically in the sunbelt for that matter) are short on bad ideas, but this is getting absolutely ridiculous, especially as this silencing of the educators seems to be yet another bad idea that is actually trending?

As per a recent article in our local paper – The Arizona Republic – Arizona “legislators want to gag school officials” in an amendment to Senate Bill 1172 that will prohibit “an employee of a school district or charter school, acting on the district’s or charter school’s behalf, from distributing electronic materials to influence the outcome of an election or to advocate support for or opposition to pending or proposed legislation.”

The charge is also that this is a retaliatory move by AZ legislators, in response to a series of recent protests in response to serious budget cuts several weeks ago. “Perhaps [this is] to keep [educators] from talking about how the legislature has shortchanged Arizona’s school kids by hundreds of millions of dollars since the recession, and how the legislature is still making it nearly impossible for many districts to take care of even [schools’] most basic needs.”

In addition, is this even Constitutional? An Arizona Schools Boards Association (ASBA) spokesperson is cited as responding, saying “SB 1172 raises grave constitutional concerns. It may violate school and district officials free speech rights and almost certainly chills protected speech by school officials and the parents and community members that interact with them. It will freeze the flow of information to the public that seeks to ascertain the impact of pending legislation on their schools and children’s education.”

As per a related announcement released by the ASBA, this “could have a chilling effect on the free speech rights of school and district officials” throughout the state but also (likely) beyond if this continues to catch on. School officials may be held “liable for a $5,000 civil fine just for sharing information on the positive or negative impacts of proposed legislation to parents or reporters.”

Time to fight back, again. If you are a citizen of Arizona (citizens only) and feel that the Arizona community (and potentially beyond) is entitled to the free flow of information and that free speech is worth protecting, click here to contact your legislators to oppose SB1172.

Houston, We Have A Problem: New Research Published about the EVAAS

New VAM research was recently published in the peer-reviewed Education Policy Analysis Archives journal, titled “Houston, We Have a Problem: Teachers Find No Value in the SAS Education Value-Added Assessment System (EVAAS®).” This article was published by a former doctoral student of mine, turned researcher now at a large non-profit — Clarin Collins. I asked her to write a guest post for you all summarizing the fully study (linked again here). Here is what she wrote.

As someone who works in the field of philanthropy, completed a doctoral program more than two years ago, and recently became a new mom, you might question why I worked on an academic publication and am writing about it here as a guest blogger? My motivation is simple: the teachers. Teachers continue to be at the crux of the national education reform efforts as they are blamed for the nation’s failing education system and student academic struggles. National and state legislation has been created and implemented as believed remedies to “fix” this problem by holding teachers accountable for student progress as measured by achievement gains.

While countless researchers have highlighted the faults of teacher accountability systems and growth models (unfortunately to fall on the deaf ears of those mandating such policies), very rarely are teachers asked how such policies play out in practice, or for their opinions, as representing their voices in all of this. The goal of this research, therefore, was first, to see how one such teacher evaluation policy is playing out in practice and second, to give voice to marginalized teachers, those who are at the forefront of these new policy initiatives. That being said, while I encourage you to check out the full article [linked again here], I highlight key findings in this summary, using the words of teachers as often as possible to permit them, really, to speak for themselves.

In this study I examined the SAS Education Value-Added Assessment System (EVAAS) in practice, as perceived and experienced by teachers in the Southwest School District (SSD). SSD [a pseudonym] is using EVAAS for high-stakes consequences more than any other district or state in the country. I used a mixed-method design including a large-scale electronic survey to investigate the model’s reliability and validity; to determine whether teachers used the EVAAS data in formative ways as intended; to gather teachers’ opinions on EVAAS’s claimed benefits and statements; and to understand the unintended consequences that might have also occurred as a result of EVAAS use in SSD.

Results revealed that the reliability of the EVAAS model produced split and inconsistent results among teacher participants regardless of subject or grade-level taught. As one teacher stated, “In three years, I was above average, below average and average.” Teachers indicated that it was the students and their varying background demographics who biased their EVAAS results, and much that was demonstrated via their scores was beyond the control of teachers. “[EVAAS] depends a lot on home support, background knowledge, current family situation, lack of sleep, whether parents are at home, in jail, etc. [There are t]oo many outside factors – behavior issues, etc.” that apparently are not controlled or accounted for in the model.

Teachers reported dissimilar EVAAS and principal observation scores, reducing the criterion-related validity of both measures of teacher quality. Some even reported that principals changed their observation scores to match their EVAAS scores; “One principal told me one year that even though I had high [state standardized test] scores and high Stanford [test] scores, the fact that my EVAAS scores showed no growth, it would look bad to the superintendent.” Added another teacher, “I had high appraisals but low EVAAS, so they had to change the appraisals to match lower EVAAS scores.”

The majority of teachers disagreed with SAS’s marketing claims such as EVAAS reports are easy to use to improve instruction, and EVAAS will ensure growth opportunities for all students. Teachers called the reports “vague” and “unclear” and were “not quite sure how to interpret” and use the data to inform their instruction. As one teacher explained, she looked at her EVAAS report “only to guess as to what to do for the next group in my class.”

Many unintended consequences associated with the high-stakes use of EVAAS emerged through teachers’ responses, which revealed among others that teachers felt heightened pressure and competition, which they believed reduced morale and collaboration, and encouraged cheating or teaching to the test in attempt to raise EVAAS scores. Teachers made comments such as, “To gain the highest EVAAS score, drill and kill and memorization yields the best results, as does teaching to the test,” and “When I figured out how to teach to the test, the scores went up,” as well as, “EVAAS leaves room for me to teach to the test and appear successful.”

Teachers realized this emphasis on test scores was detrimental for students, as one teacher wrote, “As a result of the emphasis on EVAAS, we teach less math, not more. Too much drill and kill and too little understanding [for the] love of math… Raising a generation of children under these circumstances seems best suited for a country of followers, not inventors, not world leaders.”

Teachers also admitted they are not collaborating to share best practices as much anymore: “Since the inception of the EVAAS system, teachers have become even more distrustful of each other because they are afraid that someone might steal a good teaching method or materials from them and in turn earn more bonus money. This is not conducive to having a good work environment, and it actually is detrimental to students because teachers are not willing to share ideas or materials that might help increase student learning and achievement.”

While I realize this body of work could simply add to “the shelves” along with those findings of other researchers striving to deflate and demystify this latest round of education reform, if nothing else, I hope the teachers who participated in this study know I am determined to let their true experiences, perceptions of their experiences, and voices be heard.


Again, to find out more information including the statistics in support of the above assertions and findings, please click here to read the full study.

Rothstein, Chetty et al., and (Now) Kane on Bias

Here’s an update to a recent post about research conducted by Berkeley Associate Professor of Economics – Jesse Rothstein.

In Rothstein’s recently released study, he provides evidence that puts the aforementioned Chetty et al. results under a more appropriate light. Rothstein’s charge, again, is that Chetty et al. (perhaps unintentionally) masked evidence of bias in their now infamous VAM-based study, which in turn biased Chetty et al.’s (perpetual) claims that teachers caused effects in student achievement growth over time. These effects, rather, might have been more likely caused by bias given the types of students non-randomly assigned to teachers’ classrooms versus “true teacher effects.”

In addition, while in his study Rothstein replicated Chetty et al.’s overall results using a similar data set, so did Thomas Kane – a colleague of Chetty’s at Harvard who has also been the source of prior VAMboozled! posts here, here, and here. During the Vergara v. California case last summer, the prosecuting team actually used Kane’s (and colleagues’) replication-study results to validate Chetty et al.’s initial results.

However, Rothstein did not replicate Chetty et al.’s findings when it came to bias (the best evidence of this is offered in Rothstein’s study’s Appendix B). Inversely, Kane’s (and colleagues’) study did not, then, have any of the prior year score analyses needed to analyze and assess bias, so the extent to which Chetty et al.’s results were due to bias was then more or less moot.

But after Rothstein released his recent study effectively critiquing Chetty et al. on this point, Kane (and colleagues) released the results Kane presented at the Vergara trial (see here). However, Kane (and colleagues) seemingly released an updated version of “Kane’s” initial results to seemingly counter Rothstein, in support of Chetty. In other words, Kane seems to have released his study (perhaps) more in support of his colleague Chetty than in the name of conducting good, independent research.

Oh the tangled web Chetty and Kane (purportedly) continue to weave.

See also Chetty et al.’s direct response to Rothstein here.