Chetty Et Al…Are “Et” It “Al” Again!

As per a recent post about Raj Chetty, the fantabulous study he and his colleagues (i.e., Chetty, Friedman, and Rockoff) “Keep on Giving…,” and a recent set of emails exchanged among Raj, Diane, and me about their (implied) nobel-worthy work in this area, given said fantabulous study, it seems Chetty et al are “et” it “al” again!

Excuse my snarkiness here as I am still in shock that they took it upon themselves to “take on” the American Statistical Association.” Yes, the ASA as critqued by Chetty et al. If that’s not a sign of how Chetty et al feel about their work in this area, then I don’t know what is.”

Given my already close proximity to this debate, I thought it wise to consult a colleague of mine–one who, in fact, is an economist (like Chetty et al) and who is also an expert in this area–to write a response to their newly released “take” on the ASA’s recently released “Statement on Using Value-Added Models for Educational Assessment.” Margarita Pivovarova is also an economist conducting research in this area, and I value her perspective more than many other potential reviewers out there, as she is both smart and wise when it comes to the careful use and interpretation of data and methods like those at the focus of these papers. Here’s what she wrote:

What an odd set of opponents in the interpretation of ASA statement, this one apparently being a set of scholars who self-selected themselves to answer what they seemingly believed others were asking; that is, “What say Chetty et al on this?”

That being said, the overall tone of this response reminds of a “response to the editor” as if, in a more customary manner, ASA would have reviewed [the aforementioned] Chetty et al paper and Chetty et al were invited to respond to the ASA. Interestingly here, though, Chetty et al seemed compelled to take a self-assigned “expert” stance on the ASA’s statement, regardless of the fact that this statement represents the expertise and collective stance of a whole set of highly regarded and highly deserving scholars in this area, who quite honestly would NOT have asked for Chetty et al’s advice on the matter. 

This is the ASA, and these are the ASA experts Chetty et al also explicitly marginalize in the reference section of their critique. [Note for yourselves the major lack of alignment between Chetty et al.’s references versus the 70 articles linked to here on this website]. This should cause others pause in terms of why Chetty et al are so extremely selective in their choices of relevant literature – the literature on which they rely “to prove” some of their own statements and “to prove” some of the ASA’s statements false or suspect. The largest chunk of the best and most relevant literature [as again linked to here] is completely left out of their “critique.” This also includes the whole set of peer-reviewed articles published in educational and education finance journals. A search on ERIC (The Educational Resources Information Center) with “value-added” as key words for the last 5 years yields 406 entries, and a similar search in JSTOR (a shared digital library) returning 495. This is quite a deep and rich literature to outright and in its entirety leave out of Chetty et al’s purportedly more informed and expert take on this statement, ESPECIALLY given many in the ASA wrote these aforementioned, marginalized articles! While Chetty et al critique the ASA for not including citations, this is quite customary in a position statement that is meant to be research-based, but not cluttered with the hundreds of articles in support of the association’s position.

Another point of contention surrounds ASA Point #5:, VAMs typically measure correlation, not causation: Effects – positive or negative – attributed to a teacher may actually be caused by other factors that are not captured in the model. Although I should mention that this was taken from within the ASA statement, interpreted, and then defined by Chetty et al as “ASA Point #5.” Notwithstanding, the answer to “ASA Point #5” guides  policy implications, so it is important to consider: Do value-added estimates yield causal interpretations as opposed to just descriptive information? In 2004, the editors of the American Educational Research Association’s (AERA) Journal of Educational and Behavioral Statistics (JEBS) published a special issue entirely devoted to value-added models (VAMs). One notable publication in that issue was written by Rubin, Stuart, and Zanutto (2004). Any discussion of the causal interpretations of VAMs cannot be devoid of this foundational article and its potential outcomes framework, or Rubin Causal Model, advanced within this piece and elsewhere (see also Holland, 1986). These authors clearly communicated that value-added estimates cannot be considered causal unless a set of “heroic assumptions” are agreed to and imposed. Accordingly, and pertinent here, “anyone familiar with education will realize that this [is]…fairly unrealistic.” Even back then, Rubin et al suggested we switch gears and instead evaluate interventions and reward incentives as based on the descriptive qualities of the indicators and estimates derived via VAMs.This is something Chetty et al overlooked here and continue to overlook and neglect in their research.

Another concern surrounds how Chetty et al suggest “we” deal with the interpretation of VAM estimates (what they define as ASA Points #2 and #3). The grander question here is: Does one need to be an expert to understand the “truth” behind the numbers? If we, in general, are talking about those who are in charge of making decisions based on these numbers, then the answer is yes, or at least aware of what the numbers do and do not do! While probably different levels of sophistication are required to develop VAMs versus interpret VAM results, interpretation is a science in itself and ultimately requires background knowledge about the model’s design and how the estimates were generated if informed, accurate, and valid inferences are to be drawn. While it is true that using an iPad doesn’t require one to understand the technology behind it, the same cannot be said about the interpretation of this set of statistics upon which people’s lives increasingly depend.

Related, ASA rightfully states that the standardized test scores used in VAMs are not the only outcomes that should be of interest for policy makers and stake-holders (ASA Point #4). In view of recent findings of the role of cognitive versus non-cognitive skills for students’ future well-beings, current agreement is that test scores might not even be one of the most important outcomes capturing a students’ educated self. Addressing this point, Chetty et al, rather, demonstrate how “interpretation matters.” In Jackson’s (2013) study [not surprisingly cited by Chetty et al, likely because it was published by the National Bureau of Educational Research (NBER) – the same organization that published their now infamous study without it being peer or even internally reviewed way back when], Jackson used value-added techniques to estimate teacher effects on the non-cognitive and long-term outcomes of students. One of his main findings was that teachers who are good at boosting test scores are not always the same teachers who have positive and long-lasting outcomes on non-cognitive skills acquisition. In fact, value-added to test scores and to non-cognitive outcomes for the same teacher were then and have since been shown to be only weakly correlated [or related] with one another. Regardless, Chetty et al use the results of standardized test scores to rank teachers and then use those rankings to estimate the effects of “highly-effective” teachers on the long-run outcomes of students, REGARDLESS OF THE RESEARCH in which their study should have been situated (excluding the work of Thomas Kane and Eric Hanushek). As noted before, if value-added estimates from standardized test scores cannot be interpreted as causal, then the effect of “high value-added” teachers on college attendance, earnings, and reduced teenage birth rates CANNOT BE CONSIDERED CAUSAL either.

Lastly, ASA expressed concerns about the sensitivity of value-added estimates to model specifications (ASA Point #6). Likewise, researchers contributing to a now large body of literature have also recently explored this same issue, and collectively they have found that value-added estimates are highly sensitive to the assessments being used, even within the same subject areas (Papay, 2011) and different subject areas taught by the same teachers given different student compositions (Loeb & Candelaria, 2012). Echoing those findings, others (also ironically cited by Chetty et al in the discussion to demonstrate the opposite) have also found that “even when the correlations between performance estimates generated by different models are quite high for the workforce as a whole, there are still sizable differences in teacher rankings generated by different models that are associated with the composition of students in a teacher’s classroom” (Goldhaber, Walch, & Gabele, 2014). And others have noted that “models that are less aggressive in controlling for student-background and schooling-environment information systematically assign higher rankings to more-advantaged schools, and to individuals who teach at these schools” (Ehlert, Koeldel, Parsons, and Podgursky, 2014).”

In sum, these are only a few “points” from this “point-by-point discussion” that would, or rather should, immediately strike anyone fairly familiar with the debate surrounding the use and abuse of VAMs as way off-base. While the debates surrounding VAMs continue, it is so easy for a casual and/or careless observer to lose the big picture behind the surrounding debates and get distracted by technicalities. This certainly seems to be the case here, as Chetty et al continue to argue their technical arguments from a million miles away from the actual schools’, teachers’, and students’ data they study, and from which they make miracles.

*To see VAMboozled!’s initial “take” on the same ASA statement, a take that in our minds still stands as it was not a self-promotional critique but only a straightforward interpretation for our followers, please click here.

Consumer Alert: Researchers in New Study Find “Surprisingly Weak” Correlations among VAMs and Other Teacher Quality Indicators

Two weeks ago, an article in the U.S. News and World Report (as well as similar articles in Education Week and linked to from the homepage of the American Educational Research Association [AERA]) highlighted the results of a recent research conducted by University of Southern California’s Morgan Polikoff and University of Pennsylvania’s Andrew Porter. The research article was released online here, in the AERA-based, peer-reviewed, and highly esteemed journal: Education Evaluation and Policy Analysis.
As per the study’s abstract, the researchers found (to which their peer-reviewers apparently agreed) that the extent to which teachers’ instructional alignment was associated with their contributions to student learning and their effectiveness on VAMs, using data from the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) study, were “surprisingly weak,” or as per the aforementioned U.S. News and World Report article, “weak to nonexistent.”
Researchers, specifically, analyzed the (co)relationships among VAM estimates and observational data, student survey data, and other data pertaining to whether teachers aligned their instruction with state standards. They did this using data taken from 327 fourth and eighth grade math and English teachers in six school districts, again as derived via the aforementioned MET study.
Researchers concluded that “there were few if any correlations that large [i.e., greater than r = 0.3] between any of the indicators of pedagogical quality and the VAM scores. Nor were there many correlations of that magnitude in the main MET study. Simply put, the correlations of value-added with observational measures of pedagogical quality, student survey measures, and instructional alignment were small” (Polikoff & Porter, 2014, p. 13).
Interestingly enough, the research I recently conducted with my current doctoral student (see Paufler & Amrein-Beardsley, here), was used to supplement these researchers’ findings. In Education Week, the articles’ author, Holly Yettick, wrote the following:
In addition to raising questions about the sometimes weak correlations between value-added assessments and other teacher-evaluation methods, researchers continue to assess how the models are created, interpreted, and used.
In a study that appears in the current issue of the American Educational Research Journal, Noelle A. Paufler and Audrey Amrein-Beardsley, a doctoral candidate and an associate professor at Arizona State University, respectively, conclude that elementary school students are not randomly distributed into classrooms. That finding is significant because random distribution of students is a technical assumption that underlies some value-added models.
Even when value-added models do account for nonrandom classroom assignment, they typically fail to consider behavior, personality, and other factors that profoundly influenced the classroom-assignment decisions of the 378 Arizona principals surveyed. That, too, can bias value-added results.

The Study that Keeps on Giving…(Hopefully) in its Final Round

In January I wrote a post about “The Study that Keeps on Giving…” Specifically, this post was about the study conducted and authored by Raj Chetty (Economics Professor at Harvard), John Friedman (Assistant Professor of Public Policy at Harvard), and Jonah Rockoff (Associate Professor of Finance and Economics at Harvard) that was published first in 2011 (in its non-peer-reviewed and not even internally reviewed form) by the National Bureau of Economic Research (NBER) and then published again by NBER in January of 2014 (in the same form) but this time split into two separate studies (see them split here and here).

Their re-release of the same albeit split study was what prompted the title of the initial “The Study that Keeps on Giving…” post. Little did I know then, though, that the reason this study was re-released in split form was that it was soon to be published in a peer-reviewed journal. Its non-peer-reviewed publication status was a major source of prior criticism. While journal editors seemed to have suggested the split, NBER seemingly took advantage of this opportunity to publicize this study in two forms, regardless and without prior explanation.

Anyhow, this came to my attention when the study’s lead author – Raj Chetty – emailed me a few weeks ago, emailed Diane Ravitch on the same email, and also apparently emailed other study “critics” at the same time (see prior reviews of this study as per this study’s other notable “critics” here, here, here, and here) to notify all of us that this study made it through peer review and was to be published in a forthcoming issue of the American Economic Review. While Diane and I responded to our joint email (as other critics may have done as well), we ultimately promised Chetty that we would not share the actual contents of any of the approximately 20 email exchanges that went back and forth among the three of us over the following days.

What I can say, though, is that no genuine concern was expressed by Chetty or on behalf of his co-authors, in particular, about the intended or unintended consequences that came about as a result of his study, nor how many policymakers since used and abused study results for political gain and the further advancement of VAM-based policies. Instead, emails were more or less self-promotional and celebratory, especially given that President Obama cited the study in his 2012 State of the Union Address and that Chetty apparently continues to advise U.S. Secretary of Education Arne Duncan about his similar VAM-based policies. Perhaps, next, the Nobel prize committee might pay this study its due respects, now overdue, but again I only paraphrase from that which I inferred from these email conversations.

As a refresher, Chetty et al. conducted value-added analyses on a massive data set (with over 1 million student-level test and tax records) and presented (highly-questionable) evidence that favored teachers’ long-lasting, enduring, and in some cases miraculous effects. While some of the findings would have been very welcomed to the profession, had they indeed been true (e.g., high value-added teachers substantively affect students incomes in their adult years), the study’s authors overstated their findings, and they did not duly consider (or provide evidence to counter) the alternative hypotheses in terms of what other factors besides teachers might have caused the outcomes they observed (e.g., those things that happen outside of schools while students are in school and throughout students’ lives).

Nor did they consider, or rather satisfactorily consider, how the non-random assignment of students into both schools and classrooms might have biased the effects observed, whereas the students in high “value-added” teachers’ classrooms might have been more “likely to succeed” regardless of, or even despite the teacher effect, on both the short and long term effects demonstrated in their findings…then widely publicized via the media and beyond throughout other political spheres.

Rather, Chetty et al. advanced what they argued were a series of causal effects by exploiting a series of correlations that they turned attributional. They did this because (I believe) they truly believe that their sophisticated econometric models and the sophisticated controls and approaches they use in fact work as intended. Perhaps this also explains why Chetty et al. give pretty much all credit in the area of value-added research to econometricians, and they do this throughout their papers, all the while over-citing the works of their economic researchers/friends but not the others (besides Berkeley economist Jesse Rothstein, see the full reference to his study here) who have also outright contradicted their findings, with evidence. Apparently, educational researchers do not have much to add on this topic, but I digress.

But this is too a serious fault as “they” (and I don’t mean to make sweeping generalizations here) have never been much for understanding what goes into the data they analyze, as socially constructed and largely context dependent. Nor do they seem to care to fully understand the realities of the classrooms from which they receive such data, or what test scores actually mean, or when using them what one can and cannot actually infer. This, too, was made clear via our email exchange. It seems this from-the-sky-down view of educational data is the best (as well as the most convenient) approach that “they” might even expressly prefer, so that they do not have to get their data fingers dirty and deal with the messiness that always surrounds these types of educational data and always comes into play when conducting most any type of educational research that relies (in this case solely) on students’ large-scale standardized test scores.

Regardless, I decided to give this study yet another review to see if, now that this study has made it through the peer review process, I was missing something. I wasn’t. The studies are pretty much exactly the same as they were when first released (which unfortunately does not say much for peer review). The first study here is about VAM-based bias and how VAM estimates that control for students’ prior test scores “exhibit little bias despite the grouping of students” and despite the number of studies not referenced or cited that continue to evidence the opposite. The second study here is about teacher-level value-added and how teachers with a lot of it (purportedly) cause grander things throughout their students’ lives. More specifically, they found that “students [non-randomly] assigned to high [value-added] teachers are more likely to attend college, earn higher salaries, and are less likely to have children as teenagers.” They also found that “[r]eplacing a teacher whose [value-added] is in the bottom 5% with an average teacher would increase the present value of students’ lifetime income by approximately $250,000 per classroom [emphasis added].” Please note that this overstated figure is not per student; had it been broken out by student it would have rather become “chump change,” for the lack of a better term, which serves as one example of just one of their classic exaggerations. They do, however, when you read through the actual text, tone their powerful language down a bit to note that, on average, this is more accurately $185,000, still per classroom. Again, to read the more thorough critiques conducted by scholars with also impressive academic profiles, I suggest readers click here, here, here, or here.

What I did find important to bring to light during this round of review were the assumptions that, thanks to Chetty and his emails, were made more obvious (and likewise troublesome) than before. These are the, in some cases, “very strong” assumptions that Chetty et al. make explicit in both of their studies (see Assumptions 1-3 in the first and second papers). These are also the assumptions they make explicit, with “evidence” why they should not reject these assumptions (most likely, and in some cases clearly) because their study relied on such assumptions. The assumptions they made were so strong, in fact, at one point they even mention that it would be useful could they have “relaxed” some of the assumptions they made. In other cases, they justify their adoption of these assumptions given the data limitations and methodological issues they faced, plain and simply because there was no other way to conduct (or continue) their analyses without making and agreeing to these assumptions.

So, see if you agree with the following three assumptions they make most explicit and use throughout both studies (although other assumptions are littered throughout both pieces), yourselves. I would love for Chetty et al. to discuss whether their assumptions in fact hold given the realities the everyday teacher, school, or classroom face. But again, I digress…

Assumption 1 [Stationarity]: Teacher levels of value-added as based on growth in student achievement over time follows a stationary, unchanging, constant, and consistent process. On average, “teacher quality does not vary across calendar years and [rather] depends only on the amount of time that elapses between” years. While completely nonsensical to the average adult with really any commonsense, this assumption, to them, helped them “simplif[y] the estimation of teacher [value-added] by reducing the number of parameters” needed in their models, or more appropriately needed to model their effects.

Assumption 2 [Selection on Excluded Observables]: Students are sorted or assigned to teachers on excluded observables that can be estimated. See a recent study that I conducted with a doctoral student of mine (that was just published in this month’s issue of the highly esteemed American Educational Research Journal here) in which we found, with evidence, that 98% of the time this assumption is false. Students are non-randomly sorted on “observables” and “non-observables” (most of which are not and cannot be included in such data sets) 98% of the time; both types of variables bias teacher-level value-added over time given the statistical procedures meant to control for these variables do not work effectively well, especially for students in the extremes or on both sides of the normal bell curve. While convenient, especially when conducting this type of far-removed research, this assumption is false and cannot really be taken seriously given the pragmatic realities of schools.

Assumption 3 [Teacher Switching as a Quasi-Experiment]: Changes in teacher-level value-added scores across cohorts within a school-grade are orthogonal (i.e., non-overlapping, uncorrelated, or independent) with changes in other determinants of student scores. While Chetty et al. themselves write that this assumption “could potentially be violated by endogenous student or teacher sorting to schools over time,” they also state that “[s]tudent sorting at an annual frequency is minimal because of the costs of changing schools” which is yet another unchecked assumption without reference(s) in support. They further note that “[w]hile endogenous teacher sorting is plausible over long horizons, the high-frequency changes [they] analyze are likely driven by idiosyncratic shocks such as changes in staffing needs, maternity leaves, or the relocation of spouses.” These are all plausible assumptions too, right? Is “high-frequency teacher turnover…uncorrelated with student and school characteristics?” Concerns about this and really all of these assumptions, and ultimately how they impact study findings, should certainly cause pause.

My final point, interestingly enough, also came up during the email exchanges mentioned above. Chetty made the argument that he, more or less, had no dog in the fight surrounding value-added. In the first sentence of the first manuscript, however, he (and his colleagues) wrote, “Are teachers’ impacts on students’ test scores (“value-added”) a good measure of their quality?” The answer soon thereafter and repeatedly made known in both papers becomes an unequivocal “Yes.” Chetty et al. write in the first paper that “[they] established that value-added measures can help us [emphasis added as “us” is undefined] identify which teachers have the greatest ability to raise students’ test scores.” In the second paper, they write that “We find that teacher [value-added] has substantial impacts on a broad range of outcomes.”  Apparently, Chetty wasn’t representing his and/or his colleagues’ honest “research-based” opinions and feelings about VAMs in one place (i.e., our emails) or the other (his publications) very well.

Contradictions…as nettlesome as those dirtly little assumptions I suppose.

Tennessee’s Education Commissioner Gives “Inspiring” TEDxNashville Talk

The purpose of TED talks is, and since their inception in 1984 has been, to share “ideas worth spreading, “in the most innovative and engaging” of ways. It seems that somebody in the state of Tennessee hijacked this idea, however, and put the Tennessee Education Commissioner (Kevin Huffman) on the TEDxNashville stage, to talk about teachers, teacher accountability, and why teachers need to “work harder” and be held more accountable for meeting higher standards if they don’t.

Watch the video here:

And/Or read the accompanying article here. But I have summarized the key points, as I see them, for you all below in case you’d rather not view/read for yourself.

Before we begin, though, I should make explicit that Commissioner Huffman was formerly an executive with Teach for America (TFA), and it was from his “grungy, non-profit cubicle in DC” where the Governor of Tennessee picked him and ultimately placed him into the state Commissioner position. So, as explained by him, he “wasn’t a total novice” in terms of America’s public schools because 1) he was in charge of policy and politics at TFA, 2) he taught in Houston for three years through TFA, and 3) he brought with him to Tennessee prior experience “dealing” with Capitol Hill. All of this (and a law degree) made him qualified to serve as Tennessee’s Education Commissioner.

This background knowledge might help others understand from where his (misinformed and quite simply uninspiring) sense of “reality” about teachers in America’s, and in particular Tennessee’s public schools comes. This should also help to explain some of his comments on which he simply “struck out” – as demonstrated via this video. Oh – and he was previously married to Michelle Rhee, head of StudentsFirst, former Chancellor of Washington D.C.’s public schools, and the source of other VAMboozled! posts here, here, and here. So his ideas/comments, or “strikes,” as I’ve called them below, might actually make sense in this context.

Strike 1 – Huffman formerly wrote an education column for The Washington Post during which he endured reader’s “anonymous…boo and hiss” comments. This did not change his thinking, however, but rather helped him develop the “thick skin” he needed in preparation for his job as Commissioner. This was yet another “critical prerequisite” for his future leadership role, not to ingest reader feedback and criticism for what it was worth, but rather to reject it and use it as justification to stay the course…and ultimately celebrate this self-professed obsession, as demonstrated as follows.

Strike 2 – He has a self-described “nerdy obsession” with data, and results, and focusing on data that lead to results. While he came into Tennessee as a “change agent,” however, he noticed that a lot of the changes he desired were already in place…and had already been appropriately validated by the state of Tennessee “winning” the Race to the Top competition and receiving the $500 million that came along with “winning” it. What he didn’t note, however, was that while his professed and most important belief was that “results trump all,” the state of Tennessee had been carrying this mantra since the early 1990s when they adopted the Tennessee Value-Added Assessment System (TVAAS) for increased accountability. This was well before he probably even graduated college.

Regardless, he noted that nothing had been working in Tennessee for all of this time, until he came into his leadership position and forced this policy’s fit. “If you can deliver unassailable evidence [emphasis added] that students are learning more and students are benefiting, then those results [should] carry the day.” He did not once mention that the state’s use of the TVAAS was THE reason the state of Tennessee was the first to win Race to the Top funds…and even though they had been following the same “results trump all” mantra since the early 1990s, Tennessee was still ranked 44th in the nation all of those years later, and in some areas lower in rank in terms of its national levels of achievement (i.e., on the National Assessment of Educational Progress [NAEP]) twenty-years in… when he took office. Regardless of the state’s history, however, Commisioner Huffman went all in on “raising academic standards” and focusing on the state’s “evaluation” system. While “there was a bunch of work” previously done in Tennessee on this note, he took these two tenets “more seriously” and “really drove [them] right in.”

Strike 3 – When Huffman visited all 136 school districts in the state after he arrived, and thereafter took these things “more seriously,” there were soon after “all of these really good signs” that things were working. The feedback he got in response to his initiatives was all [emphasis added] positive. He kept hearing things like, “kids are [now] learning more [and] instruction is getting better,” thanks to him, and he would hear the same feedback regardless of whether people liked him and/or his “results trump all” policies. More shocking here, though, were his self-promotional pieces given there are currently three major lawsuits in the state, all surrounding the state’s teacher evaluation system as carried forth by him. This is the state, in fact, “winning” the “Race to the Lawsuit” competition, should one currently exist.

In terms of the hard test-based evidence, though, in support of his effectiveness, while the state tests were demonstrating the effectiveness of his policies, Huffman was really waiting for the aforementioned NAEP results to see if what he was doing was working. And here’s what happened: two years after he arrived “Tennessee’s kids had the most growth of any kids in America.”

But here’s what REALLY happened.

While Sanders (the TVAAS developer who first convinced the state legislature to adopt his model for high-stakes accountability purposes in the 1990s) and others (including U.S. Secretary of Education Arne Duncan) also claimed that Tennessee’s use of accountability instruments caused Tennessee’s NAEP gains (besides the fact that the purported gains were over two decades delayed), others have since spoiled the celebration because 1) the results also demonstrated an expanding achievement gap in Tennessee; 2) the state’s lowest socioeconomic students continue to perform poorly, despite Huffman’s claims; 3) Tennessee didn’t make gains significantly different than many other states; and 4) other states with similar accountability instruments and policies (e.g., Colorado, Louisiana) did not make similar gains, while states without such instruments and policies (e.g., Kentucky, Iowa, Washington) did. I should add that Kentucky’s achievement gap is also narrowing and their lowest socioeconomic students have made significant gains. This is important to note as Huffman repeatedly compares his state to theirs.

Claiming that NAEP scores increased because of TVAAS-use, and other stringent accountability policies developed by Huffman, was and continues to be unwarranted (see also this article in Education Week).

So my apologies to Tennessee because it appears that TEDx was a TED strike-out this time. Tennessee’s Commissioner struck out, and the moot pieces of evidence supporting these three strikes will ultimately unveil their true selves in the very near future. For now, though, there is really nothing to celebrate, even if Commissioner Huffman brings you cake to convince you otherwise (as referenced in the video).

In the end, this – “the Tennessee story” (as Huffman calls it) – reminds me of a story from way back in 2002 when then Governor George W. Bush spoke about “the Miracle in Texas” in an all too familiar light…he also ultimately “struck out.” Bush talked then of all of the indicators, similar to here, that were to be celebrated thanks to his (which was really Ross Perot’s) similar high-stakes policy initiative. While the gains in Texas were at-the-same-time evidenced to have been artificially inflated, this never hit the headlines. Rather, these artificial results helped President George W. Bush advance No Child Left Behind (NCLB) at the national level.

As we are all now aware, or should be now aware, NCLB never demonstrated its intended consequences now 10 years past, but NCLB demonstrated, rather, only negative, unintended consequences instead. In addition, the state of Texas in which such a similar miracle occurred, and where NCLB was first conceived, is now performing about average as compared to the nation…and losing ground.

Call me a cynic, call me a realist..but there it is.

VAMs and the “Dummies In Charge:” A Clever “Must Read”

Peter Greene wrote a very clever, poignant, and to the point article about VAMs, titled “VAMs for Dummies” in The Blog of The Huffington Post. While I already tweeted, facebooked, shared, and else short of printing this one out for my files, I thought it imperative I also share it out with you. Also – Greene gave us at VAMboozled a wonderful shout out directing readers here to find out more. So Peter — even though I’ve never met you — thanks for the kudos and keep it coming. This is a fabulous piece!

Click here to read his piece in full. I’ve also pasted it below, mainly because this one is a keeper. See also a link to a “nifty” 3-minute video on VAMs below.

If you don’t spend every day with your head stuck in the reform toilet, receiving the never-ending education swirly that is school reformy stuff, there are terms that may not be entirely clear to you. One is VAM — Value-Added Measure.

VAM is a concept borrowed from manufacturing. If I take one dollar’s worth of sheet metal and turn it into a lovely planter that I can sell for ten dollars, I’ve added nine dollars of value to the metal.

It’s a useful concept in manufacturing management. For instance, if my accounting tells me that it costs me ten dollars in labor to add five dollars of value to an object, I should plan my going-out-of-business sale today.

And a few years back, when we were all staring down the NCLB law requiring that 100 percent of our students be above average by this year, it struck many people as a good idea — let’s check instead to see if teachers are making students better. Let’s measure if teachers have added value to the individual student.

There are so many things wrong with this conceptually, starting with the idea that a student is like a piece of manufacturing material and continuing on through the reaffirmation of the school-is-a-factory model of education. But there are other problems as well.

1) Back in the manufacturing model, I knew how much value my piece of metal had before I started working my magic on it. We have no such information for students.

2) The piece of sheet metal, if it just sits there, will still be a piece of sheet metal. If anything, it will get rusty and less valuable. But a child, left to its own devices, will still get older, bigger, and smarter. A child will add value on its own, out of thin air. Almost like it was some living, breathing sentient being and not a piece of raw manufacturing material.

3) All piece of sheet metals are created equal. Any that are too not-equal get thrown in the hopper. On the assembly line, each piece of metal is as easy to add value to as the last. But here we have one more reformy idea predicated on the idea that children are pretty much identical.

How to solve these three big problems? Call the statisticians!

This is the point at which that horrifying formula that pops up in these discussion appears. Or actually, a version of it, because each state has its own special sauce when it comes to VAM. In Pennsylvania, our special VAM sauce is called PVAAS [i.e., the EVAAS in Pennsylvania]. I went to a state training session about PVAAS in 2009 and wrote about it for my regular newspaper gig. Here’s what I said about how the formula works at the time:

PVAAS uses a thousand points of data to project the test results for students. This is a highly complex model that three well-paid consultants could not clearly explain to seven college-educated adults, but there were lots of bars and graphs, so you know it’s really good. I searched for a comparison and first tried “sophisticated guess;” the consultant quickly corrected me–“sophisticated prediction.” I tried again–was it like a weather report, developed by comparing thousands of instances of similar conditions to predict the probability of what will happen next? Yes, I was told. That was exactly right. This makes me feel much better about PVAAS, because weather reports are the height of perfect prediction.

Here’s how it’s supposed to work. The magic formula will factor in everything from your socio-economics through the trends over the past X years in your classroom, throw in your pre-testy thing if you like, and will spit out a prediction of how Johnny would have done on the test in some neutral universe where nothing special happened to Johnny. Your job as a teacher is to get your real Johnny to do better on The Test than Alternate Universe Johnny would.

See? All that’s required for VAM to work is believing that the state can accurately predict exactly how well your students would have done this year if you were an average teacher. How could anything possibly go wrong??

And it should be noted — all of these issues occur in the process before we add refinements such as giving VAM scores based on students that the teacher doesn’t even teach. There is no parallel for this in the original industrial VAM model, because nobody anywhere could imagine that it’s not insanely ridiculous.

If you want to know more, the interwebs are full of material debunking this model, because nobody — I mean nobody — believes in it except politicians and corporate privateers. So you can look at anything from this nifty three minute video to the awesome blog Vamboozled by Audrey Amrein-Beardsley.

This is one more example of a feature of reformy stuff that is so top-to-bottom stupid that it’s hard to understand. But whether you skim the surface, look at the philosophical basis, or dive into the math, VAM does not hold up. You may be among the people who feel like you don’t quite get it, but let me reassure you — when I titled this “VAM for Dummies,” I wasn’t talking about you. VAM is always and only for dummies; it’s just that right now, those dummies are in charge.

 

Breaking News: Houston Teachers Suing over their District’s EVAAS Use

It’s time!! While lawsuits are emerging across the nation (namely three in Tennessee and one in Florida), one significant and potentially huge lawsuit was just filed in federal court this past Wednesday in Houston on behalf of seven teachers working in the Houston Independent School District.

The one place I have conducted quite extensive research on VAMs, and the Education Value-Added Assessment System (EVAAS) and its use in this district in particular, is there in Houston. Accordingly, I’m honored to report that I’m helping folks move forward with their case.

Houston, the 7th largest urban district in the country, is widely recognized for its (inappropriate) using of the EVAAS for more consequential decision-making purposes (e.g., teacher merit pay and in the case of this article, teacher termination) more than anywhere else in the nation. Read a previous article about this in general, and also four teachers who were fired in Houston due in large part to their EVAAS scores here. See also a 12-minute YouTube video about these same teachers here.

According to a recent post in The Washington Post, the teachers/plaintiffs are arguing that  EVAAS output are inaccurate, the EVAAS is unfair, that teachers are being evaluated via the EVAAS using tests that do not match the curriculum they are to teach, that the EVAAS system fails to control for student-level factors that impact how well teachers perform but that are outside of teachers’ control (e.g., parental effects), that the EVAAS is incomprehensible and hence very difficult if not impossible to actually use to improve upon their instruction, and, accordingly, that teachers’ due process rights are being violated because teachers do not have adequate opportunities to change as a results of their EVAAS results.

“The suit further alleges that teachers’ rights to equal protection under the Constitution are being abridged because teachers with below-average scores find themselves receiving harsher scores on a separate measure of instructional practice.” That is, teachers are claiming that their administrators are changing, and therefore distorting other measures of their effectiveness (e.g., observational scores) because administrators are being told to trust the EVAAS data and to force such alignments between the EVAAS and observational output to (1) manufacture greater alignment between the two and (2) artificially inflate what are currently the very low correlations being observed between the two.

As well, and unique to but also very interesting in the case of the EVAAS, recalibration in the model is also causing teachers’ “effectiveness” scores to adjust retroactively when additional data are added to the longitudinal data system. This, according to EVAAS technicians, helps to make the system more rigorous, but this, in reality, causes teachers’ prior scores to change retroactively from years prior. So yes, a teacher who might be deemed ineffective one year and then penalized, might have his/her prior “ineffective” score retroactively changed to “effective” for the same year prior when more recent data are added and re-calibrated to refine the EVAAS output. Sounds complicated, I mean comical, but it’s unfortunately not.

This one is definitely one to watch, and being closest to this one I will certainly keep everyone updated as things progress as I can. This, in my mind, is the lawsuit that “could affect [VAM-based] evaluation systems well beyond Texas.”

Yes!! Let’s sure hope so…assuming all goes well, as it should, as this lawsuit is to be based on the research.

To watch an informative short (2:30) video produced by ABC news in Houston, click here. To download the complete lawsuit click here.

 

States Most Likely to Receive Race to the Top Funds

Since 2009, the US Department of Education via its Race to the Top initiative has given literally billions in federal, taxpayer funds to incentivize states to adopt its various educational policies, as based on many non-research-based or research-informed reforms. As pertinent here, the main “reform” being VAMs, with funding going to states that have them, are willing to adopt them, and are willing to use them for low- and preferably high-stakes decisions about teachers, schools, and districts.

Diane Ravitch recently posted a piece about the Education Law Center finding that there was an interesting pattern to the distribution of Race to the Top grants. The Education Law Center, in an Education Justice article, found that the states and districts with the least fair and equitable state school finance systems were the states that won a large share of RTTT grants.

Interesting, indeed, but not surprising. There is an underlying reason for this, as based on standard correlations anybody can run or calculate using state-level demographics and some basic descriptive statistics.

In this case, correlational analyses reveal that state-level policies that rely at least in part on VAMs are indeed more common in states that allocate less money than the national average for schooling as compared to the nation. More specifically, they are more likely found in states in which yearly per pupil expenditures are lower than the national average (as demonstrated in the aforementioned post). They are more likely found in states in which students perform worse, or have lower relative percentages of “proficient” students as per the US’s (good) national test (i.e., the National Assessment of Educational Progress (NAEP). They are more likely found in states that have more centralized governments, rather than those with more powerful counties and districts as per local control. They are more likely to be found in more highly populated states and states with relatively larger populations of poor and racial and language minority students. And they are more likely to be found in red states in which residents predominantly vote for the Republican Party.

All of these underlying correlations indeed explain why such policies are more popular, and accordingly adopted in certain states versus others. As well, these underlying correlations help to explain the correlation of interest as presented by the Education Law Center in its aforementioned Education Justice article.  Indeed, these states disproportionally received Race to the Top funds as their political and other state-level demographics would have predicted them to, as these are the states most likely to climb on board the VAMwagon (noting that some states had already done so prior to Race to the Top and hence won first-round Race to the Top funds [e.g., Tennessee]).

Please note, however, that with all imperfect correlations found in correlational research, there are outliers. In this case, this would include blue states that adopt VAMs for consequential purposes (e.g., Colorado) or red states who continue to move relatively slower in terms of their VAM-based policies and initiatives (e.g., Texas, Arizona, and Mississippi). (Co)related again, this would also include states with relatively fewer and relatively more poor and minority students and English Language Learners (ELLs), respectively.

For more about these correlations and state level research, please see: Collins, C., & Amrein-Beardsley, A. (2014). Putting growth and value-added models on the map: A national overview. Teachers College Record, 16(1). Retrieved from: http://www.tcrecord.org/Content.asp?ContentId=17291

American Statistical Association (ASA) Position Statement on VAMs

Inside my most recent post, about the Top 14 research-based articles about VAMs, there was a great research-based statement that was released just last week by the American Statistical Association (ASA), titled the “ASA Statement on Using Value-Added Models for Educational Assessment.”

It is short, accessible, easy to understand, and hard to dispute, so I wanted to be sure nobody missed it as this is certainly a must read for all of you following this blog, not to mention everybody else dealing/working with VAMs and their related educational policies. Likewise, this represents the current, research-based evidence and thinking of probably 90% of the educational researchers and econometricians (still) conducting research in this area.

Again, the ASA is the best statistical organization in the U.S. and likely one of if not the best statistical associations in the world. Some of the most important parts of their statement, taken directly from their full statement as I see them, follow:

  1. VAMs are complex statistical models, and high-level statistical expertise is needed to
    develop the models and [emphasis added] interpret their results.
  2. Estimates from VAMs should always be accompanied by measures of precision and a discussion of the assumptions and possible limitations of the model. These limitations are particularly relevant if VAMs are used for high-stakes purposes.
  3. VAMs are generally based on standardized test scores, and do not directly measure
    potential teacher contributions toward other student outcomes.
  4. VAMs typically measure correlation, not causation: Effects – positive or negative –
    attributed to a teacher may actually be caused by other factors that are not captured in the model.
  5. Under some conditions, VAM scores and rankings can change substantially when a
    different model or test is used, and a thorough analysis should be undertaken to
    evaluate the sensitivity of estimates to different models.
  6. VAMs should be viewed within the context of quality improvement, which distinguishes aspects of quality that can be attributed to the system from those that can be attributed to individual teachers, teacher preparation programs, or schools.
  7. Most VAM studies find that teachers account for about 1% to 14% of the variability in test scores, and that the majority of opportunities for quality improvement are found in the system-level conditions. Ranking teachers by their VAM scores can have unintended consequences that reduce quality.
  8. Attaching too much importance to a single item of quantitative information is counter-productive—in fact, it can be detrimental to the goal of improving quality.
  9. When used appropriately, VAMs may provide quantitative information that is relevant for improving education processes…[but only if used for descriptive/description purposes]. Otherwise, using VAM scores to improve education requires that they provide meaningful information about a teacher’s ability to promote student learning…[and they just do not do this at this point, as there is no research evidence to support this ideal].
  10. A decision to use VAMs for teacher evaluations might change the way the tests are viewed and lead to changes in the school environment. For example, more classroom time might be spent on test preparation and on specific content from the test at the exclusion of content that may lead to better long-term learning gains or motivation for students. Certain schools may be hard to staff if there is a perception that it is harder for teachers to achieve good VAM scores when working in them. Overreliance on VAM scores may foster a competitive environment, discouraging collaboration and efforts to improve the educational system as a whole.

Also important to point out is that included in the report the ASA makes recommendations regarding the “key questions states and districts [yes, practitioners!] should address regarding the use of any type of VAM.” These include, although they are not limited to questions about reliability (consistency), validity, the tests on which VAM estimates are based, and the major statistical errors that always accompany VAM estimates, but are often buried and often not reported with results (i.e., in terms of confidence
intervals or standard errors).

Also important is the purpose for ASA’s statement, as written by them: “As the largest organization in the United States representing statisticians and related professionals, the American Statistical Association (ASA) is making this statement to provide guidance, given current knowledge and experience, as to what can and cannot reasonably be expected from the use of VAMs. This statement focuses on the use of VAMs for assessing teachers’ performance but the issues discussed here also apply to their use for school or principal accountability. The statement is not intended to be prescriptive. Rather, it is intended to enhance general understanding of the strengths and limitations of the results generated by VAMs and thereby encourage the informed use of these results.”

Do give the position statement a read and use it as needed!

Correction: Make the “Top 13” VAM Articles the “Top 14”

As per my most recent post earlier today, about the Top 13 research-based articles about VAMs, low and behold another great research-based statement was just this week released by the American Statistical Association (ASA), titled the “ASA Statement on Using Value-Added Models for Educational Assessment.”

So, let’s make the Top 13 the Top 14 and call it a day. I say “day” deliberately; this is such a hot and controversial topic it is often hard to keep up with the literature in this area, on literally a daily basis.

As per this outstanding statement released by the ASA – the best statistical organization in the U.S. and one of if not the best statistical associations in the world – some of the most important parts of their statement, taken directly from their full statement as I see them, follow:

  1. VAMs are complex statistical models, and high-level statistical expertise is needed to
    develop the models and [emphasis added] interpret their results.
  2. Estimates from VAMs should always be accompanied by measures of precision and a discussion of the assumptions and possible limitations of the model. These limitations are particularly relevant if VAMs are used for high-stakes purposes.
  3. VAMs are generally based on standardized test scores, and do not directly measure
    potential teacher contributions toward other student outcomes.
  4. VAMs typically measure correlation, not causation: Effects – positive or negative –
    attributed to a teacher may actually be caused by other factors that are not captured in the model.
  5. Under some conditions, VAM scores and rankings can change substantially when a
    different model or test is used, and a thorough analysis should be undertaken to
    evaluate the sensitivity of estimates to different models.
  6. VAMs should be viewed within the context of quality improvement, which distinguishes aspects of quality that can be attributed to the system from those that can be attributed to individual teachers, teacher preparation programs, or schools.
  7. Most VAM studies find that teachers account for about 1% to 14% of the variability in test scores, and that the majority of opportunities for quality improvement are found in the system-level conditions. Ranking teachers by their VAM scores can have unintended consequences that reduce quality.
  8. Attaching too much importance to a single item of quantitative information is counter-productive—in fact, it can be detrimental to the goal of improving quality.
  9. When used appropriately, VAMs may provide quantitative information that is relevant for improving education processes…[but only if used for descriptive/description purposes]. Otherwise, using VAM scores to improve education requires that they provide meaningful information about a teacher’s ability to promote student learning…[and they just do not do this at this point, as there is no research evidence to support this ideal].
  10. A decision to use VAMs for teacher evaluations might change the way the tests are viewed and lead to changes in the school environment. For example, more classr
    oom time might be spent on test preparation and on specific content from the test at the exclusion of content that may lead to better long-term learning gains or motivation for students. Certain schools may be hard to staff if there is a perception that it is harder for teachers to achieve good VAM scores when working in them. Overreliance on VAM scores may foster a competitive environment, discouraging collaboration and efforts to improve the educational system as a whole.

Also important to point out is that included in the report the ASA makes recommendations regarding the “key questions states and districts [yes, practitioners!] should address regarding the use of any type of VAM.” These include, although they are not limited to questions about reliability (consistency), validity, the tests on which VAM estimates are based, and the major statistical errors that always accompany VAM estimates, but are often buried and often not reported with results (i.e., in terms of confidence
intervals or standard errors).

Also important is the purpose for ASA’s statement, as written by them: “As the largest organization in the United States representing statisticians and related professionals, the American Statistical Association (ASA) is making this statement to provide guidance, given current knowledge and experience, as to what can and cannot reasonably be expected from the use of VAMs. This statement focuses on the use of VAMs for assessing teachers’ performance but the issues discussed here also apply to their use for school or principal accountability. The statement is not intended to be prescriptive. Rather, it is intended to enhance general understanding of the strengths and limitations of the results generated by VAMs and thereby encourage the informed use of these results.”

If you’re going to choose one article to read and review, this week or this month, and one that is thorough and to the key points, this is the one I recommend you read…at least for now!

Another Lawsuit in Rochester, New York

Following my most recent post, it seems, again, and as per another post in Diane Ravitch’s blog, that the Rochester, New York Teachers Association is also now “suing the state over its teacher evaluation system, alleging that it does not take into account the impact of poverty on classroom performance.”

As per the original post, “The Rochester Teachers Association today filed a lawsuit alleging that the Regents and State Education Department failed to adequately account for the effects of severe poverty and, as a result, unfairly penalized Rochester teachers on their APPR (Annual Professional Performance Review) evaluations.

The suit, filed in state Supreme Court in Albany by New York State United Teachers on behalf of the RTA and more than 100 Rochester teachers, argues the State Education Department did not adequately account for student poverty in setting student growth scores on state tests in grades 4-8 math and English language arts. In addition, SED imposed rules for Student Learning Objectives and implemented evaluations in a way that made it more difficult for teachers of economically disadvantaged students to achieve a score of “effective” or better. As a result, the lawsuit alleges the Regents and SED violated teachers’ rights to fair evaluations and equal protection under the law.

SED computes a growth score based on student performance on state standardized tests, which is then used in teacher evaluations.

Nearly 90 percent of Rochester students live in poverty. The lawsuit says SED’s failure to appropriately compensate for student poverty when calculating student growth scores resulted in about one-third of Rochester’s teachers receiving overall ratings of “developing” or “ineffective” in 2012-13, even though 98 percent were rated “highly effective” or “effective” by their principals on the 60 points tied to their instructional classroom practices. Statewide, just 5 percent of teachers received “developing” or “ineffective” ratings.

“The State Education Department’s failure to properly factor in the devastating impact of Rochester’s poverty in setting growth scores and providing guidance for developing SLOs resulted in city teachers being unfairly rated in their evaluations,” Iannuzzi said. “Rochester teachers work with some of the most disadvantaged students in the state. They should not face stigmatizing labels based on discredited tests and the state’s inability to adequately account for the impact of extreme poverty when measuring growth.”

RTA President Adam Urbanski said an analysis of Rochester teachers’ evaluations for 2012-13 demonstrated clearly the effects of poverty and student attendance, for example, were not properly factored in for teachers’ evaluations. As a result, “dedicated and effective teachers received unfair ratings based on student outcomes that were beyond their control. The way the State Education Department implemented the state testing portion of APPR adds up to nothing more than junk science.”

Urbanski stressed Rochester teachers embrace accountability, support objective and constructive evaluations, and accept that testing has a place in education. “Tests should be used to inform instruction and can be effective tools to help improve teaching and learning,” he said. “But SED’s obsession with standardized testing and data collection has perverted the goals of testing while, ironically, failing to accurately measure the one impact that matters most: the effects of poverty on student achievement.”