Can VAMs Be Trusted?

In a recent paper published in the peer-reviewed journal Education Finance and Policy, coauthors Cassandra Guarino (Indiana University – Bloomington), Mark Reckase (Michigan State University), and Jeffrey Wooldridge (Michigan State University) ask and then answer the following question: “Can Value-Added Measures of Teacher Performance Be Trusted?” While what I write below is taken from what I read via the official publication, I link here to the working paper that was published online via the Education Policy Center at Michigan State University (i.e., not for a fee).

From the abstract, authors “investigate whether commonly used value-added estimation strategies produce accurate estimates of teacher effects under a variety of scenarios. [They] estimate teacher effects [using] simulated student achievement data sets that mimic plausible types of student grouping and teacher assignment scenarios. [They] find that no one method accurately captures true teacher effects in all scenarios, and the potential for misclassifying teachers as high- or low-performing can be substantial.”

From elsewhere in more specific terms, the authors use simulated data to “represent controlled conditions” to most closely match “the relatively simple conceptual model upon which value-added estimation strategies are based.” This is the strength of this research study in that authors’ findings represent best-case scenarios, while when working with real-world and real-life data “conditions are [much] more complex.” Hence, working with various statistical estimators, controls, approaches, and the like using simulated data becomes “the best way to discover fundamental flaws and differences among them when they should be expected to perform at their best.”

They found…

  • “No one [value-added] estimator performs well under all plausible circumstances, but some are more robust than others…[some] fare better than expected…[and] some of the most popular methods are neither the most robust nor ideal.” In other words, calculating value-added regardless of the sophistication of the statistical specifications and controls used is messy, and this messiness can seriously throw off the validity of the inferences to be drawn about teachers, even given the fanciest models and methodological approaches we currently have going (i.e., those models and model specifications being advanced via policy).
  • “[S]ubstantial proportions of teachers can be misclassified as ‘below average’ or ‘above average’ as well as in the bottom and top quintiles of the teacher quality distribution, even in [these] best-case scenarios.” This means that the misclassification errors we are seeing with real-world data, we are also seeing with simulated data. This leads us to even more concern about whether VAMs will ever be able to get it right, or in this case, counter the effects of the nonrandom assignment of students to classrooms and teachers to the same.
  • Researchers found that “even in the best scenarios and under the simplistic and idealized conditions imposed by [their] data-generating process, the potential for misclassifying above-average teachers as below average or for misidentifying the “worst” or “best” teachers remains nontrivial, particularly if teacher effects are relatively small. Applying the [most] commonly used [value-added approaches] results in misclassification rates that range from at least 7 percent to more than 60 percent, depending upon the estimator and scenario.” So even with a pretty perfect dataset, or a dataset much cleaner than those that come from actual children and their test scores in real schools, misclassification errors can impact teachers upwards of 60% of the time.

In sum, researchers conclude that while certain VAMs hold more promise than others, they may not be capable of overcoming the many obstacles presented by the non-random assignment of students to teachers (and teachers to classrooms).

In their own words, “it is clear that every estimator has an Achilles heel (or more than one area of potential weakness)” that can distort teacher-level output in highly consequential ways. Hence, “[t]he degree of error in [VAM] estimates…may make them less trustworthy for the specific purpose of evaluating individual teachers” than we might think.

Rothstein, Chetty et al., and (Now) Kane on Bias

Here’s an update to a recent post about research conducted by Berkeley Associate Professor of Economics – Jesse Rothstein.

In Rothstein’s recently released study, he provides evidence that puts the aforementioned Chetty et al. results under a more appropriate light. Rothstein’s charge, again, is that Chetty et al. (perhaps unintentionally) masked evidence of bias in their now infamous VAM-based study, which in turn biased Chetty et al.’s (perpetual) claims that teachers caused effects in student achievement growth over time. These effects, rather, might have been more likely caused by bias given the types of students non-randomly assigned to teachers’ classrooms versus “true teacher effects.”

In addition, while in his study Rothstein replicated Chetty et al.’s overall results using a similar data set, so did Thomas Kane – a colleague of Chetty’s at Harvard who has also been the source of prior VAMboozled! posts here, here, and here. During the Vergara v. California case last summer, the prosecuting team actually used Kane’s (and colleagues’) replication-study results to validate Chetty et al.’s initial results.

However, Rothstein did not replicate Chetty et al.’s findings when it came to bias (the best evidence of this is offered in Rothstein’s study’s Appendix B). Inversely, Kane’s (and colleagues’) study did not, then, have any of the prior year score analyses needed to analyze and assess bias, so the extent to which Chetty et al.’s results were due to bias was then more or less moot.

But after Rothstein released his recent study effectively critiquing Chetty et al. on this point, Kane (and colleagues) released the results Kane presented at the Vergara trial (see here). However, Kane (and colleagues) seemingly released an updated version of “Kane’s” initial results to seemingly counter Rothstein, in support of Chetty. In other words, Kane seems to have released his study (perhaps) more in support of his colleague Chetty than in the name of conducting good, independent research.

Oh the tangled web Chetty and Kane (purportedly) continue to weave.

See also Chetty et al.’s direct response to Rothstein here.

VAMs and Observations: Consistencies, Correlations, and Contortions

A research article was recently published in the peer-reviewed Education Policy Analysis Archives journal titled “The Stability of Teacher Performance and Effectiveness: Implications for Policies Concerning Teacher Evaluation.” This articles was published by professors/doctoral researchers from Baylor University, Texas A & M, & University of South Carolina. Researchers set out to explore the stability (or fluctuations) observed across 132 teachers’ effectiveness ratings across 23 schools in South Carolina over time using both observational and VAM-based output.

Researchers found, not surprisingly given prior research in this area, that neither teacher performance using value-added nor effectiveness using observations was highly stable over time. This is most problematic when “sound decisions about continued employment, tenure, and promotion are predicated on some degree of stability over time. It is imprudent to make such decisions of the performance and effectiveness of a teacher as “Excellent” one year and “Mediocre” the next.”

They also observed “a generally weak positive relationship between the two sets of ratings [i.e., value-added estimates and observations], which has also been the source of many literature studies.

On both of these findings, really, we are well pas the point of saturation. That is, we could not have more agreement across research studies on the following (1) that teacher-level value-added scores are highly unstable over time and (2) that these value-added scores do not align well with observational scores, as they should if both measures were to be appropriate capturing the “teacher effectiveness” construct.

Another interesting finding from this study, discussed before but that has not yet reached the point of saturation in the research literature like the prior two is how (3) different teacher performance ratings as based on observational data are also markedly different across schools. At issue here is “that performance ratings may be school-specific.” Or as per a recent post on this blog, that there is indeed much “Arbitrariness Inherent in Teacher Observations.” This is also highly problematic in that where a teacher might be housed might determine more his/her ratings based not necessarily (or entirely) on his/her actual “quality” or “effectiveness” but his/her location, his/her rater, and his/her rater’s scoring approach given differential tendencies towards leniency, or severity. This might leave us with more of a luck-of-the-draw approach than an actually “objective” measurement of true teacher quality, contrary to current and popular (especially policy) beliefs.

Accordingly, and also per the research, this is not getting much better in that, as per the authors of this article as well as many other scholars, (1) “the variance in value-added scores that can be attributed to teacher performance rarely exceeds 10 percent; (2) in many ways “gross” measurement errors that in many ways come, first, from the tests being used to calculate value-added; (3) the restricted ranges in teacher effectiveness scores also given these test scores and their limited stretch, and depth, and instructional insensitivity — this was also at the heart of a recent post whereas in what demonstrated that “the entire range from the 15th percentile of effectiveness to the 85th percentile of [teacher] effectiveness [using the EVAAS] cover[ed] approximately 3.5 raw score points [given the tests used to measure value-added];” (4) context or student, family, school, and community background effects that simply cannot be controlled for, or factored out; (5) especially at the classroom/teacher level when students are not randomly assigned to classrooms (and teachers assigned to teach those classrooms)… although this will likely never happen for the sake of improving the sophistication and rigor of the value-added model over students’ “best interests.”

Rothstein, Chetty et al., and VAM-Based Bias

Recall the Chetty et al. study at focus of many posts on this blog (see for example here, here, and here)? The study was cited in President Obama’ 2012 State of the Union address when Obama said, “We know a good teacher can increase the lifetime income of a classroom by over $250,000,” and this study was more recently the focus of attention when the judge in Vergara v. California cited Chetty et al.’s study as providing evidence that “a single year in a classroom with a grossly ineffective teacher costs students $1.4 million in lifetime earnings per classroom.” Well, this study is at the source of a new, and very interesting VAM-based debate, again.

This time, new research conducted by Berkeley Associate Professor of Economics – Jesse Rothstein – provides evidence that puts the aforementioned Chetty et al. results under another appropriate light. While Rothstein and others have written critiques of the Chetty et al. study prior (see prior reviews here, here, here, and here), what Rothstein recently found (in his working, not-yet-peer-reviewed-study here) is that by using “teacher switching” statistical procedures, Chetty et al. masked evidence of bias in their prior study. While Chetty et al. have repeatedly claimed bias was not an issue (see for example a series of emails on this topic here), it seems indeed it was.

While Rothstein replicated Chetty et al.’s overall results using a similar dataset, Rothstein did not replicate Chetty et al.’s findings when it came to bias. As mentioned, Chetty et al. used a process of “teacher switching” to test for bias in their study, and by doing so found, with evidence, that bias did not exist in their value-added output. Rothstein found that when “teacher switching” is appropriately controlled, however, “bias accounts for about 20% of the variance in [VAM] scores.” This makes suspect, more now than before, Chetty et al.’s prior assertions that their model, and their findings, were immune to bias.

What this means, as per Rothstein, is that “teacher switching [the process used by Chetty et al.] is correlated with changes in students’ prior grade scores that bias the key coefficient toward a finding of no bias.” Hence, there was a reason Chetty et al. did not find bias in their value-added estimates because they did not use the proper, statistical controls to control for bias in the first place. When properly controlled, or adjusted, estimates yield “evidence of moderate bias;” hence, “[t]he association between [value-added] and long-run outcomes is not robust and quite sensitive to controls.”

This has major implications in the sense that this makes suspect the causal statements also made by Chetty et al. and repeated by President Obama, the Vergara v. California judge, and others – that “high value-added” teachers caused students to ultimately realize higher long-term incomes, fewer pregnancies, etc. X years down the road. If Chetty et al. did not appropriately control for bias, which again Rothstein argues with evidence they did not, it is likely that students would have realized these “things” almost if not entirely regardless of their teachers or what “value” their teachers purportedly “added” to their learning X years prior.

In other words, students were likely not randomly assigned to classrooms in either the Chetty et al. or the Rothstein datasets (making these datasets comparable). So if the statistical controls used did not effectively “control for” the lack of non random assignment of students into classrooms, teachers may have been assigned high value-added scores not necessarily because they were high value-added teachers but because they were non-randomly assigned higher performing, higher aptitude, etc. students, in the first place and as a whole. Thereafter, they were given credit for the aforementioned long-term outcomes, regardless.

If the name Jesse Rothstein sounds familiar, it should. I have referenced his research in prior posts here, here, and here, as he is well-known in the area of VAM research, in particular, for a series of papers in which he provided evidence that students who are assigned to classrooms in non-random ways can create biased, teacher-level value-added scores. If random assignment was the norm (i.e., whereas students are randomly assigned to classrooms and, ideally, teachers are randomly assigned to teach those classrooms of randomly assigned students), teacher-level bias would not be so problematic. However, given research I also recently conducted on this topic (see here), random assignment (at least in the state of Arizona) occurs 2% of the time, at best. Principals otherwise outright reject the notion as random assignment is not viewed as in “students’ best interests,” regardless of whether randomly assigning students to classrooms might mean “more accurate” value-added output as a result.

So it seems, we either get the statistical controls right (which I doubt is possible) or we randomly assign (which I highly doubt is possible). Otherwise, we are left wondering whether value-added analyses will ever work as per their intended (and largely ideal) purposes, especially when it comes to evaluating and holding accountable America’s public school teachers for their effectiveness.

—–

In case you’re interested, Chetty et al have responded to Rothstein’s critique. Their full response can be accessed here. Not surprisingly, they first highlight that Rothstein (and another set of their colleagues at Harvard), replicated their results. That “value-added (VA) measures of teacher quality show very consistent properties across different settings” is that on which Chetty et al. focus first and foremost. What they dismiss, however, is whether the main concerns raised by Rothstein threaten the validity of their methods, and their conclusions. They also dismiss the fact that Rothstein addressed Chetty et al.’s counterpoints before they published them, in Appendix B of his paper given Chetty et al. shared their concerns with Rothstein prior to his study’s release.

Nonetheless, the concerns Chetty et al. attempt to counter are whether their “teacher-switching” approach was invalid, and whether the “exclusion of teachers with missing [value-added] estimates biased the[ir]conclusion[s]” as well. The extent to which missing data bias value-added estimates has also been discussed prior when statisticians force the assumption in their analyses that missing data are “missing at random” (MAR), which is a difficult (although for some like Chetty et al, necessary) assumption to swallow (see, for example, the Braun 2004 reference here).

Jesse Rothstein on Teacher Evaluation and Teacher Tenure

Last week, released via the Washington Post’s Wonkblog, Max Ehrenfreund wrote a piece titled “Teacher tenure has little to do with student achievement, economist says.” For those of you who do not know Jesse Rothstein, he’s an Associate Professor of Economics at University of California – Berkeley, and he is one of the leading researchers/economists conducting research on teacher evaluation and accountability policies writ large, as well as the value-added models (VAMs) being used for such purposes. He’s probably most famous for a study he conducted in 2009 about how the non-random, purposeful sorting of students into classrooms indeed biases (or distorts) value-added estimations, pretty much despite the sophistication of the statistical controls meant to block (or control for) such bias (or distorting effects). You can find this study referenced here.

Anyhow, in this piece author Ehrenfreuend discusses with Rothstein teacher evaluation and teacher tenure. Some of the key take-aways from the interview and for this audience follow, but do read the full piece, linked again here, if so inclined:

Rothstein, on teacher evaluation:

  • In terms of evaluating teachers, “[t]here’s no perfect method. I think there are lots of methods that give you some information, and there are lots of problems with any method. I think there’s been a tendency in thinking about methods to prioritize cheap methods over methods that might be more expensive. In particular, there’s been a tendency to prioritize statistical computations based on student test scores, because all you need is one statistician and the test score data. Classroom observation requires having lots of people to sit in the back of lots and lots of classrooms and make judgments.
  • Why the interest in value-added? “I think that’s a complicated question. It seems scientific, in a way that other methods don’t. Partly it has to do with the fact that it’s cheap, and it seems like an easy answer.”
  • What about the fantabulous study Raj Chetty and his Harvard colleagues (Friedman and Rockoff) conducted about teachers’ value-added (which has been the source of many prior posts herein)? “I don’t think anybody disputes that good teachers are important, that teachers matter. I have some methodological concerns about that study, but in any case, even if you take it at face value, what it tells you is that higher value-added teachers’ students earn more on average.”
  • What are the alternatives? “We could double teachers’ salaries. I’m not joking about that. The standard way that you make a profession a prestigious, desirable profession, is you pay people enough to make it attractive. The fact that that doesn’t even enter the conversation tells you something about what’s wrong with the conversation around these topics. I could see an argument that says it’s just not worth it, that it would cost too much. The fact that nobody even asks the question tells me that people are only willing to consider cheap solutions.”

Rothstein, on teacher tenure:

  • “Getting good teachers in front of classrooms is tricky,” and it will likely “still be a challenge without tenure, possibly even harder. There are only so many people willing to consider teaching as a career, and getting rid of tenure could eliminate one of the job’s main attractions.”
  • Likewise, “there are certainly some teachers in urban, high-poverty settings that are not that good, and we ought to be figuring out ways to either help them get better or get them out of the classroom. But it’s important to keep in mind that that’s only one of several sources of the problem.”
  • “Even if you give the principal the freedom to fire lots of teachers, they won’t do it very often, because they know the alternative is worse.” The alternative being replacing an ineffective teacher by an even less effective teacher. Contrary to what is oft-assumed, high qualified teachers are not knocking down the doors to teach in such schools.
  • Teacher tenure is “really a red herring” in the sense that debating tenure ultimately misleads and distracts others from the more relevant and important issues at hand (e.g., recruiting strong teachers into such schools). Tenure “just doesn’t matter that much. If you got rid of tenure, you would find that the principals don’t really fire very many people anyway” (see also point above).

Texas Hangin’ its Hat on its New VAM System

A fellow blogger, James Hamric and author of Hammy’s Education Reform Blog, emailed a few weeks ago connecting me with a recent post he wrote about teacher evaluations in Texas, titling them and his blog post “The good, the bad and the ridiculous.”

It seems that the Texas Education Agency (TEA), which serves a similar role in Texas as a state department of education elsewhere, recently posted details about the state’s new Teacher Evaluation and Support System (TESS) that the state submitted to the U.S. Department of Education to satisfy the condition’s of its No Child Left Behind (NCLB) waiver, excusing Texas from not meeting NCLB’s prior goal that all students in the state (and all other states) would be 100% proficient in mathematics and reading/language arts by 2014.

While “80% of TESS will be rubric based evaluations consisting of formal observations, self assessment and professional development across six domains…The remaining 20% of TESS ‘will be reflected in a student growth measure at the individual teacher level that will include a value-add score based on student growth as measured by state assessments.’ These value added measures (VAMs) will only apply to approximately one quarter of the teachers [however, and as is the case more or less throughout the country] – those [who] teach testable subjects/grades. For all the other teachers, local districts will have flexibility for the remaining 20% of the evaluation score.” This “flexibility” will include options that include student learning objectives (SLOs), portfolios or district-level pre- and post-tests.

Hamric then goes onto review his concerns about the VAM-based component. While we have highlighted these issues and concerns many times prior on this blog, I do recommend reading these as summarized by others other than us who write here in this blog. This may just help to saturate our minds, and also prepare them to defend ourselves against the “good, bad, and the ridiculous” and perhaps work towards better systems of teacher evaluation, as is really the goal. Click here, again, to read this post in full.

Related, Hamric concludes with the following, “the vast majority of educators want constructive feedback, almost to a fault. As long as the administrator is well trained and qualified, a rubric based evaluation should be sufficient to assess the effectiveness of a teacher. While the mathematical validity of value added models are accepted in more economic and concrete realms, they should not be even a small part of educator evaluations and certainly not any part of high-stakes decisions as to continuing employment. It is my hope that, as Texas rolls out TESS in pilot districts in the 2014-2015 school year, serious consideration will be given to removing the VAM component completely.”

Brookings Institute Paper on VAMs and Their More Important Observational Counterparts

A paper recently released by the Brookings Institute (click here to read the paper in full) has “added” some “value” to our thinking about the use of observations in the teacher evaluation systems of topic here, and most all educational policy circles these days.

Researchers, as situated in the federal context surrounding these systems (including more than $4 billion now released to 19 states via Race to the Top and now 43 NCLB waivers also granted), examined the observational systems, rather than the VAMs themselves, as these observational systems typically accompany the VAM components in these (oft-high-stakes) systems.

Researchers found that the observational components of these systems certainly “add” more “value” than their VAM counterparts. This is true largely because, as researchers found in their analyses, only 22% of teachers were VAM-eligible (which is significantly lower than others’ estimates that usually hang around 30%). In this case, observational systems were the only real “hard data” available for the other 78% of teachers across school sites. In addition, researchers found (although we have known this prior) that “classroom observations have the potential of providing formative feedback to teachers that [theoretically] helps them improve their practice [more than VAMs]…[because] feedback from [VAMs]…is often too delayed and vague to produce improvement in teaching.”

Researchers do note, however, that “improvements are needed” if such observational systems are to carry the weight for which they are currently being tasked, again provided the above and their findings below:

  1. Observational biases also exist, as per their research, whereas teachers who are non-randomly assigned students who enter their classrooms with higher levels of prior achievement tend to get higher observational scores than teachers non-randomly assigned students entering their classrooms with lower levels of prior achievement;
  2. Related, districts “do not have processes in place to address the possible biases in observational scores,” not to mention the VAM scores with which they are most often combined across contemporary systems. While authors suggest that statistical adjustments be made, I’m not certain I agree with this for the same research-based reasons I don’t agree with this approach with VAMs — no matter how sophisticated the statistics they do not work to effectively control for such bias;
  3. “The reliability of both value-added measures and demographic-adjusted teacher evaluation scores is dependent on sample size, such that these measures will be less reliable and valid when calculated in small districts than in large districts.” While authors recommend that states also control for this statistically, I would default to reality whereas districts interpreting and using these data should certainly keep this in mind, especially if they are set to make high-stakes decisions as based on these, yet another set of still faulty data;
  4. “Observations conducted by outside observers are more valid than observations conducted by school administrators. At least one observation of a teacher each year should conducted by a trained observer from outside the teacher’s school who does not have substantial prior knowledge of the teacher being observed.” On this, I would agree, but I would still recommend observers either represent or fully understand the full context in which teachers are to be observed before passing “more objective” judgments about them;
  5. And finally, their best recommendation…wait for it…that “[t]he inclusion of a school value-added component in teachers’ evaluation scores negatively impacts good teachers in bad schools and positively impacts bad teachers in good schools. This measure should be eliminated or reduced to a low weight in teacher evaluation programs.”

Citation: Whitehurst, G. J., Chingos, M. M., & Lindquist, K. M. (2014). Evaluating teachers with classroom observations: Lessons learned in four districts. Washington, DC: Brookings Institution. Retrieved from https://www.brookings.edu/wp-content/uploads/2016/06/Evaluating-Teachers-with-Classroom-Observations.pdf

Chetty Et Al…Are “Et” It “Al” Again!

As per a recent post about Raj Chetty, the fantabulous study he and his colleagues (i.e., Chetty, Friedman, and Rockoff) “Keep on Giving…,” and a recent set of emails exchanged among Raj, Diane, and me about their (implied) nobel-worthy work in this area, given said fantabulous study, it seems Chetty et al are “et” it “al” again!

Excuse my snarkiness here as I am still in shock that they took it upon themselves to “take on” the American Statistical Association.” Yes, the ASA as critqued by Chetty et al. If that’s not a sign of how Chetty et al feel about their work in this area, then I don’t know what is.”

Given my already close proximity to this debate, I thought it wise to consult a colleague of mine–one who, in fact, is an economist (like Chetty et al) and who is also an expert in this area–to write a response to their newly released “take” on the ASA’s recently released “Statement on Using Value-Added Models for Educational Assessment.” Margarita Pivovarova is also an economist conducting research in this area, and I value her perspective more than many other potential reviewers out there, as she is both smart and wise when it comes to the careful use and interpretation of data and methods like those at the focus of these papers. Here’s what she wrote:

What an odd set of opponents in the interpretation of ASA statement, this one apparently being a set of scholars who self-selected themselves to answer what they seemingly believed others were asking; that is, “What say Chetty et al on this?”

That being said, the overall tone of this response reminds of a “response to the editor” as if, in a more customary manner, ASA would have reviewed [the aforementioned] Chetty et al paper and Chetty et al were invited to respond to the ASA. Interestingly here, though, Chetty et al seemed compelled to take a self-assigned “expert” stance on the ASA’s statement, regardless of the fact that this statement represents the expertise and collective stance of a whole set of highly regarded and highly deserving scholars in this area, who quite honestly would NOT have asked for Chetty et al’s advice on the matter. 

This is the ASA, and these are the ASA experts Chetty et al also explicitly marginalize in the reference section of their critique. [Note for yourselves the major lack of alignment between Chetty et al.’s references versus the 70 articles linked to here on this website]. This should cause others pause in terms of why Chetty et al are so extremely selective in their choices of relevant literature – the literature on which they rely “to prove” some of their own statements and “to prove” some of the ASA’s statements false or suspect. The largest chunk of the best and most relevant literature [as again linked to here] is completely left out of their “critique.” This also includes the whole set of peer-reviewed articles published in educational and education finance journals. A search on ERIC (The Educational Resources Information Center) with “value-added” as key words for the last 5 years yields 406 entries, and a similar search in JSTOR (a shared digital library) returning 495. This is quite a deep and rich literature to outright and in its entirety leave out of Chetty et al’s purportedly more informed and expert take on this statement, ESPECIALLY given many in the ASA wrote these aforementioned, marginalized articles! While Chetty et al critique the ASA for not including citations, this is quite customary in a position statement that is meant to be research-based, but not cluttered with the hundreds of articles in support of the association’s position.

Another point of contention surrounds ASA Point #5:, VAMs typically measure correlation, not causation: Effects – positive or negative – attributed to a teacher may actually be caused by other factors that are not captured in the model. Although I should mention that this was taken from within the ASA statement, interpreted, and then defined by Chetty et al as “ASA Point #5.” Notwithstanding, the answer to “ASA Point #5” guides  policy implications, so it is important to consider: Do value-added estimates yield causal interpretations as opposed to just descriptive information? In 2004, the editors of the American Educational Research Association’s (AERA) Journal of Educational and Behavioral Statistics (JEBS) published a special issue entirely devoted to value-added models (VAMs). One notable publication in that issue was written by Rubin, Stuart, and Zanutto (2004). Any discussion of the causal interpretations of VAMs cannot be devoid of this foundational article and its potential outcomes framework, or Rubin Causal Model, advanced within this piece and elsewhere (see also Holland, 1986). These authors clearly communicated that value-added estimates cannot be considered causal unless a set of “heroic assumptions” are agreed to and imposed. Accordingly, and pertinent here, “anyone familiar with education will realize that this [is]…fairly unrealistic.” Even back then, Rubin et al suggested we switch gears and instead evaluate interventions and reward incentives as based on the descriptive qualities of the indicators and estimates derived via VAMs.This is something Chetty et al overlooked here and continue to overlook and neglect in their research.

Another concern surrounds how Chetty et al suggest “we” deal with the interpretation of VAM estimates (what they define as ASA Points #2 and #3). The grander question here is: Does one need to be an expert to understand the “truth” behind the numbers? If we, in general, are talking about those who are in charge of making decisions based on these numbers, then the answer is yes, or at least aware of what the numbers do and do not do! While probably different levels of sophistication are required to develop VAMs versus interpret VAM results, interpretation is a science in itself and ultimately requires background knowledge about the model’s design and how the estimates were generated if informed, accurate, and valid inferences are to be drawn. While it is true that using an iPad doesn’t require one to understand the technology behind it, the same cannot be said about the interpretation of this set of statistics upon which people’s lives increasingly depend.

Related, ASA rightfully states that the standardized test scores used in VAMs are not the only outcomes that should be of interest for policy makers and stake-holders (ASA Point #4). In view of recent findings of the role of cognitive versus non-cognitive skills for students’ future well-beings, current agreement is that test scores might not even be one of the most important outcomes capturing a students’ educated self. Addressing this point, Chetty et al, rather, demonstrate how “interpretation matters.” In Jackson’s (2013) study [not surprisingly cited by Chetty et al, likely because it was published by the National Bureau of Educational Research (NBER) – the same organization that published their now infamous study without it being peer or even internally reviewed way back when], Jackson used value-added techniques to estimate teacher effects on the non-cognitive and long-term outcomes of students. One of his main findings was that teachers who are good at boosting test scores are not always the same teachers who have positive and long-lasting outcomes on non-cognitive skills acquisition. In fact, value-added to test scores and to non-cognitive outcomes for the same teacher were then and have since been shown to be only weakly correlated [or related] with one another. Regardless, Chetty et al use the results of standardized test scores to rank teachers and then use those rankings to estimate the effects of “highly-effective” teachers on the long-run outcomes of students, REGARDLESS OF THE RESEARCH in which their study should have been situated (excluding the work of Thomas Kane and Eric Hanushek). As noted before, if value-added estimates from standardized test scores cannot be interpreted as causal, then the effect of “high value-added” teachers on college attendance, earnings, and reduced teenage birth rates CANNOT BE CONSIDERED CAUSAL either.

Lastly, ASA expressed concerns about the sensitivity of value-added estimates to model specifications (ASA Point #6). Likewise, researchers contributing to a now large body of literature have also recently explored this same issue, and collectively they have found that value-added estimates are highly sensitive to the assessments being used, even within the same subject areas (Papay, 2011) and different subject areas taught by the same teachers given different student compositions (Loeb & Candelaria, 2012). Echoing those findings, others (also ironically cited by Chetty et al in the discussion to demonstrate the opposite) have also found that “even when the correlations between performance estimates generated by different models are quite high for the workforce as a whole, there are still sizable differences in teacher rankings generated by different models that are associated with the composition of students in a teacher’s classroom” (Goldhaber, Walch, & Gabele, 2014). And others have noted that “models that are less aggressive in controlling for student-background and schooling-environment information systematically assign higher rankings to more-advantaged schools, and to individuals who teach at these schools” (Ehlert, Koeldel, Parsons, and Podgursky, 2014).”

In sum, these are only a few “points” from this “point-by-point discussion” that would, or rather should, immediately strike anyone fairly familiar with the debate surrounding the use and abuse of VAMs as way off-base. While the debates surrounding VAMs continue, it is so easy for a casual and/or careless observer to lose the big picture behind the surrounding debates and get distracted by technicalities. This certainly seems to be the case here, as Chetty et al continue to argue their technical arguments from a million miles away from the actual schools’, teachers’, and students’ data they study, and from which they make miracles.

*To see VAMboozled!’s initial “take” on the same ASA statement, a take that in our minds still stands as it was not a self-promotional critique but only a straightforward interpretation for our followers, please click here.

The Study that Keeps on Giving…(Hopefully) in its Final Round

In January I wrote a post about “The Study that Keeps on Giving…” Specifically, this post was about the study conducted and authored by Raj Chetty (Economics Professor at Harvard), John Friedman (Assistant Professor of Public Policy at Harvard), and Jonah Rockoff (Associate Professor of Finance and Economics at Harvard) that was published first in 2011 (in its non-peer-reviewed and not even internally reviewed form) by the National Bureau of Economic Research (NBER) and then published again by NBER in January of 2014 (in the same form) but this time split into two separate studies (see them split here and here).

Their re-release of the same albeit split study was what prompted the title of the initial “The Study that Keeps on Giving…” post. Little did I know then, though, that the reason this study was re-released in split form was that it was soon to be published in a peer-reviewed journal. Its non-peer-reviewed publication status was a major source of prior criticism. While journal editors seemed to have suggested the split, NBER seemingly took advantage of this opportunity to publicize this study in two forms, regardless and without prior explanation.

Anyhow, this came to my attention when the study’s lead author – Raj Chetty – emailed me a few weeks ago, emailed Diane Ravitch on the same email, and also apparently emailed other study “critics” at the same time (see prior reviews of this study as per this study’s other notable “critics” here, here, here, and here) to notify all of us that this study made it through peer review and was to be published in a forthcoming issue of the American Economic Review. While Diane and I responded to our joint email (as other critics may have done as well), we ultimately promised Chetty that we would not share the actual contents of any of the approximately 20 email exchanges that went back and forth among the three of us over the following days.

What I can say, though, is that no genuine concern was expressed by Chetty or on behalf of his co-authors, in particular, about the intended or unintended consequences that came about as a result of his study, nor how many policymakers since used and abused study results for political gain and the further advancement of VAM-based policies. Instead, emails were more or less self-promotional and celebratory, especially given that President Obama cited the study in his 2012 State of the Union Address and that Chetty apparently continues to advise U.S. Secretary of Education Arne Duncan about his similar VAM-based policies. Perhaps, next, the Nobel prize committee might pay this study its due respects, now overdue, but again I only paraphrase from that which I inferred from these email conversations.

As a refresher, Chetty et al. conducted value-added analyses on a massive data set (with over 1 million student-level test and tax records) and presented (highly-questionable) evidence that favored teachers’ long-lasting, enduring, and in some cases miraculous effects. While some of the findings would have been very welcomed to the profession, had they indeed been true (e.g., high value-added teachers substantively affect students incomes in their adult years), the study’s authors overstated their findings, and they did not duly consider (or provide evidence to counter) the alternative hypotheses in terms of what other factors besides teachers might have caused the outcomes they observed (e.g., those things that happen outside of schools while students are in school and throughout students’ lives).

Nor did they consider, or rather satisfactorily consider, how the non-random assignment of students into both schools and classrooms might have biased the effects observed, whereas the students in high “value-added” teachers’ classrooms might have been more “likely to succeed” regardless of, or even despite the teacher effect, on both the short and long term effects demonstrated in their findings…then widely publicized via the media and beyond throughout other political spheres.

Rather, Chetty et al. advanced what they argued were a series of causal effects by exploiting a series of correlations that they turned attributional. They did this because (I believe) they truly believe that their sophisticated econometric models and the sophisticated controls and approaches they use in fact work as intended. Perhaps this also explains why Chetty et al. give pretty much all credit in the area of value-added research to econometricians, and they do this throughout their papers, all the while over-citing the works of their economic researchers/friends but not the others (besides Berkeley economist Jesse Rothstein, see the full reference to his study here) who have also outright contradicted their findings, with evidence. Apparently, educational researchers do not have much to add on this topic, but I digress.

But this is too a serious fault as “they” (and I don’t mean to make sweeping generalizations here) have never been much for understanding what goes into the data they analyze, as socially constructed and largely context dependent. Nor do they seem to care to fully understand the realities of the classrooms from which they receive such data, or what test scores actually mean, or when using them what one can and cannot actually infer. This, too, was made clear via our email exchange. It seems this from-the-sky-down view of educational data is the best (as well as the most convenient) approach that “they” might even expressly prefer, so that they do not have to get their data fingers dirty and deal with the messiness that always surrounds these types of educational data and always comes into play when conducting most any type of educational research that relies (in this case solely) on students’ large-scale standardized test scores.

Regardless, I decided to give this study yet another review to see if, now that this study has made it through the peer review process, I was missing something. I wasn’t. The studies are pretty much exactly the same as they were when first released (which unfortunately does not say much for peer review). The first study here is about VAM-based bias and how VAM estimates that control for students’ prior test scores “exhibit little bias despite the grouping of students” and despite the number of studies not referenced or cited that continue to evidence the opposite. The second study here is about teacher-level value-added and how teachers with a lot of it (purportedly) cause grander things throughout their students’ lives. More specifically, they found that “students [non-randomly] assigned to high [value-added] teachers are more likely to attend college, earn higher salaries, and are less likely to have children as teenagers.” They also found that “[r]eplacing a teacher whose [value-added] is in the bottom 5% with an average teacher would increase the present value of students’ lifetime income by approximately $250,000 per classroom [emphasis added].” Please note that this overstated figure is not per student; had it been broken out by student it would have rather become “chump change,” for the lack of a better term, which serves as one example of just one of their classic exaggerations. They do, however, when you read through the actual text, tone their powerful language down a bit to note that, on average, this is more accurately $185,000, still per classroom. Again, to read the more thorough critiques conducted by scholars with also impressive academic profiles, I suggest readers click here, here, here, or here.

What I did find important to bring to light during this round of review were the assumptions that, thanks to Chetty and his emails, were made more obvious (and likewise troublesome) than before. These are the, in some cases, “very strong” assumptions that Chetty et al. make explicit in both of their studies (see Assumptions 1-3 in the first and second papers). These are also the assumptions they make explicit, with “evidence” why they should not reject these assumptions (most likely, and in some cases clearly) because their study relied on such assumptions. The assumptions they made were so strong, in fact, at one point they even mention that it would be useful could they have “relaxed” some of the assumptions they made. In other cases, they justify their adoption of these assumptions given the data limitations and methodological issues they faced, plain and simply because there was no other way to conduct (or continue) their analyses without making and agreeing to these assumptions.

So, see if you agree with the following three assumptions they make most explicit and use throughout both studies (although other assumptions are littered throughout both pieces), yourselves. I would love for Chetty et al. to discuss whether their assumptions in fact hold given the realities the everyday teacher, school, or classroom face. But again, I digress…

Assumption 1 [Stationarity]: Teacher levels of value-added as based on growth in student achievement over time follows a stationary, unchanging, constant, and consistent process. On average, “teacher quality does not vary across calendar years and [rather] depends only on the amount of time that elapses between” years. While completely nonsensical to the average adult with really any commonsense, this assumption, to them, helped them “simplif[y] the estimation of teacher [value-added] by reducing the number of parameters” needed in their models, or more appropriately needed to model their effects.

Assumption 2 [Selection on Excluded Observables]: Students are sorted or assigned to teachers on excluded observables that can be estimated. See a recent study that I conducted with a doctoral student of mine (that was just published in this month’s issue of the highly esteemed American Educational Research Journal here) in which we found, with evidence, that 98% of the time this assumption is false. Students are non-randomly sorted on “observables” and “non-observables” (most of which are not and cannot be included in such data sets) 98% of the time; both types of variables bias teacher-level value-added over time given the statistical procedures meant to control for these variables do not work effectively well, especially for students in the extremes or on both sides of the normal bell curve. While convenient, especially when conducting this type of far-removed research, this assumption is false and cannot really be taken seriously given the pragmatic realities of schools.

Assumption 3 [Teacher Switching as a Quasi-Experiment]: Changes in teacher-level value-added scores across cohorts within a school-grade are orthogonal (i.e., non-overlapping, uncorrelated, or independent) with changes in other determinants of student scores. While Chetty et al. themselves write that this assumption “could potentially be violated by endogenous student or teacher sorting to schools over time,” they also state that “[s]tudent sorting at an annual frequency is minimal because of the costs of changing schools” which is yet another unchecked assumption without reference(s) in support. They further note that “[w]hile endogenous teacher sorting is plausible over long horizons, the high-frequency changes [they] analyze are likely driven by idiosyncratic shocks such as changes in staffing needs, maternity leaves, or the relocation of spouses.” These are all plausible assumptions too, right? Is “high-frequency teacher turnover…uncorrelated with student and school characteristics?” Concerns about this and really all of these assumptions, and ultimately how they impact study findings, should certainly cause pause.

My final point, interestingly enough, also came up during the email exchanges mentioned above. Chetty made the argument that he, more or less, had no dog in the fight surrounding value-added. In the first sentence of the first manuscript, however, he (and his colleagues) wrote, “Are teachers’ impacts on students’ test scores (“value-added”) a good measure of their quality?” The answer soon thereafter and repeatedly made known in both papers becomes an unequivocal “Yes.” Chetty et al. write in the first paper that “[they] established that value-added measures can help us [emphasis added as “us” is undefined] identify which teachers have the greatest ability to raise students’ test scores.” In the second paper, they write that “We find that teacher [value-added] has substantial impacts on a broad range of outcomes.”  Apparently, Chetty wasn’t representing his and/or his colleagues’ honest “research-based” opinions and feelings about VAMs in one place (i.e., our emails) or the other (his publications) very well.

Contradictions…as nettlesome as those dirtly little assumptions I suppose.

Breaking News: Houston Teachers Suing over their District’s EVAAS Use

It’s time!! While lawsuits are emerging across the nation (namely three in Tennessee and one in Florida), one significant and potentially huge lawsuit was just filed in federal court this past Wednesday in Houston on behalf of seven teachers working in the Houston Independent School District.

The one place I have conducted quite extensive research on VAMs, and the Education Value-Added Assessment System (EVAAS) and its use in this district in particular, is there in Houston. Accordingly, I’m honored to report that I’m helping folks move forward with their case.

Houston, the 7th largest urban district in the country, is widely recognized for its (inappropriate) using of the EVAAS for more consequential decision-making purposes (e.g., teacher merit pay and in the case of this article, teacher termination) more than anywhere else in the nation. Read a previous article about this in general, and also four teachers who were fired in Houston due in large part to their EVAAS scores here. See also a 12-minute YouTube video about these same teachers here.

According to a recent post in The Washington Post, the teachers/plaintiffs are arguing that  EVAAS output are inaccurate, the EVAAS is unfair, that teachers are being evaluated via the EVAAS using tests that do not match the curriculum they are to teach, that the EVAAS system fails to control for student-level factors that impact how well teachers perform but that are outside of teachers’ control (e.g., parental effects), that the EVAAS is incomprehensible and hence very difficult if not impossible to actually use to improve upon their instruction, and, accordingly, that teachers’ due process rights are being violated because teachers do not have adequate opportunities to change as a results of their EVAAS results.

“The suit further alleges that teachers’ rights to equal protection under the Constitution are being abridged because teachers with below-average scores find themselves receiving harsher scores on a separate measure of instructional practice.” That is, teachers are claiming that their administrators are changing, and therefore distorting other measures of their effectiveness (e.g., observational scores) because administrators are being told to trust the EVAAS data and to force such alignments between the EVAAS and observational output to (1) manufacture greater alignment between the two and (2) artificially inflate what are currently the very low correlations being observed between the two.

As well, and unique to but also very interesting in the case of the EVAAS, recalibration in the model is also causing teachers’ “effectiveness” scores to adjust retroactively when additional data are added to the longitudinal data system. This, according to EVAAS technicians, helps to make the system more rigorous, but this, in reality, causes teachers’ prior scores to change retroactively from years prior. So yes, a teacher who might be deemed ineffective one year and then penalized, might have his/her prior “ineffective” score retroactively changed to “effective” for the same year prior when more recent data are added and re-calibrated to refine the EVAAS output. Sounds complicated, I mean comical, but it’s unfortunately not.

This one is definitely one to watch, and being closest to this one I will certainly keep everyone updated as things progress as I can. This, in my mind, is the lawsuit that “could affect [VAM-based] evaluation systems well beyond Texas.”

Yes!! Let’s sure hope so…assuming all goes well, as it should, as this lawsuit is to be based on the research.

To watch an informative short (2:30) video produced by ABC news in Houston, click here. To download the complete lawsuit click here.