Moshe Adler on Chetty et al.

Recall the study at focus of now many posts on this blog? Written by Raj Chetty and two of his colleagues at Harvard, cited first in President Obama’ 2012 State of the Union address when Obama said, “We know a good teacher can increase the lifetime income of a classroom by over $250,000,” and more recently the focus of much attention when the judge in Vergara v. California cited Chetty’s testimony and the aforementioned team’s research as evidence that “a single year in a classroom with a grossly ineffective teacher costs students $1.4 million in lifetime earnings per classroom?”

Well, one of this study’s biggest critics, and most important critics as an expert in economics himself, has written a response to all of this. More specifically, Moshe Adler – economist affiliated with Columbia University and Empire State College, State University of New York and author of the 2010 book, Economics for the Rest of Us: Debunking the Science That Makes Life Dismal – first (in April 2014) wrote a critique of Chetty et al.’s studies for the National Education Policy Center (NEPC) during which he also compared this study from its first release in 2011 (cited by Obama) to its current 2014 version (cited by the judge in Vergara v. California) that can be found here, after which Chetty et al. wrote a response (in May 2014) that can be found at the bottom of the page here, after which Adler released a final response (in June 2014) that can also be found at the very bottom of the page here.

For those of you interested in reading all three pieces, all three are really quite critical to understanding Chetty et al.’s methods, as well as their methods’ shortcomings; their assumptions, as well as their feasibility in context; their results, as unfortunately overstated as they are; and the like. For those of you who want just the highlights, though (as also referenced in the press release just released here), here are the nine key points taken from now the two studies. These are the nine concerns made and established by Moshe in the first round, to which Chetty et al. responded, but seemingly still stand as solid criticisms.

As per Chetty et al.’s study #1:

  1. Value-added scores in this report and in general are unstable from year to year and from test to test, but the report ignores this instability.
  2. The report inflates the effect of teacher value-added by assuming that a child’s test scores can be improved endlessly [over time].
  3. The procedure that the report develops for calculating teacher value-added varies greatly between subjects within school levels (math or English in elementary/high school) and between schools within subjects (elementary or middle school math/English), indicating that the calculated teacher value-added may be random.
  4. The commonly used method for determining how well a model predicts results is through correlations and illustrated through scatterplot graphs. The report does not [however] present either the calculations or graphs and instead invents its own novel graph to show that the measurements of value-added produce good predictions of future teacher performance. But this is misleading. Notwithstanding the graph, it is possible that the quality of predictions in the report was poor.

As per Chetty et al.’s study #2:

  1. An earlier version of the report found that an increase in teacher value-added has no effect on income at age 30, but this result is not mentioned in this revised version. Instead, the authors state that they did not have a sufficiently large sample to investigate the relationship between teacher value-added and income at any age after 28, but this claim is untrue. They had 220,000 observations (p. 15), which is a more than sufficiently large sample for their analysis.
  2. The method used to calculate the 1.34% increase [that a 1 standard deviation increase in teacher VA raises wages at age 28 by 1.34% (p. 37)] is misleading, since observations with no reported income were included in the analysis, while high earners were excluded. If done properly, it is possible that the effect of teacher value-added is to decrease, not increase, income at age 28 (or 30).
  3. The increase in annual income at age 28 due to having a higher quality teacher “improved” dramatically from the first version of the report ($182 per year, report of December, 2011) to the next ($286 per year, report of September, 2013). Because the data sets are not identical, a slight discrepancy between estimates is to be expected. But since the discrepancy is so large, it suggests that the correlation between teacher value-added and income later in life is random.
  4. In order to achieve its estimate of a $39,000 income gain per student, the report makes the assumption that the 1.34% increase in income at age 28 will be repeated year after year. Because no increase in income was detected at age 30, [however] and because 29.6% of the observations consisted of non-filers, this assumption is unjustified.
  5. The effect of teacher value-added on test scores fades out rapidly. The report deals with this problem by citing two studies that it claims buttress the validity of its own results. This claim is [also] both wrong and misleading.

Chetty Et Al…Are “Et” It “Al” Again!

As per a recent post about Raj Chetty, the fantabulous study he and his colleagues (i.e., Chetty, Friedman, and Rockoff) “Keep on Giving…,” and a recent set of emails exchanged among Raj, Diane, and me about their (implied) nobel-worthy work in this area, given said fantabulous study, it seems Chetty et al are “et” it “al” again!

Excuse my snarkiness here as I am still in shock that they took it upon themselves to “take on” the American Statistical Association.” Yes, the ASA as critqued by Chetty et al. If that’s not a sign of how Chetty et al feel about their work in this area, then I don’t know what is.”

Given my already close proximity to this debate, I thought it wise to consult a colleague of mine–one who, in fact, is an economist (like Chetty et al) and who is also an expert in this area–to write a response to their newly released “take” on the ASA’s recently released “Statement on Using Value-Added Models for Educational Assessment.” Margarita Pivovarova is also an economist conducting research in this area, and I value her perspective more than many other potential reviewers out there, as she is both smart and wise when it comes to the careful use and interpretation of data and methods like those at the focus of these papers. Here’s what she wrote:

What an odd set of opponents in the interpretation of ASA statement, this one apparently being a set of scholars who self-selected themselves to answer what they seemingly believed others were asking; that is, “What say Chetty et al on this?”

That being said, the overall tone of this response reminds of a “response to the editor” as if, in a more customary manner, ASA would have reviewed [the aforementioned] Chetty et al paper and Chetty et al were invited to respond to the ASA. Interestingly here, though, Chetty et al seemed compelled to take a self-assigned “expert” stance on the ASA’s statement, regardless of the fact that this statement represents the expertise and collective stance of a whole set of highly regarded and highly deserving scholars in this area, who quite honestly would NOT have asked for Chetty et al’s advice on the matter. 

This is the ASA, and these are the ASA experts Chetty et al also explicitly marginalize in the reference section of their critique. [Note for yourselves the major lack of alignment between Chetty et al.’s references versus the 70 articles linked to here on this website]. This should cause others pause in terms of why Chetty et al are so extremely selective in their choices of relevant literature – the literature on which they rely “to prove” some of their own statements and “to prove” some of the ASA’s statements false or suspect. The largest chunk of the best and most relevant literature [as again linked to here] is completely left out of their “critique.” This also includes the whole set of peer-reviewed articles published in educational and education finance journals. A search on ERIC (The Educational Resources Information Center) with “value-added” as key words for the last 5 years yields 406 entries, and a similar search in JSTOR (a shared digital library) returning 495. This is quite a deep and rich literature to outright and in its entirety leave out of Chetty et al’s purportedly more informed and expert take on this statement, ESPECIALLY given many in the ASA wrote these aforementioned, marginalized articles! While Chetty et al critique the ASA for not including citations, this is quite customary in a position statement that is meant to be research-based, but not cluttered with the hundreds of articles in support of the association’s position.

Another point of contention surrounds ASA Point #5:, VAMs typically measure correlation, not causation: Effects – positive or negative – attributed to a teacher may actually be caused by other factors that are not captured in the model. Although I should mention that this was taken from within the ASA statement, interpreted, and then defined by Chetty et al as “ASA Point #5.” Notwithstanding, the answer to “ASA Point #5” guides  policy implications, so it is important to consider: Do value-added estimates yield causal interpretations as opposed to just descriptive information? In 2004, the editors of the American Educational Research Association’s (AERA) Journal of Educational and Behavioral Statistics (JEBS) published a special issue entirely devoted to value-added models (VAMs). One notable publication in that issue was written by Rubin, Stuart, and Zanutto (2004). Any discussion of the causal interpretations of VAMs cannot be devoid of this foundational article and its potential outcomes framework, or Rubin Causal Model, advanced within this piece and elsewhere (see also Holland, 1986). These authors clearly communicated that value-added estimates cannot be considered causal unless a set of “heroic assumptions” are agreed to and imposed. Accordingly, and pertinent here, “anyone familiar with education will realize that this [is]…fairly unrealistic.” Even back then, Rubin et al suggested we switch gears and instead evaluate interventions and reward incentives as based on the descriptive qualities of the indicators and estimates derived via VAMs.This is something Chetty et al overlooked here and continue to overlook and neglect in their research.

Another concern surrounds how Chetty et al suggest “we” deal with the interpretation of VAM estimates (what they define as ASA Points #2 and #3). The grander question here is: Does one need to be an expert to understand the “truth” behind the numbers? If we, in general, are talking about those who are in charge of making decisions based on these numbers, then the answer is yes, or at least aware of what the numbers do and do not do! While probably different levels of sophistication are required to develop VAMs versus interpret VAM results, interpretation is a science in itself and ultimately requires background knowledge about the model’s design and how the estimates were generated if informed, accurate, and valid inferences are to be drawn. While it is true that using an iPad doesn’t require one to understand the technology behind it, the same cannot be said about the interpretation of this set of statistics upon which people’s lives increasingly depend.

Related, ASA rightfully states that the standardized test scores used in VAMs are not the only outcomes that should be of interest for policy makers and stake-holders (ASA Point #4). In view of recent findings of the role of cognitive versus non-cognitive skills for students’ future well-beings, current agreement is that test scores might not even be one of the most important outcomes capturing a students’ educated self. Addressing this point, Chetty et al, rather, demonstrate how “interpretation matters.” In Jackson’s (2013) study [not surprisingly cited by Chetty et al, likely because it was published by the National Bureau of Educational Research (NBER) – the same organization that published their now infamous study without it being peer or even internally reviewed way back when], Jackson used value-added techniques to estimate teacher effects on the non-cognitive and long-term outcomes of students. One of his main findings was that teachers who are good at boosting test scores are not always the same teachers who have positive and long-lasting outcomes on non-cognitive skills acquisition. In fact, value-added to test scores and to non-cognitive outcomes for the same teacher were then and have since been shown to be only weakly correlated [or related] with one another. Regardless, Chetty et al use the results of standardized test scores to rank teachers and then use those rankings to estimate the effects of “highly-effective” teachers on the long-run outcomes of students, REGARDLESS OF THE RESEARCH in which their study should have been situated (excluding the work of Thomas Kane and Eric Hanushek). As noted before, if value-added estimates from standardized test scores cannot be interpreted as causal, then the effect of “high value-added” teachers on college attendance, earnings, and reduced teenage birth rates CANNOT BE CONSIDERED CAUSAL either.

Lastly, ASA expressed concerns about the sensitivity of value-added estimates to model specifications (ASA Point #6). Likewise, researchers contributing to a now large body of literature have also recently explored this same issue, and collectively they have found that value-added estimates are highly sensitive to the assessments being used, even within the same subject areas (Papay, 2011) and different subject areas taught by the same teachers given different student compositions (Loeb & Candelaria, 2012). Echoing those findings, others (also ironically cited by Chetty et al in the discussion to demonstrate the opposite) have also found that “even when the correlations between performance estimates generated by different models are quite high for the workforce as a whole, there are still sizable differences in teacher rankings generated by different models that are associated with the composition of students in a teacher’s classroom” (Goldhaber, Walch, & Gabele, 2014). And others have noted that “models that are less aggressive in controlling for student-background and schooling-environment information systematically assign higher rankings to more-advantaged schools, and to individuals who teach at these schools” (Ehlert, Koeldel, Parsons, and Podgursky, 2014).”

In sum, these are only a few “points” from this “point-by-point discussion” that would, or rather should, immediately strike anyone fairly familiar with the debate surrounding the use and abuse of VAMs as way off-base. While the debates surrounding VAMs continue, it is so easy for a casual and/or careless observer to lose the big picture behind the surrounding debates and get distracted by technicalities. This certainly seems to be the case here, as Chetty et al continue to argue their technical arguments from a million miles away from the actual schools’, teachers’, and students’ data they study, and from which they make miracles.

*To see VAMboozled!’s initial “take” on the same ASA statement, a take that in our minds still stands as it was not a self-promotional critique but only a straightforward interpretation for our followers, please click here.

The Study that Keeps on Giving…(Hopefully) in its Final Round

In January I wrote a post about “The Study that Keeps on Giving…” Specifically, this post was about the study conducted and authored by Raj Chetty (Economics Professor at Harvard), John Friedman (Assistant Professor of Public Policy at Harvard), and Jonah Rockoff (Associate Professor of Finance and Economics at Harvard) that was published first in 2011 (in its non-peer-reviewed and not even internally reviewed form) by the National Bureau of Economic Research (NBER) and then published again by NBER in January of 2014 (in the same form) but this time split into two separate studies (see them split here and here).

Their re-release of the same albeit split study was what prompted the title of the initial “The Study that Keeps on Giving…” post. Little did I know then, though, that the reason this study was re-released in split form was that it was soon to be published in a peer-reviewed journal. Its non-peer-reviewed publication status was a major source of prior criticism. While journal editors seemed to have suggested the split, NBER seemingly took advantage of this opportunity to publicize this study in two forms, regardless and without prior explanation.

Anyhow, this came to my attention when the study’s lead author – Raj Chetty – emailed me a few weeks ago, emailed Diane Ravitch on the same email, and also apparently emailed other study “critics” at the same time (see prior reviews of this study as per this study’s other notable “critics” here, here, here, and here) to notify all of us that this study made it through peer review and was to be published in a forthcoming issue of the American Economic Review. While Diane and I responded to our joint email (as other critics may have done as well), we ultimately promised Chetty that we would not share the actual contents of any of the approximately 20 email exchanges that went back and forth among the three of us over the following days.

What I can say, though, is that no genuine concern was expressed by Chetty or on behalf of his co-authors, in particular, about the intended or unintended consequences that came about as a result of his study, nor how many policymakers since used and abused study results for political gain and the further advancement of VAM-based policies. Instead, emails were more or less self-promotional and celebratory, especially given that President Obama cited the study in his 2012 State of the Union Address and that Chetty apparently continues to advise U.S. Secretary of Education Arne Duncan about his similar VAM-based policies. Perhaps, next, the Nobel prize committee might pay this study its due respects, now overdue, but again I only paraphrase from that which I inferred from these email conversations.

As a refresher, Chetty et al. conducted value-added analyses on a massive data set (with over 1 million student-level test and tax records) and presented (highly-questionable) evidence that favored teachers’ long-lasting, enduring, and in some cases miraculous effects. While some of the findings would have been very welcomed to the profession, had they indeed been true (e.g., high value-added teachers substantively affect students incomes in their adult years), the study’s authors overstated their findings, and they did not duly consider (or provide evidence to counter) the alternative hypotheses in terms of what other factors besides teachers might have caused the outcomes they observed (e.g., those things that happen outside of schools while students are in school and throughout students’ lives).

Nor did they consider, or rather satisfactorily consider, how the non-random assignment of students into both schools and classrooms might have biased the effects observed, whereas the students in high “value-added” teachers’ classrooms might have been more “likely to succeed” regardless of, or even despite the teacher effect, on both the short and long term effects demonstrated in their findings…then widely publicized via the media and beyond throughout other political spheres.

Rather, Chetty et al. advanced what they argued were a series of causal effects by exploiting a series of correlations that they turned attributional. They did this because (I believe) they truly believe that their sophisticated econometric models and the sophisticated controls and approaches they use in fact work as intended. Perhaps this also explains why Chetty et al. give pretty much all credit in the area of value-added research to econometricians, and they do this throughout their papers, all the while over-citing the works of their economic researchers/friends but not the others (besides Berkeley economist Jesse Rothstein, see the full reference to his study here) who have also outright contradicted their findings, with evidence. Apparently, educational researchers do not have much to add on this topic, but I digress.

But this is too a serious fault as “they” (and I don’t mean to make sweeping generalizations here) have never been much for understanding what goes into the data they analyze, as socially constructed and largely context dependent. Nor do they seem to care to fully understand the realities of the classrooms from which they receive such data, or what test scores actually mean, or when using them what one can and cannot actually infer. This, too, was made clear via our email exchange. It seems this from-the-sky-down view of educational data is the best (as well as the most convenient) approach that “they” might even expressly prefer, so that they do not have to get their data fingers dirty and deal with the messiness that always surrounds these types of educational data and always comes into play when conducting most any type of educational research that relies (in this case solely) on students’ large-scale standardized test scores.

Regardless, I decided to give this study yet another review to see if, now that this study has made it through the peer review process, I was missing something. I wasn’t. The studies are pretty much exactly the same as they were when first released (which unfortunately does not say much for peer review). The first study here is about VAM-based bias and how VAM estimates that control for students’ prior test scores “exhibit little bias despite the grouping of students” and despite the number of studies not referenced or cited that continue to evidence the opposite. The second study here is about teacher-level value-added and how teachers with a lot of it (purportedly) cause grander things throughout their students’ lives. More specifically, they found that “students [non-randomly] assigned to high [value-added] teachers are more likely to attend college, earn higher salaries, and are less likely to have children as teenagers.” They also found that “[r]eplacing a teacher whose [value-added] is in the bottom 5% with an average teacher would increase the present value of students’ lifetime income by approximately $250,000 per classroom [emphasis added].” Please note that this overstated figure is not per student; had it been broken out by student it would have rather become “chump change,” for the lack of a better term, which serves as one example of just one of their classic exaggerations. They do, however, when you read through the actual text, tone their powerful language down a bit to note that, on average, this is more accurately $185,000, still per classroom. Again, to read the more thorough critiques conducted by scholars with also impressive academic profiles, I suggest readers click here, here, here, or here.

What I did find important to bring to light during this round of review were the assumptions that, thanks to Chetty and his emails, were made more obvious (and likewise troublesome) than before. These are the, in some cases, “very strong” assumptions that Chetty et al. make explicit in both of their studies (see Assumptions 1-3 in the first and second papers). These are also the assumptions they make explicit, with “evidence” why they should not reject these assumptions (most likely, and in some cases clearly) because their study relied on such assumptions. The assumptions they made were so strong, in fact, at one point they even mention that it would be useful could they have “relaxed” some of the assumptions they made. In other cases, they justify their adoption of these assumptions given the data limitations and methodological issues they faced, plain and simply because there was no other way to conduct (or continue) their analyses without making and agreeing to these assumptions.

So, see if you agree with the following three assumptions they make most explicit and use throughout both studies (although other assumptions are littered throughout both pieces), yourselves. I would love for Chetty et al. to discuss whether their assumptions in fact hold given the realities the everyday teacher, school, or classroom face. But again, I digress…

Assumption 1 [Stationarity]: Teacher levels of value-added as based on growth in student achievement over time follows a stationary, unchanging, constant, and consistent process. On average, “teacher quality does not vary across calendar years and [rather] depends only on the amount of time that elapses between” years. While completely nonsensical to the average adult with really any commonsense, this assumption, to them, helped them “simplif[y] the estimation of teacher [value-added] by reducing the number of parameters” needed in their models, or more appropriately needed to model their effects.

Assumption 2 [Selection on Excluded Observables]: Students are sorted or assigned to teachers on excluded observables that can be estimated. See a recent study that I conducted with a doctoral student of mine (that was just published in this month’s issue of the highly esteemed American Educational Research Journal here) in which we found, with evidence, that 98% of the time this assumption is false. Students are non-randomly sorted on “observables” and “non-observables” (most of which are not and cannot be included in such data sets) 98% of the time; both types of variables bias teacher-level value-added over time given the statistical procedures meant to control for these variables do not work effectively well, especially for students in the extremes or on both sides of the normal bell curve. While convenient, especially when conducting this type of far-removed research, this assumption is false and cannot really be taken seriously given the pragmatic realities of schools.

Assumption 3 [Teacher Switching as a Quasi-Experiment]: Changes in teacher-level value-added scores across cohorts within a school-grade are orthogonal (i.e., non-overlapping, uncorrelated, or independent) with changes in other determinants of student scores. While Chetty et al. themselves write that this assumption “could potentially be violated by endogenous student or teacher sorting to schools over time,” they also state that “[s]tudent sorting at an annual frequency is minimal because of the costs of changing schools” which is yet another unchecked assumption without reference(s) in support. They further note that “[w]hile endogenous teacher sorting is plausible over long horizons, the high-frequency changes [they] analyze are likely driven by idiosyncratic shocks such as changes in staffing needs, maternity leaves, or the relocation of spouses.” These are all plausible assumptions too, right? Is “high-frequency teacher turnover…uncorrelated with student and school characteristics?” Concerns about this and really all of these assumptions, and ultimately how they impact study findings, should certainly cause pause.

My final point, interestingly enough, also came up during the email exchanges mentioned above. Chetty made the argument that he, more or less, had no dog in the fight surrounding value-added. In the first sentence of the first manuscript, however, he (and his colleagues) wrote, “Are teachers’ impacts on students’ test scores (“value-added”) a good measure of their quality?” The answer soon thereafter and repeatedly made known in both papers becomes an unequivocal “Yes.” Chetty et al. write in the first paper that “[they] established that value-added measures can help us [emphasis added as “us” is undefined] identify which teachers have the greatest ability to raise students’ test scores.” In the second paper, they write that “We find that teacher [value-added] has substantial impacts on a broad range of outcomes.”  Apparently, Chetty wasn’t representing his and/or his colleagues’ honest “research-based” opinions and feelings about VAMs in one place (i.e., our emails) or the other (his publications) very well.

Contradictions…as nettlesome as those dirtly little assumptions I suppose.

Research Brief: Access to “Effective Teaching” as per VAMs

Researchers of a brief released from the Institute of Education Sciences (IES), the primary research arm of the United States Department of Education (USDOE), recently set out to “shed light on the extent to which disadvantaged students have access to effective teaching, based on value-added measures [VAMs]” as per three recent IES studies that have since been published in peer-reviewed journals and that include in their analyses 17 total states.

Researchers found, overall, that: (1) disadvantaged students receive less-effective teaching and have less access to effective teachers on average, that’s worth about a four-week lack of achievement in reading and about a two-week lack of achievement in mathematics as per VAM-based estimates, and (2) students’ access to effective teaching varies across districts.

On point (1), this is something we have known for years, contrary to what the authors of this brief write (i.e., “there has been limited research on the extent to which disadvantaged students receive less effective teaching than other students.” They simply dismiss a plethora of studies because researchers did not use VAMs to evaluate “effective teaching.” Linda Darling-Hammond’s research, in particular, has been critically important in this area for decades. It is a fact that, on average, students in high-needs schools that disproportionally serve the needs of disadvantaged students have less access to teachers who have certain teacher-quality indicators (e.g., National Board Certification and advanced degrees/expertise in content-areas, although these things are argued not to matter in this brief). In addition, there are also higher teacher turnover rates in such schools, and oftentimes such schools become “dumping grounds” for teachers who cannot be terminated due to many of the tenure laws currently at focus and under fire across the nation. This is certainly a problem, as is disadvantaged students’ access to effective teachers. So, agreed!

On point (2), agreed again. Students’ access to effective teaching varies across districts. There is indeed a lot of variation in terms of teacher quality across districts, thanks largely to local (and historical) educational policies (e.g., district and school zoning, charter and magnet schools, open enrollment, vouchers and other choice policies promoting public school privatization), all of which continue to perpetuate these problems. No surprise really, here, either, as we have also known this for decades, thanks to research that has not been based solely on the use of VAMs but research by, for example, Jonathan Kozol, bell hooks, and Jean Anyon to name a few.

What is most relevant here, though, and in particular for readers of this blog, is that the authors of this brief used misinformed approaches when writing this brief and advancing their findings. That is, they used VAMs to examine the extent to which disadvantaged students receive “less effective teaching” by defining “less effective teaching” using only VAM estimates as the indicators of effectiveness, and as relatively compared to other teachers across the schools and districts in which they found that such grave disparities exist. All the while, not once did they mention how these disparities very likely biased the relative estimates on which they based their main findings.

Most importantly, they blindly agreed to a largely unchecked and largely false assumption that the teachers caused the relatively low growth in scores rather than the low growth being caused by the bias inherent in the VAMs being used to estimate the relative levels of “effective teaching” across teachers. This is the bias that across VAMs is still, it seems weekly, becoming more apparent and of increasing concern (see, for example, a recent post about a research study demonstrating this bias here).

This is also the same issue I detailed in a recent post titled, “Chicken or the Egg?” in which I deconstructed the “Which came first, the chicken or the egg?” question in the context of VAMs. This is becoming increasingly important as those using VAM-based data are using them to make causal claims, when only correlational (or in simpler terms relational) claims can and should be made. The fundamental question in this brief should have been, rather, “What is the real case of cause and consequence” when examining “effective teaching” in these studies across these states? True teacher effectiveness, or teacher effectiveness along with the bias inherent in and across VAMs given the relativistic comparisons on which VAM estimates are based…or both?!?

Interestingly enough, not once was “bias” even mentioned in either the brief or its accompanying technical appendix. It seems to these researchers, there ain’t no such thing. Hence, their claims are valid and should be interpreted as such.

That being said, we cannot continue to use VAM estimates (emphasis added) to support claims about bad teachers causing low achievement among disadvantaged students when VAM researchers increasingly evidence that these models cannot control for the disadvantages that disadvantaged students bring with them to the schoolhouse door. Until these models are bias-free (which is unlikely), never can claims be made that the teachers caused the growth (or lack thereof), or in this case more or less growth than other similar teachers with different sets of students non-randomly attending different districts and schools and non-randomly assigned into different classrooms with different teachers.

VAMs are biased by the very nature of the students and their disadvantages, both of which clearly contribute to the VAM estimates themselves.

It is also certainly worth mentioning that the research cited throughout this brief is not representative of the grander peer-reviewed research available in this area (e.g., research derived via Michelle Rhee’s “Students First”?!?). Likewise, having great familiarity with the authors of not only the three studies cited in this brief, but also the others cited “in support,” let’s just say their aforementioned sheer lack of attention to bias and what bias meant for the validity of their findings was (unfortunately) predictable.

As far as I’m concerned, the (small) differences they report in achievement might as well be real or true, but to claim that teachers caused the differences because of their effectiveness, or lack thereof, is certainly false and untrue.

Citation: Institute of Education Sciences. (2014, January). Do disadvantaged students get less effective teaching? Key findings from recent Institute of Education Sciences studies. National Center for Education Evaluation and Regional Assistance. Retrieved from

Chicken or the Egg?

“Which came first, the chicken or the egg?” is an age-old question, but is more importantly a dilemma about identifying the real cases of cause and consequence.

Recall from a recent post that currently in the state of California nine public school students are challenging California’s teacher tenure system, arguing that their right to a good education is being violated by job protections that protect ineffective teachers, but do not protect the students from being instructed by said teachers. Recall, as well, that a wealthy technology magnate [David Welch] is financing the whole case, as also affiliated and backed by Students Matter. The ongoing suit is called “Vergara v. California.”

Welch and Students Matter have thus far brought to testify an all-star cast, most recently including Thomas Kane, an economics professor from Harvard University. Kane also directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation, not surprisingly as a VAM advocate, advancing a series of highly false claims about the wonderful potentials of VAMs. Potentials that, once again, did not pass any sort of peer review, but that still made it to the US Congress. To read about the many methodological and other problems with the MET studies click here.

If I was to make a list of VAMboozlers, Kane would be near the top of the list, especially as he is increasingly using his Harvard affiliation to advance his own (profitable) credibility in this area. To read an insightful post about just this, read VAMboozled! reader Laura Chapman’s comment at the bottom of a recent post here, in which she wrote, “Harvard is only one of a dozen high profile institutions that has become the source of propaganda about K-12 education and teacher performance as measured by scores on standardized tests.”

Anyhow, and as per a recent article in the Los Angeles Times, Kane testified that “Black and Latino students are more likely to get ineffective teachers in Los Angeles schools than white and Asian students,” and that “the worst teachers–in the bottom 5%–taught 3.2% of white students and 5.4% of Latino students. If ineffective teachers were evenly distributed, you’d expect that 5% of each group of students would have these low-rated instructors.” He concluded that “The teaching-quality imbalance especially hurts the neediest students because ‘rather than assign them more effective teachers to help close the gap with white students they’re assigned less effective teachers, which results in the gap being slightly wider in the following year.”

Kane’s research was, of course, used to support the claim that bad teachers are causing the disparities that he cited, regardless of the fact the inverse could be also, equally, or even more true–that the value-added measures used to measure teacher effectiveness in these schools are biased by the very nature of the students in these schools that are contributing their low test scores to such estimates. As increasingly being demonstrated in the literature, these models are biased by the types of students in the classrooms and schools that contribute to the measures themselves.

So which one came first? The chicken or the egg? The question here, really, and that I wish defendants would have posed, was whether the students in these schools caused such teachers to appear less effective when in fact they might have been as equally effective as “similar” teachers teaching more advantaged kids across town. What we do know from the research literature is that, indeed, there are higher turnover rates in such schools, and oftentimes such schools become “dumping grounds” for teachers who cannot be terminated due to such tenure laws – this is certainly a problem. But to claim that teachers in such schools are causing poor achievement is certainly cause for concern, not to mention a professional and research-based ethics concern as well.

Kane’s “other notable finding was that the worst teachers in Los Angeles are doing more harm to students than the worst ones in other school systems that he compared. The other districts were New York City, Charlotte-Mecklenberg, Dallas, Denver, Memphis and Hillsborough County in Florida.” Don’t ask me how he figured that one out, across states that use different tests, different systems, and have in their schools entirely different and unique populations. Amazing what some economists can accomplish with advanced mathematical models…and just a few (heroic) assumptions.

The Study That Keeps On Giving…

About two months ago, I posted (1) a critique of a highly publicized Mathematica Policy Research study released to the media about the vastly overstated “value” of value-added measures, and (2) another critique of another study released to the media by the National Bureau of Economic Research (NBER). This one, like the other, was not peer-reviewed, or even internally reviewed, yet it was released despite its major issues (e.g., overstated findings about VAMs based on a sample for which only 17% of teachers actually had value-added data).

Again, neither study went through a peer review process, both were wrought with methodological and conceptual issues that did not warrant study findings, and both, regardless, were released to the media for wide dissemination.

Yet again, VAM enthusiasts are attempting to VAMboozle policymakers and the general public with another faulty study, again released by the National Bureau of Economic Research (NBER). But, in an unprecedented move, this time NBER has released the same, highly flawed study three times, even though the first study first released in 2011 still has not made it through peer-review to official publication and it has, accordingly, not proved itself as anything more than a technical report with major methodological issues.

In the first study (2011) Raj Chetty (Economics Professor at Harvard), John Friedman (Assistant Professor of Public Policy at Harvard), and Jonah Rockoff (Associate Professor of Finance and Economics at Harvard) conducted value-added analyses on a massive data set and (over-simplistically) presented (highly-questionable) evidence that favored teachers’ long-lasting, enduring, and in some cases miraculous effects. While some of the findings would have been very welcomed to the profession, had they indeed been true (e.g., high value-added teachers substantively affect students incomes in their adult years), the study’s authors way-overstated their findings, and they did not consider alternative hypotheses in terms of what other factors besides teachers might have caused the outcomes they observed (e.g., those things that happen outside of schools).

Accordingly, and more than appropriately, this study has only been critiqued since, in subsequent attempts to undo what should not have been done in the first place (thanks to both the media and the study’s authors given the exaggerated spin they spun given their results). See, for example, one peer-reviewed critique here, two others conducted by well-known education scholars (i.e., Bruce Baker [Education Professor at Rutgers] and Dale Ballou [Associate Professor of Education at Vanderbilt)) here and here, and another released by the Institute of Education Sciences’ What Works Clearinghouse here.

Maybe in response to their critics, maybe to drive the false findings into more malformed policies, maybe because Chetty (the study’s lead author) just received the John Bates Clark Medal awarded by the American Economic Association, or maybe to have the last word, NBER just released the same exact paper in two more installments. See the second and third releases, positioned as Part I and Part II, to see that they are exactly the same but being promulgated, yet again. While “they” acknowledge that they have done this on the first page of each of the two, it is pretty unethical to go the second round given all of the criticism, the positive and negative press this “working paper” received after its original release(s), and given the study has still not made it through to print in a peer-reviewed journal.

*Thanks to Sarah Polasky for helping with this post.

The 2013 Bunkum Awards & the Gates Foundation’s $45 MET Studies

Tis the award season, and during this time every year, the National Education Policy Center (NEPC) recognizing the “lowlights” in educational research over the previous year, in their annual Bunkum Awards. To view the entertaining video presentation of the awards, hosted by my mentor David Berliner (Arizona State University), please click here.

Lowlights, specifically defined, include research studies in which researchers present, and often oversell thanks to many media outlets, “weak data, shoddy analyses, and overblown recommendations.”  Like the Razzies are to the Oscars in the Academy of Film, are the Bunkums to the best educational research studies in the Academy of Education. And like the Razzies, “As long as the bunk [like junk] keeps flowing, the awards will keep coming.”

As per David Berliner, in his introduction in the video, “the taxpayers who finance public education deserve smart [educational] policies based on sound [research-based] evidence.” This is precisely why these awards are both necessary, and morally imperative.

One among this year’s deserving honorees is of particular pertinence here. This is the, drum roll: ‘We’re Pretty Sure We Could Have Done More with $45 Million’ Award — Awarded to the Bill & Melinda Gates Foundation for Two Culminating Reports they released this year from their Measures of Effective (MET) Project. To see David’s presentation on this award, specifically, scroll to minute 3:15 (to 4:30) in the aforementioned video.

Those at NEPC write about these studies: “We think it important to recognize whenever so little is produced at such great cost. The MET researchers gathered a huge data base reporting on thousands of teachers in six cities. Part of the study’s purpose was to address teacher evaluation methods using randomly assigned students. Unfortunately, the students did not remain randomly assigned and some teachers and students did not even participate. This had deleterious effects on the study–limitations that somehow got overlooked in the infinite retelling and exaggeration of the findings.

When the MET researchers studied the separate and combined effects of teacher observations, value-added test scores, and student surveys, they found correlations so weak that no common attribute or characteristic of teacher-quality could be found. Even with 45 million dollars and a crackerjack team of researchers, they could not define an “effective teacher.” In fact, none of the three types of performance measures captured much of the variation in teachers’ impacts on conceptually demanding tests. But that didn’t stop the Gates folks, in a reprise from their 2011 Bunkum-winning ways, from announcing that they’d found a way to measure effective teaching, nor did it deter the federal government from strong-arming states into adoption of policies tying teacher evaluation to measures of students’ growth.”

To read the full critique of both of these studies, written by Jesse Rothstein (University of California – Berkeley) and William Mathis (University of Colorado – Boulder), please click here.

Unpacking DC’s Impact, or the Lack Thereof

Recently, I posted a critique of the newly released and highly publicized Mathematica Policy Research study about the (vastly overstated) “value” of value-added measures and their ability to effectively measure teacher quality. The study, which did not go through a peer review process, is wrought with methodological and conceptual problems, which I dismantled in the post, with a consumer alert.

Yet again, VAM enthusiasts are attempting to VAMboozle policymakers and the general public with another faulty study, this time released to the media by the National Bureau of Economic Research (NBER). The “working paper” (i.e., not peer-reviewed, and in this case not even internally reviewed by those at NBER) analyzed the controversial teacher evaluation system (i.e., IMPACT) that was put into place in DC Public Schools (DCPS) under the then Chancellor, Michelle Rhee.

The authors, Thomas Dee and James Wyckoff (2013), present what they term “novel evidence” to suggest that the “uniquely high-powered incentives” linked to “teacher performance” worked to improve the “performance” of high-performing teachers, and that “dismissal threats” worked to increase the “voluntary attrition of low-performing teachers.” The authors, however, and similar to those of the Mathematica study, assert highly troublesome claims based on a plethora of problems that had this study undergone peer review before it was released to the public and before it was hailed in the media, would not have created the media hype that ensued. Hence, it is appropriate to issue yet another consumer alert.

The most major problems include, but are not limited to, the following:

“Teacher Performance:” Probably the largest fatal flaw, or the study’s most major limitation was that only 17% of the teachers included in this study (i.e., teachers of reading and mathematics in grades 4 through 8) were actually evaluated under the IMPACT system for their “teacher performance,” or for that which they contributed to the system’s most valued indicator: student achievement. Rather, 83% of the teachers did not have student test scores available to determine if they were indeed effective (or not) using individual value-added scores. It is implied throughout the paper, as well as the media reports covering this study post release, that “teacher performance” was what was investigated when in fact for four out of five DC teachers their “performance” was evaluated only as per what they were observed doing or self-reported doing all the while. These teachers were evaluated on their “performance” using almost exclusively (except for the 5% school-level value-added indicator) the same subjective measures integral to many traditional evaluation systems as well as student achievement/growth on teacher-developed and administrator-approved classroom-based tests, instead.

Score Manipulation and Inflation: Related, a major study limitation was that the aforementioned indicators that were used to define and observe changes in “teacher performance” (for the 83% of DC teachers) were based almost entirely on highly subjective, highly manipulable, and highly volatile indicators of “teacher performance.” Given the socially constructed indicators used throughout this study were undoubtedly subject to score bias by manipulation and artificial inflation as teachers (and their evaluators) were able to influence their ratings. While evidence of this was provided in the study, the authors banally dismissed this possibility as “theoretically [not really] reasonable.” When using tests, and especially subjective indicators to measure “teacher performance,” one must exercise caution to ensure that those being measured do not engage in manipulation and inflation techniques known to effectively increase the scores derived and valued, particularly within such high-stakes accountability systems. Again, for 83% of the teachers their “teacher performance” indicators were almost entirely manipulable (with the exception of school-level value-added weighted at 5%).

Unrestrained Bias: Related, the authors set forth a series of assumptions throughout their study that would have permitted readers to correctly predict the study’s findings without reading it. This is highly problematic, as well, and this would not have been permitted had the scientific community been involved. Researcher bias can certainly impact (or sway) study findings and this most certainly happened here.

Other problems include gross overstatements (e.g., about how the IMPACT system has evidenced itself as financially sound and sustainable over time), dismissed yet highly complex technical issues (e.g., about classification errors and the arbitrary thresholds the authors used to statistically define and examine whether teachers “jumped” thresholds and became more effective), other over-simplistic treatments of major methodological and pragmatic issues (e.g., cheating in DC Public Schools and whether this impacted outcome “teacher performance” data, and the like.

To read the full critique of the NBER study, click here.

The claims the authors have asserted in this study are disconcerting, at best. I wouldn’t be as worried if I knew that this paper truly was in a “working” state and still had to undergo peer-review before being released to the public. Unfortunately, it’s too late for this, as NBER irresponsibly released the report without such concern. Now, we as the public are responsible for consuming this study with critical caution and advocating for our peers and politicians to do the same.

LA Times up to the Same Old Antics

The Los Angeles Times has made it in its own news, again, for its controversial open public records request soliciting the student test scores of all Los Angeles Unified School District (LAUSD) teachers. They have done this repeatedly since 2011 — the first time the Los Angeles Times hired external contractors to calculate LAUSD teachers’ value-added scores, so that they could publish the teachers’ value-added scores on their Los Angeles Teacher Ratings website. They have also done this, repeatedly since 2011, despite the major research-based issues (see for example a National Education Policy Center (NEPC) analysis that followed; see also a recent/timely post about this on another blog at

This time, the Los Angeles Times is charging that the district is opposed to all of this because it wants to keep the data “secret?” How about being opposed to this because doing this is just plain wrong?!

This is an ongoing source of frustration for me since the authors of the initial articles and the creators of the searchable website (Jason Felch and Jason Strong) contacted me back in 2011 regarding whether what they were doing was appropriate, valid, and fair. Despite my strong warnings, research-based apprehensions, and cautionary notes, Felch and Song thanked me for my time and moved forward.

It became apparent to me soon thereafter, that I was merely a pawn in their media game, so that they could ultimately assert that: “The Times shared [what they did] with leading researchers, including skeptics of value added methods, and incorporated their input.” While they did not incorporate any of my input, not one bit, and as far as I could tell any of the input they might have garnered from any of the other skeptics of value added methods with whom they claimed to have spoken, this claim became particularly handy when they were to defend themselves and their actions when the Los Angeles Times, Ethics and Standards board stepped in, on their behalves.

While I wrote words of detest in response to their “Ethics and Standards” post back then, the truth of the matter was, and still is, is that what they continue to do at the Los Angeles Times has literally nothing to do with the current VAM-based research. It is based on “research” only if one defines research as a far-removed action involving far-removed statisticians crunching numbers as part of a monetary contract.

Mathematica Study – Value-Added & Teacher Experience

A study released yesterday by Mathematica Policy Research (and sponsored by the U.S. Department of Education) titled “Teachers with High “Value Added” Can Boost Test Scores in Low-Performing Schools” implies that, yet again, value-added estimates are the key statistical indicators we as a nation should be using, above all else, to make pragmatic and policy decisions about America’s public school teachers. In this case, the argument is that value-added estimates can and should be used to make decisions about where to position high value-added teachers so that they might have greater effects, as well as greater potentials to “add” more “value” to student learning and achievement over time.

While most if not all educational researchers agree with the fundamental idea behind this research study — that it is important to explore ways to improve America’s lowest achieving schools by providing students in such schools increased access to better teachers — this study overstates its “value-added” results.

Hence, I have to issue a Consumer Alert! While I don’t recommend reading the whole technical report (at over 200 single-spaced pages), the main issues with this piece, again not in terms of its overall purpose but in terms of its value-added felicitations, follow:

  1. The high value-added teachers who were selected to participate in this study, and transfer into high-needs schools to teach for two years, were disproportionately National Board Certified Teachers and teachers with more years of teaching experience. The finding that these teachers, selected only because they were high value-added teachers was confounded by the very fact that they were compared to “similar” teachers in the high-needs schools, many of whom were not certified as exemplary teachers and many of whom (20%) were new teachers…as in, entirely new to the teaching profession! While the high value-added teachers who choose to teach in higher needs schools for two years (with $20,000 bonuses to boot) were likely wonderful teachers in their own rights, the same study results would have likely been achieved by simply choosing teachers with more than X years of experience or choosing teachers whose supervisors selected them as “the best.” Hence, this study was not about using “value-added” as the arbiter of all that is good and objective in measuring teacher effects, it was about selecting teachers who were distinctly different than the teachers to whom they were compared and attributing the predictable results back to the “value-added” selections that were made.
  2. Related, many of the politicians and policymakers who are advancing national and state value-added initiatives and policies forward are continuously using sets of false assumptions about teacher experience, teacher credentials, and how/why these things do not matter to advance their agendas forward. Rather, in this study, it seems that teacher experience and credentials mattered the most. Results from this study, hence, contradict initiatives, for example, to get rid of salary schedules that rely on years of experience and credentials, as value-added scores, as evidenced in this study, do seem to capture these other variables (i.e., experience and credentials) as well.
  3. Another argument could be made against the millions of dollars in taxpayer generated funds that our politicians are pouring into these initiatives, as well as federally-funded studies like these. For years, we have known that America’s best teachers are disproportionately located in America’s best schools. While this too was one of this study’s “new” findings, this is nowhere “new” to the research literature in this area. This has not just recently become a “growing concern” (see, for example, an article that I wrote in Phi Delta Kappan back in 2007 that helps to explore this issue’s historical roots). In addition, we have about 20 years of research studies examining teacher transfer initiatives like the one studied here, but such initiatives are so costly that they are rarely if ever sustainable, and they typically fizzle out in due time. This is yet another instance in which the U.S. Department of Education took an ahistorical approach and funded a research study with a question to which we already knew the answer: “Transferring teacher talent” works to improve schools but not with real-world education funds. It would be fabulous should significant investments be made in this area, though!
  4. Interesting to add, as well, is that “[s]pearheading” this study were staff from The New Teacher Project (TNTP). This group authored the now famous The Widget Effect study. This group is also famously known for advancing false notions that teachers matter so much that they can literally wipe out all of the social issues that continue to ail so many of America’s public schools (e.g., poverty). This study’s ahistorical and overstated results, are a bit less surprising taking this into consideration.
  5. Finally, call me a research purist but another Consumer Alert! should automatically ding whenever “random assignment,” “random sampling,” “a randomized design,” or a like term (e.g., “a multisite randomized experiment as was used here) is used, especially in a study’s marketing materials. In this study, such terms are exploited when in fact “treatment” teachers were selected due to high-value-added scores (that were not calculated consistently across participating districts), only 35 to 65 percent of teachers had enough value-added data to be eligible for selection, teachers were surveyed to determine if they wanted to participate (self-selection bias), teachers were approved to participate by their supervisors, then teachers were interviewed by the new principals for whom they were to work as they were actually for hire. To make this truly random, principals would have had to agree to whatever placements they got and all high-value-added teachers would have had to have been randomly selected and then randomly placed, and given equal probabilities in their placements, regardless of the actors and human agents involved. On the flip side, the “control” teachers to whom the “treatment” (i.e., high-value-added) teachers were to be compared, should have also been randomly selected from a pool of control applicants who should have been “similar” to the other teachers in the control schools. To get at more valid results, the control group teachers should have better represented “the typical teachers” at the control schools (i.e., not brand new teachers entering the field). “[N]ormal hiring practices” should not have been followed if study results were to answer whether expert teachers would have greater effects on student achievement than other teachers in high-needs schools. The research question answered, instead, was whether high-value-added teachers, a statistically significant group of whom were National Board Certified and had about four more years of teaching experience on average, would have greater effects on student achievement than other newly hired teachers, many of whom (20%) were brand new to the profession.