Unpacking DC’s Impact, or the Lack Thereof

Recently, I posted a critique of the newly released and highly publicized Mathematica Policy Research study about the (vastly overstated) “value” of value-added measures and their ability to effectively measure teacher quality. The study, which did not go through a peer review process, is wrought with methodological and conceptual problems, which I dismantled in the post, with a consumer alert.

Yet again, VAM enthusiasts are attempting to VAMboozle policymakers and the general public with another faulty study, this time released to the media by the National Bureau of Economic Research (NBER). The “working paper” (i.e., not peer-reviewed, and in this case not even internally reviewed by those at NBER) analyzed the controversial teacher evaluation system (i.e., IMPACT) that was put into place in DC Public Schools (DCPS) under the then Chancellor, Michelle Rhee.

The authors, Thomas Dee and James Wyckoff (2013), present what they term “novel evidence” to suggest that the “uniquely high-powered incentives” linked to “teacher performance” worked to improve the “performance” of high-performing teachers, and that “dismissal threats” worked to increase the “voluntary attrition of low-performing teachers.” The authors, however, and similar to those of the Mathematica study, assert highly troublesome claims based on a plethora of problems that had this study undergone peer review before it was released to the public and before it was hailed in the media, would not have created the media hype that ensued. Hence, it is appropriate to issue yet another consumer alert.

The most major problems include, but are not limited to, the following:

“Teacher Performance:” Probably the largest fatal flaw, or the study’s most major limitation was that only 17% of the teachers included in this study (i.e., teachers of reading and mathematics in grades 4 through 8) were actually evaluated under the IMPACT system for their “teacher performance,” or for that which they contributed to the system’s most valued indicator: student achievement. Rather, 83% of the teachers did not have student test scores available to determine if they were indeed effective (or not) using individual value-added scores. It is implied throughout the paper, as well as the media reports covering this study post release, that “teacher performance” was what was investigated when in fact for four out of five DC teachers their “performance” was evaluated only as per what they were observed doing or self-reported doing all the while. These teachers were evaluated on their “performance” using almost exclusively (except for the 5% school-level value-added indicator) the same subjective measures integral to many traditional evaluation systems as well as student achievement/growth on teacher-developed and administrator-approved classroom-based tests, instead.

Score Manipulation and Inflation: Related, a major study limitation was that the aforementioned indicators that were used to define and observe changes in “teacher performance” (for the 83% of DC teachers) were based almost entirely on highly subjective, highly manipulable, and highly volatile indicators of “teacher performance.” Given the socially constructed indicators used throughout this study were undoubtedly subject to score bias by manipulation and artificial inflation as teachers (and their evaluators) were able to influence their ratings. While evidence of this was provided in the study, the authors banally dismissed this possibility as “theoretically [not really] reasonable.” When using tests, and especially subjective indicators to measure “teacher performance,” one must exercise caution to ensure that those being measured do not engage in manipulation and inflation techniques known to effectively increase the scores derived and valued, particularly within such high-stakes accountability systems. Again, for 83% of the teachers their “teacher performance” indicators were almost entirely manipulable (with the exception of school-level value-added weighted at 5%).

Unrestrained Bias: Related, the authors set forth a series of assumptions throughout their study that would have permitted readers to correctly predict the study’s findings without reading it. This is highly problematic, as well, and this would not have been permitted had the scientific community been involved. Researcher bias can certainly impact (or sway) study findings and this most certainly happened here.

Other problems include gross overstatements (e.g., about how the IMPACT system has evidenced itself as financially sound and sustainable over time), dismissed yet highly complex technical issues (e.g., about classification errors and the arbitrary thresholds the authors used to statistically define and examine whether teachers “jumped” thresholds and became more effective), other over-simplistic treatments of major methodological and pragmatic issues (e.g., cheating in DC Public Schools and whether this impacted outcome “teacher performance” data, and the like.

To read the full critique of the NBER study, click here.

The claims the authors have asserted in this study are disconcerting, at best. I wouldn’t be as worried if I knew that this paper truly was in a “working” state and still had to undergo peer-review before being released to the public. Unfortunately, it’s too late for this, as NBER irresponsibly released the report without such concern. Now, we as the public are responsible for consuming this study with critical caution and advocating for our peers and politicians to do the same.

NY Teacher of the Year Not “Highly Effective” at “Adding Value”

Diane Ravitch recently posted: “New York’s Teacher of the Year Is Not Rated ‘Highly Effective” on her blog. In her post she writes about Kathleen Ferguson, the state of New York’s Teacher of the Year who has also received numerous excellence in teaching awards, including her district’s teacher of the year award.

Ms. Ferguson, despite the clear consensus that she is an excellent teacher, was not even able to evidence herself as “highly effective” on her evaluation largely because she teaches a disproportionate number of special education students in her classroom. She recently testified on her and other teachers’ behalves about these rankings, their (over)reliance on student test scores to define “effectiveness,” and the Common Core to a Senate Education Committee. To see the original report and to hear a short piece of her testimony, please click here. “This system [simply] does not make sense,” Ferguson said.

In terms of VAMs, this presents evidence of issues with what is called “concurrent-related evidence of validity.” When gathering concurrent-related evidence of validity (see full definition on the Glossary page of this site), it is necessary to assess, for example, whether teachers who post large and small value-added gains or losses over time are the same teachers deemed effective or ineffective, respectively, over the same period of time using other independent quantitative and qualitative measures of teacher effectiveness (e.g., external teaching excellence awards like in the case here). If the evidence that is presented points in the same direction, evidence of concurrent validity is supported, and this adds to the overall argument in support of overall “validity.” Inversely, if the evidence that is presented does not point in the same direction, evidence of concurrent validity is not supported, further limiting the overall argument in support of overall “validity.”

Only when similar sources of evidence support similar inferences and conclusions can our confidence in the sources as independent measures of the same construct (i.e., teaching effectiveness) increase.

This is also highlighted in research I have conducted in Houston. Please watch this video (of about 12 minutes) to investigate/understand “the validity issue” further, through the eyes of four teachers who also had similar stories but were terminated due to their value-added scores that, for the most part, contradicted other indicators of their effectiveness as teachers. You can also read the original study here.

http://www.youtube.com/watch?v=18R2af2CKk8

Florida Newspaper Following “Suit”

On Thursday (November 14th), I wrote about what is happening in the courts of Los Angeles about the LA Times’ controversial open public records request soliciting the student test scores of all Los Angeles Unified School District (LAUSD) teachers (see LA Times up to the Same Old Antics). Today I write about the same thing happening in the courts of Florida as per The Florida Times-Union’s suit against the state (see Florida teacher value-added data is public record, appeal court rules).

As the headline reads, “Florida’s controversial value-added teacher data are [to be] public record” and released to The Times-Union for the same reasons and purposes they are being released, again, to the LA times. These (in many ways right) reasons include: (1) the data should not be exempt from public inspection, (2) these records, because they are public, should be open for public consumption, (3) parents as members of the public and direct consumers of these data should be granted access, and the list goes on.

What is in too many ways wrong, however, is that while the court wrote that the data are “only one part of a larger spectrum of criteria by which a public school teacher is evaluated,” the data will be consumed by a highly assuming, highly unaware, and highly uninformed public as the only source of data that count.

Worse, because the data are derived via complex analyses of “objective” test scores that yield (purportedly) hard numbers from which teacher-level value-added can be calculated using even more complex statistics, the public will trust the statisticians behind the scenes, because they are smart, and they will consume the numbers as true and valid because smart people constructed them.

The key here, though, is that they are, in fact, constructed. In every step of the process of constructing the value-added data, there are major issues that arise and major decisions that are made. Both cause major imperfections in the data that will in the end come out clean (i.e., as numbers), even though they are still super dirty on the inside.

As written by The Times-Union reporter, “Value-added is the difference between the learning growth a student makes in a teacher’s class and the statistically predicted learning growth the student should have earned based on previous performance.” It is just not as simple and as straightforward as that.

Here are just some reasons why: (1) large-scale standardized achievement tests offer very narrow measures of what students have achieved, although they are assumed to measure the breadth and depth of student learning covering an entire year; (2) the 40 to 50 total tested items do not represent the hundreds of more complex items we value more; (3) test developers use statistical tools to remove the items that too many students answered correctly making much of what we value not at all valued on the tests; (4) calculating value-added, or growth “upwards” over time requires that the scales used to measure growth from one year to the next are on scales of equal units, but this is (to my knowledge) never the case; (5) otherwise, the standard statistical response is to norm the test scores before using them, but this then means that for every winner there must be a loser, or in this case that as some teachers get better, other teachers must get worse, which does not reflect reality; (6) then value-added estimates do not indicate whether a teacher is good or highly effective, as is also often assumed, but rather whether a teacher is purportedly better or worse than other teachers to whom they are compared who teach entirely different sets of students who are not randomly assigned to their classrooms but for whom they are to demonstrate “similar” levels of growth; (7) then teachers assigned more “difficult-to-teach” students are held accountable for demonstrating similar growth regardless of the numbers of, for example, English Language Learners (ELLs) or special education students in their classes (although statisticians argue they can “control for this” despite recent research evidence); (8) which becomes more of a biasing issue when statisticians cannot effectively control for what happens (or what does not happen) over the summers whereby in every current state-level value-added system these tests are administered annually, from the spring of year X to the spring of year Y, always capturing the summer months in the post-test scores and biasing scores dramatically in one way or another based on that which is entirely out of the control of the school; (9) this forces the value-added statisticians to statistically assume that summer learning growth and decay matters the same for all students despite the research-based fact that different types of students lose or gain variable levels of knowledge over the summer months; and (10) goes into (11), (12), and so forth.

But the numbers do not reflect any of this, now do they.

LA Times up to the Same Old Antics

The Los Angeles Times has made it in its own news, again, for its controversial open public records request soliciting the student test scores of all Los Angeles Unified School District (LAUSD) teachers. They have done this repeatedly since 2011 — the first time the Los Angeles Times hired external contractors to calculate LAUSD teachers’ value-added scores, so that they could publish the teachers’ value-added scores on their Los Angeles Teacher Ratings website. They have also done this, repeatedly since 2011, despite the major research-based issues (see for example a National Education Policy Center (NEPC) analysis that followed; see also a recent/timely post about this on another blog at techcrunch.com).

This time, the Los Angeles Times is charging that the district is opposed to all of this because it wants to keep the data “secret?” How about being opposed to this because doing this is just plain wrong?!

This is an ongoing source of frustration for me since the authors of the initial articles and the creators of the searchable website (Jason Felch and Jason Strong) contacted me back in 2011 regarding whether what they were doing was appropriate, valid, and fair. Despite my strong warnings, research-based apprehensions, and cautionary notes, Felch and Song thanked me for my time and moved forward.

It became apparent to me soon thereafter, that I was merely a pawn in their media game, so that they could ultimately assert that: “The Times shared [what they did] with leading researchers, including skeptics of value added methods, and incorporated their input.” While they did not incorporate any of my input, not one bit, and as far as I could tell any of the input they might have garnered from any of the other skeptics of value added methods with whom they claimed to have spoken, this claim became particularly handy when they were to defend themselves and their actions when the Los Angeles Times, Ethics and Standards board stepped in, on their behalves.

While I wrote words of detest in response to their “Ethics and Standards” post back then, the truth of the matter was, and still is, is that what they continue to do at the Los Angeles Times has literally nothing to do with the current VAM-based research. It is based on “research” only if one defines research as a far-removed action involving far-removed statisticians crunching numbers as part of a monetary contract.