Doug Harris on the (Increased) Use of Value-Added in Louisiana

Thus far, there have been four total books written about value-added models (VAMs) in education: one (2005) scholarly, edited book that was published prior to our heightened policy interest in VAMs; one (2012) that is less scholarly but more of a field guide on how to use VAM-based data; my recent (2014) scholarly book; and another recent (2011) scholarly book written by Doug Harris. Doug is an Associate Professor of Economics at Tulane in Louisiana. He is also, as I’ve written prior, “a ‘cautious’ but quite active proponent of VAMs.”

There is quite an interesting history surrounding these latter two books, given Harris and I have quite different views on VAMs and their potentials in education. To read more about our differing opinions you can read a review of Harris’s book I wrote for Teachers College Record, and another review a former doctoral student and I wrote for Education Review, to which he responded in his (and his book’s) defense, to which we also responded (with a “rebuttal to a rebuttal, as you will“). What was ultimately titled a “Value-Added Smackdown” in a blog post featured in Education Week, let’s just say, got a little out of hand, with the “smackdown” ending up focusing almost solely around our claim that Harris believed, and we disagreed, with the notion that “value-added [was and still is] good enough to be used for [purposes of] educational accountability.” We asserted then, and I continue to assert now, that “value-added is not good enough to be attaching any sort of consequences much less any such decisions to its output. Value-added may not even be good enough even at the most basic, pragmatic level.”

Harris continues to disagree…

Just this month he released a technical report to his state’s school board (i.e., the Louisiana Board of Elementary and Secondary Education (BESE)), in which he (unfortunately) evidenced that he has not (yet) changed his scholarly stripes….even given the most recent research about the increasingly apparent, methodological, statistical, and pragmatic limitations (see, for example, here, here, here, here, and here), and the recent position statement released by the American Statistical Association underscoring the key points being evidenced across (most) educational research studies. See also the 24 articles published about VAMs in all American Educational Research Association (AERA) Journals here, along with open-access links to the actual articles.

In this report Harris makes “Recommendations to Improve the Louisiana System of
Accountability for Teachers, Leaders, Schools, and Districts,” the main one being that the state focus “more on student learning or growth—[by] specifically, calculating the predicted test scores and rewarding schools based on how well students do compared with those predictions.” The recommendations in more detail, in support, and that also pertain to our interests here include the following five (of six recommendations total):

1. “Focus more on student growth [i.e., value-added] in order to better measure the performance of schools.” Not that there is any research evidence in support, but “The state should [also] aim for a 50-­‐50 split between growth and achievement levels [i.e., not based on value-added].” Doing this at the school accountability level “would also improve alignment with teacher accountability, which includes student growth [i.e., teacher-level value-added] as 50% of the evaluation.”

2. “Reduce uneven incentives and avoid “incentive cliffs” by increasing [school performance score] points more gradually as students move to higher performance levels,” notwithstanding the fact that no research to date has evidenced that such incentives incentivize much of anything intended, at least in education. Regardless, and despite the research, “Giving more weight to achievement growth [will help to create] more even [emphasis added] incentives (see Recommendation #1).”

3. Related, “Create a larger number of school letter grades [to] create incentives for all schools to improve,” by adding +/- extensions to the school letter grades, because “[i]f there were more categories, the next [school letter grade] level would always be within reach….provide. This way all schools will have an incentive to improve, whereas currently only those who are at the high end of the B-­‐D categories have much incentive.” If only the real world of education worked as informed by simple idioms, like those simplifying the theories supporting incentives (e.g., the carrot just in front of the reach of the mule will make the mule draw the cart harder).

5. “Eliminate the first over-­ride provision in the teacher accountability system, which automatically places teachers who are “Ineffective” on either measure in the “Ineffective” performance category.” With this recommendation, I fully agree, as Louisiana is one of the most extreme states when it comes to attaching consequences to problematic data, although I don’t think Harris would agree with my “problematic” classification. But this would mean that “teachers who appear highly effective on one measure could not end up in the “Ineffective” category,” which for this state would certainly be a step in the right direction. Although Harris’s assertion that doing this would also help prevent principals from saving truly ineffective teachers (e.g., by countering teachers’ value-added scores with artificially inflated or allegedly fake observational scores), on behalf of principals as professionals, I find insulting.

6. “Commission a full-­scale third party evaluation of the entire accountability system focused on educator responses and student outcomes.” With this recommendation, I also fully agree under certain conditions: (1) the external evaluator is indeed external to the system and has no conflicts of interest, including financial (even prior to payment for the external review), (2) that which the external evaluator is to investigate is informed by the research in terms of the framing of the questions that need to be asked, (3) as also recommended by Harris, that perspectives of those involved (e.g., principals and teachers) are included in the evaluation design, and (4) all parties formally agree to releasing all data regardless of what (positive or negative) the external evaluator might evidence and find.

Harris’s additional details and “other, more modest recommendations” include the following:

  • Keep “value-­‐added out of the principal [evaluation] measure,” but the state should consider calculating principal value-­‐added measures and issuing reports that describe patterns of variation (e.g., variation in performance overall [in] certain kinds of schools) both for the state as a whole and specific districts.” This reminds me of the time that value-added measures for teachers were to be used only for descriptive purposes. While noble as a recommendation, we know from history what policymakers can do once the data are made available.
  • “Additional Teacher Accountability Recommendations” start on page 11 of this report, although all of these (unfortunately, again) focus on value-added model twists and tweaks (e.g., how to adjust for ceiling effects for schools and teachers with disproportionate numbers of gifted/high-achieving students, how to watch and account for bias) to make the teacher value-added model even better.

Harris concludes that “With these changes, Louisiana would have one of the best accountability systems in the country. Rather than weakening accountability, these recommendations [would] make accountability smarter and make it more likely to improve students’ academic performance.” Following these recommendations would “make the state a national leader.” While Harris cites 20 years of failed attempts in Louisiana and across all states across the country as the reason America’s public education system has not improved its public school students’ academic performance, I’d argue it’s more like 40 years of failed attempts because Harris’s (and so many others’) accountability-bent logic is seriously flawed.

Correction: Make the “Top 13” VAM Articles the “Top 14”

As per my most recent post earlier today, about the Top 13 research-based articles about VAMs, low and behold another great research-based statement was just this week released by the American Statistical Association (ASA), titled the “ASA Statement on Using Value-Added Models for Educational Assessment.”

So, let’s make the Top 13 the Top 14 and call it a day. I say “day” deliberately; this is such a hot and controversial topic it is often hard to keep up with the literature in this area, on literally a daily basis.

As per this outstanding statement released by the ASA – the best statistical organization in the U.S. and one of if not the best statistical associations in the world – some of the most important parts of their statement, taken directly from their full statement as I see them, follow:

  1. VAMs are complex statistical models, and high-level statistical expertise is needed to
    develop the models and [emphasis added] interpret their results.
  2. Estimates from VAMs should always be accompanied by measures of precision and a discussion of the assumptions and possible limitations of the model. These limitations are particularly relevant if VAMs are used for high-stakes purposes.
  3. VAMs are generally based on standardized test scores, and do not directly measure
    potential teacher contributions toward other student outcomes.
  4. VAMs typically measure correlation, not causation: Effects – positive or negative –
    attributed to a teacher may actually be caused by other factors that are not captured in the model.
  5. Under some conditions, VAM scores and rankings can change substantially when a
    different model or test is used, and a thorough analysis should be undertaken to
    evaluate the sensitivity of estimates to different models.
  6. VAMs should be viewed within the context of quality improvement, which distinguishes aspects of quality that can be attributed to the system from those that can be attributed to individual teachers, teacher preparation programs, or schools.
  7. Most VAM studies find that teachers account for about 1% to 14% of the variability in test scores, and that the majority of opportunities for quality improvement are found in the system-level conditions. Ranking teachers by their VAM scores can have unintended consequences that reduce quality.
  8. Attaching too much importance to a single item of quantitative information is counter-productive—in fact, it can be detrimental to the goal of improving quality.
  9. When used appropriately, VAMs may provide quantitative information that is relevant for improving education processes…[but only if used for descriptive/description purposes]. Otherwise, using VAM scores to improve education requires that they provide meaningful information about a teacher’s ability to promote student learning…[and they just do not do this at this point, as there is no research evidence to support this ideal].
  10. A decision to use VAMs for teacher evaluations might change the way the tests are viewed and lead to changes in the school environment. For example, more classr
    oom time might be spent on test preparation and on specific content from the test at the exclusion of content that may lead to better long-term learning gains or motivation for students. Certain schools may be hard to staff if there is a perception that it is harder for teachers to achieve good VAM scores when working in them. Overreliance on VAM scores may foster a competitive environment, discouraging collaboration and efforts to improve the educational system as a whole.

Also important to point out is that included in the report the ASA makes recommendations regarding the “key questions states and districts [yes, practitioners!] should address regarding the use of any type of VAM.” These include, although they are not limited to questions about reliability (consistency), validity, the tests on which VAM estimates are based, and the major statistical errors that always accompany VAM estimates, but are often buried and often not reported with results (i.e., in terms of confidence
intervals or standard errors).

Also important is the purpose for ASA’s statement, as written by them: “As the largest organization in the United States representing statisticians and related professionals, the American Statistical Association (ASA) is making this statement to provide guidance, given current knowledge and experience, as to what can and cannot reasonably be expected from the use of VAMs. This statement focuses on the use of VAMs for assessing teachers’ performance but the issues discussed here also apply to their use for school or principal accountability. The statement is not intended to be prescriptive. Rather, it is intended to enhance general understanding of the strengths and limitations of the results generated by VAMs and thereby encourage the informed use of these results.”

If you’re going to choose one article to read and review, this week or this month, and one that is thorough and to the key points, this is the one I recommend you read…at least for now!

More from an English Teacher in North Carolina

The same English teacher, Chris Gilbert, whom I referenced in a recent post just wrote yet another great piece in The Washington Post.

He writes about an automated phone call he received informing him (and the rest of his colleagues) that the top 25% of teachers in his district were to be offered four-year contracts and an additional and annual $500 in exchange for relinquishing their tenure rights. This, was recently added to another slew of legislative actions in his state of North Carolina including, but not limited to, another year without pay increases (making this the 5th year without increases), no more tenure, no more salary increases for earning master’s/doctoral degrees, and no more class-size caps.

The problems with just this 25% policy, however, and as he writes, include the following: the “policy reflects the view that teachers are inadequately motivated to do their jobs;” this implies, without any evidence that only an arbitrarily set “25% of a district’s teachers deserve a raise;” this facilitates a “culture of competition [that] kills the collaboration that is integral to effective education;” “[t]he idea that a single teacher’s influence can be isolated [using VAMs] is absurd;” and just in general that this policy “reflects a myopic approach to reform.”

More Value-Added Problems in DC’s Public Schools

Over the past month I have posted two entries about what’s going in in DC’s public schools with the value-added-based teacher evaluation system developed and advanced by the former School Chancellor Michelle Rhee and carried on by the current School Chancellor Kaya Henderson.

The first post was about a bogus “research” study in which National Bureau of Education Research (NBER)/University of Virginia and Stanford researchers overstated false claims that the system was indeed working and effective, despite the fact that (among other problems) 83% of the teachers in the study did not have student test scores available to measure their “value added.” The second post was about a DC teacher’s experiences being evaluated under this system (as part of the aforementioned 83%) using almost solely his administrator’s and master educator’s observational scores. Demonstrated with data in this post was how error prone this part of the DC system also evidenced itself to be.

Adding to the value-added issues in DC, it was just released by DC public school officials (the day before winter break) and then two Washington Post articles (see the first article here and the second here) that 44 DC public school teachers also received incorrect evaluation scores for the last academic year (2012-2013) because of technical errors in the ways the scores were calculated. One of the 44 teachers was fired as a result, although (s)he is now looking to be reinstated and compensated for the salary lost.

While “[s]chool officials described the errors as the most significant since the system launched a controversial initiative in 2009 to evaluate teachers in part on student test scores,” they also downplayed the situation as only impacting 44.

VAM formulas are certainly “subject to error,” and they are subject to error always, across the board, for teachers in general as well as the 470 DC public school teachers with value-added scores based on student test scores. Put more accurately, just over 10% (n=470) of all DC teachers (n=4,000) were evaluated using their students’ test scores, which is even less than the 83% mentioned above. And for about 10% of these teachers (n=44), calculation errors were found.

This is not a “minor glitch” as written into a recent Huffington Post article covering the same story, which positions the teachers’ unions as almost irrational for “slamming the school system for the mistake and raising broader questions about the system.” It is a major glitch caused both by inappropriate “weightings” of teachers’ administrator’ and master educators’ observational scores, as well as “a small technical error” that directly impacted the teachers’ value-added calculations. It is a major glitch with major implications about which others, including not just those from the unions but many (e.g., 90%) from the research community, are concerned. It is a major glitch that does warrant additional cause about this AND all of the other statistical and other errors not mentioned but prevalent in all value-added scores (e.g., the errors always found in large-scale standardized tests particularly given their non-equivalent scales, the errors caused by missing data, the errors caused by small class sizes, the errors caused by summer learning loss/gains, the errors caused by other teachers’ simultaneous and carry over effects, the errors caused by parental and peer effects [see also this recent post about these], etc.).

So what type of consequence is to be in store for those perpetuating such nonsense? Including, particularly here, those charged with calculating and releasing value-added “estimates” (“estimates” as these are not and should never be interpreted as hard data), but also the reporters who report on the issues without understanding them or reading the research about them. I, for one, would like to see them held accountable for the “value” they too are to “add” to our thinking about these social issues, but who rather detract and distract readers away from the very real, research-based issues at hand.

New Research Study: Incentives with No Impact

Following VAMboozled!’s most recent post (November 25, 2013) about the non-peer-reviewed National Bureau of Economic Research’s DC IMPACT study, another recently published study albeit this time in the Journal of Labor Economics, a top-field peer-reviewed journal, found no impact of incentives in New York City’s (NYC) public schools given its large-scale multimillion program. The author, Roland Fryer Jr., (2013) analyzed the data from a school-based experiment “to better understand the impact of teacher incentives on student achievement.”

A randomized experiment, a gold standard in applied work of this kind, was implemented in more than 200 hundred NYC public schools. The schools decided on the specific incentive scheme, either team or individual. The stakes were relatively high – on average, a high performing school (i.e. a school that meets the target by 100%), received a transfer of $180,000, and a school that met the target by 75%, received $90,000. Not bad by all accounts!

The target was set based on a school performance in terms of students’ achievement, improvement, and the learning environment. Yes, a fraction of schools met the target and received the transfers, but it did not improve the achievement of students, to say the least. If anything, such incentive in fact worsened the performance of students. The estimates from the experiment imply that if a student attended a middle school with an incentive in place for three years, his/her math test scores would decline by 0.138 of a standard deviation and his/her reading score would drop by 0.09 of a standard deviation.

Not only that, but the incentive program had no effect on teachers’ absenteeism, retention in school or district, nor did it affect the teachers’ perception of the learning environment in a school. Literally, the estimated 75 million dollars invested and spent brought zero return!

This evaluation, together with a few others (Glazerman & Seifullah, 2010; Springer et al., 2010; Vigdor, 2008) raises a question about both the financial effectiveness of similar incentives in schools and the achievement-based accountability measures in particular, and their ability to positively affect students’ achievement.

Thanks to Margarita Pivovarova – Assistant Professor of Economics at Arizona State University – for this post.

References:

Fryer, R. G. (2013). Teacher incentives and student achievement: Evidence from New York City Public Schools. Journal of Labor Economics, 31(2), 373-407.

Glazerman, S., & Seifullah, A. (2010). An evaluation of the Teacher Advancement Program (TAP) in Chicago: Year two impact report. Washington, DC: Mathematica Policy Research. Retrieved from http://www.mathematica-mpr.com/publications/pdfs/education/tap_yr2_rpt.pdf

Springer, M. G., Ballou, D., Hamilton, L. S., Le, V.-N., Lockwood, J.R., McCaffrey, D.F., Pepper, M., & Stecher, B.M. (2010). Teacher pay for performance: Experimental evidence from the project on incentives in teaching. Nashville, TN: National Center on Performance Incentives. Retrieved from http://www.rand.org/content/dam/rand/pubs/reprints/2010/RAND_RP1416.pdf

Vigdor, J. L. (2008). Teacher salary bonuses in North Carolina. Nashville, TN: National Center on Performance Incentives. Retrieved from https://my.vanderbilt.edu/performanceincentives/files/2012/10/200803_Vigdor_TeacherBonusesNC.pdf

Unpacking DC’s Impact, or the Lack Thereof

Recently, I posted a critique of the newly released and highly publicized Mathematica Policy Research study about the (vastly overstated) “value” of value-added measures and their ability to effectively measure teacher quality. The study, which did not go through a peer review process, is wrought with methodological and conceptual problems, which I dismantled in the post, with a consumer alert.

Yet again, VAM enthusiasts are attempting to VAMboozle policymakers and the general public with another faulty study, this time released to the media by the National Bureau of Economic Research (NBER). The “working paper” (i.e., not peer-reviewed, and in this case not even internally reviewed by those at NBER) analyzed the controversial teacher evaluation system (i.e., IMPACT) that was put into place in DC Public Schools (DCPS) under the then Chancellor, Michelle Rhee.

The authors, Thomas Dee and James Wyckoff (2013), present what they term “novel evidence” to suggest that the “uniquely high-powered incentives” linked to “teacher performance” worked to improve the “performance” of high-performing teachers, and that “dismissal threats” worked to increase the “voluntary attrition of low-performing teachers.” The authors, however, and similar to those of the Mathematica study, assert highly troublesome claims based on a plethora of problems that had this study undergone peer review before it was released to the public and before it was hailed in the media, would not have created the media hype that ensued. Hence, it is appropriate to issue yet another consumer alert.

The most major problems include, but are not limited to, the following:

“Teacher Performance:” Probably the largest fatal flaw, or the study’s most major limitation was that only 17% of the teachers included in this study (i.e., teachers of reading and mathematics in grades 4 through 8) were actually evaluated under the IMPACT system for their “teacher performance,” or for that which they contributed to the system’s most valued indicator: student achievement. Rather, 83% of the teachers did not have student test scores available to determine if they were indeed effective (or not) using individual value-added scores. It is implied throughout the paper, as well as the media reports covering this study post release, that “teacher performance” was what was investigated when in fact for four out of five DC teachers their “performance” was evaluated only as per what they were observed doing or self-reported doing all the while. These teachers were evaluated on their “performance” using almost exclusively (except for the 5% school-level value-added indicator) the same subjective measures integral to many traditional evaluation systems as well as student achievement/growth on teacher-developed and administrator-approved classroom-based tests, instead.

Score Manipulation and Inflation: Related, a major study limitation was that the aforementioned indicators that were used to define and observe changes in “teacher performance” (for the 83% of DC teachers) were based almost entirely on highly subjective, highly manipulable, and highly volatile indicators of “teacher performance.” Given the socially constructed indicators used throughout this study were undoubtedly subject to score bias by manipulation and artificial inflation as teachers (and their evaluators) were able to influence their ratings. While evidence of this was provided in the study, the authors banally dismissed this possibility as “theoretically [not really] reasonable.” When using tests, and especially subjective indicators to measure “teacher performance,” one must exercise caution to ensure that those being measured do not engage in manipulation and inflation techniques known to effectively increase the scores derived and valued, particularly within such high-stakes accountability systems. Again, for 83% of the teachers their “teacher performance” indicators were almost entirely manipulable (with the exception of school-level value-added weighted at 5%).

Unrestrained Bias: Related, the authors set forth a series of assumptions throughout their study that would have permitted readers to correctly predict the study’s findings without reading it. This is highly problematic, as well, and this would not have been permitted had the scientific community been involved. Researcher bias can certainly impact (or sway) study findings and this most certainly happened here.

Other problems include gross overstatements (e.g., about how the IMPACT system has evidenced itself as financially sound and sustainable over time), dismissed yet highly complex technical issues (e.g., about classification errors and the arbitrary thresholds the authors used to statistically define and examine whether teachers “jumped” thresholds and became more effective), other over-simplistic treatments of major methodological and pragmatic issues (e.g., cheating in DC Public Schools and whether this impacted outcome “teacher performance” data, and the like.

To read the full critique of the NBER study, click here.

The claims the authors have asserted in this study are disconcerting, at best. I wouldn’t be as worried if I knew that this paper truly was in a “working” state and still had to undergo peer-review before being released to the public. Unfortunately, it’s too late for this, as NBER irresponsibly released the report without such concern. Now, we as the public are responsible for consuming this study with critical caution and advocating for our peers and politicians to do the same.

Why VAMs & Merit Pay Aren’t Fair

An “oldie” (i.e., published about one year ago), but a goodie! This one is already posted in the video gallery of this site, but it recently came up again as a good, short-at-three minutes, video version, that captures some of the main issues.
Check it out and share as (so) needed!

Six Reasons Why VAMs and Merit Pay Aren’t Fair