Special Issue of “Educational Researcher” (Paper #2 of 9): VAMs’ Measurement Errors, Issues with Retroactive Revisions, and (More) Problems with Using Test Scores

Recall from a prior post that the peer-reviewed journal titled Educational Researcher (ER) – recently published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#2 of 9) here, titled “Using Student Test Scores to Measure Teacher Performance: Some Problems in the Design and Implementation of Evaluation Systems” and authored by Dale Ballou – Associate Professor of Leadership, Policy, and Organizations at Vanderbilt University – and Matthew Springer – Assistant Professor of Public Policy also at Vanderbilt.

As written into the articles’ abstract, their “aim in this article [was] to draw attention to some underappreciated problems in the design and implementation of evaluation systems that incorporate value-added measures. [They focused] on four [problems]: (1) taking into account measurement error in teacher assessments, (2) revising teachers’ scores as more information becomes available about their students, and (3) and (4) minimizing opportunistic behavior by teachers during roster verification and the supervision of exams.”

Here is background on their perspective, so that you all can read and understand their forthcoming findings in context: “On the whole we regard the use of educator evaluation systems as a positive development, provided judicious use is made of this information. No evaluation instrument is perfect; every evaluation system is an assembly of various imperfect measures. There is information in student test scores about teacher performance; the challenge is to extract it and combine it with the information gleaned from other instruments.”

Their claims of most interest, in my opinion and given their perspective as illustrated above, are as follows:

  • “Teacher value-added estimates are notoriously imprecise. If value-added scores are to be used for high-stakes personnel decisions, appropriate account must be taken of the magnitude of the likely error in these estimates” (p. 78).
  • “[C]omparing a teacher of 25 students to [an equally effective] teacher of 100 students… the former is 4 to 12 times more likely to be deemed ineffective, solely as a function of the number of the teacher’s students who are tested—a reflection of the fact that the measures used in such accountability systems are noisy and that the amount of noise is greater the fewer students a teacher has. Clearly it is unfair to treat two teachers with the same true effectiveness differently” (p. 78).
  • “[R]esources will be wasted if teachers are targeted for interventions without taking
    into account the probability that the ratings they receive are based on error” (p. 78).
  • “Because many state administrative data systems are not up to [the data challenges required to calculate VAM output], many states have implemented procedures wherein teachers are called on to verify and correct their class rosters [i.e., roster verification]…[Hence]…the notion that teachers might manipulate their rosters in order to improve their value-added scores [is worrisome as the possibility of this occurring] obtains indirect support from other studies of strategic behavior in response to high-stakes accountability…These studies suggest that at least some teachers and schools will take advantage of virtually any opportunity to game
    a test-based evaluation system…” (p. 80), especially if they view the system as unfair (this is my addition, not theirs) and despite the extent to which school or district administrators monitor the process or verify the final roster data. This is another gaming technique not often discussed, or researched.
  • Related, in one analysis these authors found that “students [who teachers] do not claim [during this roster verification process] have on average test scores far below those of the students who are claimed…a student who is not claimed is very likely to be one who would lower teachers’ value added” (p. 80). Interestingly, and inversely, they also found that “a majority of the students [they] deem[ed] exempt [were actually] claimed by their teachers [on teachers’ rosters]” (p. 80). They note that when either occurs, it’s rare; hence, it should not significantly impact teachers value added scores on the whole. However, this finding also “raises the prospect of more serious manipulation of roster verification should value added come to be used for high-stakes personnel decisions, when incentives to game the system will grow stronger” (p. 80).
  • In terms of teachers versus proctors or other teachers monitoring students when they take large-scale standardized tests (that are used across all states to calculate value-added estimates), researchers also found that “[a]t every grade level, the number of questions answered correctly is higher when students are monitored by their own teacher” (p. 82). They believe this finding is more relevant that I do in that the difference was one question (although when multiplied by the number of students included in a teacher’s value-added calculations this might be more noteworthy). In addition,  I know of very few teachers, anymore, who are permitted to proctor their own students’ tests, but for those who still allow this, this finding might also be relevant. “An alternative interpretation of these findings is that students
    naturally do better when their own teacher supervises the exam as
    opposed to a teacher they do not know” (p. 83).

The authors also critique, quite extensively in fact, the Education Value-Added Assessment System (EVAAS) used statewide in North Carolina, Ohio, Pennsylvania, and Tennessee and many districts elsewhere. In particular, they take issue with the model’s use of the conventional t-test statistic to identify a teacher for whom they are 95% confident (s)he differs from average. They also take issue with EVAAS practice whereby teachers’ EVAAS scores change retroactively, as more data become available, to get at more “precision” even though teachers’ scores can change one or two years well after the initial score is registered (and used for whatever purposes).

“This has confused teachers, who wonder why their value-added score keeps changing for students they had in the past. Whether or not there are sound statistical reasons for undertaking these revisions…revising value-added estimates poses problems when the evaluation system is used for high-stakes decisions. What will be done about the teacher whose performance during the 2013–2014 school year, as calculated in the summer of 2014, was so low that the teacher loses his or her job or license but whose revised estimate for the same year, released in the summer of 2015, places the teacher’s performance above the threshold at which these sanctions would apply?…[Hence,] it clearly makes no sense to revise these estimates, as each revision is based on less information about student performance” (p. 79).

Hence, “a state that [makes] a practice of issuing revised ‘improved’ estimates would appear to be in a poor position to argue that high-stakes decisions ought to be based on initial, unrevised estimates, though in fact the grounds for regarding the revised estimates as an improvement are sometimes highly dubious. There is no obvious fix for this problem, which we expect will be fought out in the courts” (p. 83).

*****

If interested, see the Review of Article #1 – the introduction to the special issue here.

Article #2 Reference: Ballou, D., & Springer, M. G. (2015). Using student test scores to measure teacher performance: Some problems in the design and implementation of evaluation systems. Educational Researcher, 44(2), 77-86. doi:10.3102/0013189X15574904

Special Issue of “Educational Researcher” Examines Value-Added Measures (Paper #1 of 9)

A few months ago, the flagship journal of the American Educational Research Association (AERA) – the peer-reviewed journal titled Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs) (i.e., one introduction (reviewed below), four feature articles, one essay, and three commentaries). I will review each of these pieces separately over the next few weeks or so, although if any of you want an advanced preview, do click here as AERA made each of these articles free and accessible.

In this “Special Issue” editors Douglas Harris – Associate Professor of Economics at Tulane University – and Carolyn Herrington – Professor of Educational Leadership and Policy at Florida State University – solicited “[a]rticles from leading scholars cover[ing] a range of topics, from challenges in the design and implementation of teacher evaluation systems, to the emerging use of teacher observation information by principals as an alternative to VAM data in making teacher staffing decisions.” They challenged authors “to participate in the important conversation about value-added by providing rigorous evidence, noting that successful policy implementation and design are the product of evaluation and adaption” (assuming “successful policy implementation and design” exist, but I digress).

More specifically, in the co-editors’ Introduction to the Special Issue, Harris and Herrington note that in this special issue they “pose dozens of unanswered questions [see below], not only about the net effects of these policies on measurable student outcomes, but about the numerous, often indirect ways in which [unintended] and less easily observed effects might arise.” This section is of, in my opinion, the most “added value.”

Here are some of their key assertions:

  • “[T]eachers and principals trust classroom observations more than value added.”
  • “Teachers—especially the better ones—want to know what exactly they are doing well and doing poorly. In this respect, value-added measures are unhelpful.”
  • “[D]istrust in value-added measures may be partly due to [or confounded with] frustration with high-stakes testing generally.”
  • “Support for value added also appears stronger among administrators than teachers…But principals are still somewhat skeptical.”
  • “[T]he [pre-VAM] data collection process may unintentionally reduce the validity and credibility of value-added measures.”
  • “[I]t seems likely that support for value added among educators will decrease as the stakes increase.”
  • “[V]alue-added measures suffer from much higher missing data rates than classroom observation[s].”
  • “[T]he timing of value-added measures—that they arrive only once a year and during the middle of the school year when it is hard to adjust teaching assignments—is a real concern among teachers and principals alike.”
  • “[W]e cannot lose sight of the ample evidence against the traditional model [i.e., based on snapshot measures examined once per year as was done for decades past, or pre-VAM].” This does not make VAMs “better,” but with this statement most researchers agree.

Inversely, here are some points or assertions that should cause pause:

  • “The issue is not whether value-added measures are valid but whether they can be used in a way that improves teaching and learning.” I would strongly argue that validity is a pre-condition to use, as we do not want educators using invalid data to even attempt to improve teaching and learning. I’m actually surprised this statement was published, as so scientifically and pragmatically off-based.
  • We know “very little” about how educators “actually respond to policies that use value-added measures.” Clearly, the co-editors are not followers of this blog, other similar outlets (e.g., The Washington Post’s Answer Sheet), or other articles published in the media as well as scholarly journals, about educator use, interpretation, opinion, response, and the like to VAMs. (For example articles published in scholarly journals see, for example, here, here, and here).
  • “[I]n the debate about these policies, perspective has taken over where the evidence trail ends.” Rather, the evidence trail is already quite saturated in many respects, as study after study continues to evidence the same things (e.g., inconsistencies in teacher-level ratings over time, mediocre correlations between VAM and observational output, all of which matter most if high-stakes decisions are to be tied to value-added output).
  • [T]he best available evidence [is just] beginning to emerge on the use of value added.” Perhaps the authors of this piece are correct if focusing only on use, or the lack thereof as we have a good deal of evidence that much use is not happening given issues with transparency, accessibility, comprehensibility, relevance, fairness, and the like.

In the end, and also of high value, the authors of this piece offer others (e.g., graduate students, practitioners, or academics looking take note) some interesting points for future research, given VAMs are likely still to stay, for at least awhile. Some, although not all of their suggested questions for future research are included here:

  • How do educators’ perceptions impact their behavioral responses with and towards VAMs or VAM output? Does administrator skepticism affect how they use these measures?
  • Does the use of VAMs actually lead to more teaching to the test and shallow instruction, aimed more at developing basic skills than at critical thinking and creative
    problem solving?
  • Does the approach further narrow the curriculum and lead to more gaming of the system or, following Campbell’s law, distort the measures in ways that make them less informative?
  • In the process of sanctioning and dismissing low-performing teachers, do value-added-based accountability policies sap the creativity and motivation of our best teachers?
  • Does the use of value-added measures reduce trust and undermine collaboration among educators and principals, thus weakening the organization as a whole?
  • Are these unintended effects made worse by the common misperception that when
    teachers help their colleagues within the school, they could reduce their own value-added measures?
  • Aside from these incentives, does the general orientation toward individual
    performance lead teachers to think less about their colleagues and organizational roles and responsibilities?
  • Are teacher and administrator preparation programs helping to prepare future educators for value-added-based accountability by explaining the measures and their potential uses and misuses?
  • Although not explicitly posed as a question in this piece, but important, what are their (i.e., VAMs and VAM outputs) benefits, intended or otherwise?

Some of these questions are discussed or answered at least in part in the eight articles included in this “Special Issue” also to be highlighted in the next few weeks or so, one by one. Do stay tuned.

*****

Article #1 Reference: Harris, D. N., & Herrington, C. D. (2015). Editors’ introduction: The use of teacher value-added measures in schools: New evidence, unanswered questions, and future prospects. Educational Researcher, 44(2), 71-76. doi:10.3102/0013189X15576142

EVAAS, Value-Added, and Teacher Branding

I do not think I ever shared this video out, and now following up on another post, about the potential impact these videos should really have, I thought now is an appropriate time to share. “We can be the change,” and social media can help.

My former doctoral student and I put together this video, after conducting a study with teachers in the Houston Independent School District and more specifically four teachers whose contracts were not renewed due in large part to their EVAAS scores in the summer of 2011. This video (which is really a cartoon, although it certainly lacks humor) is about them, but also about what is happening in general in their schools, post the adoption and implementation (at approximately $500,000/year) of the SAS EVAAS value-added system.

To read the full study from which this video was created, click here. Below is the abstract.

The SAS Educational Value-Added Assessment System (SAS® EVAAS®) is the most widely used value-added system in the country. It is also self-proclaimed as “the most robust and reliable” system available, with its greatest benefit to help educators improve their teaching practices. This study critically examined the effects of SAS® EVAAS® as experienced by teachers, in one of the largest, high-needs urban school districts in the nation – the Houston Independent School District (HISD). Using a multiple methods approach, this study critically analyzed retrospective quantitative and qualitative data to better comprehend and understand the evidence collected from four teachers whose contracts were not renewed in the summer of 2011, in part given their low SAS® EVAAS® scores. This study also suggests some intended and unintended effects that seem to be occurring as a result of SAS® EVAAS® implementation in HISD. In addition to issues with reliability, bias, teacher attribution, and validity, high-stakes use of SAS® EVAAS® in this district seems to be exacerbating unintended effects.

Article on the “Heroic” Assumptions Surrounding VAMs Published (and Currently Available) in TCR

My former doctoral student and I wrote a paper about the “heroic” assumptions surrounding VAMs. We titled it “Truths” Devoid of Empirical Proof: Underlying Assumptions Surrounding Value-Added Models in Teacher Evaluation” and it was just published in the esteemed Teachers College Record (TCR). It is also open and accessible for one week, for free, here. I have also pasted the abstract below for more information.

Abstract

Despite the overwhelming and research-based concerns regarding value-added models (VAMs), VAM advocates, policymakers, and supporters continue to hold strong to VAMs’ purported, yet still largely theoretical strengths and potentials. Those advancing VAMs have, more or less, adopted and promoted a set of agreed-upon, albeit “heroic” set of assumptions, without independent, peer-reviewed research in support. These “heroic” assumptions transcend promotional, policy, media, and research-based pieces, but they have never been fully investigated, explicated, or made explicit as a set or whole. These assumptions, though often violated, are often ignored in order to promote VAM adoption and use, and also to sell for-profits’ and sometimes non-profits’ VAM-based systems to states and districts. The purpose of this study was to make obvious the assumptions that have been made within the VAM narrative and that, accordingly, have often been accepted without challenge. Ultimately, sources for this study included 470 distinctly different written pieces, from both traditional and non-traditional sources. The results of this analysis suggest that the preponderance of sources propagating unfounded assertions are fostering a sort of VAM echo chamber that seems impenetrable by even the most rigorous and trustworthy empirical evidence.

Evidence of Grade and Subject-Level Bias in Value-Added Measures: Article Published in TCR

One of my most recent posts was about William Sanders — developer of the Tennessee Value-Added Assessment System (TVAAS), which is now more popularly known as the Education Value-Added Assessment System (EVAAS®) — and his forthcoming 2015 James Bryant Conant Award — one of the nation’s most prestigious education honors, that will be awarded to him this next month by the Education Commission of the States (ECS).

Sanders is to be honored for his “national leader[ship] in value-added assessments, [as] his [TVAAS/EVAAS] work has [informed] key policy discussion[s] in states across the nation.”

Ironically, this was announced the same week that one of my former doctoral students — Jessica Holloway-Libell, who is soon to be an Assistant Professor at Kansas State University — had a paper published in the esteemed Teachers College Record about this very model. Her paper titled, “Evidence of Grade and Subject-Level Bias in Value-Added Measures” can be accessed (at least for the time being) here.

You might also recall this topic, though, as we posted her two initial drafts of this article over one year ago, here and here. Both posts followed the analyses she conducted after a VAMboozled follower emailed us expressing his suspicions about grade and subject area bias in his district in Tennessee, in which he was/still is a school administrator.  The question he posed was whether his suspicions were correct, and whether this was happening elsewhere in his state, using Sanders’ TVAAS/EVAAS model.

Jessica found it was.

More specifically, Jessica found that:

  1. Teachers of students in 4th and 8th grades were much more likely to receive positive value-added scores than in other grades (e.g., 5th, 6th, and 7th grades); hence, that 4th and 8th teachers are generally better teachers in Tennessee using the TVAAS/EVAAS model.
  2. Mathematics teachers (theoretically throughout Tennessee) are, overall, more effective than Tennessee’s English/language arts teachers, regardless of school district; hence, mathematics teachers are generally better than English/language arts teachers in Tennessee using the TVAAS/EVAAS model.

Being a former mathematics teacher myself, I’d like to support the second claim as being true, being subject-area biased myself. But the fact of the matter is that the counterclaims in this case are obviously true, likely entirely, instead.

It’s not that either or any set of these teachers are in fact better, it’s that Sanders’ TVAAS/EVAAS model – — the model for which Sanders is receiving this esteemed award — is yielding biased output. It is doing this for whatever reason (e.g., measurement error, test construction) but this just adds to the list of other problems (see, for example, here, here, and here) and quite frankly the reasons why this model, not to mention its master creator, is undeserving of really any award, except for a Bunkum, perhaps.

Vanderbilt Researchers on Performance Pay, VAMs, and SLOs

Do higher paychecks translate into higher student test scores? That is the question two researchers at Vanderbilt – Ryan Balch (recent Graduate Research Assistant at Vanderbilt’s National Center on Performance Incentives) and Matthew Springer (Assistant Professor of Public Policy and Education and Director of Vanderbilt’s National Center on Performance Incentives) – attempted to answer in a recent study of the REACH pay-for-performance program in Austin, Texas (a nationally recognized performance program model with $62.3 million in federal support). The study published in Education Economics Review can be found here, but for a $19.95 fee; hence, I’ll do my best to explain this study’s contents so you all can save your money, unless of course you too want to dig deeper.

As background (and as explained on the first page of the full paper), the theory behind performance pay is that tying teacher pay to teacher performance provides “strong incentives” to improve outcomes of interest. “It can help motivate teachers to higher levels of performance and align their behaviors and interests with institutional goals.” I should note, however, that there is very mixed evidence from over 100 years of research on performance pay regarding whether it has ever worked. Economists tend to believe it works while educational researchers tend to disagree.

Regardless, in this study as per a ResearchNews@Vanderbilt post put out by Vanderbilt highlighting it, researchers found that teacher-level growth in student achievement in mathematics and reading in schools in which teachers were given monetary performance incentives was significantly higher during the first year of the program’s implementation (2007-2008), than was the same growth in the nearest matched, neighborhood schools where teachers were not given performance incentives. Similar gains were maintained the following year, yet (as per the full report) no additional growth or loss was noted otherwise.

As per the full report as well, researchers more specifically found that students who were enrolled in the REACH program gained between 0.13 and 0.17 standard deviations greater gains in mathematics, and (although not as evident or highlighted in the text of the actual report, but within a related table) students who were enrolled in the REACH program gained between 0.10 and 0.05 standard deviations greater gains in reading, although these gains were also less significant in statistical terms. Curious…

While the method by which schools were matched was well-detailed, and inter-school descriptive statistics were presented to help readers determine whether in fact the schools sampled for this study were comparable (although statistics that would also help us determine whether the inter-school differences noted were statistically significant enough to pay attention to), the statistics comparing the teachers in REACH schools versus those not in REACH schools to whom they were compared were completely missing. Hence, it is impossible to even begin to determine whether the matching methodology used actually yielded comparable samples down to the teacher level – the heart of this research study. This is a fatal flaw that in my opinion should have prevented this study from being published, at least as is, as without this information we have no guarantees that teachers within these schools were indeed comparable.

Regardless, researchers also examined teachers’ Student Learning Objectives (SLOs) – the incentive program’s “primary measure of individual teacher performance” given so many teachers are still VAM-ineligible (see a prior post about SLOs, here). They examined whether SLO scores correlated with VAM scores, for those teachers who had both.

They found, as per a quote by Springer in the above-mentioned post, that “[w]hile SLOs may serve as an important pedagogical tool for teachers in encouraging goal-setting for students, the format and guidance for SLOs within the specific program did not lead to the proper identification of high value-added teachers.” That is, more precisely and as indicated in the actual study, SLOs were “not significantly correlated with a teacher’s value-added student test scores;” hence, “a teacher is no more likely to meet his or her SLO targets if [his/her] students have higher levels of achievement [over time].” This has huge implications, in particular regarding the still lacking evidence of validity surrounding SLOs.

New “Causal” Evidence in Support of VAMs

No surprise, really, but Thomas Kane, an economics professor from Harvard University who also directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation, is publicly writing in support of VAMs, again. His newest article was recently published on the website of the Brookings Institution, titled “Do Value-Added Estimates Identify Causal Effects of Teachers and Schools?” Not surprisingly as a VAM advocate, in this piece he continues to advance a series of false claims about the wonderful potentials of VAMs, in this case as also yielding causal estimates whereas teachers can be seen as directly causing the growth measured by VAMs (see prior posts about Kane’s very public perspectives here and here).

The best part of this article is where Kane (potentially) seriously considers whether “the short list of control variables captured in educational data systems—prior achievement, student demographics, English language learner [ELL] status, eligibility for federally subsidized meals or programs for gifted and special education students—include the relevant factors by which students are sorted to teachers and schools” and, hence, work to control for bias as these are some of the factors that do (as per the research evidence) distort value-added scores.

The potential surrounding the exploration of this argument, however, quickly turns south as thereafter Kane pontificates using pure logic that “it is possible that school data systems contain the very data that teachers or principals are using to assign students to teachers.” In other words, he painfully attempts to assert his non-research-based argument that as long as principals and teachers use the aforementioned variables to sort students into classrooms, then controlling for said variables should indeed control for the biasing effects caused by the non-random assortment of students into classrooms (and teachers into classrooms, although he does not address that component either).

In addition, he asserts that, “[o]f course, there are many other unmeasured factors [or variables] influencing student achievement—such as student motivation or parental engagement [that cannot or cannot easily be observed]. But as long as those factors are also invisible [emphasis added] to those making teacher and program assignment decisions, our [i.e., VAM statisticians’] inability to control for them” more or less makes not controlling for these other variables inconsequential. In other words, in this article Kane asserts as long as the “other things” principals and teachers use to non-randomly place students into classrooms are “invisible” to the principals and teachers making student placement decisions, these “other things” should not have to be statistically controlled, or factored out. We should otherwise be good to go given the aforementioned variables already observable and available.

As evidenced in a study I wrote with one of my current doctoral students that was recently published in the esteemed, peer-reviewed American Educational Research Journal on this very topic (see the full study here), we set out to better determine how and whether the controls used by value-added researchers to eliminate bias might be sufficient given what indeed occurs in practice when students are placed into classrooms.

We found that both teachers and parents play a prodigious role in the student placement process, in almost nine out of ten schools (i.e., 90% of the time). Teachers and parents (although parents are also not mentioned in Kane’s article) provide both appreciated and sometimes unwelcome insights, regarding what teachers and parents perceive to be the best learning environments for their students or children, respectively. Their added insights typically revolve around, in the following order, students’ in-school behaviors, attitudes, and disciplinary records; students’ learning styles and students’ learning styles as matched with teachers’ teaching styles; students’ personalities and students personalities as matched with teachers’ personalities; students’ interactions with their peers and prior teachers; general teacher types (e.g. teachers who manage their classrooms in perceptibly better ways); and whether students had siblings in potential teachers’ classrooms prior.

These “other things” are not typically if ever controlled for given current VAMs, nor will they likely ever be. In addition, these factors serve as legitimate reasons for class changes during the school year, although whether this, too, is or could be captured in VAMs is highly tentative at best. Otherwise, namely prior academic achievement, special education needs, giftedness, and gender also influence placement decisions. These are variables for which most current VAMs account or control, presumably effectively.

Kane, like other VAM statisticians, tend to (and in many ways have to if they are to continue with their VAM work, despite “the issues”) (over)simplify the serious complexities that come about when random assignment of students to classrooms (and teachers to classrooms) is neither feasible, nor realistic, or outright opposed (as was also clearly evidenced in the above article by 98% of educators, see again here).

The random assignment of students to classrooms (and teachers to classrooms) very rarely happens. Rather, the use of many observable and unobservable variables are used to make such classroom placement decisions, and these variables go well beyond whether students are eligible for free-and-reduced lunches or are English-language learners.

If only the real world surrounding our schools, and in particular the measurement and evaluation of our schools and teachers within them, was so simple and straightforward as Kane and others continue to assume and argue, although much of the time without evidence other than his own or that of his colleagues at Harvard (i.e., 8/17; 47% of the articles cited in Kane’s piece). See also a recent post about this here. In this case, much published research evidence exists to clearly counter this logic and the many related claims herein (see also the other research not cited in this piece but cited in the study highlighted above and linked to again here).

School-Level Bias in the PVAAS Model in Pennsylvania

Research for Action (click here for more about the organization and its mission) just released an analysis of the state of Pennsylvania’s school building rating system – the 100-point School Performance Profile (SPP) used throughout the state to categorize and rank public schools (including charters), although a school can go over 100-points by earning bonus points. The SPP is largely (40%) based on school-level value-added model (VAM) output derived via the state’s Pennsylvania Education Value-Added Assessment System (PVAAS), also generally known as the Education Value-Added Assessment System (EVAAS), also generally known as the VAM that is “expressly designed to control for out-of-school [biasing] factors.” This model should sound familiar to followers, as this is the model with which I am most familiar, on which I conduct most of my research, and widely considered one of “the most popular” and one of “the best” VAMs in town (see recent posts about this model here, here, and here).

Research for Action‘s report titled: “Pennsylvania’s School Performance Profile: Not the Sum of its Parts” details some serious issues with bias with the state’s SPP. That is, bias (a highly controversial issue covered in the research literature and also on this blog; see recent posts about bias here, here, and here), does also appear to exist in this state and particularly at the school-level for (1) subject areas less traditionally tested and, hence, not often consecutively tested (e.g., from one consecutive grade level to the next), and given (2) the state is combining growth measures with proficiency (i.e., “snapshot”) measures to evaluate schools, the latter being significantly negatively correlated with the populations of the students in the schools being evaluated.

First, “growth scores do not appear to be working as intended for writing and science,
the subjects that are not tested year-to-year.” The figure below “displays the correlation between poverty, PVAAS [growth scores], and proficiency scores in these subjects. The lines for both percent proficient and growth follow similar patterns. The correlation between poverty and proficiency rates is stronger, but PVAAS scores are also significantly and negatively correlated with poverty.”

FIgure 1

Whereas before we have covered on this blog how VAMs, and more importantly this VAM (i.e., the PVAAS, TVAAS, EVAAS, etc.), is also be biased by the subject areas teachers teach (see prior posts here and here), this is the first piece of evidence of which I am aware that illustrates that this VAM (and likely other VAMs) might also be biased, particularly for subject areas that are not consecutively tested (i.e., from one consecutive grade to the next) and subject areas that do not have pretest scores, when in this case VAM statisticians use other subject area pretest scores (e.g., from mathematics and English/language arts, as scores from these tests are likely correlated with the pretest scores that are missing) instead. This, of course, has implications for schools (as illustrated in these analyses) but also likely for teachers (as logically generalized from these findings).

Second, researchers found that because five of the six SPP indicators, accounting for 90% of a school’s base score (including an extra-credit portion) rely entirely on test scores, “this reliance on test scores, despite the partial use of growth measures, results in a school rating system that favors more advantaged schools.” The finding here is that using growth or value-added measures in conjunction with proficiency (i.e., “snapshot”) indicators also biases total output against schools serving more disadvantaged students. This is what this looks like, noting in the figures below that bias exists for both the proficiency and growth measures, although bias in the latter is less yet still evident.

FIgure 1

“The black line in each graph shows the correlation between poverty and the percent of students scoring proficient or above in math and reading; the red line displays the
relationship between poverty and growth in PVAAS. As displayed…the relationship between poverty measures and performance is much stronger for the proficiency indicator.”

“This analysis shows a very strong negative correlation between SPP and poverty in both years. In other words, as the percent of a school’s economically disadvantaged population increases, SPP scores decrease.” That is, school performance declines sharply as schools’ aggregate levels of student poverty increase, if and when states use combinations of growth and proficiency to evaluate schools. This is not surprising, but a good reminder of how much poverty is (co)related with achievement captured by test scores when growth is not considered.

Please read the full 16-page report as it includes much more important information about both of these key findings than described herein. Otherwise, well done Research for Action for “adding value,” for the lack of a better term, to the current literature surrounding VAM-based bias.

Harvard Economist Deming on VAM-Based Bias

David Deming – an Associate Professor of Education and Economics at Harvard – just published in the esteemed American Economic Review an article about VAM-based bias, in this case when VAMs are used to measure school versus teacher level effects.

Deming appropriately situated his study within the prior works on this topic, including the key works of Thomas Kane (Education and Economics at Harvard) and Raj Chetty (Economics at Harvard). These two, most notably, continue to advance assertions that using students’ prior test scores and other covariates (i.e., to statistically control for students’ demographic/background factors) minimizes VAM-based bias to negligible levels. Deming also situated his study given the notable works of Jesse Rothstein (Public Policy and Economics at the University of California, Berkeley) who continues to evidence VAM-based bias really does exist. The research of these three key players, along with their scholarly disagreements, have also been highlighted in prior posts about VAM-based bias on this blog (see, for example, here and here).

In this study to test for bias, though, Deming used data from Charlotte-Mecklenburg, North Carolina, given a data set derived from a district in which there was quasi-random assignment of students to schools (given a school choice initiative). With these data, Deming tested whether VAM-biased bias was evident across a variety of common VAM approaches, from the least sophisticated VAM (e.g., one year of prior test scores and no other covariates) to the most (e.g., two or more years of prior test score data plus various covariates).

Overall, Deming failed to reject the hypothesis that school-level effects as measured using VAMs are unbiased, almost regardless of the VAM being used. In more straightforward terms, Deming found that school effects as measured using VAMs were rarely if ever biased when compared to his randomized samples. Hence, this work falls inline with prior works countering that bias really does exist (Note: this is a correction from the prior post).

There are still, however, at least three reasons that could lead to bias in either direction (I.e., positive, in favor of school effects or negative, underestimating school effects):

  • VAMs may be biased due to the non-random sorting of students into schools (and classrooms) “on unobserved determinants of achievement” (see also the work of Rothstein, here and here).
  • If “true” school effects vary over time (independent of error), then test-based forecasts based on prior cohorts’ test scores (as is common when measuring the difference between predictions and “actual” growth, when calculating value-added) may be poor predictors of future effectiveness.
  • When students self-select into schools, the impact of attending a school may be different for students who self-select in than for students who do not. The same thing likely holds true for classroom assignment practices, although that is my extrapolation, not Deming’s.

In addition, and in Deming’s overall conclusions that also pertain here, “many other important outcomes of schooling are not measured here. Schools and teachers [who] are good at increasing student achievement may or may not be effective along other important dimensions” (see also here).

For all of these reasons, “we should be cautious before moving toward policies that hold schools accountable for improving their ‘value added” given bias.

Can VAMs Be Trusted?

In a recent paper published in the peer-reviewed journal Education Finance and Policy, coauthors Cassandra Guarino (Indiana University – Bloomington), Mark Reckase (Michigan State University), and Jeffrey Wooldridge (Michigan State University) ask and then answer the following question: “Can Value-Added Measures of Teacher Performance Be Trusted?” While what I write below is taken from what I read via the official publication, I link here to the working paper that was published online via the Education Policy Center at Michigan State University (i.e., not for a fee).

From the abstract, authors “investigate whether commonly used value-added estimation strategies produce accurate estimates of teacher effects under a variety of scenarios. [They] estimate teacher effects [using] simulated student achievement data sets that mimic plausible types of student grouping and teacher assignment scenarios. [They] find that no one method accurately captures true teacher effects in all scenarios, and the potential for misclassifying teachers as high- or low-performing can be substantial.”

From elsewhere in more specific terms, the authors use simulated data to “represent controlled conditions” to most closely match “the relatively simple conceptual model upon which value-added estimation strategies are based.” This is the strength of this research study in that authors’ findings represent best-case scenarios, while when working with real-world and real-life data “conditions are [much] more complex.” Hence, working with various statistical estimators, controls, approaches, and the like using simulated data becomes “the best way to discover fundamental flaws and differences among them when they should be expected to perform at their best.”

They found…

  • “No one [value-added] estimator performs well under all plausible circumstances, but some are more robust than others…[some] fare better than expected…[and] some of the most popular methods are neither the most robust nor ideal.” In other words, calculating value-added regardless of the sophistication of the statistical specifications and controls used is messy, and this messiness can seriously throw off the validity of the inferences to be drawn about teachers, even given the fanciest models and methodological approaches we currently have going (i.e., those models and model specifications being advanced via policy).
  • “[S]ubstantial proportions of teachers can be misclassified as ‘below average’ or ‘above average’ as well as in the bottom and top quintiles of the teacher quality distribution, even in [these] best-case scenarios.” This means that the misclassification errors we are seeing with real-world data, we are also seeing with simulated data. This leads us to even more concern about whether VAMs will ever be able to get it right, or in this case, counter the effects of the nonrandom assignment of students to classrooms and teachers to the same.
  • Researchers found that “even in the best scenarios and under the simplistic and idealized conditions imposed by [their] data-generating process, the potential for misclassifying above-average teachers as below average or for misidentifying the “worst” or “best” teachers remains nontrivial, particularly if teacher effects are relatively small. Applying the [most] commonly used [value-added approaches] results in misclassification rates that range from at least 7 percent to more than 60 percent, depending upon the estimator and scenario.” So even with a pretty perfect dataset, or a dataset much cleaner than those that come from actual children and their test scores in real schools, misclassification errors can impact teachers upwards of 60% of the time.

In sum, researchers conclude that while certain VAMs hold more promise than others, they may not be capable of overcoming the many obstacles presented by the non-random assignment of students to teachers (and teachers to classrooms).

In their own words, “it is clear that every estimator has an Achilles heel (or more than one area of potential weakness)” that can distort teacher-level output in highly consequential ways. Hence, “[t]he degree of error in [VAM] estimates…may make them less trustworthy for the specific purpose of evaluating individual teachers” than we might think.