Including Summers “Adds Considerable Measurement Error” to Value-Added Estimates

A new article titled “The Effect of Summer on Value-added Assessments of Teacher and School Performance” was recently released in the peer-reviewed journal Education Policy Analysis Archives. The article is authored by Gregory Palardy and Luyao Peng from the University of California, Riverside. 

Before we begin, though, here is some background so that you all understand the importance of the findings in this particular article.

In order to calculate teacher-level value added, all states are currently using (at minimum) the large-scale standardized tests mandated by No Child Left Behind (NCLB) in 2002. These tests were mandated for use in the subject areas of mathematics and reading/language arts. However, because these tests are given only once per year, typically in the spring, to calculate value-added statisticians measure actual versus predicted “growth” (aka “value-added”) from spring-to-spring, over a 12-month span, which includes summers.

While many (including many policymakers) assume that value-added estimations are calculated from fall to spring during time intervals under which students are under the same teachers’ supervision and instruction, this is not true. The reality is that the pre- to post-test occasions actually span 12-month periods, including the summers that often cause the nettlesome summer effects often observed via VAM-based estimates. Different students learn different things over the summer, and this is strongly associated (and correlated) with student’s backgrounds, and this is strongly associated (and correlated) with students’ out-of-school opportunities (e.g., travel, summer camps, summer schools). Likewise, because summers are the time periods over which teachers and schools tend to have little control over what students do, this is also the time period during which research  indicates that achievement gaps maintain or widen. More specifically, research indicates that indicates that students from relatively lower socio-economic backgrounds tend to suffer more from learning decay than their wealthier peers, although they learn at similar rates during the school year.

What these 12-month testing intervals also include are prior teachers’ residual effects, whereas students testing in the spring, for example, finish out every school year (e.g., two months or so) with their prior teachers before entering the classrooms of the teachers for whom value-added is to be calculated the following spring, although teachers’ residual effects were not of focus in this particular study.

Nonetheless, via the research, we have always known that these summer (and prior or adjacent teachers’ residual effects) are difficult if not impossible to statistically control. This in and of itself leads to much of the noise (fluctuations/lack of reliability, imprecision, and potential biases) we observe in the resulting value-added estimates. This is precisely what was of focus in this particular study.

In this study researchers examined “the effects of including the summer period on value-added assessments (VAA) of teacher and school performance at the [1st] grade [level],” as compared to using VAM-based estimates derived from a fall-to-spring test administration within the same grade and same year (i.e., using data derived via a nationally representative sample via the National Center for Education Statistics (NCES) with an n=5,034 children).

Researchers found that:

  • Approximately 40-62% of the variance in VAM-based estimates originates from the summer period, depending on the reading or math outcome;
  • When summer is omitted from VAM-based calculations using within year pre/post-tests, approximately 51-61% of the teachers change performance categories. What this means in simpler terms is that including summers in VAM-based estimates is indeed causing some of the errors and misclassification rates being observed across studies.
  • Statistical controls to control for student and classroom/school variables reduces summer effects considerably (e.g., via controlling for students’ prior achievement), yet 36-47% of teachers still fall into different quintiles when summers are included in the VAM-based estimates.
  • Findings also evidence that including summers within VAM-based calculations tends to bias VAM-based estimates against schools with higher relative concentrations of poverty, or rather higher relative concentrations of students who are eligible for the federal free-and-reduced lunch program.
  • Overall, results suggest that removing summer effects from VAM-based estimates may require biannual achievement assessments (i.e., fall and spring). If we want VAM-based estimates to be more accurate, we might have to double the number of tests we administer per year in each subject area for which teachers are to be held accountable using VAMs. However, “if twice-annual assessments are not conducted, controls for prior achievement seem to be the best method for minimizing summer effects.”

This is certainly something to consider in terms of trade-offs, specifically in terms of whether we really want to “double-down” on the number of tests we already require our public students to take (also given the time that testing and test preparation already takes away from students’ learning activities), and whether we also want to “double-down” on the increased costs of doing so. I should also note here, though, that using pre/post-tests within the same year is (also) not as simple as it may seem (either). See another post forthcoming about the potential artificial deflation/inflation of pre/post scores to manufacture artificial levels of growth.

To read the full study, click here.

*I should note that I am an Associate Editor for this journal, and I served as editor for this particular publication, seeing it through the full peer-reviewed process.

Citation: Palardy, G. J., & Peng, L. (2015). The effects of including summer on value-added assessments of teachers and schools. Education Policy Analysis Archives, 23(92). doi:10.14507/epaa.v23.1997 Retrieved from http://epaa.asu.edu/ojs/article/view/1997

EVAAS, Value-Added, and Teacher Branding

I do not think I ever shared this video out, and now following up on another post, about the potential impact these videos should really have, I thought now is an appropriate time to share. “We can be the change,” and social media can help.

My former doctoral student and I put together this video, after conducting a study with teachers in the Houston Independent School District and more specifically four teachers whose contracts were not renewed due in large part to their EVAAS scores in the summer of 2011. This video (which is really a cartoon, although it certainly lacks humor) is about them, but also about what is happening in general in their schools, post the adoption and implementation (at approximately $500,000/year) of the SAS EVAAS value-added system.

To read the full study from which this video was created, click here. Below is the abstract.

The SAS Educational Value-Added Assessment System (SAS® EVAAS®) is the most widely used value-added system in the country. It is also self-proclaimed as “the most robust and reliable” system available, with its greatest benefit to help educators improve their teaching practices. This study critically examined the effects of SAS® EVAAS® as experienced by teachers, in one of the largest, high-needs urban school districts in the nation – the Houston Independent School District (HISD). Using a multiple methods approach, this study critically analyzed retrospective quantitative and qualitative data to better comprehend and understand the evidence collected from four teachers whose contracts were not renewed in the summer of 2011, in part given their low SAS® EVAAS® scores. This study also suggests some intended and unintended effects that seem to be occurring as a result of SAS® EVAAS® implementation in HISD. In addition to issues with reliability, bias, teacher attribution, and validity, high-stakes use of SAS® EVAAS® in this district seems to be exacerbating unintended effects.

Economists Declare Victory for VAMs

On a popular economics site, fivethirtyeight.com, authors use “hard numbers” to tell compelling stories, and this time the compelling story told is about value-added models and all of the wonders, thanks to the “hard numbers” derived via model output, they are working to reform the way “we” evaluate and hold teachers accountable for their effects.

In an article titled “The Science Of Grading Teachers Gets High Marks,” this site’s “quantitative editor” (?!?) – Andrew Flowers – writes about how “the science” behind using “hard numbers” to evaluate teachers’ effects is, fortunately for America and thanks to the efforts of (many/most) econometricians, gaining much-needed momentum.

Not to really anyone’s surprise, the featured economics study of this post is…wait for it…the Chetty et al. study at focus of much controversy and many prior posts on this blog (see for example here, here, here, and here). This is the study cited in President Obama’ 2012 State of the Union address when he said that, “We know a good teacher can increase the lifetime income of a classroom by over $250,000,” and this study was more recently the focus of attention when the judge in Vergara v. California cited Chetty et al.’s study as providing evidence that “a single year in a classroom with a grossly ineffective teacher costs students $1.4 million in lifetime earnings per classroom.”

These are the “hard numbers” that have since been duly critiqued by scholars from California to New York since (see, for example, here, here, here, and here), but that’s not mentioned in this post. What is mentioned, however, is the notable work of economist Jesse Rothstein, whose work I have also cited in prior posts (see, for example, here, here, here, and here,), as he has also countered Chetty et al.’s claims, not to mention added critical research to the topic on VAM-based bias.

What is also mentioned, not to really anyone’s surprise again, though, is that Thomas Kane – a colleague of Chetty’s at Harvard who has also been the source of prior VAMboozled! posts (see, for example, here, here, and here), who also replicated Chetty’s results as notably cited/used during the Vergara v. California case last summer, endorses Chetty’s work throughout this same article. Article author “reached out” to Kane “to get more perspective,” although I, for one, question how random this implied casual reach really was… Recall a recent post about our “(Unfortunate) List of VAMboozlers?” Two of our five total honorees include Chetty and Kane – the same two “hard number” economists prominently featured in this piece.

Nonetheless, this article’s “quantitative editor” (?!?) Flowers sides with them (i.e., Chetty and Kane), and ultimately declares victory for VAMs, writing that VAMs ultimately and “accurately isolate a teacher’s impact on students”…”[t]he implication[s] being, school administrators can legitimately use value-added scores to hire, fire and otherwise evaluate teacher performance.”

This “cutting-edge science,” as per a quote taken from Chetty’s co-author Friedman (Brown University), captures it all: “It’s almost like we’re doing real, hard science…Well, almost. But by the standards of empirical social science — with all its limitations in experimental design, imperfect data, and the hard-to-capture behavior of individuals — it’s still impressive….[F]or what has been called the “credibility revolution” in empirical economics, it’s a win.”

New “Causal” Evidence in Support of VAMs

No surprise, really, but Thomas Kane, an economics professor from Harvard University who also directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation, is publicly writing in support of VAMs, again. His newest article was recently published on the website of the Brookings Institution, titled “Do Value-Added Estimates Identify Causal Effects of Teachers and Schools?” Not surprisingly as a VAM advocate, in this piece he continues to advance a series of false claims about the wonderful potentials of VAMs, in this case as also yielding causal estimates whereas teachers can be seen as directly causing the growth measured by VAMs (see prior posts about Kane’s very public perspectives here and here).

The best part of this article is where Kane (potentially) seriously considers whether “the short list of control variables captured in educational data systems—prior achievement, student demographics, English language learner [ELL] status, eligibility for federally subsidized meals or programs for gifted and special education students—include the relevant factors by which students are sorted to teachers and schools” and, hence, work to control for bias as these are some of the factors that do (as per the research evidence) distort value-added scores.

The potential surrounding the exploration of this argument, however, quickly turns south as thereafter Kane pontificates using pure logic that “it is possible that school data systems contain the very data that teachers or principals are using to assign students to teachers.” In other words, he painfully attempts to assert his non-research-based argument that as long as principals and teachers use the aforementioned variables to sort students into classrooms, then controlling for said variables should indeed control for the biasing effects caused by the non-random assortment of students into classrooms (and teachers into classrooms, although he does not address that component either).

In addition, he asserts that, “[o]f course, there are many other unmeasured factors [or variables] influencing student achievement—such as student motivation or parental engagement [that cannot or cannot easily be observed]. But as long as those factors are also invisible [emphasis added] to those making teacher and program assignment decisions, our [i.e., VAM statisticians’] inability to control for them” more or less makes not controlling for these other variables inconsequential. In other words, in this article Kane asserts as long as the “other things” principals and teachers use to non-randomly place students into classrooms are “invisible” to the principals and teachers making student placement decisions, these “other things” should not have to be statistically controlled, or factored out. We should otherwise be good to go given the aforementioned variables already observable and available.

As evidenced in a study I wrote with one of my current doctoral students that was recently published in the esteemed, peer-reviewed American Educational Research Journal on this very topic (see the full study here), we set out to better determine how and whether the controls used by value-added researchers to eliminate bias might be sufficient given what indeed occurs in practice when students are placed into classrooms.

We found that both teachers and parents play a prodigious role in the student placement process, in almost nine out of ten schools (i.e., 90% of the time). Teachers and parents (although parents are also not mentioned in Kane’s article) provide both appreciated and sometimes unwelcome insights, regarding what teachers and parents perceive to be the best learning environments for their students or children, respectively. Their added insights typically revolve around, in the following order, students’ in-school behaviors, attitudes, and disciplinary records; students’ learning styles and students’ learning styles as matched with teachers’ teaching styles; students’ personalities and students personalities as matched with teachers’ personalities; students’ interactions with their peers and prior teachers; general teacher types (e.g. teachers who manage their classrooms in perceptibly better ways); and whether students had siblings in potential teachers’ classrooms prior.

These “other things” are not typically if ever controlled for given current VAMs, nor will they likely ever be. In addition, these factors serve as legitimate reasons for class changes during the school year, although whether this, too, is or could be captured in VAMs is highly tentative at best. Otherwise, namely prior academic achievement, special education needs, giftedness, and gender also influence placement decisions. These are variables for which most current VAMs account or control, presumably effectively.

Kane, like other VAM statisticians, tend to (and in many ways have to if they are to continue with their VAM work, despite “the issues”) (over)simplify the serious complexities that come about when random assignment of students to classrooms (and teachers to classrooms) is neither feasible, nor realistic, or outright opposed (as was also clearly evidenced in the above article by 98% of educators, see again here).

The random assignment of students to classrooms (and teachers to classrooms) very rarely happens. Rather, the use of many observable and unobservable variables are used to make such classroom placement decisions, and these variables go well beyond whether students are eligible for free-and-reduced lunches or are English-language learners.

If only the real world surrounding our schools, and in particular the measurement and evaluation of our schools and teachers within them, was so simple and straightforward as Kane and others continue to assume and argue, although much of the time without evidence other than his own or that of his colleagues at Harvard (i.e., 8/17; 47% of the articles cited in Kane’s piece). See also a recent post about this here. In this case, much published research evidence exists to clearly counter this logic and the many related claims herein (see also the other research not cited in this piece but cited in the study highlighted above and linked to again here).

Same Model (EVAAS), Different State (Ohio), Same Problems (Bias and Data Errors)

Following up on my most recent post about “School-Level Bias in the PVAAS Model in Pennsylvania,” also in Ohio – a state that also uses “the best” and “most sophisticated” VAM (i.e., a version of the Education Value-Added Assessment System [EVAAS]; for more information click here) – this seems to be a problem, as per an older (2013) article just sent to me following my prior post “Teachers’ ‘Value-Added’ Ratings and [their] Relationship to Student Income Levels [being] Questioned.”

The key finding? Ohio’s “2011-12 value-added results show that districts, schools and teachers with large numbers of poor students tend to have lower value-added results than those that serve more-affluent ones.” Such output continue to evidence how using VAMs may not be “the great equalizer” after all. VAMs might not be the “true” measures they are assumed (and marketed/sold) to be.

Here are the state’s stats, as highlighted in this piece and taken directly from the Plain Dealer/StateImpact Ohio analysis:

  • Value-added scores were 2½ times higher on average for districts where the median family income is above $35,000 than for districts with income below that amount.
  • For low-poverty school districts, two-thirds had positive value-added scores — scores indicating students made more than a year’s worth of progress.
  • For high-poverty school districts, two-thirds had negative value-added scores — scores indicating that students made less than a year’s progress.
  • Almost 40 percent of low-poverty schools scored “Above” the state’s value-added target, compared with 20 percent of high-poverty schools.
  • At the same time, 25 percent of high-poverty schools scored “Below” state value-added targets while low-poverty schools were half as likely to score “Below.”

One issue likely causing this level of bias is that student background factors are not accounted for in the EVAAS model. “By including all of a student’s testing history, each student serves as his or her own control,” as per the analytics company – SAS – that now sells and runs model calculations. Rather, the EVAAS uses students’ prior test scores to control for these factors. As per a document available on the SAS website: “To the extent that SES/DEM [socioeconomic/demographic] influences persist over time, these influences are already represented in the student’s data. This negates the need for SES/DEM adjustment.”

Within the same document SAS authors write: “that adjusting value-added for the race or poverty level of a student adds the dangerous assumption that those students will not perform well. It would also create separate standards for measuring students who have the same history on scores on previous tests. And it says that adjusting for these issues can also hide whether students in poor areas are receiving worse instruction. ‘We recommend that these adjustments not be made,” the brief reads. “Not only are they largely unnecessary, but they may be harmful.”

See also another article, also from Ohio about how a value-added “Glitch Cause[d the] State to Pull Back Teacher “Value Added” Student Growth Scores.” From the intro, “An error by contractor SAS Institute Inc. forced the state to withdraw some key teacher performance measurements that it had posted online for teachers to review. The state’s decision to take down the value added scores for teachers across the state on Wednesday has some educators questioning the future reliability of the scores and other state data…” (click here to read more).

University of Missouri’s Koedel on VAM Bias and “Leveling the Playing Field”

In a forthcoming study (to be published in the peer reviewed journal Educational Policy), University of Missouri Economists – Cory Koedel, Mark Ehlert, Eric Parsons, and Michael Podgursky – evidence how poverty should be included as a factor when evaluating teachers using VAMs. This study, also covered in a recent article in Education Week and another news brief published by the University of Missouri, highlights how doing this “could lead to a more ‘effective and equitable’ teacher-evaluation system.”

Koedel, as cited in the University of Missouri brief, argues that using what he and his colleagues term a “proportional” system would “level the playing field” for teachers working with students from different income backgrounds. They evaluated three types of growth models/VAMs to evidence their conclusions; however, how they did this will not be available until the actual article is published.

While Koedel and I tend to disagree on VAMs as potential tools for teacher evaluation – whereas he believes that such statistical and methodological tweaks will bring us to a level of VAM perfection I believe is likely impossible – the important takeaway from this piece is that VAMs are (more) biased when such background and resource sensitive factors are excluded from VAM-based calculations.

Accordingly, while Koedel writes elsewhere (in a related Policy Brief) that using what they term a “proportional’ evaluation system would “entirely mitigate this concern,” we have evidence elsewhere that with even the most sophisticated and comprehensive controls available, never can VAM-based bias be “entirely” mitigated (see, for example, here and here). This is likely due to the fact that while Koedel, his colleageus, and others argue (often with solid evidence) that controlling for “observed student and school characteristics” helps to mitigate bias, there is still unobserved student and school characteristics that cannot be observed, quantified, and hence controlled for or factored out, and this (will likely forever) prevent bias’s “entire mitgation.”

Here, the question is whether we need sheer perfection. The answer is no, but when these models are applied in practice, and when particularly teachers who teach homogenous groups of students in relatively more extreme positions (e.g., disproprortionate numbers of students from high-needs, non-English proficient backgrounds) bias still matters. While, as a whole we might be less concerned about bias when such factors are included in VAMs, there (likely forever will be) cases where bias will impact individual teachers. This is where folks will have to rely on human judgment to interpret the “objective” numbers based on VAMs.

A Really Old Oldie but Still Very Relevant Goodie

Thanks to a colleague in Florida, I recently read an article about the “Problems of Teacher Measurementpublished in 1917 in the Journal of Educational Psychology
by B. F. Pittenger. As mentioned, it’s always interesting to take a historical approach (hint here to policymakers), and in this case a historical vies via the perspective of an author on the same topic of interest to followers here through an article he wrote almost 100 years ago. Let’s see how things have changed, or more specifically, how things have not changed.

Then, “they” had the same goals we still have today, if this isn’t telling in and of itself. From 1917: “The current efforts of experimentallists in the field of teacher measurement are only attempts to extract from the consciousness of principals and supervisors these personal criteria of good teaching, and to assemble and condense them into a single objective schedule, thoroughly tested, by means of which every judge of teaching may make his [sic] estimates more accurate, and more consistent with those of other judges. There is nothing new about the entire movement except the attempt to objectify what already exists subjectively, and to unify and render universal what is now the scattered property of many men.”

Policymakers continue to invest entirely on an ideal known then also to be (possibly forever) false. From 1917: “There are those who believe that the movement toward teacher measurement is a monstrous innovation, which threatens the holiest traditions of the educational profession by putting a premium upon mechanical methodology…the phrase ‘teacher-measurement,’ itself, no doubt, is in part responsible for this misunderstanding, as it suggests a mathematical exactness of procedure which is clearly impossible in this field [emphasis added]. Teacher measurement will probably never become more than a carefully controlled process of estimating a teacher’s individual efficiency…[This is]…sufficiently convenient and euphonious, and has now been used widely enough, to warrant its continuation.”

As for the methods “issues” in 1917? “However sympathetic one may be with the general plan of devising schedules for teacher measurement, it is difficult to justify many of the methods by which these investigators have attacked the problem. For example, all of them appear to have set up as their goal the construction of a schedule which can be applied to any teacher, whether in the elementary or high school, and irrespective of the grade or subject in which his teaching is being done. “Teaching is teaching,” is the evident assumption, “and the same wherever found.” But it may reasonably be maintained that different qualities and methods, at least in part, are requisite…In so far as the criteria of good teaching are the same in these very diverse situations, it seems probable that the comparative importance to be attached to each must differ.” Sound familiar?

On the use of multiple measures, as currently in line with the current measurement standards of the profession, from 1917: “students of teacher measurement appear to have erred in that they have attempted too much. The writer is strongly of the opinion that, for the present at least, efforts to construct a schedule for teacher measurement should be confined to a single one of the three planes which have been enumerated. Doubtless in the end we shall want to know as much as possible about all three; and to combine in our final estimate of a teacher’s merit all attainable facts as to her equipment, her classroom procedure, and the results which she achieves. But at present we should do wisely to project our investigations upon one plane at a time, and to make each of these investigations as thorough as it is possible to make it. Later, when we know the nature and comparative value of the various items necessary to adequate judgment upon all planes, there will be time and opportunity for putting together the different schedules into one.” One-hundred years later…

On prior teachers’ effects: “we must keep constantly in mind the fact that the results which pupils achieve in any given subject are by no means the product of the labor of any single teacher. Earlier teachers, other contemporary teachers, and the environment external to the school, are all factors in determining pupil efficiency in any school subject. It has been urged that the influence of these complicating factors can be materially reduced by measuring only the change in pupil achievement which takes place under the guidance of a single teacher. But it must be remembered that this process only reduces these complications; it does not and cannot eliminate them.”

Finally, the supreme to be sought, then and now? “The plane of results (in the sense of changes wrought in pupils) would be the ideal plane upon which to build an estimate of a teacher’s individual efficiency, if it were possible (1) to measure all of the results of teaching, and (2) to pick out from the body of measured results any single teacher’s contribution. At present these desiderata are impossible to attain [emphasis added]…[but]…let us not make the mistake of assuming that the results that we can measure are the only results of teaching, or even that they are the most important part.”

Likewise, “no one teacher can be given the entire blame or credit for the doings of the pupils in her classroom…the ‘classroom process’ should be regarded as including the activities of both teachers and pupils.” In the end, “The promotion, discharge, or constructive criticism of teachers cannot be reduced to mathematical formulae. The proper function of a scorecard for teacher measurement is not to substitute such a formula for a supervisor’s personal judgment, but to aid him in discovering and assembling all the data upon which intelligent judgment should be based.”

Those Who Can, Teach—Those Who Don’t Understand Teaching, Make Policy

There is an old adage that many in education have undoubtedly (and unfortunately) encountered at some point in their education careers: that “those who can, do—those who can’t, teach.” For decades now, researchers, including Dr. Thomas Good who is the author of an article published in 2014 in Teachers College Record, have evidenced that, contrary to this belief, teachers can do, and some teachers do better than others. Dr. Good points out in his historical analysis, What Do We Know About How Teachers Influence Student Performance on Standardized Tests: And Why Do We Know so Little About Other Student Outcomes? that teachers matter and that teachers vary in their effects on student achievement. He provides evidence from several decades of scholarly research on teacher effectiveness to show that teachers do make a difference in student achievement as measured by large-scale standardized achievement tests.

Dr. Good is also quick to acknowledge that, despite the reiterated notion that teachers matter and thus should possess (and continue to be trained in) effective teaching qualities (e.g., be well versed in their content knowledge, have strong classroom management skills, hold appropriate expectations, etc.), “fad-driven” education reform policies (e.g., teacher evaluation polices that are based in large part on student achievement growth or teachers’ “value-added”) have gone too far and have actually overvalued the effects of teachers. He explains that simplistic reform efforts, such as Race to the Top and VAM-based teacher evaluation systems, overvalue teacher effects in terms of the actual levels of impact teachers have on student achievement. It has been well documented that teacher effects can only explain, on average, between 10-20% of the variation in student achievement scores (this will be further explored in forthcoming posts). This means that 80-90% of student achievement scores are the result of other factors that are completely outside of the teachers’ control (e.g., poverty, parental support, etc.). Regardless, new teacher evaluation systems that rely so heavily on VAM estimates ignore this very important fact.

Dr. Good also emphasizes that teaching is a complex practice and that by attempting to isolate the variables that make up effective teaching and focusing on each one separately only oversimplifies the complexities of the teaching-achievement relationship. For one thing, VAMs and other popular evaluative practices, such as classroom observations, inhibit one’s ability to recognize the patterns of effective teaching and instead promote a simplistic view of just some of the individual variables that actually matter when thinking about effective teaching. He provides the following example:

Many observational systems call for the demonstration of high expectations. However…expectations can be too high or too low, and the issue is for teachers to demonstrate appropriate expectations. How then does a classroom observer know and code if expectations are appropriate both for individual students and for the class as a whole?

The bottom line is that, like teaching, understanding the impact of teachers on student learning is very complex. Teachers matter, and they should be trained, treated, and valued as such. However, they are one of many factors that impact student learning and achievement over time, and this very critical point cannot be (though currently is) ignored by policy. VAM-based policies, particularly, place an exorbitantly over-estimated value on the impact of teachers on student achievement scores.

Post contributed by Jessica Holloway-Libell

Houston, We Have A Problem: New Research Published about the EVAAS

New VAM research was recently published in the peer-reviewed Education Policy Analysis Archives journal, titled “Houston, We Have a Problem: Teachers Find No Value in the SAS Education Value-Added Assessment System (EVAAS®).” This article was published by a former doctoral student of mine, turned researcher now at a large non-profit — Clarin Collins. I asked her to write a guest post for you all summarizing the fully study (linked again here). Here is what she wrote.

As someone who works in the field of philanthropy, completed a doctoral program more than two years ago, and recently became a new mom, you might question why I worked on an academic publication and am writing about it here as a guest blogger? My motivation is simple: the teachers. Teachers continue to be at the crux of the national education reform efforts as they are blamed for the nation’s failing education system and student academic struggles. National and state legislation has been created and implemented as believed remedies to “fix” this problem by holding teachers accountable for student progress as measured by achievement gains.

While countless researchers have highlighted the faults of teacher accountability systems and growth models (unfortunately to fall on the deaf ears of those mandating such policies), very rarely are teachers asked how such policies play out in practice, or for their opinions, as representing their voices in all of this. The goal of this research, therefore, was first, to see how one such teacher evaluation policy is playing out in practice and second, to give voice to marginalized teachers, those who are at the forefront of these new policy initiatives. That being said, while I encourage you to check out the full article [linked again here], I highlight key findings in this summary, using the words of teachers as often as possible to permit them, really, to speak for themselves.

In this study I examined the SAS Education Value-Added Assessment System (EVAAS) in practice, as perceived and experienced by teachers in the Southwest School District (SSD). SSD [a pseudonym] is using EVAAS for high-stakes consequences more than any other district or state in the country. I used a mixed-method design including a large-scale electronic survey to investigate the model’s reliability and validity; to determine whether teachers used the EVAAS data in formative ways as intended; to gather teachers’ opinions on EVAAS’s claimed benefits and statements; and to understand the unintended consequences that might have also occurred as a result of EVAAS use in SSD.

Results revealed that the reliability of the EVAAS model produced split and inconsistent results among teacher participants regardless of subject or grade-level taught. As one teacher stated, “In three years, I was above average, below average and average.” Teachers indicated that it was the students and their varying background demographics who biased their EVAAS results, and much that was demonstrated via their scores was beyond the control of teachers. “[EVAAS] depends a lot on home support, background knowledge, current family situation, lack of sleep, whether parents are at home, in jail, etc. [There are t]oo many outside factors – behavior issues, etc.” that apparently are not controlled or accounted for in the model.

Teachers reported dissimilar EVAAS and principal observation scores, reducing the criterion-related validity of both measures of teacher quality. Some even reported that principals changed their observation scores to match their EVAAS scores; “One principal told me one year that even though I had high [state standardized test] scores and high Stanford [test] scores, the fact that my EVAAS scores showed no growth, it would look bad to the superintendent.” Added another teacher, “I had high appraisals but low EVAAS, so they had to change the appraisals to match lower EVAAS scores.”

The majority of teachers disagreed with SAS’s marketing claims such as EVAAS reports are easy to use to improve instruction, and EVAAS will ensure growth opportunities for all students. Teachers called the reports “vague” and “unclear” and were “not quite sure how to interpret” and use the data to inform their instruction. As one teacher explained, she looked at her EVAAS report “only to guess as to what to do for the next group in my class.”

Many unintended consequences associated with the high-stakes use of EVAAS emerged through teachers’ responses, which revealed among others that teachers felt heightened pressure and competition, which they believed reduced morale and collaboration, and encouraged cheating or teaching to the test in attempt to raise EVAAS scores. Teachers made comments such as, “To gain the highest EVAAS score, drill and kill and memorization yields the best results, as does teaching to the test,” and “When I figured out how to teach to the test, the scores went up,” as well as, “EVAAS leaves room for me to teach to the test and appear successful.”

Teachers realized this emphasis on test scores was detrimental for students, as one teacher wrote, “As a result of the emphasis on EVAAS, we teach less math, not more. Too much drill and kill and too little understanding [for the] love of math… Raising a generation of children under these circumstances seems best suited for a country of followers, not inventors, not world leaders.”

Teachers also admitted they are not collaborating to share best practices as much anymore: “Since the inception of the EVAAS system, teachers have become even more distrustful of each other because they are afraid that someone might steal a good teaching method or materials from them and in turn earn more bonus money. This is not conducive to having a good work environment, and it actually is detrimental to students because teachers are not willing to share ideas or materials that might help increase student learning and achievement.”

While I realize this body of work could simply add to “the shelves” along with those findings of other researchers striving to deflate and demystify this latest round of education reform, if nothing else, I hope the teachers who participated in this study know I am determined to let their true experiences, perceptions of their experiences, and voices be heard.

—–

Again, to find out more information including the statistics in support of the above assertions and findings, please click here to read the full study.

VAMs and Observations: Consistencies, Correlations, and Contortions

A research article was recently published in the peer-reviewed Education Policy Analysis Archives journal titled “The Stability of Teacher Performance and Effectiveness: Implications for Policies Concerning Teacher Evaluation.” This articles was published by professors/doctoral researchers from Baylor University, Texas A & M, & University of South Carolina. Researchers set out to explore the stability (or fluctuations) observed across 132 teachers’ effectiveness ratings across 23 schools in South Carolina over time using both observational and VAM-based output.

Researchers found, not surprisingly given prior research in this area, that neither teacher performance using value-added nor effectiveness using observations was highly stable over time. This is most problematic when “sound decisions about continued employment, tenure, and promotion are predicated on some degree of stability over time. It is imprudent to make such decisions of the performance and effectiveness of a teacher as “Excellent” one year and “Mediocre” the next.”

They also observed “a generally weak positive relationship between the two sets of ratings [i.e., value-added estimates and observations], which has also been the source of many literature studies.

On both of these findings, really, we are well pas the point of saturation. That is, we could not have more agreement across research studies on the following (1) that teacher-level value-added scores are highly unstable over time and (2) that these value-added scores do not align well with observational scores, as they should if both measures were to be appropriate capturing the “teacher effectiveness” construct.

Another interesting finding from this study, discussed before but that has not yet reached the point of saturation in the research literature like the prior two is how (3) different teacher performance ratings as based on observational data are also markedly different across schools. At issue here is “that performance ratings may be school-specific.” Or as per a recent post on this blog, that there is indeed much “Arbitrariness Inherent in Teacher Observations.” This is also highly problematic in that where a teacher might be housed might determine more his/her ratings based not necessarily (or entirely) on his/her actual “quality” or “effectiveness” but his/her location, his/her rater, and his/her rater’s scoring approach given differential tendencies towards leniency, or severity. This might leave us with more of a luck-of-the-draw approach than an actually “objective” measurement of true teacher quality, contrary to current and popular (especially policy) beliefs.

Accordingly, and also per the research, this is not getting much better in that, as per the authors of this article as well as many other scholars, (1) “the variance in value-added scores that can be attributed to teacher performance rarely exceeds 10 percent; (2) in many ways “gross” measurement errors that in many ways come, first, from the tests being used to calculate value-added; (3) the restricted ranges in teacher effectiveness scores also given these test scores and their limited stretch, and depth, and instructional insensitivity — this was also at the heart of a recent post whereas in what demonstrated that “the entire range from the 15th percentile of effectiveness to the 85th percentile of [teacher] effectiveness [using the EVAAS] cover[ed] approximately 3.5 raw score points [given the tests used to measure value-added];” (4) context or student, family, school, and community background effects that simply cannot be controlled for, or factored out; (5) especially at the classroom/teacher level when students are not randomly assigned to classrooms (and teachers assigned to teach those classrooms)… although this will likely never happen for the sake of improving the sophistication and rigor of the value-added model over students’ “best interests.”