Special Issue of “Educational Researcher” (Paper #9 of 9): Amidst the “Blooming Buzzing Confusion”

Recall that the peer-reviewed journal Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the last of nine articles (#9 of 9), which is actually a commentary titled “Value Added: A Case Study in the Mismatch Between Education Research and Policy.” This commentary is authored by Stephen Raudenbush – Professor of Sociology and Public Policy Studies at the University of Chicago.

Like with the last two commentaries reviewed here and here, Raudenbush writes of the “Special Issue” that, in this topical area, “[r]esearchers want their work to be used, so we flirt with the idea that value-added research tells us how to improve schooling…[Luckily, perhaps] this volume has some potential to subdue this flirtation” (p. 138).

Raudenbush positions the research covered in this “Special Issue,” as well as the research on teacher evaluation and education in general, as being conducted amidst the “blooming buzzing confusion” (p. 138) surrounding the messy world through which we negotiate life. This is why “specific studies don’t tell us what to do, even if they sometimes have large potential for informing expert judgment” (p. 138).

With that being said, “[t]he hard question is how to integrate the new research on teachers with other important strands of research [e.g., effective schools research] in order to inform rather than distort practical judgment” (p. 138). Echoing Susan Moore Johnson’s sentiments, reviewed as article #6 here, this is appropriately hard if we are to augment versus undermine “our capacity to mobilize the “social capital” of the school to strengthen the human capital of the teacher” (p. 138).

On this note, and “[i]n sum, recent research on value added tells us that, by using data from student perceptions, classroom observations, and test score growth, we can obtain credible evidence [albeit weakly related evidence, referring to the Bill & Melinda Gates Foundation’s MET studies] of the relative effectiveness of a set of teachers who teach similar kids [emphasis added] under similar conditions [emphasis added]…[Although] if a district administrator uses data like that collected in MET, we can anticipate that an attempt to classify teachers for personnel decisions will be characterized by intolerably high error rates [emphasis added]. And because districts can collect very limited information, a reliance on district-level data collection systems will [also] likely generate…distorted behavior[s]..in which teachers attempt to “game” the
comparatively simple indicators,” or system (p. 138-139).

Accordingly, “[a]n effective school will likely be characterized by effective ‘distributed’ leadership, meaning that expert teachers share responsibility for classroom observation, feedback, and frequent formative assessments of student learning. Intensive professional development combined with classroom follow-up generates evidence about teacher learning and teacher improvement. Such local data collection efforts [also] have some potential to gain credibility among teachers, a virtue that seems too often absent” (p. 140).

This, might be at least a significant part of the solution.

“If the school is potentially rich in information about teacher effectiveness and teacher improvement, it seems to follow that key personnel decisions should be located firmly at the school level..This sense of collective efficacy [accordingly] seems to be a key feature of…highly effective schools” (p. 140).

*****

If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; see the Review of Article #5 – on teachers’ perceptions of observations and student growth here; see the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here; see the Review of Article (Commentary) #7 – on VAMs situated in their appropriate ecologies here; and see the Review of Article #8, Part I – on a more research-based assessment of VAMs’ potentials here and Part II on “a modest solution” provided to us by Linda Darling-Hammond here.

Article #9 Reference: Raudenbush, S. W. (2015). Value added: A case study in the mismatch between education research and policy. Educational Researcher, 44(2), 138-141. doi:10.3102/0013189X15575345

 

 

 

Special Issue of “Educational Researcher” (Paper #8 of 9, Part I): A More Research-Based Assessment of VAMs’ Potentials

Recall that the peer-reviewed journal Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#8 of 9), which is actually a commentary titled “Can Value-Added Add Value to Teacher Evaluation?” This commentary is authored by Linda Darling-Hammond – Professor of Education, Emeritus, at Stanford University.

Like with the last commentary reviewed here, Darling-Hammond reviews some of the key points taken from the five feature articles in the aforementioned “Special Issue.” More specifically, though, Darling-Hammond “reflect[s] on [these five] articles’ findings in light of other work in this field, and [she] offer[s her own] thoughts about whether and how VAMs may add value to teacher evaluation” (p. 132).

She starts her commentary with VAMs “in theory,” in that VAMs COULD accurately identify teachers’ contributions to student learning and achievement IF (and this is a big IF) the following three conditions were met: (1) “student learning is well-measured by tests that reflect valuable learning and the actual achievement of individual students along a vertical scale representing the full range of possible achievement measures in equal interval units” (2) “students are randomly assigned to teachers within and across schools—or, conceptualized another way, the learning conditions and traits of the group of students assigned to one teacher do not vary substantially from those assigned to another;” and (3) “individual teachers are the only contributors to students’ learning over the period of time used for measuring gains” (p. 132).

None of things are actual true (or near to true, nor will they likely ever be true) in educational practice, however. Hence, the errors we continue to observe that continue to prevent VAM use for their intended utilities, even with the sophisticated statistics meant to mitigate errors and account for the above-mentioned, let’s call them, “less than ideal” conditions.

Other pervasive and perpetual issues surrounding VAMs as highlighted by Darling-Hammond, per each of the three categories above, pertain to (1) the tests used to measure value-added is that the tests are very narrow, focus on lower level skills, and are manipulable. These tests in their current form cannot effectively measure the learning gains of a large share of students who are above or below grade level given a lack of sufficient coverage and stretch. As per Haertel (2013, as cited in Darling-Hammond’s commentary), this “translates into bias against those teachers working with the lowest-performing or the highest-performing classes’…and “those who teach in tracked school settings.” It is also important to note here that the new tests created by the Partnership for Assessing Readiness for College and Careers (PARCC) and Smarter Balanced, multistate consortia “will not remedy this problem…Even though they will report students’ scores on a vertical scale, they will not be able to measure accurately the achievement or learning of students who started out below or above grade level” (p.133).

With respect to (2) above, on the equivalence (or rather non-equivalence) of groups of student across teachers classrooms who are the ones whose VAM scores are relativistically compared, the main issue here is that “the U.S. education system is the one of most segregated and unequal in the industrialized world…[likewise]…[t]he country’s extraordinarily high rates of childhood poverty, homelessness, and food insecurity are not randomly distributed across communities…[Add] the extensive practice of tracking to the mix, and it is clear that the assumption of equivalence among classrooms is far from reality” (p. 133). Whether sophisticated statistics can control for all of this variation is one of most debated issues surrounding VAMs and their levels of outcome bias, accordingly.

And as per (3) above, “we know from decades of educational research that many things matter for student achievement aside from the individual teacher a student has at a moment in time for a given subject area. A partial list includes the following [that are also supposed to be statistically controlled for in most VAMs, but are also clearly not controlled for effectively enough, if even possible]: (a) school factors such as class sizes, curriculum choices, instructional time, availability of specialists, tutors, books, computers, science labs, and other resources; (b) prior teachers and schooling, as well as other current teachers—and the opportunities for professional learning and collaborative planning among them; (c) peer culture and achievement; (d) differential summer learning gains and losses; (e) home factors, such as parents’ ability to help with homework, food and housing security, and physical and mental support or abuse; and (e) individual student needs, health, and attendance” (p. 133).

“Given all of these influences on [student] learning [and achievement], it is not surprising that variation among teachers accounts for only a tiny share of variation in achievement, typically estimated at under 10%” (see, for example, highlights from the American Statistical Association’s (ASA’s) Position Statement on VAMs here). “Suffice it to say [these issues]…pose considerable challenges to deriving accurate estimates of teacher effects…[A]s the ASA suggests, these challenges may have unintended negative effects on overall educational quality” (p. 133). “Most worrisome [for example] are [the] studies suggesting that teachers’ ratings are heavily influenced [i.e., biased] by the students they teach even after statistical models have tried to control for these influences” (p. 135).

Other “considerable challenges” include: VAM output are grossly unstable given the swings and variations observed in teacher classifications across time, and VAM output are “notoriously imprecise” (p. 133) given the other errors observed as caused, for example, by varying class sizes (e.g., Sean Corcoran (2010) documented with New York City data that the “true” effectiveness of a teacher ranked in the 43rd percentile could have had a range of possible scores from the 15th to the 71st percentile, qualifying as “below average,” “average,” or close to “above average). In addition, practitioners including administrators and teachers are skeptical of these systems, and their (appropriate) skepticisms are impacting the extent to which they use and value their value-added data, noting that they value their observational data (and the professional discussions surrounding them) much more. Also important is that another likely unintended effect exists (i.e., citing Susan Moore Johnson’s essay here) when statisticians’ efforts to parse out learning to calculate individual teachers’ value-added causes “teachers to hunker down and focus only on their own students, rather than working collegially to address student needs and solve collective problems” (p. 134). Related, “the technology of VAM ranks teachers against each other relative to the gains they appear to produce for students, [hence] one teacher’s gain is another’s loss, thus creating disincentives for collaborative work” (p. 135). This is what Susan Moore Johnson termed the egg-crate model, or rather the egg-crate effects.

Darling-Hammond’s conclusions are that VAMs have “been prematurely thrust into policy contexts that have made it more the subject of advocacy than of careful analysis that shapes its use. There is [good] reason to be skeptical that the current prescriptions for using VAMs can ever succeed in measuring teaching contributions well (p. 135).

Darling-Hammond also “adds value” in one whole section (highlighted in another post forthcoming here), offering a very sound set of solutions, using VAMs for teacher evaluations or not. Given it’s rare in this area of research we can focus on actual solutions, this section is a must read. If you don’t want to wait for the next post, read Darling-Hammond’s “Modest Proposal” (p. 135-136) within her larger article here.

In the end, Darling-Hammond writes that, “Trying to fix VAMs is rather like pushing on a balloon: The effort to correct one problem often creates another one that pops out somewhere else” (p. 135).

*****

If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; see the Review of Article #5 – on teachers’ perceptions of observations and student growth here; see the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here; and see the Review of Article (Commentary) #7 – on VAMs situated in their appropriate ecologies here.

Article #8, Part I Reference: Darling-Hammond, L. (2015). Can value-added add value to teacher evaluation? Educational Researcher, 44(2), 132-137. doi:10.3102/0013189X15575346

Including Summers “Adds Considerable Measurement Error” to Value-Added Estimates

A new article titled “The Effect of Summer on Value-added Assessments of Teacher and School Performance” was recently released in the peer-reviewed journal Education Policy Analysis Archives. The article is authored by Gregory Palardy and Luyao Peng from the University of California, Riverside. 

Before we begin, though, here is some background so that you all understand the importance of the findings in this particular article.

In order to calculate teacher-level value added, all states are currently using (at minimum) the large-scale standardized tests mandated by No Child Left Behind (NCLB) in 2002. These tests were mandated for use in the subject areas of mathematics and reading/language arts. However, because these tests are given only once per year, typically in the spring, to calculate value-added statisticians measure actual versus predicted “growth” (aka “value-added”) from spring-to-spring, over a 12-month span, which includes summers.

While many (including many policymakers) assume that value-added estimations are calculated from fall to spring during time intervals under which students are under the same teachers’ supervision and instruction, this is not true. The reality is that the pre- to post-test occasions actually span 12-month periods, including the summers that often cause the nettlesome summer effects often observed via VAM-based estimates. Different students learn different things over the summer, and this is strongly associated (and correlated) with student’s backgrounds, and this is strongly associated (and correlated) with students’ out-of-school opportunities (e.g., travel, summer camps, summer schools). Likewise, because summers are the time periods over which teachers and schools tend to have little control over what students do, this is also the time period during which research  indicates that achievement gaps maintain or widen. More specifically, research indicates that indicates that students from relatively lower socio-economic backgrounds tend to suffer more from learning decay than their wealthier peers, although they learn at similar rates during the school year.

What these 12-month testing intervals also include are prior teachers’ residual effects, whereas students testing in the spring, for example, finish out every school year (e.g., two months or so) with their prior teachers before entering the classrooms of the teachers for whom value-added is to be calculated the following spring, although teachers’ residual effects were not of focus in this particular study.

Nonetheless, via the research, we have always known that these summer (and prior or adjacent teachers’ residual effects) are difficult if not impossible to statistically control. This in and of itself leads to much of the noise (fluctuations/lack of reliability, imprecision, and potential biases) we observe in the resulting value-added estimates. This is precisely what was of focus in this particular study.

In this study researchers examined “the effects of including the summer period on value-added assessments (VAA) of teacher and school performance at the [1st] grade [level],” as compared to using VAM-based estimates derived from a fall-to-spring test administration within the same grade and same year (i.e., using data derived via a nationally representative sample via the National Center for Education Statistics (NCES) with an n=5,034 children).

Researchers found that:

  • Approximately 40-62% of the variance in VAM-based estimates originates from the summer period, depending on the reading or math outcome;
  • When summer is omitted from VAM-based calculations using within year pre/post-tests, approximately 51-61% of the teachers change performance categories. What this means in simpler terms is that including summers in VAM-based estimates is indeed causing some of the errors and misclassification rates being observed across studies.
  • Statistical controls to control for student and classroom/school variables reduces summer effects considerably (e.g., via controlling for students’ prior achievement), yet 36-47% of teachers still fall into different quintiles when summers are included in the VAM-based estimates.
  • Findings also evidence that including summers within VAM-based calculations tends to bias VAM-based estimates against schools with higher relative concentrations of poverty, or rather higher relative concentrations of students who are eligible for the federal free-and-reduced lunch program.
  • Overall, results suggest that removing summer effects from VAM-based estimates may require biannual achievement assessments (i.e., fall and spring). If we want VAM-based estimates to be more accurate, we might have to double the number of tests we administer per year in each subject area for which teachers are to be held accountable using VAMs. However, “if twice-annual assessments are not conducted, controls for prior achievement seem to be the best method for minimizing summer effects.”

This is certainly something to consider in terms of trade-offs, specifically in terms of whether we really want to “double-down” on the number of tests we already require our public students to take (also given the time that testing and test preparation already takes away from students’ learning activities), and whether we also want to “double-down” on the increased costs of doing so. I should also note here, though, that using pre/post-tests within the same year is (also) not as simple as it may seem (either). See another post forthcoming about the potential artificial deflation/inflation of pre/post scores to manufacture artificial levels of growth.

To read the full study, click here.

*I should note that I am an Associate Editor for this journal, and I served as editor for this particular publication, seeing it through the full peer-reviewed process.

Citation: Palardy, G. J., & Peng, L. (2015). The effects of including summer on value-added assessments of teachers and schools. Education Policy Analysis Archives, 23(92). doi:10.14507/epaa.v23.1997 Retrieved from http://epaa.asu.edu/ojs/article/view/1997

EVAAS, Value-Added, and Teacher Branding

I do not think I ever shared this video out, and now following up on another post, about the potential impact these videos should really have, I thought now is an appropriate time to share. “We can be the change,” and social media can help.

My former doctoral student and I put together this video, after conducting a study with teachers in the Houston Independent School District and more specifically four teachers whose contracts were not renewed due in large part to their EVAAS scores in the summer of 2011. This video (which is really a cartoon, although it certainly lacks humor) is about them, but also about what is happening in general in their schools, post the adoption and implementation (at approximately $500,000/year) of the SAS EVAAS value-added system.

To read the full study from which this video was created, click here. Below is the abstract.

The SAS Educational Value-Added Assessment System (SAS® EVAAS®) is the most widely used value-added system in the country. It is also self-proclaimed as “the most robust and reliable” system available, with its greatest benefit to help educators improve their teaching practices. This study critically examined the effects of SAS® EVAAS® as experienced by teachers, in one of the largest, high-needs urban school districts in the nation – the Houston Independent School District (HISD). Using a multiple methods approach, this study critically analyzed retrospective quantitative and qualitative data to better comprehend and understand the evidence collected from four teachers whose contracts were not renewed in the summer of 2011, in part given their low SAS® EVAAS® scores. This study also suggests some intended and unintended effects that seem to be occurring as a result of SAS® EVAAS® implementation in HISD. In addition to issues with reliability, bias, teacher attribution, and validity, high-stakes use of SAS® EVAAS® in this district seems to be exacerbating unintended effects.

Economists Declare Victory for VAMs

On a popular economics site, fivethirtyeight.com, authors use “hard numbers” to tell compelling stories, and this time the compelling story told is about value-added models and all of the wonders, thanks to the “hard numbers” derived via model output, they are working to reform the way “we” evaluate and hold teachers accountable for their effects.

In an article titled “The Science Of Grading Teachers Gets High Marks,” this site’s “quantitative editor” (?!?) – Andrew Flowers – writes about how “the science” behind using “hard numbers” to evaluate teachers’ effects is, fortunately for America and thanks to the efforts of (many/most) econometricians, gaining much-needed momentum.

Not to really anyone’s surprise, the featured economics study of this post is…wait for it…the Chetty et al. study at focus of much controversy and many prior posts on this blog (see for example here, here, here, and here). This is the study cited in President Obama’ 2012 State of the Union address when he said that, “We know a good teacher can increase the lifetime income of a classroom by over $250,000,” and this study was more recently the focus of attention when the judge in Vergara v. California cited Chetty et al.’s study as providing evidence that “a single year in a classroom with a grossly ineffective teacher costs students $1.4 million in lifetime earnings per classroom.”

These are the “hard numbers” that have since been duly critiqued by scholars from California to New York since (see, for example, here, here, here, and here), but that’s not mentioned in this post. What is mentioned, however, is the notable work of economist Jesse Rothstein, whose work I have also cited in prior posts (see, for example, here, here, here, and here,), as he has also countered Chetty et al.’s claims, not to mention added critical research to the topic on VAM-based bias.

What is also mentioned, not to really anyone’s surprise again, though, is that Thomas Kane – a colleague of Chetty’s at Harvard who has also been the source of prior VAMboozled! posts (see, for example, here, here, and here), who also replicated Chetty’s results as notably cited/used during the Vergara v. California case last summer, endorses Chetty’s work throughout this same article. Article author “reached out” to Kane “to get more perspective,” although I, for one, question how random this implied casual reach really was… Recall a recent post about our “(Unfortunate) List of VAMboozlers?” Two of our five total honorees include Chetty and Kane – the same two “hard number” economists prominently featured in this piece.

Nonetheless, this article’s “quantitative editor” (?!?) Flowers sides with them (i.e., Chetty and Kane), and ultimately declares victory for VAMs, writing that VAMs ultimately and “accurately isolate a teacher’s impact on students”…”[t]he implication[s] being, school administrators can legitimately use value-added scores to hire, fire and otherwise evaluate teacher performance.”

This “cutting-edge science,” as per a quote taken from Chetty’s co-author Friedman (Brown University), captures it all: “It’s almost like we’re doing real, hard science…Well, almost. But by the standards of empirical social science — with all its limitations in experimental design, imperfect data, and the hard-to-capture behavior of individuals — it’s still impressive….[F]or what has been called the “credibility revolution” in empirical economics, it’s a win.”

New “Causal” Evidence in Support of VAMs

No surprise, really, but Thomas Kane, an economics professor from Harvard University who also directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation, is publicly writing in support of VAMs, again. His newest article was recently published on the website of the Brookings Institution, titled “Do Value-Added Estimates Identify Causal Effects of Teachers and Schools?” Not surprisingly as a VAM advocate, in this piece he continues to advance a series of false claims about the wonderful potentials of VAMs, in this case as also yielding causal estimates whereas teachers can be seen as directly causing the growth measured by VAMs (see prior posts about Kane’s very public perspectives here and here).

The best part of this article is where Kane (potentially) seriously considers whether “the short list of control variables captured in educational data systems—prior achievement, student demographics, English language learner [ELL] status, eligibility for federally subsidized meals or programs for gifted and special education students—include the relevant factors by which students are sorted to teachers and schools” and, hence, work to control for bias as these are some of the factors that do (as per the research evidence) distort value-added scores.

The potential surrounding the exploration of this argument, however, quickly turns south as thereafter Kane pontificates using pure logic that “it is possible that school data systems contain the very data that teachers or principals are using to assign students to teachers.” In other words, he painfully attempts to assert his non-research-based argument that as long as principals and teachers use the aforementioned variables to sort students into classrooms, then controlling for said variables should indeed control for the biasing effects caused by the non-random assortment of students into classrooms (and teachers into classrooms, although he does not address that component either).

In addition, he asserts that, “[o]f course, there are many other unmeasured factors [or variables] influencing student achievement—such as student motivation or parental engagement [that cannot or cannot easily be observed]. But as long as those factors are also invisible [emphasis added] to those making teacher and program assignment decisions, our [i.e., VAM statisticians’] inability to control for them” more or less makes not controlling for these other variables inconsequential. In other words, in this article Kane asserts as long as the “other things” principals and teachers use to non-randomly place students into classrooms are “invisible” to the principals and teachers making student placement decisions, these “other things” should not have to be statistically controlled, or factored out. We should otherwise be good to go given the aforementioned variables already observable and available.

As evidenced in a study I wrote with one of my current doctoral students that was recently published in the esteemed, peer-reviewed American Educational Research Journal on this very topic (see the full study here), we set out to better determine how and whether the controls used by value-added researchers to eliminate bias might be sufficient given what indeed occurs in practice when students are placed into classrooms.

We found that both teachers and parents play a prodigious role in the student placement process, in almost nine out of ten schools (i.e., 90% of the time). Teachers and parents (although parents are also not mentioned in Kane’s article) provide both appreciated and sometimes unwelcome insights, regarding what teachers and parents perceive to be the best learning environments for their students or children, respectively. Their added insights typically revolve around, in the following order, students’ in-school behaviors, attitudes, and disciplinary records; students’ learning styles and students’ learning styles as matched with teachers’ teaching styles; students’ personalities and students personalities as matched with teachers’ personalities; students’ interactions with their peers and prior teachers; general teacher types (e.g. teachers who manage their classrooms in perceptibly better ways); and whether students had siblings in potential teachers’ classrooms prior.

These “other things” are not typically if ever controlled for given current VAMs, nor will they likely ever be. In addition, these factors serve as legitimate reasons for class changes during the school year, although whether this, too, is or could be captured in VAMs is highly tentative at best. Otherwise, namely prior academic achievement, special education needs, giftedness, and gender also influence placement decisions. These are variables for which most current VAMs account or control, presumably effectively.

Kane, like other VAM statisticians, tend to (and in many ways have to if they are to continue with their VAM work, despite “the issues”) (over)simplify the serious complexities that come about when random assignment of students to classrooms (and teachers to classrooms) is neither feasible, nor realistic, or outright opposed (as was also clearly evidenced in the above article by 98% of educators, see again here).

The random assignment of students to classrooms (and teachers to classrooms) very rarely happens. Rather, the use of many observable and unobservable variables are used to make such classroom placement decisions, and these variables go well beyond whether students are eligible for free-and-reduced lunches or are English-language learners.

If only the real world surrounding our schools, and in particular the measurement and evaluation of our schools and teachers within them, was so simple and straightforward as Kane and others continue to assume and argue, although much of the time without evidence other than his own or that of his colleagues at Harvard (i.e., 8/17; 47% of the articles cited in Kane’s piece). See also a recent post about this here. In this case, much published research evidence exists to clearly counter this logic and the many related claims herein (see also the other research not cited in this piece but cited in the study highlighted above and linked to again here).

Same Model (EVAAS), Different State (Ohio), Same Problems (Bias and Data Errors)

Following up on my most recent post about “School-Level Bias in the PVAAS Model in Pennsylvania,” also in Ohio – a state that also uses “the best” and “most sophisticated” VAM (i.e., a version of the Education Value-Added Assessment System [EVAAS]; for more information click here) – this seems to be a problem, as per an older (2013) article just sent to me following my prior post “Teachers’ ‘Value-Added’ Ratings and [their] Relationship to Student Income Levels [being] Questioned.”

The key finding? Ohio’s “2011-12 value-added results show that districts, schools and teachers with large numbers of poor students tend to have lower value-added results than those that serve more-affluent ones.” Such output continue to evidence how using VAMs may not be “the great equalizer” after all. VAMs might not be the “true” measures they are assumed (and marketed/sold) to be.

Here are the state’s stats, as highlighted in this piece and taken directly from the Plain Dealer/StateImpact Ohio analysis:

  • Value-added scores were 2½ times higher on average for districts where the median family income is above $35,000 than for districts with income below that amount.
  • For low-poverty school districts, two-thirds had positive value-added scores — scores indicating students made more than a year’s worth of progress.
  • For high-poverty school districts, two-thirds had negative value-added scores — scores indicating that students made less than a year’s progress.
  • Almost 40 percent of low-poverty schools scored “Above” the state’s value-added target, compared with 20 percent of high-poverty schools.
  • At the same time, 25 percent of high-poverty schools scored “Below” state value-added targets while low-poverty schools were half as likely to score “Below.”

One issue likely causing this level of bias is that student background factors are not accounted for in the EVAAS model. “By including all of a student’s testing history, each student serves as his or her own control,” as per the analytics company – SAS – that now sells and runs model calculations. Rather, the EVAAS uses students’ prior test scores to control for these factors. As per a document available on the SAS website: “To the extent that SES/DEM [socioeconomic/demographic] influences persist over time, these influences are already represented in the student’s data. This negates the need for SES/DEM adjustment.”

Within the same document SAS authors write: “that adjusting value-added for the race or poverty level of a student adds the dangerous assumption that those students will not perform well. It would also create separate standards for measuring students who have the same history on scores on previous tests. And it says that adjusting for these issues can also hide whether students in poor areas are receiving worse instruction. ‘We recommend that these adjustments not be made,” the brief reads. “Not only are they largely unnecessary, but they may be harmful.”

See also another article, also from Ohio about how a value-added “Glitch Cause[d the] State to Pull Back Teacher “Value Added” Student Growth Scores.” From the intro, “An error by contractor SAS Institute Inc. forced the state to withdraw some key teacher performance measurements that it had posted online for teachers to review. The state’s decision to take down the value added scores for teachers across the state on Wednesday has some educators questioning the future reliability of the scores and other state data…” (click here to read more).

University of Missouri’s Koedel on VAM Bias and “Leveling the Playing Field”

In a forthcoming study (to be published in the peer reviewed journal Educational Policy), University of Missouri Economists – Cory Koedel, Mark Ehlert, Eric Parsons, and Michael Podgursky – evidence how poverty should be included as a factor when evaluating teachers using VAMs. This study, also covered in a recent article in Education Week and another news brief published by the University of Missouri, highlights how doing this “could lead to a more ‘effective and equitable’ teacher-evaluation system.”

Koedel, as cited in the University of Missouri brief, argues that using what he and his colleagues term a “proportional” system would “level the playing field” for teachers working with students from different income backgrounds. They evaluated three types of growth models/VAMs to evidence their conclusions; however, how they did this will not be available until the actual article is published.

While Koedel and I tend to disagree on VAMs as potential tools for teacher evaluation – whereas he believes that such statistical and methodological tweaks will bring us to a level of VAM perfection I believe is likely impossible – the important takeaway from this piece is that VAMs are (more) biased when such background and resource sensitive factors are excluded from VAM-based calculations.

Accordingly, while Koedel writes elsewhere (in a related Policy Brief) that using what they term a “proportional’ evaluation system would “entirely mitigate this concern,” we have evidence elsewhere that with even the most sophisticated and comprehensive controls available, never can VAM-based bias be “entirely” mitigated (see, for example, here and here). This is likely due to the fact that while Koedel, his colleageus, and others argue (often with solid evidence) that controlling for “observed student and school characteristics” helps to mitigate bias, there is still unobserved student and school characteristics that cannot be observed, quantified, and hence controlled for or factored out, and this (will likely forever) prevent bias’s “entire mitgation.”

Here, the question is whether we need sheer perfection. The answer is no, but when these models are applied in practice, and when particularly teachers who teach homogenous groups of students in relatively more extreme positions (e.g., disproprortionate numbers of students from high-needs, non-English proficient backgrounds) bias still matters. While, as a whole we might be less concerned about bias when such factors are included in VAMs, there (likely forever will be) cases where bias will impact individual teachers. This is where folks will have to rely on human judgment to interpret the “objective” numbers based on VAMs.

A Really Old Oldie but Still Very Relevant Goodie

Thanks to a colleague in Florida, I recently read an article about the “Problems of Teacher Measurementpublished in 1917 in the Journal of Educational Psychology
by B. F. Pittenger. As mentioned, it’s always interesting to take a historical approach (hint here to policymakers), and in this case a historical vies via the perspective of an author on the same topic of interest to followers here through an article he wrote almost 100 years ago. Let’s see how things have changed, or more specifically, how things have not changed.

Then, “they” had the same goals we still have today, if this isn’t telling in and of itself. From 1917: “The current efforts of experimentallists in the field of teacher measurement are only attempts to extract from the consciousness of principals and supervisors these personal criteria of good teaching, and to assemble and condense them into a single objective schedule, thoroughly tested, by means of which every judge of teaching may make his [sic] estimates more accurate, and more consistent with those of other judges. There is nothing new about the entire movement except the attempt to objectify what already exists subjectively, and to unify and render universal what is now the scattered property of many men.”

Policymakers continue to invest entirely on an ideal known then also to be (possibly forever) false. From 1917: “There are those who believe that the movement toward teacher measurement is a monstrous innovation, which threatens the holiest traditions of the educational profession by putting a premium upon mechanical methodology…the phrase ‘teacher-measurement,’ itself, no doubt, is in part responsible for this misunderstanding, as it suggests a mathematical exactness of procedure which is clearly impossible in this field [emphasis added]. Teacher measurement will probably never become more than a carefully controlled process of estimating a teacher’s individual efficiency…[This is]…sufficiently convenient and euphonious, and has now been used widely enough, to warrant its continuation.”

As for the methods “issues” in 1917? “However sympathetic one may be with the general plan of devising schedules for teacher measurement, it is difficult to justify many of the methods by which these investigators have attacked the problem. For example, all of them appear to have set up as their goal the construction of a schedule which can be applied to any teacher, whether in the elementary or high school, and irrespective of the grade or subject in which his teaching is being done. “Teaching is teaching,” is the evident assumption, “and the same wherever found.” But it may reasonably be maintained that different qualities and methods, at least in part, are requisite…In so far as the criteria of good teaching are the same in these very diverse situations, it seems probable that the comparative importance to be attached to each must differ.” Sound familiar?

On the use of multiple measures, as currently in line with the current measurement standards of the profession, from 1917: “students of teacher measurement appear to have erred in that they have attempted too much. The writer is strongly of the opinion that, for the present at least, efforts to construct a schedule for teacher measurement should be confined to a single one of the three planes which have been enumerated. Doubtless in the end we shall want to know as much as possible about all three; and to combine in our final estimate of a teacher’s merit all attainable facts as to her equipment, her classroom procedure, and the results which she achieves. But at present we should do wisely to project our investigations upon one plane at a time, and to make each of these investigations as thorough as it is possible to make it. Later, when we know the nature and comparative value of the various items necessary to adequate judgment upon all planes, there will be time and opportunity for putting together the different schedules into one.” One-hundred years later…

On prior teachers’ effects: “we must keep constantly in mind the fact that the results which pupils achieve in any given subject are by no means the product of the labor of any single teacher. Earlier teachers, other contemporary teachers, and the environment external to the school, are all factors in determining pupil efficiency in any school subject. It has been urged that the influence of these complicating factors can be materially reduced by measuring only the change in pupil achievement which takes place under the guidance of a single teacher. But it must be remembered that this process only reduces these complications; it does not and cannot eliminate them.”

Finally, the supreme to be sought, then and now? “The plane of results (in the sense of changes wrought in pupils) would be the ideal plane upon which to build an estimate of a teacher’s individual efficiency, if it were possible (1) to measure all of the results of teaching, and (2) to pick out from the body of measured results any single teacher’s contribution. At present these desiderata are impossible to attain [emphasis added]…[but]…let us not make the mistake of assuming that the results that we can measure are the only results of teaching, or even that they are the most important part.”

Likewise, “no one teacher can be given the entire blame or credit for the doings of the pupils in her classroom…the ‘classroom process’ should be regarded as including the activities of both teachers and pupils.” In the end, “The promotion, discharge, or constructive criticism of teachers cannot be reduced to mathematical formulae. The proper function of a scorecard for teacher measurement is not to substitute such a formula for a supervisor’s personal judgment, but to aid him in discovering and assembling all the data upon which intelligent judgment should be based.”

Those Who Can, Teach—Those Who Don’t Understand Teaching, Make Policy

There is an old adage that many in education have undoubtedly (and unfortunately) encountered at some point in their education careers: that “those who can, do—those who can’t, teach.” For decades now, researchers, including Dr. Thomas Good who is the author of an article published in 2014 in Teachers College Record, have evidenced that, contrary to this belief, teachers can do, and some teachers do better than others. Dr. Good points out in his historical analysis, What Do We Know About How Teachers Influence Student Performance on Standardized Tests: And Why Do We Know so Little About Other Student Outcomes? that teachers matter and that teachers vary in their effects on student achievement. He provides evidence from several decades of scholarly research on teacher effectiveness to show that teachers do make a difference in student achievement as measured by large-scale standardized achievement tests.

Dr. Good is also quick to acknowledge that, despite the reiterated notion that teachers matter and thus should possess (and continue to be trained in) effective teaching qualities (e.g., be well versed in their content knowledge, have strong classroom management skills, hold appropriate expectations, etc.), “fad-driven” education reform policies (e.g., teacher evaluation polices that are based in large part on student achievement growth or teachers’ “value-added”) have gone too far and have actually overvalued the effects of teachers. He explains that simplistic reform efforts, such as Race to the Top and VAM-based teacher evaluation systems, overvalue teacher effects in terms of the actual levels of impact teachers have on student achievement. It has been well documented that teacher effects can only explain, on average, between 10-20% of the variation in student achievement scores (this will be further explored in forthcoming posts). This means that 80-90% of student achievement scores are the result of other factors that are completely outside of the teachers’ control (e.g., poverty, parental support, etc.). Regardless, new teacher evaluation systems that rely so heavily on VAM estimates ignore this very important fact.

Dr. Good also emphasizes that teaching is a complex practice and that by attempting to isolate the variables that make up effective teaching and focusing on each one separately only oversimplifies the complexities of the teaching-achievement relationship. For one thing, VAMs and other popular evaluative practices, such as classroom observations, inhibit one’s ability to recognize the patterns of effective teaching and instead promote a simplistic view of just some of the individual variables that actually matter when thinking about effective teaching. He provides the following example:

Many observational systems call for the demonstration of high expectations. However…expectations can be too high or too low, and the issue is for teachers to demonstrate appropriate expectations. How then does a classroom observer know and code if expectations are appropriate both for individual students and for the class as a whole?

The bottom line is that, like teaching, understanding the impact of teachers on student learning is very complex. Teachers matter, and they should be trained, treated, and valued as such. However, they are one of many factors that impact student learning and achievement over time, and this very critical point cannot be (though currently is) ignored by policy. VAM-based policies, particularly, place an exorbitantly over-estimated value on the impact of teachers on student achievement scores.

Post contributed by Jessica Holloway-Libell