Special Issue of “Educational Researcher” (Paper #8 of 9, Part I): A More Research-Based Assessment of VAMs’ Potentials

Recall that the peer-reviewed journal Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#8 of 9), which is actually a commentary titled “Can Value-Added Add Value to Teacher Evaluation?” This commentary is authored by Linda Darling-Hammond – Professor of Education, Emeritus, at Stanford University.

Like with the last commentary reviewed here, Darling-Hammond reviews some of the key points taken from the five feature articles in the aforementioned “Special Issue.” More specifically, though, Darling-Hammond “reflect[s] on [these five] articles’ findings in light of other work in this field, and [she] offer[s her own] thoughts about whether and how VAMs may add value to teacher evaluation” (p. 132).

She starts her commentary with VAMs “in theory,” in that VAMs COULD accurately identify teachers’ contributions to student learning and achievement IF (and this is a big IF) the following three conditions were met: (1) “student learning is well-measured by tests that reflect valuable learning and the actual achievement of individual students along a vertical scale representing the full range of possible achievement measures in equal interval units” (2) “students are randomly assigned to teachers within and across schools—or, conceptualized another way, the learning conditions and traits of the group of students assigned to one teacher do not vary substantially from those assigned to another;” and (3) “individual teachers are the only contributors to students’ learning over the period of time used for measuring gains” (p. 132).

None of things are actual true (or near to true, nor will they likely ever be true) in educational practice, however. Hence, the errors we continue to observe that continue to prevent VAM use for their intended utilities, even with the sophisticated statistics meant to mitigate errors and account for the above-mentioned, let’s call them, “less than ideal” conditions.

Other pervasive and perpetual issues surrounding VAMs as highlighted by Darling-Hammond, per each of the three categories above, pertain to (1) the tests used to measure value-added is that the tests are very narrow, focus on lower level skills, and are manipulable. These tests in their current form cannot effectively measure the learning gains of a large share of students who are above or below grade level given a lack of sufficient coverage and stretch. As per Haertel (2013, as cited in Darling-Hammond’s commentary), this “translates into bias against those teachers working with the lowest-performing or the highest-performing classes’…and “those who teach in tracked school settings.” It is also important to note here that the new tests created by the Partnership for Assessing Readiness for College and Careers (PARCC) and Smarter Balanced, multistate consortia “will not remedy this problem…Even though they will report students’ scores on a vertical scale, they will not be able to measure accurately the achievement or learning of students who started out below or above grade level” (p.133).

With respect to (2) above, on the equivalence (or rather non-equivalence) of groups of student across teachers classrooms who are the ones whose VAM scores are relativistically compared, the main issue here is that “the U.S. education system is the one of most segregated and unequal in the industrialized world…[likewise]…[t]he country’s extraordinarily high rates of childhood poverty, homelessness, and food insecurity are not randomly distributed across communities…[Add] the extensive practice of tracking to the mix, and it is clear that the assumption of equivalence among classrooms is far from reality” (p. 133). Whether sophisticated statistics can control for all of this variation is one of most debated issues surrounding VAMs and their levels of outcome bias, accordingly.

And as per (3) above, “we know from decades of educational research that many things matter for student achievement aside from the individual teacher a student has at a moment in time for a given subject area. A partial list includes the following [that are also supposed to be statistically controlled for in most VAMs, but are also clearly not controlled for effectively enough, if even possible]: (a) school factors such as class sizes, curriculum choices, instructional time, availability of specialists, tutors, books, computers, science labs, and other resources; (b) prior teachers and schooling, as well as other current teachers—and the opportunities for professional learning and collaborative planning among them; (c) peer culture and achievement; (d) differential summer learning gains and losses; (e) home factors, such as parents’ ability to help with homework, food and housing security, and physical and mental support or abuse; and (e) individual student needs, health, and attendance” (p. 133).

“Given all of these influences on [student] learning [and achievement], it is not surprising that variation among teachers accounts for only a tiny share of variation in achievement, typically estimated at under 10%” (see, for example, highlights from the American Statistical Association’s (ASA’s) Position Statement on VAMs here). “Suffice it to say [these issues]…pose considerable challenges to deriving accurate estimates of teacher effects…[A]s the ASA suggests, these challenges may have unintended negative effects on overall educational quality” (p. 133). “Most worrisome [for example] are [the] studies suggesting that teachers’ ratings are heavily influenced [i.e., biased] by the students they teach even after statistical models have tried to control for these influences” (p. 135).

Other “considerable challenges” include: VAM output are grossly unstable given the swings and variations observed in teacher classifications across time, and VAM output are “notoriously imprecise” (p. 133) given the other errors observed as caused, for example, by varying class sizes (e.g., Sean Corcoran (2010) documented with New York City data that the “true” effectiveness of a teacher ranked in the 43rd percentile could have had a range of possible scores from the 15th to the 71st percentile, qualifying as “below average,” “average,” or close to “above average). In addition, practitioners including administrators and teachers are skeptical of these systems, and their (appropriate) skepticisms are impacting the extent to which they use and value their value-added data, noting that they value their observational data (and the professional discussions surrounding them) much more. Also important is that another likely unintended effect exists (i.e., citing Susan Moore Johnson’s essay here) when statisticians’ efforts to parse out learning to calculate individual teachers’ value-added causes “teachers to hunker down and focus only on their own students, rather than working collegially to address student needs and solve collective problems” (p. 134). Related, “the technology of VAM ranks teachers against each other relative to the gains they appear to produce for students, [hence] one teacher’s gain is another’s loss, thus creating disincentives for collaborative work” (p. 135). This is what Susan Moore Johnson termed the egg-crate model, or rather the egg-crate effects.

Darling-Hammond’s conclusions are that VAMs have “been prematurely thrust into policy contexts that have made it more the subject of advocacy than of careful analysis that shapes its use. There is [good] reason to be skeptical that the current prescriptions for using VAMs can ever succeed in measuring teaching contributions well (p. 135).

Darling-Hammond also “adds value” in one whole section (highlighted in another post forthcoming here), offering a very sound set of solutions, using VAMs for teacher evaluations or not. Given it’s rare in this area of research we can focus on actual solutions, this section is a must read. If you don’t want to wait for the next post, read Darling-Hammond’s “Modest Proposal” (p. 135-136) within her larger article here.

In the end, Darling-Hammond writes that, “Trying to fix VAMs is rather like pushing on a balloon: The effort to correct one problem often creates another one that pops out somewhere else” (p. 135).


If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; see the Review of Article #5 – on teachers’ perceptions of observations and student growth here; see the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here; and see the Review of Article (Commentary) #7 – on VAMs situated in their appropriate ecologies here.

Article #8, Part I Reference: Darling-Hammond, L. (2015). Can value-added add value to teacher evaluation? Educational Researcher, 44(2), 132-137. doi:10.3102/0013189X15575346

Including Summers “Adds Considerable Measurement Error” to Value-Added Estimates

A new article titled “The Effect of Summer on Value-added Assessments of Teacher and School Performance” was recently released in the peer-reviewed journal Education Policy Analysis Archives. The article is authored by Gregory Palardy and Luyao Peng from the University of California, Riverside. 

Before we begin, though, here is some background so that you all understand the importance of the findings in this particular article.

In order to calculate teacher-level value added, all states are currently using (at minimum) the large-scale standardized tests mandated by No Child Left Behind (NCLB) in 2002. These tests were mandated for use in the subject areas of mathematics and reading/language arts. However, because these tests are given only once per year, typically in the spring, to calculate value-added statisticians measure actual versus predicted “growth” (aka “value-added”) from spring-to-spring, over a 12-month span, which includes summers.

While many (including many policymakers) assume that value-added estimations are calculated from fall to spring during time intervals under which students are under the same teachers’ supervision and instruction, this is not true. The reality is that the pre- to post-test occasions actually span 12-month periods, including the summers that often cause the nettlesome summer effects often observed via VAM-based estimates. Different students learn different things over the summer, and this is strongly associated (and correlated) with student’s backgrounds, and this is strongly associated (and correlated) with students’ out-of-school opportunities (e.g., travel, summer camps, summer schools). Likewise, because summers are the time periods over which teachers and schools tend to have little control over what students do, this is also the time period during which research  indicates that achievement gaps maintain or widen. More specifically, research indicates that indicates that students from relatively lower socio-economic backgrounds tend to suffer more from learning decay than their wealthier peers, although they learn at similar rates during the school year.

What these 12-month testing intervals also include are prior teachers’ residual effects, whereas students testing in the spring, for example, finish out every school year (e.g., two months or so) with their prior teachers before entering the classrooms of the teachers for whom value-added is to be calculated the following spring, although teachers’ residual effects were not of focus in this particular study.

Nonetheless, via the research, we have always known that these summer (and prior or adjacent teachers’ residual effects) are difficult if not impossible to statistically control. This in and of itself leads to much of the noise (fluctuations/lack of reliability, imprecision, and potential biases) we observe in the resulting value-added estimates. This is precisely what was of focus in this particular study.

In this study researchers examined “the effects of including the summer period on value-added assessments (VAA) of teacher and school performance at the [1st] grade [level],” as compared to using VAM-based estimates derived from a fall-to-spring test administration within the same grade and same year (i.e., using data derived via a nationally representative sample via the National Center for Education Statistics (NCES) with an n=5,034 children).

Researchers found that:

  • Approximately 40-62% of the variance in VAM-based estimates originates from the summer period, depending on the reading or math outcome;
  • When summer is omitted from VAM-based calculations using within year pre/post-tests, approximately 51-61% of the teachers change performance categories. What this means in simpler terms is that including summers in VAM-based estimates is indeed causing some of the errors and misclassification rates being observed across studies.
  • Statistical controls to control for student and classroom/school variables reduces summer effects considerably (e.g., via controlling for students’ prior achievement), yet 36-47% of teachers still fall into different quintiles when summers are included in the VAM-based estimates.
  • Findings also evidence that including summers within VAM-based calculations tends to bias VAM-based estimates against schools with higher relative concentrations of poverty, or rather higher relative concentrations of students who are eligible for the federal free-and-reduced lunch program.
  • Overall, results suggest that removing summer effects from VAM-based estimates may require biannual achievement assessments (i.e., fall and spring). If we want VAM-based estimates to be more accurate, we might have to double the number of tests we administer per year in each subject area for which teachers are to be held accountable using VAMs. However, “if twice-annual assessments are not conducted, controls for prior achievement seem to be the best method for minimizing summer effects.”

This is certainly something to consider in terms of trade-offs, specifically in terms of whether we really want to “double-down” on the number of tests we already require our public students to take (also given the time that testing and test preparation already takes away from students’ learning activities), and whether we also want to “double-down” on the increased costs of doing so. I should also note here, though, that using pre/post-tests within the same year is (also) not as simple as it may seem (either). See another post forthcoming about the potential artificial deflation/inflation of pre/post scores to manufacture artificial levels of growth.

To read the full study, click here.

*I should note that I am an Associate Editor for this journal, and I served as editor for this particular publication, seeing it through the full peer-reviewed process.

Citation: Palardy, G. J., & Peng, L. (2015). The effects of including summer on value-added assessments of teachers and schools. Education Policy Analysis Archives, 23(92). doi:10.14507/epaa.v23.1997 Retrieved from http://epaa.asu.edu/ojs/article/view/1997

An Oldie But Still Very Relevant Goodie: The First Documented Value-Added “Smack-Down”

When I first began researching VAMs, and more specifically the Education Value-Added Assessment System (EVAAS) developed by William Sanders in the state of Tennessee (the state we now know as VAM’s “ground zero”), I came across a fabulous online debate (before blogs like this and other social networking sources were really prevalent), that was all about this same system, that was then called the TVAAS (the Tennessee Value-Added Assessment System).

The discussants questioning the TVAAS? Renowned scholars including: Gene Glass — best known for his statistical work and for his development of “meta-analysis;” Michael Scriven — best known for his scholarly work in evaluation; Harvey Goldstein — best known for his knowledge of statistical modeling and their use on tests; Sherman Dorn — best known for his work on educational reforms and how we problematize our schools; Gregory Camilli — best known for his studies on the effects of educational programs and policies; and a few others with whom I am less familiar. The discussants defending their TVAAS? William Sanders — the TVAAS/EVAAS developer; Sandra P. Horn — Sanders’s colleague; and an unknown discussant representing the “TVAAS (Tennessee Value-Added Assessment System.”

While this was what could now easily be called the first value-added “smack-down” (I am honored to say I was part of the second, and the first so titled), it served as a foundational source to the first study I ever published on the topic of VAMs (a study published in 2008 in the highly esteemed Educational Researcher and titled, Methodological concerns about the Education Value-Added Assessment System [EVAAS]). I was just reminded, today, about this online debate (or debate made available online) that, although it took place in 1995, is still one of if not the best in-depth debates surrounding, and thorough analyses of VAM that has ever been done.

While it is long, it is certainly worth a read and review, as readers too should see in this debate so many issues still relevant and currently problematic, now 20 years later.  You can see just how far we’ve really come in the last, now 20 years since this VAM nonsense really got started, as the issues debated here are still, for the most part, the issues that continue to go unresolved…

One of my favorite highlight’s, I’ve pasted here if I have not yet enticed you enough…it comes from a post written by Gene Glass on Friday, October 28th, 1994. Gene writes:

“Dear Professor Sanders:

I like statistics; I made the better part of my living off of it for many years. But could we set it aside for just a minute while you answer a question or two for me?

I gather that [the TVAAS] is a means of measuring what it is that a particular teacher contributes to the basic skills learning of a class of students. Let me stipulate for the moment that for your sake all of the purely statistical considerations attendant to partialling out previous contributions of other teachers’ “additions of value” to this year’s teachers’ addition of value have been resolved perfectly–above reproach; no statistician who understands mixed models, covariance adjustment, and the like would question them. Let’s just pretend that this is true.

Now imagine–and it should be no strain on one’s imagination to do so–that we have Teacher A and Teacher B and each has had the pretest (September) achievement status of their students impeccably measured. But A has a class with average IQ of 115 and B has a class of average IQ 90. Let’s suppose that A and B teach to the very limit of their abilities all year long and that in the eyes of God, they are equally talented teachers. We would surely expect that A’s students will achieve much more on the posttest (June) than B’s. Anyone would assume so; indeed, we would be shocked if it were not so.”

Question: Does your system of measuring and adjusting and assigning numbers to teachers take these circumstances into account so that A and B emerge with equal “added value” ratings?”

Sandra P. Horn’s answer? “Yes.”

Horn, had to say yes in response to Gene’s question, however, or the method would have even then been exposed as entirely invalid. Students with higher levels of intelligence undoubtedly learn more than students with lower levels of intelligence, and if two classes differ greatly on IQ, one will make greater progress during the year. This growth can have nothing to do with the teacher, and this can be (and still is) observed, despite the sophisticated statistical controls meant to control for students’ prior achievements, and in this case their aptitudes.

American Statistical Association (ASA) Position Statement on VAMs

Inside my most recent post, about the Top 14 research-based articles about VAMs, there was a great research-based statement that was released just last week by the American Statistical Association (ASA), titled the “ASA Statement on Using Value-Added Models for Educational Assessment.”

It is short, accessible, easy to understand, and hard to dispute, so I wanted to be sure nobody missed it as this is certainly a must read for all of you following this blog, not to mention everybody else dealing/working with VAMs and their related educational policies. Likewise, this represents the current, research-based evidence and thinking of probably 90% of the educational researchers and econometricians (still) conducting research in this area.

Again, the ASA is the best statistical organization in the U.S. and likely one of if not the best statistical associations in the world. Some of the most important parts of their statement, taken directly from their full statement as I see them, follow:

  1. VAMs are complex statistical models, and high-level statistical expertise is needed to
    develop the models and [emphasis added] interpret their results.
  2. Estimates from VAMs should always be accompanied by measures of precision and a discussion of the assumptions and possible limitations of the model. These limitations are particularly relevant if VAMs are used for high-stakes purposes.
  3. VAMs are generally based on standardized test scores, and do not directly measure
    potential teacher contributions toward other student outcomes.
  4. VAMs typically measure correlation, not causation: Effects – positive or negative –
    attributed to a teacher may actually be caused by other factors that are not captured in the model.
  5. Under some conditions, VAM scores and rankings can change substantially when a
    different model or test is used, and a thorough analysis should be undertaken to
    evaluate the sensitivity of estimates to different models.
  6. VAMs should be viewed within the context of quality improvement, which distinguishes aspects of quality that can be attributed to the system from those that can be attributed to individual teachers, teacher preparation programs, or schools.
  7. Most VAM studies find that teachers account for about 1% to 14% of the variability in test scores, and that the majority of opportunities for quality improvement are found in the system-level conditions. Ranking teachers by their VAM scores can have unintended consequences that reduce quality.
  8. Attaching too much importance to a single item of quantitative information is counter-productive—in fact, it can be detrimental to the goal of improving quality.
  9. When used appropriately, VAMs may provide quantitative information that is relevant for improving education processes…[but only if used for descriptive/description purposes]. Otherwise, using VAM scores to improve education requires that they provide meaningful information about a teacher’s ability to promote student learning…[and they just do not do this at this point, as there is no research evidence to support this ideal].
  10. A decision to use VAMs for teacher evaluations might change the way the tests are viewed and lead to changes in the school environment. For example, more classroom time might be spent on test preparation and on specific content from the test at the exclusion of content that may lead to better long-term learning gains or motivation for students. Certain schools may be hard to staff if there is a perception that it is harder for teachers to achieve good VAM scores when working in them. Overreliance on VAM scores may foster a competitive environment, discouraging collaboration and efforts to improve the educational system as a whole.

Also important to point out is that included in the report the ASA makes recommendations regarding the “key questions states and districts [yes, practitioners!] should address regarding the use of any type of VAM.” These include, although they are not limited to questions about reliability (consistency), validity, the tests on which VAM estimates are based, and the major statistical errors that always accompany VAM estimates, but are often buried and often not reported with results (i.e., in terms of confidence
intervals or standard errors).

Also important is the purpose for ASA’s statement, as written by them: “As the largest organization in the United States representing statisticians and related professionals, the American Statistical Association (ASA) is making this statement to provide guidance, given current knowledge and experience, as to what can and cannot reasonably be expected from the use of VAMs. This statement focuses on the use of VAMs for assessing teachers’ performance but the issues discussed here also apply to their use for school or principal accountability. The statement is not intended to be prescriptive. Rather, it is intended to enhance general understanding of the strengths and limitations of the results generated by VAMs and thereby encourage the informed use of these results.”

Do give the position statement a read and use it as needed!

More Value-Added Problems in DC’s Public Schools

Over the past month I have posted two entries about what’s going in in DC’s public schools with the value-added-based teacher evaluation system developed and advanced by the former School Chancellor Michelle Rhee and carried on by the current School Chancellor Kaya Henderson.

The first post was about a bogus “research” study in which National Bureau of Education Research (NBER)/University of Virginia and Stanford researchers overstated false claims that the system was indeed working and effective, despite the fact that (among other problems) 83% of the teachers in the study did not have student test scores available to measure their “value added.” The second post was about a DC teacher’s experiences being evaluated under this system (as part of the aforementioned 83%) using almost solely his administrator’s and master educator’s observational scores. Demonstrated with data in this post was how error prone this part of the DC system also evidenced itself to be.

Adding to the value-added issues in DC, it was just released by DC public school officials (the day before winter break) and then two Washington Post articles (see the first article here and the second here) that 44 DC public school teachers also received incorrect evaluation scores for the last academic year (2012-2013) because of technical errors in the ways the scores were calculated. One of the 44 teachers was fired as a result, although (s)he is now looking to be reinstated and compensated for the salary lost.

While “[s]chool officials described the errors as the most significant since the system launched a controversial initiative in 2009 to evaluate teachers in part on student test scores,” they also downplayed the situation as only impacting 44.

VAM formulas are certainly “subject to error,” and they are subject to error always, across the board, for teachers in general as well as the 470 DC public school teachers with value-added scores based on student test scores. Put more accurately, just over 10% (n=470) of all DC teachers (n=4,000) were evaluated using their students’ test scores, which is even less than the 83% mentioned above. And for about 10% of these teachers (n=44), calculation errors were found.

This is not a “minor glitch” as written into a recent Huffington Post article covering the same story, which positions the teachers’ unions as almost irrational for “slamming the school system for the mistake and raising broader questions about the system.” It is a major glitch caused both by inappropriate “weightings” of teachers’ administrator’ and master educators’ observational scores, as well as “a small technical error” that directly impacted the teachers’ value-added calculations. It is a major glitch with major implications about which others, including not just those from the unions but many (e.g., 90%) from the research community, are concerned. It is a major glitch that does warrant additional cause about this AND all of the other statistical and other errors not mentioned but prevalent in all value-added scores (e.g., the errors always found in large-scale standardized tests particularly given their non-equivalent scales, the errors caused by missing data, the errors caused by small class sizes, the errors caused by summer learning loss/gains, the errors caused by other teachers’ simultaneous and carry over effects, the errors caused by parental and peer effects [see also this recent post about these], etc.).

So what type of consequence is to be in store for those perpetuating such nonsense? Including, particularly here, those charged with calculating and releasing value-added “estimates” (“estimates” as these are not and should never be interpreted as hard data), but also the reporters who report on the issues without understanding them or reading the research about them. I, for one, would like to see them held accountable for the “value” they too are to “add” to our thinking about these social issues, but who rather detract and distract readers away from the very real, research-based issues at hand.

Stanford Professor, Dr. Edward Haertel, on VAMs

In a recent speech and subsequent paper written by Dr. Edward Haertel – National Academy of Education member and Professor at Stanford University – he writes about VAMs and the extent to which VAMs, being based on student test scores, can be used to make reliable and valid inferences about teachers and teacher effectiveness. This is a must-read, particularly for those out there who are new to the research literature in this area. Dr. Haertel is certainly an expert here, actually one of the best we have, and in this piece he captures the major issues well.

Some of the issues highlighted include concerns about the tests used to model value-added and how their scales (falsely assumed to be as objective and equal as units on a measuring stick) complicate and distort VAM-based estimates. He also discusses the general issues with the tests almost if not always used when modeling value-added (i.e., the state-level tests mandated as per No Child Left Behind in 2002).

He discusses why VAM estimates are least trustworthy, and most volatile and error prone, when used to compare teachers who work in very different schools with very different student populations – students who do not attend schools in randomized patterns and who are rarely if ever randomly assigned to classrooms. The issues with bias, as highlighted by Dr. Haertel and also in a recent VAMboozled! post with a link to a new research article here, are probably the most major VAM-related, problems/issues going. As captured in his words, “VAMs will not simply reward or penalize teachers according to how well or poorly they teach. They will also reward or penalize teachers according to which students they teach and which schools they teach in” (Haertel, 2013, p. 12-13).

He reiterates issues with reliability, or a lack thereof. As per one research study he cites, researchers found that “a minimum of 10% of the teachers in the bottom fifth of the distribution one year were in the top fifth the next year, and conversely. Typically, only about a third of 1 year’s top performers were in the top category again the following year, and likewise, only about a third of 1 year’s lowest performers were in the lowest category again the following year. These findings are typical [emphasis added]…[While a] few studies have found reliabilities around .5 or a little higher…this still says that only half the variation in these value-added estimates is signal, and the remainder is noise [and/or error, which makes VAM estimates entirely invalid about half of the time]” (Haertel, 2013, p. 18).

Dr. Haertel also discusses other correlations among VAM estimates and teacher observational scores, VAM estimates and student evaluation scores, and VAM estimates taken from the same teachers at the same time but using different tests, all of which also yield abysmally (and unfortunately) low correlations, similar to those mentioned above.

His bottom line? “VAMs are complicated, but not nearly so complicated as the reality they are intended to represent” (Haertel, 2013, p. 12). They just do not measure well what so many believe they measure so very well.

Again, to find out more reasons and more in-depth explanations as to why, click here for the full speech and subsequent paper.

Random Assigment and Bias in VAM Estimates – Article Published in AERJ

“Nonsensical,” “impractical,” “unprofessional,” “unethical,” and even “detrimental” – these are just a few of the adjectives used by elementary school principals in Arizona to describe the use of randomized practices to assign students to teachers and classrooms. When asked whether principals might consider random assignment practices, one principal noted, “I prefer careful, thoughtful, and intentional placement [of students] to random. I’ve never considered using random placement. These are children, human beings.” Yet the value-added models (VAMs) being used in many states to measure the “valued-added” by individual teachers to their students’ learning assume that any school is as likely as any other school, and any teacher is as likely as any other teacher, to be assigned any student who is as likely as any other student to have similar backgrounds, abilities, aptitudes, dispositions, motivations, and the like.

One of my doctoral students – Noelle Paufler – and I recently reported in the highly esteemed American Educational Research Journal the results of a survey administered to all public and charter elementary principals in Arizona (see the online publication of “The Random Assignment of Students into Elementary Classrooms: Implications for Value-Added Analyses and Interpretations”). We examined the various methods used to assign students to classrooms in their schools, the student background characteristics considered in nonrandom placements, and the roles teachers and parents play in the placement process. In terms of bias, the fundamental question here was whether the use of nonrandom student assignment practices might lead to biased VAM estimates, if the nonrandom student sorting practices went beyond that which is typically controlled for in most VAM models (e.g., academic achievement and prior demonstrated abilities, special education status, ELL status, gender, giftedness, etc.).

We found that overwhelmingly, principals use various placement procedures through which administrators and teachers consider a variety of student background characteristics and student interactions to make placement decisions. In other words, student placements are by far nonrandom (contrary to the methodological assumptions to which VAM consumers often agree).

Principals frequently cited interactions between students, students’ peers, and previous teachers as justification for future placements. Principals stated that students were often matched with teachers based on their individual learning styles and respective teaching strengths. Parents also yielded considerable control over the placement process with a majority of principals stating that parents made placement requests, the majority of which are often honored.

In addition, in general, principal respondents were greatly opposed to using random student assignment methods in lieu of placement practices based on human judgment—practices they collectively agreed were in the best interest of students. Random assignment, even if necessary to produce unbiased VAM-based estimates, was deemed highly “nonsensical,” “impractical,” “unprofessional,” “unethical,” and even “detrimental” to student learning and teacher success.

The nonrandom assignment of students to classrooms has significant implications for the use of value-added models to estimate teacher effects on student learning using large-scale standardized test scores. Given the widespread use of nonrandom methods as indicated in this study, however, value-added researchers, policymakers, and educators should carefully consider the implications of their placement decisions as well as the validity of the inferences made using value-added estimates of teacher effectiveness.

Florida Newspaper Following “Suit”

On Thursday (November 14th), I wrote about what is happening in the courts of Los Angeles about the LA Times’ controversial open public records request soliciting the student test scores of all Los Angeles Unified School District (LAUSD) teachers (see LA Times up to the Same Old Antics). Today I write about the same thing happening in the courts of Florida as per The Florida Times-Union’s suit against the state (see Florida teacher value-added data is public record, appeal court rules).

As the headline reads, “Florida’s controversial value-added teacher data are [to be] public record” and released to The Times-Union for the same reasons and purposes they are being released, again, to the LA times. These (in many ways right) reasons include: (1) the data should not be exempt from public inspection, (2) these records, because they are public, should be open for public consumption, (3) parents as members of the public and direct consumers of these data should be granted access, and the list goes on.

What is in too many ways wrong, however, is that while the court wrote that the data are “only one part of a larger spectrum of criteria by which a public school teacher is evaluated,” the data will be consumed by a highly assuming, highly unaware, and highly uninformed public as the only source of data that count.

Worse, because the data are derived via complex analyses of “objective” test scores that yield (purportedly) hard numbers from which teacher-level value-added can be calculated using even more complex statistics, the public will trust the statisticians behind the scenes, because they are smart, and they will consume the numbers as true and valid because smart people constructed them.

The key here, though, is that they are, in fact, constructed. In every step of the process of constructing the value-added data, there are major issues that arise and major decisions that are made. Both cause major imperfections in the data that will in the end come out clean (i.e., as numbers), even though they are still super dirty on the inside.

As written by The Times-Union reporter, “Value-added is the difference between the learning growth a student makes in a teacher’s class and the statistically predicted learning growth the student should have earned based on previous performance.” It is just not as simple and as straightforward as that.

Here are just some reasons why: (1) large-scale standardized achievement tests offer very narrow measures of what students have achieved, although they are assumed to measure the breadth and depth of student learning covering an entire year; (2) the 40 to 50 total tested items do not represent the hundreds of more complex items we value more; (3) test developers use statistical tools to remove the items that too many students answered correctly making much of what we value not at all valued on the tests; (4) calculating value-added, or growth “upwards” over time requires that the scales used to measure growth from one year to the next are on scales of equal units, but this is (to my knowledge) never the case; (5) otherwise, the standard statistical response is to norm the test scores before using them, but this then means that for every winner there must be a loser, or in this case that as some teachers get better, other teachers must get worse, which does not reflect reality; (6) then value-added estimates do not indicate whether a teacher is good or highly effective, as is also often assumed, but rather whether a teacher is purportedly better or worse than other teachers to whom they are compared who teach entirely different sets of students who are not randomly assigned to their classrooms but for whom they are to demonstrate “similar” levels of growth; (7) then teachers assigned more “difficult-to-teach” students are held accountable for demonstrating similar growth regardless of the numbers of, for example, English Language Learners (ELLs) or special education students in their classes (although statisticians argue they can “control for this” despite recent research evidence); (8) which becomes more of a biasing issue when statisticians cannot effectively control for what happens (or what does not happen) over the summers whereby in every current state-level value-added system these tests are administered annually, from the spring of year X to the spring of year Y, always capturing the summer months in the post-test scores and biasing scores dramatically in one way or another based on that which is entirely out of the control of the school; (9) this forces the value-added statisticians to statistically assume that summer learning growth and decay matters the same for all students despite the research-based fact that different types of students lose or gain variable levels of knowledge over the summer months; and (10) goes into (11), (12), and so forth.

But the numbers do not reflect any of this, now do they.

How Might a Test Measure Teachers’ Causal Effects?

A reader wrote a very good question (see the VAMmunition post) that I feel is worth “sharing out,” with a short but (hopefully) informative answer that will help others better understand some of “the issues.”

(S)he wrote: “[W]hat exactly would a test look like if it were, indeed, ‘designed to estimate teachers’ causal effects’? Moreover, how different would it be from today’s tests?”

Here is (most of) my response: While large-scale standardized tests are typically limited in both the number and types of items included, among other things, one could use a similar test with more items and more “instructionally sensitive” items to better capture a teacher’s causal effects, quite simply actually. This would be done with the pre and post-tests occurring in the same year while students are being instructed by the same (albeit not only…) teacher. However, this does not happen in any value-added system at this point as these tests are given once per year (typically spring to spring). Hence, student growth scores include prior and other teachers’ effects, as well as the differential learning gains/losses that also occur over the summers during which students have little to no interactions with formal education systems, or their teachers. This “biases” these measures of growth, big time!

The other necessary condition for doing this would be random assignment. If students were randomly assigned to classrooms (and teachers were randomly assigned to classrooms), this would help to make sure that indeed all students are similar at the outset, before what we might term the “treatment” (i.e., how effectively a teacher teaches for X amount of time). However, again, this rarely if ever happens in practice as administrators and teachers (rightfully) see random assignment practices…while great for experimental research purposes, bad for students and their learning! Regardless, some statisticians suggest that their sophisticated controls can “account” for non-random assignment practices, yet again evidence suggests that no matter how sophisticated the controls are, they simply do not work here either.

See, for example, the Hermann et al. (2013), the Newton et al. (2010), and the Rothstein (2009, 2010) citations here, in this blog, under the “VAM Readings” link. I also have an article coming out about this this month, co-authored with one of my doctoral students, in a highly esteemed peer-reviewed journal. Here is the reference if you want to keep an eye out for it. These references should (hopefully) explain all of this with greater depth and clarity: Paufler, N. A. & Amrein-Beardsley, A. (2013, October). The random assignment of students Into elementary classrooms: Implications for value-added analyses and interpretations. American Educational Research Journal. doi: 10.3102/0002831213508299