This one (should) explain itself (i.e., not reliable, and accordingly, not valid either). When high-stakes consequences are to be attached, this matters even more.
Following a post last month titled “New Empirical Evidence: Students’ ‘Persistent Economic Disadvantage’ More Likely to Bias Value-Added Estimates,” Matt Barnum — senior staff writer for The 74, an (allegedly) non-partisan, honest, and fact-based news site backed by Editor-in-Chief Campbell Brown and covering America’s education system “in crisis” (see, also, a prior post about The 74 here) — followed up with a tweet via Twitter. He wrote: “Yes, though [bias caused by economic disadvantage] likely applies with equal or even more force to other measures of teacher quality, like observations.” I replied via Twitter that I disagreed with this statement in that I was unaware of research in support of his assertion, and Barnum sent me two articles to review thereafter.
I attempted to review both of these articles herein, although I quickly figured out that I had actually read and reviewed the first (2014) piece on this blog (see original post here, see also a 2014 Brookings Institution article summarizing this piece here). In short, in this study researchers found that the observational components of states’ contemporary teacher systems certainly “add” more “value” than their value-added counterparts, especially for (in)formative purposes. However, researchers found that observational bias also exists, as akin to value-added bias, whereas teachers who are non-randomly assigned students who enter their classrooms with higher levels of prior achievement tend to get higher observational scores than teachers non-randomly assigned students entering their classrooms with lower levels of prior achievement. Researchers concluded that because districts “do not have processes in place to address the possible biases in observational scores,” statistical adjustments might be made to offset said bias, as might external observers/raters be brought in to yield more “objective” observational assessments of teachers.
For the second study, and this post here, I gave this one a more thorough read (you can find the full study, pre-publication here). Using data from the Measures of Effective
Teaching (MET) Project, in which random assignment was used (or more accurately attempted), researchers also explored the extent to which students enrolled in teachers’ classrooms influence classroom observational scores.
They found, primarily, that:
- “[T]he context in which teachers work—most notably, the incoming academic performance of their students—plays a critical role in determining teachers’ performance” as measured by teacher observations. More specifically, “ELA [English/language arts] teachers were more than twice as likely to be rated in the top performance quintile if [nearly randomly] assigned the highest achieving students compared with teachers assigned the low-est achieving students,” and “math teachers were more than 6 times as likely.” In addition, “approximately half of the teachers—48% in ELA and 54% in math—were rated in the top two performance quintiles if assigned the highest performing students, while 37% of ELA and only 18% of math teachers assigned the lowest performing students were highly rated based on classroom observation scores”
- “[T]he intentional sorting of teachers to students has a significant influence on measured performance” as well. More specifically, results further suggest that “higher performing students [are, at least sometimes] endogenously sorted into the classes of higher performing teachers…Therefore, the nonrandom and positive assignment of teachers to classes of students based on time-invariant (and unobserved) teacher
characteristics would reveal more effective teacher performance, as measured by classroom observation scores, than may actually be true.”
So, the non-random assignment of teachers biases both the value-added and observational components written into America’s now “more objective” teacher evaluation systems, as (formerly) required of all states that were to comply with federal initiatives and incentives (e.g., Race to the Top). In addition, when those responsible for assigning students to classrooms (sub)consciously favor teachers with high, prior observational scores, this exacerbates the issues. This is especially important when observational (and value-added) data are to be used for high-stakes accountability systems in that the data yielded via really both measurement systems may be less likely to reflect “true” teaching effectiveness due to “true” bias. “Indeed, teachers working with higher achieving students tend to receive higher performance ratings, above and beyond that which might be attributable to aspects of teacher quality,” and vice-versa.
Citation Study #1: Whitehurst, G. J., Chingos, M. M., & Lindquist, K. M. (2014). Evaluating teachers with classroom observations: Lessons learned in four districts. Washington, DC: Brookings Institution. Retrieved from https://www.brookings.edu/wp-content/uploads/2016/06/Evaluating-Teachers-with-Classroom-Observations.pdf
Citation Study #2: Steinberg, M. P., & Garrett, R. (2016). Classroom composition and measured teacher performance: What do teacher observation scores really measure? Educational Evaluation and Policy Analysis, 38(2), 293-317. doi:10.3102/0162373715616249 Retrieved from http://static.politico.com/58/5f/f14b2b144846a9b3365b8f2b0897/study-of-classroom-observations-of-teachers.pdf
The National Bureau of Economic Research (NBER) recently released a circulated but not-yet internally or externally reviewed study titled “The Gap within the Gap: Using Longitudinal Data to Understand Income Differences in Student Achievement.” Note that we have covered NBER studies such as this in the past in this blog, so in all fairness and like I have noted in the past, this paper should also be critically consumed, as well as my interpretations of the authors’ findings.
Nevertheless, this study is authored by Katherine Michelmore — Assistant Professor of Public Administration and International Affairs at Syracuse University, and Susan Dynarski — Professor of Public Policy, Education, and Economics at the University of Michigan, and this study is entirely relevant to value-added models (VAMs). Hence, below I cover their key highlights and takeaways, as I see them. I should note up front, however, that the authors did not directly examine how the new measure of economic disadvantage that they introduce (see below) actually affects calculations of teacher-level value-added. Rather, they motivate their analyses by saying that calculating teacher value-added is one application of their analyses.
The background to their study is as follows: “Gaps in educational achievement between high- and low-income children are growing” (p. 1), but the data that are used to capture “high- and low-income” in the state of Michigan (i.e., the state in which their study took place) and many if not most other states throughout the US, capture “income” demographics in very rudimentary, blunt, and often binary ways (i.e., “yes” for students who are eligible to receive federally funded free-or-reduced lunches and “no” for the ineligible).
Consequently, in this study the authors “leverage[d] the longitudinal structure of these data sets to develop a new measure of persistent economic disadvantage” (p. 1), all the while defining “persistent economic disadvantage” by the extent to which students were “eligible for subsidized meals in every grade since kindergarten” (p. 8). Students “who [were] never eligible for subsidized meals during those grades [were] defined as never [being economically] disadvantaged” (p. 8), and students who were eligible for subsidized meals for variable years were defined as “transitorily disadvantaged” (p. 8). This all runs counter, however, to the binary codes typically used, again, across the nation.
Appropriately, then, their goal (among other things) was to see how a new measure they constructed to better measure and capture “persistent economic disadvantage” might help when calculating teacher-level value-added. They accordingly argue (among other things) that, perhaps, not accounting for persistent disadvantage might subsequently cause more biased value-added estimates “against teachers of [and perhaps schools educating] persistently disadvantaged children” (p. 3). This, of course, also depends on how persistently disadvantaged students are (non)randomly assigned to teachers.
With statistics like the following as also reported in their report: “Students [in Michigan] [persistently] disadvantaged by 8th grade were six times more likely to be black and four times more likely to be Hispanic, compared to those who were never disadvantaged,” their assertions speak volumes not only to the importance of their findings for educational policy, but also for the teachers and schools still being evaluated using value-added scores and the researchers investigating, criticizing, promoting, or even trying to make these models better (if that is possible). In short, though, teachers who are disproportionately teaching in urban areas with more students akin to their equally disadvantaged peers, might realize relatively more biased value-added estimates as a result.
For value-added purposes, then, it is clear that the assumptions that controlling for student disadvantage by using such basal indicators of current economic disadvantage is overly simplistic, and just using test scores to also count for this economic disadvantage (i.e., as promoted in most versions of the Education Value-Added Assessment System (EVAAS)) is likely worse. More specifically, the assumption that economic disadvantage also does not impact some students more than others over time, or over the period of data being used to capture value-added (typically 3-5 years of students’ test score data), is also highly susceptible. “[T]hat children who are persistently disadvantaged perform worse than those who are disadvantaged in only some grades” (p. 14) also violates another fundamental assumption that teachers’ effects are consistent over time for similar students who learn at more or less consistent rates over time, regardless of these and other demographics.
The bottom line here, then, is that the indicator that should be used instead of our currently used proxies for current economic disadvantage is the number of grades students spend in economic disadvantage. If the value-added indicator does not effectively account for the “negative, nearly linear relationship between [students’ test] scores and the number of grades spent in economic disadvantage” (p. 18), while controlling for other student demographics and school fixed effects, value-added estimates will likely be (even) more biased against teachers who teach these students as a result.
Otherwise, teachers who teach students with persistent economic disadvantages will likely have it worse (i.e., in terms of bias) than teachers who teach students with current economic disadvantages, teachers who teach students with economically disadvantaged in their current or past histories will have it worse than teachers who teach students without (m)any prior economic disadvantages, and so on.
Citation: Michelmore, K., & Dynarski, S. (2016). The gap within the gap: Using longitudinal data to understand income differences in student achievement. Cambridge, MA: National Bureau of Economic Research (NBER). Retrieved from http://www.nber.org/papers/w22474
The American Prospect — a self-described “liberal intelligence” magazine — featured last week a question and answer, interview-based article with Jesse Rothstein — Professor of Economics at University of California – Berkeley — on “The Economic Consequences of Denying Teachers Tenure.” Rothstein is a great choice for this one in that indeed he is an economist, but one of a few, really, who is deep into the research literature and who, accordingly, has a balanced set of research-based beliefs about value-added models (VAMs), their current uses in America’s public schools, and what they can and cannot do (theoretically) to support school reform. He’s probably most famous for a study he conducted in 2009 about how the non-random, purposeful sorting of students into classrooms indeed biases (or distorts) value-added estimations, pretty much despite the sophistication of the statistical controls meant to block (or control for) such bias (or distorting effects). You can find this study referenced here, and a follow-up to this study here.
In this article, though, the interviewer — Rachel Cohen — interviews Jesse primarily about how in California a higher court recently reversed the Vergara v. California decision that would have weakened teacher employment protections throughout the state (see also here). “In 2014, in Vergara v. California, a Los Angeles County Superior Court judge ruled that a variety of teacher job protections worked together to violate students’ constitutional right to an equal education. This past spring, in a 3–0 decision, the California Court of Appeals threw this ruling out.”
Here are the highlights in my opinion, by question and answer, although there is much more information in the full article here:
Cohen: “Your research suggests that even if we got rid of teacher tenure, principals still wouldn’t fire many teachers. Why?”
Rothstein: “It’s basically because in most cases, there’s just not actually a long list of [qualified] people lining up to take the jobs; there’s a shortage of qualified teachers to hire.” In addition, “Lots of schools recognize it makes more sense to keep the teacher employed, and incentivize them with tenure…”I’ve studied this, and it’s basically economics 101. There is evidence that you get more people interested in teaching when the job is better, and there is evidence that firing teachers reduces the attractiveness of the job.”
Cohen: Your research suggests that even if we got rid of teacher tenure, principals still wouldn’t fire many teachers. Why?
Rothstein: It’s basically because in most cases, there’s just not actually a long list of people lining up to take the jobs; there’s a shortage of qualified teachers to hire. If you deny tenure to someone, that creates a new job opening. But if you’re not confident you’ll be able to fill it with someone else, that doesn’t make you any better off. Lots of schools recognize it makes more sense to keep the teacher employed, and incentivize them with tenure.
Cohen: “Aren’t most teachers pretty bad their first year? Are we denying them a fair shot if we make tenure decisions so soon?”
Rothstein: “Even if they’re struggling, you can usually tell if things will turn out to be okay. There is quite a bit of evidence for someone to look at.”
Cohen: “Value-added models (VAM) played a significant role in the Vergara trial. You’ve done a lot of research on these tools. Can you explain what they are?”
Rothstein: “[The] value-added model is a statistical tool that tries to use student test scores to come up with estimates of teacher effectiveness. The idea is that if we define teacher effectiveness as the impact that teachers have on student test scores, then we can use statistics to try to then tell us which teachers are good and bad. VAM played an odd role in the trial. The plaintiffs were arguing that now, with VAM, we have these new reliable measures of teacher effectiveness, so we should use them much more aggressively, and we should throw out the job statutes. It was a little weird that the judge took it all at face value in his decision.”
Cohen: “When did VAM become popular?”
Rothstein: “I would say it became a big deal late in the [George W.] Bush administration. That’s partly because we had new databases that we hadn’t had previously, so it was possible to estimate on a large scale. It was also partly because computers had gotten better. And then VAM got a huge push from the Obama administration.”
Cohen: “So you’re skeptical of VAM.”
Rothstein: “I think the metrics are not as good as the plaintiffs made them out to be. There are bias issues, among others.”
Cohen: “During the Vergara trials you testified against some of Harvard economist Raj Chetty’s VAM research, and the two of you have been going back and forth ever since. Can you describe what you two are arguing about?”
Rothstein: “Raj’s testimony at the trial was very focused on his work regarding teacher VAM. After the trial, I really dug in to understand his work, and I probed into some of his assumptions, and found that they didn’t really hold up. So while he was arguing that VAM showed unbiased results, and VAM results tell you a lot about a teacher’s long-term outcomes, I concluded that what his approach really showed was that value-added scores are moderately biased, and that they don’t really tell us one way or another about a teacher’s long-term outcomes” (see more about this debate here).
Cohen: “Could VAM be improved?”
Rothstein: “It may be that there is a way to use VAM to make a better system than we have now, but we haven’t yet figured out how to do that. Our first attempts have been trying to use them in not very intelligent ways.”
Cohen: “It’s been two years since the Vergara trial. Do you think anything’s changed?”
Rothstein: “I guess in general there’s been a little bit of a political walk-back from the push for VAM. And this retreat is not necessarily tied to the research evidence; sometimes these things just happen. But I’m not sure the trial court opinion would have come out the same if it were held today.”
Again, see more from this interview, also about teacher evaluation systems in general, job protections, and the like in the full article here.
Citation: Cohen, R. M. (2016, August 4). Q&A: The economic consequences of eenying teachers tenure. The American Prospect. Retrieved from http://prospect.org/article/qa-economic-consequences-denying-teachers-tenure
I just read what might be one of the best articles I’ve read in a long time on using test scores to measure teacher effectiveness, and why this is such a bad idea. Not surprisingly, unfortunately, this article was written 20 years ago (i.e., 1986) by – Edward Haertel, National Academy of Education member and recently retired Professor at Stanford University. If the name sounds familiar, it should as Professor Emeritus Haertel is one of the best on the topic of, and history behind VAMs (see prior posts about his related scholarship here, here, and here). To access the full article, please scroll to the reference at the bottom of this post.
Heartel wrote this article when at the time policymakers were, like they still are now, trying to hold teachers accountable for their students’ learning as measured on states’ standardized test scores. Although this article deals with minimum competency tests, which were in policy fashion at the time, about seven policy iterations ago, the contents of the article still have much relevance given where we are today — investing in “new and improved” Common Core tests and still riding on unsinkable beliefs that this is the way to reform the schools that have been in despair and (still) in need of major repair since 20+ years ago.
Here are some of the points I found of most “value:”
- On isolating teacher effects: “Inferring teacher competence from test scores requires the isolation of teaching effects from other major influences on student test performance,” while “the task is to support an interpretation of student test performance as reflecting teacher competence by providing evidence against plausible rival hypotheses or interpretation.” While “student achievement depends on multiple factors, many of which are out of the teacher’s control,” and many of which cannot and likely never will be able to be “controlled.” In terms of home supports, “students enjoy varying levels of out-of-school support for learning. Not only may parental support and expectations influence student motivation and effort, but some parents may share directly in the task of instruction itself, reading with children, for example, or assisting them with homework.” In terms of school supports, “[s]choolwide learning climate refers to the host of factors that make a school more than a collection of self-contained classrooms. Where the principal is a strong instructional leader; where schoolwide policies on attendance, drug use, and discipline are consistently enforced; where the dominant peer culture is achievement-oriented; and where the school is actively supported by parents and the community.” This, all, makes isolating the teacher effect nearly if not wholly impossible.
- On the difficulties with defining the teacher effect: “Does it include homework? Does it include self-directed study initiated by the student? How about tutoring by a parent or an older sister or brother? For present purposes, instruction logically refers to whatever the teacher being evaluated is responsible for, but there are degrees of responsibility, and it is often shared. If a teacher informs parents of a student’s learning difficulties and they arrange for private tutoring, is the teacher responsible for the student’s improvement? Suppose the teacher merely gives the student low marks, the student informs her parents, and they arrange for a tutor? Should teachers be credited with inspiring a student’s independent study of school subjects? There is no time to dwell on these difficulties; others lie ahead. Recognizing that some ambiguity remains, it may suffice to define instruction as any learning activity directed by the teacher, including homework….The question also must be confronted of what knowledge counts as achievement. The math teacher who digresses into lectures on beekeeping may be effective in communicating information, but for purposes of teacher evaluation the learning outcomes will not match those of a colleague who sticks to quadratic equations.” Much if not all of this cannot and likely never will be able to be “controlled” or “factored” in or our, as well.
- On standardized tests: The best of standardized tests will (likely) always be too imperfect and not up to the teacher evaluation task, no matter the extent to which they are pitched as “new and improved.” While it might appear that these “problem[s] could be solved with better tests,” they cannot. Ultimately, all that these tests provide is “a sample of student performance. The inference that this performance reflects educational achievement [not to mention teacher effectiveness] is probabilistic [emphasis added], and is only justified under certain conditions.” Likewise, these tests “measure only a subset of important learning objectives, and if teachers are rated on their students’ attainment of just those outcomes, instruction of unmeasured objectives [is also] slighted.” Like it was then as it still is today, “it has become a commonplace that standardized student achievement tests are ill-suited for teacher evaluation.”
- On the multiple choice formats of such tests: “[A] multiple-choice item remains a recognition task, in which the problem is to find the best of a small number of predetermined alternatives and the cri- teria for comparing the alternatives are well defined. The nonacademic situations where school learning is ultimately ap- plied rarely present problems in this neat, closed form. Discovery and definition of the problem itself and production of a variety of solutions are called for, not selection among a set of fixed alternatives.”
- On students and the scores they are to contribute to the teacher evaluation formula: “Students varying in their readiness to profit from instruction are said to differ in aptitude. Not only general cognitive abilities, but relevant prior instruction, motivation, and specific inter- actions of these and other learner characteristics with features of the curriculum and instruction will affect academic growth.” In other words, one cannot simply assume all students will learn or grow at the same rate with the same teacher. Rather, they will learn at different rates given their aptitudes, their “readiness to profit from instruction,” the teachers’ instruction, and sometimes despite the teachers’ instruction or what the teacher teaches.
- And on the formative nature of such tests, as it was then: “Teachers rarely consult standardized test results except, perhaps, for initial grouping or placement of students, and they believe that the tests are of more value to school or district administrators than to themselves.”
Reference: Haertel, E. (1986). The valid use of student performance measures for teacher evaluation. Educational Evaluation and Policy Analysis, 8(1), 45-60.
One of my doctoral students sent me a YouTube video I feel compelled to share with you all. It is an interview with one of my all time favorite and most admired academics — Stephen Jay Gould. Gould, who passed away at age 60 from cancer, was a paleontologist, evolutionary biologist, and scientist who spent most of his academic career at Harvard. He was “one of the most influential and widely read writers of popular science of his generation,” and he was also the author of one of my favorite books of all time: The Mismeasure of Man (1981).
In The Mismeasure of Man Gould examined the history of psychometrics and the history of intelligence testing (e.g., the methods of nineteenth century craniometry, or the physical measures of peoples’ skulls to “objectively” capture their intelligence). Gould examined psychological testing and the uses of all sorts of tests and measurements to inform decisions (which is still, as we know, uber-relevant today) as well as “inform” biological determinism (i.e., “the view that “social and economic differences between human groups—primarily races, classes, and sexes—arise from inherited, inborn distinctions and that society, in this sense, is an accurate reflection of biology). Gould also examined in this book the general use of mathematics and “objective” numbers writ large to measure pretty much anything, as well as to measure and evidence predetermined sets of conclusions. This book is, as I mentioned, one of the best. I highly recommend it to all.
In this seven-minute video, you can get a sense of what this book is all about, as also so relevant to that which we continue to believe or not believe about tests and what they really are or are not worth. Thanks, again, to my doctoral student for finding this as this is a treasure not to be buried, especially given Gould’s 2002 passing.
I recently re-read an article in full that is now 10 years old, or 10 years out, as published in 2004 and, as per the words of the authors, before VAM approaches were “widely adopted in formal state or district accountability systems.” Unfortunately, I consistently find it interesting, particularly in terms of the research on VAMs, to re-explore/re-discover what we actually knew 10 years ago about VAMs, as most of the time, this serves as a reminder of how things, most of the time, have not changed.
The article, “Models for Value-Added Modeling of Teacher Effects,” is authored by Daniel McCaffrey (Educational Testing Service [ETS] Scientist, and still a “big name” in VAM research), J. R. Lockwood (RAND Corporation Scientists), Daniel Koretz (Professor at Harvard), Thomas Louis (Professor at Johns Hopkins), and Laura Hamilton (RAND Corporation Scientist).
At the point at which the authors wrote this article, besides the aforementioned data and data base issues, were issues with “multiple measures on the same student and multiple teachers instructing each student” as “[c]lass groupings of students change annually, and students are taught by a different teacher each year.” Authors, more specifically, questioned “whether VAM really does remove the effects of factors such as prior performance and [students’] socio-economic status, and thereby provide[s] a more accurate indicator of teacher effectiveness.”
The assertions they advanced, accordingly and as relevant to these questions, follow:
- Across different types of VAMs, given different types of approaches to control for some of the above (e.g., bias), teachers’ contribution to total variability in test scores (as per value-added gains) ranged from 3% to 20%. That is, teachers can realistically only be held accountable for 3% to 20% of the variance in test scores using VAMs, while the other 80% to 97% of the variance (stil) comes from influences outside of the teacher’s control. A similar statistic (i.e., 1% to 14%) was similarly and recently highlighted in the recent position statement on VAMs released by the American Statistical Association.
- Most VAMs focus exclusively on scores from standardized assessments, although I will take this one-step further now, noting that all VAMs now focus exclusively on large-scale standardized tests. This I evidenced in a recent paper I published here: Putting growth and value-added models on the map: A national overview).
- VAMs introduce bias when missing test scores are not missing completely at random. The missing at random assumption, however, runs across most VAMs because without it, data missingness would be pragmatically insolvable, especially “given the large proportion of missing data in many achievement databases and known differences between students with complete and incomplete test data.” The really only solution here is to use “implicit imputation of values for unobserved gains using the observed scores” which is “followed by estimation of teacher effect[s] using the means of both the imputed and observe gains [together].”
- Bias “[still] is one of the most difficult issues arising from the use of VAMs to estimate school or teacher effects…[and]…the inclusion of student level covariates is not necessarily the solution to [this] bias.” In other words, “Controlling for student-level covariates alone is not sufficient to remove the effects of [students’] background [or demographic] characteristics.” There is a reason why bias is still such a highly contested issue when it comes to VAMs (see a recent post about this here).
- All (or now most) commonly-used VAMs assume that teachers’ (and prior teachers’) effects persist undiminished over time. This assumption “is not empirically or theoretically justified,” either, yet it persists.
These authors’ overall conclusion, again from 10 years ago but one that in many ways still stands? VAMs “will often be too imprecise to support some of [its] desired inferences” and uses including, for example, making low- and high-stakes decisions about teacher effects as produced via VAMs. “[O]btaining sufficiently precise estimates of teacher effects to support ranking [and such decisions] is likely to [forever] be a challenge.”
Since the passage of the Every Student Succeeds Act (ESSA) last January, in which the federal government handed back to states the authority to decide whether to evaluate teachers with or without students’ test scores, states have been dropping the value-added measure (VAM) or growth components (e.g., the Student Growth Percentiles (SGP) package) of their teacher evaluation systems, as formerly required by President Obama’s Race to the Top initiative. See my most recent post here, for example, about how legislators in Oklahoma recently removed VAMs from their state-level teacher evaluation system, while simultaneously increasing the state’s focus on the professional development of all teachers. Hawaii recently did the same.
Now, it seems that Massachusetts is the next at least moving in this same direction.
As per a recent article in The Boston Globe (here), similar test-based teacher accountability efforts are facing increased opposition, primarily from school district superintendents and teachers throughout the state. At issue is whether all of this is simply “becoming a distraction,” whether the data can be impacted or “biased” by other statistically uncontrollable factors, and whether all teachers can be evaluated in similar ways, which is an issue with “fairness.” Also at issue is “reliability,” whereby a 2014 study released by the Center for Educational Assessment at the University of Massachusetts Amherst, in which researchers examined student growth percentiles, found the “amount of random error was substantial.” Stephen Sireci, one of the study authors and UMass professor, noted that, instead of relying upon the volatile results, “You might as well [just] flip a coin.”
Damian Betebenner, a senior associate at the National Center for the Improvement of Educational Assessment Inc. in Dover, N.H. who developed the SGP model in use in Massachusetts, added that “Unfortunately, the use of student percentiles has turned into a debate for scapegoating teachers for the ills.” Isn’t this the truth, to the extent that policymakers got a hold of these statistical tools, after which they much too swiftly and carelessly singled out teachers for unmerited treatment and blame.
Regardless, and recently, stakeholders in Massachusetts lobbied the Senate to approve an amendment to the budget that would no longer require such test-based ratings in teachers’ professional evaluations, while also passing a policy statement urging the state to scrap these ratings entirely. “It remains unclear what the fate of the Senate amendment will be,” however. “The House has previously rejected a similar amendment, which means the issue would have to be resolved in a conference committee as the two sides reconcile their budget proposals in the coming weeks.”
Not surprisingly, Mitchell Chester, Massachusetts Commissioner for Elementary and Secondary Education, continues to defend the requirement. It seems that Chester, like others, is still holding tight to the default (yet still unsubstantiated) logic helping to advance these systems in the first place, arguing, “Some teachers are strong, others are not…If we are not looking at who is getting strong gains and those who are not we are missing an opportunity to upgrade teaching across the system.”
Recall that the peer-reviewed journal Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#8 of 9), which is actually a commentary titled “Can Value-Added Add Value to Teacher Evaluation?” This commentary is authored by Linda Darling-Hammond – Professor of Education, Emeritus, at Stanford University.
Like with the last commentary reviewed here, Darling-Hammond reviews some of the key points taken from the five feature articles in the aforementioned “Special Issue.” More specifically, though, Darling-Hammond “reflect[s] on [these five] articles’ findings in light of other work in this field, and [she] offer[s her own] thoughts about whether and how VAMs may add value to teacher evaluation” (p. 132).
She starts her commentary with VAMs “in theory,” in that VAMs COULD accurately identify teachers’ contributions to student learning and achievement IF (and this is a big IF) the following three conditions were met: (1) “student learning is well-measured by tests that reflect valuable learning and the actual achievement of individual students along a vertical scale representing the full range of possible achievement measures in equal interval units” (2) “students are randomly assigned to teachers within and across schools—or, conceptualized another way, the learning conditions and traits of the group of students assigned to one teacher do not vary substantially from those assigned to another;” and (3) “individual teachers are the only contributors to students’ learning over the period of time used for measuring gains” (p. 132).
None of things are actual true (or near to true, nor will they likely ever be true) in educational practice, however. Hence, the errors we continue to observe that continue to prevent VAM use for their intended utilities, even with the sophisticated statistics meant to mitigate errors and account for the above-mentioned, let’s call them, “less than ideal” conditions.
Other pervasive and perpetual issues surrounding VAMs as highlighted by Darling-Hammond, per each of the three categories above, pertain to (1) the tests used to measure value-added is that the tests are very narrow, focus on lower level skills, and are manipulable. These tests in their current form cannot effectively measure the learning gains of a large share of students who are above or below grade level given a lack of sufficient coverage and stretch. As per Haertel (2013, as cited in Darling-Hammond’s commentary), this “translates into bias against those teachers working with the lowest-performing or the highest-performing classes’…and “those who teach in tracked school settings.” It is also important to note here that the new tests created by the Partnership for Assessing Readiness for College and Careers (PARCC) and Smarter Balanced, multistate consortia “will not remedy this problem…Even though they will report students’ scores on a vertical scale, they will not be able to measure accurately the achievement or learning of students who started out below or above grade level” (p.133).
With respect to (2) above, on the equivalence (or rather non-equivalence) of groups of student across teachers classrooms who are the ones whose VAM scores are relativistically compared, the main issue here is that “the U.S. education system is the one of most segregated and unequal in the industrialized world…[likewise]…[t]he country’s extraordinarily high rates of childhood poverty, homelessness, and food insecurity are not randomly distributed across communities…[Add] the extensive practice of tracking to the mix, and it is clear that the assumption of equivalence among classrooms is far from reality” (p. 133). Whether sophisticated statistics can control for all of this variation is one of most debated issues surrounding VAMs and their levels of outcome bias, accordingly.
And as per (3) above, “we know from decades of educational research that many things matter for student achievement aside from the individual teacher a student has at a moment in time for a given subject area. A partial list includes the following [that are also supposed to be statistically controlled for in most VAMs, but are also clearly not controlled for effectively enough, if even possible]: (a) school factors such as class sizes, curriculum choices, instructional time, availability of specialists, tutors, books, computers, science labs, and other resources; (b) prior teachers and schooling, as well as other current teachers—and the opportunities for professional learning and collaborative planning among them; (c) peer culture and achievement; (d) differential summer learning gains and losses; (e) home factors, such as parents’ ability to help with homework, food and housing security, and physical and mental support or abuse; and (e) individual student needs, health, and attendance” (p. 133).
“Given all of these influences on [student] learning [and achievement], it is not surprising that variation among teachers accounts for only a tiny share of variation in achievement, typically estimated at under 10%” (see, for example, highlights from the American Statistical Association’s (ASA’s) Position Statement on VAMs here). “Suffice it to say [these issues]…pose considerable challenges to deriving accurate estimates of teacher effects…[A]s the ASA suggests, these challenges may have unintended negative effects on overall educational quality” (p. 133). “Most worrisome [for example] are [the] studies suggesting that teachers’ ratings are heavily influenced [i.e., biased] by the students they teach even after statistical models have tried to control for these influences” (p. 135).
Other “considerable challenges” include: VAM output are grossly unstable given the swings and variations observed in teacher classifications across time, and VAM output are “notoriously imprecise” (p. 133) given the other errors observed as caused, for example, by varying class sizes (e.g., Sean Corcoran (2010) documented with New York City data that the “true” effectiveness of a teacher ranked in the 43rd percentile could have had a range of possible scores from the 15th to the 71st percentile, qualifying as “below average,” “average,” or close to “above average). In addition, practitioners including administrators and teachers are skeptical of these systems, and their (appropriate) skepticisms are impacting the extent to which they use and value their value-added data, noting that they value their observational data (and the professional discussions surrounding them) much more. Also important is that another likely unintended effect exists (i.e., citing Susan Moore Johnson’s essay here) when statisticians’ efforts to parse out learning to calculate individual teachers’ value-added causes “teachers to hunker down and focus only on their own students, rather than working collegially to address student needs and solve collective problems” (p. 134). Related, “the technology of VAM ranks teachers against each other relative to the gains they appear to produce for students, [hence] one teacher’s gain is another’s loss, thus creating disincentives for collaborative work” (p. 135). This is what Susan Moore Johnson termed the egg-crate model, or rather the egg-crate effects.
Darling-Hammond’s conclusions are that VAMs have “been prematurely thrust into policy contexts that have made it more the subject of advocacy than of careful analysis that shapes its use. There is [good] reason to be skeptical that the current prescriptions for using VAMs can ever succeed in measuring teaching contributions well (p. 135).
Darling-Hammond also “adds value” in one whole section (highlighted in another post forthcoming here), offering a very sound set of solutions, using VAMs for teacher evaluations or not. Given it’s rare in this area of research we can focus on actual solutions, this section is a must read. If you don’t want to wait for the next post, read Darling-Hammond’s “Modest Proposal” (p. 135-136) within her larger article here.
In the end, Darling-Hammond writes that, “Trying to fix VAMs is rather like pushing on a balloon: The effort to correct one problem often creates another one that pops out somewhere else” (p. 135).
If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; see the Review of Article #5 – on teachers’ perceptions of observations and student growth here; see the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here; and see the Review of Article (Commentary) #7 – on VAMs situated in their appropriate ecologies here.
Article #8, Part I Reference: Darling-Hammond, L. (2015). Can value-added add value to teacher evaluation? Educational Researcher, 44(2), 132-137. doi:10.3102/0013189X15575346
Recall the New York lawsuit pertaining to Long Island teacher Sheri Lederman? She just won in New York’s State Supreme court, and boy did she win big, also for the cause!
Sheri is a teacher, who by all accounts other than her 2013-2014 “ineffective” growth score of a 1/20, is a terrific 4th grade, 18-year veteran teacher. However, after receiving her “ineffective” growth rating and score, she along with her attorney and husband Bruce Lederman, sued the state of New York to challenge the state’s growth-based teacher evaluation system and Sheri’s individual score. See prior posts about Sheri’s case here, here, here and here.
The more specific goal of her case was to seek a judgment: (1) setting aside or vacating Sheri’s individual growth score and rating her as “ineffective,” and (2) declare that the New York endorsed and implemented growth measures in use was/is “arbitrary and capricious.” The “overall gist” was that Sheri contended that the system unfairly penalized teachers whose students consistently scored well and could not demonstrated growth upwards (e.g., teachers of gifted or other high achieving students). This concern/complaint is common elsewhere.
As per a State Supreme Court ruling, just released today as written by Acting Supreme Court Justice Judge Roger McDonough (May 10, 2016), and at 15 pages in length and available in full here, Sheri won her case. She won it against John King — the then New York State Education Department Commissioner and the now US Secretary of Education (who recently replaced Arne Duncan as US Secretary of Education). The Court concluded that Sheri (her husband, her team of experts, and other witnesses) effectively established that her growth score and rating for 2013-2014 was “arbitrary and capricious,” with “arbitrary and capricious” being defined as actions “taken without sound basis in reason or regard to the facts.”
More specifically, the Court’s conclusion was founded upon: (1) the convincing and detailed evidence of VAM bias against teachers at both ends of the spectrum (e.g. those with high-performing students or those with low-performing students); (2) the disproportionate effect of petitioner’s small class size and relatively large percentage of high-performing students; (3) the functional inability of high-performing students to demonstrate growth akin to lower-performing students; (4) the wholly unexplained swing in petitioner’s growth score from 14 [i.e., her growth score the year prior] to 1, despite the presence of statistically similar scoring students in her respective classes; and, most tellingly, (5) the strict imposition of rating constraints in the form of a “bell curve” that places teachers in four categories via pre-determined percentages regardless of whether the performance of students dramatically rose or dramatically fell from the previous year.”
As per an email I received earlier today from Bruce (i.e., Sheri’s husband/attorney who prosecuted her case), the Court otherwise “declined to make an overall ruling on the [New York growth] rating system in general because of new regulations in effect” [e.g., that the state’s growth model is currently under review]…[Nontheless, t]he decision should qualify as persuasive authority for other teachers challenging growth scores throughout the County [and Country]. [In addition, the] Court carefully recite[d] all our expert affidavits [i.e., from Professors Darling-Hammond, Pallas, Amrein-Beardsley, Sean Corcoran and Jesse Rothstein as well as Drs. Burris and Lindell].” Noted as well were the “absence of any meaningful’ challenge to [Sheri’s] experts’ conclusions, especially about the dramatic swings noticed between her, and potentially others’ scores, and the other ‘litany of expert affidavits submitted on [Sheris’] behalf].”
“It is clear that the evidence all of these amazing experts presented was a key factor in winning this case since the Judge repeatedly said both in Court and in the decision that we have a “high burden” to meet in this case.” [In addition,] [t]he Court wrote that the court “does not lightly enter into a critical analysis of this matter … [and] is constrained on this record, to conclude that [the] petitioner [i.e., Sheri] has met her high burden.”
To Bruce’s/our knowledge, this is the first time a judge has set aside an individual teacher’s VAM rating based upon such a presentation in court.
Thanks to all who helped in this endeavor. Onward!