Using VAMs “In Not Very Intelligent Ways:” A Q&A with Jesse Rothstein

ShareTweet about this on TwitterShare on FacebookEmail this to someoneShare on Google+Share on LinkedInShare on Reddit

The American Prospect — a self-described “liberal intelligence” magazine — featured last week a question and answer, interview-based article with Jesse Rothstein — Professor of Economics at University of California – Berkeley — on “The Economic Consequences of Denying Teachers Tenure.” Rothstein is a great choice for this one in that indeed he is an economist, but one of a few, really, who is deep into the research literature and who, accordingly, has a balanced set of research-based beliefs about value-added models (VAMs), their current uses in America’s public schools, and what they can and cannot do (theoretically) to support school reform. He’s probably most famous for a study he conducted in 2009 about how the non-random, purposeful sorting of students into classrooms indeed biases (or distorts) value-added estimations, pretty much despite the sophistication of the statistical controls meant to block (or control for) such bias (or distorting effects). You can find this study referenced here, and a follow-up to this study here.

In this article, though, the interviewer — Rachel Cohen — interviews Jesse primarily about how in California a higher court recently reversed the Vergara v. California decision that would have weakened teacher employment protections throughout the state (see also here). “In 2014, in Vergara v. California, a Los Angeles County Superior Court judge ruled that a variety of teacher job protections worked together to violate students’ constitutional right to an equal education. This past spring, in a 3–0 decision, the California Court of Appeals threw this ruling out.”

Here are the highlights in my opinion, by question and answer, although there is much more information in the full article here:

Cohen: “Your research suggests that even if we got rid of teacher tenure, principals still wouldn’t fire many teachers. Why?”

Rothstein: “It’s basically because in most cases, there’s just not actually a long list of [qualified] people lining up to take the jobs; there’s a shortage of qualified teachers to hire.” In addition, “Lots of schools recognize it makes more sense to keep the teacher employed, and incentivize them with tenure…”I’ve studied this, and it’s basically economics 101. There is evidence that you get more people interested in teaching when the job is better, and there is evidence that firing teachers reduces the attractiveness of the job.”

Cohen: Your research suggests that even if we got rid of teacher tenure, principals still wouldn’t fire many teachers. Why?

Rothstein: It’s basically because in most cases, there’s just not actually a long list of people lining up to take the jobs; there’s a shortage of qualified teachers to hire. If you deny tenure to someone, that creates a new job opening. But if you’re not confident you’ll be able to fill it with someone else, that doesn’t make you any better off. Lots of schools recognize it makes more sense to keep the teacher employed, and incentivize them with tenure.

Cohen: “Aren’t most teachers pretty bad their first year? Are we denying them a fair shot if we make tenure decisions so soon?”

Rothstein: “Even if they’re struggling, you can usually tell if things will turn out to be okay. There is quite a bit of evidence for someone to look at.”

Cohen: “Value-added models (VAM) played a significant role in the Vergara trial. You’ve done a lot of research on these tools. Can you explain what they are?”

Rothstein: “[The] value-added model is a statistical tool that tries to use student test scores to come up with estimates of teacher effectiveness. The idea is that if we define teacher effectiveness as the impact that teachers have on student test scores, then we can use statistics to try to then tell us which teachers are good and bad. VAM played an odd role in the trial. The plaintiffs were arguing that now, with VAM, we have these new reliable measures of teacher effectiveness, so we should use them much more aggressively, and we should throw out the job statutes. It was a little weird that the judge took it all at face value in his decision.”

Cohen: “When did VAM become popular?”

Rothstein: “I would say it became a big deal late in the [George W.] Bush administration. That’s partly because we had new databases that we hadn’t had previously, so it was possible to estimate on a large scale. It was also partly because computers had gotten better. And then VAM got a huge push from the Obama administration.”

Cohen: “So you’re skeptical of VAM.”

Rothstein: “I think the metrics are not as good as the plaintiffs made them out to be. There are bias issues, among others.”

Cohen: “During the Vergara trials you testified against some of Harvard economist Raj Chetty’s VAM research, and the two of you have been going back and forth ever since. Can you describe what you two are arguing about?”

Rothstein: “Raj’s testimony at the trial was very focused on his work regarding teacher VAM. After the trial, I really dug in to understand his work, and I probed into some of his assumptions, and found that they didn’t really hold up. So while he was arguing that VAM showed unbiased results, and VAM results tell you a lot about a teacher’s long-term outcomes, I concluded that what his approach really showed was that value-added scores are moderately biased, and that they don’t really tell us one way or another about a teacher’s long-term outcomes” (see more about this debate here).

Cohen: “Could VAM be improved?”

Rothstein: “It may be that there is a way to use VAM to make a better system than we have now, but we haven’t yet figured out how to do that. Our first attempts have been trying to use them in not very intelligent ways.”

Cohen: “It’s been two years since the Vergara trial. Do you think anything’s changed?”

Rothstein: “I guess in general there’s been a little bit of a political walk-back from the push for VAM. And this retreat is not necessarily tied to the research evidence; sometimes these things just happen. But I’m not sure the trial court opinion would have come out the same if it were held today.”

Again, see more from this interview, also about teacher evaluation systems in general, job protections, and the like in the full article here.

Citation: Cohen, R. M. (2016, August 4). Q&A: The economic consequences of eenying teachers tenure. The American Prospect. Retrieved from http://prospect.org/article/qa-economic-consequences-denying-teachers-tenure

ShareTweet about this on TwitterShare on FacebookEmail this to someoneShare on Google+Share on LinkedInShare on Reddit

“The 74’s” Fact-Checking of the Democratic Platform

ShareTweet about this on TwitterShare on FacebookEmail this to someoneShare on Google+Share on LinkedInShare on Reddit

As we all likely know well by now, speakers for both parties during last and this weeks’ Republican and Democratic Conventions, respectively, spoke and in many cases spewed a number of exaggerated, misleading, and outright false claims about multiple areas of American public policy…educational policy included. Hence, many fact-checking journalists, websites, social mediaists, and the like, have since been trying to hold both parties accountable for their facts and make “the actual facts” more evident. For a funny video about all of this, actually, see HBO’s John Oliver’s most recent bit on “last week’s unsurprisingly surprising Republican convention” here (11 minutes) and some of their expressions of “feelings” as “facts.”

Fittingly, The 74 — an (allegedly) non-partisan, honest, and fact-based news site (ironically) covering America’s education system “in crisis,” and publishing articles “backed by investigation, expertise, and experience” and backed by Editor-in-Chief Campbell Brown — took on such a fact-checking challenge in an article senior staff writer Matt Burnum wrote: “Researchers: No Consensus Against Using Test Scores in Teacher Evaluations, Contra Democratic Platform.”

Apparently, what author Barnum actually did to justify the title and contents of his article, however, was (1) take the claim written into the 55-page “2016 Democratic Party Platform” document that: “We [the Democratic Party] oppose…the use of student test scores in teacher and principal evaluations, a practice which has been repeatedly rejected by researchers” (p. 33); then (2) generalize what being “repeatedly rejected by researchers” means, to inferring that a “consensus,” “wholesale,” and “categorical rejection” among researchers “that such scores should not be used whatsoever in evaluation” exists; then (3) proceed to ask a non-random or representative sample of nine researchers on the topic about whether, indeed, his deduced conclusion was true; to (4) ultimately claim that “the [alleged] suggestion that there is a scholarly consensus against using test scores in teacher evaluation is misleading.”

Misleading, rather, is Barnum’s framing of his entire piece, as Barnum twisted the original statement into something more alarmist, which apparently warranted his fact-checking, after which he engaged in a weak convenience-based investigation, with unsubstantiated findings ultimately making the headline of this subsequent article. It seems that those involved in reporting “the actual facts” also need some serious editing and fact-checking themselves in that, “The 74’s poll of just nine researchers [IS NOT] may not be a representative sample of expert opinion,” whatsoever.

Nonetheless, the nine respondents (also without knowledge of who was contacted but did not respond, i.e., a response rate) included: Dan Goldhaber — Adjunct Professor of Education and Economics at the University of Washington, Bothell; Kirabo Jackson — Associate Professor of Education and Economics at Northwestern University; Cory Koedel — Associate Professor of Economics and Public Policy at the University of Missouri; Matthew Kraft — Assistant Professor of Education and Economics at Brown University; Susan Moore Johnson — Professor of Teacher Policy at Harvard University; Jesse Rothstein — Professor of Public Policy and Economics at the University of California, Berkeley;  Matthew Steinberg — Assistant Professor of Educational Policy at the University of Pennsylvania; Katharine Strunk — Associate Professor of Educational Policy at the University of Southern California; Jim Wyckoff — Professor of Educational Policy at the University of Virginia. You can see what appear to be these researchers’ full responses to Barnum’s undisclosed solicitation at the bottom of this article, available again here, noting that the opinions of these nine are individually important as I too would value some of these nine as among (but not representative of) the experts in the area of research (see a fuller list of 37 such experts here, 2/3rds of whom are listed above).

Regardless, and assuming that Barnum’s original misinterpretation was correct, I think how Katharine Strunk put it is likely more representative of the group of researchers on this topic as a whole as based on the research: “I think the research suggests that we need multiple measures — test scores [depending on the extent to which evidence supports low- and more importantly high-stakes use], observations, and others – to rigorously and fairly evaluate teachers.” Likewise, how Jesse Rothstein framed his response, in my opinion, is another takeaway for those looking for what is more likely a more accurate and representative statement on this hypothetical consensus: “the weight of the evidence, and the weight of expert opinion, points to the conclusion that we haven’t figured out ways to use test scores in teacher evaluations that yield benefits greater than costs.”

With that being said, what is likely most the “fact” desired in this particular instance is that “the use of student test scores in teacher and principal evaluations, [IS] a practice which has been repeatedly rejected by researchers.” But it has also been disproportionately promoted by researchers with disciplinary backgrounds in economics (although this is not always the case), and disproportionately rejected so by those with disciplinary backgrounds in education, educational policy, educational measurement and statistics, and the like (although this is not always the case). The bottom line is that reaching a consensus in this area of research is much more difficult than Barnum and others might otherwise assume.

Should one really want to “factually” answer such a question, (s)he would have to more carefully: (1) define the problem and subsequent research question (e.g., the platform never claimed in the first place that said “consensus” existed), (2) engage in background research to (3) methodically define the population of researchers from which (4) the research sample is to be drawn to adequately represent the population, after which (5) an appropriate response rate is to be secured. If there are methodological weaknesses in any of these steps, the research exercise should likely stop, as Barnum should have during step #1 in this case here.

ShareTweet about this on TwitterShare on FacebookEmail this to someoneShare on Google+Share on LinkedInShare on Reddit

47 Teachers To Be Stripped of Tenure in Denver

ShareTweet about this on TwitterShare on FacebookEmail this to someoneShare on Google+Share on LinkedInShare on Reddit

As per a recent article by Chalkbeat Colorado, “Denver Public Schools [is] Set to Strip Nearly 50 Teachers of Tenure Protections after [two-years of consecutive] Poor Evaluations.” This will make Denver Public Schools — Colorado’s largest school district — the district with the highest relative proportion of teachers to lose tenure, which demotes teachers to probationary status, which also causes them to lose their due process rights.

  • The majority of the 47 teachers — 26 of them — are white. Another 14 are Latino, four are African-American, two are multi-racial and one is Asian.
  • Thirty-one of the 47 teachers set to lose tenure — or 66 percent — teach in “green” or “blue” schools, the two highest ratings on Denver’s color-coded School Performance Framework. Only three — or 6 percent — teach in “red” schools, the lowest rating.
  • Thirty-eight of the 47 teachers — or 81 percent — teach at schools where more than half of the students qualify for federally subsidized lunches, an indicator of poverty.

Elsewhere, in Douglas County 24, in Aurora 12, in Cherry Creek one, and in Jefferson County, the state’s second largest district, zero teachers teachers are set to lose their tenure status. This all occurred provided a sweeping educator effectiveness law — Senate Bill 191 — passed throughout Colorado six years ago. As per this law, “at least 50 percent of a teacher’s evaluation [must] be based on student academic growth.”

“Because this is the first year teachers can lose that status…[however]…officials said it’s difficult to know why the numbers differ from district to district.” This, of course, is an issue with fairness whereby a court, for example, could find that if a teacher is teaching in District X versus District Y, and (s)he had an different probability of losing tenure due only to the District in which (s)he taught, this could be quite easily argued as an arbitrary component of the law, not to mention an arbitrary component of its implementation. If I was advising these districts on these matters, I would certainly advise them to tread lightly.

However, apparently many districts throughout Colorado use a state-developed and endorsed model to evaluate their teachers, but Denver uses its own model; hence, this would likely take some of the pressure off of the state, should this end up in court, and place it more so upon the district. That is, the burden of proof would likely rest on Denver Public School officials to evidence that they are no only complying with the state law but that they are doing so in sound, evidence-based, and rational/reasonable ways.

Citation: Amar, M. (2016, July 15). Denver Public Schools set to strip nearly 50 teachers of tenure protections after poor evaluations. Chalkbeat Colorado. Retrieved from http://www.chalkbeat.org/posts/co/2016/07/14/denver-public-schools-set-to-strip-nearly-50-teachers-of-tenure-protections-after-poor-evaluations/#.V5Yryq47Tof

ShareTweet about this on TwitterShare on FacebookEmail this to someoneShare on Google+Share on LinkedInShare on Reddit

One Score and Seven Policy Iterations Ago…

ShareTweet about this on TwitterShare on FacebookEmail this to someoneShare on Google+Share on LinkedInShare on Reddit

I just read what might be one of the best articles I’ve read in a long time on using test scores to measure teacher effectiveness, and why this is such a bad idea. Not surprisingly, unfortunately, this article was written 20 years ago (i.e., 1986) by – Edward Haertel, National Academy of Education member and recently retired Professor at Stanford University. If the name sounds familiar, it should as Professor Emeritus Haertel is one of the best on the topic of, and history behind VAMs (see prior posts about his related scholarship here, here, and here). To access the full article, please scroll to the reference at the bottom of this post.

Heartel wrote this article when at the time policymakers were, like they still are now, trying to hold teachers accountable for their students’ learning as measured on states’ standardized test scores. Although this article deals with minimum competency tests, which were in policy fashion at the time, about seven policy iterations ago, the contents of the article still have much relevance given where we are today — investing in “new and improved” Common Core tests and still riding on unsinkable beliefs that this is the way to reform the schools that have been in despair and (still) in need of major repair since 20+ years ago.

Here are some of the points I found of most “value:”

  • On isolating teacher effects: “Inferring teacher competence from test scores requires the isolation of teaching effects from other major influences on student test performance,” while “the task is to support an interpretation of student test performance as reflecting teacher competence by providing evidence against plausible rival hypotheses or interpretation.” While “student achievement depends on multiple factors, many of which are out of the teacher’s control,” and many of which cannot and likely never will be able to be “controlled.” In terms of home supports, “students enjoy varying levels of out-of-school support for learning. Not only may parental support and expectations influence student motivation and effort, but some parents may share directly in the task of instruction itself, reading with children, for example, or assisting them with homework.” In terms of school supports, “[s]choolwide learning climate refers to the host of factors that make a school more than a collection of self-contained classrooms. Where the principal is a strong instructional leader; where schoolwide policies on attendance, drug use, and discipline are consistently enforced; where the dominant peer culture is achievement-oriented; and where the school is actively supported by parents and the community.” This, all, makes isolating the teacher effect nearly if not wholly impossible.
  • On the difficulties with defining the teacher effect: “Does it include homework? Does it include self-directed study initiated by the student? How about tutoring by a parent or an older sister or brother? For present purposes, instruction logically refers to whatever the teacher being evaluated is responsible for, but there are degrees of responsibility, and it is often shared. If a teacher informs parents of a student’s learning difficulties and they arrange for private tutoring, is the teacher responsible for the student’s improvement? Suppose the teacher merely gives the student low marks, the student informs her parents, and they arrange for a tutor? Should teachers be credited with inspiring a student’s independent study of school subjects? There is no time to dwell on these difficulties; others lie ahead. Recognizing that some ambiguity remains, it may suffice to define instruction as any learning activity directed by the teacher, including homework….The question also must be confronted of what knowledge counts as achievement. The math teacher who digresses into lectures on beekeeping may be effective in communicating information, but for purposes of teacher evaluation the learning outcomes will not match those of a colleague who sticks to quadratic equations.” Much if not all of this cannot and likely never will be able to be “controlled” or “factored” in or our, as well.
  • On standardized tests: The best of standardized tests will (likely) always be too imperfect and not up to the teacher evaluation task, no matter the extent to which they are pitched as “new and improved.” While it might appear that these “problem[s] could be solved with better tests,” they cannot. Ultimately, all that these tests provide is “a sample of student performance. The inference that this performance reflects educational achievement [not to mention teacher effectiveness] is probabilistic [emphasis added], and is only justified under certain conditions.” Likewise, these tests “measure only a subset of important learning objectives, and if teachers are rated on their students’ attainment of just those outcomes, instruction of unmeasured objectives [is also] slighted.” Like it was then as it still is today, “it has become a commonplace that standardized student achievement tests are ill-suited for teacher evaluation.”
  • On the multiple choice formats of such tests: “[A] multiple-choice item remains a recognition task, in which the problem is to find the best of a small number of predetermined alternatives and the cri- teria for comparing the alternatives are well defined. The nonacademic situations where school learning is ultimately ap- plied rarely present problems in this neat, closed form. Discovery and definition of the problem itself and production of a variety of solutions are called for, not selection among a set of fixed alternatives.”
  • On students and the scores they are to contribute to the teacher evaluation formula: “Students varying in their readiness to profit from instruction are said to differ in aptitude. Not only general cognitive abilities, but relevant prior instruction, motivation, and specific inter- actions of these and other learner characteristics with features of the curriculum and instruction will affect academic growth.” In other words, one cannot simply assume all students will learn or grow at the same rate with the same teacher. Rather, they will learn at different rates given their aptitudes, their “readiness to profit from instruction,” the teachers’ instruction, and sometimes despite the teachers’ instruction or what the teacher teaches.
  • And on the formative nature of such tests, as it was then: “Teachers rarely consult standardized test results except, perhaps, for initial grouping or placement of students, and they believe that the tests are of more value to school or district administrators than to themselves.”

Sound familiar?

Reference: Haertel, E. (1986). The valid use of student performance measures for teacher evaluation. Educational Evaluation and Policy Analysis, 8(1), 45-60.

ShareTweet about this on TwitterShare on FacebookEmail this to someoneShare on Google+Share on LinkedInShare on Reddit

New Book on Market-Based, Educational Reforms

ShareTweet about this on TwitterShare on FacebookEmail this to someoneShare on Google+Share on LinkedInShare on Reddit

For those of you looking for a good read, you may want to check out this new book: “Learning from the Federal Market‐Based Reforms: Lessons for ESSA [the Every Student Succeeds Act]” here.

As Larry Cuban put it, the book’s editors have a “cast of all-star scholars” in this volume, and in Gloria Ladson-Billings words, the editors “assembled some of the nation’s best minds” to examine the evidence on today’s market-based reforms as well as more promising, equitable ones. For full disclosure, I have a chapter in this book about using value-added models (VAMs) to measure and evaluate teacher education programs (see below), although I am not making any royalties from book sales.

If interested, you can purchase the book at a  reduced price of $30 (from $40) per paperback thru 7/31/17, using the following discount code at checkout: LFMBR30350.  Here, again, is the link.

ABOUT THE BOOK: Over the past twenty years, educational policy has been characterized by top‐down, market‐focused policies combined with a push toward privatization and school choice. The new Every Student Succeeds Act continues along this path, though with decision‐making authority now shifted toward the states. These market‐based reforms have often been touted as the most promising response to the challenges of poverty and educational disenfranchisement. But has this approach been successful? Has learning improved? Have historically low‐scoring schools “turned around” or have the reforms had little effect? Have these narrow conceptions of schooling harmed the civic and social purposes of education in a democracy?

This book presents the evidence. Drawing on the work of the nation’s most prominent researchers, the book explores the major elements of these reforms, as well as the social, political, and educational contexts in which they take place. It examines the evidence supporting the most common school improvement strategies: school choice; reconstitutions, or massive personnel changes; and school closures. From there, it presents the research findings cutting across these strategies by addressing the evidence on test score trends, teacher evaluation, “miracle” schools, the Common Core State Standards, school choice, the newly emerging school improvement industry, and re‐segregation, among others.

The weight of the evidence indisputably shows little success and no promise for these reforms. Thus, the authors counsel strongly against continuing these failed policies. The book concludes with a review of more promising avenues for educational reform, including the necessity of broader societal investments for combatting poverty and adverse social conditions. While schools cannot single‐handedly overcome societal inequalities, important work can take place within the public school system, with evidence‐based interventions such as early childhood education, detracking, adequate funding and full‐service community schools—all intended to renew our nation’s commitment to democracy and equal educational opportunity.

CONTENTS BY SECTION AND CHAPTER

Foreword, Jeannie Oakes

SECTION I: INTRODUCTION: THE FOUNDATIONS OF MARKET BASED REFORM

  • Purposes of Education: The Language of Schooling, Mike Rose.
  • The Political Context, Janelle Scott.
  • Historical Evolution of Test‐Based Reforms, Harvey Kantor and Robert Lowe.
  • Predictable Failure of Test‐Based Accountability, Heinrich Mintrop and Gail Sunderman.

SECTION II: TEST‐BASED SANCTIONS: WHAT THE EVIDENCE SAYS

  • Transformation & Reconstitution, Betty Malen and Jennifer King Rice.
  • Turnarounds, Tina Trujillo and Michelle Valladares.
  • Restart/Conversion, Gary Miron and Jessica Urschel.
  • Closures, Ben Kirshner, Erica Van Steenis, Kristen Pozzoboni, and Matthew Gaertner.

SECTION III: FALSE PROMISES

  • Miracle School Myth, P. L. Thomas.
  • Has Test‐Based Accountability Worked? Committee on Incentives and Test‐Based Accountability in Public Education (Michael Hout & Stuart Elliott, Eds.).
  • The Effectiveness of Test‐Based Reforms. Kevin Welner and William Mathis.
  • Value Added Models: Teacher, Principal and School Evaluations, American Statistical Association.
  • The Problems with the Common Core, Stan Karp.
  • Reform and Re‐Segregation, Gary Orfield.
  • English Language Learners. Angela Valenzuela and Brendan Maxcy.
  • Racial Disproportionality: Discipline, Anne Gregory, Russell Skiba, and Pedro Noguera.
  • School Choice, Christopher Lubienski and Sarah Theule Lubienski.
  • The Privatization Industry, Patricia Burch and Jahni Smith.
  • Virtual Education, Michael Barbour.

SECTION IV: EFFECTIVE REFORMS

  • Addressing Poverty, David Berliner.
  • Racial Segregation & Achievement, Richard Rothstein.
  • Adequate Funding, Michael Rebell.
  • Early Childhood Education, Steven Barnett.
  • De‐Tracking, Kevin Welner and Carol Corbett Burris.
  • Class Size, Diane Whitmore Schanzenbach.
  • School–Community Partnerships, Linda Valli, Amanda Stefanski, and Reuben Jacobson.
  • Community Organizing for Grassroots Support, Mark Warren.
  • Teacher Education, Audrey Amrein‐Beardsley, Joshua Barnett, and Tirupalavanam Ganesh.

SECTION V: CONCLUSION

ShareTweet about this on TwitterShare on FacebookEmail this to someoneShare on Google+Share on LinkedInShare on Reddit

Center on the Future of American Education, on America’s “New and Improved” Teacher Evaluation Systems

ShareTweet about this on TwitterShare on FacebookEmail this to someoneShare on Google+Share on LinkedInShare on Reddit

Thomas Toch — education policy expert and research fellow at Georgetown University, and founding director of the Center on the Future of American Education — just released, as part of the Center, a report titled: Grading the Graders: A Report on Teacher Evaluation Reform in Public Education. He sent this to me for my thoughts, and I decided to summarize my thoughts here, with thanks and all due respect to the author, as clearly we are on different sides of the spectrum in terms of the literal “value” America’s new teacher evaluation systems might in fact “add” to the reformation of America’s public schools.

While quite a long and meaty report, here are some of the points I think that are important to address publicly:

First, is it true that using prior teacher evaluation systems (which were almost if not entirely based on teacher observational systems) yielded for “nearly every teacher satisfactory ratings”? Indeed, this is true. However, what we have seen since 2009, when states began to adopt what were then (and in many ways still are) viewed as America’s “new and improved” or “strengthened” teacher evaluation systems, is that for 70% of America’s teachers, these teacher evaluation systems are still based only on the observational indicators being used prior, because for only 30% of America’s teachers are value-added estimates calculable. As also noted in this report, it is for these 70% that “the superficial teacher [evaluation] practices of the past” (p. 2) will remain the same, although I disagree with this particular adjective, especially when these measures are used for formative purposes. While certainly imperfect, these are not simply “flimsy checklists” of no use or value. There is, indeed, much empirical research to support this assertion.

Likewise, these observational systems have not really changed since 2009, or 1999 for that matter and not that they could change all that much; but, they are not in their “early stages” (p. 2) of development. Indeed, this includes the Danielson Framework explicitly propped up in this piece as an exemplar, regardless of the fact it has been used across states and districts for decades and it is still not functioning as intended, especially when summative decisions about teacher effectiveness are to be made (see, for example, here).

Hence, in some states and districts (sometimes via educational policy) principals or other observers are now being asked, or required to deliberately assign to teachers’ lower observational categories, or assign approximate proportions of teachers per observational category used. Whereby the instrument might not distribute scores “as currently needed,” one way to game the system is to tell principals, for example, that they should only allot X% of teachers as per the three-to-five categories most often used across said instruments. In fact, in an article one of my doctoral students and I have forthcoming, we have termed this, with empirical evidence, the “artificial deflation” of observational scores, as externally being persuaded or required. Worse is that this sometimes signals to the greater public that these “new and improved” teacher evaluation systems are being used for more discriminatory purposes (i.e., to actually differentiate between good and bad teachers on some sort of discriminating continuum), or that, indeed, there is a normal distribution of teachers, as per their levels of effectiveness. While certainly there is some type of distribution, no evidence exists whatsoever to suggest that those who fall on the wrong side of the mean are, in fact, ineffective, and vice versa. It’s all relative, seriously, and unfortunately.

Related, the goal here is really not to “thoughtfully compare teacher performances,” but to evaluate teachers as per a set of criteria against which they can be evaluated and judged (i.e., whereby criterion-referenced inferences and decisions can be made). Inversely, comparing teachers in norm-referenced ways, as (socially) Darwinian and resonate with many-to-some, does not necessarily work, either or again. This is precisely what the authors of The Widget Effect report did, after which they argued for wide-scale system reform, so that increased discrimination among teachers, and reduced indifference on the part of evaluating principals, could occur. However, as also evidenced in this aforementioned article, the increasing presence of normal curves illustrating “new and improved” teacher observational distributions does not necessarily mean anything normal.

And were these systems not used often enough or “rarely” prior, to fire teachers? Perhaps, although there are no data to support such assertions, either. This very argument was at the heart of the Vergara v. California case (see, for example, here) — that teacher tenure laws, as well as laws protecting teachers’ due process rights, were keeping “grossly ineffective” teachers teaching in the classroom. Again, while no expert on either side could produce for the Court any hard numbers regarding how many “grossly ineffective” teachers were in fact being protected but such archaic rules and procedures, I would estimate (as based on my years of experience as a teacher) that this number is much lower than many believe it (and perhaps perpetuate it) to be. In fact, there was only one teacher whom I recall, who taught with me in a highly urban school, who I would have classified as grossly ineffective, and also tenured. He was ultimately fired, and quite easy to fire, as he also knew that he just didn’t have it.

Now to be clear, here, I do think that not just “grossly ineffective” but also simply “bad teachers” should be fired, but the indicators used to do this must yield valid inferences, as based on the evidence, as critically and appropriately consumed by the parties involved, after which valid and defensible decisions can and should be made. Whether one calls this due process in a proactive sense, or a wrongful termination suit in a retroactive sense, what matters most, though, is that the evidence supports the decision. This is the very issue at the heart of many of the lawsuits currently ongoing on this topic, as many of you know (see, for example, here).

Finally, where is the evidence, I ask, for many of the declaration included within and throughout this report. A review of the 133 endnotes included, for example, include only a very small handful of references to the larger literature on this topic (see a very comprehensive list of these literature here, here, and here). This is also highly problematic in this piece, as only the usual suspects (e.g., Sandi Jacobs, Thomas Kane, Bill Sanders) are cited to support the assertions advanced.

Take, for example, the following declaration: “a large and growing body of state and local implementation studies, academic research, teacher surveys, and interviews with dozens of policymakers, experts, and educators all reveal a much more promising picture: The reforms have strengthened many school districts’ focus on instructional quality, created a foundation for making teaching a more attractive profession, and improved the prospects for student achievement” (p. 1). Where is the evidence? There is no such evidence, and no such evidence published in high-quality, scholarly peer-reviewed journals of which I am aware. Again, publications released by the National Council on Teacher Quality (NCTQ) and from the Measures of Effective Teaching (MET) studies, as still not externally reviewed and still considered internal technical reports with “issues”, don’t necessarily count. Accordingly, no such evidence has been introduced, by either side, in any court case in which I am involved, likely, because such evidence does not exist, again, empirically and at some unbiased, vetted, and/or generalizable level. While Thomas Kane has introduced some of his MET study findings in the cases in Houston and New Mexico, these might be  some of the easiest pieces of evidence to target, accordingly, given the issues.

Otherwise, the only thing I can say from reading this piece that with which I agree, as that which I view, given the research literature as true and good, is that now teachers are being observed more often, by more people, in more depth, and in perhaps some cases with better observational instruments. Accordingly, teachers, also as per the research, seem to appreciate and enjoy the additional and more frequent/useful feedback and discussions about their practice, as increasingly offered. This, I would agree is something that is very positive that has come out of the nation’s policy-based focus on its “new and improved” teacher evaluation systems, again, as largely required by the federal government, especially pre-Every Student Succeeds Act (ESSA).

Overall, and in sum, “the research reveals that comprehensive teacher-evaluation models are stronger than the sum of their parts.” Unfortunately again, however, this is untrue in that systems based on multiple measures are entirely limited by the indicator that, in educational measurement terms, performs the worst. While such a holistic view is ideal, in measurement terms the sum of the parts is entirely limited by the weakest part. This is currently the value-added indicator (i.e., with the lowest levels of reliability and, related, issues with validity and bias) — the indicator at issue within this particular blog, and the indicator of the most interest, as it is this indicator that has truly changed our overall approaches to the evaluation of America’s teachers. It has yet to deliver, however, especially if to be used for high-stakes consequential decision-making purposes (e.g., incentives, getting rid of “bad apples”).

Feel free to read more here, as publicly available: Grading the Teachers: A Report on Teacher Evaluation Reform in Public Education. See also other claims regarding the benefits of said systems within (e.g., these systems as foundations for new teacher roles and responsibilities, smarter employment decisions, prioritizing classrooms, increased focus on improved standards). See also the recommendations offered, some with which I agree on the observational side (e.g., ensuring that teachers receive multiple observations during a school year by multiple evaluators), and none with which I agree on the value-added side (e.g., use at least two years of student achievement data in teacher evaluation ratings–rather, researchers agree that three years of value-added data are needed, as based on at least four years of student-level test data). There are, of course, many other recommendations included. You all can be the judges of those.

ShareTweet about this on TwitterShare on FacebookEmail this to someoneShare on Google+Share on LinkedInShare on Reddit

The Late Stephen Jay Gould on IQ Testing (with Implications for Testing Today)

ShareTweet about this on TwitterShare on FacebookEmail this to someoneShare on Google+Share on LinkedInShare on Reddit

One of my doctoral students sent me a YouTube video I feel compelled to share with you all. It is an interview with one of my all time favorite and most admired academics — Stephen Jay Gould. Gould, who passed away at age 60 from cancer, was a paleontologist, evolutionary biologist, and scientist who spent most of his academic career at Harvard. He was “one of the most influential and widely read writers of popular science of his generation,” and he was also the author of one of my favorite books of all time: The Mismeasure of Man (1981).

In The Mismeasure of Man Gould examined the history of psychometrics and the history of intelligence testing (e.g., the methods of nineteenth century craniometry, or the physical measures of peoples’ skulls to “objectively” capture their intelligence). Gould examined psychological testing and the uses of all sorts of tests and measurements to inform decisions (which is still, as we know, uber-relevant today) as well as “inform” biological determinism (i.e., “the view that “social and economic differences between human groups—primarily races, classes, and sexes—arise from inherited, inborn distinctions and that society, in this sense, is an accurate reflection of biology). Gould also examined in this book the general use of mathematics and “objective” numbers writ large to measure pretty much anything, as well as to measure and evidence predetermined sets of conclusions. This book is, as I mentioned, one of the best. I highly recommend it to all.

In this seven-minute video, you can get a sense of what this book is all about, as also so relevant to that which we continue to believe or not believe about tests and what they really are or are not worth. Thanks, again, to my doctoral student for finding this as this is a treasure not to be buried, especially given Gould’s 2002 passing.

ShareTweet about this on TwitterShare on FacebookEmail this to someoneShare on Google+Share on LinkedInShare on Reddit

Five “Indisputable” Reasons Why VAMs are Good?

ShareTweet about this on TwitterShare on FacebookEmail this to someoneShare on Google+Share on LinkedInShare on Reddit

Just this week, in Education Week — the field’s leading national newspaper covering K–12 education — a blogger by the name of Matthew Lynch published a piece explaining his “Five Indisputable [emphasis added] Reasons Why You Should Be Implementing Value-Added Assessment.”

I’m going to try to stay aboveboard with my critique of this piece, as best I can, as by the title alone you all can infer there are certainly pieces (mainly five) to be seriously criticized about the author’s indisputable take on value-added (and by default value-added models (VAMs)). I examine each of these assertions below, but I will say overall and before we begin, that pretty much everything that is included in this piece is hardly palatable, and tolerable considering that Education Week published it, and by publishing it they quasi-endorsed it, even if in an independent blog post that they likely at minimum reviewed, then made public.

First, the five assertions, along with a simple response per assertion:

1. Value-added assessment moves the focus from statistics and demographics to asking of essential questions such as, “How well are students progressing?”

In theory, yes – this is generally true (see also my response about the demographics piece replicated in assertion #3 below). The problem here, though, as we all should know by now, is that once we move away from the theory in support of value-added, this theory more or less crumbles. The majority of the research on this topic explains and evidences the reasons why. Is value-added better than what “we” did before, however, while measuring student achievement once per year without taking growth over time into consideration? Perhaps, but if it worked as intended, for sure!

2. Value-added assessment focuses on student growth, which allows teachers and students to be recognized for their improvement. This measurement applies equally to high-performing and advantaged students and under-performing or disadvantaged students.

Indeed, the focus is on growth (see my response about growth in assertion #1 above). What the author of this post does not understand, however, is that his latter conclusion is likely THE most controversial issue surrounding value-added, and on this all topical researchers likely agree. In fact, authors of the most recent review of what is actually called “bias” in value-added estimates, as published in the peer-reviewed Economics Education Review (see a pre-publication version of this manuscript here), concluded that because of potential bias (i.e., “This measurement [does not apply] equally to high-performing and advantaged students and under-performing or disadvantaged students“), that all value-added modelers should control for as many student-level (and other) demographic variables to help to minimize this potential, also given the extent to which multiple authors’ evidence of bias varies wildly (from negligible to considerable).

3. Value-added assessment provides results that are tied to teacher effectiveness, not student demographics; this is a much more fair accountability measure.

See my comment immediately above, with general emphasis added to this overly simplistic take on the extent to which VAMs yield “fair” estimates, free from the biasing effects (never to always) caused by such demographics. My “fairest” interpretation of the current albeit controversial research surrounding this particular issue is that bias does not exist across teacher-level estimates, but it certainly occurs when teachers are non-randomly assigned highly homogenous sets of students who are gifted, who are English Language Learners (ELLs), who are enrolled in special education programs, who disproportionately represent racial minority groups, who disproportionately come from lower socioeconomic backgrounds, and who have been retained in grade prior.

4. Value-added assessment is not a stand-alone solution, but it does provide rich data that helps educators make data-driven decisions.

This is entirely false. There is no research evidence, still to date, that teachers use these data to make instructional decisions. Accordingly, no research is linked to or cited here (as well as elsewhere). Now, if the author is talking about naive “educators,” in general, who make consequential decisions as based on poor (i.e., the oppostie of “rich”) data, this assertion would be true. This “truth,” in fact, is at the core of the lawsuits ongoing across the nation regarding this matter (see, for example, here), with consequences ranging from tagging a teacher’s file for receiving a low value-added score to teacher termination.

5. Value-added assessment assumes that teachers matter and recognizes that a good teacher can facilitate student improvement. Perhaps we have only value-added assessment to thank for “assuming” [sic] this. Enough said…

Or not…

Lastly, the author professes to be a “professor,” pretty much all over the place (see, again, here), although he is currently an associate professor. There is a difference, and folks who respect the difference typically make the distinction explicit and known, especially in an academic setting or context. See also here, however, given his expertise (or the lack thereof) in value-added or VAMs, about what he writes here as “indisputable.”

Perhaps most important here, though, is that his falsely inflated professional title implies, especially to a naive or uncritical public, that what he has to say, again without any research support, demands some kind of credibility and respect. Unfortunately, this is just not the case; hence, we are again reminded of the need for general readers to be critical in their consumption of such pieces. I would have thought Education Week would have played a larger role than this, rather than just putting this stuff “out there,” even if for simple debate or discussion.

ShareTweet about this on TwitterShare on FacebookEmail this to someoneShare on Google+Share on LinkedInShare on Reddit

Another Oldie but Still Very Relevant Goodie, by McCaffrey et al.

ShareTweet about this on TwitterShare on FacebookEmail this to someoneShare on Google+Share on LinkedInShare on Reddit

I recently re-read an article in full that is now 10 years old, or 10 years out, as published in 2004 and, as per the words of the authors, before VAM approaches were “widely adopted in formal state or district accountability systems.” Unfortunately, I consistently find it interesting, particularly in terms of the research on VAMs, to re-explore/re-discover what we actually knew 10 years ago about VAMs, as most of the time, this serves as a reminder of how things, most of the time, have not changed.

The article, “Models for Value-Added Modeling of Teacher Effects,” is authored by Daniel McCaffrey (Educational Testing Service [ETS] Scientist, and still a “big name” in VAM research), J. R. Lockwood (RAND Corporation Scientists),  Daniel Koretz (Professor at Harvard), Thomas Louis (Professor at Johns Hopkins), and Laura Hamilton (RAND Corporation Scientist).

At the point at which the authors wrote this article, besides the aforementioned data and data base issues, were issues with “multiple measures on the same student and multiple teachers instructing each student” as “[c]lass groupings of students change annually, and students are taught by a different teacher each year.” Authors, more specifically, questioned “whether VAM really does remove the effects of factors such as prior performance and [students’] socio-economic status, and thereby provide[s] a more accurate indicator of teacher effectiveness.”

The assertions they advanced, accordingly and as relevant to these questions, follow:

  • Across different types of VAMs, given different types of approaches to control for some of the above (e.g., bias), teachers’ contribution to total variability in test scores (as per value-added gains) ranged from 3% to 20%. That is, teachers can realistically only be held accountable for 3% to 20% of the variance in test scores using VAMs, while the other 80% to 97% of the variance (stil) comes from influences outside of the teacher’s control. A similar statistic (i.e., 1% to 14%) was similarly and recently highlighted in the recent position statement on VAMs released by the American Statistical Association.
  • Most VAMs focus exclusively on scores from standardized assessments, although I will take this one-step further now, noting that all VAMs now focus exclusively on large-scale standardized tests. This I evidenced in a recent paper I published here: Putting growth and value-added models on the map: A national overview).
  • VAMs introduce bias when missing test scores are not missing completely at random. The missing at random assumption, however, runs across most VAMs because without it, data missingness would be pragmatically insolvable, especially “given the large proportion of missing data in many achievement databases and known differences between students with complete and incomplete test data.” The really only solution here is to use “implicit imputation of values for unobserved gains using the observed scores” which is “followed by estimation of teacher effect[s] using the means of both the imputed and observe gains [together].”
  • Bias “[still] is one of the most difficult issues arising from the use of VAMs to estimate school or teacher effects…[and]…the inclusion of student level covariates is not necessarily the solution to [this] bias.” In other words, “Controlling for student-level covariates alone is not sufficient to remove the effects of [students’] background [or demographic] characteristics.” There is a reason why bias is still such a highly contested issue when it comes to VAMs (see a recent post about this here).
  • All (or now most) commonly-used VAMs assume that teachers’ (and prior teachers’) effects persist undiminished over time. This assumption “is not empirically or theoretically justified,” either, yet it persists.

These authors’ overall conclusion, again from 10 years ago but one that in many ways still stands? VAMs “will often be too imprecise to support some of [its] desired inferences” and uses including, for example, making low- and high-stakes decisions about teacher effects as produced via VAMs. “[O]btaining sufficiently precise estimates of teacher effects to support ranking [and such decisions] is likely to [forever] be a challenge.”

ShareTweet about this on TwitterShare on FacebookEmail this to someoneShare on Google+Share on LinkedInShare on Reddit

Teacher Protests Turned to Riots in Mexico

ShareTweet about this on TwitterShare on FacebookEmail this to someoneShare on Google+Share on LinkedInShare on Reddit

For those of you who have not yet heard about what has been happening recently in our neighboring country Mexico, a protest surrounding the country’s new US inspired, test-based reforms to improve teacher quality, as based on teachers’ own test performance, as been ongoing since last weekend. Teachers are to pass tests themselves, this time, and if they cannot pass the tests after three attempts, they are to be terminated/replaced (i.e., three strikes, they are to be out). The strikes are occurring primarily in Oaxaca, southern Mexico, and they have thus far led to nine deaths, including the death of one journalist, upwards of 100 injuries, approximately 20 arrests, and the “en masse” termination of many teachers for striking.

As per an article available here, “a massive strike organized by a radical wing of the country’s largest teachers union [the National Coordinator of Education Workers (or CNTE)] turned into a violent confrontation with police” starting last weekend. In Mexico, as it has been in our country’s decade’s past, the current but now prevailing assumption is that the nation’s “failing” education system is the fault of teachers who, as many argue, are those to be directly (and perhaps solely) blamed for their students’ poor relative performance. They are also to be blamed for not “causing” student performance throughout Mexico to improve.

Hence, Mexico is to hold teachers more accountable for what which they do, or more arguably that which they are purportedly not doing or doing well, and this is the necessary action being pushed by Mexico’s President Enrique Peña Nieto. Teacher-level standardized tests are to be used to measure teachers’ competency, instructional approaches, etc., teacher performance reviews are to be used as well, and those who fail to measurably perform are to be let go. Thereafter, the country’s educational situation is to, naturally, improve. This, so goes the perpetual logic. Although this is “an evaluation system that’s completely without precedent in the history of Mexican education.” See also here about how this logic is impacting other countries across the world, as per the Global Education Reform Movement (GERM).

“Here is a viral video (in Spanish) of a teacher explaining why the mandatory tests are so unwelcome: because Mexico is a huge, diverse country (sound familiar?) and you can’t hold teachers in the capital to the same standards as, say, those in the remote mountains of Chiapas. (He also says, to much audience approval, that Peña Nieto, who has the reputation of a lightweight, probably wouldn’t be able to meet the standards he’s imposing on teachers himself.)…And it’s true that some of the teachers in rural areas might not have the same academic qualifications—particularly in a place like Oaxaca, which for all its tourist delights of its capital is one of Mexico’s poorest states, with a large indigenous population and substandard infrastructure.”

Teachers in other Mexican cities are beginning to mobilize, in solidarity, although officially still at this point, these new educational policies are “not subject to negotiation.”

ShareTweet about this on TwitterShare on FacebookEmail this to someoneShare on Google+Share on LinkedInShare on Reddit