Teacher Evaluation and Accountability Alternatives, for A New Year

At the beginning of December I posted a post about Diane Ravitch’s really nice piece published in the Huffington Post about what she views as a much better paradigm for teacher evaluation and accountability. Diane Ravitch posted another on similar alternatives, although this one was written by teachers themselves.

I thought this was more than appropriate, especially given a New Year is upon us, and while it might very well be wishful thinking, perhaps at least some of our state policy makers might be willing to think in new ways about what really could be new and improved teacher evaluation systems. Cheers to that!

The main point here, though, is that alternatives do, indeed, exist. Likewise, it’s not that teachers do not want to be held accountable for, and evaluated on that which they do, but they do want whatever systems are in place (formal or informal) to be appropriate, professional, and fair. How about that for policy-based resolution.

This is from Diane’s post: The Wisdom of Teachers: A New Vision of Accountability.

Anyone who criticizes the current regime of test-based accountability is inevitably asked: What would you replace it with? Test-based accountability fails because it is based on a lack of trust in professionals. It fails because it confuses measurement with instruction. No doctor ever said to a sick patient, “Go home, take your temperature hourly, and call me in a month.” Measurement is not a treatment or a cure. It is measurement. It doesn’t close gaps: it measures them.

Here is a sound alternative approach to accountability, written by a group of teachers whose collective experience is 275 years in the classroom. Over 900 teachers contributed ideas to the plan. It is a new vision that holds all actors responsible for the full development and education of children, acknowledging that every child is a unique individual.

Its key features:

  • Shared responsibility, not blame
  • Educate the whole child
  • Full and adequate funding for all schools, with less emphasis on standardized testing
  • Teacher autonomy and professionalism
  • A shift from evaluation to support
  • Recognition that in education one size does not fit all

Houston, We Have A Problem: New Research Published about the EVAAS

New VAM research was recently published in the peer-reviewed Education Policy Analysis Archives journal, titled “Houston, We Have a Problem: Teachers Find No Value in the SAS Education Value-Added Assessment System (EVAAS®).” This article was published by a former doctoral student of mine, turned researcher now at a large non-profit — Clarin Collins. I asked her to write a guest post for you all summarizing the fully study (linked again here). Here is what she wrote.

As someone who works in the field of philanthropy, completed a doctoral program more than two years ago, and recently became a new mom, you might question why I worked on an academic publication and am writing about it here as a guest blogger? My motivation is simple: the teachers. Teachers continue to be at the crux of the national education reform efforts as they are blamed for the nation’s failing education system and student academic struggles. National and state legislation has been created and implemented as believed remedies to “fix” this problem by holding teachers accountable for student progress as measured by achievement gains.

While countless researchers have highlighted the faults of teacher accountability systems and growth models (unfortunately to fall on the deaf ears of those mandating such policies), very rarely are teachers asked how such policies play out in practice, or for their opinions, as representing their voices in all of this. The goal of this research, therefore, was first, to see how one such teacher evaluation policy is playing out in practice and second, to give voice to marginalized teachers, those who are at the forefront of these new policy initiatives. That being said, while I encourage you to check out the full article [linked again here], I highlight key findings in this summary, using the words of teachers as often as possible to permit them, really, to speak for themselves.

In this study I examined the SAS Education Value-Added Assessment System (EVAAS) in practice, as perceived and experienced by teachers in the Southwest School District (SSD). SSD [a pseudonym] is using EVAAS for high-stakes consequences more than any other district or state in the country. I used a mixed-method design including a large-scale electronic survey to investigate the model’s reliability and validity; to determine whether teachers used the EVAAS data in formative ways as intended; to gather teachers’ opinions on EVAAS’s claimed benefits and statements; and to understand the unintended consequences that might have also occurred as a result of EVAAS use in SSD.

Results revealed that the reliability of the EVAAS model produced split and inconsistent results among teacher participants regardless of subject or grade-level taught. As one teacher stated, “In three years, I was above average, below average and average.” Teachers indicated that it was the students and their varying background demographics who biased their EVAAS results, and much that was demonstrated via their scores was beyond the control of teachers. “[EVAAS] depends a lot on home support, background knowledge, current family situation, lack of sleep, whether parents are at home, in jail, etc. [There are t]oo many outside factors – behavior issues, etc.” that apparently are not controlled or accounted for in the model.

Teachers reported dissimilar EVAAS and principal observation scores, reducing the criterion-related validity of both measures of teacher quality. Some even reported that principals changed their observation scores to match their EVAAS scores; “One principal told me one year that even though I had high [state standardized test] scores and high Stanford [test] scores, the fact that my EVAAS scores showed no growth, it would look bad to the superintendent.” Added another teacher, “I had high appraisals but low EVAAS, so they had to change the appraisals to match lower EVAAS scores.”

The majority of teachers disagreed with SAS’s marketing claims such as EVAAS reports are easy to use to improve instruction, and EVAAS will ensure growth opportunities for all students. Teachers called the reports “vague” and “unclear” and were “not quite sure how to interpret” and use the data to inform their instruction. As one teacher explained, she looked at her EVAAS report “only to guess as to what to do for the next group in my class.”

Many unintended consequences associated with the high-stakes use of EVAAS emerged through teachers’ responses, which revealed among others that teachers felt heightened pressure and competition, which they believed reduced morale and collaboration, and encouraged cheating or teaching to the test in attempt to raise EVAAS scores. Teachers made comments such as, “To gain the highest EVAAS score, drill and kill and memorization yields the best results, as does teaching to the test,” and “When I figured out how to teach to the test, the scores went up,” as well as, “EVAAS leaves room for me to teach to the test and appear successful.”

Teachers realized this emphasis on test scores was detrimental for students, as one teacher wrote, “As a result of the emphasis on EVAAS, we teach less math, not more. Too much drill and kill and too little understanding [for the] love of math… Raising a generation of children under these circumstances seems best suited for a country of followers, not inventors, not world leaders.”

Teachers also admitted they are not collaborating to share best practices as much anymore: “Since the inception of the EVAAS system, teachers have become even more distrustful of each other because they are afraid that someone might steal a good teaching method or materials from them and in turn earn more bonus money. This is not conducive to having a good work environment, and it actually is detrimental to students because teachers are not willing to share ideas or materials that might help increase student learning and achievement.”

While I realize this body of work could simply add to “the shelves” along with those findings of other researchers striving to deflate and demystify this latest round of education reform, if nothing else, I hope the teachers who participated in this study know I am determined to let their true experiences, perceptions of their experiences, and voices be heard.


Again, to find out more information including the statistics in support of the above assertions and findings, please click here to read the full study.

VAM and Observational (Co)Relationships, Again

In the peer-reviewed, open-access journal, on which I serve as an Associate Editor* — Education Policy Analysis Archives – a VAM related article was recently published titled, “Sorting out the signal: Do multiple measures of teachers’ effectiveness provide consistent
information to teachers and principals?

In this article, authors Katharine Strunk (Associate Professor at USC), Tracey Weinstein (StudentsFirst), and Reino Makkonen (WestEd Senior Policy Associate) examine the relationships between VAMs and observational scores from Los Angeles Unified School District’s (LAUSD). If the purpose of this study sound familiar, it should as actually a good set of researchers have also set out to explore the same correlations in the past, of course using different data. See VAMboozled! posts about such studies, for example, here, here, and here.

In this study researchers “find moderate [positive] correlations between value-added and observation-based measures, indicating that teachers will receive similar but not entirely consistent signals from each performance measure.” The specific correlations they observed range from r = 0.18 in mathematics to r = 0.24 (in English/language arts [ELA]) which to most others classifying such correlation coefficients, these would be considered negligible to small, respectively, and at best.

Again, similar “negligible” and “small” correlation coefficients have really been found time and time again, consistently making these types of correlation coefficients the most often observed, and hence most supportive of the assertion that VAMs and observational scores are not nearly as highly correlated as they should be… IF they are both in fact effectively measuring at least some of the same thing: teacher effectiveness.

While researchers in this article spin these correlations differently than many if not most would, writing, for example, that the low to moderate [see comments about this classification above] sized correlations they observed “means that, while the two measures give the same general signal about effectiveness on average, they may also provide teachers and administrators with unique information about their levels of effectiveness;” others might also replace their term “unique” with a more appropriate adjective like “uncertain,” whereas these correlations might, rather, “provide teachers and administrators with [more uncertain] information about [teachers’] levels of effectiveness” than what might be expected.

That, too, might be a bit more reasonable of a statement. As a colleague wrote in an email to me about this article, “they think they have something–but little was found.” I could not agree more. Hence, I suppose it depends on what side each VAM researcher/author stands on this particular issue; although, the size of the correlation coefficients are indeed consistent in terms of their negligible to small ranges across studies, which was also found here (but not titled as such).

Hence, again, it is really how researchers’ define and interpret these correlations, literally for better or worst, that varies. NOT the sizes of the correlations.

The authors ultimately conclude that “[o]verall, unadjusted observation-based measures and VAMs provide teachers with a modestly [added emphasis for word choice again] consistent, but not identical, signal of effectiveness.” While really nobody is looking for identical or perfect (co)relationship here, again, the correlations that are observed are for sure far from modest or pragmatically useful.

*Full disclosure: I served as the editor on this piece, managing the peer review and revision process through to publication. Here, I comment on this piece not as associate editor, but a consumer of this research now that this manuscript made it through the blind review process to publication.

US Secretary of Education Duncan “Loves Him Some VAM Sauce”

US Secretary of Education “Arne [Duncan] loves him some VAM sauce, and it is a love that simply refuses to die,” writes Peter Greene in a recent Huffington Post post. Duncan’s (simple-mind) loves it because, indeed, the plan is (too) overly simplistic. All that the plan requires are two simple ingredients: “1) A standardized test that reliably and validly measures how much students know 2) A super-sciency math algorithm that will reliably and validly strip out all influences except that of the teacher.”

Sticking with the cooking metaphor, however, Green writes “VAM is no spring chicken, and perhaps when it was fresh and young some affection for it could be justified. After all, lots of folks, including non-reformy folks, like the idea of recognizing and rewarding teachers for being excellent. But how would we identify these pillars of excellence? That was the puzzler for ages until VAM jumped up to say, “We can do it! With Science!!” We’ll give some tests and then use super-sciency math to filter out every influence that’s Not a Teacher and we’ll know exactly how much learnin’ that teacher poured into that kid.”

“Unfortunately, we don’t have either,” and we likely never will. Why this is the case is also highlighted in this post, with Greene explicitly citing three main sources for support: the recent oppositional statement released by the National Association of Secondary School Principals, the oppositional statement released this past summer by the American Statistical Association, and our mostly-oppositional blog Vamboozled! (by Audrey Amrein-Beardsley). Hopefully getting the research into the hands of educational practitioners, school board members, the general public, and the like is indeed “adding value” in the purest sense of this phrase’s meaning. I sure hope so!

Anyhow, in this post Greene also illustrates and references a nice visual (with a side of sarcasm) explaining the complexity behind VAMs in pretty clear terms. I also paste this illustration here, which Greene references as originally coming from a blog post from Daniel Katz, Ph.D. but I have seen similar versions elsewhere and prior (e.g., a New York Times article here).


Greene ultimately asks why Duncan is still staying so fixated on a policy, disproportionally loaded and ever-increasingly rejected and unsupported?

Greene’s answer: ‘[I]f Duncan were to admit that his beloved VAM is a useless tool…then all his other favorite [reform-based] programs would collapse” around him…Why do we give the Big Test? To measure teacher effectiveness. How do we rank and evaluate our schools? By looking at teacher effectiveness. How do we find the teachers that we are going to move around so that every classroom has a great teacher? With teacher effectiveness ratings. How do we institute merit pay and a career ladder? By looking at teacher effectiveness. How do we evaluate every single program instituted in any school? By checking to see how it affects teacher effectiveness. How do we prove that centralized planning (such as Common Core) is working? By looking at teacher effectiveness. How do we prove that corporate involvement at every stage is a Good Thing? By looking at teacher effectiveness. And by “teacher effectiveness,” we always mean VAM (because we [i.e., far-removed educational reformers] don’t know any other way, at all).”

If Duncan’s “magic VAM sauce, is a sham and a delusion and a big bowl of nothing,” his career would literally fold in.

To read more from Greene, do click here to read his post in full.

On Rating The Effectiveness of Colleges of Education Using VAMs

A VAMboozled! follower – Terry Ward (El Paso, TX), a retired writer and statistician married to a veteran and also current Title I music teacher – sent this to me via email after reading a recent article in The New York Times. The article, released around Thanksgiving, was about how the “U.S. Wants Teacher Training Programs to Track How [College of Education] Graduates’ Students Perform.” This, of course, using in part value-added models (VAMs).

I thought it important to share with you all, what Terry wrote in response, with his permission:

It has recently been proposed that colleges of education be rated and evaluated on the basis of how the students of their graduates perform on standardized tests.  As, they say, the devil, however, is in the details. Let’s look at how this might work in the case of my wife — a teacher of some 40+ years experience who graduated from a college of education over 40 years ago.

Problem 1: Standardized student scores are problematic enough for individual teachers. Remember, the American Statistical Association (ASA) estimates that the teacher influence explains somewhere between one to fourteen percent of the score variation in test scores. The college is even further removed from the student taking the test, so the question becomes how small the college’s contribution must be?

Problem 2: What is the decay function for college influence? Simply put, my wife graduated with her initial teaching degree over forty years ago. Any influence of the college upon her teaching is, therefore, minimal to non-existent. One assumes such an influence fades over time, so what is the shape for this decay and how will the US Department of Education (DOE) evaluators measure it? What is the half-life of educational influence?

Problem 3: If we assume that the easiest year to measure college influence is the first year of teaching, how might the DOE extract college of education factors from the basic issues of first-year inexperience in real-world teaching?

Problem 4: What happens when additional schooling is factored in? My wife has a Master’s degree. Does that influence her teaching and how is the DOE evaluation to split between her very distant B.A. degree (40 years ago) and the slightly more recent Master’s degree (30 years ago)?

Problem 5: What of non-degree but still credentialed education? My wife is also a graduate of a National Writing Project summer writing camp with graduate hours in writing and writing pedagogy. Who is to get credit if this has (also) improved her students’ test scores? And, how is the DOE to determine or tell who is to get their appropriate credit?

Problem 6: When a teacher changes schools (perhaps to teach impoverished youth), her student scores are likely to change dramatically. Does the DOE propose to re-evaluate such a teacher’s college experience and downgrade them as well given where a teacher decides to teach?

I am sure the reader can come up with other absurd problems with the DOE proposal. I am simply reminded of the old saying that “for every complex problem, there exists a good sounding simple solution that is completely wrong!” This certainly seems to be the case here.

Rethinking Value-Added Models (VAMs): A (Short) YouTube Version

Following up on a recent post about a recently released review of my book “Rethinking Value-Added Models in Education: Critical Perspectives on Tests and Assessment-Based Accountability,” I thought it important to share with you all a condensed, video- and cartoon-based version of the (very general) points highlighted within my book.

This YouTube video, also titled “Rethinking Value-Added Models in Education,” was created by one of my former doctoral students and one of the most amazing artists I know – Dr. Taryl Hargens.

Do give it a watch, and of course feel free to share out with others!!

VAM Updates from An Important State to Watch: Tennessee

The state of Tennessee, the state in which our beloved education-based VAMs were born (see here and here), has been one state we have been following closely on VAMboozled! throughout the last year’s blog posts.

Throughout this period we have heard from concerned administrators and teachers in Tennessee (see, for example, here and here). We have written about how something called subject area bias also exists, unheard of in the VAM-related literature until a Tennessee administrator sent us a lead, and we analyzed Tennessee’s data (see here and here, and also an article also written by my graduate student and now Dr. Jessica Holloway-Libell forthcoming in the esteemed Teachers College Record). We followed closely the rise and recent fall of the career of Tennessee’s Education Commissioner Kevin Huffman (see, for example, here and here, respectively). And we have watched how the Tennessee Board of Education and other leaders in the state have met, attempted to rescind, and actually rescinded some of the policy requirements that tie teachers’ to their VAM scores, again as determined by teachers’ students’ performance as calculated by the familiar Tennessee Education Value-Added Assessment System (TVAAS), and its all-too-familiar mother-ship, the Education Value-Added Assessment System (EVAAS).

Now, following the (in many ways celebrated) exit of Commissioner Huffman, it seems the state is taking an even more reasonable stance towards VAMs and their use(s) for teacher accountability…at least for now.

As per a recent article in the Tennessee Education Report (see also an article in The Tennessean here) Governor Bill Haslam announced this week that “he will be proposing changes to the state’s teacher evaluation process in the 2015 legislative session,” the most significant change being “to reduce the weight of value-added data on teacher evaluations during the transition [emphasis added] to a new test for Tennessee students.” New tests are to be developed in 2016, which is unlikely to be part of the Common Core, and rather significantly informed by teachers in consultation with Measurement Inc.

Anyhow, as per Governor Haslam’s press release (as also cited in this article), he intends to do the following three things:

  1. Adjust the weighting of student growth data in a teacher’s evaluation so that the new state assessments in ELA [English/language arts] and math will count 10 percent of the overall evaluation in the first year of administration (2016), 20 percent in year two (2017) and 35 percent in year three (2018). Currently 35 percent of an educator’s evaluation is comprised of student achievement data based on student growth;
  2. Lower the weight of student achievement growth for teachers in non-tested grades and subjects from 25 percent to 15 percent;
  3. And make explicit local school district discretion in both the qualitative teacher evaluation model that is used for the observation portion of the evaluation as well as the specific weight student achievement growth in evaluations will play in personnel
    decisions made by the district.

Obviously, the latter two points (i.e., #2 and #3) demonstrate steps in the right direction: #2 to be a bit more reasonable about whether teachers who don’t actually teach students in the subject areas should be held accountable for out-of-subject scores (although I’d vote for 0% weight here) and #3 to handover to districts more local discretion and control over how their teacher evaluations are conducted (although VAMs still must be a part).

I’m less optimistic about the first intended change, however, as “the proposal does not go as far as some have proposed” (e.g., the American Statistical Association (ASA) as per some of their key points in their position statement on VAMs). This first change still supports what is still a “heroic” assumption that VAMs do work, and in this case will get better over time (i.e., 2016, 2017, 2018) with “new and improved tests,” so that the weights in place now (i.e., 35%) might be more appropriate then, and hence reached, or just reached regardless of appropriateness at that time…

The Nation’s High School Principals (Working) Position Statement on VAMs

The Board of Directors of the National Association of Secondary School Principals (NASSP) officially released a working position announcement on VAMs, that was also recently referenced in an article in Education Week here (“Principals’ Group Latest to Criticize ‘Value Added’ for Teacher Evaluations“) and a Washington Post post here (“Principals Reject ‘Value-Added’ Assessment that Links Test Scores to Educators’ Jobs“).

I have pasted this statement below, but also link to it here as well. The position’s highlights follow, as also summarized in the above links and the position statement itself:

  • “[T]est-score-based algorithms for measuring teacher quality aren’t appropriate.”
  • “[T]he timing for using [VAMs] comes at a a terrible time, just as schools adjust to demands from the Common Core State Standards and other difficult new expectations for K-12 students.”
  • “Principals are concerned that the new evaluation systems are eroding trust and are detrimental to building a culture of collaboration and continuous improvement necessary to successfully raise student performance to college and career-ready levels.”
  • “Value-added systems, the statement concludes, should be used to measure school improvement and help determine the effectivness of some programs and instructional methods; they could even be used to tailor professional development. But they shouldn’t be used to make “key personel decisions” about individual teachers.”
  • “[P]rincipals often don’t use value-added data even where it exists, largely because a lot of them don’t trust it.”
  • The position statement also quotes Mel Riddile, a former National Principal of the Year and chief architect of the NASSP statement, who says: “We are using value-added measurement in a way that the science does not yet support. We have to make it very clear to policymakers that using a flawed measurement both misrepresents student growth and does a disservice to the educators who live the work each day.”

See also two other great blog posts re: the potential impact the NASSP’s working statement might/should have, also, on America’s current VAM-situation. The first external post comes from the blog “curmudgucation” and discusses in great detail the highlights of the NASSP’s post. The second external post comes from a guest post on Diane Ravitch’s blog.

Below, again, is the full post as per the website of the NASSP:


To determine the efficacy of the use of data from student test scores, particularly in the form of Value-Added Measures (VAMs), to evaluate and to make key personnel decisions about classroom teachers.


Currently, a number of states either are adopting or have adopted new or revamped teacher evaluation systems, which are based in part on data from student test scores in the form of value-added measures (VAM). Some states mandate that up to fifty percent of the teacher evaluation must be based on data from student test scores. States and school districts are using the evaluation systems to make key personnel decisions about retention, dismissal and compensation of teachers and principals.

At the same time, states have also adopted and are implementing new, more rigorous college- and career standards. These new standards are intended to raise the bar from having every student earn a high school diploma to the much more ambitious goal of having every student be on-target for success in post-secondary education and training.

The assessments accompanying these new standards depart from the old, much less expensive, multiple-choice style tests to assessments, which include constructed responses. These new assessments demand higher-order thinking and up to a two-year increase in expected reading and writing skills. Not surprisingly, the newness of the assessments combined with increased rigor has resulted in significant drops in the number of students reaching “proficient” levels on assessments aligned to the new standards.

Herein lies the challenge for principals and school leaders. New teacher evaluation systems demand the inclusion of student data at a time when scores on new assessments are dropping. The fears accompanying any new evaluation system have been magnified by the inclusion of data that will get worse before it gets better. Principals are concerned that the new evaluation systems are eroding trust and are detrimental to building a culture of collaboration and continuous improvement necessary to successfully raise student performance to college and career-ready levels.

Specific question have arisen about using value-added measurement (VAM) to retain, dismiss, and compensate teachers. VAMs are statistical measures of student growth. They employ complex algorithms to figure out how much teachers contribute to their students’ learning, holding constant factors such as demographics. And so, at first glance, it would appear reasonable to use VAMs to gauge teacher effectiveness. Unfortunately, policy makers have acted on that impression over the consistent objections of researchers who have cautioned against this inappropriate use of VAM.

In a 2014 report, the American Statistical Association urged states and school districts against using VAM systems to make personnel decisions. A statement accompanying the report pointed out the following:

  • “VAMs are generally based on standardized test scores, and do not directly measure potential teacher contributions toward other student outcomes.
  • VAMs typically measure correlation, not causation: Effects – positive or negative – attributed to a teacher may actually be caused by other factors that are not captured in the model.
  • Under some conditions, VAM scores and rankings can change substantially when a different model or test is used, and a thorough analysis should be undertaken to evaluate the sensitivity of estimates to different models.
  • VAMs should be viewed within the context of quality improvement, which distinguishes aspects of quality that can be attributed to the system from those that can be attributed to individual teachers, teacher preparation programs, or schools.
  • Most VAM studies find that teachers account for about 1% to 14% of the variability in test scores, and that the majority of opportunities for quality improvement are found in the system-level conditions. Ranking teachers by their VAM scores can have unintended consequences that reduce quality.”

Another peer-reviewed study funded by the Gates Foundation and published by the American Educational Research Association (AERA) stated emphatically, “Value-Added Performance Measures Do Not Reflect the Content or Quality of Teachers’ Instruction.” The study found that “state tests and these measures of evaluating teachers don’t really seem to be associated with the things we think of as defining good teaching.” It further found that some teachers who were highly rated on student surveys, classroom observations by principals and other indicators of quality had students who scored poorly on tests. The opposite also was true. “We need to slow down or ease off completely for the stakes for teachers, at least in the first few years, so we can get a sense of what do these things measure, what does it mean,” the researchers admonished. “We’re moving these systems forward way ahead of the science in terms of the quality of the measures.”

Researcher Bruce Baker cautions against using VAMs even when test scores count less than fifty percent of a teacher’s final evaluation.  Using VAM estimates in a parallel weighting system with other measures like student surveys and principal observations “requires that VAM be considered even in the presence of a likely false positive. NY legislation prohibits a teacher from being rated highly if their test-based effectiveness estimate is low. Further, where VAM estimates vary more than other components, they will quite often be the tipping point – nearly 100% of the decision even if only 20% of the weight.”

Stanford’s Edward Haertel takes the objection for using VAMs for personnel decisions one step further: “Teacher VAM scores should emphatically not be included as a substantial factor with a fixed weight in consequential teacher personnel decisions. The information they provide is simply not good enough to use in that way. It is not just that the information is noisy. Much more serious is the fact that the scores may be systematically biased for some teachers and against others, and major potential sources of bias stem from the way our school system is organized. No statistical manipulation can assure fair comparisons of teachers working in very different schools, with very different students, under very different conditions.”

Still other researchers believe that VAM is flawed at its very foundation. Linda Darling-Hammond et al. point out that the use of test scores via VAMs assumes “that student learning is measured by a given test, is influenced by the teacher alone, and is independent from the growth of classmates and other aspects of the classroom context. None of these assumptions is well supported by current evidence.” Other factors including class size, instructional time, home support, peer culture, summer learning loss impact student achievement. Darling-Hammond points out that VAMs are inconsistent from class to class and year to year. VAMs are based on the false assumption that students are randomly assigned to teachers. VAMs cannot account for the fact that “some teachers may be more effective at some forms of instruction…and less effective in others.”

Guiding Principles

  • As instructional leader, “the principal’s role is to lead the school’s teachers in a process of learning to improve teaching, while learning alongside them about what works and what doesn’t.”
  • The teacher evaluation system should aid the principal in creating a collaborative culture of continuous learning and incremental improvement in teaching and learning.
  • Assessment for learning is critical to continuous improvement of teachers.
  • Data from student test scores should be used by schools to move students to mastery and a deep conceptual understanding of key concepts as well as to inform instruction, target remediation, and to focus review efforts.
  • NASSP supports recommendations for the use of “multiple measures” to evaluate teachers as indicated in the 2014 “Standards for Educational and Psychological Testing” measurement standards released by leading professional organizations in the area of educational measurement, including the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME).


  • Successful teacher evaluation systems should employ “multiple classroom observations across the year by expert evaluators looking to multiple sources of data, and they provide meaningful feedback to teachers.”
  • Districts and States should encourage the use of Peer Assistance and Review (PAR) programs, which use expert mentor teachers supporting novice teachers and struggling veteran teachers have been proven to be an effective system for improving instruction.
  • States and Districts should allow the use of teacher-constructed portfolios of student learning, which are being successfully used as a part of teacher evaluation systems in a number of jurisdictions.
  • VAMs should be used by principals to measure school improvement and to determine the effectiveness of programs and instructional methods.
  • VAMs should be used by principals to target professional development initiatives.
  • VAMs should not to be used to make key personnel decisions about individual teachers.
  • States and Districts should provide ongoing training for Principals in the appropriate use student data and VAMs.
  • States and Districts should make student data and VAMs available to principals at a time when decisions about school programs are being made.
  • States and Districts should provide resources and time principals need in order to make the best use of data.

A (Great) Review of My Book on Value-Added Models (VAMs)…by Rachael Gabriel

My book, “Rethinking Value-Added Models in Education: Critical Perspectives on Tests and Assessment-Based Accountability,” was just externally reviewed by Rachael Gabriel, an Assistant Professor at the University of Connecticut, in Education Review: A Multilingual Journal of Book Reviews. To read the (fantastic!!) review, click here or read what I’ve pasted below directly from Dr. Gabriel’s review. For those still interested in the book, you can order it on Amazon here. To see also other reviews, click here.


Dr. Amrein-Beardsley’s recent book, Rethinking Value-Added Models in Education (Routledge, 2014), is the single most comprehensive resource on the uses and abuses of Value-added Measurement (VAM) in U.S. education policy. As the centerpiece of several new generation teacher evaluation policies, VAM has been the subject of a firestorm of media attention, position statements, journal articles, policy briefs and a wide range of scholarly debate.

Oddly, or perhaps thankfully, for all its dramatic heralding, only twelve states have recently adopted VAM as part of their teacher evaluation or compensation policies – due in no small part to the criticism Amrein-Beardsley and others have made available to the voting public (e.g., Amrein-Beardsley & Collins, 2012; Baker et al., 2010; Baker, Oduwale, Greene, 2013). Though only twelve states have VAM written into their teacher evaluation policies, others use it at the district or school level, for purposes as varied as program evaluation, teacher compensation, educational research and legal challenges. Therefore, this book isn’t just for the hundreds of thousands of teachers who live and work in those twelve unlucky states (e.g., New York, Florida, Tennessee, Ohio, etc.), it’s about a measurement tool that captured America’s imagination – convincing us that we could accurately and reliably measure a teachers’ impact on their students – against her better judgment.

There is something for everyone in this text: from definitions for those with no background in statistics to thorough discussions of the reining positions and debates between researchers. It therefore serves as a primer on education policy as well as a deep dive into the specifics of VAM. The text rises far above other options for understanding VAM because it successfully combines technical detail with social activism. This not only displays the range of Amrein-Beardsley’s thinking about the subject, but also the depth of her understanding of VAM’s technical merits and social implications.

For example, Amrein-Beardsley uses the introductory chapter to establish the place of VAM in the history of social policy and recent education policy. VAM is framed as one example in a long history of social engineering experiments, in line with what she calls the “Measure & Punish (M&P) Theory of Change” that characterizes most contemporary education reform efforts. Within the introduction, Amrein-Beardsley also crafts a refreshingly unapologetic statement of her own position on the issue – linking it to her background as a mathematics teacher and educational researcher. This transparency and context allows her to relinquish claims to neutral objectivity and replace it with clearly argued, but passionate outrage, which makes the story of VAM both compelling and urgent.

Historicizing VAM and its surrounding controversies allows the reader to assume some analytic distance, though these debates are still very real and very raw. Within the 211 pages, she addresses VAM as both a statistical tool with specific features and assumptions, and a policy tool with socially constructed meanings and implications. This dual treatment, and the belief that data does not and never could “speak for itself” constructs a version of VAM in which the tool takes on a life of its own, with people and policies positioned and defined in response to its divisive construction (Gabriel & Lester, 2013). This characterization of VAM is also what makes this text so readable. VAM itself becomes a fascinating character in a larger story about education policy. Its role in recent policies, public debates and lawsuits is nothing short of operatic in quality. The sweeping nature of the story of VAM mixed with details on its technical merits creates a text that is encyclopedic in scope: with entire chapters devoted to discussions of VAM’s assumptions, reliability and biases.

Amrein-Beardsley’s emotion as a writer (sometimes anger, sometimes passion) simmers tangibly below the surface of her prose, and explodes once in a while in a long, complex, exclamation-pointed sentence. It seems that when you know as much as she does about VAM, and have to face the general ignorance and perverted rhetoric on the subject, neutrality isn’t an option. This treatment of the topic as one of urgency and consequence is not only rooted in a moral sense of outrage, but a deeply personal sense that the public and its teachers are being misled and mischaracterized. As Amrein-Beardsley points out, the fact that the public has been “VAMboozled” is not unique in the history of education or other social policy. It is, however, avoidable, if the public has access to a fuller understanding of what’s involved and what’s at stake when policies include VAM. This book represents one of many steps towards such access.

As such, each chapter read like transcripts of a Ted Talk, with multi-part lists (e.g., “the eight things that are now evident about VAM”), explained with tight examples and logic checks along the way. The use of multi-part lists, extensive citations and the added feature of a list of “top ten assertions” at the end of each chapter allow readers to pick and choose the sections of greatest interest. This is important because the combined depth and breadth of information about VAM can make some chapters feel skimmable. For example, those who are most interested in how sorting bias occurs, are not likely to need the explicit definitions of terms like reliability and validity. Those who are interested in the history of VAM may not need the context in the larger scheme of education reform, or the history of government involvement in social services.   Still, since so many arguments about VAM are based on popular, vague and/or mixed understandings of its foundational concepts and tools of logic, this thorough treatment is a welcome addition to the literature. The end-of-chapter assertion boxes also demonstrate Amrein-Beardsley’s unique ability to lay it all out there: the good, bad and the normally obfuscated “truth” about VAM.

The crowning feature of this book is its final chapter, in which alternatives and solutions to VAM are presented. Though Amrein-Beardsley buys into the current fascination with using “multiple measures” for determining teacher effectiveness, she is quick to point out that combining the strengths and weaknesses of imperfect indicators “does not increase overall levels of reliability and validity” (p. 211). Instead, she advocates for the strategic pursuit of face validity, in the absence of any plausible construct validity, by suggesting that individuals’ professional judgment and local definitions of effectiveness be weighed as heavily as statistical markers. Her unique solution, one that has not yet been empirically validated or even widely discussed in policy circles, is essentially a panel of supervisors and peers evaluating and rating based on local definitions of effectiveness, and informed by local and/or research-based measures. She argues that multiple local stakeholders are the most powerful tool for interpreting observable indicators effectiveness.

For those things that we cannot see – the outcomes or outputs of effective teaching (e.g., test scores, graduation rates, etc.) – both local and externally validated tools should be used. This proposal is unique in its insistence on multiplicity and its faith in the importance of human judgment. People (not a single representative of states or districts as collectives) should be responsible for selecting and designing (not one or the other) the measures of effectiveness. And people, (not observation tools or even single observers) should be responsible for discussing and analyzing the inputs and processes of effectiveness in teaching.

By way of a summary of this suggested solution, Amrein-Beardsley writes:

This solution does not rely solely on mathematics and the allure of numbers or grandeur of objectivity that too often comes along with numerical representation, especially in the social sciences. This solution does not trust the test scores too often (and wrongly) used to assess teacher quality, simply because the test output is already available (and paid for) and these data can be represented numerically, mathematically, and hence objectively. This solution does not marginalize human judgment, but rather embraces human judgment for what it is worth, as positioned and operationalized within a more professional, democratically-based, and sound system of judgment, decision-making, and support.

Current teacher evaluation policies that claim to rely on “multiple measures of effectiveness” may still only require one observer’s opinion, and the combination of one-of-each type of measure for student achievement, student growth and/or other outcomes. This combination of single probes stands in contrast Amrein-Beardsley’s proposed combination of multiple probes for each aspect of an evaluation (inputs, processes, outputs).

Also unique to Amrein-Beardsley’s solution is the faith in individual teachers and supervisors to act as professionals – a faith that many researchers question as study after study points out the weak correlations between principal ratings and student achievement scores. For the most part, media outlets and even researchers have chalked up discrepancies to human error – blaming individual judgment for any discrepancy between observation scores and VAM ratings.   Far from being surprised or discouraged by these discrepancies, Amrein-Beardsley suggests just the opposite: that VAM ratings are the data point that needs revision and oversight, and the professional judgment of educators is what should be held as the gold standard.

This faith that individuals act in the best interest of students and the profession is also reflected in how Amrein-Beardsley recounts the stories of Houston area teachers who were fired based on VAM ratings. Their value as professionals is never called into question. Rather the value of VAM as a policy tool is put on trial, and it loses miserably. This respect for teachers and belief in people-not-numbers pervades other examples and arguments across the eight chapters. It is clear that Amrein-Beardsley is writing to an audience she believes in, and one that deserves to be equipped with all the facts.

The depth and breadth that characterizes this text is in itself a testament to Amrein-Beardsley’s belief in the power of people in plural. She sets out to define and explain a topic that is known for being shrouded in political spectacle (Gabriel & Allington, 2011). Despite the incredible volume of reporting and editorializing VAM has seen in the news, she examines still-unexamined assumptions at length, assuming that there are those in the world who want and need to have them explained. In other words, she dares to believe that people want and need to know the whole story, and that knowing this could change something about how we view teachers, testing, measurement and policy.

Finally, this belief in people is underscored by the book’s dedication page. A page in the front matter shows a picture of the Cambodian orphanage to which proceeds from the book will be donated. If nothing else good comes of VAM’s ignominious presence in education research literature, it is good that we have the whole story laid out here. And it is good that what we invest in reading and sharing this text will mean something beyond its audience of US readers. From Cambodia to Houston, this text represents one more step in a mission to bring a collective intelligence, compassion and truth to bear when issues policy become threats to equity, despite our better judgment.

  • Amrein-Beardsley, A. & Collins, C. (2012). The SAS Education Value-Added Assessment System (SAS® EVAAS®) in the Houston Independent School District (HISD): Intended and unintended consequences. Education Policy Analysis Archives, 20(12).  Retrieved from http://epaa.asu.edu/ojs/article/view/1096.
  • Baker, E., Barton, P., Darling-Hammond, L., Haertel, E., Ladd, H., Linn, R., et al. (2010). Problems with the Use of student test scores to evaluate teachers. Washington, DC: Economic Policy Institute. Retrieved from http://www.epi.org/publication/bp278/
  • Baker, B.; Oluwole, J.; Greene, P. (2013). The legal consequences of mandating high stakes decisions based on low quality information: Teacher evaluation in the race-to-the-top era. Education Policy Analysis Archives, 21. Retrieved from: http://epaa.asu.edu/ojs/article/view/1298
  • Gabriel, R., & Lester, J. N. (2013) The romance quest of education reform: A discourse analysis of The LA Times’ reports on value-added measurement teacher effectiveness. Teacher’s College Record, 115(12). Retrieved from http://www.tcrecord.org/library/abstract.asp?contentid=17252
  • Gabriel, R. & Allington, R (April, 2011). Teacher effectiveness research and the spectacle of effectiveness policy. Paper presented at the annual convention of the American Educational Research Association (AERA), New Orleans, LA.

The 24 Articles Published about VAMs in All American Educational Research Association (AERA) Journals

For some time now on VAMboozled!, we have made available a set of reading lists for you all to read and consume as you wish. These lists include what I consider to be the “Top 15″ suggested research articles (here), the “Top 25″ suggested research articles (here), all suggested research articles, books, etc. (here), and also, as pertinent for this post, a list of all 24 VAM articles ever published in all peer-reviewed journals sponsored by the esteemed American Educational Research Association (AERA) here.

It seems this idea has caught on…

AERA recently released their list of articles, with links to said articles, here. However, they only referenced all VAM articles published in AERA journals since 2009. VAMboozled!’s list of 24 (again here) includes all articles ever published in AERA journals without a time constraint, given the first articles published on this topic were first released in 2003.

Here’s how AERA justified their post: “Over the past decade, the use of value‐added models (VAM) in teacher and administrator evaluation has grown nationally, while becoming one of education’s most controversial issues. Research evidence on the reliability and validity of VAM, and the consequences of using such indicators in educator evaluation, is still accumulating. In recent years, AERA’s journals have examined many aspects of VAM…” and these articles are, again, published here.

Enjoy (or not) should you take the time to peruse.