VAMs Are Never “Accurate, Reliable, and Valid”

The Educational Researcher (ER) journal is the highly esteemed, flagship journal of the American Educational Research Association. It may sound familiar in that what I view to be many of the best research articles published about value-added models (VAMs) were published in ER (see my full reading list on this topic here), but as more specific to this post, the recent “AERA Statement on Use of Value-Added Models (VAM) for the Evaluation of Educators and Educator Preparation Programs” was also published in this journal (see also a prior post about this position statement here).

After this position statement was published, however, many critiqued AERA and the authors of this piece for going too easy on VAMs, as well as VAM proponents and users, and for not taking a firmer stance against VAMs given the current research. The lightest of the critiques, for example, as authored by Brookings Institution affiliate Michael Hansen and University of Washington Bothell’s Dan Goldhaber was highlighted here, after which Boston College’s Dr. Henry Braun responded also here. Some even believed this response to also be too, let’s say, collegial or symbiotic.

Just this month, however, ER released a critique of this same position statement, as authored by Steven Klees, a Professor at the University of Maryland. Klees wrote, essentially, that the AERA Statement “only alludes to the principal problem with [VAMs]…misspecification.” To isolate the contributions of teachers to student learning is not only “very difficult,” but “it is impossible—even if all the technical requirements in the [AERA] Statement [see here] are met.”

Rather, Klees wrote, “[f]or proper specification of any form of regression analysis…All confounding variables must be in the equation, all must be measured correctly, and the correct functional form must be used. As the 40-year literature on input-output functions that use student test scores as the dependent variable make clear, we never even come close to meeting these conditions…[Hence, simply] adding relevant variables to the model, changing how you measure them, or using alternative functional forms will always yield significant differences in the rank ordering of teachers’…contributions.”

Therefore, Klees argues “that with any VAM process that made its data available to competent researchers, those researchers would find that reasonable alternative specifications would yield major differences in rank ordering. Misclassification is not simply a ‘significant risk’— major misclassification is rampant and inherent in the use of VAM.”
Klees concludes: “The bottom line is that regardless of technical sophistication, the use of VAM is never [and, perhaps never will be] ‘accurate, reliable, and valid’ and will never yield ‘rigorously supported inferences” as expected and desired.
Citation: Klees, S. J. (2016). VAMs Are Never “Accurate, Reliable, and Valid.” Educational Researcher, 45(4), 267. doi: 10.3102/0013189X16651081

No More EVAAS for Houston: School Board Tie Vote Means Non-Renewal

Recall from prior posts (here, here, and here) that seven teachers in the Houston Independent School District (HISD), with the support of the Houston Federation of Teachers (HFT), are taking HISD to federal court over how their value-added scores, derived via the Education Value-Added Assessment System (EVAAS), are being used, and allegedly abused, while this district that has tied more high-stakes consequences to value-added output than any other district/state in the nation. The case, Houston Federation of Teachers, et al. v. Houston ISD, is ongoing.

But just announced is that the HISD school board, in a 3:3 split vote late last Thursday night, elected to no longer pay an annual $680K to SAS Institute Inc. to calculate the district’s EVAAS value-added estimates. As per an HFT press release (below), HISD “will not be renewing the district’s seriously flawed teacher evaluation system, [which is] good news for students, teachers and the community, [although] the school board and incoming superintendent must work with educators and others to choose a more effective system.”


Apparently, HISD was holding onto the EVAAS, despite the research surrounding the EVAAS in general and in Houston, in that they have received (and are still set to receive) over $4 million in federal grant funds that has required them to have value-added estimates as a component of their evaluation and accountability system(s).

While this means that the federal government is still largely in favor of the use of value-added model (VAMs) in terms of its funding priorities, despite their prior authorization of the Every Student Succeeds Act (ESSA) (see here and here), this also means that HISD might have to find another growth model or VAM to still comply with the feds.

Regardless, during the Thursday night meeting a board member noted that HISD has been kicking this EVAAS can down the road for 5 years. “If not now, then when?” the board member asked. “I remember talking about this last year, and the year before. We all agree that it needs to be changed, but we just keep doing the same thing.” A member of the community said to the board: “VAM hasn’t moved the needle [see a related post about this here]. It hasn’t done what you need it to do. But it has been very expensive to this district.” He then listed the other things on which HISD could spend (and could have spent) its annual $680K EVAAS estimate costs.

Soon thereafter, the HISD school board called for a vote, and it ended up being a 3-3 tie. Because of the 3-3 tie vote, the school board rejected the effort to continue with the EVAAS. What this means for the related and aforementioned lawsuit is still indeterminate at this point.

The Danielson Framework: Evidence of Un/Warranted Use

The US Department of Education’s statistics, research, and evaluation arm — the Institute of Education Sciences — recently released a study (here) about the validity of the Danielson Framework for Teaching‘s observational ratings as used for 713 teachers, with some minor adaptations (see box 1 on page 1), in the second largest school district in Nevada — Washoe County School District (Reno). This district is to use these data, along with student growth ratings, to inform decisions about teachers’ tenure, retention, and pay-for-performance system, in compliance with the state’s still current teacher evaluation system. The study was authored by researchers out of the Regional Educational Laboratory (REL) West at WestEd — a nonpartisan, nonprofit research, development, and service organization.

As many of you know, principals throughout many districts throughout the US, as per the Danielson Framework, use a four-point rating scale to rate teachers on 22 teaching components meant to measure four different dimensions or “constructs” of teaching.
In this study, researchers found that principals did not discriminate as much among the individual four constructs and 22 components (i.e., the four domains were not statistically distinct from one another and the ratings of the 22 components seemed to measure the same or universal cohesive trait). Accordingly, principals did discriminate among the teachers they observed to be more generally effective and highly effective (i.e., the universal trait of overall “effectiveness”), as captured by the two highest categories on the scale. Hence, analyses support the use of the overall scale versus the sub-components or items in and of themselves. Put differently, and In the authors’ words, “the analysis does not support interpreting the four domain scores [or indicators] as measurements of distinct aspects of teaching; instead, the analysis supports using a single rating, such as the average over all [sic] components of the system to summarize teacher effectiveness” (p. 12).
In addition, principals also (still) rarely identified teachers as minimally effective or ineffective, with approximately 10% of ratings falling into these of the lowest two of the four categories on the Danielson scale. This was also true across all but one of the 22 aforementioned Danielson components (see Figures 1-4, p. 7-8); see also Figure 5, p. 9).
I emphasize the word “still” in that this negative skew — what would be an illustrated distribution of, in this case, the proportion of teachers receiving all scores, whereby the mass of the distribution would be concentrated toward the right side of the figure — is one of the main reasons we as a nation became increasingly focused on “more objective” indicators of teacher effectiveness, focused on teachers’ direct impacts on student learning and achievement via value-added measures (VAMs). Via “The Widget Effect” report (here), authors argued that it was more or less impossible to have so many teachers perform at such high levels, especially given the extent to which students in other industrialized nations were outscoring students in the US on international exams. Thereafter, US policymakers who got a hold of this report, among others, used it to make advancements towards, and research-based arguments for, “new and improved” teacher evaluation systems with key components being the “more objective” VAMs.

In addition, and as directly related to VAMs, in this study researchers also found that each rating from each of the four domains, as well as the average of all ratings, “correlated positively with student learning [gains, as derived via the Nevada Growth
Model, as based on the Student Growth Percentiles (SGP) model; for more information about the SGP model see here and here; see also p. 6 of this report here], in reading and in math, as would be expected if the ratings measured teacher effectiveness in promoting student learning” (p. i). Of course, this would only be expected if one agrees that the VAM estimate is the core indicator around which all other such indicators should revolve, but I digress…

Anyhow, researchers found that by calculating standard correlation coefficients between teachers’ growth scores and the four Danielson domain scores, that “in all but one case” [i.e., the correlation coefficient between Domain 4 and growth in reading], said correlations were positive and statistically significant. Indeed this is true, although the correlations they observed, as aligned with what is increasingly becoming a saturated finding in the literature (see similar findings about the Marzano observational framework here; see similar findings from other studies here, here, and here; see also other studies as cited by authors of this study on p. 13-14 here), is that the magnitude and practical significance of these correlations are “very weak” (e.g., r = .18) to “moderate” (e.g., r = .45, .46, and .48). See their Table 2 (p. 13) with all relevant correlation coefficients illustrated below.

Screen Shot 2016-06-02 at 11.24.09 AM

Regardless, “[w]hile th[is] study takes place in one school district, the findings may be of interest to districts and states that are using or considering using the Danielson Framework” (p. i), especially those that intend to use this particular instrument for summative and sometimes consequential purposes, in that the Framework’s factor structure does not hold up, especially if to be used for summative and consequential purposes, unless, possibly, used as a generalized discriminator. With that too, however, evidence of validity is still quite weak to support further generalized inferences and decisions.

So, those of you in states, districts, and schools, do make these findings known, especially if this framework is being used for similar purposes without such evidence in support of such.

Citation: Lash, A., Tran, L., & Huang, M. (2016). Examining the validity of ratings
from a classroom observation instrument for use in a district’s teacher evaluation system

REL 2016–135). Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory West. Retrieved from

What ESSA Means for Teacher Evaluation and VAMs

Within a prior post, I wrote in some detail about what the Every Student Succeeds Act (ESSA) means for the U.S., as well as states’ teacher evaluation systems as per the federally mandated adoption and use of growth and value-added models (VAMs) across the U.S., after President Obama signed it into law in December.

Diane Ravitch recently covered, in her own words, what ESSA means for teacher evaluations systems as well, in what she called Part II of a nine Part series on all key sections of ESSA (see Parts I-IX here). I thought Part II was important to share with you all, especially given this particular post captures that in which followers of this blog are most interested, although I do recommend that you all also see what the ESSA means for other areas of educational progress and reform in terms of the Common Core, teacher education, charter schools, etc. in her Parts I-IX.

Here is what she captured in her Part II post, however, copied and pasted here from her original post:

The stakes attached to testing: will teachers be evaluated by test scores, as Duncan demanded and as the American Statistical Association rejected? Will teachers be fired because of ratings based on test scores?

Short Answer:

The federal mandate on teacher evaluation linked to test scores, as created in the waivers, is eliminated in ESSA.

States are allowed to use federal funds to continue these programs, if they choose, or completely change their strategy, but they will no longer be required to include these policies as a condition of receiving federal funds. In fact, the Secretary is explicitly prohibited from mandating any aspect of a teacher evaluation system, or mandating a state conduct the evaluation altogether, in section 1111(e)(1)(B)(iii)(IX) and (X), section 2101(e), and section 8401(d)(3) of the new law.

Long Answer:

Chairman Alexander has been a long advocate of the concept, as he calls it, of “paying teachers more for teaching well.” As governor of Tennessee he created the first teacher evaluation system in the nation, and believes to this day that the “Holy Grail” of education reform is finding fair ways to pay teachers more for teaching well.

But he opposed the idea of creating or continuing a federal mandate and requiring states to follow a Washington-based model of how to establish these types of systems.

Teacher evaluation is complicated work and the last thing local school districts and states need is to send their evaluation system to Washington, D.C., to see if a bureaucrat in Washington thinks they got it right.

ESSA ends the waiver requirements on August 2016 so states or districts that choose to end their teacher evaluation system may. Otherwise, states can make changes to their teacher evaluation systems, or start over and start a new system. The decision is left to states and school districts to work out.

The law does continue a separate, competitive funding program, the Teacher and School Leader Incentive Fund, to allow states, school districts, or non-profits or for-profits in partnership with a state or school district to apply for competitive grants to implement teacher evaluation systems to see if the country can learn more about effective and fair ways of linking student performance to teacher performance.

Some Lawmakers Reconsidering VAMs in the South

A few weeks ago in Education Week, Stephen Sawchuk and Emmanuel Felton wrote a post in the its Teacher Beat blog about lawmakers, particularly in the southern states, who are beginning to reconsider, via legislation, the role of test scores and value-added measures in their states’ teacher evaluation systems. Perhaps the tides are turning.

I tweeted this one out, but I also pasted this (short) one below to make sure you all, especially those of you teaching and/or residing in states like Georgia, Oklahoma, Louisiana, Tennessee, and Virginia, did not miss it.

Southern Lawmakers Reconsidering Role of Test Scores in Teacher Evaluations

After years of fierce debates over effectiveness and fairness of the methodology, several southern lawmakers are looking to minimize the weight placed on so called value-added measures, derived from how much students’ test scores changed, in teacher-evaluation systems.

In part because these states are home to some of the weakest teachers unions in the country, southern policymakers were able to push past arguments that the state tests were ill suited for teacher-evaluation purposes and that the system would punish teachers for working in the toughest classrooms. States like Louisiana, Georgia and Tennessee, became some of the earliest and strongest adopters of the practice. But in the past few weeks, lawmakers from Baton Rouge, La., to Atlanta have introduced bills to limit the practice.

In February, the Georgia Senate unanimously passed a bill that would reduce the student-growth component from 50 percent of a teachers’ evaluation down to 30 percent. Earlier this week, nearly 30 individuals signed up to speak on behalf fo the bill at a State House hearing.

Similarly, Louisiana House Bill 479 would reduce student-growth weight from 50 percent to 35 percent. Tennessee House Bill 1453 would reduce the weight of student-growth data through the 2018-2019 school year and would require the state Board of Education to produce a report evaluating the policy’s ongoing effectiveness. Lawmakers in Florida, Kentucky, and Oklahoma have introduced similar bills, according to the Southern Regional Education Board’s 2016 educator-effectiveness bill tracker.

By and large, states adopted these test-score centric teacher-evaluation systems to attain waivers from No Child Left Behind’s requirement that all students by proficient by 2014. To get a waiver, states had to adopt systems that evaluated teachers “in significant part, based on student growth.” That has looked very different from state to state, ranging from 20 percent in Utah to 50 percent in states like Alaska, Tennessee, and Louisiana.

No Child Left Behind’s replacement, the Every Student Succeeds Act, doesn’t require states to have a teacher-evaluation system at all, but, as my colleague Stephen Sawchuk reported, the nation’s state superintendents say they remain committed to maintaining systems that regularly review teachers.

But, as Sawchuk reported, Steven Staples, Virginia’s state superintendent, signaled that his state may move away from its current system where student test scores make up 40 percent of a teacher’s evaluation:

“What we’ve found is that through our experience [with the NCLB waivers], we have had some unintended outcomes. The biggest one is that there’s an over-reliance on a single measure; too many of our divisions defaulted to the statewide standardized test … and their feedback was that because that was a focus [of the federal government], they felt they needed to emphasize that, ignoring some other factors. It also drove a real emphasis on a summative, final evaluation. And it resulted in our best teachers running away from our most challenged.”

Some state lawmakers appear to absorbing a similar message

Kane’s (Ironic) Calls for an Education Equivalent to the Federal Drug Administration (FDA)

Thomas Kane, an economics professor from Harvard University who also directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation, has been the source of multiple posts on this blog (see, for example, here, here, and here). He is consistently backing, with his Harvard affiliation in tow, VAMs, as per a series of exaggerated, but as he argues, research-based claims.

However, much of the work that Kane cites, he himself has written on the topic (sometimes with coauthors). Likewise, the strong majority of these works have not been published in peer reviewed journals, but rather as technical reports or book chapters, often in his own books (see for example Kane’s curriculum vitae (CV) here). This includes the technical reports derived via his aforementioned MET studies, now in technical report and book form, but now completed three years ago in 2013, and still not externally vetted and published in any said journal. Lack of publication of the MET studies might be due to some of the methodological issues within these particular studies, however (see published and unpublished, (in)direct criticisms of these studies here, here, and here). Although Kane does also cite some published studies authored by others, again, in support of VAMs, the studies Kane cites are primarily/only authored by econometricians (e.g., Chetty, Friedman, and Rockoff) and, accordingly, largely unrepresentative of the larger literature surrounding VAMs.

With that being said, and while I do try my best to stay aboveboard as an academic who takes my scholarly and related blogging activities seriously, sometimes it is hard to, let’s say, not “go there” when more than deserved. Now is one of those times for what I believe is a fair assessment of one example of Kane’s unpublished and externally un-vetted works.

Last week on National Public Radio (NPR), Kane was interviewed by Eric Westervelt in a series titled “There Is No FDA For Education. Maybe There Should Be.” Ironically, in 2009 I made this claim in an article that I authored and that was published in Education Leadership. I began the piece noting that “The value-added assessment model is one over-the-counter product that may be detrimental to your health.” I ended the article noting that “We need to take our education health as seriously as we take our physical health…[hence, perhaps] the FDA approach [might] also serve as a model to protect the intellectual health of the United States. [It] might [also] be a model that legislators and education leaders follow when they pass legislation or policies whose benefits and risks are unknown” (Amrein-Beardsley, 2009).

Never did I foresee, however, how much my 2009 calls for such an approach similar to Kane’s recent calls during this interview would ironically apply, now, and precisely because of academics like Kane. In other words, ironic is that Kane is now calling for “rigorous vetting” of educational research, as he claims is being done with medical research, but those rules apparently do not apply to his own work, and the scholarly works he perpetually cites in favor of VAMs.

Take also, for example, some of the more specific claims Kane expressed in this particular NPR interview.

Kane claims that “The point of education research is to identify effective interventions for closing the achievement gap.” Although elsewhere in the same interview Kane claims that “American education research [has also] mostly languished in an echo chamber for much of the last half century.” While NPR’s Westervelt criticizes Kane for making a “pretty scathing and strong indictment” of America’s education system, what Kane does not understand writ large is that the very solutions for which Kane advocates – using VAM-based measurements to fire and hire bad and good teachers, respectively – are really no different than the “stronger accountability” measures upon which we have relied for the last 40 years (since the minimum competency testing era) within this alleged “echo chamber.” Kane himself, then, IS one of the key perpetrators and perpetuators of the very “echo chamber” of which he speaks, and also scorns and indicts. Kane simply does not realize (or acknowledge) that this “echo chamber” continues in its persistence because of those who continue to preserve a stronger accountability for educational reform logic, and who continue to reinvent “new and improved” accountability measures and policies to support this logic, accordingly.

Kane also claims that he does not “point fingers at the school officials out there” for not being able to reform our schools for the better, and in this particular case, for closing the achievement gap. But unfortunately, Kane does. Recall the article Kane authored and that the New York Daily News published, titled “Teachers Must Look in the Mirror,” in which Kane insults New York’s administrators, writing that the fact that 96% of teachers in New York were given the two highest ratings last year – “effective” or “highly effective” – was “a sure sign that principals have not been honest to date.” Enough said.

The only thing with which I do agree with Kane as per this NPR interview is that we as an academic community can certainly do a better job of translating research into action, although I would add that only externally-reviewed published research is that which is worthy of actually turning into action, in practice. Hence, my repeated critiques of research studies on this blog, that are sometimes not even internally vetted or reviewed before being released to the public, and then the media, and then the masses as “truth.” Recall, for example, when the Chetty et al. studies (mentioned prior as also often cited by Kane) were cited in President Obama’s 2012 State of the Union address, regardless of the fact that the Chetty et al. studies had not even been internally reviewed by their sponsoring agency – the National Bureau of Education Research (NBER) – prior to their public release? This is also regardless of the fact that the federal government’s “What Works Clearinghouse”- also positioned in this NPR interview by Kane as the closest entity we have so far to an educational FDA – noted grave concerns about this study at the same time (as have others since; see, for example, here, here, here, and here).

Now this IS bad practice, although on this, Kane is also guilty.

To put it lightly, and in sum, I’m confused by Kane’s calls for such an external monitoring/quality entity akin to the FDA. While I could not agree more with his statement of need, as I also argued in my 2009 piece mentioned prior, it is because of educational academics like Kane that I have to continue to make such claims and calls for such monitoring and regulation.

Reference: Amrein-Beardsley, A. (2009). Buyers be-aware: What you don’t know can hurt you. Educational Leadership, 67(3), 38-42.

VAMs: A Global Perspective by Tore Sørensen

Tore Bernt Sørensen is a PhD student currently studying at the University of Bristol in England, he is an emerging global educational policy scholar, and he is a future colleague whom I am to meet this summer during an internationally-situated talk on VAMs. Just last week he released a paper published by Education International (Belgium) in which he discusses VAMs, and their use(s) globally. It is rare that I read or have the opportunities to write about what is happening with VAMs worldwide; hence, I am taking this opportunity to share with you all some of the global highlights from his article. I have also attached his article to this post here for those of you who want to give the full document a thorough read (see also the article’s full reference below).

First is that the US is “leading” the world in terms of its adoption of VAMs as an educational policy tool. While I did know this prior given my prior attempts to explore what was happening in the universe of VAMs outside of the US, as per Sørensen, our nation’s ranking in this case is still in place. In fact, “in the US the use of VAM as a policy instrument to evaluate schools and teachers has been taken exceptionally far [emphasis added] in the last 5 years, [while] most other high-income countries remain [relatively more] cautious towards the use of VAM;” this, “as reflected in OECD [Organisation for Economic Co-operation and Development] reports on the [VAM] policy instrument” (p. 1).

The second country most exceptionally using VAMs, so far, is England. Their national school inspection system in England, run by England’s Office for Standards in Education, Children’s Services and Skills (OFSTED), for example, now has VAM as its central standard and accountability indicator.

These two nations are the most invested in VAMs, thus far, primarily because they have similar histories with the school effectiveness movement that emerged in the 1970s. In addition, both countries are also both highly engaged in what Pasi Sahlberg in his 2011 book Finnish Lessons termed the Global Educational Reform Movement (GERM). GERM, in place since the 1980s, has “radically altered education sectors throughout the world with an
 agenda of evidence-based policy based on the [same] school effectiveness paradigm…[as it]…combines the centralised formulation of objectives and standards, and [the] monitoring of data, with the decentralisation to schools concerning decisions around how they seek to meet standards and maximise performance in their day-to-day running” (p. 5).

“The Chilean education system has [also recently] been subject to one of the more radical variants of GERM and there is [now] an interest [also there] in calculating VAM scores for teachers” (p. 6). In  Denmark and Sweden state authorities have begun to compare predicted versus actual performance of schools, not teachers, while taking into consideration “the contextual factors of parents’ educational background, gender, and student origin” (i.e, “context value added”) (p. 7). In Uganda and Delhi, in “partnership” with an England based, international school development company ARK, they are looking to gear up their data systems so they can run VAM trials and analyses to assess their schools’ effects, and also likely continue to scale up and out.

The US-based World Bank is also backing such international moves, as is the US-based Pearson testing corporation via its Learning Curve Project, which is relying on the input from some of the most prominent VAM advocates including Eric Hanushek (see prior posts on Hanushek here and here) and Raj Chetty (see prior posts on Chetty here and here) to promote itself as a player in the universe of VAMs. This makes sense, “[c]onsidering Pearson’s aspirations to be a global education company… particularly in low-income countries” (p. 7). On that note, also as per Sørensen, “education systems in low-income countries might prove [most] vulnerable in the coming years as international donors and for-profit enterprises appear to be endorsing VAM as a means to raise school and teacher quality” in such educationally struggling nations (p. 2).

See also a related blog post about Sørensen’s piece here, as written by him on the Education in Crisis blog, which is also sponsored by Education International. In this piece he also discusses the use of data for political purposes, as is too often the case with VAMs when “the use of statistical tools as policy instruments is taken too far…towards bounded rationality in education policy.”

In short, “VAM, if it has any use at all, must expose the misleading use of statistical mumbo jumbo that effectively #VAMboozles [thanks for the shout out!!] teachers, schools and society. This could help to spark some much needed reflection on the basic propositions of school effectiveness, the negative effects of putting too much trust in numbers, and lead us to start holding policy-makers to account for their misuse of data in policy formation.

Reference: Sørensen, T. B. (2016). Value-added measurement or modelling (VAM). Brussels, Belgium: Education International. Retrieved from

You Are Invited to Participate in the #HowMuchTesting Debate!

As the scholarly debate about the extent and purpose of educational testing rages on, the American Educational Research Association (AERA) wants to hear from you.  During a key session at its Centennial Conference this spring in Washington DC, titled How Much Testing and for What Purpose? Public Scholarship in the Debate about Educational Assessment and Accountability, prominent educational researchers will respond to questions and concerns raised by YOU, parents, students, teachers, community members, and public at large.

Hence, any and all of you with an interest in testing, value-added modeling, educational assessment, educational accountability policies, and the like are invited to post your questions, concerns, and comments using the hashtag #HowMuchTesting on Twitter, Facebook, Instagram, Google+, or the social media platform of your choice, as these are the posts to which AERA’s panelists will respond.

Organizers are interested in all #HowMuchTesting posts, but they are particularly interested in video-recorded questions and comments of 30 – 45 seconds in duration so that you can ask your own questions, rather than having it read by a moderator. In addition, in order to provide ample time for the panel of experts to prepare for the discussion, comments and questions posted by March 17 have the best chances for inclusion in the debate.

Thank you all in advance for your contributions!!

To read more about this session, from the session’s organizer, click here.

Why Standardized Tests Should Not Be Used to Evaluate Teachers (and Teacher Education Programs)

David C. Berliner, Regents’ Professor Emeritus here at Arizona State University (ASU), who also just happens to be my former albeit forever mentor, recently took up research on the use of test scores to evaluate teachers, for example, using value-added models (VAMs). While David is world-renowned for his research in educational psychology, and more specific to this case, his expertise on effective teaching behaviors and how to capture and observe them, he has also now ventured into the VAM-related debates.

Accordingly, he recently presented his newest and soon-to-be-forthcoming published research on using standardized tests to evaluate teachers, something he aptly termed in the title of his presentation “A Policy Fiasco.” He delivered his speech to an audience in Melbourne, Australia, and you can click here for the full video-taped presentation; however, given the whole presentation takes about one hour to watch, although I must say watching the full hour is well worth it, I highlight below what are his highlights and key points. These should certainly be of interest to you all as followers of this blog, and hopefully others.

Of main interest are his 14 reasons, “big and small’ for [his] judgment that assessing teacher competence using standardized achievement tests is nearly worthless.”

Here are his fourteen reasons:

  1. “When using standardized achievement tests as the basis for inferences about the quality of teachers, and the institutions from which they came, it is easy to confuse the effects of sociological variables on standardized test scores” and the effects teachers have on those same scores. Sociological variables (e.g., chronic absenteeism) continue to distort others’ even best attempts to disentangle them from the very instructional variables of interest. This, what we also term as biasing variables, are important not to inappropriately dismiss, as purportedly statistically “controlled for.”
  2. In law, we do not hold people accountable for the actions of others, for example, when a child kills another child and the parents are not charged as guilty. Hence, “[t]he logic of holding [teachers and] schools of education responsible for student achievement does not fit into our system of law or into the moral code subscribed to by most western nations.” Related, should medical school or doctors, for that matter, be held accountable for the health of their patients? One of the best parts of his talk, in fact, is about the medical field and the corollaries Berliner draws between doctors and medical schools, and teachers and colleges of education, respectively (around the 19-25 minute mark of his video presentation).
  3. Professionals are often held harmless for their lower success rates with clients who have observable difficulties in meeting the demands and the expectations of the professionals who attend to them. In medicine again, for example, when working with impoverished patients, “[t]here is precedent for holding [doctors] harmless for their lowest success rates with clients who have observable difficulties in meeting the demands and expectations of the [doctors] who attend to them, but the dispensation we offer to physicians is not offered to teachers.”
  4. There are other quite acceptable sources of data, besides tests, for judging the efficacy of teachers and teacher education programs. “People accept the fact that treatment and medicine may not result in the cure of a disease. Practicing good medicine is the goal, whether or not the patient gets better or lives. It is equally true that competent teaching can occur independent of student learning or of the achievement test scores that serve as proxies for said learning. A teacher can literally “save lives” and not move the metrics used to measure teacher effectiveness.
  5. Reliance on standardized achievement test scores as the source of data about teacher quality will inevitably promote confusion between “successful” instruction and “good” instruction. “Successful” instruction gets test scores up. “Good” instruction leaves lasting impressions, fosters further interest by the students, makes them feel competent in the area, etc. Good instruction is hard to measure, but remains the goal of our finest teachers.
  6. Related, teachers affect individual students greatly, but affect standardized achievement test scores very little. All can think of how their own teachers impacted their lives in ways that cannot be captured on a standardized achievement test.  Standardized achievement test scores are much more related to home, neighborhood and cohort than they are to teachers’ instructional capabilities. In more contemporary terms, this is also due the fact that large-scale standardized tests have (still) never been validated to measure student growth over time, nor have they been validated to attribute that growth to teachers. “Teachers have huge effects, it’s just that the tests are not sensitive to them.”
  7. Teacher’s effects on standardized achievement test scores fade quickly, barely discernable after a few years. So we might not want to overly worry about most teachers’ effects on their students—good or bad—as they are hard to detect on tests after two or so years. To use these ephemeral effects to then hold teacher education programs accountable seems even more problematic.
  8. Observational measures of teacher competency and achievement tests of teacher competency do not correlate well. This suggest nothing more than that one or both of these measures, and likely the latter, are malfunctioning in their capacities to measure the teacher effectiveness construct. See other Vamboozled posts about this here, here, and here.
  9. Different standardized achievement tests, both purporting to measure reading, mathematics, or science at the same grade level, will give different estimates of teacher competency. That is because different test developers have different visions of what it means to be competent in each of these subject areas. Thus one achievement test in these subject areas could find a teacher exemplary, but another test of those same subject areas would find the teacher lacking. What then? Have we an unstable teacher or an ill-defined subject area?
  10. Tests can be administered early or late in the fall, early or late in the spring, and the dates they are given influence the judgments about whether a teacher is performing well or poorly. Teacher competency should not be determined by minor differences in the date of testing, but that happens frequently.
  11. No standardized achievement tests have provided proof that their items are instructionally sensitive. If test items do not, because they cannot “react to good instruction,” how can one make a claim that the test items are “tapping good instruction?”
  12. Teacher effects show up more dramatically on teacher made tests than on standardized achievement tests because the former are based on the enacted curriculum, while the latter are based on the desired curriculum. You get seven times more instructionally sensitive tests the closer the test is to the classroom (i.e., teacher made tests).
  13. The opt-out testing movement invalidates inferences about teachers and schools that can be made from standardized achievement test results. Its not bad to remove these kids from taking these tests, and perhaps it is even necessary in our over-tested schools, but the tests and the VAM estimates derived via these tests, are far less valid when that happens. This is because the students who opt out are likely different in significant ways from those who do take the tests. This severely limits the validity claims that are made.
  14. Assessing new teachers with standardized achievement tests is likely to yield many false negatives. That is, the assessments would identify teachers early in their careers as ineffective in improving test scores, which is, in fact, often the case for new teachers. Two or three years later that could change. Perhaps the last thing we want to do in a time of teacher shortage is discourage new teachers while they acquire their skills.

Doug Harris on the “Every Student Succeeds Act” and Reasons to Continue with VAMs Regardless

Two weeks ago I wrote a post about the passage of the “Every Student Succeeds Act” (ESSA), and its primary purpose to reduce “the federal footprint and restore local control, while empowering parents and education leaders to hold schools accountable for effectively teaching students” within their states. More specific to VAMs, I wrote about how ESSA will allow states to decide how to weight their standardized test scores and decide whether and how to evaluate teachers with or without said scores.

Doug Harris, an Associate Professor of Economics at Tulane in Louisiana and, as I’ve written prior, “a ‘cautious’ but quite active proponent of VAMs” (see another two posts about Harris here and here) recently took to an Education Week blog to respond to ESSA, as well. His post titled, “NCLB 3.0 Could Improve the Use of Value-Added and Other Measures” can be read with a subscription. For those of you without a subscription, however, I’ve highlighted some of his key arguments below, along with my responses to those I felt were most pertinent here.

In his post, Harris argues that one element of ESSA “clearly reinforces value-added–the continuation of annual testing.” He equates these (i.e., value-added and testing) as synonyms, with which I would certainly disagree. The latter precedes the former, and without the latter the former would not exist. In other words, it’s what one does with the test scores, whereby test scores do not just mean by default the same thing as VAMs.

In addition, while the continuation of annual testing is written into ESSA, the use of VAMs is not, and their use for teacher evaluations, or not, is now left to all state’s to decide. Hence, it is true that ESSA “means the states will no longer have to abide by [federal/NCLB] waivers and there is a good chance that some states will scale back aggressive evaluation and accountability [measures].”

Harris discusses the controversies surrounding VAMs, as argued by both VAM proponents and critics, and he contends in the end that “[i]t is difficult to weigh these pros and cons… because we have so little evidence on the [formative] effects of using value-added measures on teaching and learning.” I would argue that we have plenty of evidence on the formative effects of using VAMs, given the evidence dating back now almost thirty years (e.g, from Tennessee). This is one area of research where I do not believe we need more evidence to support the lack of formative effects, or rather instructionally informative benefits to be drawn from VAM use. Again, this is largely due to the fact that VAM estimates are based on tests that are so far removed from the realities of the classroom, that the tests as well as the VAM estimates derived thereafter are not “instructionally sensitive.” Should Harris argue for value-added using “instructionally senstive” tests, perhaps, this suggestion for future research might carry some more weight or potential.

Harris also discusses some “other ways to use value-added” should states still decide (e.g., within a flagging system whereby VAMs as the more “objective” measure could be used to raise red flags for individual teachers who might require further investigation using other less “objective” measures). Given the millions of taxpayer revenues it would require to do even this, again for only 30% of all teachers who are primarily elementary school teachers of primary subject areas, cost should certainly be of note on this one. This suggestion also overlooks VAMs still prevalent methodological and measurement issues, and how these issues should also likely prevent VAMs from even playing a primary role as the key criteria used to flag teachers. This is also not warranted, from an educational measurement standpoint.

Harris continues that “The point is that it would be a shame if we went back to a no-feedback world, and even more of a shame if we did so because of an over-stated connection between evaluation, accountability, and value-added.” In my professional opinion, he is incorrect on both of these points: (1) Teachers are not (really) using the feedback they are getting from their VAM reports, as described prior, and for a variety of reasons including transparency, usability, actionability, etc.; hence, we are nowhere nearer to some utopian “feedback world” than we were pre-VAMs. All of the related research points to observational feedback, actually, to be the most useful when it comes to garnering and using formative feedback. (2) There is nothing “over-stated” in terms of the “connection between evaluation, accountability, and value-added.” Rather, what he terms as possibly “over-stated” is actually real, in practice, and overly-abused versus stated. This is not just something related to semantics.

Finally, the one point upon which we agree, with some distinction, is that ESSA “is also an opportunity because it requires schools to move beyond test scores in accountability. This will mean a reduced focus on value-added to student test scores, but that’s OK if the new measures provide a more complete assessment of the value [defined in much broader terms] that schools provide.” ESSA, then, offers the nation “a chance to get value-added [defined in much broader terms] right and correct the problems with how we measure teacher and school performance.” If we define “value-added” in much broader terms, even if test scores are not a part of the “value-added” construct, this would likely be a step in the right direction.