No More EVAAS for Houston: School Board Tie Vote Means Non-Renewal

Recall from prior posts (here, here, and here) that seven teachers in the Houston Independent School District (HISD), with the support of the Houston Federation of Teachers (HFT), are taking HISD to federal court over how their value-added scores, derived via the Education Value-Added Assessment System (EVAAS), are being used, and allegedly abused, while this district that has tied more high-stakes consequences to value-added output than any other district/state in the nation. The case, Houston Federation of Teachers, et al. v. Houston ISD, is ongoing.

But just announced is that the HISD school board, in a 3:3 split vote late last Thursday night, elected to no longer pay an annual $680K to SAS Institute Inc. to calculate the district’s EVAAS value-added estimates. As per an HFT press release (below), HISD “will not be renewing the district’s seriously flawed teacher evaluation system, [which is] good news for students, teachers and the community, [although] the school board and incoming superintendent must work with educators and others to choose a more effective system.”

here

Apparently, HISD was holding onto the EVAAS, despite the research surrounding the EVAAS in general and in Houston, in that they have received (and are still set to receive) over $4 million in federal grant funds that has required them to have value-added estimates as a component of their evaluation and accountability system(s).

While this means that the federal government is still largely in favor of the use of value-added model (VAMs) in terms of its funding priorities, despite their prior authorization of the Every Student Succeeds Act (ESSA) (see here and here), this also means that HISD might have to find another growth model or VAM to still comply with the feds.

Regardless, during the Thursday night meeting a board member noted that HISD has been kicking this EVAAS can down the road for 5 years. “If not now, then when?” the board member asked. “I remember talking about this last year, and the year before. We all agree that it needs to be changed, but we just keep doing the same thing.” A member of the community said to the board: “VAM hasn’t moved the needle [see a related post about this here]. It hasn’t done what you need it to do. But it has been very expensive to this district.” He then listed the other things on which HISD could spend (and could have spent) its annual $680K EVAAS estimate costs.

Soon thereafter, the HISD school board called for a vote, and it ended up being a 3-3 tie. Because of the 3-3 tie vote, the school board rejected the effort to continue with the EVAAS. What this means for the related and aforementioned lawsuit is still indeterminate at this point.

The Danielson Framework: Evidence of Un/Warranted Use

The US Department of Education’s statistics, research, and evaluation arm — the Institute of Education Sciences — recently released a study (here) about the validity of the Danielson Framework for Teaching‘s observational ratings as used for 713 teachers, with some minor adaptations (see box 1 on page 1), in the second largest school district in Nevada — Washoe County School District (Reno). This district is to use these data, along with student growth ratings, to inform decisions about teachers’ tenure, retention, and pay-for-performance system, in compliance with the state’s still current teacher evaluation system. The study was authored by researchers out of the Regional Educational Laboratory (REL) West at WestEd — a nonpartisan, nonprofit research, development, and service organization.

As many of you know, principals throughout many districts throughout the US, as per the Danielson Framework, use a four-point rating scale to rate teachers on 22 teaching components meant to measure four different dimensions or “constructs” of teaching.
In this study, researchers found that principals did not discriminate as much among the individual four constructs and 22 components (i.e., the four domains were not statistically distinct from one another and the ratings of the 22 components seemed to measure the same or universal cohesive trait). Accordingly, principals did discriminate among the teachers they observed to be more generally effective and highly effective (i.e., the universal trait of overall “effectiveness”), as captured by the two highest categories on the scale. Hence, analyses support the use of the overall scale versus the sub-components or items in and of themselves. Put differently, and In the authors’ words, “the analysis does not support interpreting the four domain scores [or indicators] as measurements of distinct aspects of teaching; instead, the analysis supports using a single rating, such as the average over all [sic] components of the system to summarize teacher effectiveness” (p. 12).
In addition, principals also (still) rarely identified teachers as minimally effective or ineffective, with approximately 10% of ratings falling into these of the lowest two of the four categories on the Danielson scale. This was also true across all but one of the 22 aforementioned Danielson components (see Figures 1-4, p. 7-8); see also Figure 5, p. 9).
I emphasize the word “still” in that this negative skew — what would be an illustrated distribution of, in this case, the proportion of teachers receiving all scores, whereby the mass of the distribution would be concentrated toward the right side of the figure — is one of the main reasons we as a nation became increasingly focused on “more objective” indicators of teacher effectiveness, focused on teachers’ direct impacts on student learning and achievement via value-added measures (VAMs). Via “The Widget Effect” report (here), authors argued that it was more or less impossible to have so many teachers perform at such high levels, especially given the extent to which students in other industrialized nations were outscoring students in the US on international exams. Thereafter, US policymakers who got a hold of this report, among others, used it to make advancements towards, and research-based arguments for, “new and improved” teacher evaluation systems with key components being the “more objective” VAMs.

In addition, and as directly related to VAMs, in this study researchers also found that each rating from each of the four domains, as well as the average of all ratings, “correlated positively with student learning [gains, as derived via the Nevada Growth
Model, as based on the Student Growth Percentiles (SGP) model; for more information about the SGP model see here and here; see also p. 6 of this report here], in reading and in math, as would be expected if the ratings measured teacher effectiveness in promoting student learning” (p. i). Of course, this would only be expected if one agrees that the VAM estimate is the core indicator around which all other such indicators should revolve, but I digress…

Anyhow, researchers found that by calculating standard correlation coefficients between teachers’ growth scores and the four Danielson domain scores, that “in all but one case” [i.e., the correlation coefficient between Domain 4 and growth in reading], said correlations were positive and statistically significant. Indeed this is true, although the correlations they observed, as aligned with what is increasingly becoming a saturated finding in the literature (see similar findings about the Marzano observational framework here; see similar findings from other studies here, here, and here; see also other studies as cited by authors of this study on p. 13-14 here), is that the magnitude and practical significance of these correlations are “very weak” (e.g., r = .18) to “moderate” (e.g., r = .45, .46, and .48). See their Table 2 (p. 13) with all relevant correlation coefficients illustrated below.

Screen Shot 2016-06-02 at 11.24.09 AM

Regardless, “[w]hile th[is] study takes place in one school district, the findings may be of interest to districts and states that are using or considering using the Danielson Framework” (p. i), especially those that intend to use this particular instrument for summative and sometimes consequential purposes, in that the Framework’s factor structure does not hold up, especially if to be used for summative and consequential purposes, unless, possibly, used as a generalized discriminator. With that too, however, evidence of validity is still quite weak to support further generalized inferences and decisions.

So, those of you in states, districts, and schools, do make these findings known, especially if this framework is being used for similar purposes without such evidence in support of such.

Citation: Lash, A., Tran, L., & Huang, M. (2016). Examining the validity of ratings
from a classroom observation instrument for use in a district’s teacher evaluation system

REL 2016–135). Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory West. Retrieved from http://ies.ed.gov/ncee/edlabs/regions/west/pdf/REL_2016135.pdf

What ESSA Means for Teacher Evaluation and VAMs

Within a prior post, I wrote in some detail about what the Every Student Succeeds Act (ESSA) means for the U.S., as well as states’ teacher evaluation systems as per the federally mandated adoption and use of growth and value-added models (VAMs) across the U.S., after President Obama signed it into law in December.

Diane Ravitch recently covered, in her own words, what ESSA means for teacher evaluations systems as well, in what she called Part II of a nine Part series on all key sections of ESSA (see Parts I-IX here). I thought Part II was important to share with you all, especially given this particular post captures that in which followers of this blog are most interested, although I do recommend that you all also see what the ESSA means for other areas of educational progress and reform in terms of the Common Core, teacher education, charter schools, etc. in her Parts I-IX.

Here is what she captured in her Part II post, however, copied and pasted here from her original post:

The stakes attached to testing: will teachers be evaluated by test scores, as Duncan demanded and as the American Statistical Association rejected? Will teachers be fired because of ratings based on test scores?

Short Answer:

The federal mandate on teacher evaluation linked to test scores, as created in the waivers, is eliminated in ESSA.

States are allowed to use federal funds to continue these programs, if they choose, or completely change their strategy, but they will no longer be required to include these policies as a condition of receiving federal funds. In fact, the Secretary is explicitly prohibited from mandating any aspect of a teacher evaluation system, or mandating a state conduct the evaluation altogether, in section 1111(e)(1)(B)(iii)(IX) and (X), section 2101(e), and section 8401(d)(3) of the new law.

Long Answer:

Chairman Alexander has been a long advocate of the concept, as he calls it, of “paying teachers more for teaching well.” As governor of Tennessee he created the first teacher evaluation system in the nation, and believes to this day that the “Holy Grail” of education reform is finding fair ways to pay teachers more for teaching well.

But he opposed the idea of creating or continuing a federal mandate and requiring states to follow a Washington-based model of how to establish these types of systems.

Teacher evaluation is complicated work and the last thing local school districts and states need is to send their evaluation system to Washington, D.C., to see if a bureaucrat in Washington thinks they got it right.

ESSA ends the waiver requirements on August 2016 so states or districts that choose to end their teacher evaluation system may. Otherwise, states can make changes to their teacher evaluation systems, or start over and start a new system. The decision is left to states and school districts to work out.

The law does continue a separate, competitive funding program, the Teacher and School Leader Incentive Fund, to allow states, school districts, or non-profits or for-profits in partnership with a state or school district to apply for competitive grants to implement teacher evaluation systems to see if the country can learn more about effective and fair ways of linking student performance to teacher performance.

Some Lawmakers Reconsidering VAMs in the South

A few weeks ago in Education Week, Stephen Sawchuk and Emmanuel Felton wrote a post in the its Teacher Beat blog about lawmakers, particularly in the southern states, who are beginning to reconsider, via legislation, the role of test scores and value-added measures in their states’ teacher evaluation systems. Perhaps the tides are turning.

I tweeted this one out, but I also pasted this (short) one below to make sure you all, especially those of you teaching and/or residing in states like Georgia, Oklahoma, Louisiana, Tennessee, and Virginia, did not miss it.

Southern Lawmakers Reconsidering Role of Test Scores in Teacher Evaluations

After years of fierce debates over effectiveness and fairness of the methodology, several southern lawmakers are looking to minimize the weight placed on so called value-added measures, derived from how much students’ test scores changed, in teacher-evaluation systems.

In part because these states are home to some of the weakest teachers unions in the country, southern policymakers were able to push past arguments that the state tests were ill suited for teacher-evaluation purposes and that the system would punish teachers for working in the toughest classrooms. States like Louisiana, Georgia and Tennessee, became some of the earliest and strongest adopters of the practice. But in the past few weeks, lawmakers from Baton Rouge, La., to Atlanta have introduced bills to limit the practice.

In February, the Georgia Senate unanimously passed a bill that would reduce the student-growth component from 50 percent of a teachers’ evaluation down to 30 percent. Earlier this week, nearly 30 individuals signed up to speak on behalf fo the bill at a State House hearing.

Similarly, Louisiana House Bill 479 would reduce student-growth weight from 50 percent to 35 percent. Tennessee House Bill 1453 would reduce the weight of student-growth data through the 2018-2019 school year and would require the state Board of Education to produce a report evaluating the policy’s ongoing effectiveness. Lawmakers in Florida, Kentucky, and Oklahoma have introduced similar bills, according to the Southern Regional Education Board’s 2016 educator-effectiveness bill tracker.

By and large, states adopted these test-score centric teacher-evaluation systems to attain waivers from No Child Left Behind’s requirement that all students by proficient by 2014. To get a waiver, states had to adopt systems that evaluated teachers “in significant part, based on student growth.” That has looked very different from state to state, ranging from 20 percent in Utah to 50 percent in states like Alaska, Tennessee, and Louisiana.

No Child Left Behind’s replacement, the Every Student Succeeds Act, doesn’t require states to have a teacher-evaluation system at all, but, as my colleague Stephen Sawchuk reported, the nation’s state superintendents say they remain committed to maintaining systems that regularly review teachers.

But, as Sawchuk reported, Steven Staples, Virginia’s state superintendent, signaled that his state may move away from its current system where student test scores make up 40 percent of a teacher’s evaluation:

“What we’ve found is that through our experience [with the NCLB waivers], we have had some unintended outcomes. The biggest one is that there’s an over-reliance on a single measure; too many of our divisions defaulted to the statewide standardized test … and their feedback was that because that was a focus [of the federal government], they felt they needed to emphasize that, ignoring some other factors. It also drove a real emphasis on a summative, final evaluation. And it resulted in our best teachers running away from our most challenged.”

Some state lawmakers appear to absorbing a similar message

Kane’s (Ironic) Calls for an Education Equivalent to the Federal Drug Administration (FDA)

Thomas Kane, an economics professor from Harvard University who also directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation, has been the source of multiple posts on this blog (see, for example, here, here, and here). He is consistently backing, with his Harvard affiliation in tow, VAMs, as per a series of exaggerated, but as he argues, research-based claims.

However, much of the work that Kane cites, he himself has written on the topic (sometimes with coauthors). Likewise, the strong majority of these works have not been published in peer reviewed journals, but rather as technical reports or book chapters, often in his own books (see for example Kane’s curriculum vitae (CV) here). This includes the technical reports derived via his aforementioned MET studies, now in technical report and book form, but now completed three years ago in 2013, and still not externally vetted and published in any said journal. Lack of publication of the MET studies might be due to some of the methodological issues within these particular studies, however (see published and unpublished, (in)direct criticisms of these studies here, here, and here). Although Kane does also cite some published studies authored by others, again, in support of VAMs, the studies Kane cites are primarily/only authored by econometricians (e.g., Chetty, Friedman, and Rockoff) and, accordingly, largely unrepresentative of the larger literature surrounding VAMs.

With that being said, and while I do try my best to stay aboveboard as an academic who takes my scholarly and related blogging activities seriously, sometimes it is hard to, let’s say, not “go there” when more than deserved. Now is one of those times for what I believe is a fair assessment of one example of Kane’s unpublished and externally un-vetted works.

Last week on National Public Radio (NPR), Kane was interviewed by Eric Westervelt in a series titled “There Is No FDA For Education. Maybe There Should Be.” Ironically, in 2009 I made this claim in an article that I authored and that was published in Education Leadership. I began the piece noting that “The value-added assessment model is one over-the-counter product that may be detrimental to your health.” I ended the article noting that “We need to take our education health as seriously as we take our physical health…[hence, perhaps] the FDA approach [might] also serve as a model to protect the intellectual health of the United States. [It] might [also] be a model that legislators and education leaders follow when they pass legislation or policies whose benefits and risks are unknown” (Amrein-Beardsley, 2009).

Never did I foresee, however, how much my 2009 calls for such an approach similar to Kane’s recent calls during this interview would ironically apply, now, and precisely because of academics like Kane. In other words, ironic is that Kane is now calling for “rigorous vetting” of educational research, as he claims is being done with medical research, but those rules apparently do not apply to his own work, and the scholarly works he perpetually cites in favor of VAMs.

Take also, for example, some of the more specific claims Kane expressed in this particular NPR interview.

Kane claims that “The point of education research is to identify effective interventions for closing the achievement gap.” Although elsewhere in the same interview Kane claims that “American education research [has also] mostly languished in an echo chamber for much of the last half century.” While NPR’s Westervelt criticizes Kane for making a “pretty scathing and strong indictment” of America’s education system, what Kane does not understand writ large is that the very solutions for which Kane advocates – using VAM-based measurements to fire and hire bad and good teachers, respectively – are really no different than the “stronger accountability” measures upon which we have relied for the last 40 years (since the minimum competency testing era) within this alleged “echo chamber.” Kane himself, then, IS one of the key perpetrators and perpetuators of the very “echo chamber” of which he speaks, and also scorns and indicts. Kane simply does not realize (or acknowledge) that this “echo chamber” continues in its persistence because of those who continue to preserve a stronger accountability for educational reform logic, and who continue to reinvent “new and improved” accountability measures and policies to support this logic, accordingly.

Kane also claims that he does not “point fingers at the school officials out there” for not being able to reform our schools for the better, and in this particular case, for closing the achievement gap. But unfortunately, Kane does. Recall the article Kane authored and that the New York Daily News published, titled “Teachers Must Look in the Mirror,” in which Kane insults New York’s administrators, writing that the fact that 96% of teachers in New York were given the two highest ratings last year – “effective” or “highly effective” – was “a sure sign that principals have not been honest to date.” Enough said.

The only thing with which I do agree with Kane as per this NPR interview is that we as an academic community can certainly do a better job of translating research into action, although I would add that only externally-reviewed published research is that which is worthy of actually turning into action, in practice. Hence, my repeated critiques of research studies on this blog, that are sometimes not even internally vetted or reviewed before being released to the public, and then the media, and then the masses as “truth.” Recall, for example, when the Chetty et al. studies (mentioned prior as also often cited by Kane) were cited in President Obama’s 2012 State of the Union address, regardless of the fact that the Chetty et al. studies had not even been internally reviewed by their sponsoring agency – the National Bureau of Education Research (NBER) – prior to their public release? This is also regardless of the fact that the federal government’s “What Works Clearinghouse”- also positioned in this NPR interview by Kane as the closest entity we have so far to an educational FDA – noted grave concerns about this study at the same time (as have others since; see, for example, here, here, here, and here).

Now this IS bad practice, although on this, Kane is also guilty.

To put it lightly, and in sum, I’m confused by Kane’s calls for such an external monitoring/quality entity akin to the FDA. While I could not agree more with his statement of need, as I also argued in my 2009 piece mentioned prior, it is because of educational academics like Kane that I have to continue to make such claims and calls for such monitoring and regulation.

Reference: Amrein-Beardsley, A. (2009). Buyers be-aware: What you don’t know can hurt you. Educational Leadership, 67(3), 38-42.

VAMs: A Global Perspective by Tore Sørensen

Tore Bernt Sørensen is a PhD student currently studying at the University of Bristol in England, he is an emerging global educational policy scholar, and he is a future colleague whom I am to meet this summer during an internationally-situated talk on VAMs. Just last week he released a paper published by Education International (Belgium) in which he discusses VAMs, and their use(s) globally. It is rare that I read or have the opportunities to write about what is happening with VAMs worldwide; hence, I am taking this opportunity to share with you all some of the global highlights from his article. I have also attached his article to this post here for those of you who want to give the full document a thorough read (see also the article’s full reference below).

First is that the US is “leading” the world in terms of its adoption of VAMs as an educational policy tool. While I did know this prior given my prior attempts to explore what was happening in the universe of VAMs outside of the US, as per Sørensen, our nation’s ranking in this case is still in place. In fact, “in the US the use of VAM as a policy instrument to evaluate schools and teachers has been taken exceptionally far [emphasis added] in the last 5 years, [while] most other high-income countries remain [relatively more] cautious towards the use of VAM;” this, “as reflected in OECD [Organisation for Economic Co-operation and Development] reports on the [VAM] policy instrument” (p. 1).

The second country most exceptionally using VAMs, so far, is England. Their national school inspection system in England, run by England’s Office for Standards in Education, Children’s Services and Skills (OFSTED), for example, now has VAM as its central standard and accountability indicator.

These two nations are the most invested in VAMs, thus far, primarily because they have similar histories with the school effectiveness movement that emerged in the 1970s. In addition, both countries are also both highly engaged in what Pasi Sahlberg in his 2011 book Finnish Lessons termed the Global Educational Reform Movement (GERM). GERM, in place since the 1980s, has “radically altered education sectors throughout the world with an
 agenda of evidence-based policy based on the [same] school effectiveness paradigm…[as it]…combines the centralised formulation of objectives and standards, and [the] monitoring of data, with the decentralisation to schools concerning decisions around how they seek to meet standards and maximise performance in their day-to-day running” (p. 5).

“The Chilean education system has [also recently] been subject to one of the more radical variants of GERM and there is [now] an interest [also there] in calculating VAM scores for teachers” (p. 6). In  Denmark and Sweden state authorities have begun to compare predicted versus actual performance of schools, not teachers, while taking into consideration “the contextual factors of parents’ educational background, gender, and student origin” (i.e, “context value added”) (p. 7). In Uganda and Delhi, in “partnership” with an England based, international school development company ARK, they are looking to gear up their data systems so they can run VAM trials and analyses to assess their schools’ effects, and also likely continue to scale up and out.

The US-based World Bank is also backing such international moves, as is the US-based Pearson testing corporation via its Learning Curve Project, which is relying on the input from some of the most prominent VAM advocates including Eric Hanushek (see prior posts on Hanushek here and here) and Raj Chetty (see prior posts on Chetty here and here) to promote itself as a player in the universe of VAMs. This makes sense, “[c]onsidering Pearson’s aspirations to be a global education company… particularly in low-income countries” (p. 7). On that note, also as per Sørensen, “education systems in low-income countries might prove [most] vulnerable in the coming years as international donors and for-profit enterprises appear to be endorsing VAM as a means to raise school and teacher quality” in such educationally struggling nations (p. 2).

See also a related blog post about Sørensen’s piece here, as written by him on the Education in Crisis blog, which is also sponsored by Education International. In this piece he also discusses the use of data for political purposes, as is too often the case with VAMs when “the use of statistical tools as policy instruments is taken too far…towards bounded rationality in education policy.”

In short, “VAM, if it has any use at all, must expose the misleading use of statistical mumbo jumbo that effectively #VAMboozles [thanks for the shout out!!] teachers, schools and society. This could help to spark some much needed reflection on the basic propositions of school effectiveness, the negative effects of putting too much trust in numbers, and lead us to start holding policy-makers to account for their misuse of data in policy formation.

Reference: Sørensen, T. B. (2016). Value-added measurement or modelling (VAM). Brussels, Belgium: Education International. Retrieved from http://download.ei-ie.org/Docs/WebDepot/2016_EI_VAM_EN_final_Web.pdf

You Are Invited to Participate in the #HowMuchTesting Debate!

As the scholarly debate about the extent and purpose of educational testing rages on, the American Educational Research Association (AERA) wants to hear from you.  During a key session at its Centennial Conference this spring in Washington DC, titled How Much Testing and for What Purpose? Public Scholarship in the Debate about Educational Assessment and Accountability, prominent educational researchers will respond to questions and concerns raised by YOU, parents, students, teachers, community members, and public at large.

Hence, any and all of you with an interest in testing, value-added modeling, educational assessment, educational accountability policies, and the like are invited to post your questions, concerns, and comments using the hashtag #HowMuchTesting on Twitter, Facebook, Instagram, Google+, or the social media platform of your choice, as these are the posts to which AERA’s panelists will respond.

Organizers are interested in all #HowMuchTesting posts, but they are particularly interested in video-recorded questions and comments of 30 – 45 seconds in duration so that you can ask your own questions, rather than having it read by a moderator. In addition, in order to provide ample time for the panel of experts to prepare for the discussion, comments and questions posted by March 17 have the best chances for inclusion in the debate.

Thank you all in advance for your contributions!!

To read more about this session, from the session’s organizer, click here.

Why Standardized Tests Should Not Be Used to Evaluate Teachers (and Teacher Education Programs)

David C. Berliner, Regents’ Professor Emeritus here at Arizona State University (ASU), who also just happens to be my former albeit forever mentor, recently took up research on the use of test scores to evaluate teachers, for example, using value-added models (VAMs). While David is world-renowned for his research in educational psychology, and more specific to this case, his expertise on effective teaching behaviors and how to capture and observe them, he has also now ventured into the VAM-related debates.

Accordingly, he recently presented his newest and soon-to-be-forthcoming published research on using standardized tests to evaluate teachers, something he aptly termed in the title of his presentation “A Policy Fiasco.” He delivered his speech to an audience in Melbourne, Australia, and you can click here for the full video-taped presentation; however, given the whole presentation takes about one hour to watch, although I must say watching the full hour is well worth it, I highlight below what are his highlights and key points. These should certainly be of interest to you all as followers of this blog, and hopefully others.

Of main interest are his 14 reasons, “big and small’ for [his] judgment that assessing teacher competence using standardized achievement tests is nearly worthless.”

Here are his fourteen reasons:

  1. “When using standardized achievement tests as the basis for inferences about the quality of teachers, and the institutions from which they came, it is easy to confuse the effects of sociological variables on standardized test scores” and the effects teachers have on those same scores. Sociological variables (e.g., chronic absenteeism) continue to distort others’ even best attempts to disentangle them from the very instructional variables of interest. This, what we also term as biasing variables, are important not to inappropriately dismiss, as purportedly statistically “controlled for.”
  2. In law, we do not hold people accountable for the actions of others, for example, when a child kills another child and the parents are not charged as guilty. Hence, “[t]he logic of holding [teachers and] schools of education responsible for student achievement does not fit into our system of law or into the moral code subscribed to by most western nations.” Related, should medical school or doctors, for that matter, be held accountable for the health of their patients? One of the best parts of his talk, in fact, is about the medical field and the corollaries Berliner draws between doctors and medical schools, and teachers and colleges of education, respectively (around the 19-25 minute mark of his video presentation).
  3. Professionals are often held harmless for their lower success rates with clients who have observable difficulties in meeting the demands and the expectations of the professionals who attend to them. In medicine again, for example, when working with impoverished patients, “[t]here is precedent for holding [doctors] harmless for their lowest success rates with clients who have observable difficulties in meeting the demands and expectations of the [doctors] who attend to them, but the dispensation we offer to physicians is not offered to teachers.”
  4. There are other quite acceptable sources of data, besides tests, for judging the efficacy of teachers and teacher education programs. “People accept the fact that treatment and medicine may not result in the cure of a disease. Practicing good medicine is the goal, whether or not the patient gets better or lives. It is equally true that competent teaching can occur independent of student learning or of the achievement test scores that serve as proxies for said learning. A teacher can literally “save lives” and not move the metrics used to measure teacher effectiveness.
  5. Reliance on standardized achievement test scores as the source of data about teacher quality will inevitably promote confusion between “successful” instruction and “good” instruction. “Successful” instruction gets test scores up. “Good” instruction leaves lasting impressions, fosters further interest by the students, makes them feel competent in the area, etc. Good instruction is hard to measure, but remains the goal of our finest teachers.
  6. Related, teachers affect individual students greatly, but affect standardized achievement test scores very little. All can think of how their own teachers impacted their lives in ways that cannot be captured on a standardized achievement test.  Standardized achievement test scores are much more related to home, neighborhood and cohort than they are to teachers’ instructional capabilities. In more contemporary terms, this is also due the fact that large-scale standardized tests have (still) never been validated to measure student growth over time, nor have they been validated to attribute that growth to teachers. “Teachers have huge effects, it’s just that the tests are not sensitive to them.”
  7. Teacher’s effects on standardized achievement test scores fade quickly, barely discernable after a few years. So we might not want to overly worry about most teachers’ effects on their students—good or bad—as they are hard to detect on tests after two or so years. To use these ephemeral effects to then hold teacher education programs accountable seems even more problematic.
  8. Observational measures of teacher competency and achievement tests of teacher competency do not correlate well. This suggest nothing more than that one or both of these measures, and likely the latter, are malfunctioning in their capacities to measure the teacher effectiveness construct. See other Vamboozled posts about this here, here, and here.
  9. Different standardized achievement tests, both purporting to measure reading, mathematics, or science at the same grade level, will give different estimates of teacher competency. That is because different test developers have different visions of what it means to be competent in each of these subject areas. Thus one achievement test in these subject areas could find a teacher exemplary, but another test of those same subject areas would find the teacher lacking. What then? Have we an unstable teacher or an ill-defined subject area?
  10. Tests can be administered early or late in the fall, early or late in the spring, and the dates they are given influence the judgments about whether a teacher is performing well or poorly. Teacher competency should not be determined by minor differences in the date of testing, but that happens frequently.
  11. No standardized achievement tests have provided proof that their items are instructionally sensitive. If test items do not, because they cannot “react to good instruction,” how can one make a claim that the test items are “tapping good instruction?”
  12. Teacher effects show up more dramatically on teacher made tests than on standardized achievement tests because the former are based on the enacted curriculum, while the latter are based on the desired curriculum. You get seven times more instructionally sensitive tests the closer the test is to the classroom (i.e., teacher made tests).
  13. The opt-out testing movement invalidates inferences about teachers and schools that can be made from standardized achievement test results. Its not bad to remove these kids from taking these tests, and perhaps it is even necessary in our over-tested schools, but the tests and the VAM estimates derived via these tests, are far less valid when that happens. This is because the students who opt out are likely different in significant ways from those who do take the tests. This severely limits the validity claims that are made.
  14. Assessing new teachers with standardized achievement tests is likely to yield many false negatives. That is, the assessments would identify teachers early in their careers as ineffective in improving test scores, which is, in fact, often the case for new teachers. Two or three years later that could change. Perhaps the last thing we want to do in a time of teacher shortage is discourage new teachers while they acquire their skills.

Doug Harris on the “Every Student Succeeds Act” and Reasons to Continue with VAMs Regardless

Two weeks ago I wrote a post about the passage of the “Every Student Succeeds Act” (ESSA), and its primary purpose to reduce “the federal footprint and restore local control, while empowering parents and education leaders to hold schools accountable for effectively teaching students” within their states. More specific to VAMs, I wrote about how ESSA will allow states to decide how to weight their standardized test scores and decide whether and how to evaluate teachers with or without said scores.

Doug Harris, an Associate Professor of Economics at Tulane in Louisiana and, as I’ve written prior, “a ‘cautious’ but quite active proponent of VAMs” (see another two posts about Harris here and here) recently took to an Education Week blog to respond to ESSA, as well. His post titled, “NCLB 3.0 Could Improve the Use of Value-Added and Other Measures” can be read with a subscription. For those of you without a subscription, however, I’ve highlighted some of his key arguments below, along with my responses to those I felt were most pertinent here.

In his post, Harris argues that one element of ESSA “clearly reinforces value-added–the continuation of annual testing.” He equates these (i.e., value-added and testing) as synonyms, with which I would certainly disagree. The latter precedes the former, and without the latter the former would not exist. In other words, it’s what one does with the test scores, whereby test scores do not just mean by default the same thing as VAMs.

In addition, while the continuation of annual testing is written into ESSA, the use of VAMs is not, and their use for teacher evaluations, or not, is now left to all state’s to decide. Hence, it is true that ESSA “means the states will no longer have to abide by [federal/NCLB] waivers and there is a good chance that some states will scale back aggressive evaluation and accountability [measures].”

Harris discusses the controversies surrounding VAMs, as argued by both VAM proponents and critics, and he contends in the end that “[i]t is difficult to weigh these pros and cons… because we have so little evidence on the [formative] effects of using value-added measures on teaching and learning.” I would argue that we have plenty of evidence on the formative effects of using VAMs, given the evidence dating back now almost thirty years (e.g, from Tennessee). This is one area of research where I do not believe we need more evidence to support the lack of formative effects, or rather instructionally informative benefits to be drawn from VAM use. Again, this is largely due to the fact that VAM estimates are based on tests that are so far removed from the realities of the classroom, that the tests as well as the VAM estimates derived thereafter are not “instructionally sensitive.” Should Harris argue for value-added using “instructionally senstive” tests, perhaps, this suggestion for future research might carry some more weight or potential.

Harris also discusses some “other ways to use value-added” should states still decide (e.g., within a flagging system whereby VAMs as the more “objective” measure could be used to raise red flags for individual teachers who might require further investigation using other less “objective” measures). Given the millions of taxpayer revenues it would require to do even this, again for only 30% of all teachers who are primarily elementary school teachers of primary subject areas, cost should certainly be of note on this one. This suggestion also overlooks VAMs still prevalent methodological and measurement issues, and how these issues should also likely prevent VAMs from even playing a primary role as the key criteria used to flag teachers. This is also not warranted, from an educational measurement standpoint.

Harris continues that “The point is that it would be a shame if we went back to a no-feedback world, and even more of a shame if we did so because of an over-stated connection between evaluation, accountability, and value-added.” In my professional opinion, he is incorrect on both of these points: (1) Teachers are not (really) using the feedback they are getting from their VAM reports, as described prior, and for a variety of reasons including transparency, usability, actionability, etc.; hence, we are nowhere nearer to some utopian “feedback world” than we were pre-VAMs. All of the related research points to observational feedback, actually, to be the most useful when it comes to garnering and using formative feedback. (2) There is nothing “over-stated” in terms of the “connection between evaluation, accountability, and value-added.” Rather, what he terms as possibly “over-stated” is actually real, in practice, and overly-abused versus stated. This is not just something related to semantics.

Finally, the one point upon which we agree, with some distinction, is that ESSA “is also an opportunity because it requires schools to move beyond test scores in accountability. This will mean a reduced focus on value-added to student test scores, but that’s OK if the new measures provide a more complete assessment of the value [defined in much broader terms] that schools provide.” ESSA, then, offers the nation “a chance to get value-added [defined in much broader terms] right and correct the problems with how we measure teacher and school performance.” If we define “value-added” in much broader terms, even if test scores are not a part of the “value-added” construct, this would likely be a step in the right direction.

Another Take on New Mexico’s Ruling on the State’s Teacher Evaluation/VAM System

John Thompson, a historian and teacher, wrote a post just published in Diane Ravitch’s blog (here) in which he took a closer look at the New Mexico court decision of which I was a part and which I covered a few weeks ago (here). This is the case in which state District Judge David K. Thomson, who presided over the five-day teacher-evaluation lawsuit in New Mexico, granted a preliminary injunction preventing consequences from being attached to the state’s teacher evaluation data.

Historian/Teacher John Thompson adds another, and also independent take on this ruling, again here, also having read through Judge Thomson’s entire ruling. Here’s what he wrote:

New Mexico District Judge David K. Thomson granted a preliminary injunction preventing consequences from being attached to the state’s teacher evaluation data. As Audrey Amrein-Beardsley explains, “can proceed with ‘developing’ and ‘improving’ its teacher evaluation system, but the state is not to make any consequential decisions about New Mexico’s teachers using the data the state collects until the state (and/or others external to the state) can evidence to the court during another trial (set for now, for April) that the system is reliable, valid, fair, uniform, and the like.”

This is wonderful news. As the American Federation of Teachers observes, “Superintendents, principals, parents, students and the teachers have spoken out against a system that is so rife with errors that in some districts as many as 60 percent of evaluations were incorrect. It is telling that the judge characterizes New Mexico’s system as a ‘policy experiment’ and says that it seems to be a ‘Beta test where teachers bear the burden for its uneven and inconsistent application.’”

A close reading of the ruling makes it clear that this case is an even greater victory over the misuse of test-driven accountability than even the jubilant headlines suggest. It shows that Judge Thomson made the right ruling on the key issues for the right reasons, and he seems to be predicting that other judges will be following his legal logic. Litigation over value-added teacher evaluations is being conducted in 14 states, and the legal battleground is shifting to the place where corporate reformers are weakest. No longer are teachers being forced to prove that there is no rational basis for defending the constitutionality of value-added evaluations. Now, the battleground is shifting to the actual implementation of those evaluations and how they violate state laws.

Judge Thomson concludes that the state’s evaluation systems don’t “resemble at all the theory” they were based on. He agreed with the district superintendent who compared it to the Wizard of Oz, where “the guy is behind the curtain and pulling levers and it is loud.” Some may say that the Wizard’s behavior is “understandable,” but that is not the judge’s concern. The Court must determine whether the consequences are assessed by a system that is “objective and uniform.” Clearly, it has been impossible in New Mexico and elsewhere for reformers to meet the requirements they mandated, and that is the legal terrain where VAM proponents must now fight.

The judge thus concludes, “New Mexico’s evaluation system is less like a [sound] model than a cafeteria-style evaluation system where the combination of factors, data, and elements are not easily determined and the variance from school district to school district creates conflicts with the [state] statutory mandate.”

The state of New Mexico counters by citing cases in Florida and Tennessee as precedents. But, Judge Thomson writes that those cases ruled against early challenges based on equal protection or constitutional issues, as they have also cited practical concerns in implementation. He writes of the Florida (Cook) case, “The language in the Cook case could be lifted from the Court findings in this case.” That state’s judge decided “‘The unfairness of this system is not lost on this Court.’” Judge Thomson also argues, “The (Florida) Court in fact seemed to predict the type of legal challenge that could result …‘The individual plaintiffs have a separate remedy to challenge an evaluation on procedural due process grounds if an evaluation is actually used to deprive the teacher of an evaluation right.’”

The question in Florida and Tennessee had been whether there was “a conceivable rational basis” for proceeding with the teacher evaluation policy experiment. Below are some of the more irrational results of those evaluations. The facts in the New Mexico case may be somewhat more absurd than those in other places that have implemented VAMs but, given the inherent flaws in those evaluations, I doubt they are qualitatively worse. In fact, Audrey Amrein-Beardsley testified about a similar outcome in Houston which was as awful as the New Mexico travesties and led to about 1/4th of their teachers subject to those evaluations being subject to “growth plans.”

As has become common across the nation, New Mexico teachers have been evaluated on students who aren’t in the teachers’ classrooms. They have been held accountable for test results from subjects that the teacher didn’t teach. Science teachers might be evaluated on a student taught in 2011, based on how that student scored in 2013.

The judge cited testimony regarding a case where 50% of the teachers rated Minimally Effective had missing data due to reassignment to a wrong group. One year, a district questioned the state’s data, and immediately it saw an unexplained 11% increase in effective teachers. The next year, also without explanation, the state’s original numbers on effectiveness were reduced by 6%.

One teacher taught 160 students but was evaluated on scores of 73 of them and was then placed on a plan for improvement. Because of the need to quantify the effectiveness of teachers in Group B and Group C, who aren’t subject to state End of Instruction tests, there are 63 different tests being used in one district to generate high-stakes data. And, when changing tests to the Common Core PARCC test, the state has to violate scientific protocol, and mix and match test score results in an indefensible manner. Perhaps just as bad, in 2014-15, 76% of teachers were still being evaluated on less than three years of data.

The Albuquerque situation seems exceptionally important because it serves 25% of the state’s students, and it is the type of high-poverty system where value-added evaluations are likely to be most unreliable and invalid. It had 1728 queries about data and 28% of its teachers ranked below the Effective level. The judge noted that if you teach a core subject, you are twice as likely as a French teacher to be judged Ineffective. But, that was not the most shocking statistic. In Albuquerque, Group A elementary teachers (where VAMs play a larger role) are five times more likely to be rated below Effective than their colleagues in Group B. In Roswell, Group B teachers are three times more likely to be rated below Effective than Group C teachers.

Curiously, VAM advocate Tom Kane testified, but he did so in a way the made it unclear whether he saw himself as a witness for the defense or the plaintiffs. When asked about Amrein-Beardsley’s criticism of using tests that weren’t designed for evaluating teachers, Kane countered that the Gates Foundation MET study used random samples and concluded that differing tests could be used in a way that was “useful in evaluating teachers” and valid predictors of student achievement. Kane also replied that he could estimate the state’s error rate “on average,” but he couldn’t estimate error rates for individual teachers. He did not address the judge’s real concern about whether New Mexico’s use of VAMs was uniform and objective.

I am not a lawyer but I have years of experience as a legal historian. Although I have long been disappointed that the legal profession did not condemn value-added evaluations as a violation of our democracy’s fundamental principles, I also knew that the first wave of lawsuits challenging VAMs would face an uphill battle. Using teachers as guinea pigs in a risky experiment, where non-educators imposed their untested opinions on public schools, was always bad policy. Along with their other sins, value-added evaluations would mean collective punishment of some teachers merely for teaching in schools and classes where it is harder to meet dubious test score growth targets. But, many officers of the court might decide that they did not have the grounds to overrule new teacher evaluation laws. They might have to hold their noses while ruling in favor of laws that make a mockery of our tenets of fairness in a constitutional democracy.

During the last few years, rather than force those who would destroy the hard-earned legal rights of teachers to meet the legal standard of “strict scrutiny,” those who would fire teachers without proving that their data was reliable and valid have mostly had to show that their policies were not irrational. Now that their policies are being implemented, reformers must defend the ways that their VAMs are actually being used. Corporate reformers and the Duncan administration were able to coerce almost all of the states into writing laws requiring quantitative components in teacher evaluations. Not surprisingly, it has often proven impossible to implement their schemes in a rational manner.

In theory, corporate reformers could have won if they required the high-stakes use of flawed metrics while maintaining the message discipline that they are famous for. School administrators could have been trained to say that they were merely enforcing the law when they assessed consequences based on metrics. Their job would have been to recite the standard soundbite when firing teachers – saying that their metrics may or may not reflect the actual performance of the teacher in question – but the law required that practice. Life’s not fair, they could have said, and whether or not the individual teacher was being unfairly sacrificed, the administrators who enforced the law were just following orders. It was the will of the lawmakers that the firing of the teachers with the lowest VAMs – regardless of whether the metric reflected actually effectiveness – would make schools more like corporations, so practitioners would have to accept it. But, this is one more case where reformers ignored the real world, did not play out the education policy and legal chess game, and [did or did not] anticipate that rulings such as Judge Thomson’s would soon be coming.

Real world, VAM advocates had to claim that its results represented the actual effectiveness of teachers and that, somehow, their scheme would someday improve schools. This liberated teachers and administrators to fight back in the courts. Moreover, top-down reformers set out to impose the same basic system on every teacher, in every type of class and school, in our diverse nation. When this top-down micromanaging met reality, proponents of test-driven evaluations had to play so many statistical games, create so many made-up metrics, and improvise in so many bizarre ways, that the resulting mess would be legally indefensible.

And, that is why the cases in Florida and Tennessee might soon be seen as the end of the beginning of the nation’s repudiation of value-added evaluations. The New Mexico case, along with the renewal of the federal ESEA and the departure of Arne Duncan, is clearly the beginning of the end. Had VAM proponents objectively briefed attorneys on the strengths and weaknesses of their theories, they could have thought through the inevitable legal process. On the other hand, I doubt that Kane and his fellow economists knew enough about education to be able to anticipate the inevitable, unintended results of their theories on schools. In numerous conversations with VAM true believers, rarely have I met one who seemed to know enough about the nuts and bolts about schools to be able to brief legal advisors, much less anticipate the inevitable results that would eventually have to be defended in court.