New Mexico’s “New, Bait and Switch” Schemes

“A Concerned New Mexico Parent” sent me another blog entry for you all to help you all stay apprised of the ongoing “situation” in New Mexico with its New Mexico Public Education Department (NMPED). See “A Concerned New Mexico Parent’s” prior posts here, here, and here, but in this one (s)he writes a response to an editorial that was recently released in support of the newest version of New Mexico’s teacher evaluation system. The editorial was titled: “Teacher evals have evolved but tired criticisms of them have not,” and it was published in the Albuquerque Journal, as also written by the Albuquerque Journal Editorial Board themselves.

(S)he writes:

The editorial seems to contain and promote many of the “talking points” provided by NMPED with their latest release of teacher evaluations. Hence, I would like to present a few observations on the editorial.

NMPED and the Albuquerque Journal Editorial Board both underscore the point that teachers are still primarily being (and should primarily continue to be) evaluated on the basis of their own students’ test scores (i.e., using a value-added model (VAM)), but it is actually not that simple. Rather, the new statewide teacher evaluation formula is shown here on their website, with one notable difference being that the state’s “new system” now replaces the previously district-wide variations that produced 217 scoring categories for teachers (see here for details).

Accordingly, it now appears that NMPED has kept the same 50% student achievement, 25% observations, and 25% multiple measures division as before. The “new” VAM, however, requires a minimum of three years of data for proper use. Without three years of data, NMPED is to use what it calls graduated considerations or “NMTEACH” steps to change the percentages used in the evaluation formulas by teacher type.

A small footnote on the NMTEACH website devoted to teacher evaluations explains these graduated considerations whereby “Each category is weighted according to the amount of student achievement data available for the teacher. Improved student achievement is worth from 0% to 50%; classroom observations are worth 25% to 50%; planning, preparation and professionalism is worth 15% to 40%; and surveys and/or teacher attendance is worth 10%.” In other words student achievement represents between 0 and 50% of the total, observations comprise somewhere between 14% and 40% of the total, and teacher attendance comprises 10%.

The graduated considerations (Steps) are shown below, as per their use when substitutions are needed when student achievement data are missing:

nmteach

Also, the NMTEACH “Steps” provide for the use of one year of data (Step 2 is used for 1-2 years of data.) I do not see how NMPED can calculate “student improvement” based on just one year’s worth of data.

Hence, this data substitution problem is likely massive. For example, for Category A teachers, 45 of the 58 formulas formerly used will require Step 1 substitutions. For Category B teachers, 112 of 117 prior formulas will require data substitution (Step 1), and all Category C teachers will require data substitution at the Step 1 level.

The reason that this presents a huge data problem is that the state’s prior teacher evaluation system did not require the use of so much end-of-course (EOC) data, and so the tests were not given for three years. Simultaneously and for Group C teachers, NMPED also introduced an new evaluation assessment plus software called iStation that is also in its first year of use.

Thus, for a typical Category B teacher, the evaluation will be based on 50% observation, 40% planning, preparation, and professionalism, and 10% on attendance.

Amazingly, none of this relates to student achievement, and it looks identical to the former administrator-based teacher evaluation system!

Such a “bait-and-switch” scheme will be occurring for most teachers in the state.

Further, in a small case-study I performed on a local New Mexico school (here), I found that not one single teacher in a seven-year period had “good” data for three consecutive years. This also has major implications here given the state’s notorious issues with their data, data management, and the like.

Notwithstanding, the Editorial Board also notes that “The evaluations consider only student improvement, not proficiency.” However, as noted above little actual student achievement is actually available for the strong majority of all teachers’ evaluation; hence, the rate by which this will actually count (versus perhaps appear to count to the public) are two very distinctively different things.

Regardless, the Editorial Board thereafter proclaims that “The evaluations only rate teachers’ effect on their students over a school year…” Even the simple phrase “school year” is also problematic, however.

The easiest way to explain this is to imagine a student in a dual language program (a VERY common situation in New Mexico). Let’s follow his timeline of instruction and testing:

  • August 2015: The student begins the fourth grade with teachers A1 and A2.
  • March 2016: Seven months into the year the student is tested with test #1 at the 4th-grade level.
  • March 2016 – May 2016: The student finishes fourth grade with Teachers A1 and A2
  • June 2016 – Aug 2016: Summer vacation — no tests (i.e., differential summer learning and decay occurs)
  • August 2016: The student begins the fifth grade with teachers B1 and B2.
  • March 2017: Seven months into the year the student is tested with test #2 at the 5th-grade level.
  • March 2017 – May 2017: The student finishes fifth grade with Teachers B1 and B2
  • October 2017: A teacher receives a score based on this student’s improvement (along with other students like him, although coming from different A level teachers) from test#1 to test#2

To simplify, the test improvement is based on a test given before he has completed the grade level of interest with material taught by four teachers at two different grade levels over the span of one calendar year [this is something that is known in the literature as prior teachers’ residual effects].

And it gets worse. The NMPED requires that a student be assigned to only one teacher. According to the NMTEACH FAQ, in the case of team-teaching, “Students are assigned to one teacher. That teacher would get credit. A school could change teacher assignment each snapshot and thus both teachers would get counted automatically.”

I can only assume the Editorial Board members are brighter than I am because I cannot parse out the teacher evaluation values for my sample student.

Nevertheless, the Editorial Board also gushes with praise regarding the use of teacher attendance as an evaluation tool. This is just morally wrong.

Leave is not “granted” to teachers by some benevolent overlord. It is earned and is part of the union contract between teachers and the state. Imagine a job where you are told that you have two weeks vacation time but, of course, you can only take two days of it or you might be fired. Absurd, right? Well, apparently not if you are NMPED.

This is one of the major issues in the ongoing lawsuit, where as I recall, one of the plaintiffs was penalized for taking time off for the apparently frivolous task of cancer treatment! NMPED should be ashamed of themselves!

The Editorial Board also praises the new, “no lag time” aspect of the evaluation system. In the past, teacher evaluations were presented at the end of the school year before student scores were available. Now that the evaluations depend upon student scores, the evaluations appear early in the next school year. As noted in the timeline above, the lag time is still present contrary to what they assert. Further, these evaluations now come mid-term after the school-year has started and teacher assignments have been made.

In the end, and again in the title, the Editorial Board claims that the “Teacher evals have evolved but tired criticisms of them have not.”

The evals have not evolved but have rather devolved to something virtually identical to the former observation and administration-based evaluations. The tired criticisms are tired precisely because they have never been adequately answered by NMPED.

~A Concerned New Mexico Parent

A New Book about VAMs “On Trial”

I recently heard about a new book that was written by Mark Paige — J.D. and Ph.D., assistant professor of public policy at the University of Massachusetts-Dartmouth, and a former school law attorney — and published by Rowman & Littlefield. The book is about, as per the secondary part of its title “Understanding Value-Added Models [VAMs] in the Law of Teacher Evaluation.” See more on this book, including information about how to purchase it, for those of you who might be interested in reading more, here, and also via Amazon here.

Clearly, this book is to prove very relevant given the ongoing court cases across the country (see a prior post on these cases here) regarding teachers and the systems being used to evaluate them when especially (or extremely) reliant upon VAM-based estimates for consequential decision-making purposes (e.g., teacher tenure, pay, and termination). While I have not yet read the book, I just ordered my copy the other day. I suggest you do the same, again, should you be interested in further or better understanding the federal and state law pertinent to these cases.

Notwithstanding, I also requested that the author of this book — Mark Paige — write a guest post so that you too could find out more. Here is what he wrote:

Many of us have been following VAMs in legal circles. Several courts have faced the issue of VAMs as they relate to employment law matters. These cases have tested a chief selling point (pardon [or underscore] the business reference) of VAMs: that they will effectuate, for example, teacher termination with greater ease because nobody besides the advanced statisticians and econometricians can argue with their numbers derived. In other words, if a teacher’s VAM rating is bad, then the teacher must be bad. It’s to be as simple as that. How can a court deny that, reality?

Of course, as we [should] already know, VAMs are anything but certain. Bluntly stated: VAMs are a statistical “hot mess.” The American Statistical Association, among many others, warned in no uncertain terms that VAMs cannot – and should not – be trusted to make significant employment decisions. Of course, that has not stopped many policymakers from a full-throated adoption of their use in employment and evaluation decisions. Talk about hubris.

Accordingly, I recently completed this book, again, that focuses squarely at the intersection of VAMs and the law. Its full title is “Building a Better Teacher: Understanding Value-Added Models in the Law of Teacher Evaluation” Rowman & Littlefield, 2016). Again, I provide a direct link to the book along with its description here.

To offer a bit of a sneak preview, thought, I draw many conclusions throughout the book, but one of two important take-aways is this: VAMs may actually complicate the effectuation of a teacher’s termination. Here’s one way: because VAMs are so statistically infirm, they invite plaintiff-side attorneys to attack any underlying negative decision based on these models. See, for example, Sheri Lederman’s recent New York State Supreme Court’s decision, here. [See also a related post in this blog here].

In other words, the evidence upon which districts or states rely to make significant decisions is untrustworthy (or arbitrary) and, therefore, so is any decision as based, even if in part, on VAMs. Thus, VAMs may actually strengthen a teacher’s case. This, of course, is quite apart from the fact that VAM use results in firing good teachers based on poor information, thereby contributing to the teacher shortages and lower morale (among many other parades of horribles) being reported across the nation, and now more than likely ever.

The second important take-away is this, especially given followers of this blog include many educators and administrators facing a barrage of criticisms that only “de-professionalize” them: Courts have, over time, consistently deferred to the professional judgment of administrators (and their assessment of effective teaching). The members of that august institution – the judiciary – actually believe that educators know best about teaching, and that years of accumulated experience and knowledge have actual and also court-relevant value. That may come as a startling revelation to those who consistently diminish the education profession, or those who at least feel like they and their efforts are consistently being diminished.

To be sure, the system of educator evaluation is not perfect. Our schools continue to struggle to offer equal and equitable educational opportunities to all students, especially those in the nation’s highest needs schools. But what this book ultimately concludes is that the continued use of VAMs will not, hu-hum, add any value to these efforts.

To reach author Mark Paige via email, please contact him at mpaige@umassd.edu. To reach him via Twitter: @mpaigelaw

New Mexico Is “At It Again”

“A Concerned New Mexico Parent” sent me yet another blog entry for you all to stay apprised of the ongoing “situation” in New Mexico and the continuous escapades of the New Mexico Public Education Department (NMPED). See “A Concerned New Mexico Parent’s” prior posts here, here, and here, but in this one (s)he writes what follows:

Well, the NMPED is at it again.

They just released the teacher evaluation results for the 2015-2016 school year. And, the report and media press releases are a something.

Readers of this blog are familiar with my earlier documentation of the myriad varieties of scoring formulas used by New Mexico to evaluate its teachers. If I recall, I found something like 200 variations in scoring formulas [see his/her prior post on this here with an actual variation count at n=217].

However, a recent article published in the Albuquerque Journal indicates that, now according to the NMPED, “only three types of test scores are [being] used in the calculation: Partnership for Assessment of Readiness for College and Careers [PARCC], end-of-course exams, and the [state’s new] Istation literacy test.” [Recall from another article released last January that New Mexico’s Secretary of Education Hanna Skandera is also the head of the governing board for the PARCC test].

Further, the Albuquerque Journal article author reports that the “PED also altered the way it classifies teachers, dropping from 107 options to three. Previously, the system incorporated many combinations of criteria such as a teacher’s years in the classroom and the type of standardized test they administer.”

The new state-wide evaluation plan is also available in more detail here. Although I should also add that there has been no published notification of the radical changes in this plan. It was just simply and quietly posted on NMPED’s public website.

Important to note, though, is that for Group B teachers (all levels), the many variations documented previously have all been replaced by end-of-course (EOC) exams. Also note that for Group A teachers (all levels) the percentage assigned to the PARCC test has been reduced from 50% to 35%. (Oh, how the mighty have fallen …). The remaining 15% of the Group A score is to be composed of EOC exam scores.

There are only two small problems with this NMPED simplification.

First, in many districts, no EOC exams were given to Group B teachers in the 2015-2016 school year, and none were given in the previous year either. Any EOC scores that might exist were from a solitary administration of EOC exams three years previously.

Second, for Group A teachers whose scores formerly relied solely on the PARCC test for 50% of their score, no EOC exams were ever given.

Thus, NMPED has replaced their policy of evaluating teachers on the basis of students they don’t teach to this new policy of evaluating teachers on the basis of tests they never administered!

Well done, NMPED (not…)

Luckily, NMPED still cannot make any consequential decisions based on these data, again, until NMPED proves to the court that the consequential decisions that they would still very much like to make (e.g., employment, advancement and licensure decisions) are backed by research evidence. I know, interesting concept…

Deep Pockets, Corporate Reform, and Teacher Education

A colleague whom I have never formally met, but with whom I’ve had some interesting email exchanges with over the past few months — James D. Kirylo, Professor of Teaching and Learning in Louisiana — recently sent me an email I read, and appreciated; hence, I asked him to turn it into a blog post. He responded with a guest post he has titled “Deep Pockets, Corporate Reform, and Teacher Education,” pasted below. Do give this a read, and a social media share, as this one is deserving of some legs.

Here is what he wrote:

Money is power. Money is influence. Money shapes direction. Notwithstanding the influential nature of it in the electoral process, one only needs to see how bags of dough from the mega-rich-one-percenters—largely led by Bill Gates—have bought their way in their attempt to corporatize K-12 education (see, for example, here).  

This corporatization works to defund public education, grossly blames teachers for all that ails society, is obsessed with testing, and aims to privatize.  And next on the corporatized docket: teacher education programs.

In a recent piece by Valerie Strauss, “Gates Foundation Puts Millions of Dollars into New Education Focus: Teacher Preparation,” she sketches how Gates is awarding $35 million to a three-year project called Teacher Preparation Transformation Centers funneled through five different projects, one of which is the Texas Tech based University-School Partnerships for the Renewal of Educator Preparation (U.S. Prep) National Center.

A framework that will guide this “renewal” of educator preparation comes from the National Institute for Excellence in Teaching (NIET), along with the peddling of their programs, The System for Teacher and Student Advancement (TAP) and Student and Best Practices Center (BPC). Yet, again, coming from another guy with oodles of money, leading the charge of NIET is Lowell Milken who is Chairmen and TAP founder (see, for example, here).

The state of Louisiana serves as an example on how NIET is already working overtime in chipping its way into K-12 education. One can spend hours at the Louisiana Department of Education (LDE) website and view the various links on how TAP is applying a full-court-press in hyping its brand (see, for example, here).  

And now that TAP has entered the K-12 door in Louisiana, the brand is now squiggling its way into teacher education preparation programs, namely through the Texas Tech based U.S. Prep National Center. This Gates Foundation backed project involves five teacher education programs in the country (Southern Methodist University, University of Houston, Jackson State University, and the University of Memphis, including one in Louisiana (Southeastern Louisiana University) (see more information about this here).  

Therefore, teacher educators must be “trained” to use TAP in order to “rightly” inculcate the prescription to teacher candidates.

TAP: Four Elements of Success

TAP principally plugs four Elements of Success: Multiple Career Paths (for educators as career, mentor and master teachers); Ongoing Applied Professional Growth (through weekly cluster meetings, follow-up support in the classroom, and coaching); Instructionally Focused Accountability (through multiple classroom observations and evaluations utilizing a research based instrument and rubric that identified effective teaching practices); and, Performance-Based Compensation (based on multiple; measures of performance, including student achievement gains and teachers’ instructional practices).

And according to the TAP literature, the elements of success “…were developed based upon scientific research, as well as best practices from the fields of education, business, and management” (see, for example, here). Recall, perhaps, that No Child Left Behind (NCLB) was also based on “scientific-based” research. Enough said. It is also interesting to note their use of the words “business” and “management” when referring to educating our children. Regardless, “The ultimate goal of TAP is to raise student achievement” so students will presumably be better equipped to compete in the global society (see, for example, here). 

While each element is worthy of discussion, a brief comment is in order on the first element Multiple Career Paths and fourth element, Performance-Based Compensation. Regarding the former, TAP has created a mini-hierarchy within already-hierarchical school systems (which most are) in identifying three potential sets of teachers, to reiterate from the above: a “career” teacher; a “mentor” teacher, and a “master” teacher. A “career” teacher as opposed to what? As opposed to a “temporary” teacher, a Teach For America (TFA) teacher, a substitute teacher? But, of course, according to TAP, as opposed to a “mentor” teacher and a “master” teacher.

This certainly begs the question: Why in the world would any parent want their child to be taught by a “career” teacher as opposed to a “mentor” teacher or better yet a “master” teacher? Wouldn’t we want “master” teachers in all our classrooms? To analogize, I would rather have a “master” doctor performing heart surgery on me than a “lowly” career doctor. Indeed, words, language, and concepts matter.

With respect to the latter, the notion of having an ultimate goal on raising student achievement is perhaps more than euphemistic on raising test scores, cultivating a test-centric way of doing things.

Achievement and VAM

That is, instead of focusing on learning, opportunity, developmentally appropriate practices, and falling in love with learning, “achievement” is the goal of TAP. Make no mistake, this is far from an argument on semantics. And this “achievement” linked to student growth to merit pay relies heavily on a VAM-aligned rubric.

Yet, there are multiple problems with VAM, an instrument that has been used in K-12 education since 2011. Among many other outstanding sources, one may simply want to check out this cleverly called blog here, “VAMboozled,” or see what Diane Ravitch has said about VAMs (among other places, see, for example, here), not to mention the well-visited site produced by Mercedes Schneider here. Finally, see the 2015 position statement issued by the American Educational Research Association (AERA) regarding VAMs here, as well as a similar statement issued by the American Statistical Association (ASA) here

Back to the Gates Foundation and the Texas Tech based (U.S. Prep) National Center, though. To restate, at the aforementioned university in Louisiana (though likely in the other four recruited institutions, as well), TAP will be the chief vehicle that will drive this process, and teacher education programs will be used as the host to prop the brand.

With presumably some very smart, well-educated, talented, and experienced professionals at respective teacher education sites, how is it possible that they capitulated to be the samples for the petri dish that will only work to enculturate the continuation of corporate reform, which will predictably lead to what Hofstra University Professor, Alan Singer, calls the “McDonaldization of Teacher Education“?

Strauss puts the question this way, “How many times do educators need to attempt to reinvent the wheel just because someone with deep pockets wants to try when the money could almost certainly be more usefully spent somewhere else?” I ask this same question, in this case, here.

Another Take on New Mexico’s Ruling on the State’s Teacher Evaluation/VAM System

John Thompson, a historian and teacher, wrote a post just published in Diane Ravitch’s blog (here) in which he took a closer look at the New Mexico court decision of which I was a part and which I covered a few weeks ago (here). This is the case in which state District Judge David K. Thomson, who presided over the five-day teacher-evaluation lawsuit in New Mexico, granted a preliminary injunction preventing consequences from being attached to the state’s teacher evaluation data.

Historian/Teacher John Thompson adds another, and also independent take on this ruling, again here, also having read through Judge Thomson’s entire ruling. Here’s what he wrote:

New Mexico District Judge David K. Thomson granted a preliminary injunction preventing consequences from being attached to the state’s teacher evaluation data. As Audrey Amrein-Beardsley explains, “can proceed with ‘developing’ and ‘improving’ its teacher evaluation system, but the state is not to make any consequential decisions about New Mexico’s teachers using the data the state collects until the state (and/or others external to the state) can evidence to the court during another trial (set for now, for April) that the system is reliable, valid, fair, uniform, and the like.”

This is wonderful news. As the American Federation of Teachers observes, “Superintendents, principals, parents, students and the teachers have spoken out against a system that is so rife with errors that in some districts as many as 60 percent of evaluations were incorrect. It is telling that the judge characterizes New Mexico’s system as a ‘policy experiment’ and says that it seems to be a ‘Beta test where teachers bear the burden for its uneven and inconsistent application.’”

A close reading of the ruling makes it clear that this case is an even greater victory over the misuse of test-driven accountability than even the jubilant headlines suggest. It shows that Judge Thomson made the right ruling on the key issues for the right reasons, and he seems to be predicting that other judges will be following his legal logic. Litigation over value-added teacher evaluations is being conducted in 14 states, and the legal battleground is shifting to the place where corporate reformers are weakest. No longer are teachers being forced to prove that there is no rational basis for defending the constitutionality of value-added evaluations. Now, the battleground is shifting to the actual implementation of those evaluations and how they violate state laws.

Judge Thomson concludes that the state’s evaluation systems don’t “resemble at all the theory” they were based on. He agreed with the district superintendent who compared it to the Wizard of Oz, where “the guy is behind the curtain and pulling levers and it is loud.” Some may say that the Wizard’s behavior is “understandable,” but that is not the judge’s concern. The Court must determine whether the consequences are assessed by a system that is “objective and uniform.” Clearly, it has been impossible in New Mexico and elsewhere for reformers to meet the requirements they mandated, and that is the legal terrain where VAM proponents must now fight.

The judge thus concludes, “New Mexico’s evaluation system is less like a [sound] model than a cafeteria-style evaluation system where the combination of factors, data, and elements are not easily determined and the variance from school district to school district creates conflicts with the [state] statutory mandate.”

The state of New Mexico counters by citing cases in Florida and Tennessee as precedents. But, Judge Thomson writes that those cases ruled against early challenges based on equal protection or constitutional issues, as they have also cited practical concerns in implementation. He writes of the Florida (Cook) case, “The language in the Cook case could be lifted from the Court findings in this case.” That state’s judge decided “‘The unfairness of this system is not lost on this Court.’” Judge Thomson also argues, “The (Florida) Court in fact seemed to predict the type of legal challenge that could result …‘The individual plaintiffs have a separate remedy to challenge an evaluation on procedural due process grounds if an evaluation is actually used to deprive the teacher of an evaluation right.’”

The question in Florida and Tennessee had been whether there was “a conceivable rational basis” for proceeding with the teacher evaluation policy experiment. Below are some of the more irrational results of those evaluations. The facts in the New Mexico case may be somewhat more absurd than those in other places that have implemented VAMs but, given the inherent flaws in those evaluations, I doubt they are qualitatively worse. In fact, Audrey Amrein-Beardsley testified about a similar outcome in Houston which was as awful as the New Mexico travesties and led to about 1/4th of their teachers subject to those evaluations being subject to “growth plans.”

As has become common across the nation, New Mexico teachers have been evaluated on students who aren’t in the teachers’ classrooms. They have been held accountable for test results from subjects that the teacher didn’t teach. Science teachers might be evaluated on a student taught in 2011, based on how that student scored in 2013.

The judge cited testimony regarding a case where 50% of the teachers rated Minimally Effective had missing data due to reassignment to a wrong group. One year, a district questioned the state’s data, and immediately it saw an unexplained 11% increase in effective teachers. The next year, also without explanation, the state’s original numbers on effectiveness were reduced by 6%.

One teacher taught 160 students but was evaluated on scores of 73 of them and was then placed on a plan for improvement. Because of the need to quantify the effectiveness of teachers in Group B and Group C, who aren’t subject to state End of Instruction tests, there are 63 different tests being used in one district to generate high-stakes data. And, when changing tests to the Common Core PARCC test, the state has to violate scientific protocol, and mix and match test score results in an indefensible manner. Perhaps just as bad, in 2014-15, 76% of teachers were still being evaluated on less than three years of data.

The Albuquerque situation seems exceptionally important because it serves 25% of the state’s students, and it is the type of high-poverty system where value-added evaluations are likely to be most unreliable and invalid. It had 1728 queries about data and 28% of its teachers ranked below the Effective level. The judge noted that if you teach a core subject, you are twice as likely as a French teacher to be judged Ineffective. But, that was not the most shocking statistic. In Albuquerque, Group A elementary teachers (where VAMs play a larger role) are five times more likely to be rated below Effective than their colleagues in Group B. In Roswell, Group B teachers are three times more likely to be rated below Effective than Group C teachers.

Curiously, VAM advocate Tom Kane testified, but he did so in a way the made it unclear whether he saw himself as a witness for the defense or the plaintiffs. When asked about Amrein-Beardsley’s criticism of using tests that weren’t designed for evaluating teachers, Kane countered that the Gates Foundation MET study used random samples and concluded that differing tests could be used in a way that was “useful in evaluating teachers” and valid predictors of student achievement. Kane also replied that he could estimate the state’s error rate “on average,” but he couldn’t estimate error rates for individual teachers. He did not address the judge’s real concern about whether New Mexico’s use of VAMs was uniform and objective.

I am not a lawyer but I have years of experience as a legal historian. Although I have long been disappointed that the legal profession did not condemn value-added evaluations as a violation of our democracy’s fundamental principles, I also knew that the first wave of lawsuits challenging VAMs would face an uphill battle. Using teachers as guinea pigs in a risky experiment, where non-educators imposed their untested opinions on public schools, was always bad policy. Along with their other sins, value-added evaluations would mean collective punishment of some teachers merely for teaching in schools and classes where it is harder to meet dubious test score growth targets. But, many officers of the court might decide that they did not have the grounds to overrule new teacher evaluation laws. They might have to hold their noses while ruling in favor of laws that make a mockery of our tenets of fairness in a constitutional democracy.

During the last few years, rather than force those who would destroy the hard-earned legal rights of teachers to meet the legal standard of “strict scrutiny,” those who would fire teachers without proving that their data was reliable and valid have mostly had to show that their policies were not irrational. Now that their policies are being implemented, reformers must defend the ways that their VAMs are actually being used. Corporate reformers and the Duncan administration were able to coerce almost all of the states into writing laws requiring quantitative components in teacher evaluations. Not surprisingly, it has often proven impossible to implement their schemes in a rational manner.

In theory, corporate reformers could have won if they required the high-stakes use of flawed metrics while maintaining the message discipline that they are famous for. School administrators could have been trained to say that they were merely enforcing the law when they assessed consequences based on metrics. Their job would have been to recite the standard soundbite when firing teachers – saying that their metrics may or may not reflect the actual performance of the teacher in question – but the law required that practice. Life’s not fair, they could have said, and whether or not the individual teacher was being unfairly sacrificed, the administrators who enforced the law were just following orders. It was the will of the lawmakers that the firing of the teachers with the lowest VAMs – regardless of whether the metric reflected actually effectiveness – would make schools more like corporations, so practitioners would have to accept it. But, this is one more case where reformers ignored the real world, did not play out the education policy and legal chess game, and [did or did not] anticipate that rulings such as Judge Thomson’s would soon be coming.

Real world, VAM advocates had to claim that its results represented the actual effectiveness of teachers and that, somehow, their scheme would someday improve schools. This liberated teachers and administrators to fight back in the courts. Moreover, top-down reformers set out to impose the same basic system on every teacher, in every type of class and school, in our diverse nation. When this top-down micromanaging met reality, proponents of test-driven evaluations had to play so many statistical games, create so many made-up metrics, and improvise in so many bizarre ways, that the resulting mess would be legally indefensible.

And, that is why the cases in Florida and Tennessee might soon be seen as the end of the beginning of the nation’s repudiation of value-added evaluations. The New Mexico case, along with the renewal of the federal ESEA and the departure of Arne Duncan, is clearly the beginning of the end. Had VAM proponents objectively briefed attorneys on the strengths and weaknesses of their theories, they could have thought through the inevitable legal process. On the other hand, I doubt that Kane and his fellow economists knew enough about education to be able to anticipate the inevitable, unintended results of their theories on schools. In numerous conversations with VAM true believers, rarely have I met one who seemed to know enough about the nuts and bolts about schools to be able to brief legal advisors, much less anticipate the inevitable results that would eventually have to be defended in court.

Brookings’ Critique of AERA Statement on VAMs, and Henry Braun’s Rebuttal

Two weeks ago I published a post about the newly released “American Educational Research Association (AERA) Statement on Use of Value-Added Models (VAM) for the Evaluation of Educators and Educator Preparation Programs.”

In this post I also included a summary of the AERA Council’s eight key, and very important points abut VAMs and VAM use. I also noted that I contributed to this piece in one of its earliest forms. More importantly, however, the person who managed the statement’s external review and also assisted the AERA Council in producing the final statement before it was officially released was Boston College’s Dr. Henry Braun, Boisi Professor of Education and Public Policy and Educational Research, Measurement, and Evaluation.

Just this last week, the Brookings Institution published a critique of the AERA statement for, in my opinion, no other apparent reason than just being critical. The critique was written by Brookings affiliate Michael Hansen and University of Washington Bothell’s Dan Goldhaber, titled a “Response to AERA statement on Value-Added Measures: Where are the Cautionary Statements on Alternative Measures?

Accordingly, I invited Dr. Henry Braun to respond, and he graciously agreed:

In a recent posting, Michael Hansen and Dan Goldhaber complain that the AERA statement on the use of VAMs does not take a similarly critical stance with respect to “alternative measures”. True enough! The purpose of the statement is to provide a considered, research-based discussion of the issues related to the use of value-added scores for high-stakes evaluation. It culminates in a set of eight requirements to be met before such use should be made.

The AERA statement does not stake out an extreme position. First, it is grounded in the broad research literature on drawing causal inferences from observational data subject to strong selection (i.e., the pairings of teachers and students is highly non-random), as well as empirical studies of VAMs in different contexts. Second, the requirements are consistent with the AERA, American Psychological Association (APA), and National Council on Measurement in Education (NCME) Standards for Educational and Psychological Testing. Finally, its cautions are in line with those expressed in similar statements released by the Board on Testing and Assessment of the National Research Council and by the American Statistical Association.

Hansen and Goldhaber are certainly correct when they assert that, in devising an accountability system for educators, a comparative perspective is essential. One should consider the advantages and disadvantages of different indicators, which ones to employ, how to combine them and, most importantly, consider both the consequences for educators and the implications for the education system as a whole. Nothing in the AERA statement denies the importance of subjecting all potential indicators to scrutiny. Indeed, it states: “Justification should be provided for the inclusion of each indicator and the weight accorded to it in the evaluation process.” Of course, guidelines for designing evaluation systems would constitute a challenge of a different order!

In this context, it must be recognized that rankings based on VAM scores and ratings based on observational protocols will necessarily have different psychometric and statistical properties. Moreover, they both require a “causal leap” to justify their use: VAM scores are derived directly from student test performance, but require a way of linking to the teacher of record. Observational ratings are based directly on a teacher’s classroom performance, but require a way of linking back to her students’ achievement or progress.

Thus, neither approach is intrinsically superior to the other. But the singular danger with VAM scores, being the outcome of a sophisticated statistical procedure, is that they are seen by many as providing a gold standard against which other indicators should be judged. Both the AERA and ASA statements offer a needed corrective, by pointing out the path that must be traversed before an indicator based on VAM scores approaches the status of a gold standard. Though the requirements listed in the AERA statement may be aspirational, they do offer signposts against which we can judge how far we have come along that path.

Henry Braun, Lynch School of Education, Boston College

Why Gene Glass is No Longer a Measurement Specialist

One of my mentors – Dr. Gene Glass (formerly at ASU and now at Boulder) wrote a letter earlier this week on his blog, titled “Why I Am No Longer a Measurement Specialist.” This is a must read for all of you following the current policy trends not only surrounding teacher-level accountability, but also high-stakes testing in general.

Gene – one of the most well-established and well-known measurement specialists in and outside of the field of education, world renowned for developing “meta-analysis,” writes:

I was introduced to psychometrics in 1959. I thought it was really neat.By 1960, I was programming a computer on a psychometrics research project funded by the Office of Naval Research. In 1962, I entered graduate school to study educational measurement under the top scholars in the field.

My mentors – both those I spoke with daily and those whose works I read – had served in WWII. Many did research on human factors — measuring aptitudes and talents and matching them to jobs. Assessments showed who were the best candidates to be pilots or navigators or marksmen. We were told that psychometrics had won the war; and of course, we believed it.

The next wars that psychometrics promised it could win were the wars on poverty and ignorance. The man who led the Army Air Corps effort in psychometrics started a private research center. (It exists today, and is a beneficiary of the millions of dollars spent on Common Core testing.) My dissertation won the 1966 prize in Psychometrics awarded by that man’s organization. And I was hired to fill the slot recently vacated by the world’s leading psychometrician at the University of Illinois. Psychometrics was flying high, and so was I.

Psychologists of the 1960s & 1970s were saying that just measuring talent wasn’t enough. Talents had to be matched with the demands of tasks to optimize performance. Measure a learning style, say, and match it to the way a child is taught. If Jimmy is a visual learner, then teach Jimmy in a visual way. Psychometrics promised to help build a better world. But twenty years later, the promises were still unfulfilled. Both talent and tasks were too complex to yield to this simple plan. Instead, psychometricians grew enthralled with mathematical niceties. Testing in schools became a ritual without any real purpose other than picking a few children for special attention.

Around 1980, I served for a time on the committee that made most of the important decisions about the National Assessment of Educational Progress. The project was under increasing pressure to “grade” the NAEP results: Pass/Fail; A/B/C/D/F; Advanced/Proficient/Basic. Our committee held firm: such grading was purely arbitrary, and worse, would only be used politically. The contract was eventually taken from our organization and given to another that promised it could give the nation a grade, free of politics. It couldn’t.

Measurement has changed along with the nation. In the last three decades, the public has largely withdrawn its commitment to public education. The reasons are multiple: those who pay for public schools have less money, and those served by the public schools look less and less like those paying taxes.

The degrading of public education has involved impugning its effectiveness, cutting its budget, and busting its unions. Educational measurement has been the perfect tool for accomplishing all three: cheap and scientific looking.

International tests have purported to prove that America’s schools are inefficient or run by lazy incompetents. Paper-and-pencil tests seemingly show that kids in private schools – funded by parents – are smarter than kids in public schools. We’ll get to the top, so the story goes, if we test a teacher’s students in September and June and fire that teacher if the gains aren’t great enough.

There has been resistance, of course. Teachers and many parents understand that children’s development is far too complex to capture with an hour or two taking a standardized test. So resistance has been met with legislated mandates. The test company lobbyists convince politicians that grading teachers and schools is as easy as grading cuts of meat. A huge publishing company from the UK has spent $8 million in the past decade lobbying Congress. Politicians believe that testing must be the cornerstone of any education policy.

The results of this cronyism between corporations and politicians have been chaotic. Parents see the stress placed on their children and report them sick on test day. Educators, under pressure they see as illegitimate, break the rules imposed on them by governments. Many teachers put their best judgment and best lessons aside and drill children on how to score high on multiple-choice tests. And too many of the best teachers exit the profession.

When measurement became the instrument of accountability, testing companies prospered and schools suffered. I have watched this happen for several years now. I have slowly withdrawn my intellectual commitment to the field of measurement. Recently I asked my dean to switch my affiliation from the measurement program to the policy program. I am no longer comfortable being associated with the discipline of educational measurement.

Gene V Glass
Arizona State University
National Education Policy Center
University of Colorado Boulder

Out with the Old, In with the New: Proposed Ohio Budget Bill to Revise the Teacher Evaluation System (Again)

Here is another post from VAMboozled!’s new team member – Noelle Paufler, Ph.D. – on Ohio’s “new and improved” teacher evaluation system, redesigned three years out from Ohio’s last attempt.

The Ohio Teacher Evaluation System (OTES) can hardly be considered “old” in its third year of implementation, and yet Ohio Budget Bill (HB64) proposes new changes to the system for the 2015-2016 school year. In a recent blog post, Plunderbund (aka Greg Mild) highlights the latest revisions to the OTES as proposed in HB64. (This post is also featured here on Diane Ravitch’s blog.)

Plunderbund outlines several key concerns with the budget bill including:

  • Student Learning Objectives (SLOs): In place of SLOs, teachers who are assigned to grade levels, courses, or subjects for which value-added scores are unavailable (i.e., via state standardized tests or vendor assessments approved by the Ohio Department of Education [ODE]) are to be evaluated “using a method of attributing student growth,” per HB64, Section 3319.111 (B) (2).
  • Attributed Student Growth: The value-added results of an entire school or district are to be attributed to teachers who otherwise do not have individual value-added scores for evaluation purposes. In this scenario, teachers are to be evaluated based upon the performance of students they may not have met in subject areas they do not directly teach.
  • Timeline: If enacted, the budget bill does will require the ODE to finalize the revised evaluation framework until October 31, 2015. Although the OTES has just now been fully implemented in most districts across the state, school boards would need to quickly revise teacher evaluation processes, forms, and software to comply with the new requirements well after the school year is already underway.

As Plunderbund notes, these newly proposed changes resurrect a series of long-standing questions of validity and credibility with regards to OTES. The proposed use of “attributed student growth” to evaluate teachers who are assigned to non-tested grade levels or subject areas has and should raise concerns among all teachers. This proposal presumes that an essentially two-tiered evaluation system can validly measure the effectiveness of some teachers based on presumably proximal outcomes (their individual students’ scores on state or approved vendor assessments) and others based on distal outcomes (at best) using attributed student growth. While the dust has scarcely settled with regards to OTES implementation, Plunderbund compellingly argues that this new wave of proposed changes would result in more confusion, frustration, and chaos among teachers and disruptions to student learning.

To learn more, read Plunderbund’s full critique of the proposed changes, again, click here.

Is this Thing On? Amplifying the Call to Stop the Use of Test Data for Educator Evaluations (At Least for Now)

I invited a colleague of mine and now member of the VAMboozled! team – Kimberly Kappler Hewitt (Assistant Professor, University of North Carolina, Greensboro) – to write another guest post for you all (see her first post here). She wrote another, this time capturing what three leading professional organizations have to say on the use of VAMs and tests in general for purposes of teacher accountability. Here’s what she wrote:

Within the last year, three influential organizations—reflecting researchers, practitioners, and philanthropic sectors—have called for a moratorium on the current use of student test score data for educator evaluations, including the use of value-added models (VAMs).

In April of 2014, the American Statistical Association (ASA) released a position statement that was highly skeptical of the use of VAMs for educator evaluation. ASA declared that “Attaching too much importance to a single item of quantitative information is counterproductive—in fact, it can be detrimental to the goal of improving quality.” To be clear, the ASA stopped short of outright condemning the use of VAM for educator evaluation, and declared that its statement was designed to provide guidance, not prescription. Instead, ASA outlined the possibilities and limitations of VAM and called into question how it is currently being (mis)used for educator evaluation.

In June of 2014, the Gates Foundation, the largest American philanthropic education funder, released “A Letter to Our Partners: Let’s Give Students and Teachers Time.” This was written by Vicki Phillips, Director of Education, College Ready, in which she (on behalf of the Foundation) called for a two-year moratorium on the use of test scores for educator evaluation. She explained that “teachers need time to develop lessons, receive more training, get used to the new tests, and offer their feedback.”

Similarly, the Association for Supervision and Curriculum Development (ASCD), which is arguably the leading international educator organization comprised of 125,000 members in more than 130 nations, also recently released a policy brief that also calls for a two-year moratorium on high stakes use of state tests—including their use for educator evaluations. ASCD also explicitly acknowledged that “reliance on high-stakes standardized tests to evaluate students, educators, or schools is antithetical to a whole child education. It is also counter to what constitutes good educational practice.”

While the call to halt the current use of test scores for educator evaluation is echoed across all three of these organizations, there are important nuances to their messages. The Gates Foundation, for example, makes it clear that the foundation supports the use of student test data for educator evaluation even as it declares the need for a two-year moratorium, the purpose of which is to allow teachers the time to adjust to the new Common Core Standards and related tests:

The Gates Foundation is an ardent supporter of fair teacher feedback and evaluation systems that include measures of student gains. We don’t believe student assessments should ever be the sole measure of teaching performance, but evidence of a teacher’s impact on student learning should be part of a balanced evaluation that helps all teachers learn and improve.

The Gates Foundation cautions, though, the risk of moving too quickly to tie test scores to teacher evaluation:

Applying assessment scores to evaluations before these pieces are developed would be like measuring the speed of a runner based on her time—without knowing how far she ran, what obstacles were in the way, or whether the stopwatch worked!

I wonder what the stopwatch symbolizes in the simile: Does the Gates Foundation have questions about the measurement mechanism itself (VAM or another student growth measure), or is Gates simply arguing for more time in order for educators to be “ready” for the race they are expected to run?

While the Gates call for a moratorium is oriented on increasing the possibility of realizing the positive potential of policies regarding the use of student test data for educator evaluation by providing more time to prepare educators for them, ASA on the other hand is concerned about the potential negative effects of such policies. The ASA, in its attempt to provide guidance, identified problems with the current use of VAM for educator evaluation and raised important questions about the potential effects of high stakes use of VAM for educator evaluation:

A decision to use VAMs for teacher evaluations might change the way the tests are viewed and lead to changes in the school environment. For example, more classroom time might be spent on test preparation and on specific content from the test at the exclusion of content that may lead to better long-term learning gains or motivation for students. Certain schools may be hard to staff if there is a perception that it is harder for teachers to achieve good VAM scores when working in them. Over-reliance on VAM scores may foster a competitive environment, discouraging collaboration and efforts to improve the educational system as a whole.

Similarly to ASA, ASCD is concerned with the negative effects of current accountability practices, including “over testing, a narrowing of the curriculum, and a de-emphasis of untested subjects and concepts—the arts, civics, and social and emotional skills, among many others.” While ASCD is clear that it is not calling for a moratorium on testing, it is calling for a moratorium on accountability consequences linked to state tests: “States can and should still administer standardized assessments and communicate the results and what they mean to districts, schools, and families, but without the threat of punitive sanctions that have distorted their importance.” ASCD goes further than ASA and Gates in calling for a complete revamp of accountability practices, including policies regarding teacher accountability:

We need a pause to replace the current system with a new vision. Policymakers and the public must immediately engage in an open and transparent community decision-making process about the best ways to use test scores and to develop accountability systems that fully support a broader, more accurate definition of college, career, and citizenship readiness that ensures equity and access for all students.

So…are policymakers listening? Are these influential organizations able to amplify the voices of researchers and practitioners across the country who also want a moratorium on misguided teacher accountability practices? Let’s hope so.

Playing Fair: Factors that Influence VAM for Special Education Teachers

As you all know, value-added models (VAMs) are intended to measure a teacher’s effectiveness. By comparing students’ learning and the value that educators add, VAMs attempt to isolate the teacher’s impact on student achievement. VAMs focus on individual student progress from one testing period to the next, sometimes without considering past learning, peer influence, family environment or individual ability, depending on the model.

Teachers, administrators and experts have debated VAM reliability and validity, but not often mentioned is the controversy regarding the use of VAMs for teachers of special education students. Why is this so controversial? Because students with disabilities are often educated in general education classrooms, but generally score lower on standardized tests – tests that they often should not be taking in the first place. Accordingly, holding teachers of special education students accountable for their performance is uniquely problematic. For example, many special education students are in mainstream classrooms, with co-teaching provided by both special and general education teachers; hence, special education programs can present challenges to VAMs that are meant to measure straightforward progress.

Co-teaching Complexities

Research like “Co-Teaching: An Illustration of the Complexity of Collaboration in Special Education” outlines some of the specific challenges that teachers of special education can face when co-teaching is involved. But essentially, co-teaching is a partnership between a general and a special education teacher, who jointly instruct a group of students, including those with special needs and disabilities. The intent is to provide special education students with access to the general curriculum while receiving more specialized instruction to support their learning.

Accordingly, collaboration is key to successful co-teaching. Teams that demonstrate lower levels of collaboration tend to struggle more, while successful co-teaching teams share their expertise to motivate students. However, special education teachers often report differences in teaching styles that lead to conflict; they often feel regulated to the role of classroom assistant, rather than full teaching partner. This also has implications for VAMs.

For example, student outcomes from co-teaching vary. A 2002 study by Rea, McLaughlin and Walther-Thomas found that students with learning disabilities in co-taught classes had better attendance and report card grades, but no better performance on standardized tests. Another report showed that test scores for students with and without disabilities were not affected by co-teaching (Idol, 2006).

A 2014 study by the Carnegie Foundation for the Advancement of Teaching points out another issue that can make co-teaching more difficult in a special education settings; it can be difficult to determine value-added because it can be hard to separate such teachers’ contributions. Authors also assert that calculating value-added would be more accurate if the models used more detailed data about disability status, services rendered, and past and present accommodations made, but many states do not collect these data (Buzick, 2014), and even if they did there is no real level of certainty that this would work.

Likewise, inclusion brings special education students into the general classroom, eliminating boundaries between special education students and general education peers. However, special education teachers often voice opposition to general education inclusion as it relates to VAMs.

According to “Value-Added Modeling: Challenges for Measuring Special Education Teacher Quality” (Lawson, 2014) some of the specific challenges cited include:

  • When students with disabilities spend the majority of their day in general education classrooms, special education teacher effectiveness is distorted.
  • Quality special education instruction can be hindered by poor general education instruction.
  • Students may be pulled out of class for additional services, which makes it difficult to maintain progress and pace.
  • Multiple teachers often provide instruction to special education students, so each teacher’s impact is difficult to assess.
  • When special education teachers assist general education classrooms, their impact is not measured by VAMs.

And along with the complexities involved with teaching students with disabilities, special education teachers also deal with a number of constraints that impact instructional time and affect VAMs. Special education teachers also deal with more paperwork, including Individualized Education Plans (IEPs) that take time to write and review. In addition, they must handle extensive curriculum and lesson planning, manage parent communication, keep up with special education laws and coordinate with general education teachers. While their priority may be to fully support each student’s learning and achievement, it’s not always possible. In addition, not everything special education teachers do can be appropriately captured using tests.

These are but a few reasons that special education teachers should question the fairness of VAMs.

***

This is a guest post from Umari Osgood who works at Bisk Education and writes on behalf of University of St. Thomas online programs.