Florida’s Superintendents Have Officially “Lost Confidence” in the State’s Teacher and School Evaluation System

In an article written by Valerie Strauss and featured yesterday in The Washington Post, school superintendents throughout the state of Florida have officially declared to the state that they have “lost confidence” in Florida’s teacher and school evaluation system (click here for the original article). More specifically, state school superintendents submitted a letter to the state asserting that “they ‘have lost confidence’ in the system’s accuracy and are calling for a suspension and a review of the system” (click here also for their original letter).

This is the evaluation model instituted by Florida’s former governor Jeb Bush, whom we all also identify as the brother of former President George W. Bush and son of former President George H. W. Bush. Former President GW Bush was also the architect of what most now agree is the failed No Child Left Behind (NCLB) Act of 2001, that former President GW Bush put forth as federal policy given his “educational reforms” in the state of Texas in the late 1990s. Beyond this brotherly association, however, Jeb is also currently a running candidate for the 2016 Republican presidential nomination, and via his campaign he is also (like his brother, pre-NCLB) touting his “educational reforms” in the state of Florida, hoping to take these reforms also to the White House.

It is Jeb Bush’s “educational reforms” that, without research evidence in support, also became models of “educational reform” for other states around the country, including states like New Mexico. In New Mexico, the state in which its current evaluation system, as based on the Florida model, is at the core of a potentially significant lawsuit (see prior posts about this lawsuit here, here, here, and here). Not surprisingly, one of Jeb Bush’s protégés – Hanna Skandera, who is currently the state of New Mexico’s Secretary of its Public Education Department (PED) – took Bush’s model to even more of an extreme, making this particular model one of the most arbitrary and capricious of all such models currently in place throughout the U.S. (as also defined and detailed more fully here).

Hence, it should certainly be of interest to those in New Mexico (and other states) that the teacher and school evaluation model upon which their state model was built, is now apparently falling apart, at least for now.

“The superintendents [in Florida] are calling for the state to suspend the system for a year – meaning that the scores from this spring’s administration of the [state’s] exams will not be used in [the state-level] evaluations [of teachers or schools].” The state school superintendents are also calling for “a full review” of the system, which certainly seems warranted in Florida, New Mexico, and any other state for that matter, in which similar evaluation models are being paid for using taxpayer revenues, without much if any research evidence to support that they are working as intended, and sold, and paid for.

No Longer in Washington State are Test Scores to be Tied to Teachers’ Evaluations

You might recall from a prior post that U.S. Secretary of Education Arne Duncan pulled the state of Washington’s No Child Left Behind (NCLB) waiver because the state did not met the U.S. Department of Education’s timeline for tying teacher evaluations to student test scores (i.e., using growth or value-added modeling). Washington was the first to lose its waiver, but hopefully not the last.

Teachers in Washington achieved yet another victory in this regard.

As per a recent article in The Seattle Times (see also a related post here), it looks like “After four months of negotiations, a five-day [teachers] strike and one final all-night [negotiations] talk, the Seattle teachers union and Seattle Public Schools reached a tentative contract agreement,” that is (if it is ultimately finalized), will:

  • Remove test scores from playing any role in teacher evaluations;
  • Give teachers significantly more say in how often students are tested;
  • Secure for teachers pay increases of 9.5 percent over three years, in addition to the state cost-of-living adjustment of 4.8 percent over two years;
  • Pay teachers for the longer school days they are to teach (longer school days are forthcoming);
  • Lower special-education student-teacher ratios;
  • Lower teacher specialists’ caseloads;
  • Allow for guaranteed recesses for students; etc.

As also noted in this article, this is certainly “groundbreaking,” especially given the federal politics and policies surrounding, as most pertinent to followers of this blog, using test scores to evaluate teachers (see bullet #1 above).

Hopefully other states can, and will follow suit…

New Mexico’s Teacher Evaluation Lawsuit Underway: Day Two

Following up on my most recent post, about the “Lawsuit in New Mexico Challenging [the] State’s Teacher Evaluation System,” I testified yesterday for four hours regarding the state’s new teacher evaluation model. As noted in both articles linked to below, I positioned this state’s model as one of the most “arbitrary and capricious” systems I’ve seen across all states currently working with or implementing such systems, again, as incentivized via President Obama’s Race to the Top program and as required if states were to receive (and are to keep) the waivers excusing them from not meeting the previous requirement written into President George W. Bush’s No Child Left Behind Act of 2011, that all children would be academically proficient by the year 2014.

I positioned this teacher evaluation model as one of the most “arbitrary and capricious” systems I’ve seen across all states in that this state, unlike no other I have ever seen, because the state has worked very hard to make 100% of its teachers value-added eligible, while essentially pulling off-the-shelf, criterion- and norm-referenced tests and also developing (yet not satisfactorily vetting or validating) a series of end-of-course exams (EoCs) to do this. This also includes early childhood teachers using, for example, the norm-referenced DIBELS. Let us just say, for purposes of brevity, this (and many other of the state’s educational measurement actions in this regard) defy the Standards for Educational & Psychological Testing.

Nonetheless, the day’s testimony also included testimony from one high school science teacher who expressed his concerns about his evaluation outcomes, as well as the evaluation outcomes of his high school, science colleagues. The day ended with two hours of testimony given by the state’s model developer – Pete Goldschmidt – who is now an Associate Professor at California State University Northridge. Time ran out, however, before cross-examination.

For more information, though, I will direct you all to two articles that capture the main events or highlights of the day.

The author of the first article titled, “Experts differ on test-based evaluations at NM hearing,” fairly captures the events of the day, as well as the professional and collegial agreements and disagreements expressed to the judge by both Pete Goldschmidt and me.

The author of the second article titled, “Professor’s testimony: Teacher eval system ‘not ready for prime time,” however, was less fair. Nonetheless, for purposes of transparency, I include both articles for you all here, to also see how polarizing this topic can be (as many of us already know). I will say, though, that I liked the picture included in this latter article. Very Santa Fe-ish 😉

Day three of this trial will occur in Santa Fe next Thursday, during which Thomas Kane, an economics professor from Harvard University who also directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation, and who has also been the subject of prior posts (see, for example, here, here and here) will testify. Also to testify will be Matthew Montaño, Educator Quality Division, New Mexico Public Education Department (PED).

Will keep you posted, again…

New Mexico’s Teacher Evaluation Lawsuit Underway

You might recall, from a post last March, that the American Federation of Teachers (AFT), joined by the Albuquerque Teachers Federation, filed a “Lawsuit in New Mexico Challenging [the] State’s Teacher Evaluation System.” Plaintiffs are more specifically charging that the state’s current teacher evaluation system is unfair, error-ridden, harming teachers, and depriving students of high-quality educators (see the actual lawsuit here).

Well, testimonies started yesterday in Santa Fe, and as one of the expert witnesses on the plaintiffs’ side, I was there to witness the first day of examinations. While I will not comment on my impressions at this point, because I will be testifying this Monday and would like to save all of my comments until I’m on the stand, I will say it was quite an interesting day indeed, for both sides.

What I do feel comfortable sharing at this point, though, is an article that The New Mexican reporter Robert Nott wrote, as he too attended the full day in court. His article, essentially about the state of New Mexico “Getting it Right,” captures the gist of the day. I say this duly noting that only witnesses on the plaintiffs’ side were examined, and also cross-examined yesterday. Plaintiffs’ witnesses will continue this Monday, and defendants’ witnesses will continue thereafter, also this Monday, and likely one more day to be scheduled thereafter.

But as for the highlights, as per Nott’s article:

  • “Joel Boyd, [a highly respected] superintendent of the Santa Fe Public Schools, testified that ‘glaring errors’ have marred the state’s ratings of teachers in his district.” He testified that “We should pause and get it right,” also testifying that “the state agency has not proven itself capable of identifying either effective or ineffective teachers.” Last year when Boyd challenged his district’s 1,000 or so teachers’ rankings, New Mexico’s Public Education Department (PED) “ultimately yielded and increased numerous individual teacher rankings…[which caused]..the district’s overall rating [to improve] by 17 percentage points.”
  • State Senator Bill Soules, who is also a recently retired teacher, testified that “his last evaluation included data from 18 students he did not teach. ‘Who are those 18 students who I am being evaluated on?’ he asked the judge.”
  • One of the defendant’s attorneys later defended the state’s data, stating “education department records show that there were only 712 queries from districts regarding the accuracy of teacher evaluation results in 2014-15. Of those, the state changed just 31 ratings after reviewing challenges.” State Senator Soules responded, however, that “a [i.e., one] query may include many teachers.” For example, Albuquerque Public Schools (APS) purportedly put in one single query that included “hundreds, if not thousands” of questions about that district’s set of teacher evaluations.

In fact, most if not all of the witnesses who testified not only argued, but evidenced, how the state used flawed data in their personal, or their schools’/districts’ teachers’ general evaluations, leading to incorrect results.

Plaintiffs and their witnessed also argued, and evidenced, that “the system does not judge teachers by the same standards. Language arts teachers, as well as educators working in subjects without standardized tests, are rated by different measures than those teaching the core subjects of math, science and English.” This, as both the plaintiff’s witnesses and lawyers also argued, makes this an arbitrary and capricious system, or rather one that is not “objective” as per the state’s legislative requirements.

In the words of Shane Youtz, one of two of the plaintiff’s attorneys, “You have a system that is messed up…Frankly, the PED doesn’t know what it is doing with the data and the formula, and they are just changing things ad hoc.”

“Attorneys for the Public Education Department countered that, although no evaluation system is perfect, this one holds its educators to a high standard and follows national trends in utilizing student test scores when possible.”

Do stay tuned….

“Efficiency” as a Constitutional Mandate for Texas’s Educational System

The Texas Constitution requires that the state “establish and make suitable provision for the support and maintenance of an efficient system of public free schools,” as the “general diffusion of knowledge [is]…essential to the preservation of the liberties and rights of the people.” Following this notion, The George W. Bush Institute’s Education Reform Initiative  recently released its first set of reports as part of its The Productivity for Results Series: “A Legal Lever for Enhancing Productivity.” The report was authored by an affiliate of The New Teacher Project (TNTP) – the non-profit organization founded by the controversial former Chancellor of Washington DC’s public schools Michelle Rhee; an unknown and apparently unaffiliated “education researcher” named Krishanu Sengupta; and Sandy Kress, the “key architect of No Child Left Behind [under the presidential leadership of George W. Bush] who later became a lobbyist for Pearson, the testing company” (see, for example, here).

Authors of this paper review the economic and education research (although if you look through the references the strong majority of pieces come from economics research, which makes sense as this is an economically driven venture) to identify characteristics that typify enterprises that are efficient. More specifically, the authors use the principles of x-efficiency set out in the work of the highly respected Henry Levin that require efficient organizations, in this case as (perhaps inappropriately) applied to schools, to have: 1) Clear objective outcomes with measurable outcomes; 2) Incentives that are linked to success on the objective function; 3) Efficient access to useful information for decisions; 4) Adaptability to meet changing conditions; and 5) Use of the most productive technology consistent with cost constraints.

The authors also advance another series of premises, as related to this view of x-efficiency and its application to education/schools in Texas: (1) that “if Texas is committed to diffusing knowledge efficiently, as mandated by the state constitution, it should ensure that the system for putting effective teachers in classrooms and effective materials in the hands of teachers and students is characterized by the principles that undergird an efficient enterprise, such as those of x-efficiency;” (2) this system must include value-added measurement systems (i.e., VAMs), as deemed throughout this paper as not only constitutional but also rational and in support of x-efficiency; (3) given “rational policies for teacher training, certification, evaluation, compensation, and dismissal are key to an efficient education system;” (4) “the extent to which teacher education programs prepare their teachers to achieve this goal should [also] be [an] important factor;”  (5) “teacher evaluation systems [should also] be properly linked to incentives…[because]…in x-efficient enterprises, incentives are linked to success in the objective function of the organization;” (6) which is contradictory with current, less x-efficient teacher compensation systems that link incentives to time on the job, or tenure, rather than to “the success of the organization’s function; (6), in the end, “x-efficient organizations have efficient access to useful information for decisions, and by not linking teacher evaluations to student achievement, [education] systems [such as the one in Texas will] fail to provide the necessary information to improve or dismiss teachers.”

The two districts highlighted as being most x-efficient in Texas, and in this report include, to no surprise: “Houston [which] adds a value-added system to reward teachers, with student performance data counting for half of a teacher’s overall rating. HISD compares students’ academic growth year to year, under a commonly used system called EVAAS.” We’ve discussed not only this system but also its use in Houston often on this blog (see, for example, here, here, and here). Teachers in Houston who consistently perform poorly can be fired for “insufficient student academic growth as reflected by value added scores…In 2009, before EVAAS became a factor in terminations, 36 of 12,000 teachers were fired for performance reasons, or .3%, a number so low the Superintendent [Terry Grier] himself called the dismissal system into question. From 2004-2009, the district
fired or did not renew 365 teachers, 140 for “performance reasons,” including poor discipline management, excessive absences, and a lack of student progress. In 2011, 221 teacher contracts were not renewed, multiple for “significant lack of student progress attributable to the educator,” as well as “insufficient student academic growth reflected by [SAS EVAAS] value-added scores….In the 2011-12 school year, 54% of the district’s low-performing teachers were dismissed.” That’s “progress,” right?!?

Anyhow, for those of you who have not heard, this same (controversial) Superintendent, who pushed this system throughout his district is retiring (see, for example, here).

The other district of (dis)honorable mention was Dallas Independent School district; it also uses a VAM called the Classroom Effectiveness Index (CIE), although I know less about this system as I have never examined or researched it myself, nor have I read really anything about it. But in 2012, the district’s Board “decided not to renew 259 contracts due to poor performance, five times more than the previous year.” The “progress” such x-efficiency brings…

What is still worrisome to the authors, though, is that “[w]hile some districts appear to be increasing their efforts to eliminate ineffective teachers, the percentage of teachers dismissed for any reason, let alone poor performance, remains well under one percent in the state’s largest districts.” Related, and I preface this one noting that this next argument is one of the most over-cited and hyper-utilized by organizations backing “initiatives” or “reforms” such as these, that this “falls well below the five to eight percent that Hanushek calculates would elevate achievement to internationally competitive levels” “Calculations by Eric Hanushek of Stanford University show that removing the bottom five percent of teachers in the United States and replacing them with teachers of average effectiveness would raise student achievement in the U.S. 0.4 standard deviations, to the level of student achievement in Canada. Replacing the bottom eight percent would raise student achievement to the level of Finland, a top performing country on international assessments.” As Linda Darling-Hammond, also of Stanford would argue, we cannot simply “fire our way to Finland.” Sorry Eric! But this is based on econometric predictions, and no evidence exists whatsoever that this is in fact a valid inference. Nontheless, it is cited over and over again by the same/similar folks (such as the authors of this piece) to justify their currently trendy educational reforms.

The major point here, though, is that “if Texas wanted to remove (or improve) the bottom five to eight percent of its teachers, the current evaluation system would not be able to identify them;” hence, the state desperately needs a VAM-based system to do this. Again,  no research to counter this or really any claim is included in this piece; only the primarily economics-based literatures were selected in support.

In the end, though, the authors conclude that “While the Texas Constitution has established a clear objective function for the state school system and assessments are in place to measure the outcome, it does not appear that the Texas education system shares the other four characteristics of x-efficient enterprises as identified by Levin. Given the constitutional mandate for efficiency and the difficult economic climate, it may be a good time for the state to remedy this situation…[Likewise] the adversity and incentives may now be in place for Texas to focus on improving the x-efficiency of its school system.”

As I know and very much respect Henry Levin (see, for example, an interview I conducted with him a few years ago, with the shorter version here and the longer version here), I’d be curious to know what his response might be to the authors’ use of his x-efficiency framework to frame such neo-conservative (and again trendy) initiatives and reforms. Perhaps I will email him…

A Labor Day Tribute to NY Teacher Sheri Lederman

As per the U.S. Department of Labor, “Labor Day, the first Monday in September, is a creation of the labor movement and is dedicated to the social and economic achievements of American workers. It constitutes a yearly national tribute to the contributions workers have made to the strength, prosperity, and well-being of our country.”

There is no better day than today, then, to honor Sheri Lederman – the Long Island, New York teacher who by all accounts other than her 1 out of 20 VAM score is a terrific 4th grade and now 18 year veteran teacher, and who along with her attorney husband, Bruce Lederman, is suing the state of New York to challenge the state’s teacher evaluation system…and mainly it’s value-added component (see prior posts about Sheri’s case herehere and here). I’m serving as one of the expert witnesses, also, on this case.

Anyhow, Al Jazeera America just released a fabulous 3:41 video about Sheri, her story, and VAMs in general. Columbia Professor Aaron Pallas is featured speaking on Sheri’s behalf, and a study of mine is featured at marker 2:41 (i.e., the national map illustrating what’s happening across the nation; see also “Putting growth and value-added models on the map: A national overview.

Do give this a watch. It’s well worth honoring and respecting Sheri’s achievements, thus far, and especially today.


Special Issue of “Educational Researcher” (Paper #2 of 9): VAMs’ Measurement Errors, Issues with Retroactive Revisions, and (More) Problems with Using Test Scores

Recall from a prior post that the peer-reviewed journal titled Educational Researcher (ER) – recently published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#2 of 9) here, titled “Using Student Test Scores to Measure Teacher Performance: Some Problems in the Design and Implementation of Evaluation Systems” and authored by Dale Ballou – Associate Professor of Leadership, Policy, and Organizations at Vanderbilt University – and Matthew Springer – Assistant Professor of Public Policy also at Vanderbilt.

As written into the articles’ abstract, their “aim in this article [was] to draw attention to some underappreciated problems in the design and implementation of evaluation systems that incorporate value-added measures. [They focused] on four [problems]: (1) taking into account measurement error in teacher assessments, (2) revising teachers’ scores as more information becomes available about their students, and (3) and (4) minimizing opportunistic behavior by teachers during roster verification and the supervision of exams.”

Here is background on their perspective, so that you all can read and understand their forthcoming findings in context: “On the whole we regard the use of educator evaluation systems as a positive development, provided judicious use is made of this information. No evaluation instrument is perfect; every evaluation system is an assembly of various imperfect measures. There is information in student test scores about teacher performance; the challenge is to extract it and combine it with the information gleaned from other instruments.”

Their claims of most interest, in my opinion and given their perspective as illustrated above, are as follows:

  • “Teacher value-added estimates are notoriously imprecise. If value-added scores are to be used for high-stakes personnel decisions, appropriate account must be taken of the magnitude of the likely error in these estimates” (p. 78).
  • “[C]omparing a teacher of 25 students to [an equally effective] teacher of 100 students… the former is 4 to 12 times more likely to be deemed ineffective, solely as a function of the number of the teacher’s students who are tested—a reflection of the fact that the measures used in such accountability systems are noisy and that the amount of noise is greater the fewer students a teacher has. Clearly it is unfair to treat two teachers with the same true effectiveness differently” (p. 78).
  • “[R]esources will be wasted if teachers are targeted for interventions without taking
    into account the probability that the ratings they receive are based on error” (p. 78).
  • “Because many state administrative data systems are not up to [the data challenges required to calculate VAM output], many states have implemented procedures wherein teachers are called on to verify and correct their class rosters [i.e., roster verification]…[Hence]…the notion that teachers might manipulate their rosters in order to improve their value-added scores [is worrisome as the possibility of this occurring] obtains indirect support from other studies of strategic behavior in response to high-stakes accountability…These studies suggest that at least some teachers and schools will take advantage of virtually any opportunity to game
    a test-based evaluation system…” (p. 80), especially if they view the system as unfair (this is my addition, not theirs) and despite the extent to which school or district administrators monitor the process or verify the final roster data. This is another gaming technique not often discussed, or researched.
  • Related, in one analysis these authors found that “students [who teachers] do not claim [during this roster verification process] have on average test scores far below those of the students who are claimed…a student who is not claimed is very likely to be one who would lower teachers’ value added” (p. 80). Interestingly, and inversely, they also found that “a majority of the students [they] deem[ed] exempt [were actually] claimed by their teachers [on teachers’ rosters]” (p. 80). They note that when either occurs, it’s rare; hence, it should not significantly impact teachers value added scores on the whole. However, this finding also “raises the prospect of more serious manipulation of roster verification should value added come to be used for high-stakes personnel decisions, when incentives to game the system will grow stronger” (p. 80).
  • In terms of teachers versus proctors or other teachers monitoring students when they take large-scale standardized tests (that are used across all states to calculate value-added estimates), researchers also found that “[a]t every grade level, the number of questions answered correctly is higher when students are monitored by their own teacher” (p. 82). They believe this finding is more relevant that I do in that the difference was one question (although when multiplied by the number of students included in a teacher’s value-added calculations this might be more noteworthy). In addition,  I know of very few teachers, anymore, who are permitted to proctor their own students’ tests, but for those who still allow this, this finding might also be relevant. “An alternative interpretation of these findings is that students
    naturally do better when their own teacher supervises the exam as
    opposed to a teacher they do not know” (p. 83).

The authors also critique, quite extensively in fact, the Education Value-Added Assessment System (EVAAS) used statewide in North Carolina, Ohio, Pennsylvania, and Tennessee and many districts elsewhere. In particular, they take issue with the model’s use of the conventional t-test statistic to identify a teacher for whom they are 95% confident (s)he differs from average. They also take issue with EVAAS practice whereby teachers’ EVAAS scores change retroactively, as more data become available, to get at more “precision” even though teachers’ scores can change one or two years well after the initial score is registered (and used for whatever purposes).

“This has confused teachers, who wonder why their value-added score keeps changing for students they had in the past. Whether or not there are sound statistical reasons for undertaking these revisions…revising value-added estimates poses problems when the evaluation system is used for high-stakes decisions. What will be done about the teacher whose performance during the 2013–2014 school year, as calculated in the summer of 2014, was so low that the teacher loses his or her job or license but whose revised estimate for the same year, released in the summer of 2015, places the teacher’s performance above the threshold at which these sanctions would apply?…[Hence,] it clearly makes no sense to revise these estimates, as each revision is based on less information about student performance” (p. 79).

Hence, “a state that [makes] a practice of issuing revised ‘improved’ estimates would appear to be in a poor position to argue that high-stakes decisions ought to be based on initial, unrevised estimates, though in fact the grounds for regarding the revised estimates as an improvement are sometimes highly dubious. There is no obvious fix for this problem, which we expect will be fought out in the courts” (p. 83).


If interested, see the Review of Article #1 – the introduction to the special issue here.

Article #2 Reference: Ballou, D., & Springer, M. G. (2015). Using student test scores to measure teacher performance: Some problems in the design and implementation of evaluation systems. Educational Researcher, 44(2), 77-86. doi:10.3102/0013189X15574904