Victory in Court: Consequences Attached to VAMs Suspended Throughout New Mexico

Great news for New Mexico and New Mexico’s approximately 23,000 teachers, and great news for states and teachers potentially elsewhere, in terms of setting precedent!

Late yesterday, state District Judge David K. Thomson, who presided over the ongoing teacher-evaluation lawsuit in New Mexico, granted a preliminary injunction preventing consequences from being attached to the state’s teacher evaluation data. More specifically, Judge Thomson ruled that the state can proceed with “developing” and “improving” its teacher evaluation system, but the state is not to make any consequential decisions about New Mexico’s teachers using the data the state collects until the state (and/or others external to the state) can evidence to the court during another trial (set for now, for April) that the system is reliable, valid, fair, uniform, and the like.

As you all likely recall, the American Federation of Teachers (AFT), joined by the Albuquerque Teachers Federation (ATF), last year, filed a “Lawsuit in New Mexico Challenging [the] State’s Teacher Evaluation System.” Plaintiffs charged that the state’s teacher evaluation system, imposed on the state in 2012 by the state’s current Public Education Department (PED) Secretary Hanna Skandera (with value-added counting for 50% of teachers’ evaluation scores), is unfair, error-ridden, spurious, harming teachers, and depriving students of high-quality educators, among other claims (see the actual lawsuit here).

Thereafter, one scheduled day of testimonies turned into five in Santa Fe, that ran from the end of September through the beginning of October (each of which I covered here, here, here, here, and here). I served as the expert witness for the plaintiff’s side, along with other witnesses including lawmakers (e.g., a state senator) and educators (e.g., teachers, superintendents) who made various (and very articulate) claims about the state’s teacher evaluation system on the stand. Thomas Kane served as the expert witness for the defendant’s side, along with other witnesses including lawmakers and educators who made counter claims about the system, some of which backfired, unfortunately for the defense, primarily during cross-examination.

See articles released about this ruling this morning in the Santa Fe New Mexican (“Judge suspends penalties linked to state’s teacher eval system”) and the Albuquerque Journal (“Judge curbs PED teacher evaluations).” See also the AFT’s press release, written by AFT President Randi Weingarten, here. Click here for the full 77-page Order written by Judge Thomson (see also, below, five highlights I pulled from this Order).

The journalist of the Santa Fe New Mexican, though, provided the most detailed information about Judge Thomson’s Order, writing, for example, that the “ruling by state District Judge David Thomson focused primarily on the complicated combination of student test scores used to judge teachers. The ruling [therefore] prevents the Public Education Department [PED] from denying teachers licensure advancement or renewal, and it strikes down a requirement that poorly performing teachers be placed on growth plans.” In addition, the Judge noted that “the teacher evaluation system varies from district to district, which goes against a state law calling for a consistent evaluation plan for all educators.”

The PED continues to stand by its teacher evaluation system, calling the court challenge “frivolous” and “a legal PR stunt,” all the while noting that Judge Thomson’s decision “won’t affect how the state conducts its teacher evaluations.” Indeed it will, for now and until the state’s teacher evaluation system is vetted, and validated, and “the court” is “assured” that the system can actually be used to take the “consequential actions” against teachers, “required” by the state’s PED.

Here are some other highlights that I took directly from Judge Thomson’s ruling, capturing what I viewed as his major areas of concern about the state’s system (click here, again, to read Judge Thomson’s full Order):

  • Validation Needed: “The American Statistical Association says ‘estimates from VAM should always be accompanied by measures of precision and a discussion of the assumptions and possible limitations of the model. These limitations are particularly relevant if VAM are used for high stake[s] purposes” (p. 1). These are the measures, assumptions, limitations, and the like that are to be made transparent in this state.
  • Uniformity Required: “New Mexico’s evaluation system is less like a [sound] model than a cafeteria-style evaluation system where the combination of factors, data, and elements are not easily determined and the variance from school district to school district creates conflicts with the [state] statutory mandate” (p. 2)…with the existing statutory framework for teacher evaluations for licensure purposes requiring “that the teacher be evaluated for ‘competency’ against a ‘highly objective uniform statewide standard of evaluation’ to be developed by PED” (p. 4). “It is the term ‘highly objective uniform’ that is the subject matter of this suit” (p. 4), whereby the state and no other “party provided [or could provide] the Court a total calculation of the number of available district-specific plans possible given all the variables” (p. 54). See also the Judge’s points #78-#80 (starting on page 70) for some of the factors that helped to “establish a clear lack of statewide uniformity among teachers” (p. 70).
  • Transparency Missing: “The problem is that it is not easy to pull back the curtain, and the inner workings of the model are not easily understood, translated or made accessible” (p. 2). “Teachers do not find the information transparent or accurate” and “there is no evidence or citation that enables a teacher to verify the data that is the content of their evaluation” (p. 42). In addition, “[g]iven the model’s infancy, there are no real studies to explain or define the [s]tate’s value-added system…[hence, the consequences and decisions]…that are to be made using such system data should be examined and validated prior to making such decisions” (p. 12).
  • Consequences Halted: “Most significant to this Order, [VAMs], in this [s]tate and others, are being used to make consequential decisions…This is where the rubber hits the road [as per]…teacher employment impacts. It is also where, for purposes of this proceeding, the PED departs from the statutory mandate of uniformity requiring an injunction” (p. 9). In addition, it should be noted that indeed “[t]here are adverse consequences to teachers short of termination” (p. 33) including, for example, “a finding of ‘minimally effective’ [that] has an impact on teacher licenses” (p. 41). These, too, are to be halted under this injunction Order.
  • Clarification Required: “[H]ere is what this [O]rder is not: This [O]rder does not stop the PED’s operation, development and improvement of the VAM in this [s]tate, it simply restrains the PED’s ability to take consequential actions…until a trial on the merits is held” (p. 2). In addition, “[a] preliminary injunction differs from a permanent injunction, as does the factors for its issuance…’ The objective of the preliminary injunction is to preserve the status quo [minus the consequences] pending the litigation of the merits. This is quite different from finally determining the cause itself” (p. 74). Hence, “[t]he court is simply enjoining the portion of the evaluation system that has adverse consequences on teachers” (p. 75).

The PED also argued that “an injunction would hurt students because it could leave in place bad teachers.” As per Judge Thomson, “That is also a faulty argument. There is no evidence that temporarily halting consequences due to the errors outlined in this lengthy Opinion more likely results in retention of bad teachers than in the firing of good teachers” (p. 75).

Finally, given my involvement in this lawsuit and given the team with whom I was/am still so fortunate to work (see picture below), including all of those who testified as part of the team and whose testimonies clearly proved critical in Judge Thomson’s final Order, I want to thank everyone for all of their time, energy, and efforts in this case, thus far, on behalf of the educators attempting to (still) do what they love to do — teach and serve students in New Mexico’s public schools.

IMG_0123

Left to right: (1) Stephanie Ly, President of AFT New Mexico; (2) Dan McNeil, AFT Legal Department; (3) Ellen Bernstein, ATF President; (4) Shane Youtz, Attorney at Law; and (5) me 😉

New Mexico’s Teacher Evaluation Lawsuit Underway

You might recall, from a post last March, that the American Federation of Teachers (AFT), joined by the Albuquerque Teachers Federation, filed a “Lawsuit in New Mexico Challenging [the] State’s Teacher Evaluation System.” Plaintiffs are more specifically charging that the state’s current teacher evaluation system is unfair, error-ridden, harming teachers, and depriving students of high-quality educators (see the actual lawsuit here).

Well, testimonies started yesterday in Santa Fe, and as one of the expert witnesses on the plaintiffs’ side, I was there to witness the first day of examinations. While I will not comment on my impressions at this point, because I will be testifying this Monday and would like to save all of my comments until I’m on the stand, I will say it was quite an interesting day indeed, for both sides.

What I do feel comfortable sharing at this point, though, is an article that The New Mexican reporter Robert Nott wrote, as he too attended the full day in court. His article, essentially about the state of New Mexico “Getting it Right,” captures the gist of the day. I say this duly noting that only witnesses on the plaintiffs’ side were examined, and also cross-examined yesterday. Plaintiffs’ witnesses will continue this Monday, and defendants’ witnesses will continue thereafter, also this Monday, and likely one more day to be scheduled thereafter.

But as for the highlights, as per Nott’s article:

  • “Joel Boyd, [a highly respected] superintendent of the Santa Fe Public Schools, testified that ‘glaring errors’ have marred the state’s ratings of teachers in his district.” He testified that “We should pause and get it right,” also testifying that “the state agency has not proven itself capable of identifying either effective or ineffective teachers.” Last year when Boyd challenged his district’s 1,000 or so teachers’ rankings, New Mexico’s Public Education Department (PED) “ultimately yielded and increased numerous individual teacher rankings…[which caused]..the district’s overall rating [to improve] by 17 percentage points.”
  • State Senator Bill Soules, who is also a recently retired teacher, testified that “his last evaluation included data from 18 students he did not teach. ‘Who are those 18 students who I am being evaluated on?’ he asked the judge.”
  • One of the defendant’s attorneys later defended the state’s data, stating “education department records show that there were only 712 queries from districts regarding the accuracy of teacher evaluation results in 2014-15. Of those, the state changed just 31 ratings after reviewing challenges.” State Senator Soules responded, however, that “a [i.e., one] query may include many teachers.” For example, Albuquerque Public Schools (APS) purportedly put in one single query that included “hundreds, if not thousands” of questions about that district’s set of teacher evaluations.

In fact, most if not all of the witnesses who testified not only argued, but evidenced, how the state used flawed data in their personal, or their schools’/districts’ teachers’ general evaluations, leading to incorrect results.

Plaintiffs and their witnessed also argued, and evidenced, that “the system does not judge teachers by the same standards. Language arts teachers, as well as educators working in subjects without standardized tests, are rated by different measures than those teaching the core subjects of math, science and English.” This, as both the plaintiff’s witnesses and lawyers also argued, makes this an arbitrary and capricious system, or rather one that is not “objective” as per the state’s legislative requirements.

In the words of Shane Youtz, one of two of the plaintiff’s attorneys, “You have a system that is messed up…Frankly, the PED doesn’t know what it is doing with the data and the formula, and they are just changing things ad hoc.”

“Attorneys for the Public Education Department countered that, although no evaluation system is perfect, this one holds its educators to a high standard and follows national trends in utilizing student test scores when possible.”

Do stay tuned….

NY Teacher Lederman’s Day in Court

Do you recall the case of Sheri Lederman? The Long Island teacher who, apparently by all accounts other than her composite growth (or value-added) score is a terrific 4th grade/18 year veteran teacher, who received a score of 1 out of 20 after she scored a 14 out of 20 the year prior (see prior posts herehere and here; see also here and here)?

With her husband, attorney Bruce Lederman leading her case, she is suing the state of New York (the state in which Governor Cuomo is pushing to now have teachers’ value-added scores count for approximately 50% of their total evaluations) to challenge the state’s teacher evaluation system. She is also being fully supported by her students, her principal, her superintendent, and a series of VAM experts including: Linda Darling-Hammond (Stanford), Aaron Pallas (Columbia University Teachers College), Carol Burris (Educator and Principal of the Year from New York), Brad Lindell (Long Island Research Consultant), and me (Arizona State University) (see their/our expert witness affidavits here). See also an affidavit more recently submitted by Jesse Rothstein (Berkeley) here, as well as the full document explaining the entire case – the Memorandum of Law – here.

Well, the Ledermans had their day in court this past Wednesday (August 12, 2015).

It was apparent in the hearing that the Judge carefully read all the papers prior, and he was fully familiar with the issues. As per Bruce Lederman, “[t]he issue that seemed to catch the Judge’s attention the most was whether it was rational to have a system which decides in advance that 7% of teachers will be ineffective, regardless of actual results. The Judge asked numerous questions about whether it was fair to use a bell curve,” whereby when using a bell curve to distribute teachers’ growth or value-added scores, there will always be a set of “ineffective” teachers, regardless of whether in face they are truly “ineffective.” This occurs not naturally but by the statistical manipulation needed to fit all scores within the normal distribution needed to spread out the scores in order to make relative distinctions and categorizations (e.g., highly effective, effective, ineffective), the validity of which are highly uncertain (see, for example, a prior post here). Hence, “[t]he Judge pressed the lawyer representing New York’s Education Department very hard on this particular issue,” but the state’s lawyer did not (most likely because she could not) give the Judge a satisfactory explanation, justification, or rationale.

For more information on the case, see here the video that I feel best captures the case, thanks to CBS news in Albany. For another video see here, compliments of NBC news in Albany. See also two additional articles, here and here, with the latter including the photo of Sheri and Bruce Lederman pasted below.

a - ledermans_0

EVAAS, Value-Added, and Teacher Branding

I do not think I ever shared this video out, and now following up on another post, about the potential impact these videos should really have, I thought now is an appropriate time to share. “We can be the change,” and social media can help.

My former doctoral student and I put together this video, after conducting a study with teachers in the Houston Independent School District and more specifically four teachers whose contracts were not renewed due in large part to their EVAAS scores in the summer of 2011. This video (which is really a cartoon, although it certainly lacks humor) is about them, but also about what is happening in general in their schools, post the adoption and implementation (at approximately $500,000/year) of the SAS EVAAS value-added system.

To read the full study from which this video was created, click here. Below is the abstract.

The SAS Educational Value-Added Assessment System (SAS® EVAAS®) is the most widely used value-added system in the country. It is also self-proclaimed as “the most robust and reliable” system available, with its greatest benefit to help educators improve their teaching practices. This study critically examined the effects of SAS® EVAAS® as experienced by teachers, in one of the largest, high-needs urban school districts in the nation – the Houston Independent School District (HISD). Using a multiple methods approach, this study critically analyzed retrospective quantitative and qualitative data to better comprehend and understand the evidence collected from four teachers whose contracts were not renewed in the summer of 2011, in part given their low SAS® EVAAS® scores. This study also suggests some intended and unintended effects that seem to be occurring as a result of SAS® EVAAS® implementation in HISD. In addition to issues with reliability, bias, teacher attribution, and validity, high-stakes use of SAS® EVAAS® in this district seems to be exacerbating unintended effects.

The Multiple Teacher Evaluation System(s) in New Mexico, from a Concerned New Mexico Parent

A “concerned New Mexico parent” who wrote a prior post for this blog here, wrote another for you all below, about the sheer numbers of different teacher evaluation systems, or variations, now in place in his/her state of New Mexico. (S)he writes:

Readers of this blog are well aware of the limitations of VAMs for evaluating teachers. However, many readers may not be aware that there are actually many system variations used to evaluate teachers. In the state of New Mexico, for example, 217 different variations are used to evaluate the many and diverse types of teachers teaching in the state [and likely all other states].

But. Is there any evidence that they are valid? NO. Is there any evidence that they are equivalent? NO. Is there any evidence that this is fair? NO.

The New Mexico Public Education Department (NMPED) provides a framework for teacher evaluations, and the final teacher evaluation should be weighted as follows: Improved Student Achievement (50%), Teacher Observations (25%), and Multiple Measures (25%).

Every school district in New Mexico is required to submit a detailed evaluation plan of specifically what measures will be used to satisfy the overall NMPED 50-25-25 percentage framework, after which NMPED approves all plans.

The exact details of any district’s educator effectiveness plan can be found on the NMTEACH website, as every public and charter school plan is posted here.

There are massive differences between how groups of teachers are graded between districts, however, which distorts most everything about the system(s), including the extent to which similar (and different) teachers might be similarly (and fairly) evaluated and assessed.

Even within districts, there are massive differences in how grade level (elementary, middle, high school) teachers are evaluated.

And, even something as seemingly simple as evaluating K-2 teachers requires 42 different variations in scoring.

Table 1 below shows the number of different scales used to calculate teacher effectiveness for each group of teachers and each grade level, for example, at the state level.

New Mexico divides all teachers into three categories — group A teachers have scores based on the statewide test (mathematics, English/language arts (ELA)), group B teachers (e.g. music or history) do not have a corresponding statewide test, and group C teachers teach grades K-2. Table 1 shows the number of scales used by New Mexico school districts for each teacher group. It is further broken down by grade-level. For example, as illustrated, there are 42 different scales used to evaluate Elementary-level Group A teachers in New Mexico. The column marked “Unique (one-offs)” indicates the number of scales that are completely unique for a given teacher group and grade-level. For example, as illustrated, there are 11 unique scales used to grade Group B High School teachers, and for each of these eleven scales, only one district, one grade-level, and one teacher group is evaluated within the entire state.

Based on the size of the school district, a unique scale may be grading as few as a dozen teachers! In addition, there are 217 scales used statewide, with 99 of these scales being unique (by teacher)!

Table 1: New Mexico Teacher Evaluation System(s)

Group Grade Scales Used Unique (one-offs)
Group A (SBA-based) All 58 15
(e.g. 5th grade English teacher) Elem 42 10
MS 37 2
HS 37 3
Group B (non-SBA) All 117 56
(e.g. Elem music teacher) Elem 67 37
MS 62 8
HS 61 11
Group C (grades K-2) All 42 28
Elem 42 28
TOTAL   217 variants 99 one-offs

The table above highlights the spectacular absurdity of the New Mexico Teacher Evaluation System.

(The complete listings of all variants for the three groups are contained here (in Table A for Group A), here (in Table B for Group B), and here (in Table C for Group C). The abbreviations and notes for these tables are listed here (in Table D).

By approving all of these different formulas, all things considered, NMPED is also making the following nonsensical claims..

NMPED Claim: The prototype 50-25-25 percentage split has some validity.

There is no evidence to support this division between student achievement measures, observation, and multiple measures at all. It simply represents what NMPED could politically “get away with” in terms of a formula. Why not 60-20-20 or 57-23-20 or 46-18-36, etcetera? The NMPED prototype scale has no proven validity, whatsoever.

NMPED Claim: All 217 formulas are equivalent to evaluate teachers.

This claim by NMPED is absurd on its face and every other part of its… Is there any evidence that they have cross-validated the tests? There is no evidence that any of these scales are valid or accurate measures of “teacher effectiveness.” Also, there is no evidence whatsoever that they are equivalent.

Further, if the formulas are equivalent (as NMPED claims), why is New Mexico wasting money on technology for administering SBA tests or End-of-Course exams? Why not use an NMPED-approved formula that includes tests like Discovery, MAPS, DIBELS, or Star that are already being used?

NMPED Claim: Teacher Attendance and Student Surveys are interchangeable.

According to the approved plans, many districts assign 10% to Teacher Attendance while other districts assign 10% to Student Surveys. Both variants have been approved by NMPED.

Mathematically, (i.e., in terms of the proportions either is to be allotted) they appear to be interchangeable. If that is so, why is NMPED also specifically trying to enforce Teacher Attendance as an element of the evaluation scale? Why did Hanna Skandera proclaim to the press that this measure improved New Mexico education? (For typical news coverage, on this topic, for example, see here).

The use of teacher attendance appears to be motivated by union-busting rather than any mathematical rationale.

NMPED Claim: All observation methods are equivalent.

NMPED allows for three very different observation methods to be used for 40% of the final score. Each method is somewhat complicated and involves different observers.

There is no indication that NMPED has evaluated the reliability or validity of these three very different observation methods, or tested their results for equivalence. They simply assert that they are equivalent.

NMPED Claim: These formulas will be used to rate teachers.

These formulas are the worst kind of statistical jiggery-pokery (to use a newly current phrase). NMPED presents a seemingly rational, scientific number to the public using invalid and unvalidated mathematical manipulations and then determines teachers’ careers based on the completely bogus New Mexico teacher evaluation system(s).

Conclusion: Not only is the emperor naked, he has a closet containing 217 equivalent outfits at home!

Splits, Rotations, and Other Consequences of Teaching in a High-Stakes Environment in an Urban School

An Arizona teacher who teaches in a very urban, high-needs schools writes about the realities of teaching in her school, under the pressures that come along with high-stakes accountability and a teacher workforce working under an administration, both of which are operating in chaos. This is a must read, as she also talks about two unintended consequences of educational reform in her school about which I’ve never heard before: splits and rotations. Both seem to occur at all costs simply to stay afloat during “rough” times, but both also likely have deleterious effects on students in such schools, as well as teachers being held accountable for the students “they” teach.

She writes:

Last academic year (2012-2013) a new system for evaluating teachers was introduced into my school district. And it was rough. Teachers were dropping like flies. Some were stressed to the point of requiring medical leave. Others were labeled ineffective based on a couple classroom observations and were asked to leave. By mid-year, the school was down five teachers. And there were a handful of others who felt it was just a matter of time before they were labeled ineffective and asked to leave, too.

The situation became even worse when the long-term substitutes who had been brought in to cover those teacher-less classrooms began to leave also. Those students with no contracted teacher and no substitute began getting “split”. “Splitting” is what the administration of a school does in a desperate effort to put kids somewhere. And where the students go doesn’t seem to matter. A class roster is printed, and the first five students on the roster go to teacher A. The second five students go to teacher B, and so on. Grade-level isn’t even much of a consideration. Fourth graders get split to fifth grade classrooms. Sixth graders get split to 5th and 7th grade classrooms. And yes, even 7th and 8th graders get split to 5th grade classrooms. Was it difficult to have another five students in my class? Yes. Was it made more difficult that they weren’t even of the same grade level I was teaching? Yes. This went on for weeks…

And then the situation became even worse. As it became more apparent that the revolving door of long-term substitutes was out of control, the administration began “The Rotation.” “The Rotation” was a plan that used the contracted teachers (who remained!) as substitutes in those teacher-less classrooms. And so once or twice a week, I (and others) would get an email from the administration alerting me that it was my turn to substitute during prep time. Was it difficult to sacrifice 20-40 % of weekly prep time (that is used to do essential work like plan lessons, gather materials, grade, call parents, etc…) Yes. Was it difficult to teach in a classroom that had a different teacher, literally, every hour without coordinated lessons? Yes.

Despite this absurd scenario, in October 2013, I received a letter from my school district indicating how I fared in this inaugural year of the teacher evaluation system. It wasn’t good. Fifty percent of my performance label was based on school test scores (not on the test scores of my homeroom students). How well can students perform on tests when they don’t have a consistent teacher?

So when I think about accountability, I wonder now what it is I was actually held accountable for? An ailing, urban school? An ineffective leadership team who couldn’t keep a workforce together? Or was I just held accountable for not walking away from a no-win situation?

Coincidentally, this 2013-2014 academic year has, in many ways, mirrored the 2012-2013. The upside is that this year, only 10% of my evaluation is based on school-wide test scores (the other 40% will be my homeroom students’ test scores). This year, I have a fighting chance to receive a good label. One more year of an unfavorable performance label and the district will have to, by law, do something about me. Ironically, if it comes to that point, the district can replace me with a long-term substitute, who is not subject to the same evaluation system that I am. Moreover, that long-term substitute doesn’t have to hold a teaching certificate. Further, that long-term substitute will cost the district a lot less money in benefits (i.e. healthcare, retirement system contributions).

I should probably start looking for a job—maybe as a long-term substitute.

Out with the Old, In with the New: Proposed Ohio Budget Bill to Revise the Teacher Evaluation System (Again)

Here is another post from VAMboozled!’s new team member – Noelle Paufler, Ph.D. – on Ohio’s “new and improved” teacher evaluation system, redesigned three years out from Ohio’s last attempt.

The Ohio Teacher Evaluation System (OTES) can hardly be considered “old” in its third year of implementation, and yet Ohio Budget Bill (HB64) proposes new changes to the system for the 2015-2016 school year. In a recent blog post, Plunderbund (aka Greg Mild) highlights the latest revisions to the OTES as proposed in HB64. (This post is also featured here on Diane Ravitch’s blog.)

Plunderbund outlines several key concerns with the budget bill including:

  • Student Learning Objectives (SLOs): In place of SLOs, teachers who are assigned to grade levels, courses, or subjects for which value-added scores are unavailable (i.e., via state standardized tests or vendor assessments approved by the Ohio Department of Education [ODE]) are to be evaluated “using a method of attributing student growth,” per HB64, Section 3319.111 (B) (2).
  • Attributed Student Growth: The value-added results of an entire school or district are to be attributed to teachers who otherwise do not have individual value-added scores for evaluation purposes. In this scenario, teachers are to be evaluated based upon the performance of students they may not have met in subject areas they do not directly teach.
  • Timeline: If enacted, the budget bill does will require the ODE to finalize the revised evaluation framework until October 31, 2015. Although the OTES has just now been fully implemented in most districts across the state, school boards would need to quickly revise teacher evaluation processes, forms, and software to comply with the new requirements well after the school year is already underway.

As Plunderbund notes, these newly proposed changes resurrect a series of long-standing questions of validity and credibility with regards to OTES. The proposed use of “attributed student growth” to evaluate teachers who are assigned to non-tested grade levels or subject areas has and should raise concerns among all teachers. This proposal presumes that an essentially two-tiered evaluation system can validly measure the effectiveness of some teachers based on presumably proximal outcomes (their individual students’ scores on state or approved vendor assessments) and others based on distal outcomes (at best) using attributed student growth. While the dust has scarcely settled with regards to OTES implementation, Plunderbund compellingly argues that this new wave of proposed changes would result in more confusion, frustration, and chaos among teachers and disruptions to student learning.

To learn more, read Plunderbund’s full critique of the proposed changes, again, click here.

The (Relentless) Will to Quantify

An article was just published in the esteemed, peer-reviewed journal Teachers College Record titled, “The Will to Quantify: The “Bottom Line” in the Market Model of Education Reform” and authored by Leo Casey – Executive Director of the Albert Shanker Institute. I will summarize its key points here (1) for those of you without the time to read the whole article (albeit worth reading) and (2) just in case the link above does not work for those of you out there without subscriptions to Teachers College Record.

In this article, Casey reviews the case of New York and the state department of education’s policy attempts to use New York teachers’ value-added data to reform the state’s public schools, “in the image and likeness of competitive businesses.” Casey interrogates this state’s history given the current, market-based, corporate reform environment surrounding (and swallowing) America’s public schools within New York, but also beyond.

Recall that New York is one of our states to watch, especially since the election of Governor Cuomo into the Governor’s office (see prior posts about New York here, here, and here). Accordingly, according to Casey as demonstrated in this article, this is the state to use to demonstrate how “[t]he market model of education reform has become a prisoner to a Nietzschean will to quantify, in which the validity and reliability of the actual numbers is irrelevant.”

In New York, using the state’s large-scale standardized tests in English/language arts and mathematics, grades 3 through 8, teachers’ value-added data reports were first developed for approximately 18,000 teachers throughout the state for three school years: 2007-2010. The scores were constructed with all assurances that these scores “would not be used for [teacher] evaluation purposes,” while the state department specifically identified tenure decisions and annual rating processes as two areas where teachers’ value-added scores “would play no role.” At that time the department of education also took a “firm position” that that these reports would not be disclosed or shared outside of the school community (i.e., with the public).

Soon, thereafter, however the department of education, “acting unilaterally,” began to use the scores in tenure decisions and began to, after a series of Freedom of Information requests, release the scores to the media, who in turn released the scores to the public at large. By February of 2012, teachers’ value-added scores were published by all  major New York media.

Recall these articles, primarily about the worst teachers in New York (see, for example, here, here, and here), and recall the story of Pascale Mauclair – a sixth-grade teacher in Queens who was “pilloried” in the New York Post as the city’s “worst teacher” based solely on her value-added reports. After a more thorough investigation, however, “Mauclair proved to be an excellent teacher who had the unqualified support of her school, one of the best in the city: her principal declared without hesitation or qualification that she would put her own child in Mauclair’s class, and her colleagues met Mauclair with a standing ovation when she returned to the school after the Post’s attack. Mauclair’s undoing had been her own dedication to teaching students with the greatest needs. As a teacher of English as a Second Language, she had taken on the task of teaching small self-contained classes of recent immigrants for the last five years.”

Nonetheless, the state department of education continued (and continues) to produce data for New York teachers “with a single year of test score data, and sample sizes as low as 10…When students did not have a score in a previous year, scores were statistically “imputed” to them in order to produce a basis for making a growth measure.”

These scores had, and often continue to have (also across states), “average confidence intervals of 60 to 70 percentiles for a single-year estimate. On a distribution that went from 0 to 99, the average margins of error in the [New York scores] were, by the [state department of education’s] own calculations, 35 percentiles for Math and 53 percentiles for English Language Arts. One-third of all [scores], the [department] conceded, were completely unreliable—that is, so imprecise as to not warrant any confidence in them. The sheer magnitude of these numbers takes us into the realm of the statistically surreal.” Yet the state continues to this day in its efforts to use these data despite the gross statistical and consequential human errors present.

This is, in the words of Casey, is “a demonstration of [extreme] professional malpractice in the realm of testing.” Yet educational reformers like Governor Cuomo as well as “Michael Bloomberg, Joel Klein, and a cohort of similarly minded education reformers across the United States, the fundamental problem with American public education is that it has been organized as a monopoly that is not subject to the discipline of the marketplace. The solution to all that ails public schools, therefore, is to remake them in the image and likeness of a competitive business. Just as private businesses rise and fall on their ability to compete in the marketplace, as measured by the ‘bottom line’ of their profit balance sheet, schools need to live or die on their ability to compete with each other, based on an educational ‘bottom line.’ If ‘bad’ schools die and new ‘good’ schools are created in their stead, the productivity of education improves. But to undertake this transformation and to subject schools to market discipline, an educational “bottom line” must be established. Standardized testing and value-added measures of performance based on standardized testing provide that ‘bottom line.”

Otherwise, some of the key findings taken from other studies Casey cited in this piece are also good to keep in mind:

  • “A 2010 U.S. Department of Education study found that value-added measures in general have disturbingly high rates of error, with the use of three years of test data producing a 25% error rate in classifying teachers as above average, average, and below average and one year of test data yielding a 35% error rate.” Nothing much has changed in terms of error rates here, so this study stills stands as one of the capstone pieces on this topic.
  • “New York University Professor Sean Corcoran has shown that it is hard to isolate a specific teacher effect from classroom factors and school effects using value-added measures, and that in a single year of test scores, it is impossible to distinguish the teacher’s impact. The fewer the years of data and the smaller the sample (the number of student scores), the more imprecise the value-added estimates.”
  • Also recall that “the tests in question [are/were] designed for another purpose: the measure of student performance, not teacher or school performance.” That is, these tests were designed and validated to measure student achievement, BUT they were never designed or validated for their current purposes/uses: to measure teacher effects on student achievement.

Mirror, Mirror on the Wall…

No surprise, again, but Thomas Kane, an economics professor from Harvard University who also directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation, is publicly writing in support of VAMs, again (redundancy intended). I just posted about one of his recent articles published on the website of the Brookings Institution titled “Do Value-Added Estimates Identify Causal Effects of Teachers and Schools?” after which I received another of his articles, this time published by the New York Daily News titled “Teachers Must Look in the Mirror.”

Embracing a fabled metaphor, while not to position teachers as the wicked queens or to position Kane as Snow White, let us ask ourselves the classic question:”Who is the fairest one of all?” as we critically review yet another fairytale authored by Harvard’s Kane. He has, after all, “carefully studied the best systems for rating teachers” (see other prior posts about Kane’s public perspectives on VAMs here and here).

In this piece, Kane continues to advance a series of phantasmal claims about the potentials of VAMs, this time in the state of New York where Governor Andrew Cuomo intends to take the state’s teacher evaluation system up to a system based 50% on teachers’ value-added, or 100% on value-added in cases where a teacher rated as “ineffective” in his/her value-added score can be rated as “ineffective” overall. Here,  value-added could be used to trump all else (see prior posts about this here and here).

According to Kane, Governor Cuomo “picked the right fight.” The state’s new system “will finally give schools the tools they need to manage and improve teaching.” Perhaps the magic mirror would agree with such a statement, but research would evidence it vain.

As I have noted prior, there is absolutely no evidence, thus far, indicating that such systems have any (in)formative use or value. These data are first and foremost designed for summative, or summary, purposes; they are not designed for formative use. Accordingly, the data that come from such systems — besides the data that come from the observational components still being built into these systems that have existed and been used for decades past — are not transparent, difficult to understand, and therefore challenging to use. Likewise, such data are not instructionally sensitive, and they are untimely in that test-based results typically come back to teachers well after their students have moved on to subsequent grade levels.

What about Kane’s claims against tenure: “The tenure process is the place to start. It’s the most important decision a principal makes. One poor decision can burden thousands of future students, parents, colleagues and supervisors.” This is quite an effect considering the typical teacher being held accountable using these new and improved teacher evaluation systems as based (in this case largely) on VAMs typically impacts only teachers at the elementary level who teach mathematics and reading/language arts. Even an elementary teacher with a career spanning 40 years with an average of 30 students per class would directly impact (or burden) 1,200 students, maximum. This is not to say this is inconsequential, but as consequential as Kane’s sensational numbers imply? What about the thousands of parents, colleagues, and supervisors also to be burdened by one poor decision? Fair and objective? This particular mirror thinks not.

Granted, I am not making any claims about tenure as I think all would agree that sometimes tenure can support, keeping with the metaphor, bad apples. Rather I take claim with the exaggerations, including also that “Traditionally, principals have used much too low a standard, promoting everyone but the very worst teachers.” We must all check our assumptions here about how we define “the very worst teachers” and how many of them really lurk in the shadows of America’s now not-so-enchanted forests. There is no evidence to support this claim, either, just conjecture.

As for the solution, “Under the new law, the length of time it will take to earn tenure will be lengthened from three to four years.” Yes, that arbitrary, one-year extension will certainly help… Likewise, tenure decisions will now be made better using classroom observations (the data that have, according to Kane in this piece, been used for years to make all of these aforementioned bad decisions) and our new fair and objective, test-based measures, which not accordingly to Kane, can only be used for about 30% of all teachers in America’s public schools. Nonetheless, “Student achievement gains [are to serve as] the bathroom scale, [and] classroom observations [are to serve] as the mirror.”

Kane continues, scripting, “Although the use of test scores has received all the attention, [one of] the most consequential change[s] in the law has been overlooked: One of a teacher’s observers must now be drawn from outside his or her school — someone whose only role is to comment on teaching.” Those from inside the school were only commenting on one’s beauty and fairness prior, I suppose, as “The fact that 96% of teachers were given the two highest ratings last year — being deemed either “effective” or “highly effective” — is a sure sign that principals have not been honest to date.”

All in all, perhaps somebody else should be taking a long hard “Look in the Mirror,” as this new law will likely do everything but “[open] the door to a renewed focus on instruction and excellence in teaching” despite the best efforts of “union leadership,” although I might add to Kane’s list many adorable little researchers who have also “carefully studied the best systems for rating teachers” and more or less agree on their intended and unintended results in…the end.

Vanderbilt Researchers on Performance Pay, VAMs, and SLOs

Do higher paychecks translate into higher student test scores? That is the question two researchers at Vanderbilt – Ryan Balch (recent Graduate Research Assistant at Vanderbilt’s National Center on Performance Incentives) and Matthew Springer (Assistant Professor of Public Policy and Education and Director of Vanderbilt’s National Center on Performance Incentives) – attempted to answer in a recent study of the REACH pay-for-performance program in Austin, Texas (a nationally recognized performance program model with $62.3 million in federal support). The study published in Education Economics Review can be found here, but for a $19.95 fee; hence, I’ll do my best to explain this study’s contents so you all can save your money, unless of course you too want to dig deeper.

As background (and as explained on the first page of the full paper), the theory behind performance pay is that tying teacher pay to teacher performance provides “strong incentives” to improve outcomes of interest. “It can help motivate teachers to higher levels of performance and align their behaviors and interests with institutional goals.” I should note, however, that there is very mixed evidence from over 100 years of research on performance pay regarding whether it has ever worked. Economists tend to believe it works while educational researchers tend to disagree.

Regardless, in this study as per a ResearchNews@Vanderbilt post put out by Vanderbilt highlighting it, researchers found that teacher-level growth in student achievement in mathematics and reading in schools in which teachers were given monetary performance incentives was significantly higher during the first year of the program’s implementation (2007-2008), than was the same growth in the nearest matched, neighborhood schools where teachers were not given performance incentives. Similar gains were maintained the following year, yet (as per the full report) no additional growth or loss was noted otherwise.

As per the full report as well, researchers more specifically found that students who were enrolled in the REACH program gained between 0.13 and 0.17 standard deviations greater gains in mathematics, and (although not as evident or highlighted in the text of the actual report, but within a related table) students who were enrolled in the REACH program gained between 0.10 and 0.05 standard deviations greater gains in reading, although these gains were also less significant in statistical terms. Curious…

While the method by which schools were matched was well-detailed, and inter-school descriptive statistics were presented to help readers determine whether in fact the schools sampled for this study were comparable (although statistics that would also help us determine whether the inter-school differences noted were statistically significant enough to pay attention to), the statistics comparing the teachers in REACH schools versus those not in REACH schools to whom they were compared were completely missing. Hence, it is impossible to even begin to determine whether the matching methodology used actually yielded comparable samples down to the teacher level – the heart of this research study. This is a fatal flaw that in my opinion should have prevented this study from being published, at least as is, as without this information we have no guarantees that teachers within these schools were indeed comparable.

Regardless, researchers also examined teachers’ Student Learning Objectives (SLOs) – the incentive program’s “primary measure of individual teacher performance” given so many teachers are still VAM-ineligible (see a prior post about SLOs, here). They examined whether SLO scores correlated with VAM scores, for those teachers who had both.

They found, as per a quote by Springer in the above-mentioned post, that “[w]hile SLOs may serve as an important pedagogical tool for teachers in encouraging goal-setting for students, the format and guidance for SLOs within the specific program did not lead to the proper identification of high value-added teachers.” That is, more precisely and as indicated in the actual study, SLOs were “not significantly correlated with a teacher’s value-added student test scores;” hence, “a teacher is no more likely to meet his or her SLO targets if [his/her] students have higher levels of achievement [over time].” This has huge implications, in particular regarding the still lacking evidence of validity surrounding SLOs.