Bias in Teacher Observations, As Well

Following a post last month titled “New Empirical Evidence: Students’ ‘Persistent Economic Disadvantage’ More Likely to Bias Value-Added Estimates,” Matt Barnum — senior staff writer for The 74, an (allegedly) non-partisan, honest, and fact-based news site backed by Editor-in-Chief Campbell Brown and covering America’s education system “in crisis” (see, also, a prior post about The 74 here) — followed up with a tweet via Twitter. He wrote: “Yes, though [bias caused by economic disadvantage] likely applies with equal or even more force to other measures of teacher quality, like observations.” I replied via Twitter that I disagreed with this statement in that I was unaware of research in support of his assertion, and Barnum sent me two articles to review thereafter.

I attempted to review both of these articles herein, although I quickly figured out that I had actually read and reviewed the first (2014) piece on this blog (see original post here, see also a 2014 Brookings Institution article summarizing this piece here). In short, in this study researchers found that the observational components of states’ contemporary teacher systems certainly “add” more “value” than their value-added counterparts, especially for (in)formative purposes. However, researchers  found that observational bias also exists, as akin to value-added bias, whereas teachers who are non-randomly assigned students who enter their classrooms with higher levels of prior achievement tend to get higher observational scores than teachers non-randomly assigned students entering their classrooms with lower levels of prior achievement. Researchers concluded that because districts “do not have processes in place to address the possible biases in observational scores,” statistical adjustments might be made to offset said bias, as might external observers/raters be brought in to yield more “objective” observational assessments of teachers.

For the second study, and this post here, I gave this one a more thorough read (you can find the full study, pre-publication here). Using data from the Measures of Effective
Teaching (MET) Project, in which random assignment was used (or more accurately attempted), researchers also explored the extent to which students enrolled in teachers’ classrooms influence classroom observational scores.

They found, primarily, that:

  1. “[T]he context in which teachers work—most notably, the incoming academic performance of their students—plays a critical role in determining teachers’ performance” as measured by teacher observations. More specifically, “ELA [English/language arts] teachers were more than twice as likely to be rated in the top performance quintile if [nearly randomly] assigned the highest achieving students compared with teachers assigned the low-est achieving students,” and “math teachers were more than 6 times as likely.” In addition, “approximately half of the teachers—48% in ELA and 54% in math—were rated in the top two performance quintiles if assigned the highest performing students, while 37% of ELA and only 18% of math teachers assigned the lowest performing students were highly rated based on classroom observation scores”
  2. “[T]he intentional sorting of teachers to students has a significant influence on measured performance” as well. More specifically, results further suggest that “higher performing students [are, at least sometimes] endogenously sorted into the classes of higher performing teachers…Therefore, the nonrandom and positive assignment of teachers to classes of students based on time-invariant (and unobserved) teacher
    characteristics would reveal more effective teacher performance, as measured by classroom observation scores, than may actually be true.”

So, the non-random assignment of teachers biases both the value-added and observational components written into America’s now “more objective” teacher evaluation systems, as (formerly) required of all states that were to comply with federal initiatives and incentives (e.g., Race to the Top). In addition, when those responsible for assigning students to classrooms (sub)consciously favor teachers with high, prior observational scores, this exacerbates the issues. This is especially important when observational (and value-added) data are to be used for high-stakes accountability systems in that the data yielded via really both measurement systems may be less likely to reflect “true” teaching effectiveness due to “true” bias. “Indeed, teachers working with higher achieving students tend to receive higher performance ratings, above and beyond that which might be attributable to aspects of teacher quality,” and vice-versa.

Citation Study #1: Whitehurst, G. J., Chingos, M. M., & Lindquist, K. M. (2014). Evaluating teachers with classroom observations: Lessons learned in four districts. Washington, DC: Brookings Institution. Retrieved from

Citation Study #2: Steinberg, M. P., & Garrett, R. (2016). Classroom composition and measured teacher performance: What do teacher observation scores really measure? Educational Evaluation and Policy Analysis, 38(2), 293-317. doi:10.3102/0162373715616249  Retrieved from


The “Value-Added” of Teacher Preparation Programs: New Research

The journal Education of Economics Review recently published a study titled “Teacher Quality Differences Between Teacher Preparation Programs: How Big? How Reliable? Which Programs Are Different?” The study was authored by researchers at the University of Texas – Austin, Duke University, and Tulane. The pre-publication version of this piece can be found here.

As the title implies, the purpose of the study was to “evaluate statistical methods for estimating teacher quality differences between TPPs [teacher preparation programs].” Needless to say, this research is particularly relevant, here, given “Sixteen US states have begun to hold teacher preparation programs (TPPs) accountable for teacher quality, where quality is estimated by teacher value-added to student test scores.” The federal government continues to support and advance these initiatives, as well (see, for example, here).

But this research study is also particularly important because while researchers found that “[t]he most convincing estimates [of TPP quality] [came] from a value-added model where confidence intervals [were] widened;” that is, the extent to which measurement errors were permitted was dramatically increased, and also widened further using statistical corrections. But even when using these statistical techniques and accomodations, they found that it was still “rarely possible to tell which TPPs, if any, [were] better or worse than average.”

They therefore concluded that “[t]he potential benefits of TPP accountability may be too small to balance the risk that a proliferation of noisy TPP estimates will encourage arbitrary and ineffective policy actions” in response. More specifically, and in their own words, they found that:

  1. Differences between TPPs. While most of [their] results suggest that real differences between TPPs exist, the differences [were] not large [or large enough to make or evidence the differentiation between programs as conceptualized and expected]. [Their] estimates var[ied] a bit with their statistical methods, but averaging across plausible methods [they] conclude[d] that between TPPs the heterogeneity [standard deviation (SD) was] about .03 in math and .02 in reading. That is, a 1 SD increase in TPP quality predict[ed] just [emphasis added] a [very small] .03 SD increase in student math scores and a [very small] .02 SD increase in student reading scores.
  2. Reliability of TPP estimates. Even if the [above-mentioned] differences between TPPs were large enough to be of policy interest, accountability could only work if TPP differences could be estimated reliably. And [their] results raise doubts that they can. Every plausible analysis that [they] conducted suggested that TPP estimates consist[ed] mostly of noise. In some analyses, TPP estimates appeared to be about 50% noise; in other analyses, they appeared to be as much as 80% or 90% noise…Even in large TPPs the estimates were mostly noise [although]…[i]t is plausible [although perhaps not probable]…that TPP estimates would be more reliable if [researchers] had more than one year of data…[although states smaller than the one in this study — Texs]…would require 5 years to accumulate the amount of data that [they used] from one year of data.
  3. Notably Different TPPs. Even if [they] focus[ed] on estimates from a single model, it remains hard to identify which TPPs differ from the average…[Again,] TPP differences are small and estimates of them are uncertain.

In conclusion, that researchers found “that there are only small teacher quality differences between TPPs” might seem surprising, but not really given the outcome variables they used to measure and assess TPP effects were students’ test scores. In short, students’ test scores are three times removed from the primary unit of analysis in studies like these. That is, (1) the TPP is to be measured by the effectiveness of its teacher graduates, and (2) teacher graduates are to be measured by their purported impacts on their students’ test scores, while (3) students’ test scores are to only and have only been validated for measuring student learning and achievement. These test scores have not been validated to assess and measure, in the inverse, teachers causal impacts on said achievements or on TPPs impacts on teachers on said achievements.

If this sounds confusing, it is, and also highly nonsensical, but this is also a reason why this is so difficult to do, and as evidenced in this study, improbable to do this well or as theorized in that TPP estimates are sensitive to error, insensitive given error, and, accordingly, highly uncertain and invalid.

Citation: von Hippela, P. T., Bellowsb, L., Osbornea, C., Lincovec, J. A., & Millsd, N. (2016). Teacher quality differences between teacher preparation programs: How big? How reliable? Which programs are different? Education of Economics Review, 53, 31–45. doi:10.1016/j.econedurev.2016.05.002

VAM-Based Chaos Reigns in Florida, as Caused by State-Mandated Teacher Turnovers

The state of Florida is another one of our state’s to watch in that, even since the passage of the Every Student Succeeds Act (ESSA) last January, the state is still moving forward with using its VAMs for high-stakes accountability reform. See my most recent post about one district in Florida here, after the state ordered it to dismiss a good number of its teachers as per their low VAM scores when this school year started. After realizing this also caused or contributed to a teacher shortage in the district, the district scrambled to hire Kelly Services contracted substitute teachers to replace them, after which the district also put administrators back into the classroom to help alleviate the bad situation turned worse.

In a recent post released by The Ledger, teachers from the same Polk County School District (size = 100K students) added much needed details and also voiced concerns about all of this in the article that author Madison Fantozzi titled “Polk teachers: We are more than value-added model scores.”

Throughout this piece Fantozzi covers the story of Elizabeth Keep, a teacher who was “plucked from” the middle school in which she taught for 13 years, after which she was involuntarily placed at a district high school “just days before she was to report back to work.” She was one of 35 teachers moved from five schools in need of reform as based on schools’ value-added scores, although this was clearly done with no real concern or regard of the disruption this would cause these teachers, not to mention the students on the exiting and receiving ends. Likewise, and according to Keep, “If you asked students what they need, they wouldn’t say a teacher with a high VAM score…They need consistency and stability.” Apparently not. In Keep’s case, she “went from being the second most experienced person in [her middle school’s English] department…where she was department chair and oversaw the gifted program, to a [new, and never before] 10th- and 11th-grade English teacher” at the new high school to which she was moved.

As background, when Polk County School District officials presented turnaround plans to the State Board of Education last July, school board members “were most critical of their inability to move ‘unsatisfactory’ teachers out of the schools and ‘effective’ teachers in.”  One board member, for example, expressed finding it “horrendous” that the district was “held hostage” by the extent to which the local union was protecting teachers from being moved as per their value-added scores. Referring to the union, and its interference in this “reform,” he accused the unions of “shackling” the districts and preventing its intended reforms. Note that the “effective” teachers who are to replace the “ineffective” ones can earn up to $7,500 in bonuses per year to help the “turnaround” the schools into which they enter.

Likewise, the state’s Commissioner of Education concurred saying that she also “wanted ‘unsatisfactory’ teachers out and ‘highly effective’ teachers in,” again, with effectiveness being defined by teachers’ value-added or lack thereof, even though (1) the teachers targeted only had one or two years of the three years of value-added data required by state statute, and even though (2) the district’s senior director of assessment, accountability and evaluation noted that, in line with a plethora of other research findings, teachers being evaluated using the state’s VAM have a 51% chance of changing their scores from one year to the next. This lack of reliability, as we know it, should outright prevent any such moves in that without some level of stability, valid inferences from which valid decisions are to be made cannot be drawn. It’s literally impossible.

Nonetheless, state board of education members “unanimously… threatened to take [all of the district’s poor performing] over or close them in 2017-18 if district officials [didn’t] do what [the Board said].” See also other tales of similar districts in the article available, again, here.

In Keep’s case, “her ‘unsatisfactory’ VAM score [that caused the district to move her, as] paired with her ‘highly effective’ in-class observations by her administrators brought her overall district evaluation to ‘effective’…[although she also notes that]…her VAM scores fluctuate because the state has created a moving target.” Regardless, Keep was notified “five days before teachers were due back to their assigned schools Aug. 8 [after which she was] told she had to report to a new school with a different start time that [also] disrupted her 13-year routine and family that shares one car.”

VAM-based chaos reigns, especially in Florida.

U.S. Department of Education: Value-Added Not Good for Evaluating Schools and Principals

Just this month, the Institute of Education Sciences (IES) wing of the U.S. Department of Education released a report about using value-added models (VAMs) for measuring school principals’ performance. The article conducted by researchers at Mathematica Policy Research and titled “Can Student Test Scores Provide Useful Measures of School Principals’ Performance?” can be found online here, with my summary of the study findings highlighted next and herein.

Before the passage of the Every Student Succeeds Act (ESSA), 40 states had written into their state statutes, as incentivized by the federal government, to use growth in student achievement growth for annual principal evaluation purposes. More states had written growth/value-added models (VAMs) for teacher evaluation purposes, which we have covered extensively via this blog, but this pertains only to school and/or principal evaluation purposes. Now since the passage of ESSA, and the reduction in the federal government’s control over state-level policies, states now have much more liberty to more freely decide whether to continue using student achievement growth for either purposes. This paper is positioned within this reasoning, and more specifically to help states decide whether or to what extent they might (or might not) continue to move forward with using growth/VAMs for school and principal evaluation purposes.

Researchers, more specifically, assessed (1) reliability – or the consistency or stability of these ratings over time, which is important “because only stable parts of a rating have the potential to contain information about principals’ future performance; unstable parts reflect only transient aspects of their performance;” and (2) one form of multiple evidences of validity – the predictive validity of these principal-level measures, with predictive validity defined as “the extent to which ratings from these measures accurately reflect principals’ contributions to student achievement in future years.” In short, “A measure could have high predictive validity only if [emphasis added] it was highly stable between consecutive years [i.e., reliability]…and its stable part was strongly related to principals’ contributions to student achievement” over time (i.e., predictive validity).

Researchers used principal-level value-added (unadjusted and adjusted for prior achievement and other potentially biasing demographic variables) to more directly examine “the extent to which student achievement growth at a school differed from average growth statewide for students with similar prior achievement and background characteristics.” Also important to note is that the data they used to examine school-level value-added came from Pennsylvania, which is one of a handful of states that uses the popular and proprietary (and controversial) Education Value-Added Assessment System (EVAAS) statewide.

Here are the researchers’ key findings, taken directly from the study’s summary (again, for more information see the full manuscript here).

  • The two performance measures in this study that did not account for students’ past achievement—average achievement and adjusted average achievement—provided no information for predicting principals’ contributions to student achievement in the following year.
  • The two performance measures in this study that accounted for students’ past achievement—school value-added and adjusted school value-added—provided, at most, a small amount of information for predicting principals’ contributions to student achievement in the following year. This was due to instability and inaccuracy in the stable parts.
  • Averaging performance measures across multiple recent years did not improve their accuracy for predicting principals’ contributions to student achievement in the following year. In simpler terms, a principal’s average rating over three years did not predict his or her future contributions more accurately than did a rating from the most recent year only. This is more of a statistical finding than one that has direct implications for policy and practice (except for silly states who might, despite findings like those presented in this study, decide that they can use one year to do this not at all well instead of three years to do this not at all well).

Their bottom line? “…no available measures of principal [/school] performance have yet been shown to accurately identify principals [/schools] who will contribute successfully to student outcomes in future years,” especially if based on students’ test scores, although the researchers also assert that “no research has ever determined whether non-test measures, such as measures of principals’ leadership practices, [have successfully or accurately] predict[ed] their future contributions” either.

The researchers follow-up with a highly cautionary note: “the value-added measures will make plenty of mistakes when trying to identify principals [/schools] who will contribute effectively or ineffectively to student achievement in future years. Therefore, states and districts should exercise caution when using these measures to make major decisions about principals. Given the inaccuracy of the test-based measures, state and district leaders and researchers should also make every effort to identify nontest measures that can predict principals’ future contributions to student outcomes [instead].”

Citation: Chiang, H., McCullough, M., Lipscomb, S., & Gill, B. (2016). Can student test scores provide useful measures of school principals’ performance? Washington DC: U.S. Department of Education, Institute of Education Sciences. Retrieved from

New Mexico’s “New, Bait and Switch” Schemes

“A Concerned New Mexico Parent” sent me another blog entry for you all to help you all stay apprised of the ongoing “situation” in New Mexico with its New Mexico Public Education Department (NMPED). See “A Concerned New Mexico Parent’s” prior posts here, here, and here, but in this one (s)he writes a response to an editorial that was recently released in support of the newest version of New Mexico’s teacher evaluation system. The editorial was titled: “Teacher evals have evolved but tired criticisms of them have not,” and it was published in the Albuquerque Journal, as also written by the Albuquerque Journal Editorial Board themselves.

(S)he writes:

The editorial seems to contain and promote many of the “talking points” provided by NMPED with their latest release of teacher evaluations. Hence, I would like to present a few observations on the editorial.

NMPED and the Albuquerque Journal Editorial Board both underscore the point that teachers are still primarily being (and should primarily continue to be) evaluated on the basis of their own students’ test scores (i.e., using a value-added model (VAM)), but it is actually not that simple. Rather, the new statewide teacher evaluation formula is shown here on their website, with one notable difference being that the state’s “new system” now replaces the previously district-wide variations that produced 217 scoring categories for teachers (see here for details).

Accordingly, it now appears that NMPED has kept the same 50% student achievement, 25% observations, and 25% multiple measures division as before. The “new” VAM, however, requires a minimum of three years of data for proper use. Without three years of data, NMPED is to use what it calls graduated considerations or “NMTEACH” steps to change the percentages used in the evaluation formulas by teacher type.

A small footnote on the NMTEACH website devoted to teacher evaluations explains these graduated considerations whereby “Each category is weighted according to the amount of student achievement data available for the teacher. Improved student achievement is worth from 0% to 50%; classroom observations are worth 25% to 50%; planning, preparation and professionalism is worth 15% to 40%; and surveys and/or teacher attendance is worth 10%.” In other words student achievement represents between 0 and 50% of the total, observations comprise somewhere between 14% and 40% of the total, and teacher attendance comprises 10%.

The graduated considerations (Steps) are shown below, as per their use when substitutions are needed when student achievement data are missing:


Also, the NMTEACH “Steps” provide for the use of one year of data (Step 2 is used for 1-2 years of data.) I do not see how NMPED can calculate “student improvement” based on just one year’s worth of data.

Hence, this data substitution problem is likely massive. For example, for Category A teachers, 45 of the 58 formulas formerly used will require Step 1 substitutions. For Category B teachers, 112 of 117 prior formulas will require data substitution (Step 1), and all Category C teachers will require data substitution at the Step 1 level.

The reason that this presents a huge data problem is that the state’s prior teacher evaluation system did not require the use of so much end-of-course (EOC) data, and so the tests were not given for three years. Simultaneously and for Group C teachers, NMPED also introduced an new evaluation assessment plus software called iStation that is also in its first year of use.

Thus, for a typical Category B teacher, the evaluation will be based on 50% observation, 40% planning, preparation, and professionalism, and 10% on attendance.

Amazingly, none of this relates to student achievement, and it looks identical to the former administrator-based teacher evaluation system!

Such a “bait-and-switch” scheme will be occurring for most teachers in the state.

Further, in a small case-study I performed on a local New Mexico school (here), I found that not one single teacher in a seven-year period had “good” data for three consecutive years. This also has major implications here given the state’s notorious issues with their data, data management, and the like.

Notwithstanding, the Editorial Board also notes that “The evaluations consider only student improvement, not proficiency.” However, as noted above little actual student achievement is actually available for the strong majority of all teachers’ evaluation; hence, the rate by which this will actually count (versus perhaps appear to count to the public) are two very distinctively different things.

Regardless, the Editorial Board thereafter proclaims that “The evaluations only rate teachers’ effect on their students over a school year…” Even the simple phrase “school year” is also problematic, however.

The easiest way to explain this is to imagine a student in a dual language program (a VERY common situation in New Mexico). Let’s follow his timeline of instruction and testing:

  • August 2015: The student begins the fourth grade with teachers A1 and A2.
  • March 2016: Seven months into the year the student is tested with test #1 at the 4th-grade level.
  • March 2016 – May 2016: The student finishes fourth grade with Teachers A1 and A2
  • June 2016 – Aug 2016: Summer vacation — no tests (i.e., differential summer learning and decay occurs)
  • August 2016: The student begins the fifth grade with teachers B1 and B2.
  • March 2017: Seven months into the year the student is tested with test #2 at the 5th-grade level.
  • March 2017 – May 2017: The student finishes fifth grade with Teachers B1 and B2
  • October 2017: A teacher receives a score based on this student’s improvement (along with other students like him, although coming from different A level teachers) from test#1 to test#2

To simplify, the test improvement is based on a test given before he has completed the grade level of interest with material taught by four teachers at two different grade levels over the span of one calendar year [this is something that is known in the literature as prior teachers’ residual effects].

And it gets worse. The NMPED requires that a student be assigned to only one teacher. According to the NMTEACH FAQ, in the case of team-teaching, “Students are assigned to one teacher. That teacher would get credit. A school could change teacher assignment each snapshot and thus both teachers would get counted automatically.”

I can only assume the Editorial Board members are brighter than I am because I cannot parse out the teacher evaluation values for my sample student.

Nevertheless, the Editorial Board also gushes with praise regarding the use of teacher attendance as an evaluation tool. This is just morally wrong.

Leave is not “granted” to teachers by some benevolent overlord. It is earned and is part of the union contract between teachers and the state. Imagine a job where you are told that you have two weeks vacation time but, of course, you can only take two days of it or you might be fired. Absurd, right? Well, apparently not if you are NMPED.

This is one of the major issues in the ongoing lawsuit, where as I recall, one of the plaintiffs was penalized for taking time off for the apparently frivolous task of cancer treatment! NMPED should be ashamed of themselves!

The Editorial Board also praises the new, “no lag time” aspect of the evaluation system. In the past, teacher evaluations were presented at the end of the school year before student scores were available. Now that the evaluations depend upon student scores, the evaluations appear early in the next school year. As noted in the timeline above, the lag time is still present contrary to what they assert. Further, these evaluations now come mid-term after the school-year has started and teacher assignments have been made.

In the end, and again in the title, the Editorial Board claims that the “Teacher evals have evolved but tired criticisms of them have not.”

The evals have not evolved but have rather devolved to something virtually identical to the former observation and administration-based evaluations. The tired criticisms are tired precisely because they have never been adequately answered by NMPED.

~A Concerned New Mexico Parent

New Mexico Lawsuit Update

As you all likely recall, the American Federation of Teachers (AFT), joined by the Albuquerque Teachers Federation (ATF), last fall, filed a “Lawsuit in New Mexico Challenging [the] State’s Teacher Evaluation System.” Plaintiffs charged that the state’s teacher evaluation system, imposed on the state in 2012 by the state’s current Public Education Department (PED) Secretary Hanna Skandera (with value-added counting for 50% of teachers’ evaluation scores), was unfair, error-ridden, spurious, harming teachers, and depriving students of high-quality educators, among other claims (see the actual lawsuit here). Again, I’m serving as the expert witness on the side of the plaintiffs in this suit.

As you all likely also recall, in December of 2015, State District Judge David K. Thomson granted a preliminary injunction preventing consequences from being attached to the state’s teacher evaluation data. More specifically, Judge Thomson ruled that the state could proceed with “developing” and “improving” its teacher evaluation system, but the state was not to make any consequential decisions about New Mexico’s teachers using the data the state collected until the state (and/or others external to the state) could evidence to the court during another trial (initially set for April 2016, then postponed to October 2016, and likely to be postponed again) that the system is reliable, valid, fair, uniform, and the like (see prior post on this ruling here).

Well, many of you have (since these prior posts) written requesting updates regarding this lawsuit, and here is one as released jointly by the AFT and ATF. This accurately captures the current and ongoing situation:

September 23, 2016

Many of you will remember the classic Christmas program, Rudolph the Red Nose Reindeer, and how the terrible and menacing abominable snowman became harmless once his teeth were removed. This is how you should view the PED evaluation you recently received – a harmless abominable snowman.  

The math is still wrong, the methodology deeply flawed, but the preliminary injunction achieved by our union, removed the teeth from PED’s evaluations, and so there is no reason to worry. As explained below, we will continue to fight these evaluations and will not rest until the PED institutes an evaluation system that is fair, meaningful, and consistently applied.

For all of you, who just got arbitrarily labeled by the PED in your summative evaluations, just remember, like the abominable snowman, these labels have no teeth, and your career is safe.

2014-2015 Evaluations

These evaluations, as you know, were the subject of our lawsuit filed in 2014. As a result of the Court’s order, the preliminary injunction, no negative consequences can result from your value-added scores.

In an effort to comply with the Court’s order, the PED announced in May it would be issuing new regulations.  This did not happen, and it did not happen in June, in July, in August, or in September. The bottom line is the PED still has not issued new regulations – though it still promises that those regulations are coming soon. So much for accountability.

The trial on the old regulations, scheduled for October 24, has been postponed based upon the PED’s repetitive assertions that new regulations would be issued.

In addition, we have repeatedly asked the PED to provide their data, which they finally did, however it lacked the codebook necessary to meaningfully interpret the data. We view this as yet another stall tactic.

Soon, we will petition the Court for an order compelling PED to produce the documents it promised months ago. Our union’s lawyers and expert witnesses will use this data to critically analyze the PED’s claims and methodology … again.

2015-2016 Evaluations

Even though the PED has condensed the number of ways an educator can be evaluated in a false attempt to satisfy the Courts, the fact remains that value-added models are based on false math and highly inaccurate data. In addition to the PED’s information we have requested for the 2014-2015 evaluations, we have requested all data associated with the current 2015-2016 evaluations.

If our experts determine the summative evaluation scores are again, “based on fundamentally, and irreparably, flawed methodology which is further plagued by consistent and appalling data errors,” we will also challenge the 2015-2016 evaluations. If the PED ever releases new regulations, and we determine that they violate statute (again), we will challenge those regulations, as well.

Rest assured our union will not stop challenging the PED until we are satisfied they have adopted an evaluation system that is respectful of students and educators. We will keep you updated as we learn more information, including the release of new regulations and the rescheduled trial date.

In Solidarity,

Stephanie Ly                                   Ellen Bernstein
President, AFT NM                         President, ATF

A New Book about VAMs “On Trial”

I recently heard about a new book that was written by Mark Paige — J.D. and Ph.D., assistant professor of public policy at the University of Massachusetts-Dartmouth, and a former school law attorney — and published by Rowman & Littlefield. The book is about, as per the secondary part of its title “Understanding Value-Added Models [VAMs] in the Law of Teacher Evaluation.” See more on this book, including information about how to purchase it, for those of you who might be interested in reading more, here, and also via Amazon here.

Clearly, this book is to prove very relevant given the ongoing court cases across the country (see a prior post on these cases here) regarding teachers and the systems being used to evaluate them when especially (or extremely) reliant upon VAM-based estimates for consequential decision-making purposes (e.g., teacher tenure, pay, and termination). While I have not yet read the book, I just ordered my copy the other day. I suggest you do the same, again, should you be interested in further or better understanding the federal and state law pertinent to these cases.

Notwithstanding, I also requested that the author of this book — Mark Paige — write a guest post so that you too could find out more. Here is what he wrote:

Many of us have been following VAMs in legal circles. Several courts have faced the issue of VAMs as they relate to employment law matters. These cases have tested a chief selling point (pardon [or underscore] the business reference) of VAMs: that they will effectuate, for example, teacher termination with greater ease because nobody besides the advanced statisticians and econometricians can argue with their numbers derived. In other words, if a teacher’s VAM rating is bad, then the teacher must be bad. It’s to be as simple as that. How can a court deny that, reality?

Of course, as we [should] already know, VAMs are anything but certain. Bluntly stated: VAMs are a statistical “hot mess.” The American Statistical Association, among many others, warned in no uncertain terms that VAMs cannot – and should not – be trusted to make significant employment decisions. Of course, that has not stopped many policymakers from a full-throated adoption of their use in employment and evaluation decisions. Talk about hubris.

Accordingly, I recently completed this book, again, that focuses squarely at the intersection of VAMs and the law. Its full title is “Building a Better Teacher: Understanding Value-Added Models in the Law of Teacher Evaluation” Rowman & Littlefield, 2016). Again, I provide a direct link to the book along with its description here.

To offer a bit of a sneak preview, thought, I draw many conclusions throughout the book, but one of two important take-aways is this: VAMs may actually complicate the effectuation of a teacher’s termination. Here’s one way: because VAMs are so statistically infirm, they invite plaintiff-side attorneys to attack any underlying negative decision based on these models. See, for example, Sheri Lederman’s recent New York State Supreme Court’s decision, here. [See also a related post in this blog here].

In other words, the evidence upon which districts or states rely to make significant decisions is untrustworthy (or arbitrary) and, therefore, so is any decision as based, even if in part, on VAMs. Thus, VAMs may actually strengthen a teacher’s case. This, of course, is quite apart from the fact that VAM use results in firing good teachers based on poor information, thereby contributing to the teacher shortages and lower morale (among many other parades of horribles) being reported across the nation, and now more than likely ever.

The second important take-away is this, especially given followers of this blog include many educators and administrators facing a barrage of criticisms that only “de-professionalize” them: Courts have, over time, consistently deferred to the professional judgment of administrators (and their assessment of effective teaching). The members of that august institution – the judiciary – actually believe that educators know best about teaching, and that years of accumulated experience and knowledge have actual and also court-relevant value. That may come as a startling revelation to those who consistently diminish the education profession, or those who at least feel like they and their efforts are consistently being diminished.

To be sure, the system of educator evaluation is not perfect. Our schools continue to struggle to offer equal and equitable educational opportunities to all students, especially those in the nation’s highest needs schools. But what this book ultimately concludes is that the continued use of VAMs will not, hu-hum, add any value to these efforts.

To reach author Mark Paige via email, please contact him at To reach him via Twitter: @mpaigelaw

New Mexico Is “At It Again”

“A Concerned New Mexico Parent” sent me yet another blog entry for you all to stay apprised of the ongoing “situation” in New Mexico and the continuous escapades of the New Mexico Public Education Department (NMPED). See “A Concerned New Mexico Parent’s” prior posts here, here, and here, but in this one (s)he writes what follows:

Well, the NMPED is at it again.

They just released the teacher evaluation results for the 2015-2016 school year. And, the report and media press releases are a something.

Readers of this blog are familiar with my earlier documentation of the myriad varieties of scoring formulas used by New Mexico to evaluate its teachers. If I recall, I found something like 200 variations in scoring formulas [see his/her prior post on this here with an actual variation count at n=217].

However, a recent article published in the Albuquerque Journal indicates that, now according to the NMPED, “only three types of test scores are [being] used in the calculation: Partnership for Assessment of Readiness for College and Careers [PARCC], end-of-course exams, and the [state’s new] Istation literacy test.” [Recall from another article released last January that New Mexico’s Secretary of Education Hanna Skandera is also the head of the governing board for the PARCC test].

Further, the Albuquerque Journal article author reports that the “PED also altered the way it classifies teachers, dropping from 107 options to three. Previously, the system incorporated many combinations of criteria such as a teacher’s years in the classroom and the type of standardized test they administer.”

The new state-wide evaluation plan is also available in more detail here. Although I should also add that there has been no published notification of the radical changes in this plan. It was just simply and quietly posted on NMPED’s public website.

Important to note, though, is that for Group B teachers (all levels), the many variations documented previously have all been replaced by end-of-course (EOC) exams. Also note that for Group A teachers (all levels) the percentage assigned to the PARCC test has been reduced from 50% to 35%. (Oh, how the mighty have fallen …). The remaining 15% of the Group A score is to be composed of EOC exam scores.

There are only two small problems with this NMPED simplification.

First, in many districts, no EOC exams were given to Group B teachers in the 2015-2016 school year, and none were given in the previous year either. Any EOC scores that might exist were from a solitary administration of EOC exams three years previously.

Second, for Group A teachers whose scores formerly relied solely on the PARCC test for 50% of their score, no EOC exams were ever given.

Thus, NMPED has replaced their policy of evaluating teachers on the basis of students they don’t teach to this new policy of evaluating teachers on the basis of tests they never administered!

Well done, NMPED (not…)

Luckily, NMPED still cannot make any consequential decisions based on these data, again, until NMPED proves to the court that the consequential decisions that they would still very much like to make (e.g., employment, advancement and licensure decisions) are backed by research evidence. I know, interesting concept…

A Case of VAM-Based Chaos in Florida

Within a recent post, I wrote about my recent “silence” explaining that, apparently, post the passage of federal government’s (January 1, 2016) passage of the Every Student Succeeds Act (ESSA) that no longer requires teachers to be evaluated by their student’s tests score using VAMs (see prior posts on this here and here), “crazy” VAM-related events have apparently subsided. While I noted in the post that this also did not mean that certain states and districts are not still drinking (and overdosing on) the VAM-based Kool-Aid, what I did not note is that the ways by which I get many of the stories I cover on this blog is via Google Alerts. This is where I have noticed a significant decline in VAM-related stories. Clearly, however, the news outlets often covered via Google Alerts don’t include district-level stories, so to cover these we must continue to rely on our followers (i.e., teachers, administrators, parents, students, school board members, etc.) to keep the stories coming.

Coincidentally — Billy Townsend, who is running for a school board seat in Polk County, Florida (district size = 100K students) — sent me one such story. As an edublogger himself, he actually sent me three blog posts (see post #1, post #2, and post #3 listed by order of relevance) capturing what is happening in his district, again, as situated under the state of Florida’s ongoing, VAM-based, nonsense. I’ve summarized the situation below as based on his three posts.

In short, the state ordered the district to dismiss a good number of its teachers as per their VAM scores when this school year started. “[T]his has been Florida’s [educational reform] model for nearly 20 years [actually since 1979, so 35 years]: Choose. Test. Punish. Stigmatize. Segregate. Turnover.” Because the district already had a massive teacher shortage as well, however, these teachers were replaced with Kelly Services contracted substitute teachers. Thereafter, district leaders decided that this was not “a good thing,” and they decided that administrators and “coaches” would temporarily replace the substitute teachers to make the situation “better.” While, of course, the substitutes’ replacements did not have VAM scores themselve, they were nonetheless deemed fit to teach and clearly more fit to teach than the teachers who were terminated as based on their VAM scores.

According to one teacher who anonymously wrote about her terminated teacher colleagues, and one of the district’s “best” teachers: “She knew our kids well. She understood how to reach them, how to talk to them. Because she ‘looked like them’ and was from their neighborhood, she [also] had credibility with the students and parents. She was professional, always did what was best for students. She had coached several different sports teams over the past decade. Her VAM score just wasn’t good enough.”

Consequently, this has turned into a “chaotic reality for real kids and adults” throughout the county’s schools, and the district and state apparently realized this by “threaten[ing] all of [the district’s] teachers with some sort of ethics violation if they talk about what’s happening” throughout the district. While “[t]he repetition of stories that sound just like this from [the districts’] schools is numbing and heartbreaking at the same time,” the state, district, and school board, apparently, “has no interest” in such stories.

Put simply, and put well as this aligns with our philosophy here: “Let’s [all] consider what [all of this] really means: [Florida] legislators do not want to hear from you if you are communicating a real experience from your life at a school — whether you are a teacher, parent, or student. Your experience doesn’t matter. Only your test score.”

Isn’t that the unfortunate truth; hence, and with reference to the introduction above, please do keep these relatively more invisible studies coming so that we can share out with the nation and make such stories more visible and accessible. VAMs, again, are alive and well, just perhaps in more undisclosed ways, like within districts as is the case here.

Houston Education and Civil Rights Summit (Friday, Oct. 14 to Saturday, Oct. 15)

For those of you interested, and perhaps close to Houston, Texas, I will be presenting my research on the Houston Independent School District’s (now hopefully past) use of the Education Value-Added Assessment System for more high-stakes, teacher-level consequences than anywhere else in the nation.

As you may recall from prior posts (see, for example, here, here, and here), seven teachers in the disrict, with the support of the Houston Federation of Teachers (HFT), are taking the district to federal court over how their value-added scores are/were being used, and allegedly abused. The case, Houston Federation of Teachers, et al. v. Houston ISD, is still ongoing; although, also as per a prior post, the school board just this past June, in a 3:3 split vote, elected to no longer pay an annual $680K to SAS Institute Inc. to calculate the district’s EVAAS estimates. Hence, by non-renewing this contract it appears, at least for the time being, that the district is free from its prior history using the EVAAS for high-stakes accountability. See also this post here for an analysis of Houston’s test scores post EVAAS implementation,  as compared to other districts in the state of Texas. Apparently, all of the time and energy invested did not pay off for the district, or more importantly its teachers and students located within its boundaries.

Anyhow, those presenting and attending the conference–the Houston Education and Civil Rights Summit, as also sponsored and supported by United Opt Out National–will prioritize and focus on the “continued challenges of public education and the teaching profession [that] have only been exacerbated by past and current policies and practices,”  as well as “the shifting landscape of public education and its impact on civil and human rights and civil society.”

As mentioned, I will be speaking, alongside two featured speakers: Samuel Abrams–the Director of the National Center for the Study of Privatization in Education (NCSPE) and an instructor in Columbia’s Teachers College, and Julian Vasquez Heilig–Professor of Educational Leadership and Policy Studies at California State Sacramento and creator of the blog Cloaking Inequality. For more information about these and other speakers, many of whom are practitioners, see  the conference website available, again, here.

When is it? Friday, October 14, 2016 at 4:00 PM through to Saturday, October 15, 2016 at 8:00 PM (CDT).

Where is it? Houston Hilton Post Oak – 2001 Post Oak Blvd, Houston, TX 77056

Hope to see you there!