“You Are More Than Your EVAAS Score!”

Justin Parmenter is a seventh-grade language arts teacher in Charlotte, North Carolina. Via his blog — Notes from the Chalkboard — he writes “musings on public education.” You can subscribe to his blog at the bottom of any of his blog pages, one of which I copied and pasted below for all of you following this blog (now at 43K followers!!).

His recent post is titled “Take heart, NC teachers. You are more than your EVAAS score!” and serves as a solid reminder of what teachers’ value-added scores (and namely in this case teachers’ Education Value-Added Assessment Scores (EVAAS)) cannot tell you, us, or pretty much anybody about anyone’s worth as a teacher. Do give it a read, and do give him a shout out by sharing this with others.

*****

Last night an email from the SAS corporation hit the inboxes of teachers all across North Carolina.  I found it suspicious and forwarded it to spam.

EVAAS is a tool that SAS claims shows how effective educators are by measuring precisely what value each teacher adds to their students’ learning.  Each year teachers board an emotional roller coaster as they prepare to find out whether they are great teachers, average teachers, or terrible teachers–provided they can figure out their logins.

NC taxpayers spend millions of dollars for this tool, and SAS founder and CEO James Goodnight is the richest person in North Carolina, worth nearly $10 billion.  However, over the past few years, more and more research has shown that value added ratings like EVAAS are highly unstable and are unable to account for the many factors that influence our students and their progress. Lawsuits have sprung up from Texas to Tennessee, charging, among other things, that use of this data to evaluate teachers and make staffing decisions violates teachers’ due process rights, since SAS refuses to reveal the algorithms it uses to calculate scores.

By coincidence, the same day I got the email from SAS, I also got this email from the mother of one of my 7th grade students:

Photos attached provided evidence that the student was indeed reading at the dinner table.

The student in question had never thought of himself as a reader.  That has changed this year–not because of any masterful teaching on my part, but just because he had the right book in front of him at the right time.

Here’s my point:  We need to remember that EVAAS can’t measure the most important ways teachers are adding value to our students’ lives.  Every day we are turning students into lifelong independent readers. We are counseling them through everything from skinned knees to school shootings.  We are mediating their conflicts. We are coaching them in sports. We are finding creative ways to inspire and motivate them. We are teaching them kindness and empathy.  We are doing so much more than helping them pass a standardized test at the end of the year.

So if you figure out your EVAAS login today, NC teachers, take heart.  You are so much more than your EVAAS score!

A North Carolina Teacher’s Guest Post on His/Her EVAAS Scores

A teacher from the state of North Carolina recently emailed me for my advice regarding how to help him/her read and understand his/her recently received Education Value-Added Assessment System (EVAAS) value added scores. You likely recall that the EVAAS is the model I cover most on this blog, also in that this is the system I have researched the most, as well as the proprietary system adopted by multiple states (e.g., Ohio, North Carolina, and South Carolina) and districts across the country for which taxpayers continue to pay big $. Of late, this is also the value-added model (VAM) of sole interest in the recent lawsuit that teachers won in Houston (see here).

You might also recall that the EVAAS is the system developed by the now late William Sanders (see here), who ultimately sold it to SAS Institute Inc. that now holds all rights to the VAM (see also prior posts about the EVAAS here, here, here, here, here, and here). It is also important to note, because this teacher teaches in North Carolina where SAS Institute Inc. is located and where its CEO James Goodnight is considered the richest man in the state, that as a major Grand Old Party (GOP) donor “he” helps to set all of of the state’s education policy as the state is also dominated by Republicans. All of this also means that it is unlikely EVAAS will go anywhere unless there is honest and open dialogue about the shortcomings of the data.

Hence, the attempt here is to begin at least some honest and open dialogue herein. Accordingly, here is what this teacher wrote in response to my request that (s)he write a guest post:

***

SAS Institute Inc. claims that the EVAAS enables teachers to “modify curriculum, student support and instructional strategies to address the needs of all students.”  My goal this year is to see whether these claims are actually possible or true. I’d like to dig deep into the data made available to me — for which my state pays over $3.6 million per year — in an effort to see what these data say about my instruction, accordingly.

For starters, here is what my EVAAS-based growth looks like over the past three years:

As you can see, three years ago I met my expected growth, but my growth measure was slightly below zero. The year after that I knocked it out of the park. This past year I was right in the middle of my prior two years of results. Notice the volatility [aka an issue with VAM-based reliability, or consistency, or a lack thereof; see, for example, here].

Notwithstanding, SAS Institute Inc. makes the following recommendations in terms of how I should approach my data:

Reflecting on Your Teaching Practice: Learn to use your Teacher reports to reflect on the effectiveness of your instructional delivery.

The Teacher Value Added report displays value-added data across multiple years for the same subject and grade or course. As you review the report, you’ll want to ask these questions:

  • Looking at the Growth Index for the most recent year, were you effective at helping students to meet or exceed the Growth Standard?
  • If you have multiple years of data, are the Growth Index values consistent across years? Is there a positive or negative trend?
  • If there is a trend, what factors might have contributed to that trend?
  • Based on this information, what strategies and instructional practices will you replicate in the current school year? What strategies and instructional practices will you change or refine to increase your success in helping students make academic growth?

Yet my growth index values are not consistent across years, as also noted above. Rather, my “trends” are baffling to me.  When I compare those three instructional years in my mind, nothing stands out to me in terms of differences in instructional strategies that would explain the fluctuations in growth measures, either.

So let’s take a closer look at my data for last year (i.e., 2016-2017).  I teach 7th grade English/language arts (ELA), so my numbers are based on my students reading grade 7 scores in the table below.

What jumps out for me here is the contradiction in “my” data for achievement Levels 3 and 4 (achievement levels start at Level 1 and top out at Level 5, whereas levels 3 and 4 are considered proficient/middle of the road).  There is moderate evidence that my grade 7 students who scored a Level 4 on the state reading test exceeded the Growth Standard.  But there is also moderate evidence that my same grade 7 students who scored Level 3 did not meet the Growth Standard.  At the same time, the number of students I had demonstrating proficiency on the same reading test (by scoring at least a 3) increased from 71% in 2015-2016 (when I exceeded expected growth) to 76% in school year 2016-2017 (when my growth declined significantly). This makes no sense, right?

Hence, and after considering my data above, the question I’m left with is actually really important:  Are the instructional strategies I’m using for my students whose achievement levels are in the middle working, or are they not?

I’d love to hear from other teachers on their interpretations of these data.  A tool that costs taxpayers this much money and impacts teacher evaluations in so many states should live up to its claims of being useful for informing our teaching.

On Conditional Bias and Correlation: A Guest Post

After I posted about “Observational Systems: Correlations with Value-Added and Bias,” a blog follower, associate professor, and statistician named Laura Ring Kapitula (see also a very influential article she wrote on VAMs here) posted comments on this site that I found of interest, and I thought would also be of interest to blog followers. Hence, I invited her to write a guest post, and she did.

She used R (i.e., a free software environment for statistical computing and graphics) to simulate correlation scatterplots (see Figures below) to illustrate three unique situations: (1) a simulation where there are two indicators (e.g., teacher value-added and observational estimates plotted on the x and y axes) that have a correlation of r = 0.28 (the highest correlation coefficient at issue in the aforementioned post); (2) a simulation exploring the impact of negative bias and a moderate correlation on a group of teachers; and (3) another simulation with two indicators that have a non-linear relationship possibly induced or caused by bias. She designed simulations (2) and (3) to illustrate the plausibility of the situation suggested next (as written into Audrey’s post prior) about potential bias in both value-added and observational estimates:

If there is some bias present in value-added estimates, and some bias present in the observational estimates…perhaps this is why these low correlations are observed. That is, only those teachers teaching classrooms inordinately stacked with students from racial minority, poor, low achieving, etc. groups might yield relatively stronger correlations between their value-added and observational scores given bias, hence, the low correlations observed may be due to bias and bias alone.

Laura continues…

Here, Audrey makes the point that a correlation of r = 0.28 is “weak.” It is, accordingly, useful to see an example of just how “weak” such a correlation is by looking at a scatterplot of data selected from a population where the true correlation is r = 0.28. To make the illustration more meaningful the points are colored based on their quintile scores as per simulated teachers’ value-added divided into the lowest 20%, next 20%, etc.

In this figure you can see by looking at the blue “least squares line” that, “on average,” as a simulated teacher’s value-added estimate increases the average of a teacher’s observational estimate increases. However, there is a lot of variability (or scatter points) around the (scatterplot) line. Given this variability, we can make statements about averages, such as “on average” teachers in the top 20% for VAM scores will likely have on average higher observed observational scores; however, there is not nearly enough precision to make any (and certainly not any good) predictions about the observational score from the VAM score for individual teachers. In fact, the linear relationship between teachers’ VAM and observational scores only accounts for about 8% of the variation in VAM score. Note: we get 8% by squaring the aforementioned r = 0.28 correlation (i.e., an R squared). The other 92% of the variance is due to error and other factors.

What this means in practice is that when correlations are this “weak,” it is reasonable to say statements about averages, for example, that “on average” as one variable increases the mean of the other variable increases, but it would not be prudent or wise to make predictions for individuals based on these data. See, for example, that individuals in the top 20% (quintile 5) of VAM have a very large spread in their scores on the observational score, with 95% of the scores in the top quintile being in between the 7th and 98th percentiles for their observational scores. So, here if we observe a VAM for a specific teacher in the top 20%, and we do not know their observational score, we cannot say much more than their observational score is likely to be in the top 90%. Similarly, if we observe a VAM in the bottom 20%, we cannot say much more than their observational score is likely to be somewhere in the bottom 90%. That’s not saying a lot, in terms of precision, but also in terms of practice.

The second scatterplot I ran to test how bias that only impacts a small group of teachers might theoretically impact an overall correlation, as posited by Audrey. Here I simulated a situation where, again, there are two values present in a population of teachers: a teacher’s value-added and a teacher’s observational score. Then I insert a group of teachers (as Audrey described) who represent 20% of a population and teach a disproportionate number of students who come from relatively lower socioeconomic, high racial minority, etc. backgrounds, and I assume this group is measured with negative bias on both indicators and this group has a moderate correlation between indicators of r = 0.50. The other 80% of the population is assumed to be uncorrelated. Note: for this demonstration I assume that this group includes 20% of teachers from the aforementioned population, these teachers I assume to be measured with negative bias (by one standard deviation on average) on both measures, and, again, I set their correlation at r = 0.50 with the other 80% of teachers at a correlation of zero.

What you can see is that if there is bias in this correlation that impacts only a certain group on the two instrument indicators; hence, it is possible that this bias can result in an observed correlation overall. In other words, a strong correlation noted in just one group of teachers (i.e., teachers scoring the lowest on their value-added and observational indicators in this case) can be relatively stronger than the “weak” correlation observed on average or overall.

Another, possible situation is that there might be a non-linear relationship between these two measures. In the simulation below, I assume that different quantiles on VAM have a different linear relationship with the observational score. For example, in the plot there is not a constant slope, but teachers who are in the first quintile on VAM I assume to have a correlation of r = 0.50 with observational scores, the second quintile I assume to have a correlation of r = 0.20, and the other quintiles I assume to be uncorrelated. This results in an overall correlation in the simulation of r = 0.24, with a very small p-value (i.e. a very small chance that a correlation of this size would be observed by random chance alone if the true correlation was zero).

What this means in practice is that if, in fact, there is a non-linear relationship between teachers’ observational and VAM scores, this can induce a small but statistically significant correlation. As evidenced, teachers in the lowest 20% on the VAM score have differences in the mean observational score depending on the VAM score (a moderate correlation of r = 0.50), but for the other 80%, knowing the VAM score is not informative as there is a very small correlation for the second quintile and no correlation for the upper 60%. So, if quintile cut-off scores are used, teachers can easily be misclassified. In sum, Pearson Correlations (the standard correlation coefficient) measure the overall strength of  linear relationships between X and Y, but if X and Y have a non-linear relationship (like as illustrated in the above), this statistic can be very misleading.

Note also that for all of these simulations very small p-values are observed (e.g., p-values <0.0000001 which, again, mean these correlations are statistically significant or that the probability of observing correlations this large by chance if the true correlation is zero, is nearly 0%). What this illustrates, again, is that correlations (especially correlations this small) are (still) often misleading. While they might be statistically significant, they might mean relatively little in the grand scheme of things (i.e., in terms of practical significance; see also “The Difference Between”Significant’ and ‘Not Significant’ is not Itself Statistically Significant” or posts on Andrew Gelman’s blog for more discussion on these topics if interested).

At the end of the day r = 0.28 is still a “weak” correlation. In addition, it might be “weak,” on average, but much stronger and statistically and practically significant for teachers in the bottom quintiles (e.g., teachers in the bottom 20%, as illustrated in the final figure above) typically teaching the highest needs students. Accordingly, this might be due, at least in part, to bias.

In conclusion, one should always be wary of claims based on “weak” correlations, especially if they are positioned to be stronger than industry standards would classify them (e.g., in the case highlighted in the prior post). Even if a correlation is “statistically significant,” it is possible that the correlation is the result of bias, and that the relationship is so weak that it is not meaningful in practice, especially when the goal is to make high-stakes decisions about individual teachers. Accordingly, when you see correlations this small, keep these scatterplots in mind or generate some of your own (see, for example, here to dive deeper into what these correlations might mean and how significant these correlations might really be).

*Please contact Dr. Kapitula directly at kapitull@gvsu.edu if you want more information or to access the R code she used for the above.

New Mexico’s “New, Bait and Switch” Schemes

“A Concerned New Mexico Parent” sent me another blog entry for you all to help you all stay apprised of the ongoing “situation” in New Mexico with its New Mexico Public Education Department (NMPED). See “A Concerned New Mexico Parent’s” prior posts here, here, and here, but in this one (s)he writes a response to an editorial that was recently released in support of the newest version of New Mexico’s teacher evaluation system. The editorial was titled: “Teacher evals have evolved but tired criticisms of them have not,” and it was published in the Albuquerque Journal, as also written by the Albuquerque Journal Editorial Board themselves.

(S)he writes:

The editorial seems to contain and promote many of the “talking points” provided by NMPED with their latest release of teacher evaluations. Hence, I would like to present a few observations on the editorial.

NMPED and the Albuquerque Journal Editorial Board both underscore the point that teachers are still primarily being (and should primarily continue to be) evaluated on the basis of their own students’ test scores (i.e., using a value-added model (VAM)), but it is actually not that simple. Rather, the new statewide teacher evaluation formula is shown here on their website, with one notable difference being that the state’s “new system” now replaces the previously district-wide variations that produced 217 scoring categories for teachers (see here for details).

Accordingly, it now appears that NMPED has kept the same 50% student achievement, 25% observations, and 25% multiple measures division as before. The “new” VAM, however, requires a minimum of three years of data for proper use. Without three years of data, NMPED is to use what it calls graduated considerations or “NMTEACH” steps to change the percentages used in the evaluation formulas by teacher type.

A small footnote on the NMTEACH website devoted to teacher evaluations explains these graduated considerations whereby “Each category is weighted according to the amount of student achievement data available for the teacher. Improved student achievement is worth from 0% to 50%; classroom observations are worth 25% to 50%; planning, preparation and professionalism is worth 15% to 40%; and surveys and/or teacher attendance is worth 10%.” In other words student achievement represents between 0 and 50% of the total, observations comprise somewhere between 14% and 40% of the total, and teacher attendance comprises 10%.

The graduated considerations (Steps) are shown below, as per their use when substitutions are needed when student achievement data are missing:

nmteach

Also, the NMTEACH “Steps” provide for the use of one year of data (Step 2 is used for 1-2 years of data.) I do not see how NMPED can calculate “student improvement” based on just one year’s worth of data.

Hence, this data substitution problem is likely massive. For example, for Category A teachers, 45 of the 58 formulas formerly used will require Step 1 substitutions. For Category B teachers, 112 of 117 prior formulas will require data substitution (Step 1), and all Category C teachers will require data substitution at the Step 1 level.

The reason that this presents a huge data problem is that the state’s prior teacher evaluation system did not require the use of so much end-of-course (EOC) data, and so the tests were not given for three years. Simultaneously and for Group C teachers, NMPED also introduced an new evaluation assessment plus software called iStation that is also in its first year of use.

Thus, for a typical Category B teacher, the evaluation will be based on 50% observation, 40% planning, preparation, and professionalism, and 10% on attendance.

Amazingly, none of this relates to student achievement, and it looks identical to the former administrator-based teacher evaluation system!

Such a “bait-and-switch” scheme will be occurring for most teachers in the state.

Further, in a small case-study I performed on a local New Mexico school (here), I found that not one single teacher in a seven-year period had “good” data for three consecutive years. This also has major implications here given the state’s notorious issues with their data, data management, and the like.

Notwithstanding, the Editorial Board also notes that “The evaluations consider only student improvement, not proficiency.” However, as noted above little actual student achievement is actually available for the strong majority of all teachers’ evaluation; hence, the rate by which this will actually count (versus perhaps appear to count to the public) are two very distinctively different things.

Regardless, the Editorial Board thereafter proclaims that “The evaluations only rate teachers’ effect on their students over a school year…” Even the simple phrase “school year” is also problematic, however.

The easiest way to explain this is to imagine a student in a dual language program (a VERY common situation in New Mexico). Let’s follow his timeline of instruction and testing:

  • August 2015: The student begins the fourth grade with teachers A1 and A2.
  • March 2016: Seven months into the year the student is tested with test #1 at the 4th-grade level.
  • March 2016 – May 2016: The student finishes fourth grade with Teachers A1 and A2
  • June 2016 – Aug 2016: Summer vacation — no tests (i.e., differential summer learning and decay occurs)
  • August 2016: The student begins the fifth grade with teachers B1 and B2.
  • March 2017: Seven months into the year the student is tested with test #2 at the 5th-grade level.
  • March 2017 – May 2017: The student finishes fifth grade with Teachers B1 and B2
  • October 2017: A teacher receives a score based on this student’s improvement (along with other students like him, although coming from different A level teachers) from test#1 to test#2

To simplify, the test improvement is based on a test given before he has completed the grade level of interest with material taught by four teachers at two different grade levels over the span of one calendar year [this is something that is known in the literature as prior teachers’ residual effects].

And it gets worse. The NMPED requires that a student be assigned to only one teacher. According to the NMTEACH FAQ, in the case of team-teaching, “Students are assigned to one teacher. That teacher would get credit. A school could change teacher assignment each snapshot and thus both teachers would get counted automatically.”

I can only assume the Editorial Board members are brighter than I am because I cannot parse out the teacher evaluation values for my sample student.

Nevertheless, the Editorial Board also gushes with praise regarding the use of teacher attendance as an evaluation tool. This is just morally wrong.

Leave is not “granted” to teachers by some benevolent overlord. It is earned and is part of the union contract between teachers and the state. Imagine a job where you are told that you have two weeks vacation time but, of course, you can only take two days of it or you might be fired. Absurd, right? Well, apparently not if you are NMPED.

This is one of the major issues in the ongoing lawsuit, where as I recall, one of the plaintiffs was penalized for taking time off for the apparently frivolous task of cancer treatment! NMPED should be ashamed of themselves!

The Editorial Board also praises the new, “no lag time” aspect of the evaluation system. In the past, teacher evaluations were presented at the end of the school year before student scores were available. Now that the evaluations depend upon student scores, the evaluations appear early in the next school year. As noted in the timeline above, the lag time is still present contrary to what they assert. Further, these evaluations now come mid-term after the school-year has started and teacher assignments have been made.

In the end, and again in the title, the Editorial Board claims that the “Teacher evals have evolved but tired criticisms of them have not.”

The evals have not evolved but have rather devolved to something virtually identical to the former observation and administration-based evaluations. The tired criticisms are tired precisely because they have never been adequately answered by NMPED.

~A Concerned New Mexico Parent

A New Book about VAMs “On Trial”

I recently heard about a new book that was written by Mark Paige — J.D. and Ph.D., assistant professor of public policy at the University of Massachusetts-Dartmouth, and a former school law attorney — and published by Rowman & Littlefield. The book is about, as per the secondary part of its title “Understanding Value-Added Models [VAMs] in the Law of Teacher Evaluation.” See more on this book, including information about how to purchase it, for those of you who might be interested in reading more, here, and also via Amazon here.

Clearly, this book is to prove very relevant given the ongoing court cases across the country (see a prior post on these cases here) regarding teachers and the systems being used to evaluate them when especially (or extremely) reliant upon VAM-based estimates for consequential decision-making purposes (e.g., teacher tenure, pay, and termination). While I have not yet read the book, I just ordered my copy the other day. I suggest you do the same, again, should you be interested in further or better understanding the federal and state law pertinent to these cases.

Notwithstanding, I also requested that the author of this book — Mark Paige — write a guest post so that you too could find out more. Here is what he wrote:

Many of us have been following VAMs in legal circles. Several courts have faced the issue of VAMs as they relate to employment law matters. These cases have tested a chief selling point (pardon [or underscore] the business reference) of VAMs: that they will effectuate, for example, teacher termination with greater ease because nobody besides the advanced statisticians and econometricians can argue with their numbers derived. In other words, if a teacher’s VAM rating is bad, then the teacher must be bad. It’s to be as simple as that. How can a court deny that, reality?

Of course, as we [should] already know, VAMs are anything but certain. Bluntly stated: VAMs are a statistical “hot mess.” The American Statistical Association, among many others, warned in no uncertain terms that VAMs cannot – and should not – be trusted to make significant employment decisions. Of course, that has not stopped many policymakers from a full-throated adoption of their use in employment and evaluation decisions. Talk about hubris.

Accordingly, I recently completed this book, again, that focuses squarely at the intersection of VAMs and the law. Its full title is “Building a Better Teacher: Understanding Value-Added Models in the Law of Teacher Evaluation” Rowman & Littlefield, 2016). Again, I provide a direct link to the book along with its description here.

To offer a bit of a sneak preview, thought, I draw many conclusions throughout the book, but one of two important take-aways is this: VAMs may actually complicate the effectuation of a teacher’s termination. Here’s one way: because VAMs are so statistically infirm, they invite plaintiff-side attorneys to attack any underlying negative decision based on these models. See, for example, Sheri Lederman’s recent New York State Supreme Court’s decision, here. [See also a related post in this blog here].

In other words, the evidence upon which districts or states rely to make significant decisions is untrustworthy (or arbitrary) and, therefore, so is any decision as based, even if in part, on VAMs. Thus, VAMs may actually strengthen a teacher’s case. This, of course, is quite apart from the fact that VAM use results in firing good teachers based on poor information, thereby contributing to the teacher shortages and lower morale (among many other parades of horribles) being reported across the nation, and now more than likely ever.

The second important take-away is this, especially given followers of this blog include many educators and administrators facing a barrage of criticisms that only “de-professionalize” them: Courts have, over time, consistently deferred to the professional judgment of administrators (and their assessment of effective teaching). The members of that august institution – the judiciary – actually believe that educators know best about teaching, and that years of accumulated experience and knowledge have actual and also court-relevant value. That may come as a startling revelation to those who consistently diminish the education profession, or those who at least feel like they and their efforts are consistently being diminished.

To be sure, the system of educator evaluation is not perfect. Our schools continue to struggle to offer equal and equitable educational opportunities to all students, especially those in the nation’s highest needs schools. But what this book ultimately concludes is that the continued use of VAMs will not, hu-hum, add any value to these efforts.

To reach author Mark Paige via email, please contact him at mpaige@umassd.edu. To reach him via Twitter: @mpaigelaw

New Mexico Is “At It Again”

“A Concerned New Mexico Parent” sent me yet another blog entry for you all to stay apprised of the ongoing “situation” in New Mexico and the continuous escapades of the New Mexico Public Education Department (NMPED). See “A Concerned New Mexico Parent’s” prior posts here, here, and here, but in this one (s)he writes what follows:

Well, the NMPED is at it again.

They just released the teacher evaluation results for the 2015-2016 school year. And, the report and media press releases are a something.

Readers of this blog are familiar with my earlier documentation of the myriad varieties of scoring formulas used by New Mexico to evaluate its teachers. If I recall, I found something like 200 variations in scoring formulas [see his/her prior post on this here with an actual variation count at n=217].

However, a recent article published in the Albuquerque Journal indicates that, now according to the NMPED, “only three types of test scores are [being] used in the calculation: Partnership for Assessment of Readiness for College and Careers [PARCC], end-of-course exams, and the [state’s new] Istation literacy test.” [Recall from another article released last January that New Mexico’s Secretary of Education Hanna Skandera is also the head of the governing board for the PARCC test].

Further, the Albuquerque Journal article author reports that the “PED also altered the way it classifies teachers, dropping from 107 options to three. Previously, the system incorporated many combinations of criteria such as a teacher’s years in the classroom and the type of standardized test they administer.”

The new state-wide evaluation plan is also available in more detail here. Although I should also add that there has been no published notification of the radical changes in this plan. It was just simply and quietly posted on NMPED’s public website.

Important to note, though, is that for Group B teachers (all levels), the many variations documented previously have all been replaced by end-of-course (EOC) exams. Also note that for Group A teachers (all levels) the percentage assigned to the PARCC test has been reduced from 50% to 35%. (Oh, how the mighty have fallen …). The remaining 15% of the Group A score is to be composed of EOC exam scores.

There are only two small problems with this NMPED simplification.

First, in many districts, no EOC exams were given to Group B teachers in the 2015-2016 school year, and none were given in the previous year either. Any EOC scores that might exist were from a solitary administration of EOC exams three years previously.

Second, for Group A teachers whose scores formerly relied solely on the PARCC test for 50% of their score, no EOC exams were ever given.

Thus, NMPED has replaced their policy of evaluating teachers on the basis of students they don’t teach to this new policy of evaluating teachers on the basis of tests they never administered!

Well done, NMPED (not…)

Luckily, NMPED still cannot make any consequential decisions based on these data, again, until NMPED proves to the court that the consequential decisions that they would still very much like to make (e.g., employment, advancement and licensure decisions) are backed by research evidence. I know, interesting concept…

Deep Pockets, Corporate Reform, and Teacher Education

A colleague whom I have never formally met, but with whom I’ve had some interesting email exchanges with over the past few months — James D. Kirylo, Professor of Teaching and Learning in Louisiana — recently sent me an email I read, and appreciated; hence, I asked him to turn it into a blog post. He responded with a guest post he has titled “Deep Pockets, Corporate Reform, and Teacher Education,” pasted below. Do give this a read, and a social media share, as this one is deserving of some legs.

Here is what he wrote:

Money is power. Money is influence. Money shapes direction. Notwithstanding the influential nature of it in the electoral process, one only needs to see how bags of dough from the mega-rich-one-percenters—largely led by Bill Gates—have bought their way in their attempt to corporatize K-12 education (see, for example, here).  

This corporatization works to defund public education, grossly blames teachers for all that ails society, is obsessed with testing, and aims to privatize.  And next on the corporatized docket: teacher education programs.

In a recent piece by Valerie Strauss, “Gates Foundation Puts Millions of Dollars into New Education Focus: Teacher Preparation,” she sketches how Gates is awarding $35 million to a three-year project called Teacher Preparation Transformation Centers funneled through five different projects, one of which is the Texas Tech based University-School Partnerships for the Renewal of Educator Preparation (U.S. Prep) National Center.

A framework that will guide this “renewal” of educator preparation comes from the National Institute for Excellence in Teaching (NIET), along with the peddling of their programs, The System for Teacher and Student Advancement (TAP) and Student and Best Practices Center (BPC). Yet, again, coming from another guy with oodles of money, leading the charge of NIET is Lowell Milken who is Chairmen and TAP founder (see, for example, here).

The state of Louisiana serves as an example on how NIET is already working overtime in chipping its way into K-12 education. One can spend hours at the Louisiana Department of Education (LDE) website and view the various links on how TAP is applying a full-court-press in hyping its brand (see, for example, here).  

And now that TAP has entered the K-12 door in Louisiana, the brand is now squiggling its way into teacher education preparation programs, namely through the Texas Tech based U.S. Prep National Center. This Gates Foundation backed project involves five teacher education programs in the country (Southern Methodist University, University of Houston, Jackson State University, and the University of Memphis, including one in Louisiana (Southeastern Louisiana University) (see more information about this here).  

Therefore, teacher educators must be “trained” to use TAP in order to “rightly” inculcate the prescription to teacher candidates.

TAP: Four Elements of Success

TAP principally plugs four Elements of Success: Multiple Career Paths (for educators as career, mentor and master teachers); Ongoing Applied Professional Growth (through weekly cluster meetings, follow-up support in the classroom, and coaching); Instructionally Focused Accountability (through multiple classroom observations and evaluations utilizing a research based instrument and rubric that identified effective teaching practices); and, Performance-Based Compensation (based on multiple; measures of performance, including student achievement gains and teachers’ instructional practices).

And according to the TAP literature, the elements of success “…were developed based upon scientific research, as well as best practices from the fields of education, business, and management” (see, for example, here). Recall, perhaps, that No Child Left Behind (NCLB) was also based on “scientific-based” research. Enough said. It is also interesting to note their use of the words “business” and “management” when referring to educating our children. Regardless, “The ultimate goal of TAP is to raise student achievement” so students will presumably be better equipped to compete in the global society (see, for example, here). 

While each element is worthy of discussion, a brief comment is in order on the first element Multiple Career Paths and fourth element, Performance-Based Compensation. Regarding the former, TAP has created a mini-hierarchy within already-hierarchical school systems (which most are) in identifying three potential sets of teachers, to reiterate from the above: a “career” teacher; a “mentor” teacher, and a “master” teacher. A “career” teacher as opposed to what? As opposed to a “temporary” teacher, a Teach For America (TFA) teacher, a substitute teacher? But, of course, according to TAP, as opposed to a “mentor” teacher and a “master” teacher.

This certainly begs the question: Why in the world would any parent want their child to be taught by a “career” teacher as opposed to a “mentor” teacher or better yet a “master” teacher? Wouldn’t we want “master” teachers in all our classrooms? To analogize, I would rather have a “master” doctor performing heart surgery on me than a “lowly” career doctor. Indeed, words, language, and concepts matter.

With respect to the latter, the notion of having an ultimate goal on raising student achievement is perhaps more than euphemistic on raising test scores, cultivating a test-centric way of doing things.

Achievement and VAM

That is, instead of focusing on learning, opportunity, developmentally appropriate practices, and falling in love with learning, “achievement” is the goal of TAP. Make no mistake, this is far from an argument on semantics. And this “achievement” linked to student growth to merit pay relies heavily on a VAM-aligned rubric.

Yet, there are multiple problems with VAM, an instrument that has been used in K-12 education since 2011. Among many other outstanding sources, one may simply want to check out this cleverly called blog here, “VAMboozled,” or see what Diane Ravitch has said about VAMs (among other places, see, for example, here), not to mention the well-visited site produced by Mercedes Schneider here. Finally, see the 2015 position statement issued by the American Educational Research Association (AERA) regarding VAMs here, as well as a similar statement issued by the American Statistical Association (ASA) here

Back to the Gates Foundation and the Texas Tech based (U.S. Prep) National Center, though. To restate, at the aforementioned university in Louisiana (though likely in the other four recruited institutions, as well), TAP will be the chief vehicle that will drive this process, and teacher education programs will be used as the host to prop the brand.

With presumably some very smart, well-educated, talented, and experienced professionals at respective teacher education sites, how is it possible that they capitulated to be the samples for the petri dish that will only work to enculturate the continuation of corporate reform, which will predictably lead to what Hofstra University Professor, Alan Singer, calls the “McDonaldization of Teacher Education“?

Strauss puts the question this way, “How many times do educators need to attempt to reinvent the wheel just because someone with deep pockets wants to try when the money could almost certainly be more usefully spent somewhere else?” I ask this same question, in this case, here.

Another Take on New Mexico’s Ruling on the State’s Teacher Evaluation/VAM System

John Thompson, a historian and teacher, wrote a post just published in Diane Ravitch’s blog (here) in which he took a closer look at the New Mexico court decision of which I was a part and which I covered a few weeks ago (here). This is the case in which state District Judge David K. Thomson, who presided over the five-day teacher-evaluation lawsuit in New Mexico, granted a preliminary injunction preventing consequences from being attached to the state’s teacher evaluation data.

Historian/Teacher John Thompson adds another, and also independent take on this ruling, again here, also having read through Judge Thomson’s entire ruling. Here’s what he wrote:

New Mexico District Judge David K. Thomson granted a preliminary injunction preventing consequences from being attached to the state’s teacher evaluation data. As Audrey Amrein-Beardsley explains, “can proceed with ‘developing’ and ‘improving’ its teacher evaluation system, but the state is not to make any consequential decisions about New Mexico’s teachers using the data the state collects until the state (and/or others external to the state) can evidence to the court during another trial (set for now, for April) that the system is reliable, valid, fair, uniform, and the like.”

This is wonderful news. As the American Federation of Teachers observes, “Superintendents, principals, parents, students and the teachers have spoken out against a system that is so rife with errors that in some districts as many as 60 percent of evaluations were incorrect. It is telling that the judge characterizes New Mexico’s system as a ‘policy experiment’ and says that it seems to be a ‘Beta test where teachers bear the burden for its uneven and inconsistent application.’”

A close reading of the ruling makes it clear that this case is an even greater victory over the misuse of test-driven accountability than even the jubilant headlines suggest. It shows that Judge Thomson made the right ruling on the key issues for the right reasons, and he seems to be predicting that other judges will be following his legal logic. Litigation over value-added teacher evaluations is being conducted in 14 states, and the legal battleground is shifting to the place where corporate reformers are weakest. No longer are teachers being forced to prove that there is no rational basis for defending the constitutionality of value-added evaluations. Now, the battleground is shifting to the actual implementation of those evaluations and how they violate state laws.

Judge Thomson concludes that the state’s evaluation systems don’t “resemble at all the theory” they were based on. He agreed with the district superintendent who compared it to the Wizard of Oz, where “the guy is behind the curtain and pulling levers and it is loud.” Some may say that the Wizard’s behavior is “understandable,” but that is not the judge’s concern. The Court must determine whether the consequences are assessed by a system that is “objective and uniform.” Clearly, it has been impossible in New Mexico and elsewhere for reformers to meet the requirements they mandated, and that is the legal terrain where VAM proponents must now fight.

The judge thus concludes, “New Mexico’s evaluation system is less like a [sound] model than a cafeteria-style evaluation system where the combination of factors, data, and elements are not easily determined and the variance from school district to school district creates conflicts with the [state] statutory mandate.”

The state of New Mexico counters by citing cases in Florida and Tennessee as precedents. But, Judge Thomson writes that those cases ruled against early challenges based on equal protection or constitutional issues, as they have also cited practical concerns in implementation. He writes of the Florida (Cook) case, “The language in the Cook case could be lifted from the Court findings in this case.” That state’s judge decided “‘The unfairness of this system is not lost on this Court.’” Judge Thomson also argues, “The (Florida) Court in fact seemed to predict the type of legal challenge that could result …‘The individual plaintiffs have a separate remedy to challenge an evaluation on procedural due process grounds if an evaluation is actually used to deprive the teacher of an evaluation right.’”

The question in Florida and Tennessee had been whether there was “a conceivable rational basis” for proceeding with the teacher evaluation policy experiment. Below are some of the more irrational results of those evaluations. The facts in the New Mexico case may be somewhat more absurd than those in other places that have implemented VAMs but, given the inherent flaws in those evaluations, I doubt they are qualitatively worse. In fact, Audrey Amrein-Beardsley testified about a similar outcome in Houston which was as awful as the New Mexico travesties and led to about 1/4th of their teachers subject to those evaluations being subject to “growth plans.”

As has become common across the nation, New Mexico teachers have been evaluated on students who aren’t in the teachers’ classrooms. They have been held accountable for test results from subjects that the teacher didn’t teach. Science teachers might be evaluated on a student taught in 2011, based on how that student scored in 2013.

The judge cited testimony regarding a case where 50% of the teachers rated Minimally Effective had missing data due to reassignment to a wrong group. One year, a district questioned the state’s data, and immediately it saw an unexplained 11% increase in effective teachers. The next year, also without explanation, the state’s original numbers on effectiveness were reduced by 6%.

One teacher taught 160 students but was evaluated on scores of 73 of them and was then placed on a plan for improvement. Because of the need to quantify the effectiveness of teachers in Group B and Group C, who aren’t subject to state End of Instruction tests, there are 63 different tests being used in one district to generate high-stakes data. And, when changing tests to the Common Core PARCC test, the state has to violate scientific protocol, and mix and match test score results in an indefensible manner. Perhaps just as bad, in 2014-15, 76% of teachers were still being evaluated on less than three years of data.

The Albuquerque situation seems exceptionally important because it serves 25% of the state’s students, and it is the type of high-poverty system where value-added evaluations are likely to be most unreliable and invalid. It had 1728 queries about data and 28% of its teachers ranked below the Effective level. The judge noted that if you teach a core subject, you are twice as likely as a French teacher to be judged Ineffective. But, that was not the most shocking statistic. In Albuquerque, Group A elementary teachers (where VAMs play a larger role) are five times more likely to be rated below Effective than their colleagues in Group B. In Roswell, Group B teachers are three times more likely to be rated below Effective than Group C teachers.

Curiously, VAM advocate Tom Kane testified, but he did so in a way the made it unclear whether he saw himself as a witness for the defense or the plaintiffs. When asked about Amrein-Beardsley’s criticism of using tests that weren’t designed for evaluating teachers, Kane countered that the Gates Foundation MET study used random samples and concluded that differing tests could be used in a way that was “useful in evaluating teachers” and valid predictors of student achievement. Kane also replied that he could estimate the state’s error rate “on average,” but he couldn’t estimate error rates for individual teachers. He did not address the judge’s real concern about whether New Mexico’s use of VAMs was uniform and objective.

I am not a lawyer but I have years of experience as a legal historian. Although I have long been disappointed that the legal profession did not condemn value-added evaluations as a violation of our democracy’s fundamental principles, I also knew that the first wave of lawsuits challenging VAMs would face an uphill battle. Using teachers as guinea pigs in a risky experiment, where non-educators imposed their untested opinions on public schools, was always bad policy. Along with their other sins, value-added evaluations would mean collective punishment of some teachers merely for teaching in schools and classes where it is harder to meet dubious test score growth targets. But, many officers of the court might decide that they did not have the grounds to overrule new teacher evaluation laws. They might have to hold their noses while ruling in favor of laws that make a mockery of our tenets of fairness in a constitutional democracy.

During the last few years, rather than force those who would destroy the hard-earned legal rights of teachers to meet the legal standard of “strict scrutiny,” those who would fire teachers without proving that their data was reliable and valid have mostly had to show that their policies were not irrational. Now that their policies are being implemented, reformers must defend the ways that their VAMs are actually being used. Corporate reformers and the Duncan administration were able to coerce almost all of the states into writing laws requiring quantitative components in teacher evaluations. Not surprisingly, it has often proven impossible to implement their schemes in a rational manner.

In theory, corporate reformers could have won if they required the high-stakes use of flawed metrics while maintaining the message discipline that they are famous for. School administrators could have been trained to say that they were merely enforcing the law when they assessed consequences based on metrics. Their job would have been to recite the standard soundbite when firing teachers – saying that their metrics may or may not reflect the actual performance of the teacher in question – but the law required that practice. Life’s not fair, they could have said, and whether or not the individual teacher was being unfairly sacrificed, the administrators who enforced the law were just following orders. It was the will of the lawmakers that the firing of the teachers with the lowest VAMs – regardless of whether the metric reflected actually effectiveness – would make schools more like corporations, so practitioners would have to accept it. But, this is one more case where reformers ignored the real world, did not play out the education policy and legal chess game, and [did or did not] anticipate that rulings such as Judge Thomson’s would soon be coming.

Real world, VAM advocates had to claim that its results represented the actual effectiveness of teachers and that, somehow, their scheme would someday improve schools. This liberated teachers and administrators to fight back in the courts. Moreover, top-down reformers set out to impose the same basic system on every teacher, in every type of class and school, in our diverse nation. When this top-down micromanaging met reality, proponents of test-driven evaluations had to play so many statistical games, create so many made-up metrics, and improvise in so many bizarre ways, that the resulting mess would be legally indefensible.

And, that is why the cases in Florida and Tennessee might soon be seen as the end of the beginning of the nation’s repudiation of value-added evaluations. The New Mexico case, along with the renewal of the federal ESEA and the departure of Arne Duncan, is clearly the beginning of the end. Had VAM proponents objectively briefed attorneys on the strengths and weaknesses of their theories, they could have thought through the inevitable legal process. On the other hand, I doubt that Kane and his fellow economists knew enough about education to be able to anticipate the inevitable, unintended results of their theories on schools. In numerous conversations with VAM true believers, rarely have I met one who seemed to know enough about the nuts and bolts about schools to be able to brief legal advisors, much less anticipate the inevitable results that would eventually have to be defended in court.

Brookings’ Critique of AERA Statement on VAMs, and Henry Braun’s Rebuttal

Two weeks ago I published a post about the newly released “American Educational Research Association (AERA) Statement on Use of Value-Added Models (VAM) for the Evaluation of Educators and Educator Preparation Programs.”

In this post I also included a summary of the AERA Council’s eight key, and very important points abut VAMs and VAM use. I also noted that I contributed to this piece in one of its earliest forms. More importantly, however, the person who managed the statement’s external review and also assisted the AERA Council in producing the final statement before it was officially released was Boston College’s Dr. Henry Braun, Boisi Professor of Education and Public Policy and Educational Research, Measurement, and Evaluation.

Just this last week, the Brookings Institution published a critique of the AERA statement for, in my opinion, no other apparent reason than just being critical. The critique was written by Brookings affiliate Michael Hansen and University of Washington Bothell’s Dan Goldhaber, titled a “Response to AERA statement on Value-Added Measures: Where are the Cautionary Statements on Alternative Measures?

Accordingly, I invited Dr. Henry Braun to respond, and he graciously agreed:

In a recent posting, Michael Hansen and Dan Goldhaber complain that the AERA statement on the use of VAMs does not take a similarly critical stance with respect to “alternative measures”. True enough! The purpose of the statement is to provide a considered, research-based discussion of the issues related to the use of value-added scores for high-stakes evaluation. It culminates in a set of eight requirements to be met before such use should be made.

The AERA statement does not stake out an extreme position. First, it is grounded in the broad research literature on drawing causal inferences from observational data subject to strong selection (i.e., the pairings of teachers and students is highly non-random), as well as empirical studies of VAMs in different contexts. Second, the requirements are consistent with the AERA, American Psychological Association (APA), and National Council on Measurement in Education (NCME) Standards for Educational and Psychological Testing. Finally, its cautions are in line with those expressed in similar statements released by the Board on Testing and Assessment of the National Research Council and by the American Statistical Association.

Hansen and Goldhaber are certainly correct when they assert that, in devising an accountability system for educators, a comparative perspective is essential. One should consider the advantages and disadvantages of different indicators, which ones to employ, how to combine them and, most importantly, consider both the consequences for educators and the implications for the education system as a whole. Nothing in the AERA statement denies the importance of subjecting all potential indicators to scrutiny. Indeed, it states: “Justification should be provided for the inclusion of each indicator and the weight accorded to it in the evaluation process.” Of course, guidelines for designing evaluation systems would constitute a challenge of a different order!

In this context, it must be recognized that rankings based on VAM scores and ratings based on observational protocols will necessarily have different psychometric and statistical properties. Moreover, they both require a “causal leap” to justify their use: VAM scores are derived directly from student test performance, but require a way of linking to the teacher of record. Observational ratings are based directly on a teacher’s classroom performance, but require a way of linking back to her students’ achievement or progress.

Thus, neither approach is intrinsically superior to the other. But the singular danger with VAM scores, being the outcome of a sophisticated statistical procedure, is that they are seen by many as providing a gold standard against which other indicators should be judged. Both the AERA and ASA statements offer a needed corrective, by pointing out the path that must be traversed before an indicator based on VAM scores approaches the status of a gold standard. Though the requirements listed in the AERA statement may be aspirational, they do offer signposts against which we can judge how far we have come along that path.

Henry Braun, Lynch School of Education, Boston College

Why Gene Glass is No Longer a Measurement Specialist

One of my mentors – Dr. Gene Glass (formerly at ASU and now at Boulder) wrote a letter earlier this week on his blog, titled “Why I Am No Longer a Measurement Specialist.” This is a must read for all of you following the current policy trends not only surrounding teacher-level accountability, but also high-stakes testing in general.

Gene – one of the most well-established and well-known measurement specialists in and outside of the field of education, world renowned for developing “meta-analysis,” writes:

I was introduced to psychometrics in 1959. I thought it was really neat.By 1960, I was programming a computer on a psychometrics research project funded by the Office of Naval Research. In 1962, I entered graduate school to study educational measurement under the top scholars in the field.

My mentors – both those I spoke with daily and those whose works I read – had served in WWII. Many did research on human factors — measuring aptitudes and talents and matching them to jobs. Assessments showed who were the best candidates to be pilots or navigators or marksmen. We were told that psychometrics had won the war; and of course, we believed it.

The next wars that psychometrics promised it could win were the wars on poverty and ignorance. The man who led the Army Air Corps effort in psychometrics started a private research center. (It exists today, and is a beneficiary of the millions of dollars spent on Common Core testing.) My dissertation won the 1966 prize in Psychometrics awarded by that man’s organization. And I was hired to fill the slot recently vacated by the world’s leading psychometrician at the University of Illinois. Psychometrics was flying high, and so was I.

Psychologists of the 1960s & 1970s were saying that just measuring talent wasn’t enough. Talents had to be matched with the demands of tasks to optimize performance. Measure a learning style, say, and match it to the way a child is taught. If Jimmy is a visual learner, then teach Jimmy in a visual way. Psychometrics promised to help build a better world. But twenty years later, the promises were still unfulfilled. Both talent and tasks were too complex to yield to this simple plan. Instead, psychometricians grew enthralled with mathematical niceties. Testing in schools became a ritual without any real purpose other than picking a few children for special attention.

Around 1980, I served for a time on the committee that made most of the important decisions about the National Assessment of Educational Progress. The project was under increasing pressure to “grade” the NAEP results: Pass/Fail; A/B/C/D/F; Advanced/Proficient/Basic. Our committee held firm: such grading was purely arbitrary, and worse, would only be used politically. The contract was eventually taken from our organization and given to another that promised it could give the nation a grade, free of politics. It couldn’t.

Measurement has changed along with the nation. In the last three decades, the public has largely withdrawn its commitment to public education. The reasons are multiple: those who pay for public schools have less money, and those served by the public schools look less and less like those paying taxes.

The degrading of public education has involved impugning its effectiveness, cutting its budget, and busting its unions. Educational measurement has been the perfect tool for accomplishing all three: cheap and scientific looking.

International tests have purported to prove that America’s schools are inefficient or run by lazy incompetents. Paper-and-pencil tests seemingly show that kids in private schools – funded by parents – are smarter than kids in public schools. We’ll get to the top, so the story goes, if we test a teacher’s students in September and June and fire that teacher if the gains aren’t great enough.

There has been resistance, of course. Teachers and many parents understand that children’s development is far too complex to capture with an hour or two taking a standardized test. So resistance has been met with legislated mandates. The test company lobbyists convince politicians that grading teachers and schools is as easy as grading cuts of meat. A huge publishing company from the UK has spent $8 million in the past decade lobbying Congress. Politicians believe that testing must be the cornerstone of any education policy.

The results of this cronyism between corporations and politicians have been chaotic. Parents see the stress placed on their children and report them sick on test day. Educators, under pressure they see as illegitimate, break the rules imposed on them by governments. Many teachers put their best judgment and best lessons aside and drill children on how to score high on multiple-choice tests. And too many of the best teachers exit the profession.

When measurement became the instrument of accountability, testing companies prospered and schools suffered. I have watched this happen for several years now. I have slowly withdrawn my intellectual commitment to the field of measurement. Recently I asked my dean to switch my affiliation from the measurement program to the policy program. I am no longer comfortable being associated with the discipline of educational measurement.

Gene V Glass
Arizona State University
National Education Policy Center
University of Colorado Boulder