New York Times Article on “Grading Teachers by the Test”

Last Tuesday, Eduardo Porter – writer of the Economic Scene column for The New York Times – wrote an excellent article, from an economics perspective, about that which is happening with our current obsession in educational policy with “Grading Teachers by the Test.” Do give the article a full read; it’s well worth it, and also consider here some of what I see as Porter’s strongest points for ponder.

Porter writes about what economist’s often refer to as Goodhart’s Law, which states that “when a measure becomes the target, it can no longer be used as the measure.” This occurs given the value (mis)placed on any measure, and the distortion (i.e., in terms of artificial inflation or deflation, depending on the desired direction of the measure) that often-to-always comes about as a result.

Remember when the Federal Aviation Administration was to hold airlines “accountable” for their on-time arrivals? Airlines responded by meeting the “new and improved” FAA standards by merely extending the estimated time of arrivals (ETAs) on the back end. Airlines did not do anything if much else of instrumental value anywhere else. As cited by Porter, the same thing happens when U.S. hospitals “do whatever it takes to keep patients alive at least 31 days after an operation, to beat Medicare’s 30-day survival yardstick.” This is Goodhart’s law at play.

In education we commonly refer to Goodhart’s Law’s interdisciplinary cousin — Campbell’s Law instead, which states that “the more any quantitative social indicator (or even some qualitative indicator) is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.” In his 1976 paper, Campbell wrote that “achievement tests may well be valuable indicators of general school achievement under conditions of normal teaching aimed at general competence. But when test scores become the goal of the teaching process, they both lose their value as indicators of educational status and distort the educational process in undesirable ways” (click here for more information). This is the main point of this New York Times piece. See also a classic article on test score pollution here and a fantastic book all about this here.

My colleague Jesse Rothstein – Associate Professor of Economics at University of California – Berkeley – is cited as saying “We don’t know how big a deal this is…[but]…It is one of [our] main concerns.” I would add that while we cannot quantify this or how often this actually occurs, nor will we every likely be able to do so, we can state with 100% certainty this is, indeed, a very big deal. Porter agrees, citing multiple research studies in support (including one of mine here and another of my former PhD student’s here).

See also a research study I published about this with a set of colleagues in 2010, titled “Cheating in the first, second, and third degree: Educators’ responses to high-stakes testing,” in which we got as close as anyone of whom I am aware to understanding not only the frequencies but also the varieties of ways in which educators distort, sometimes consciously and sometimes subconsciously, their students’ test scores. The also often times engage in such practices believing that their test-preparation and other test-related practices are in their students’ best interests. Note, however, that this study took place before even more consequences were attached to students’ test scores, given the federal government’s more recent fascination with holding teachers accountable for their quantifiable-by-student-test-scores “value-added.”

It is a fascinating phenomenon, really, albeit highly unfortunate given the consequences those external to America’s public education system (e.g,. educational policymakers) continue to tie to teachers’ test scores, regardless, and now more than ever before.

As underscored by Porter, “American education has embarked upon a nationwide experiment in incentive design. Prodded by the Education Department, most states have set up evaluation systems for teachers built on the gains of their students on standardized tests, alongside more traditional criteria like evaluations from principals.” All of this has occurred despite the research evidence, some of which is over 20 years old, and despite the profession’s calls (see, for example, here) to stop what, thanks to Goodhart’s and Campbell’s Laws, can often be understood as noise and nonsense.

Help Florida Teacher Luke Flint “Tell His Story” about His VAM Scores

This is a great (although unfortunate) YouTube video capturing Indian River County, Florida teacher Luke Flint’s “Story” about the VAM scores he just received from the state as based on the state’s value-added formula.

This is a must watch, and a must share, as his “Story” has potential to “add value” in the best of ways, that is, in terms of further informing debates about how these VAMs actually “work” in practice.

Same Model (EVAAS), Different State (Ohio), Same Problems (Bias and Data Errors)

Following up on my most recent post about “School-Level Bias in the PVAAS Model in Pennsylvania,” also in Ohio – a state that also uses “the best” and “most sophisticated” VAM (i.e., a version of the Education Value-Added Assessment System [EVAAS]; for more information click here) – this seems to be a problem, as per an older (2013) article just sent to me following my prior post “Teachers’ ‘Value-Added’ Ratings and [their] Relationship to Student Income Levels [being] Questioned.”

The key finding? Ohio’s “2011-12 value-added results show that districts, schools and teachers with large numbers of poor students tend to have lower value-added results than those that serve more-affluent ones.” Such output continue to evidence how using VAMs may not be “the great equalizer” after all. VAMs might not be the “true” measures they are assumed (and marketed/sold) to be.

Here are the state’s stats, as highlighted in this piece and taken directly from the Plain Dealer/StateImpact Ohio analysis:

  • Value-added scores were 2½ times higher on average for districts where the median family income is above $35,000 than for districts with income below that amount.
  • For low-poverty school districts, two-thirds had positive value-added scores — scores indicating students made more than a year’s worth of progress.
  • For high-poverty school districts, two-thirds had negative value-added scores — scores indicating that students made less than a year’s progress.
  • Almost 40 percent of low-poverty schools scored “Above” the state’s value-added target, compared with 20 percent of high-poverty schools.
  • At the same time, 25 percent of high-poverty schools scored “Below” state value-added targets while low-poverty schools were half as likely to score “Below.”

One issue likely causing this level of bias is that student background factors are not accounted for in the EVAAS model. “By including all of a student’s testing history, each student serves as his or her own control,” as per the analytics company – SAS – that now sells and runs model calculations. Rather, the EVAAS uses students’ prior test scores to control for these factors. As per a document available on the SAS website: “To the extent that SES/DEM [socioeconomic/demographic] influences persist over time, these influences are already represented in the student’s data. This negates the need for SES/DEM adjustment.”

Within the same document SAS authors write: “that adjusting value-added for the race or poverty level of a student adds the dangerous assumption that those students will not perform well. It would also create separate standards for measuring students who have the same history on scores on previous tests. And it says that adjusting for these issues can also hide whether students in poor areas are receiving worse instruction. ‘We recommend that these adjustments not be made,” the brief reads. “Not only are they largely unnecessary, but they may be harmful.”

See also another article, also from Ohio about how a value-added “Glitch Cause[d the] State to Pull Back Teacher “Value Added” Student Growth Scores.” From the intro, “An error by contractor SAS Institute Inc. forced the state to withdraw some key teacher performance measurements that it had posted online for teachers to review. The state’s decision to take down the value added scores for teachers across the state on Wednesday has some educators questioning the future reliability of the scores and other state data…” (click here to read more).

School-Level Bias in the PVAAS Model in Pennsylvania

Research for Action (click here for more about the organization and its mission) just released an analysis of the state of Pennsylvania’s school building rating system – the 100-point School Performance Profile (SPP) used throughout the state to categorize and rank public schools (including charters), although a school can go over 100-points by earning bonus points. The SPP is largely (40%) based on school-level value-added model (VAM) output derived via the state’s Pennsylvania Education Value-Added Assessment System (PVAAS), also generally known as the Education Value-Added Assessment System (EVAAS), also generally known as the VAM that is “expressly designed to control for out-of-school [biasing] factors.” This model should sound familiar to followers, as this is the model with which I am most familiar, on which I conduct most of my research, and widely considered one of “the most popular” and one of “the best” VAMs in town (see recent posts about this model here, here, and here).

Research for Action‘s report titled: “Pennsylvania’s School Performance Profile: Not the Sum of its Parts” details some serious issues with bias with the state’s SPP. That is, bias (a highly controversial issue covered in the research literature and also on this blog; see recent posts about bias here, here, and here), does also appear to exist in this state and particularly at the school-level for (1) subject areas less traditionally tested and, hence, not often consecutively tested (e.g., from one consecutive grade level to the next), and given (2) the state is combining growth measures with proficiency (i.e., “snapshot”) measures to evaluate schools, the latter being significantly negatively correlated with the populations of the students in the schools being evaluated.

First, “growth scores do not appear to be working as intended for writing and science,
the subjects that are not tested year-to-year.” The figure below “displays the correlation between poverty, PVAAS [growth scores], and proficiency scores in these subjects. The lines for both percent proficient and growth follow similar patterns. The correlation between poverty and proficiency rates is stronger, but PVAAS scores are also significantly and negatively correlated with poverty.”

FIgure 1

Whereas before we have covered on this blog how VAMs, and more importantly this VAM (i.e., the PVAAS, TVAAS, EVAAS, etc.), is also be biased by the subject areas teachers teach (see prior posts here and here), this is the first piece of evidence of which I am aware that illustrates that this VAM (and likely other VAMs) might also be biased, particularly for subject areas that are not consecutively tested (i.e., from one consecutive grade to the next) and subject areas that do not have pretest scores, when in this case VAM statisticians use other subject area pretest scores (e.g., from mathematics and English/language arts, as scores from these tests are likely correlated with the pretest scores that are missing) instead. This, of course, has implications for schools (as illustrated in these analyses) but also likely for teachers (as logically generalized from these findings).

Second, researchers found that because five of the six SPP indicators, accounting for 90% of a school’s base score (including an extra-credit portion) rely entirely on test scores, “this reliance on test scores, despite the partial use of growth measures, results in a school rating system that favors more advantaged schools.” The finding here is that using growth or value-added measures in conjunction with proficiency (i.e., “snapshot”) indicators also biases total output against schools serving more disadvantaged students. This is what this looks like, noting in the figures below that bias exists for both the proficiency and growth measures, although bias in the latter is less yet still evident.

FIgure 1

“The black line in each graph shows the correlation between poverty and the percent of students scoring proficient or above in math and reading; the red line displays the
relationship between poverty and growth in PVAAS. As displayed…the relationship between poverty measures and performance is much stronger for the proficiency indicator.”

“This analysis shows a very strong negative correlation between SPP and poverty in both years. In other words, as the percent of a school’s economically disadvantaged population increases, SPP scores decrease.” That is, school performance declines sharply as schools’ aggregate levels of student poverty increase, if and when states use combinations of growth and proficiency to evaluate schools. This is not surprising, but a good reminder of how much poverty is (co)related with achievement captured by test scores when growth is not considered.

Please read the full 16-page report as it includes much more important information about both of these key findings than described herein. Otherwise, well done Research for Action for “adding value,” for the lack of a better term, to the current literature surrounding VAM-based bias.

New Mexico UnEnchanted

After a great meeting with students, parents, teachers, school board members, state leaders, and the like in New Mexico, I thought I’d follow up with some reflections from a state turned unenchanted with its state’s attempts at education reform (as based on high-stakes testing and using value-added models [VAMs] to hold teachers accountable for students’ test performance). I thought I would also follow up with a public statement of thanks to my hosts from New Mexico State University’s Borderlands Writing Project as well as a thanks to all of those with whom I met. To watch the video of my keynote presentation click here, here, and here (i.e., Parts 1, 2, and 3, respectively).

As mentioned in my last post, New Mexico is now certainly one state to watch — not only because of the students who are opting out of taking the new “and improved” common core tests across the state, but also because of the many others who are activating in protest, and also in protection of teachers as well as students whose learning opportunities have been hijacked by high-stakes testing across the state. This is all occurring thanks to the “leadership” of Hanna Skandera — former Florida Deputy Commissioner of Education under former Governor Jeb Bush and head of the New Mexico Public Education Department.

Unlike states I have researched in the past, this is one state I can honestly say has gone high-stakes silly. State leadership is currently administering common core tests on computers, with computer issues now causing students to test even more days than expected given computer quirks. At the same time the state is requiring students to take end-of-course (EoCs) exams in almost all subject areas, including physical education, music, art, social studies, government, and the like (e.g., x 10). At the same time many teachers are giving final exams as these are the exams teachers value and “trust” the most because they give teachers the best and most useful feedback in terms of what teachers taught and how students did given what they taught. And all of this including test preparation is to take, in some cases, three months of time. No joke, and no wonder so many involved have literally had it.

In addition, unlike states I have researched in the past, this is also a state that requires teachers to sign a contractual document that they are not, at the same time or ever, to “diminish the significance or importance of the tests” (see, for example, slide 7 here). Yes – this means that teachers are not to speak negatively about the tests or say anything negatively about these tests in their classrooms or in public; if they do they could be found in violation of their contracts, that they themselves signed or had to sign to get or hold their teaching position. They say something disparaging and somebody with authority finds out? They could be fired.

Last Friday night at my main presentation, in fact, a few teachers actually came up to me whispering their impressions in my ear in fear of being “found out.” If that’s not an indicator of something gone awry in one state I don’t know what is. Rumor also has it that Hanna Skandera has requested the names and license numbers of any teachers who have helped or encouraged students to protest the state’s “new” PARCC test(s). As per one teacher, “this is a quelling of free speech and professional communication.”

In terms of the state’s value-added model (VAM) it is very difficult to find out information about the actual model being used to hold teachers accountable across the state. I met one teacher who teaches an extreme set of students, and who by others’ accounts is one of the best teachers in her district, but she was labeled “highly ineffective;” hence, without evidence I’d beg to argue the VAM in use in this state is probably not much different than others. Otherwise, I cannot comment further given all that is not publicly available (or easily found) about the model — which is problematic in and of itself given this information should be made public and transparent, as should the information about how the model is actually functioning.

What I did find very interesting, however, was an article about a set of “five Los Alamos physicists, statisticians and math experts” who responded to some of New Mexico’s VAM critics who charged “it would take a rocket scientist to figure out the complex formula” being used to measure teacher’s value-added. The article highlights how Los Alamos Scientists — Los Alamos, New Mexico is known for its proportion of intellectual scientists who reside in the area — could not after one year of trying figure out the model, figure it out either. These scientists did find that the state’s VAM (1) is not a “true measure of a school’s worth;” (2) “varies” at levels unexpected of a model that should otherwise yield consistent results; (3) seems to be “biased” by class size, household income, and focused programming that occurs outside of teachers’ effects; (4) is so complicated that it could not serve as a “good communication tool;” and (5) would not likely “pass peer review in the scientific community.”

Finally, as for the opt-out movement, which was also a source of discussion throughout my visit given primarily New Mexico students’ activism in this regard, a related post was recently featured on Gene Glass’s “Education in Two Worlds” readers might also find of interest. His words should also add fodder to those students and parents, teachers, and school board members in support in New Mexico, to keep it going. See Dr. Glass’s full post below.

Can the “Opt Out” Movement Succeed?
There’s a movement growing across the nation. It’s called “Opt Out,” and it means, refusal to subject oneself or one’s children to the rampant standardized testing that has gripped public schools. The tentacles of the accountability testing movement have reached into every quarter of America’s public schools. And the audience for this information is composed of politicians attempting to bust unions, taxpayers hoping to replace high-salary teachers with low-salary teachers, and Realtors dodging red-lining laws while steering clients to the “best schools.” Those urging parents and students to refuse to be tested cite the illegitimacy of these motives and the increasing amount of time for learning that is being given over to assessing learning. At present, the Opt Out movement is small — a few thousand students in Colorado, several hundred in New Mexico, and smatterings of ad hoc parent groups in the East. Some might view these small numbers as no threat to the accountability assessment industry. But the threat is more serious than it appears. Politicians and others want to rank schools and school districts according to their test score averages. Or they want to compare teachers according to their test score gains (Value Added Measurement) and pressure the low scorers or worse. It only takes a modest amount of Opting Out to thwart these uses of the test data. If 10% of the parents at the school say “No” to the standardized test, how do the statisticians adjust or correct for those missing data? Which 10% opted out? The highest scorers? The lowest? A scattering of high and low scorers? And would any statistical sleight of hand to correct for “missing data” stand up in court against a teacher who was fired or a school that was taken over by the state for a “turn around”? I don’t think so.

Gene V Glass
Arizona State University
National Education Policy Center
University of Colorado Boulder

NY Congressman’s Letter to Duncan on Behalf of NY Teacher Lederman

In the fall I posted a piece on a “Highly Regarded’ Teacher Suing New York over [Her] ‘Ineffective’ VAM Scores.” This post was about Sheri Lederman, a 17-year veteran and 4th grade teacher, recognized by her district superintendent as having a “flawless” teaching record and “highly regarded as an educator.” Yet she received an “ineffective” rating this last here.

She is now suing the state because of her VAM scores that placed her in the “ineffective” category, which is the worst possible category in the state. She received the lowest VAM-based rating possible in the state and was assigned a 1 out of 20, with a 20 being the highest in terms of “added value.”

Her case is onto its way to the courthouse in the state of New York this week, and all fingers and toes are crossed in terms of a positive outcome, also given her particular case could also help to set a precedent for the (many) others in (many) other states.

Two days ago and on her behalf, a New York Congressman mailed to Secretary Arne Duncan a letter (attached here) explaining Sheri’s story and asking Secretary Duncan to not only hear Sheri’s case but also answer a series of questions about the VAMs Duncan has incentivized throughout the country (primarily via Race to the Top funding attached to “growth” [aka value-added] models).

These questions are pasted below; although again, I encourage those who are interested to read the full letter (attached again here) as well as the questions pasted below, as both provide a sense of that which this congressman is now questioning, as are others.

  1. Does the U.S. Department of Education believe that VAMs accurately measure a teacher’s contribution to student growth?
  2. What form of guidance has the U.S. Department of Education provided to states that are using VAM to measure teacher effectiveness?
  3. Has the U.S. Department of Education issued any guidance to states specific on how states and Local Education Authorities should handle appeals for teacher evaluation scores generated by VAMs?
  4. Has the U.S. Department of Education identified covariates of interest that should be included in a model that attempts to isolate teacher contribution to student growth?
  5. Has the U.S. Department of Education identified covariates that should not be included in the model that attempts to isolate teacher contribution?
  6. What alternatives to VAM have been found to be suitable to measure teacher effectiveness?

I will keep you posted…

Unfortunate Updates from Tennessee’s New “Leadership”

Did I write in a prior post (from December, 2014) that “following the (in many ways celebrated) exit of Commissioner Huffman, it seems the state [of Tennessee] is taking an even more reasonable stance towards VAMs and their use(s) for teacher accountability?” I did, and I was apparently wrong…

Things in Tennessee — the state in which our beloved education-based VAMs were born (see here and here) and on which we have been constant watch — have turned south, once again, thanks to new “leadership.” I use this term loosely to describe the author of the letter below, as featured in a recent article in The Tennessean. The letter’s author is the state’s new Commissioner of the Tennessee Department of Education — Candice McQueen — “a former Tennessee classroom teacher and higher education leader.”

Pasted below is what she wrote. You be the judge and question her claims, particularly those I underlined for emphasis in the below.

This week legislators debated Governor Haslam’s plan to respond to educator feedback regarding Tennessee’s four-year-old teacher evaluation system. There’s been some confusion around the details of the plan, so I’m writing to clarify points and share the overall intent.

Every student has individual and unique needs, and each student walks through the school doors on the first day at a unique point in their education journey. It’s critical that we understand not only the grade the child earns at the end of the year, but also how much the child has grown over the course of the year.

Tennessee’s Value-Added Assessment System, or TVAAS, provides educators vital information about our students’ growth.

From TVAAS we learn if our low-achieving students are getting the support they need to catch up to their peers, and we also learn if our high-achieving students are being appropriately challenged so that they remain high-achieving. By analyzing this information, teachers can make adjustments to their instruction and get better at serving the unique needs of every child, every year.

We know that educators often have questions about how TVAAS is calculated and we are working hard to better support teachers’ and leaders’ understanding of this growth measure so that they can put the data to use to help their students learn (check out our website for some of these resources).

While we are working on improving resources, we have also heard concerns from teachers about being evaluated during a transition to a new state assessment. In response to this feedback from teachers, Governor Haslam proposed legislation to phase in the impact of the new TNReady test over the next three years.

Tennessee’s teacher evaluation system, passed as part of the First to the Top Act by the General Assembly in 2010 with support from the Tennessee Education Association, has always included multiple years of student growth data. Student growth data makes up 35 percent of a teacher’s evaluation and includes up to three years of data when available.

Considering prior and current year’s data together paints a more complete picture of a teacher’s impact on her student’s growth.

More data protects teachers from an oversized impact of a tough year (which teachers know happen from time to time). The governor’s plan maintains this more complete picture for teachers by continuing to look at both current and prior year’s growth in a teacher’s evaluation.

Furthermore, the governor’s proposal offers teachers more ownership and choice during this transition period.

If teachers are pleased with the growth score they earn from the new TNReady test, they can opt to have that score make up the entirety of their 35 percent growth measure. However, teachers may also choose to include growth scores they earned from prior TCAP years, reducing the impact of the new assessment.

The purpose of all of this is to support our student’s learning. Meeting the unique needs of each learner is harder than rocket science, and to achieve that we need feedback and information about how our students are learning, what is working well, and where we need to make adjustments. One of these critical pieces of feedback is information about our students’ growth.

My Presentation, Tomorrow (Friday Night) in Las Cruces, New Mexico

For just over one year now, I have been blogging about that which is occurring in my neighbor state – New Mexico – under the leadership of Hanna Skandera (the acting head of the New Mexico Public Education Department and Jeb Bush protégé; see prior post here). This state is increasingly becoming another “state to watch” in terms of their teacher accountability policies as based on value-added models (VAMs), or what in New Mexico they term their value-added scores (VASs).

Accordingly, the state is another in which a recent lawsuit is bringing to light some of the issues with these models, particularly in terms of data errors, missingness, incompleteness, and Common Core, Partnership for Assessment of Readiness for College and Careers (PARCC) tests that are now being used for consequential decision-making purposes, right out of the gate. For more about this lawsuit, click here. For more information about the PARCC tests, and protests against it – a reported 950 students in Las Cruces walked out of classes 10 days ago in protest, and 800 students “opted out” with parental support – click here and here.

For those of you who live in New Mexico, I am heading there for a series of presentations and meetings tomorrow (Friday, March 13) with an assortment of folks ranging from school board members, parents, students, members of the media, and academics to discuss that which is happening in New Mexico. My visit is being sponsored by New Mexico State University’s Borderlands Writing Project.

Here is the information for my main talk, tomorrow (Friday, March 13) night from 7:00-9:00 p.m., on “Critical Perspectives on Assessment-Based Accountability,” in the Health and Social Services Auditorium on the NMSU campus in Las Cruces. Attached is the press release as well as the advertising flyer. After my presentation there will be breakout sessions that are also open to the public (1) for parents on making decisions about the tests; (2) for Taking the Test, allowing people to try out a practice test and provide feedback; and (3) for next steps – a session for teachers, students and people who want to help create positive change.

Hope to see you there!

New Mexico’s Teacher Evaluation Lawsuit: Four Teachers’ Individual Cases

Regarding a prior post about a recently filed “Lawsuit in New Mexico Challenging State’s Teacher Evaluation System,” filed by the American Federation of Teachers (AFT) and charging that the state’s current teacher evaluation system is unfair, error-ridden, harming teachers, and depriving students of high-quality educators (see the actual lawsuit here), the author of an article recently released in The Washington Post takes “A closer look at four New Mexico teachers’ evaluations.”

Emma Brown writes that the state believes this system supports the “aggressive changes’ needed “to produce real change for students” and “these evaluations are an essential tool to support the teachers and students of New Mexico.” Teachers, on the other hand (and in general terms), believe that the new evaluations “are arbitrary and offer little guidance as to how to improve.”

Highlighted further in this piece, though, are four specific teachers’ evaluations taken from this state’s system along with each teacher’s explanations of the problems as they see them. The first veteran teacher with 36 years of “excellent evaluations” scored ineffective for missing too much work, although she was approved for and put on a six-month’s leave after a serious injury caused by a fall. She took four of the six months, but her “teacher attendance” score dropped her to the bottom of the teacher rankings. She has since retired.

The second, 2nd-grade teacher, also a veteran teacher with 27 years of experience, received 50% of her “teacher attendance” points also given a family-related illness, but she also received 8 out of 50 “student achievement” points. She argues that her students, because most of them are well above average had difficulties demonstrating growth. In other words, her argument rests on the real concern (and very real concern in terms of the current research) that “ceiling effects” are/were preventing her students from growing upwards, enough, when compared to other “similar” students who are also to demonstrate “a full year’s worth of growth.” She is also retiring in a few months “in part because she is so frustrated with the evaluation system.”

The third teacher, a middle-school teacher, scored 23 out of 70 “value-added” points, even though he switched from teaching language arts to teaching social studies at the middle-school level. This teacher did not apparently have the three-years needed (not to mention in the same subject area) to calculate his “value-added,” nor does he have “any idea” where his score came from or how it was calculated.” Accordingly, his score “doesn’t give him any information about how to get better,” which falls under the general issue that these scores are apparently offering teachers little guidance as to how to improve. This is an issue familiar across most if not all such models.

The fourth teacher, an alternative high school mathematics and science teacher of pregnant and parenting teens many of whom have learning or emotional disabilities, received 24 of 70 “student achievement” points, she is arguing, are based on tests that are “unvetted and unreliable,” especially given the types of students she teaches. As per her claim: ““There are things I am being evaluated on which I do not and cannot control…Each year my school graduates 30 to 60 students, each of whom is either employed, or enrolled in post-secondary training/education. This is the measure of our success, not test scores.”

This is certainly a state to watch, as the four New Mexico teachers highlighted in this article certainly have unique and important cases, all of which may be used to help set precedent in this state as well as others. Do stay tuned…

University of Missouri’s Koedel on VAM Bias and “Leveling the Playing Field”

In a forthcoming study (to be published in the peer reviewed journal Educational Policy), University of Missouri Economists – Cory Koedel, Mark Ehlert, Eric Parsons, and Michael Podgursky – evidence how poverty should be included as a factor when evaluating teachers using VAMs. This study, also covered in a recent article in Education Week and another news brief published by the University of Missouri, highlights how doing this “could lead to a more ‘effective and equitable’ teacher-evaluation system.”

Koedel, as cited in the University of Missouri brief, argues that using what he and his colleagues term a “proportional” system would “level the playing field” for teachers working with students from different income backgrounds. They evaluated three types of growth models/VAMs to evidence their conclusions; however, how they did this will not be available until the actual article is published.

While Koedel and I tend to disagree on VAMs as potential tools for teacher evaluation – whereas he believes that such statistical and methodological tweaks will bring us to a level of VAM perfection I believe is likely impossible – the important takeaway from this piece is that VAMs are (more) biased when such background and resource sensitive factors are excluded from VAM-based calculations.

Accordingly, while Koedel writes elsewhere (in a related Policy Brief) that using what they term a “proportional’ evaluation system would “entirely mitigate this concern,” we have evidence elsewhere that with even the most sophisticated and comprehensive controls available, never can VAM-based bias be “entirely” mitigated (see, for example, here and here). This is likely due to the fact that while Koedel, his colleageus, and others argue (often with solid evidence) that controlling for “observed student and school characteristics” helps to mitigate bias, there is still unobserved student and school characteristics that cannot be observed, quantified, and hence controlled for or factored out, and this (will likely forever) prevent bias’s “entire mitgation.”

Here, the question is whether we need sheer perfection. The answer is no, but when these models are applied in practice, and when particularly teachers who teach homogenous groups of students in relatively more extreme positions (e.g., disproprortionate numbers of students from high-needs, non-English proficient backgrounds) bias still matters. While, as a whole we might be less concerned about bias when such factors are included in VAMs, there (likely forever will be) cases where bias will impact individual teachers. This is where folks will have to rely on human judgment to interpret the “objective” numbers based on VAMs.