Written into my last post here were “The ‘Top Ten’ Research-Based Reasons Why Large-Scale, Standardized Tests Should Not Be Used to Evaluate Teachers…” really anywhere, but specific to this post in the state of Nevada. Accordingly, this post pertained to what were then the ongoing legislative negotiations in Nevada, and a testimony that I submitted and titled as such.
Well, it looks like those in Nevada who, as detailed more fully in another post here, were “trying to eliminate — or at least reduce — the role [students’] standardized tests play[ed] in evaluations of teachers, saying educators [were] being unfairly judged on factors outside of their control,” lost their legislative fight.
As per their proposed AB320, the state would have eliminated large-scale standardized test results as a mandated teacher evaluation measure, but the state would have allowed local assessments to account for 20% of a teacher’s total evaluation.
On Friday, however, the Nevada Independent released an article about how the state, instead, passed a “compromised bill.” Accordingly, large-scale standardized test scores are to still to be used to evaluate teachers, although they are to now count for 40% versus 50% of Nevada teachers’ overall evaluation scores. This is clearly a loss given the bill was passed as “something [so] much closer to the system already in place” (i.e., moving from 50% to 40%).
This is all unfortunate, also given this outcome seemed to come down to a vote that fell along party lines (i.e., in favor of the 40% “compromise”), and this was ultimately signed by Nevada’s Republican Governor Sandoval, who also had the authority to see AB320 through (i.e., not in its revised form).
Apparently, Nevada will continue to put up a good fight. Hopefully in the future, the state will also fall in line with what seems to be trending across other states (e.g., Connecticut, Texas), in which legislators are removing such misinformed, arbitrary, and commonsensical (i.e., without research evidence and support) mandates and requirements.
Last Thursday was a BIG day in terms of value-added models (VAMs). For those of you who missed it, US Magistrate Judge Smith ruled — in Houston Federation of Teachers (HFT) et al. v. Houston Independent School District (HISD) — that Houston teacher plaintiffs’ have legitimate claims regarding how their EVAAS value-added estimates, as used (and abused) in HISD, was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case organization shall deprive any person of life, liberty, or property, without due process). See post here: “A Big Victory in Court in Houston.” On the same day, “we” won another court case — Texas State Teachers Association v. Texas Education Agency — on which The Honorable Lora J. Livingston ruled that the state was to remove all student growth requirements from all state-level teacher evaluation systems. In other words, and in the name of increased local control, teachers throughout Texas will no longer be required to be evaluated using their students’ test scores. See prior post here: “Another Big Victory in Court in Texas.”
Also last Thursday (it was a BIG day, like I said), I testified, again, regarding a similar provision (hopefully) being passed in the state of Nevada. As per a prior post here, Nevada’s “Democratic lawmakers are trying to eliminate — or at least reduce — the role [students’] standardized tests play in evaluations of teachers, saying educators are being unfairly judged on factors outside of their control.” More specifically, as per AB320 the state would eliminate statewide, standardized test results as a mandated teacher evaluation measure but allow local assessments to account for 20% of a teacher’s total evaluation. AB320 is still in work session. It has the votes in committee and on the floor, thus far.
The National Center on Teacher Quality (NCTQ), unsurprisingly (see here and here), submitted (unsurprising) testimony against AB320 that can be read here, and I submitted testimony (I think, quite effectively 😉 ) refuting their “research-based” testimony, and also making explicit what I termed “The “Top Ten” Research-Based Reasons Why Large-Scale, Standardized Tests Should Not Be Used to Evaluate Teachers” here. I have also pasted my submission below, in case anybody wants to forward/share any of my main points with others, especially others in similar positions looking to impact state or local educational policies in similar ways.
May 4, 2017
Dear Assemblywoman Miller:
Re: The “Top Ten” Research-Based Reasons Why Large-Scale, Standardized Tests Should Not Be Used to Evaluate Teachers
While I understand that the National Council on Teacher Quality (NCTQ) submitted a letter expressing their opposition against Assembly Bill (AB) 320, it should be officially noted that, counter to that which the NCTQ wrote into its “research-based” letter, the American Statistical Association (ASA), the American Educational Research Association (AERA), the National Academy of Education (NAE), and other large-scale, highly esteemed, professional educational and educational research/measurement associations disagree with the assertions the NCTQ put forth. Indeed, the NCTQ is not a nonpartisan research and policy organization as claimed, but one of only a small handful of partisan operations still in existence and still pushing forward what is increasingly becoming dismissed as America’s ideal teacher evaluation systems (e.g., announced today, Texas dropped their policy requirement that standardized test scores be used to evaluate teachers; Connecticut moved in the same policy direction last month).
Accordingly, these aforementioned and highly esteemed organizations have all released statements cautioning all against the use of students’ large-scale, state-level standardized tests to evaluate teachers, primarily, for the following research-based reasons, that I have limited to ten for obvious purposes:
- The ASA evidenced that teacher effects correlate with only 1-14% of the variance in their students’ large-scale standardized test scores. This means that the other 86%-99% of the variance is due to factors outside of any teacher’s control (e.g., out-of-school and student-level variables). That teachers’ effects, as measured by large-scaled standardized tests (and not including other teacher effects that cannot be measured using large-scaled standardized tests), account for such little variance makes using them to evaluate teachers wholly irrational and unreasonable.
- Large-scale standardized tests have always been, and continue to be, developed to assess levels of student achievement, but not levels of growth in achievement over time, and definitely not growth in achievement that can be attributed back to a teacher (i.e., in terms of his/her effects). Put differently, these tests were never designed to estimate teachers’ effects; hence, using them in this regard is also psychometrically invalid and indefensible.
- Large-scale standardized tests, when used to evaluate teachers, often yield unreliable or inconsistent results. Teachers who should be (more or less) consistently effective are, accordingly, being classified in sometimes highly inconsistent ways year-to-year. As per the current research, a teacher evaluated using large-scale standardized test scores as effective one year has a 25% to 65% chance of being classified as ineffective the following year(s), and vice versa. This makes the probability of a teacher being identified as effective, as based on students’ large-scale test scores, no different than the flip of a coin (i.e., random).
- The estimates derived via teachers’ students’ large-scale standardized test scores are also invalid. Very limited evidence exists to support that teachers whose students’ yield high- large-scale standardized tests scores are also effective using at least one other correlated criterion (e.g., teacher observational scores, student satisfaction survey data), and vice versa. That these “multiple measures” don’t map onto each other, also given the error prevalent in all of the “multiple measures” being used, decreases the degree to which all measures, students’ test scores included, can yield valid inferences about teachers’ effects.
- Large-scale standardized tests are often biased when used to measure teachers’ purported effects over time. More specifically, test-based estimates for teachers who teach inordinate proportions of English Language Learners (ELLs), special education students, students who receive free or reduced lunches, students retained in grade, and gifted students are often evaluated not as per their true effects but group effects that bias their estimates upwards or downwards given these mediating factors. The same thing holds true with teachers who teach English/language arts versus mathematics, in that mathematics teachers typically yield more positive test-based effects (which defies logic and commonsense).
- Related, large-scale standardized tests estimates are fraught with measurement errors that negate their usefulness. These errors are caused by inordinate amounts of inaccurate and missing data that cannot be replaced or disregarded; student variables that cannot be statistically “controlled for;” current and prior teachers’ effects on the same tests that also prevent their use for making determinations about single teachers’ effects; and the like.
- Using large-scale standardized tests to evaluate teachers is unfair. Issues of fairness arise when these test-based indicators impact some teachers more than others, sometimes in consequential ways. Typically, as is true across the nation, only teachers of mathematics and English/language arts in certain grade levels (e.g., grades 3-8 and once in high school) can be measured or held accountable using students’ large-scale test scores. Across the nation, this leaves approximately 60-70% of teachers as test-based ineligible.
- Large-scale standardized test-based estimates are typically of very little formative or instructional value. Related, no research to date evidences that using tests for said purposes has improved teachers’ instruction or student achievement as a result. As per UCLA Professor Emeritus James Popham: The farther the test moves away from the classroom level (e.g., a test developed and used at the state level) the worst the test gets in terms of its instructional value and its potential to help promote change within teachers’ classrooms.
- Large-scale standardized test scores are being used inappropriately to make consequential decisions, although they do not have the reliability, validity, fairness, etc. to satisfy that for which they are increasingly being used, especially at the teacher-level. This is becoming increasingly recognized by US court systems as well (e.g., in New York and New Mexico).
- The unintended consequences of such test score use for teacher evaluation purposes are continuously going unrecognized (e.g., by states that pass such policies, and that states should acknowledge in advance of adapting such policies), given research has evidenced, for example, that teachers are choosing not to teach certain types of students whom they deem as the most likely to hinder their potentials positive effects. Principals are also stacking teachers’ classes to make sure certain teachers are more likely to demonstrate positive effects, or vice versa, to protect or penalize certain teachers, respectively. Teachers are leaving/refusing assignments to grades in which test-based estimates matter most, and some are leaving teaching altogether out of discontent or in professional protest.
 Note that the two studies the NCTQ used to substantiate their “research-based” letter would not support the claims included. For example, their statement that “According to the best-available research, teacher evaluation systems that assign between 33 and 50 percent of the available weight to student growth ‘achieve more consistency, avoid the risk of encouraging too narrow a focus on any one aspect of teaching, and can support a broader range of learning objectives than measured by a single test’ is false. First, the actual “best-available” research comes from over 10 years of peer-reviewed publications on this topic, including over 500 peer-reviewed articles. Second, what the authors of the Measures of Effective Teaching (MET) Studies found was that the percentages to be assigned to student test scores were arbitrary at best, because their attempts to empirically determine such a percentage failed. This face the authors also made explicit in their report; that is, they also noted that the percentages they suggested were not empirically supported.
This week in Nevada “Lawmakers Mull[ed] Dropping Student Test Scores from Teacher Evaluations,” as per a recent article in The Nevada Independent (see here). This would be quite a move from 2011 when the state (as backed by state Republicans, not backed by federal Race to the Top funds, and as inspired by Michelle Rhee) passed into policy a requirement that 50% of all Nevada teachers’ evaluations were to rely on said data. The current percentage rests at 20%, but it is to double next year to 40%.
Nevada is one of a still uncertain number of states looking to retract the weight and purported “value-added” of such measures. Note also that last week Connecticut dropped some of its test-based components of its teacher evaluation system (see here). All of this is occurring, of course, post the federal passage of the Every Student Succeeds Act (ESSA), within which it is written that states must no longer set up teacher-evaluation systems based in significant part on their students’ test scores.
Accordingly, Nevada’s “Democratic lawmakers are trying to eliminate — or at least reduce — the role [students’] standardized tests play in evaluations of teachers, saying educators are being unfairly judged on factors outside of their control.” The Democratic Assembly Speaker, for example, said that “he’s always been troubled that teachers are rated on standardized test scores,” more specifically noting: “I don’t think any single teacher that I’ve talked to would shirk away from being held accountable…[b]ut if they’re going to be held accountable, they want to be held accountable for things that … reflect their actual work.” I’ve never met a teacher would disagree with this statement.
Anyhow, this past Monday the state’s Assembly Education Committee heard public testimony on these matters and three bills “that would alter the criteria for how teachers’ effectiveness is measured.” These three bills are as follows:
- AB212 would prohibit the use of student test scores in evaluating teachers, while
- AB320 would eliminate statewide [standardized] test results as a measure but allow local assessments to account for 20 percent of the total evaluation.
- AB312 would ensure that teachers in overcrowded classrooms not be penalized for certain evaluation metrics deemed out of their control given the student-to-teacher ratio.
Many presented testimony in support of these bills over an extended period of time on Tuesday. I was also invited to speak, during which I “cautioned lawmakers against being ‘mesmerized’ by the promised objectivity of standardized tests. They have their own flaws, [I] argued, estimating that 90-95 percent of researchers who are looking at the effects of high-stakes testing agree that they’re not moving the dial [really whatsoever] on teacher performance.”
Lawmakers have until the end of tomorrow (i.e., Friday) to pass these bills outside of the committee. Otherwise, they will die.
Of course, I will keep you posted, but things are currently looking “very promising,” especially for AB320.
“A Concerned New Mexico Parent” sent me another blog entry for you all to review. In this post (s)he explains and illustrates another statistical shenanigan the New Mexico Public Education Department (NMPED) recently pulled to promote the state’s value-added approach to reform (see this parent’s prior posts here and here).
The New Mexico Public Education Department (NMPED) should be ashamed of themselves.
In their explanation of the state’s NMTEACH teacher evaluation system, cutely titled “NMTEACH 101,” they present a PowerPoint slide that is numbing in it’s deceptiveness.
The entire presentation is available on their public website here (click on “NMTEACH101” under the “Teachers” heading at the top of the website to view the 34-slide presentation in its entirety).
Of particular interest to us, though, is the “proof” NMPED illustrates on slide 11 about the value of their value-added model (VAM) as related to students’ college-readiness. The slide is shown here:
Apparently we, as an unassuming public, are to believe that NMPED has longitudinal data showing how a VAM score from grades 3 through 12 (cor)relates to the percent of New Mexico students attending college at age 20. [This is highly unlikely, now also knowing a bit about this state’s data].
But even if we assume that such an unlikely longitudinal data set exists, we should still be disconcerted by the absolutely minimal effect of “Normalized Teacher Value Added” illustrated on the x-axis. This variable is clearly normalized so that each value represents a standard deviation (SD) with a range from -1.5 SD to + 1.5 SD — which represents a fairly significant range of values. In layman’s terms, this should cover the range from minimally effective to exemplary teachers.
So at first glance, the regression line (or slope) appears impressive. But after a second and more critical glance, we notice that the range of improvement is from roughly 36% to 37.8% — a decidedly and significantly much less impressive result.
In other words, by choosing to present and distort both the x- and y-axes this way, NMPED manages to make a statistical mountain out of what is literally a statistical molehill of change!
Shame on NMPED, again!
Within a recent post, I wrote about my recent “silence” explaining that, apparently, post the passage of federal government’s (January 1, 2016) passage of the Every Student Succeeds Act (ESSA) that no longer requires teachers to be evaluated by their student’s tests score using VAMs (see prior posts on this here and here), “crazy” VAM-related events have apparently subsided. While I noted in the post that this also did not mean that certain states and districts are not still drinking (and overdosing on) the VAM-based Kool-Aid, what I did not note is that the ways by which I get many of the stories I cover on this blog is via Google Alerts. This is where I have noticed a significant decline in VAM-related stories. Clearly, however, the news outlets often covered via Google Alerts don’t include district-level stories, so to cover these we must continue to rely on our followers (i.e., teachers, administrators, parents, students, school board members, etc.) to keep the stories coming.
Coincidentally — Billy Townsend, who is running for a school board seat in Polk County, Florida (district size = 100K students) — sent me one such story. As an edublogger himself, he actually sent me three blog posts (see post #1, post #2, and post #3 listed by order of relevance) capturing what is happening in his district, again, as situated under the state of Florida’s ongoing, VAM-based, nonsense. I’ve summarized the situation below as based on his three posts.
In short, the state ordered the district to dismiss a good number of its teachers as per their VAM scores when this school year started. “[T]his has been Florida’s [educational reform] model for nearly 20 years [actually since 1979, so 35 years]: Choose. Test. Punish. Stigmatize. Segregate. Turnover.” Because the district already had a massive teacher shortage as well, however, these teachers were replaced with Kelly Services contracted substitute teachers. Thereafter, district leaders decided that this was not “a good thing,” and they decided that administrators and “coaches” would temporarily replace the substitute teachers to make the situation “better.” While, of course, the substitutes’ replacements did not have VAM scores themselve, they were nonetheless deemed fit to teach and clearly more fit to teach than the teachers who were terminated as based on their VAM scores.
According to one teacher who anonymously wrote about her terminated teacher colleagues, and one of the district’s “best” teachers: “She knew our kids well. She understood how to reach them, how to talk to them. Because she ‘looked like them’ and was from their neighborhood, she [also] had credibility with the students and parents. She was professional, always did what was best for students. She had coached several different sports teams over the past decade. Her VAM score just wasn’t good enough.”
Consequently, this has turned into a “chaotic reality for real kids and adults” throughout the county’s schools, and the district and state apparently realized this by “threaten[ing] all of [the district’s] teachers with some sort of ethics violation if they talk about what’s happening” throughout the district. While “[t]he repetition of stories that sound just like this from [the districts’] schools is numbing and heartbreaking at the same time,” the state, district, and school board, apparently, “has no interest” in such stories.
Put simply, and put well as this aligns with our philosophy here: “Let’s [all] consider what [all of this] really means: [Florida] legislators do not want to hear from you if you are communicating a real experience from your life at a school — whether you are a teacher, parent, or student. Your experience doesn’t matter. Only your test score.”
Isn’t that the unfortunate truth; hence, and with reference to the introduction above, please do keep these relatively more invisible studies coming so that we can share out with the nation and make such stories more visible and accessible. VAMs, again, are alive and well, just perhaps in more undisclosed ways, like within districts as is the case here.
In one of my most recent posts I wrote about how Virginia SGP, aka parent Brian Davison, won in court against the state of Virginia, requiring them to release teachers’ Student Growth Percentile (SGP) scores. Virginia SGP is a very vocal promoter of the use of SGPs to evaluate teachers’ value-added (although many do not consider the SGP model to be a value-added model (VAM); see general differences between VAMs and SGPs here). Regardless, he sued the state of Virginia to release teachers’ SGP scores so he could make them available to all via the Internet. He did this, more specifically, so parents and perhaps others throughout the state would be able to access and then potentially use the scores to make choices about who should and should not teach their kids. See other posts about this story here and here.
Those of us who are familiar with Virginia SGP and the research literature writ large know that, unfortunately, there’s much that Virginia SGP does not understand about the now loads of research surrounding VAMs as defined more broadly (see multiple research article links here). Likewise, Virginia SGP, as evidenced below, rides most of his research-based arguments on select sections of a small handful of research studies (e.g., those written by economists Raj Chetty and colleagues, and Thomas Kane as part of Kane’s Measures of Effective Teaching (MET) studies) that do not represent the general research on the topic. He simultaneously ignores/rejects the research studies that empirically challenge his research-based claims (e.g., that there is no bias in VAM-based estimates, and that because Chetty, Friedman, and Rockoff “proved this,” it must be true, despite the research studies that have presented evidence otherwise (see for example here, here, and here).
Nonetheless, given that him winning this case in Virginia is still noteworthy, and followers of this blog should be aware of this particular case, I invited Virginia SGP to write a guest post so that he could tell his side of the story. As we have exchanged emails in the past, which I must add have become less abrasive/inflamed as time has passed, I recommend that readers read and also critically consume what is written below. Let’s hope that we might have some healthy and honest dialogue on this particular topic in the end.
From Virginia SGP:
I’d like to thank Dr. Amrein-Beardsley for giving me this forum.
My school district recently announced its teacher of the year. John Tuck teaches in a school with 70%+ FRL students compared to a district average of ~15% (don’t ask me why we can’t even those #’s out). He graduated from an ordinary school with a degree in liberal arts. He only has a Bachelors and is not a National Board Certified Teacher (NBCT). He is in his ninth year of teaching specializing in math and science for 5th graders. Despite the ordinary background, Tuck gets amazing student growth. He mentors, serves as principal in the summer, and leads the school’s leadership committees. In Dallas, TX, he could have risen to the top of the salary scale already, but in Loudoun County, VA, he only makes $55K compared to a top salary of $100K for Step 30 teachers. Tuck is not rewarded for his talent or efforts largely because Loudoun eschews all VAMs and merit-based promotion.
This is largely why I enlisted the assistance of Arizona State law school graduate Lin Edrington in seeking the Virginia Department of Education’s (VDOE) VAM (SGP) data via a Freedom of Information Act (FOIA) suit (see pertinent files here).
VAMs are not perfect. There are concerns about validity when switching from paper to computer tests. There are serious concerns about reliability when VAMs are computed with small sample sizes or are based on classes not taught by the rated teacher (as appeared to occur in New Mexico, Florida, and possibly New York). Improper uses of VAMs give reformers a bad name. This was not the case in Virginia. SGPs were only to be used when appropriate with 2+ years of data and 40+ scores recommended.
I am a big proponent of VAMs based on my reviews of the research. We have the Chetty/Friedman/Rockoff (CFR) studies, of course, including their recent paper showing virtually no bias (Table 6). The following briefing presented by Professor Friedman at our trial gives a good layman’s overview of their high level findings. When teachers are transferred to a completely new school but their VAMs remain consistent, that is very convincing to me. I understand some point to the cautionary statement of the ASA suggesting districts apply VAMs carefully and explicitly state their limitations. But the ASA definitely recommends VAMs for analyzing larger samples including schools or district policies, and CFR believe their statement failed to consider updated research.
To me, the MET studies provided some of the most convincing evidence. Not only are high VAMs on state standardized tests correlated to higher achievement on more open-ended short-answer and essay-based tests of critical thinking, but students of high-VAM teachers are more likely to enjoy class (Table 14). This points to VAMs measuring inspiration, classroom discipline, the ability to communicate concepts, subject matter knowledge and much more. If a teacher engages a disinterested student, their low scores will certainly rise along with their VAMs. CFR and others have shown this higher achievement carries over into future grades and success later in life. VAMs don’t just measure the ability to identify test distractors, but the ability of teachers to inspire.
So why exactly did the Richmond City Circuit Court force the release of Virginia’s SGPs? VDOE applied for and received a No Child Left Behind (NCLB) waiver like many other states. But in court testimony provided in December of 2014, VDOE acknowledged that districts were not complying with the waiver by not providing the SGP data to teachers or using SGPs in teacher evaluations despite “assurances” to the US Department of Education (USDOE). When we initially received a favorable verdict in January of 2015, instead of trying to comply with NCLB waiver requirements, my district of Loudoun County Publis Schools (LCPS) laughed. LCPS refused to implement SGPs or even discuss them.
There was no dispute that the largest Virginia districts had committed fraud when I discussed these facts with the US Attorney’s office and lawyers from the USDOE in January of 2016, but the USDOE refused to support a False Claim Act suit. And while nearly every district stridently refused to use VAMs [i.e., SGPs], the Virginia Secretary of Education was falsely claiming in high profile op-eds that Virginia was using “progress and growth” in the evaluation of schools. Yet, VDOE never used the very measure (SGPs) that the ESEA [i.e., NCLB] waivers required to measure student growth. The irony is that if these districts had used SGPs for just 1% of their teachers’ evaluations after the December of 2014 hearing, their teachers’ SGPs would be confidential today. I could only find one county that utilized SGPs, and their teachers’ SGPs are exempt. Sometimes fraud doesn’t pay.
My overall goals are threefold:
- Hire more Science Technology Engineering and Mathematics (STEM) majors to get kids excited about STEM careers and effectively teach STEM concepts
- Use growth data to evaluate policies, administrators, and teachers. Share the insights from the best teachers and provide professional development to ineffective ones
- Publish private sector equivalent pay so young people know how much teachers really earn (pensions often add 15-18% to their salaries). We can then recruit more STEM teachers and better overall teaching candidates
What has this lawsuit and activism cost me? A lot. I ate $5K of the cost of the VDOE SGP suit even after the award[ing] of fees. One local school board member has banned me from commenting on his “public figure” Facebook page (which I see as a free speech violation), both because I questioned his denial of SGPs and some other conflicts of interests I saw, although indirectly related to this particular case. The judge in the case even sanctioned me $7K just for daring to hold him accountable. And after criticizing LCPS for violating Family Educational Rights and Privacy Act (FERPA) by coercing kids who fail Virginia’s Standards of Learning tests (SOLs) to retake them, I was banned from my kids’ school for being a “safety threat.”
Note that I am a former Naval submarine officer and have held Department of Defense (DOD) clearances for 20+ years. I attended a meeting this past Thursday with LCPS officials in which they [since] acknowledged I was no safety threat. I served in the military, and along with many I have fought for the right to free speech.
Accordingly, I am no shrinking violet. Despite having LCPS attorneys sanction perjury, the Republican Commonwealth Attorney refused to prosecute and then illegally censored me in public forums. So the CA will soon have to sign a consent order acknowledging violating my constitutional rights (he effectively admitted as much already). And a federal civil rights complaint against the schools for their retaliatory ban is being drafted as we speak. All of this resulted from my efforts to have public data released and hold LCPS officials accountable to state and federal laws. I have promised that the majority of any potential financial award will be used to fund other whistle blower cases, [against] both teachers and reformers. I have a clean background and administrators still targeted me. Imagine what they would do to someone who isn’t willing to bear these costs!
In the end, I encourage everyone to speak out based on your beliefs. Support your case with facts not anecdotes or hastily conceived opinions. And there are certainly efforts we can all support like those of Dr. Darling-Hammond. We can hold an honest debate, but please remember that schools don’t exist to employ teachers/principals. Schools exist to effectively educate students.
A few weeks ago in Education Week, Stephen Sawchuk and Emmanuel Felton wrote a post in the its Teacher Beat blog about lawmakers, particularly in the southern states, who are beginning to reconsider, via legislation, the role of test scores and value-added measures in their states’ teacher evaluation systems. Perhaps the tides are turning.
I tweeted this one out, but I also pasted this (short) one below to make sure you all, especially those of you teaching and/or residing in states like Georgia, Oklahoma, Louisiana, Tennessee, and Virginia, did not miss it.
Southern Lawmakers Reconsidering Role of Test Scores in Teacher Evaluations
After years of fierce debates over effectiveness and fairness of the methodology, several southern lawmakers are looking to minimize the weight placed on so called value-added measures, derived from how much students’ test scores changed, in teacher-evaluation systems.
In part because these states are home to some of the weakest teachers unions in the country, southern policymakers were able to push past arguments that the state tests were ill suited for teacher-evaluation purposes and that the system would punish teachers for working in the toughest classrooms. States like Louisiana, Georgia and Tennessee, became some of the earliest and strongest adopters of the practice. But in the past few weeks, lawmakers from Baton Rouge, La., to Atlanta have introduced bills to limit the practice.
In February, the Georgia Senate unanimously passed a bill that would reduce the student-growth component from 50 percent of a teachers’ evaluation down to 30 percent. Earlier this week, nearly 30 individuals signed up to speak on behalf fo the bill at a State House hearing.
Similarly, Louisiana House Bill 479 would reduce student-growth weight from 50 percent to 35 percent. Tennessee House Bill 1453 would reduce the weight of student-growth data through the 2018-2019 school year and would require the state Board of Education to produce a report evaluating the policy’s ongoing effectiveness. Lawmakers in Florida, Kentucky, and Oklahoma have introduced similar bills, according to the Southern Regional Education Board’s 2016 educator-effectiveness bill tracker.
By and large, states adopted these test-score centric teacher-evaluation systems to attain waivers from No Child Left Behind’s requirement that all students by proficient by 2014. To get a waiver, states had to adopt systems that evaluated teachers “in significant part, based on student growth.” That has looked very different from state to state, ranging from 20 percent in Utah to 50 percent in states like Alaska, Tennessee, and Louisiana.
No Child Left Behind’s replacement, the Every Student Succeeds Act, doesn’t require states to have a teacher-evaluation system at all, but, as my colleague Stephen Sawchuk reported, the nation’s state superintendents say they remain committed to maintaining systems that regularly review teachers.
But, as Sawchuk reported, Steven Staples, Virginia’s state superintendent, signaled that his state may move away from its current system where student test scores make up 40 percent of a teacher’s evaluation:
“What we’ve found is that through our experience [with the NCLB waivers], we have had some unintended outcomes. The biggest one is that there’s an over-reliance on a single measure; too many of our divisions defaulted to the statewide standardized test … and their feedback was that because that was a focus [of the federal government], they felt they needed to emphasize that, ignoring some other factors. It also drove a real emphasis on a summative, final evaluation. And it resulted in our best teachers running away from our most challenged.”
Some state lawmakers appear to absorbing a similar message
Last November, I published a post about “Houston’s “Split” Decision to Give Superintendent Grier $98,600 in Bonuses, Pre-Resignation.” Thereafter, I engaged some of my former doctoral students to further explore some data from Houston Independent School District (HISD), and what we collectively found and wrote up was just published in the highly-esteemed Teachers College Record journal (Amrein-Beardsley, Collins, Holloway-Libell, & Paufler, 2016). To view the full commentary, please click here.
In this commentary we discuss HISD’s highest-stakes use of its Education Value-Added Assessment System (EVAAS) data – the value-added system HISD pays for at an approximate rate of $500,000 per year. This district has used its EVAAS data for more consequential purposes (e.g., teacher merit pay and termination) than any other state or district in the nation; hence, HISD is well known for its “big use” of “big data” to reform and inform improved student learning and achievement throughout the district.
We note in this commentary, however, that as per the evidence, and more specifically the recent release of the Texas’s large-scale standardized test scores, that perhaps attaching such high-stakes consequences to teachers’ EVAAS output in Houston is not working as district leaders have, now for years, intended. See, for example, the recent test-based evidence comparing the state of Texas v. HISD, illustrated below.
“Perhaps the district’s EVAAS system is not as much of an “educational-improvement and performance-management model that engages all employees in creating a culture of excellence” as the district suggests (HISD, n.d.a). Perhaps, as well, we should “ponder the specific model used by HISD—the aforementioned EVAAS—and [EVAAS modelers’] perpetual claims that this model helps teachers become more “proactive [while] making sound instructional choices;” helps teachers use “resources more strategically to ensure that every student has the chance to succeed;” or “provides valuable diagnostic information about [teachers’ instructional] practices” so as to ultimately improve student learning and achievement (SAS Institute Inc., n.d.).
The bottom line, though, is that “Even the simplest evidence presented above should at the very least make us question this particular value-added system, as paid for, supported, and applied in Houston for some of the biggest and baddest teacher-level consequences in town.” See, again, the full text and another, similar graph in the commentary, linked here.
Amrein-Beardsley, A., Collins, C., Holloway-Libell, J., & Paufler, N. A. (2016). Everything is bigger (and badder) in Texas: Houston’s teacher value-added system. [Commentary]. Teachers College Record. Retrieved from http://www.tcrecord.org/Content.asp?ContentId=18983
Houston Independent School District (HISD). (n.d.a). ASPIRE: Accelerating Student Progress Increasing Results & Expectations: Welcome to the ASPIRE Portal. Retrieved from http://portal.battelleforkids.org/Aspire/home.html
SAS Institute Inc. (n.d.). SAS® EVAAS® for K–12: Assess and predict student performance with precision and reliability. Retrieved from www.sas.com/govedu/edu/k12/evaas/index.html
A retired Massachusetts principal, named Linda Murdock, posted a post on her blog titled “Murdock’s EduCorner” about her experiences, as a principal, with “value-added,” or more specifically in her state the use of Student Growth Percentile (SGP) scores to estimate said “value-added.” It’s certainly worth reading as one thing I continue to find is that which we continue to find in the research on value-added models (VAMs) is also being realized by practitioners in the schools being required to use value-added output such as these. In this case, for example, while Murdock does not discuss the technical terms we use in the research (e.g., reliability, validity, and bias), she discusses these in pragmatic, real terms (e.g., year-to-year fluctuations, lack of relationship of SGP scores and other indicators of teacher effectiveness, and the extent to which certain sets of students can hinder teachers’ demonstrated growth or value-added, respectively). Hence, do give her post a read here, and also pasted in full below. Do also pay special attention to the bulleted sections in which she discusses these and other issues on a case-by-case basis.
At the end of the last school year, I was chatting with two excellent teachers, and our conversation turned to the new state-mandated teacher evaluation system and its use of student “growth scores” (“Student Growth Percentiles” or “SGPs” in Massachusetts) to measure a teacher’s “impact on student learning.”
“Guess we didn’t have much of an impact this year,” said one teacher.
The other teacher added, “It makes you feel about this high,” showing a tiny space between her thumb and forefinger.
Throughout the school, comments were similar — indicating that a major “impact” of the new evaluation system is demoralizing and discouraging teachers. (How do I know, by the way, that these two teachers are excellent? I know because I worked with them as their principal – being in their classrooms, observing and offering feedback, talking to parents and students, and reviewing products demonstrating their students’ learning – all valuable ways of assessing a teacher’s “impact”.)
According to the Massachusetts Department of Elementary and Secondary Education (“DESE”), the new evaluation system’s goals include promoting the “growth and development of leaders and teachers,” and recognizing “excellence in teaching and leading.” The DESE website indicates that the DESE considers a teacher’s median SGP as an appropriate measure of that teacher’s “impact on student learning”:
“ESE has confidence that SGPs are a high quality measure of student growth. While the precision of a median SGP decreases with fewer students, median SGP based on 8-19 students still provides quality information that can be included in making a determination of an educator’s impact on students.”
Given the many concerns about the use of “value-added measurement” tools (such as SGPs) in teacher evaluation, this confidence is difficult to understand, particularly as applied to real teachers in real schools. Considerable research notes the imprecision and variability of these measures as applied to the evaluation of individual teachers. On the other side, experts argue that use of an “imperfect measure” is better than past evaluation methods. Theories aside, I believe that the actual impact of this “measure” on real people in real schools is important.
As a principal, when I first heard of SGPs I was curious. I wondered whether the data would actually filter out other factors affecting student performance, such as learning disabilities, English language proficiency, or behavioral challenges, and I wondered if the data would give me additional information useful in evaluating teachers.
Unfortunately, I found that SGPs did not provide useful information about student growth or learning, and median SGPs were inconsistent and not correlated with teaching skill, at least for the teachers with whom I was working. In two consecutive years of SGP data from our Massachusetts elementary school:
- One 4th grade teacher had median SGPs of 37 (ELA) and 36 (math) in one year, and 61.5 and 79 the next year. The first year’s class included students with disabilities and the next year’s did not.
- Two 4th grade teachers who co-teach their combined classes (teaching together, all students, all subjects) had widely differing median SGPs: one teacher had SGPs of 44 (ELA) and 42 (math) in the first year and 40 and 62.5 in the second, while the other teacher had SGPs of 61 and 50 in the first year and 41 and 45 in the second.
- A 5th grade teacher had median SGPs of 72.5 and 64 for two math classes in the first year, and 48.5, 26, and 57 for three math classes in the following year. The second year’s classes included students with disabilities and English language learners, but the first year’s did not.
- Another 5th grade teacher had median SGPs of 45 and 43 for two ELA classes in the first year, and 72 and 64 in the second year. The first year’s classes included students with disabilities and students with behavioral challenges while the second year’s classes did not.
As an experienced observer/evaluator, I found that median SGPs did not correlate with teachers’ teaching skills but varied with class composition. Stronger teachers had the same range of SGPs in their classes as teachers with weaker skills, and median SGPs for a new teacher with a less challenging class were higher than median SGPs for a highly skilled veteran teacher with a class that included English language learners.
Furthermore, SGP data did not provide useful information regarding student growth. In analyzing students’ SGPs, I noticed obvious general patterns: students with disabilities had lower SGPs than students without disabilities, English language learners had lower SGPs than students fluent in English, students who had some kind of trauma that year (e.g., parents’ divorce) had lower SGPs, and students with behavioral/social issues had lower SGPs. SGPs were correlated strongly with test performance: in one year, for example, the median ELA SGP for students in the “Advanced” category was 88, compared with 51.5 for “Proficient” students, 19.5 for “Needs Improvement,” and 5 for the “Warning” category.
There were also wide swings in student SGPs, not explainable except perhaps by differences in student performance on particular test days. One student with disabilities had an SGP of 1 in the first year and 71 in the next, while another student had SGPs of 4 in ELA and 94 in math in 4th grade and SGPs of 50 in ELA and 4 in math in 5th grade, both with consistent district test scores.
So how does this “information” impact real people in a real school? As a principal, I found that it added nothing to what I already knew about the teaching and learning in my school. Using these numbers for teacher evaluation does, however, negatively impact schools: it demoralizes and discourages teachers, and it has the potential to affect class and teacher assignments.
In real schools, student and teacher assignments are not random. Students are grouped for specific purposes, and teachers are assigned classes for particular reasons. Students with disabilities and English language learners are often grouped to allow specialists, such as the speech/language teacher or the ELL teacher, to work more effectively with them. Students with behavioral issues are sometimes placed in special classes, and are often assigned to teachers who work particularly well with them. Leveled classes (AP, honors, remedial), create different student combinations, and teachers are assigned particular classes based on the administrator’s judgment of which teachers will do the best with which classes. For example, I would assign new or struggling teachers less challenging classes so I could work successfully with them on improving their skills.
In the past, when I told a teacher that he/she had a particularly challenging class, because he/she could best work with these students, he/she generally cheerfully accepted the challenge, and felt complimented on his/her skills. Now, that teacher could be concerned about the effect of that class on his/her evaluation. Teachers may be reluctant to teach lower level courses, or to work with English language learners or students with behavioral issues, and administrators may hesitate to assign the most challenging classes to the most skilled teachers.
In short, in my experience, the use of this type of “value-added” measurement provides no useful information and has a negative impact on real teachers and real administrators in real schools. If “data” is not only not useful, but actively harmful, to those who are supposedly benefitting from using it, what is the point? Why is this continuing?