Tennessee’s Trout/Taylor Value-Added Lawsuit Dismissed

As you may recall, one of 15 important lawsuits pertaining to teacher value-added estimates across the nation (Florida n=2, Louisiana n=1, Nevada n=1, New Mexico n=4, New York n=3, Tennessee n=3, and Texas n=1 – see more information here) was situated in Knox County, Tennessee.

Filed in February of 2015, with legal support provided by the Tennessee Education Association (TEA), Knox County teacher Lisa Trout and Mark Taylor charged that they were denied monetary bonuses after their Tennessee Value-Added Assessment System (TVAAS — the original Education Value-Added Assessment System (EVAAS)) teacher-level value-added scores were miscalculated. This lawsuit was also to contest the reasonableness, rationality, and arbitrariness of the TVAAS system, as per its intended and actual uses in this case, but also in Tennessee writ large. On this case, Jesse Rothstein (University of California – Berkeley) and I were serving as the Plaintiffs’ expert witnesses.

Unfortunately, however, last week (February 17, 2016) the Plaintiffs’ team received a Court order written by U.S. District Judge Harry S. Mattice Jr. dismissing their claims. While the Court had substantial questions about the reliability and validity of the TVAAS, the Court determined that the State satisfied the very low threshold of the “rational basis test,” at legal issue. I should note here, however, that all of the evidence that the lawyers for the Plaintiffs collected via their “extensive discovery,” including the affidavits both Jesse and I submitted on Plaintiffs’ behalves, were unfortunately not considered in Judge Mattice’s motion to dismiss. This, perhaps, makes sense given some of the assertions made by the Court, forthcoming.

Ultimately, the Court found that the TVAAS-based, teacher-level value-added policy at issue was “rationally related to a legitimate government interest.” As per the Court order itself, Judge Mattice wrote that “While the court expresses no opinion as to whether the Tennessee Legislature has enacted sound public policy, it finds that the use of TVAAS as a means to measure teacher efficacy survives minimal constitutional scrutiny. If this policy proves to be unworkable in practice, plaintiffs are not to be vindicated by judicial intervention but rather by democratic process.”

Otherwise, as per an article in the Knoxville News Sentinel, Judge Mattice was “not unsympathetic to the teachers’ claims,” for example, given the TVAAS measures “student growth — not teacher performance — using an algorithm that is not fail proof.” He inversely noted, however, in the Court order that the “TVAAS algorithms have been validated for their accuracy in measuring a teacher’s effect on student growth,” even if minimal. He also wrote that the test scores used in the TVAAS (and other models) “need not be validated for measuring teacher effectiveness merely because they are used as an input in a validated statistical model that measures teacher effectiveness.” This is, unfortunately, untrue. Nonetheless, he continued to write that even though the rational basis test “might be a blunt tool, a rational policymaker could conclude that TVAAS is ‘capable of measuring some marginal impact that teachers can have on their own students…[and t]his is all the Constitution requires.”

In the end, Judge Mattice concluded in the Court order that, overall, “It bears repeating that Plaintiff’s concerns about the statistical imprecision of TVAAS are not unfounded. In addressing Plaintiffs’ constitutional claims, however, the Court’s role is extremely limited. The judiciary is not empowered to second-guess the wisdom of the Tennessee legislature’s approach to solving the problems facing public education, but rather must determine whether the policy at issue is rationally related to a legitimate government interest.”

It is too early to know whether the prosecution team will appeal, although Judge Mattice dismissed the federal constitutional claims within the lawsuit “with prejudice.” As per an article in the Knoxville News Sentinel, this means that “it cannot be resurrected with new facts or legal claims or in another court. His decision can be appealed, though, to the 6th Circuit U.S. Court of Appeals.”

Everything is Bigger (and Badder) in Texas: Houston’s Teacher Value-Added System

Last November, I published a post about “Houston’s “Split” Decision to Give Superintendent Grier $98,600 in Bonuses, Pre-Resignation.” Thereafter, I engaged some of my former doctoral students to further explore some data from Houston Independent School District (HISD), and what we collectively found and wrote up was just published in the highly-esteemed Teachers College Record journal (Amrein-Beardsley, Collins, Holloway-Libell, & Paufler, 2016). To view the full commentary, please click here.

In this commentary we discuss HISD’s highest-stakes use of its Education Value-Added Assessment System (EVAAS) data – the value-added system HISD pays for at an approximate rate of $500,000 per year. This district has used its EVAAS data for more consequential purposes (e.g., teacher merit pay and termination) than any other state or district in the nation; hence, HISD is well known for its “big use” of “big data” to reform and inform improved student learning and achievement throughout the district.

We note in this commentary, however, that as per the evidence, and more specifically the recent release of the Texas’s large-scale standardized test scores, that perhaps attaching such high-stakes consequences to teachers’ EVAAS output in Houston is not working as district leaders have, now for years, intended. See, for example, the recent test-based evidence comparing the state of Texas v. HISD, illustrated below.

Figure 1

“Perhaps the district’s EVAAS system is not as much of an “educational-improvement and performance-management model that engages all employees in creating a culture of excellence” as the district suggests (HISD, n.d.a). Perhaps, as well, we should “ponder the specific model used by HISD—the aforementioned EVAAS—and [EVAAS modelers’] perpetual claims that this model helps teachers become more “proactive [while] making sound instructional choices;” helps teachers use “resources more strategically to ensure that every student has the chance to succeed;” or “provides valuable diagnostic information about [teachers’ instructional] practices” so as to ultimately improve student learning and achievement (SAS Institute Inc., n.d.).

The bottom line, though, is that “Even the simplest evidence presented above should at the very least make us question this particular value-added system, as paid for, supported, and applied in Houston for some of the biggest and baddest teacher-level consequences in town.” See, again, the full text and another, similar graph in the commentary, linked  here.



Amrein-Beardsley, A., Collins, C., Holloway-Libell, J., & Paufler, N. A. (2016). Everything is bigger (and badder) in Texas: Houston’s teacher value-added system. [Commentary]. Teachers College Record. Retrieved from http://www.tcrecord.org/Content.asp?ContentId=18983

Houston Independent School District (HISD). (n.d.a). ASPIRE: Accelerating Student Progress Increasing Results & Expectations: Welcome to the ASPIRE Portal. Retrieved from http://portal.battelleforkids.org/Aspire/home.html

SAS Institute Inc. (n.d.). SAS® EVAAS® for K–12: Assess and predict student performance with precision and reliability. Retrieved from www.sas.com/govedu/edu/k12/evaas/index.html

Houston Lawsuit Update, with Summary of Expert Witnesses’ Findings about the EVAAS

Recall from a prior post that a set of teachers in the Houston Independent School District (HISD), with the support of the Houston Federation of Teachers (HFT) are taking their district to federal court to fight for their rights as professionals, and how their value-added scores, derived via the Education Value-Added Assessment System (EVAAS), have allegedly violated them. The case, Houston Federation of Teachers, et al. v. Houston ISD, is to officially begin in court early this summer.

More specifically, the teachers are arguing that EVAAS output are inaccurate, the EVAAS is unfair, that teachers are being evaluated via the EVAAS using tests that do not match the curriculum they are to teach, that the EVAAS system fails to control for student-level factors that impact how well teachers perform but that are outside of teachers’ control (e.g., parental effects), that the EVAAS is incomprehensible and hence very difficult if not impossible to actually use to improve upon their instruction (i.e., actionable), and, accordingly, that teachers’ due process rights are being violated because teachers do not have adequate opportunities to change as a results of their EVAAS results.

The EVAAS is the one value-added model (VAM) on which I’ve conducted most of my research, also in this district (see, for example, here, here, here, and here); hence, I along with Jesse Rothstein – Professor of Public Policy and Economics at the University of California – Berkeley, who also conducts extensive research on VAMs – are serving as the expert witnesses in this case.

What was recently released regarding this case is a summary of the contents of our affidavits, as interpreted by authors of the attached “EVAAS Litigation UPdate,” in which the authors declare, with our and others’ research in support, that “Studies Declare EVAAS ‘Flawed, Invalid and Unreliable.” Here are the twelve key highlights, again, as summarized by the authors of this report and re-summarized, by me, below:

  1. Large-scale standardized tests have never been validated for their current uses. In other words, as per my affidavit, “VAM-based information is based upon large-scale achievement tests that have been developed to assess levels of student achievement, but not levels of growth in student achievement over time, and not levels of growth in student achievement over time that can be attributed back to students’ teachers, to capture the teachers’ [purportedly] causal effects on growth in student achievement over time.”
  2. The EVAAS produces different results from another VAM. When, for this case, Rothstein constructed and ran an alternative, albeit sophisticated VAM using data from HISD both times, he found that results “yielded quite different rankings and scores.” This should not happen if these models are indeed yielding indicators of truth, or true levels of teacher effectiveness from which valid interpretations and assertions can be made.
  3. EVAAS scores are highly volatile from one year to the next. Rothstein, when running the actual data, found that while “[a]ll VAMs are volatile…EVAAS growth indexes and effectiveness categorizations are particularly volatile due to the EVAAS model’s failure to adequately account for unaccounted-for variation in classroom achievement.” In addition, volatility is “particularly high in grades 3 and 4, where students have relatively few[er] prior [test] scores available at the time at which the EVAAS scores are first computed.”
  4. EVAAS overstates the precision of teachers’ estimated impacts on growth. As per Rothstein, “This leads EVAAS to too often indicate that teachers are statistically distinguishable from the average…when a correct calculation would indicate that these teachers are not statistically distinguishable from the average.”
  5. Teachers of English Language Learners (ELLs) and “highly mobile” students are substantially less likely to demonstrate added value, as per the EVAAS, and likely most/all other VAMs. This, what we term as “bias,” makes it “impossible to know whether this is because ELL teachers [and teachers of highly mobile students] are, in fact, less effective than non-ELL teachers [and teachers of less mobile students] in HISD, or whether it is because the EVAAS VAM is biased against ELL [and these other] teachers.”
  6. The number of students each teacher teaches (i.e., class size) also biases teachers’ value-added scores. As per Rothstein, “teachers with few linked students—either because they teach small classes or because many of the students in their classes cannot be used for EVAAS calculations—are overwhelmingly [emphasis added] likely to be assigned to the middle effectiveness category under EVAAS (labeled “no detectable difference [from average], and average effectiveness”) than are teachers with more linked students.”
  7. Ceiling effects are certainly an issue. Rothstein found that in some grades and subjects, “teachers whose students have unusually high prior year scores are very unlikely to earn high EVAAS scores, suggesting that ‘ceiling effects‘ in the tests are certainly relevant factors.” While EVAAS and HISD have previously acknowledged such problems with ceiling effects, they apparently believe these effects are being mediated with the new and improved tests recently adopted throughout the state of Texas. Rothstein, however, found that these effects persist even given the new and improved.
  8. There are major validity issues with “artificial conflation.” This is a term I recently coined to represent what is happening in Houston, and elsewhere (e.g., Tennessee), when district leaders (e.g., superintendents) mandate or force principals and other teacher effectiveness appraisers or evaluators, for example, to align their observational ratings of teachers’ effectiveness with value-added scores, with the latter being the “objective measure” around which all else should revolve, or align; hence, the conflation of the one to match the other, even if entirely invalid. As per my affidavit, “[t]o purposefully and systematically endorse the engineering and distortion of the perceptible ‘subjective’ indicator, using the perceptibly ‘objective’ indicator as a keystone of truth and consequence, is more than arbitrary, capricious, and remiss…not to mention in violation of the educational measurement field’s Standards for Educational and Psychological Testing” (American Educational Research Association (AERA), American Psychological Association (APA), National Council on Measurement in Education (NCME), 2014).
  9. Teaching-to-the-test is of perpetual concern. Both Rothstein and I, independently, noted concerns about how “VAM ratings reward teachers who teach to the end-of-year test [more than] equally effective teachers who focus their efforts on other forms of learning that may be more important.”
  10. HISD is not adequately monitoring the EVAAS system. According to HISD, EVAAS modelers keep the details of their model secret, even from them and even though they are paying an estimated $500K per year for district teachers’ EVAAS estimates. “During litigation, HISD has admitted that it has not performed or paid any contractor to perform any type of verification, analysis, or audit of the EVAAS scores. This violates the technical standards for use of VAM that AERA specifies, which provide that if a school district like HISD is going to use VAM, it is responsible for ‘conducting the ongoing evaluation of both intended and unintended consequences’ and that ‘monitoring should be of sufficient scope and extent to provide evidence to document the technical quality of the VAM application and the validity of its use’ (AERA Statement, 2015).
  11. EVAAS lacks transparency. AERA emphasizes the importance of transparency with respect to VAM uses. For example, as per the AERA Council who wrote the aforementioned AERA Statement, “when performance levels are established for the purpose of evaluative decisions, the methods used, as well as the classification accuracy, should be documented and reported” (AERA Statement, 2015). However, and in contrast to meeting AERA’s requirements for transparency, in this district and elsewhere, as per my affidavit, the “EVAAS is still more popularly recognized as the ‘black box’ value-added system.”
  12. Related, teachers lack opportunities to verify their own scores. This part is really interesting. “As part of this litigation, and under a very strict protective order that was negotiated over many months with SAS [i.e., SAS Institute Inc. which markets and delivers its EVAAS system], Dr. Rothstein was allowed to view SAS’ computer program code on a laptop computer in the SAS lawyer’s office in San Francisco, something that certainly no HISD teacher has ever been allowed to do. Even with the access provided to Dr. Rothstein, and even with his expertise and knowledge of value-added modeling, [however] he was still not able to reproduce the EVAAS calculations so that they could be verified.”Dr. Rothstein added, “[t]he complexity and interdependency of EVAAS also presents a barrier to understanding how a teacher’s data translated into her EVAAS score. Each teacher’s EVAAS calculation depends not only on her students, but also on all other students with- in HISD (and, in some grades and years, on all other students in the state), and is computed using a complex series of programs that are the proprietary business secrets of SAS Incorporated. As part of my efforts to assess the validity of EVAAS as a measure of teacher effectiveness, I attempted to reproduce EVAAS calculations. I was unable to reproduce EVAAS, however, as the information provided by HISD about the EVAAS model was far from sufficient.”

Houston’s “Split” Decision to Give Superintendent Grier $98,600 in Bonuses, Pre-Resignation

States of attention on this blog, and often of (dis)honorable mention as per their state-level policies bent on value-added models (VAMs), include Florida, New York, Tennessee, and New Mexico. As for a quick update about the latter state of New Mexico, we are still waiting to hear the final decision from the judge who recently heard the state-level lawsuit still pending on this matter in New Mexico (see prior posts about this case here, here, here, here, and here).

Another locale of great interest, though, is the Houston Independent School District. This is the seventh largest urban school district in the nation, and the district that has tied more high-stakes consequences to their value-added output than any other district/state in the nation. These “initiatives” were “led” by soon-to-resign/retire Superintendent Terry Greir who, during his time in Houston (2009-2015), implemented some of the harshest consequences ever attached to teacher-level value-added output, as per the district’s use of the Education Value-Added Assessment System (EVAAS) (see other posts about the EVAAS here, here, and here; see other posts about Houston here, here, and here).

In fact, the EVAAS is still used throughout Houston today to evaluate all EVAAS-eligible teachers, to also “reform” the district’s historically low-performing schools, by tying teachers’ purported value-added performance to teacher improvement plans, merit pay, nonrenewal, and termination (e.g., 221 Houston teachers were terminated “in large part” due to their EVAAS scores in 2011). However, pending litigation (i.e., this is the district in which the American and Houston Federation of Teachers (AFT/HFT) are currently suing the district for their wrongful use of, and over-emphasis on this particular VAM; see here), Superintendent Grier and the district have recoiled on some of the high-stakes consequences they formerly attached to the EVAAS  This particular lawsuit is to commence this spring/summer.

Nonetheless, my most recent post about Houston was about some of its future school board candidates, who were invited by The Houston Chronicle to respond to Superintendent Grier’s teacher evaluation system. For the most part, those who responded did so unfavorably, especially as the evaluation systems was/is disproportionately reliant on teachers’ EVAAS data and high-stakes use of these data in particular (see here).

Most recently, however, as per a “split” decision registered by Houston’s current school board (i.e., 4:3, and without any new members elected last November), Superintendent Grier received a $98,600 bonus for his “satisfactory evaluation” as the school district’s superintendent. See more from the full article published in The Houston Chronicle. As per the same article, Superintendent “Grier’s base salary is $300,000, plus $19,200 for car and technology allowances. He also is paid for unused leave time.”

More importantly, take a look at the two figures below, taken from actual district reports (see references below), highlighting Houston’s performance (declining, on average, in blue) as compared to the state of Texas (maintaining, on average, in black), to determine for yourself whether Superintendent Grier, indeed, deserved such a bonus (not to mention salary).

Another question to ponder is whether the district’s use of the EVAAS value-added system, especially since Superintendent Grier’s arrival in 2009, is actually reforming the school district as he and other district leaders have for so long now intended (e.g., since his Superintendent appointment in 2009).

Figure 1

Figure 1. Houston (blue trend line) v. Texas (black trend line) performance on the state’s STAAR tests, 2012-2015 (HISD, 2015a)

Figure 2

Figure 2. Houston (blue trend line) v. Texas (black trend line) performance on the state’s STAAR End-of-Course (EOC) tests, 2012-2015 (HISD, 2015b)


Houston Independent School District (HISD). (2015a). State of Texas Assessments of Academic Readiness (STAAR) performance, grades 3-8, spring 2015. Retrieved here.

Houston Independent School District (HISD). (2015b). State of Texas Assessments of Academic Readiness (STAAR) end-of-course results, spring 2015. Retrieved here.

Houston Board Candidates Respond to their Teacher Evaluation System

For a recent article in the Houston Chronicle, the newspaper sent 12 current candidates for the Houston Independent School District (HISD) School Board a series of questions about HISD, to which seven candidates responded. The seven candidates’ responses are of specific interest here in that HISD is the district well-known for attaching more higher-stakes consequences to value-added output (e.g., teacher termination) than others (see for example here, here, and here). The seven candidates’ responses are of general interest in that the district uses the popular and (in)famous Education Value-Added Assessment System (EVAAS) for said purposes (see also here, here, and here). Accordingly, what these seven candidates have to say about the EVAAS and/or HISD’s teacher evaluation system might also be a sign of things to come, perhaps for the better, throughout HISD.

The questions are: (1) Do you support HISD’s current teacher evaluation system, which includes student test scores? Why or why not? What, if any, changes would you make? And (2) Do you support HISD’s current bonus system based on student test scores? Why or why not? What, if any, changes would you make? To see candidate names, their background information, their responses to other questions, etc. please read in full the article in the Houston Chronicle.

Here are the seven candidates’ responses to question #1:

  • I do not support the current teacher evaluation system. Teacher’s performance should not rely on the current formula using the evaluation system with the amount of weight placed on student test scores. Too many obstacles outside the classroom affect student learning today that are unfair in this system. Other means of support such as a community school model must be put in place to support the whole student, supporting student learning in the classroom (Fonseca).
  • No, I do not support the current teacher evaluation system, EVAAS, because it relies on an algorithm that no one understands. Testing should be diagnostic, not punitive. Teachers must have the freedom to teach basic math, reading, writing and science and not only teach to the test, which determines if they keep a job and/or get bonuses. Teachers should be evaluated on student growth. For example, did the third-grade teacher raise his/her non-reading third-grader to a higher level than that student read when he/she came into the teacher’s class? Did the teacher take time to figure out what non-educational obstacles the student had in order to address those needs so that the student began learning? Did the teacher coach the debate team and help the students become more well-rounded, and so on? Standardized tests in a vacuum indicate nothing (Jones).
  • I remember the time when teachers practically never revised test scores. Tests can be one of the best tools to help a child identify strengths and weakness. Students’ scores were filed, and no one ever checked them out from the archives. When student scores became part of their evaluation, teachers began to look into data more often. It is a magnificent tool for student and teacher growth. Having said that, I also believe that many variables that make a teacher great are not measured in his or her evaluation. There is nothing on character education for which teachers are greatly responsible. I do not know of a domain in the teacher’s evaluation that quite measures the art of teaching. Data is about the scientific part of teaching, but the art of teaching has to be evaluated by an expert at every school; we call them principals (Leal).
  • Student test scores were not designed to be used for this purpose. The use of students’ test scores to evaluate teachers has been discredited by researchers and statisticians. EVAAS and other value-added models are deeply flawed and should not be major components of a teacher evaluation system. The existing research indicates that 10-14 percent of students’ test scores are attributable to teacher factors. Therefore, I would support using student test scores (a measure of student achievement) as no more than 10-14 percent of teachers’ evaluations (McCoy).
  • No, I do not support the current teacher evaluation system, which includes student test scores, for the following reasons: 1) High-stakes decisions should not be made based on the basis of value-added scores alone. 2) The system is meant to assess and predict student performance with precision and reliability, but the data revealed that the EVAAS system is inconsistent and has consistent problems. 3) The EVAAS repots do not match the teachers’ “observation” PDAS scores [on the formal evaluation]; therefore, data is manipulated to show a relationship. 4) Most importantly, teachers cannot use the information generated as a formative tool because teachers receive the EVAAS reports in the summer or fall after the students leave their classroom. 5) Very few teachers realized that there was an HISD-sponsored professional development training linked to the EVAAS system to improve instruction. Changes that I will make are to make recommendations and confer with other board members to revamp the system or identify a more equitable system (McCullough).
  • The current teacher evaluation system should be reviewed and modified. While I believe we should test, it should only be a diagnostic measure of progress and indicator of deficiency for the purpose of aligned instruction. There should not be any high stakes attached for the student or the teacher. That opens the door for restricting teaching-to-test content and stifles the learning potential. If we have to have it, make it 5 percent. The classroom should be based on rich academic experiences, not memorization regurgitation (Skillern-Jones).
  • I support evaluating teachers on how well their students perform and grow, but I do not support high-stakes evaluation of teachers using a value-added test score that is based on the unreliable STAAR test. Research indicates that value-added measures of student achievement tied to individual teachers should not be used for high-stakes decisions or compared across dissimilar student populations or schools. If we had a reliable test of student learning, I would support the use of value-added growth measures in a low-stakes fashion where measures of student growth are part of an integrated analysis of a teacher’s overall performance and practices. I strongly believe that teachers should be evaluated with an integrated set of measures that show what teachers do and what happens as a result. These measures may include meaningful evidence of student work and learning, pedagogy, classroom management, knowledge of content and even student surveys. Evaluators should be appropriately trained, and teachers should have regular evaluations with frequent feedback from strong mentors and professional development to strengthen their content knowledge and practice (Stipeche).

Here are the seven candidates’ responses to question #2:

  • I do not support the current bonus system based on student test scores as, again, teachers do not currently have support to affect what happens outside the classroom. Until we provide support, we cannot base teacher performance or bonuses on a heavy weight of test scores (Fonseca).
  • No, I do not support the current bonus system. Teachers who grow student achievement should receive bonuses, not just teachers whose students score well on tests. For example, a teacher who closes the educational achievement gap with a struggling student should earn a bonus before a teacher who has students who are not challenged and for whom learning is relatively easy. Teachers who grow their students in extracurricular activities should earn a bonus before a teacher that only focuses on education. Teachers that choose to teach in struggling schools should earn a bonus over a teacher that teaches in a school with non-struggling students. Teachers who work with their students in UIL participation, history fairs, debate, choir, student government and like activities should earn a bonus over a teacher who does not (Jones).
  • Extrinsic incentives killed creativity. I knew that from my counseling background, but in 2011 or 2010, Dr. Grier sent an email to school administrators with a link of a TED Talks video that contradicts any notion of giving monetary incentives to promote productivity in the classroom: http://www.ted.com/talks/dan_pink_on_motivation?language=en. Give incentives for perfect attendance or cooperation among teachers selected by teachers (Leal).
  • No. Student test scores were not designed to be used for this purpose. All teachers need salary increases (McCoy).
  • No, I do not support HISD’s current bonus system based on student test scores. Student test scores should be a diagnostic tool used to identify instructional gaps and improve student achievement. Not as a measure to reward teachers, because the process is flawed. I would work collaboratively to identify another system to reward teachers (McCullough).
  • The current bonus program does, in fact, reward teachers who students make significant academic gains. It leaves out those teachers who have students at the top of the achievement scale. By formulaic measures, it is flawed and the system, according to its creators, is being misused and misapplied. It would be beneficial overall to consider measures to expand the teacher population of recipients as well as to undertake measures to simplify the process if we keep it. I think a better focus would be to see how we can increase overall teacher salaries in a meaningful and impactful way to incentivize performance and longevity (Skillern-Jones).
  • No. I do not support the use of EVAAS in this manner. More importantly, ASPIRE has not closed the achievement gap nor dramatically improved the academic performance of all students in the district (Stipeche).

No responses or no responses of any general substance were received from Daniels, Davila, McKinzie, Smith, Williams.

“Value-Less” Value-Added Data

Peter Greene, a veteran teacher of English in Pennsylvania who works as a teacher in a state using the Pennsylvania version of the Education Value-Added Assessment System (EVAAS), wrote last week (October 5, 2015) in his Curmudgucation blog about his “Value-Less Data.” I thought it very important to share with you all, as he does a great job deconstructing one of the most widespread claims being made, and most lacking research support, about using the data derived via value-added models (VAMs) to inform and improve what teachers do in their classrooms.

Greene sententiously critiques this claim, writing:

It’s autumn in Pennsylvania, which means it’s time to look at the rich data to be gleaned from our Big Standardized Test (called PSSA for grades 3-8, and Keystone Exams at the high school level).

We love us some value added data crunching in PA (our version is called PVAAS, an early version of the value-added baloney model). This is a model that promises far more than it can deliver, but it also makes up a sizeable chunk of our school evaluation model, which in turn is part of our teacher evaluation model.

Of course the data crunching and collecting is supposed to have many valuable benefits, not the least of which is unleashing a pack of rich and robust data hounds who will chase the wild beast of low student achievement up the tree of instructional re-alignment. Like every other state, we have been promised that the tests will have classroom teachers swimming in a vast vault of data, like Scrooge McDuck on a gold bullion bender. So this morning I set out early to the states Big Data Portal to see what riches the system could reveal.

Here’s what I can learn from looking at the rich data.

* the raw scores of each student
* how many students fell into each of the achievement subgroups (test scores broken down by 20 point percentile slices)
* if each of the five percentile slices was generally above, below, or at its growth target

Annnnd that’s about it. I can sift through some of that data for a few other features.

For instance, PVAAS can, in a Minority Report sort of twist, predict what each student should get as a score based on– well, I’ve been trying for six years to find someone who can explain this to me, and still nothing. But every student has his or her own personal alternate universe score. If the student beats that score, they have shown growth. If they don’t, they have not.

The state’s site will actually tell me what each student’s alternate universe score was, side by side with their actual score. This is kind of an amazing twist– you might think this data set would be useful for determining how well the state’s predictive legerdemain actually works. Or maybe a discrepancy might be a signal that something is up with the student. But no — all discrepancies between predicted and actual scores are either blamed on or credited to the teacher.

I can use that same magical power to draw a big target on the backs of certain students. I can generate a list of students expected to fall within certain score ranges and throw them directly into the extra test prep focused remediation tank. Although since I’m giving them the instruction based on projected scores from a test they haven’t taken yet, maybe I should call it premediation.

Of course, either remediation or premediation would be easier to develop if I knew exactly what the problem was.

But the website gives only raw scores. I don’t know what “modules” or sections of the test the student did poorly on. We’ve got a principal working on getting us that breakdown, but as classroom teachers we don’t get to see it. Hell, as classroom teachers, we are not allowed to see the questions, and if we do see them, we are forbidden to talk about them, report on them, or use them in any way. (Confession: I have peeked, and many of the questions absolutely suck as measures of anything).

Bottom line– we have no idea what exactly our students messed up to get a low score on the test. In fact, we have no idea what they messed up generally.

So that’s my rich data. A test grade comes back, but I can’t see the test, or the questions, or the actual items that the student got wrong.

The website is loaded with bells and whistles and flash-dependent functions along with instructional videos that seem to assume that the site will be used by nine-year-olds, combining instructions that should be unnecessary (how to use a color-coding key to read a pie chart) to explanations of “analysis” that isn’t (by looking at how many students have scored below basic, we can determine how many students have scored below basic).

I wish some of the reformsters who believe that BS [i.e., not “basic skills” but the “other” BS] Testing gets us rich data that can drive and focus instruction would just get in there and take a look at this, because they would just weep. No value is being added, but lots of time and money is being wasted.

Valerie Strauss also covered Greene’s post in her Answer Sheet Blog in The Washington Post here, in case you’re interested in seeing her take on this as well: “Why the ‘rich’ student data we get from testing is actually worthless.”

“Efficiency” as a Constitutional Mandate for Texas’s Educational System

The Texas Constitution requires that the state “establish and make suitable provision for the support and maintenance of an efficient system of public free schools,” as the “general diffusion of knowledge [is]…essential to the preservation of the liberties and rights of the people.” Following this notion, The George W. Bush Institute’s Education Reform Initiative  recently released its first set of reports as part of its The Productivity for Results Series: “A Legal Lever for Enhancing Productivity.” The report was authored by an affiliate of The New Teacher Project (TNTP) – the non-profit organization founded by the controversial former Chancellor of Washington DC’s public schools Michelle Rhee; an unknown and apparently unaffiliated “education researcher” named Krishanu Sengupta; and Sandy Kress, the “key architect of No Child Left Behind [under the presidential leadership of George W. Bush] who later became a lobbyist for Pearson, the testing company” (see, for example, here).

Authors of this paper review the economic and education research (although if you look through the references the strong majority of pieces come from economics research, which makes sense as this is an economically driven venture) to identify characteristics that typify enterprises that are efficient. More specifically, the authors use the principles of x-efficiency set out in the work of the highly respected Henry Levin that require efficient organizations, in this case as (perhaps inappropriately) applied to schools, to have: 1) Clear objective outcomes with measurable outcomes; 2) Incentives that are linked to success on the objective function; 3) Efficient access to useful information for decisions; 4) Adaptability to meet changing conditions; and 5) Use of the most productive technology consistent with cost constraints.

The authors also advance another series of premises, as related to this view of x-efficiency and its application to education/schools in Texas: (1) that “if Texas is committed to diffusing knowledge efficiently, as mandated by the state constitution, it should ensure that the system for putting effective teachers in classrooms and effective materials in the hands of teachers and students is characterized by the principles that undergird an efficient enterprise, such as those of x-efficiency;” (2) this system must include value-added measurement systems (i.e., VAMs), as deemed throughout this paper as not only constitutional but also rational and in support of x-efficiency; (3) given “rational policies for teacher training, certification, evaluation, compensation, and dismissal are key to an efficient education system;” (4) “the extent to which teacher education programs prepare their teachers to achieve this goal should [also] be [an] important factor;”  (5) “teacher evaluation systems [should also] be properly linked to incentives…[because]…in x-efficient enterprises, incentives are linked to success in the objective function of the organization;” (6) which is contradictory with current, less x-efficient teacher compensation systems that link incentives to time on the job, or tenure, rather than to “the success of the organization’s function; (6), in the end, “x-efficient organizations have efficient access to useful information for decisions, and by not linking teacher evaluations to student achievement, [education] systems [such as the one in Texas will] fail to provide the necessary information to improve or dismiss teachers.”

The two districts highlighted as being most x-efficient in Texas, and in this report include, to no surprise: “Houston [which] adds a value-added system to reward teachers, with student performance data counting for half of a teacher’s overall rating. HISD compares students’ academic growth year to year, under a commonly used system called EVAAS.” We’ve discussed not only this system but also its use in Houston often on this blog (see, for example, here, here, and here). Teachers in Houston who consistently perform poorly can be fired for “insufficient student academic growth as reflected by value added scores…In 2009, before EVAAS became a factor in terminations, 36 of 12,000 teachers were fired for performance reasons, or .3%, a number so low the Superintendent [Terry Grier] himself called the dismissal system into question. From 2004-2009, the district
fired or did not renew 365 teachers, 140 for “performance reasons,” including poor discipline management, excessive absences, and a lack of student progress. In 2011, 221 teacher contracts were not renewed, multiple for “significant lack of student progress attributable to the educator,” as well as “insufficient student academic growth reflected by [SAS EVAAS] value-added scores….In the 2011-12 school year, 54% of the district’s low-performing teachers were dismissed.” That’s “progress,” right?!?

Anyhow, for those of you who have not heard, this same (controversial) Superintendent, who pushed this system throughout his district is retiring (see, for example, here).

The other district of (dis)honorable mention was Dallas Independent School district; it also uses a VAM called the Classroom Effectiveness Index (CIE), although I know less about this system as I have never examined or researched it myself, nor have I read really anything about it. But in 2012, the district’s Board “decided not to renew 259 contracts due to poor performance, five times more than the previous year.” The “progress” such x-efficiency brings…

What is still worrisome to the authors, though, is that “[w]hile some districts appear to be increasing their efforts to eliminate ineffective teachers, the percentage of teachers dismissed for any reason, let alone poor performance, remains well under one percent in the state’s largest districts.” Related, and I preface this one noting that this next argument is one of the most over-cited and hyper-utilized by organizations backing “initiatives” or “reforms” such as these, that this “falls well below the five to eight percent that Hanushek calculates would elevate achievement to internationally competitive levels” “Calculations by Eric Hanushek of Stanford University show that removing the bottom five percent of teachers in the United States and replacing them with teachers of average effectiveness would raise student achievement in the U.S. 0.4 standard deviations, to the level of student achievement in Canada. Replacing the bottom eight percent would raise student achievement to the level of Finland, a top performing country on international assessments.” As Linda Darling-Hammond, also of Stanford would argue, we cannot simply “fire our way to Finland.” Sorry Eric! But this is based on econometric predictions, and no evidence exists whatsoever that this is in fact a valid inference. Nontheless, it is cited over and over again by the same/similar folks (such as the authors of this piece) to justify their currently trendy educational reforms.

The major point here, though, is that “if Texas wanted to remove (or improve) the bottom five to eight percent of its teachers, the current evaluation system would not be able to identify them;” hence, the state desperately needs a VAM-based system to do this. Again,  no research to counter this or really any claim is included in this piece; only the primarily economics-based literatures were selected in support.

In the end, though, the authors conclude that “While the Texas Constitution has established a clear objective function for the state school system and assessments are in place to measure the outcome, it does not appear that the Texas education system shares the other four characteristics of x-efficient enterprises as identified by Levin. Given the constitutional mandate for efficiency and the difficult economic climate, it may be a good time for the state to remedy this situation…[Likewise] the adversity and incentives may now be in place for Texas to focus on improving the x-efficiency of its school system.”

As I know and very much respect Henry Levin (see, for example, an interview I conducted with him a few years ago, with the shorter version here and the longer version here), I’d be curious to know what his response might be to the authors’ use of his x-efficiency framework to frame such neo-conservative (and again trendy) initiatives and reforms. Perhaps I will email him…

Special Issue of “Educational Researcher” (Paper #2 of 9): VAMs’ Measurement Errors, Issues with Retroactive Revisions, and (More) Problems with Using Test Scores

Recall from a prior post that the peer-reviewed journal titled Educational Researcher (ER) – recently published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#2 of 9) here, titled “Using Student Test Scores to Measure Teacher Performance: Some Problems in the Design and Implementation of Evaluation Systems” and authored by Dale Ballou – Associate Professor of Leadership, Policy, and Organizations at Vanderbilt University – and Matthew Springer – Assistant Professor of Public Policy also at Vanderbilt.

As written into the articles’ abstract, their “aim in this article [was] to draw attention to some underappreciated problems in the design and implementation of evaluation systems that incorporate value-added measures. [They focused] on four [problems]: (1) taking into account measurement error in teacher assessments, (2) revising teachers’ scores as more information becomes available about their students, and (3) and (4) minimizing opportunistic behavior by teachers during roster verification and the supervision of exams.”

Here is background on their perspective, so that you all can read and understand their forthcoming findings in context: “On the whole we regard the use of educator evaluation systems as a positive development, provided judicious use is made of this information. No evaluation instrument is perfect; every evaluation system is an assembly of various imperfect measures. There is information in student test scores about teacher performance; the challenge is to extract it and combine it with the information gleaned from other instruments.”

Their claims of most interest, in my opinion and given their perspective as illustrated above, are as follows:

  • “Teacher value-added estimates are notoriously imprecise. If value-added scores are to be used for high-stakes personnel decisions, appropriate account must be taken of the magnitude of the likely error in these estimates” (p. 78).
  • “[C]omparing a teacher of 25 students to [an equally effective] teacher of 100 students… the former is 4 to 12 times more likely to be deemed ineffective, solely as a function of the number of the teacher’s students who are tested—a reflection of the fact that the measures used in such accountability systems are noisy and that the amount of noise is greater the fewer students a teacher has. Clearly it is unfair to treat two teachers with the same true effectiveness differently” (p. 78).
  • “[R]esources will be wasted if teachers are targeted for interventions without taking
    into account the probability that the ratings they receive are based on error” (p. 78).
  • “Because many state administrative data systems are not up to [the data challenges required to calculate VAM output], many states have implemented procedures wherein teachers are called on to verify and correct their class rosters [i.e., roster verification]…[Hence]…the notion that teachers might manipulate their rosters in order to improve their value-added scores [is worrisome as the possibility of this occurring] obtains indirect support from other studies of strategic behavior in response to high-stakes accountability…These studies suggest that at least some teachers and schools will take advantage of virtually any opportunity to game
    a test-based evaluation system…” (p. 80), especially if they view the system as unfair (this is my addition, not theirs) and despite the extent to which school or district administrators monitor the process or verify the final roster data. This is another gaming technique not often discussed, or researched.
  • Related, in one analysis these authors found that “students [who teachers] do not claim [during this roster verification process] have on average test scores far below those of the students who are claimed…a student who is not claimed is very likely to be one who would lower teachers’ value added” (p. 80). Interestingly, and inversely, they also found that “a majority of the students [they] deem[ed] exempt [were actually] claimed by their teachers [on teachers’ rosters]” (p. 80). They note that when either occurs, it’s rare; hence, it should not significantly impact teachers value added scores on the whole. However, this finding also “raises the prospect of more serious manipulation of roster verification should value added come to be used for high-stakes personnel decisions, when incentives to game the system will grow stronger” (p. 80).
  • In terms of teachers versus proctors or other teachers monitoring students when they take large-scale standardized tests (that are used across all states to calculate value-added estimates), researchers also found that “[a]t every grade level, the number of questions answered correctly is higher when students are monitored by their own teacher” (p. 82). They believe this finding is more relevant that I do in that the difference was one question (although when multiplied by the number of students included in a teacher’s value-added calculations this might be more noteworthy). In addition,  I know of very few teachers, anymore, who are permitted to proctor their own students’ tests, but for those who still allow this, this finding might also be relevant. “An alternative interpretation of these findings is that students
    naturally do better when their own teacher supervises the exam as
    opposed to a teacher they do not know” (p. 83).

The authors also critique, quite extensively in fact, the Education Value-Added Assessment System (EVAAS) used statewide in North Carolina, Ohio, Pennsylvania, and Tennessee and many districts elsewhere. In particular, they take issue with the model’s use of the conventional t-test statistic to identify a teacher for whom they are 95% confident (s)he differs from average. They also take issue with EVAAS practice whereby teachers’ EVAAS scores change retroactively, as more data become available, to get at more “precision” even though teachers’ scores can change one or two years well after the initial score is registered (and used for whatever purposes).

“This has confused teachers, who wonder why their value-added score keeps changing for students they had in the past. Whether or not there are sound statistical reasons for undertaking these revisions…revising value-added estimates poses problems when the evaluation system is used for high-stakes decisions. What will be done about the teacher whose performance during the 2013–2014 school year, as calculated in the summer of 2014, was so low that the teacher loses his or her job or license but whose revised estimate for the same year, released in the summer of 2015, places the teacher’s performance above the threshold at which these sanctions would apply?…[Hence,] it clearly makes no sense to revise these estimates, as each revision is based on less information about student performance” (p. 79).

Hence, “a state that [makes] a practice of issuing revised ‘improved’ estimates would appear to be in a poor position to argue that high-stakes decisions ought to be based on initial, unrevised estimates, though in fact the grounds for regarding the revised estimates as an improvement are sometimes highly dubious. There is no obvious fix for this problem, which we expect will be fought out in the courts” (p. 83).


If interested, see the Review of Article #1 – the introduction to the special issue here.

Article #2 Reference: Ballou, D., & Springer, M. G. (2015). Using student test scores to measure teacher performance: Some problems in the design and implementation of evaluation systems. Educational Researcher, 44(2), 77-86. doi:10.3102/0013189X15574904

It’s a VAM Shame…

It’s a VAM shame to see how VAMs have been used, erroneously yet as assuredly perfect-to-near-perfect indicators of educational quality, to influence educational policy. A friend and colleague of mine just sent me a PowerPoint that William L. Sanders – the developer of the Tennessee Value-Added Assessment System (TVAAS) now more popularly known as the Education Value-Added Assessment System (EVAAS®) and “arguably the most ardent supporter and marketer of [for-profit] value-added” (Harris, 2011; see also prior links about Sanders and his T/EVAAS model here, here, here, and here) – presented to the Tennessee Board of Education back in 2013.

The simple and straightforward (and hence policymaker-friendly) PowerPoint titled “Teacher Characteristics and Effectiveness” consists of seven total slides with figures illustrating three key points: teacher value-added as calculated using the TVAAS model does not differ by (1) years of teaching experience, (2) teachers’ education level, and (3) teacher salary. In other words, and as translated into simpler terms but also terms that have greatly influenced (and continue to influence) educational policy: (1) years of teacher experience do not matter, (2) advanced degrees do not matter, and (3) teacher salaries do not matter.

While it’s difficult to determine how this particular presentation influenced educational policy in Tennessee (see, for example, here), at a larger scale these are the three key policy trends that have since directed (and continue to direct) particularly state policy initiatives. What is trending in educational policy is to evaluate teachers only by their teacher-level value-added. At the same time, this “research” supports simultaneous calls to destruct teachers’ traditional salary schedules that reward teachers for their years of experience (which matters, as per other research) and advanced degrees (on which other research is mixed).

This “research” evidence is certainly convenient when calls for budget cuts are politically in order. But this “research” is also more than unfortunate in that the underlying assumption in support of all of this is that VAMs are perfect-to-near-perfect indicators of educational quality; hence, their output data can and should be trusted. Likewise, all of the figures illustrated in this and many other similar PowerPoints can be wholly trusted because they are based on VAMs.

Despite the plethora of methodological and pragmatic issues with VAMs, highlighted here within the first post I ever posted on this blog and also duly noted by the American Statistical Association as well as other associations (e.g., the National Association of Secondary School Principals (NASSP), the National Academy of Education), these VAMs are being used to literally change and set bunkum educational policy, because so many care not to be bothered with the truth, as inconvenient.

Like I wrote, it’s a VAM shame…

Evidence of Grade and Subject-Level Bias in Value-Added Measures: Article Published in TCR

One of my most recent posts was about William Sanders — developer of the Tennessee Value-Added Assessment System (TVAAS), which is now more popularly known as the Education Value-Added Assessment System (EVAAS®) — and his forthcoming 2015 James Bryant Conant Award — one of the nation’s most prestigious education honors, that will be awarded to him this next month by the Education Commission of the States (ECS).

Sanders is to be honored for his “national leader[ship] in value-added assessments, [as] his [TVAAS/EVAAS] work has [informed] key policy discussion[s] in states across the nation.”

Ironically, this was announced the same week that one of my former doctoral students — Jessica Holloway-Libell, who is soon to be an Assistant Professor at Kansas State University — had a paper published in the esteemed Teachers College Record about this very model. Her paper titled, “Evidence of Grade and Subject-Level Bias in Value-Added Measures” can be accessed (at least for the time being) here.

You might also recall this topic, though, as we posted her two initial drafts of this article over one year ago, here and here. Both posts followed the analyses she conducted after a VAMboozled follower emailed us expressing his suspicions about grade and subject area bias in his district in Tennessee, in which he was/still is a school administrator.  The question he posed was whether his suspicions were correct, and whether this was happening elsewhere in his state, using Sanders’ TVAAS/EVAAS model.

Jessica found it was.

More specifically, Jessica found that:

  1. Teachers of students in 4th and 8th grades were much more likely to receive positive value-added scores than in other grades (e.g., 5th, 6th, and 7th grades); hence, that 4th and 8th teachers are generally better teachers in Tennessee using the TVAAS/EVAAS model.
  2. Mathematics teachers (theoretically throughout Tennessee) are, overall, more effective than Tennessee’s English/language arts teachers, regardless of school district; hence, mathematics teachers are generally better than English/language arts teachers in Tennessee using the TVAAS/EVAAS model.

Being a former mathematics teacher myself, I’d like to support the second claim as being true, being subject-area biased myself. But the fact of the matter is that the counterclaims in this case are obviously true, likely entirely, instead.

It’s not that either or any set of these teachers are in fact better, it’s that Sanders’ TVAAS/EVAAS model – — the model for which Sanders is receiving this esteemed award — is yielding biased output. It is doing this for whatever reason (e.g., measurement error, test construction) but this just adds to the list of other problems (see, for example, here, here, and here) and quite frankly the reasons why this model, not to mention its master creator, is undeserving of really any award, except for a Bunkum, perhaps.