Another Oldie but Still Very Relevant Goodie, by McCaffrey et al.

I recently re-read an article in full that is now 10 years old, or 10 years out, as published in 2004 and, as per the words of the authors, before VAM approaches were “widely adopted in formal state or district accountability systems.” Unfortunately, I consistently find it interesting, particularly in terms of the research on VAMs, to re-explore/re-discover what we actually knew 10 years ago about VAMs, as most of the time, this serves as a reminder of how things, most of the time, have not changed.

The article, “Models for Value-Added Modeling of Teacher Effects,” is authored by Daniel McCaffrey (Educational Testing Service [ETS] Scientist, and still a “big name” in VAM research), J. R. Lockwood (RAND Corporation Scientists),  Daniel Koretz (Professor at Harvard), Thomas Louis (Professor at Johns Hopkins), and Laura Hamilton (RAND Corporation Scientist).

At the point at which the authors wrote this article, besides the aforementioned data and data base issues, were issues with “multiple measures on the same student and multiple teachers instructing each student” as “[c]lass groupings of students change annually, and students are taught by a different teacher each year.” Authors, more specifically, questioned “whether VAM really does remove the effects of factors such as prior performance and [students’] socio-economic status, and thereby provide[s] a more accurate indicator of teacher effectiveness.”

The assertions they advanced, accordingly and as relevant to these questions, follow:

  • Across different types of VAMs, given different types of approaches to control for some of the above (e.g., bias), teachers’ contribution to total variability in test scores (as per value-added gains) ranged from 3% to 20%. That is, teachers can realistically only be held accountable for 3% to 20% of the variance in test scores using VAMs, while the other 80% to 97% of the variance (stil) comes from influences outside of the teacher’s control. A similar statistic (i.e., 1% to 14%) was similarly and recently highlighted in the recent position statement on VAMs released by the American Statistical Association.
  • Most VAMs focus exclusively on scores from standardized assessments, although I will take this one-step further now, noting that all VAMs now focus exclusively on large-scale standardized tests. This I evidenced in a recent paper I published here: Putting growth and value-added models on the map: A national overview).
  • VAMs introduce bias when missing test scores are not missing completely at random. The missing at random assumption, however, runs across most VAMs because without it, data missingness would be pragmatically insolvable, especially “given the large proportion of missing data in many achievement databases and known differences between students with complete and incomplete test data.” The really only solution here is to use “implicit imputation of values for unobserved gains using the observed scores” which is “followed by estimation of teacher effect[s] using the means of both the imputed and observe gains [together].”
  • Bias “[still] is one of the most difficult issues arising from the use of VAMs to estimate school or teacher effects…[and]…the inclusion of student level covariates is not necessarily the solution to [this] bias.” In other words, “Controlling for student-level covariates alone is not sufficient to remove the effects of [students’] background [or demographic] characteristics.” There is a reason why bias is still such a highly contested issue when it comes to VAMs (see a recent post about this here).
  • All (or now most) commonly-used VAMs assume that teachers’ (and prior teachers’) effects persist undiminished over time. This assumption “is not empirically or theoretically justified,” either, yet it persists.

These authors’ overall conclusion, again from 10 years ago but one that in many ways still stands? VAMs “will often be too imprecise to support some of [its] desired inferences” and uses including, for example, making low- and high-stakes decisions about teacher effects as produced via VAMs. “[O]btaining sufficiently precise estimates of teacher effects to support ranking [and such decisions] is likely to [forever] be a challenge.”

No More EVAAS for Houston: School Board Tie Vote Means Non-Renewal

Recall from prior posts (here, here, and here) that seven teachers in the Houston Independent School District (HISD), with the support of the Houston Federation of Teachers (HFT), are taking HISD to federal court over how their value-added scores, derived via the Education Value-Added Assessment System (EVAAS), are being used, and allegedly abused, while this district that has tied more high-stakes consequences to value-added output than any other district/state in the nation. The case, Houston Federation of Teachers, et al. v. Houston ISD, is ongoing.

But just announced is that the HISD school board, in a 3:3 split vote late last Thursday night, elected to no longer pay an annual $680K to SAS Institute Inc. to calculate the district’s EVAAS value-added estimates. As per an HFT press release (below), HISD “will not be renewing the district’s seriously flawed teacher evaluation system, [which is] good news for students, teachers and the community, [although] the school board and incoming superintendent must work with educators and others to choose a more effective system.”

here

Apparently, HISD was holding onto the EVAAS, despite the research surrounding the EVAAS in general and in Houston, in that they have received (and are still set to receive) over $4 million in federal grant funds that has required them to have value-added estimates as a component of their evaluation and accountability system(s).

While this means that the federal government is still largely in favor of the use of value-added model (VAMs) in terms of its funding priorities, despite their prior authorization of the Every Student Succeeds Act (ESSA) (see here and here), this also means that HISD might have to find another growth model or VAM to still comply with the feds.

Regardless, during the Thursday night meeting a board member noted that HISD has been kicking this EVAAS can down the road for 5 years. “If not now, then when?” the board member asked. “I remember talking about this last year, and the year before. We all agree that it needs to be changed, but we just keep doing the same thing.” A member of the community said to the board: “VAM hasn’t moved the needle [see a related post about this here]. It hasn’t done what you need it to do. But it has been very expensive to this district.” He then listed the other things on which HISD could spend (and could have spent) its annual $680K EVAAS estimate costs.

Soon thereafter, the HISD school board called for a vote, and it ended up being a 3-3 tie. Because of the 3-3 tie vote, the school board rejected the effort to continue with the EVAAS. What this means for the related and aforementioned lawsuit is still indeterminate at this point.

Pennsylvania Governor Rejects “Teacher Performance” v. Teacher Seniority Bill

Yesterday, the Governor of Pennsylvania vetoed the “Protecting Excellent Teachers Act” bill that would lessen the role of seniority for teachers throughout the state. Simultaneously, the bill would increase the role of “observable” teacher effects, via teachers’ “performance ratings” as determined at least in part via the use of value-added model (VAM) estimates (i.e., using the popular Education Value-Added Assessment System (EVAAS)). These “performance ratings” at issue are to be used for increased consequential purposes (e.g., teacher terminations/layoffs, even if solely for economic reasons).

Governor Wolff is reported as saying that “the state should spend its time investing in improving teachers and performance standards, not paving the way for layoffs. In his veto message, he noted that the evaluation system was designed to identify a teacher’s weaknesses and then provide the opportunity to improve.” He is quoted as adding, “Teachers who do not improve after being given the opportunity and tools to do so are the ones who should no longer be in the classroom…This [emphasis added] is the system we should be using to remove ineffective teachers.”

The bill, passed by both the House and Senate, and supported by the state School Boards Association among others, is apparently bound to resurface, however. Also because Republicans are charging the Governor with “resisting reform at the same time he wants more funding for education.” Increased funding is not going to happen without increased accountability, apparently, and according to Republican leaders.
Read more here, as per the article originally printed in The Philadelphia Inquirer.

Special Issue of “Educational Researcher” (Paper #8 of 9, Part I): A More Research-Based Assessment of VAMs’ Potentials

Recall that the peer-reviewed journal Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#8 of 9), which is actually a commentary titled “Can Value-Added Add Value to Teacher Evaluation?” This commentary is authored by Linda Darling-Hammond – Professor of Education, Emeritus, at Stanford University.

Like with the last commentary reviewed here, Darling-Hammond reviews some of the key points taken from the five feature articles in the aforementioned “Special Issue.” More specifically, though, Darling-Hammond “reflect[s] on [these five] articles’ findings in light of other work in this field, and [she] offer[s her own] thoughts about whether and how VAMs may add value to teacher evaluation” (p. 132).

She starts her commentary with VAMs “in theory,” in that VAMs COULD accurately identify teachers’ contributions to student learning and achievement IF (and this is a big IF) the following three conditions were met: (1) “student learning is well-measured by tests that reflect valuable learning and the actual achievement of individual students along a vertical scale representing the full range of possible achievement measures in equal interval units” (2) “students are randomly assigned to teachers within and across schools—or, conceptualized another way, the learning conditions and traits of the group of students assigned to one teacher do not vary substantially from those assigned to another;” and (3) “individual teachers are the only contributors to students’ learning over the period of time used for measuring gains” (p. 132).

None of things are actual true (or near to true, nor will they likely ever be true) in educational practice, however. Hence, the errors we continue to observe that continue to prevent VAM use for their intended utilities, even with the sophisticated statistics meant to mitigate errors and account for the above-mentioned, let’s call them, “less than ideal” conditions.

Other pervasive and perpetual issues surrounding VAMs as highlighted by Darling-Hammond, per each of the three categories above, pertain to (1) the tests used to measure value-added is that the tests are very narrow, focus on lower level skills, and are manipulable. These tests in their current form cannot effectively measure the learning gains of a large share of students who are above or below grade level given a lack of sufficient coverage and stretch. As per Haertel (2013, as cited in Darling-Hammond’s commentary), this “translates into bias against those teachers working with the lowest-performing or the highest-performing classes’…and “those who teach in tracked school settings.” It is also important to note here that the new tests created by the Partnership for Assessing Readiness for College and Careers (PARCC) and Smarter Balanced, multistate consortia “will not remedy this problem…Even though they will report students’ scores on a vertical scale, they will not be able to measure accurately the achievement or learning of students who started out below or above grade level” (p.133).

With respect to (2) above, on the equivalence (or rather non-equivalence) of groups of student across teachers classrooms who are the ones whose VAM scores are relativistically compared, the main issue here is that “the U.S. education system is the one of most segregated and unequal in the industrialized world…[likewise]…[t]he country’s extraordinarily high rates of childhood poverty, homelessness, and food insecurity are not randomly distributed across communities…[Add] the extensive practice of tracking to the mix, and it is clear that the assumption of equivalence among classrooms is far from reality” (p. 133). Whether sophisticated statistics can control for all of this variation is one of most debated issues surrounding VAMs and their levels of outcome bias, accordingly.

And as per (3) above, “we know from decades of educational research that many things matter for student achievement aside from the individual teacher a student has at a moment in time for a given subject area. A partial list includes the following [that are also supposed to be statistically controlled for in most VAMs, but are also clearly not controlled for effectively enough, if even possible]: (a) school factors such as class sizes, curriculum choices, instructional time, availability of specialists, tutors, books, computers, science labs, and other resources; (b) prior teachers and schooling, as well as other current teachers—and the opportunities for professional learning and collaborative planning among them; (c) peer culture and achievement; (d) differential summer learning gains and losses; (e) home factors, such as parents’ ability to help with homework, food and housing security, and physical and mental support or abuse; and (e) individual student needs, health, and attendance” (p. 133).

“Given all of these influences on [student] learning [and achievement], it is not surprising that variation among teachers accounts for only a tiny share of variation in achievement, typically estimated at under 10%” (see, for example, highlights from the American Statistical Association’s (ASA’s) Position Statement on VAMs here). “Suffice it to say [these issues]…pose considerable challenges to deriving accurate estimates of teacher effects…[A]s the ASA suggests, these challenges may have unintended negative effects on overall educational quality” (p. 133). “Most worrisome [for example] are [the] studies suggesting that teachers’ ratings are heavily influenced [i.e., biased] by the students they teach even after statistical models have tried to control for these influences” (p. 135).

Other “considerable challenges” include: VAM output are grossly unstable given the swings and variations observed in teacher classifications across time, and VAM output are “notoriously imprecise” (p. 133) given the other errors observed as caused, for example, by varying class sizes (e.g., Sean Corcoran (2010) documented with New York City data that the “true” effectiveness of a teacher ranked in the 43rd percentile could have had a range of possible scores from the 15th to the 71st percentile, qualifying as “below average,” “average,” or close to “above average). In addition, practitioners including administrators and teachers are skeptical of these systems, and their (appropriate) skepticisms are impacting the extent to which they use and value their value-added data, noting that they value their observational data (and the professional discussions surrounding them) much more. Also important is that another likely unintended effect exists (i.e., citing Susan Moore Johnson’s essay here) when statisticians’ efforts to parse out learning to calculate individual teachers’ value-added causes “teachers to hunker down and focus only on their own students, rather than working collegially to address student needs and solve collective problems” (p. 134). Related, “the technology of VAM ranks teachers against each other relative to the gains they appear to produce for students, [hence] one teacher’s gain is another’s loss, thus creating disincentives for collaborative work” (p. 135). This is what Susan Moore Johnson termed the egg-crate model, or rather the egg-crate effects.

Darling-Hammond’s conclusions are that VAMs have “been prematurely thrust into policy contexts that have made it more the subject of advocacy than of careful analysis that shapes its use. There is [good] reason to be skeptical that the current prescriptions for using VAMs can ever succeed in measuring teaching contributions well (p. 135).

Darling-Hammond also “adds value” in one whole section (highlighted in another post forthcoming here), offering a very sound set of solutions, using VAMs for teacher evaluations or not. Given it’s rare in this area of research we can focus on actual solutions, this section is a must read. If you don’t want to wait for the next post, read Darling-Hammond’s “Modest Proposal” (p. 135-136) within her larger article here.

In the end, Darling-Hammond writes that, “Trying to fix VAMs is rather like pushing on a balloon: The effort to correct one problem often creates another one that pops out somewhere else” (p. 135).

*****

If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; see the Review of Article #5 – on teachers’ perceptions of observations and student growth here; see the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here; and see the Review of Article (Commentary) #7 – on VAMs situated in their appropriate ecologies here.

Article #8, Part I Reference: Darling-Hammond, L. (2015). Can value-added add value to teacher evaluation? Educational Researcher, 44(2), 132-137. doi:10.3102/0013189X15575346

Ohio Bill to Review Value-Added Component of Schools’ A-F Report Cards

Ohio state legislators just last week introduced a bill to review the value-added measurements required when evaluating schools as per the state’s A-F school report cards (as based on Florida’s A-F school report card model). The bill is to be introduced by political members of the Republican side of the House who, more specifically, want officials and/or others to review how the state comes up with their school report card grades, with emphasis on the state’s specific value-added (i.e., Education Value-Added Assessment System (EVAAS)) component.

According to one article here, “especially confusing” with Ohio’s school reports cards is the school-level value added section. At the school level, value-added means essentially the same thing — the measurement of how well a school purportedly grew its students from one year to the next, when students’ growth in test scores over time are aggregated beyond the classroom and to the school-wide level. While value-added estimates are still to count for 35-50% of a teacher’s individual evaluation throughout the state, this particular bill has to do with school-level value-added only.

While most in the House, Democrats included, seem to be in favor of the idea of reviewing the value-added component (e.g., citing parent/user confusion, lack of transparency, common questions posed to the state and others about this specific component that they cannot answer), at least one Democrat is questioning Republicans’ motives (e.g., charging that Republicans might have ulterior motives to not hold charter schools accountable using VAMs and to simultaneously push conservative agendas further).

Regardless, that lawmakers in at least the state of Ohio are now admitting that they have too little understanding of how the value-added system works, and also works in practice, seems to be a step in the right direction. Let’s just hope the intentions of those backing the bill are in the right place, as also explained here. Perhaps the fact that the whole bill is one paragraph in length speaks to the integrity and forthrightness of the endeavor — perhaps not.

Otherwise, the Vice President for Ohio policy and advocacy for the Thomas B. Fordham Institute — a strong supporter of value added — is quoted as saying that “it makes sense to review the measurement…There are a lot of myths and misconceptions out there, and the more people know, the more people will understand the important role looking at student growth plays in the accountability system.”  One such “myth” he cites is that, “[t]here are measures on our state report card that correlate with demographics, but value added isn’t one of them.” In fact, and rather, we have evidence directly from the state of Ohio contradicting this claim that he calls a “myth” — that, indeed, bias is alive and well in Ohio (as well as elsewhere), especially when VAM-based estimates are aggregated at the school level (see a post with figures illustrating bias in Ohio here).

On that note, I just hope that whomever they invite for this forthcoming review, if the bill is passed, is well-informed, very knowledgeable of the literature surrounding value-added in general but also in breadth and depth, and is not representing a vendor or any particular think tank, philanthropic, or other entity with a clear agenda. Balance, at minimum for this review, is key.

Alleged Violation of Protective Order in Houston Lawsuit, Overruled

Many of you will recall a post I made public in January including “Houston Lawsuit Update[s], with Summar[ies] of Expert Witnesses’ Findings about the EVAAS” (Education Value-Added Assessment System sponsored by SAS Institute Inc.). What you might not have recognized since, however, was that I pulled the post down a few weeks after I posted it. Here’s the back story.

In January 2016, the Houston Federation of Teachers (HFT) published an “EVAAS Litigation Update,” which summarized a portion of Dr. Jesse Rothstein’s expert report in which he conclude[d], among other things, that teachers do not have the ability to meaningfully verify their EVAAS scores. He wrote that “[a]t most, a teacher could request information about which students were assigned to her, and could read literature — mostly released by SAS, and not the product of an independent investigation — regarding the properties of EVAAS estimates.” On January 10, 2016, I posted the post: “Houston Lawsuit Update, with Summary of Expert Witnesses’ Findings about the EVAAS” summarizing what I considered to be the twelve key highlights of HFT’s “EVAAS Litigation Update,” in which I highlighted Rothstein’s above conclusions.

Lawyers representing SAS Institute Inc. charged that this post, along with the more detailed “EVAAS Litigation Update” I summarized within the post (authored by the Houston Federation of Teachers (HFT) to keep their members in Houston up-to-date on the progress of this lawsuit) violated a protective order that was put in place to protect SAS’s EVAAS computer source code. Even though there is/was nothing in the “EVAAS Litigation Update” or the blog post that disclosed the source code, SAS objected to both as disclosing conclusions that, SAS said, could not have been reached in the absence of a review of the source code. They threatened HFT, its lawyers, and its experts (myself and Dr. Rothstein) with monetary sanctions. HFT went to court in order to get the court’s interpretation of the protective order and to see if a Judge agreed with SAS’s position. In the meantime, I removed the prior post (which is now back up here).

The great news is that the Judge found in HFT’s favor. He found that neither the “EVAAS Litigation Update” nor the related blog post violated the protective order. Further, he found that “we” have the right to share other updates on the Houston lawsuit, which is still pending, as long as the updates do not violate the protective order still in place. This includes discussion of the conclusions or findings of experts, provided that the source code is not disclosed, either explicitly or by necessary implication.

In more specific terms, as per his ruling in his Court Order, the judge ruled that SAS Institute Inc.’s lawyers “interpret[ed] the protective order too broadly in this instance. Rothstein’s opinion regarding the inability to verify or replicate a teacher’s EVAAS score essentially mimics the allegations of HFT’s complaint. The Litigation Update made clear that Rothstein confirmed this opinion after review of the source code; but it [was] not an opinion ‘that could not have been made in the absence of [his] review’ of the source code. Rothstein [also] testified by affidavit that his opinion is not based on anything he saw in the source code, but on the extremely restrictive access permitted by SAS.” He added that “the overly broad interpretation urged by SAS would inhibit legitimate discussion about the lawsuit, among both the union’s membership and the public at large.” That, also in his words, would be an “unfortunate result” that should, in the future, be avoided.

Here, again, are the 12 key highlights of the EVAAS Litigation Update:
  • Large-scale standardized tests have never been validated for their current uses. In other words, as per my affidavit, “VAM-based information is based upon large-scale achievement tests that have been developed to assess levels of student achievement, but not levels of growth in student achievement over time, and not levels of growth in student achievement over time that can be attributed back to students’ teachers, to capture the teachers’ [purportedly] causal effects on growth in student achievement over time.”
  • The EVAAS produces different results from another VAM. When, for this case, Rothstein constructed and ran an alternative, albeit sophisticated VAM using data from HISD both times, he found that results “yielded quite different rankings and scores.” This should not happen if these models are indeed yielding indicators of truth, or true levels of teacher effectiveness from which valid interpretations and assertions can be made.
  • EVAAS scores are highly volatile from one year to the next. Rothstein, when running the actual data, found that while “[a]ll VAMs are volatile…EVAAS growth indexes and effectiveness categorizations are particularly volatile due to the EVAAS model’s failure to adequately account for unaccounted-for variation in classroom achievement.” In addition, volatility is “particularly high in grades 3 and 4, where students have relatively few[er] prior [test] scores available at the time at which the EVAAS scores are first computed.”
  • EVAAS overstates the precision of teachers’ estimated impacts on growth. As per Rothstein, “This leads EVAAS to too often indicate that teachers are statistically distinguishable from the average…when a correct calculation would indicate that these teachers are not statistically distinguishable from the average.”
  • Teachers of English Language Learners (ELLs) and “highly mobile” students are substantially less likely to demonstrate added value, as per the EVAAS, and likely most/all other VAMs. This, what we term as “bias,” makes it “impossible to know whether this is because ELL teachers [and teachers of highly mobile students] are, in fact, less effective than non-ELL teachers [and teachers of less mobile students] in HISD, or whether it is because the EVAAS VAM is biased against ELL [and these other] teachers.”
  • The number of students each teacher teaches (i.e., class size) also biases teachers’ value-added scores. As per Rothstein, “teachers with few linked students—either because they teach small classes or because many of the students in their classes cannot be used for EVAAS calculations—are overwhelmingly [emphasis added] likely to be assigned to the middle effectiveness category under EVAAS (labeled “no detectable difference [from average], and average effectiveness”) than are teachers with more linked students.”
  • Ceiling effects are certainly an issue. Rothstein found that in some grades and subjects, “teachers whose students have unusually high prior year scores are very unlikely to earn high EVAAS scores, suggesting that ‘ceiling effects‘ in the tests are certainly relevant factors.” While EVAAS and HISD have previously acknowledged such problems with ceiling effects, they apparently believe these effects are being mediated with the new and improved tests recently adopted throughout the state of Texas. Rothstein, however, found that these effects persist even given the new and improved.
  • There are major validity issues with “artificial conflation.” This is a term I recently coined to represent what is happening in Houston, and elsewhere (e.g., Tennessee), when district leaders (e.g., superintendents) mandate or force principals and other teacher effectiveness appraisers or evaluators, for example, to align their observational ratings of teachers’ effectiveness with value-added scores, with the latter being the “objective measure” around which all else should revolve, or align; hence, the conflation of the one to match the other, even if entirely invalid. As per my affidavit, “[t]o purposefully and systematically endorse the engineering and distortion of the perceptible ‘subjective’ indicator, using the perceptibly ‘objective’ indicator as a keystone of truth and consequence, is more than arbitrary, capricious, and remiss…not to mention in violation of the educational measurement field’s Standards for Educational and Psychological Testing” (American Educational Research Association (AERA), American Psychological Association (APA), National Council on Measurement in Education (NCME), 2014).
  • Teaching-to-the-test is of perpetual concern. Both Rothstein and I, independently, noted concerns about how “VAM ratings reward teachers who teach to the end-of-year test [more than] equally effective teachers who focus their efforts on other forms of learning that may be more important.”
  • HISD is not adequately monitoring the EVAAS system. According to HISD, EVAAS modelers keep the details of their model secret, even from them and even though they are paying an estimated $500K per year for district teachers’ EVAAS estimates. “During litigation, HISD has admitted that it has not performed or paid any contractor to perform any type of verification, analysis, or audit of the EVAAS scores. This violates the technical standards for use of VAM that AERA specifies, which provide that if a school district like HISD is going to use VAM, it is responsible for ‘conducting the ongoing evaluation of both intended and unintended consequences’ and that ‘monitoring should be of sufficient scope and extent to provide evidence to document the technical quality of the VAM application and the validity of its use’ (AERA Statement, 2015).
  • EVAAS lacks transparency. AERA emphasizes the importance of transparency with respect to VAM uses. For example, as per the AERA Council who wrote the aforementioned AERA Statement, “when performance levels are established for the purpose of evaluative decisions, the methods used, as well as the classification accuracy, should be documented and reported” (AERA Statement, 2015). However, and in contrast to meeting AERA’s requirements for transparency, in this district and elsewhere, as per my affidavit, the “EVAAS is still more popularly recognized as the ‘black box’ value-added system.”
  • Related, teachers lack opportunities to verify their own scores. This part is really interesting. “As part of this litigation, and under a very strict protective order that was negotiated over many months with SAS [i.e., SAS Institute Inc. which markets and delivers its EVAAS system], Dr. Rothstein was allowed to view SAS’ computer program code on a laptop computer in the SAS lawyer’s office in San Francisco, something that certainly no HISD teacher has ever been allowed to do. Even with the access provided to Dr. Rothstein, and even with his expertise and knowledge of value-added modeling, [however] he was still not able to reproduce the EVAAS calculations so that they could be verified.”Dr. Rothstein added, “[t]he complexity and interdependency of EVAAS also presents a barrier to understanding how a teacher’s data translated into her EVAAS score. Each teacher’s EVAAS calculation depends not only on her students, but also on all other students with- in HISD (and, in some grades and years, on all other students in the state), and is computed using a complex series of programs that are the proprietary business secrets of SAS Incorporated. As part of my efforts to assess the validity of EVAAS as a measure of teacher effectiveness, I attempted to reproduce EVAAS calculations. I was unable to reproduce EVAAS, however, as the information provided by HISD about the EVAAS model was far from sufficient.”

Tennessee’s Trout/Taylor Value-Added Lawsuit Dismissed

As you may recall, one of 15 important lawsuits pertaining to teacher value-added estimates across the nation (Florida n=2, Louisiana n=1, Nevada n=1, New Mexico n=4, New York n=3, Tennessee n=3, and Texas n=1 – see more information here) was situated in Knox County, Tennessee.

Filed in February of 2015, with legal support provided by the Tennessee Education Association (TEA), Knox County teacher Lisa Trout and Mark Taylor charged that they were denied monetary bonuses after their Tennessee Value-Added Assessment System (TVAAS — the original Education Value-Added Assessment System (EVAAS)) teacher-level value-added scores were miscalculated. This lawsuit was also to contest the reasonableness, rationality, and arbitrariness of the TVAAS system, as per its intended and actual uses in this case, but also in Tennessee writ large. On this case, Jesse Rothstein (University of California – Berkeley) and I were serving as the Plaintiffs’ expert witnesses.

Unfortunately, however, last week (February 17, 2016) the Plaintiffs’ team received a Court order written by U.S. District Judge Harry S. Mattice Jr. dismissing their claims. While the Court had substantial questions about the reliability and validity of the TVAAS, the Court determined that the State satisfied the very low threshold of the “rational basis test,” at legal issue. I should note here, however, that all of the evidence that the lawyers for the Plaintiffs collected via their “extensive discovery,” including the affidavits both Jesse and I submitted on Plaintiffs’ behalves, were unfortunately not considered in Judge Mattice’s motion to dismiss. This, perhaps, makes sense given some of the assertions made by the Court, forthcoming.

Ultimately, the Court found that the TVAAS-based, teacher-level value-added policy at issue was “rationally related to a legitimate government interest.” As per the Court order itself, Judge Mattice wrote that “While the court expresses no opinion as to whether the Tennessee Legislature has enacted sound public policy, it finds that the use of TVAAS as a means to measure teacher efficacy survives minimal constitutional scrutiny. If this policy proves to be unworkable in practice, plaintiffs are not to be vindicated by judicial intervention but rather by democratic process.”

Otherwise, as per an article in the Knoxville News Sentinel, Judge Mattice was “not unsympathetic to the teachers’ claims,” for example, given the TVAAS measures “student growth — not teacher performance — using an algorithm that is not fail proof.” He inversely noted, however, in the Court order that the “TVAAS algorithms have been validated for their accuracy in measuring a teacher’s effect on student growth,” even if minimal. He also wrote that the test scores used in the TVAAS (and other models) “need not be validated for measuring teacher effectiveness merely because they are used as an input in a validated statistical model that measures teacher effectiveness.” This is, unfortunately, untrue. Nonetheless, he continued to write that even though the rational basis test “might be a blunt tool, a rational policymaker could conclude that TVAAS is ‘capable of measuring some marginal impact that teachers can have on their own students…[and t]his is all the Constitution requires.”

In the end, Judge Mattice concluded in the Court order that, overall, “It bears repeating that Plaintiff’s concerns about the statistical imprecision of TVAAS are not unfounded. In addressing Plaintiffs’ constitutional claims, however, the Court’s role is extremely limited. The judiciary is not empowered to second-guess the wisdom of the Tennessee legislature’s approach to solving the problems facing public education, but rather must determine whether the policy at issue is rationally related to a legitimate government interest.”

It is too early to know whether the prosecution team will appeal, although Judge Mattice dismissed the federal constitutional claims within the lawsuit “with prejudice.” As per an article in the Knoxville News Sentinel, this means that “it cannot be resurrected with new facts or legal claims or in another court. His decision can be appealed, though, to the 6th Circuit U.S. Court of Appeals.”

Everything is Bigger (and Badder) in Texas: Houston’s Teacher Value-Added System

Last November, I published a post about “Houston’s “Split” Decision to Give Superintendent Grier $98,600 in Bonuses, Pre-Resignation.” Thereafter, I engaged some of my former doctoral students to further explore some data from Houston Independent School District (HISD), and what we collectively found and wrote up was just published in the highly-esteemed Teachers College Record journal (Amrein-Beardsley, Collins, Holloway-Libell, & Paufler, 2016). To view the full commentary, please click here.

In this commentary we discuss HISD’s highest-stakes use of its Education Value-Added Assessment System (EVAAS) data – the value-added system HISD pays for at an approximate rate of $500,000 per year. This district has used its EVAAS data for more consequential purposes (e.g., teacher merit pay and termination) than any other state or district in the nation; hence, HISD is well known for its “big use” of “big data” to reform and inform improved student learning and achievement throughout the district.

We note in this commentary, however, that as per the evidence, and more specifically the recent release of the Texas’s large-scale standardized test scores, that perhaps attaching such high-stakes consequences to teachers’ EVAAS output in Houston is not working as district leaders have, now for years, intended. See, for example, the recent test-based evidence comparing the state of Texas v. HISD, illustrated below.

Figure 1

“Perhaps the district’s EVAAS system is not as much of an “educational-improvement and performance-management model that engages all employees in creating a culture of excellence” as the district suggests (HISD, n.d.a). Perhaps, as well, we should “ponder the specific model used by HISD—the aforementioned EVAAS—and [EVAAS modelers’] perpetual claims that this model helps teachers become more “proactive [while] making sound instructional choices;” helps teachers use “resources more strategically to ensure that every student has the chance to succeed;” or “provides valuable diagnostic information about [teachers’ instructional] practices” so as to ultimately improve student learning and achievement (SAS Institute Inc., n.d.).

The bottom line, though, is that “Even the simplest evidence presented above should at the very least make us question this particular value-added system, as paid for, supported, and applied in Houston for some of the biggest and baddest teacher-level consequences in town.” See, again, the full text and another, similar graph in the commentary, linked  here.

*****

References:

Amrein-Beardsley, A., Collins, C., Holloway-Libell, J., & Paufler, N. A. (2016). Everything is bigger (and badder) in Texas: Houston’s teacher value-added system. [Commentary]. Teachers College Record. Retrieved from http://www.tcrecord.org/Content.asp?ContentId=18983

Houston Independent School District (HISD). (n.d.a). ASPIRE: Accelerating Student Progress Increasing Results & Expectations: Welcome to the ASPIRE Portal. Retrieved from http://portal.battelleforkids.org/Aspire/home.html

SAS Institute Inc. (n.d.). SAS® EVAAS® for K–12: Assess and predict student performance with precision and reliability. Retrieved from www.sas.com/govedu/edu/k12/evaas/index.html

Houston Lawsuit Update, with Summary of Expert Witnesses’ Findings about the EVAAS

Recall from a prior post that a set of teachers in the Houston Independent School District (HISD), with the support of the Houston Federation of Teachers (HFT) are taking their district to federal court to fight for their rights as professionals, and how their value-added scores, derived via the Education Value-Added Assessment System (EVAAS), have allegedly violated them. The case, Houston Federation of Teachers, et al. v. Houston ISD, is to officially begin in court early this summer.

More specifically, the teachers are arguing that EVAAS output are inaccurate, the EVAAS is unfair, that teachers are being evaluated via the EVAAS using tests that do not match the curriculum they are to teach, that the EVAAS system fails to control for student-level factors that impact how well teachers perform but that are outside of teachers’ control (e.g., parental effects), that the EVAAS is incomprehensible and hence very difficult if not impossible to actually use to improve upon their instruction (i.e., actionable), and, accordingly, that teachers’ due process rights are being violated because teachers do not have adequate opportunities to change as a results of their EVAAS results.

The EVAAS is the one value-added model (VAM) on which I’ve conducted most of my research, also in this district (see, for example, here, here, here, and here); hence, I along with Jesse Rothstein – Professor of Public Policy and Economics at the University of California – Berkeley, who also conducts extensive research on VAMs – are serving as the expert witnesses in this case.

What was recently released regarding this case is a summary of the contents of our affidavits, as interpreted by authors of the attached “EVAAS Litigation UPdate,” in which the authors declare, with our and others’ research in support, that “Studies Declare EVAAS ‘Flawed, Invalid and Unreliable.” Here are the twelve key highlights, again, as summarized by the authors of this report and re-summarized, by me, below:

  1. Large-scale standardized tests have never been validated for their current uses. In other words, as per my affidavit, “VAM-based information is based upon large-scale achievement tests that have been developed to assess levels of student achievement, but not levels of growth in student achievement over time, and not levels of growth in student achievement over time that can be attributed back to students’ teachers, to capture the teachers’ [purportedly] causal effects on growth in student achievement over time.”
  2. The EVAAS produces different results from another VAM. When, for this case, Rothstein constructed and ran an alternative, albeit sophisticated VAM using data from HISD both times, he found that results “yielded quite different rankings and scores.” This should not happen if these models are indeed yielding indicators of truth, or true levels of teacher effectiveness from which valid interpretations and assertions can be made.
  3. EVAAS scores are highly volatile from one year to the next. Rothstein, when running the actual data, found that while “[a]ll VAMs are volatile…EVAAS growth indexes and effectiveness categorizations are particularly volatile due to the EVAAS model’s failure to adequately account for unaccounted-for variation in classroom achievement.” In addition, volatility is “particularly high in grades 3 and 4, where students have relatively few[er] prior [test] scores available at the time at which the EVAAS scores are first computed.”
  4. EVAAS overstates the precision of teachers’ estimated impacts on growth. As per Rothstein, “This leads EVAAS to too often indicate that teachers are statistically distinguishable from the average…when a correct calculation would indicate that these teachers are not statistically distinguishable from the average.”
  5. Teachers of English Language Learners (ELLs) and “highly mobile” students are substantially less likely to demonstrate added value, as per the EVAAS, and likely most/all other VAMs. This, what we term as “bias,” makes it “impossible to know whether this is because ELL teachers [and teachers of highly mobile students] are, in fact, less effective than non-ELL teachers [and teachers of less mobile students] in HISD, or whether it is because the EVAAS VAM is biased against ELL [and these other] teachers.”
  6. The number of students each teacher teaches (i.e., class size) also biases teachers’ value-added scores. As per Rothstein, “teachers with few linked students—either because they teach small classes or because many of the students in their classes cannot be used for EVAAS calculations—are overwhelmingly [emphasis added] likely to be assigned to the middle effectiveness category under EVAAS (labeled “no detectable difference [from average], and average effectiveness”) than are teachers with more linked students.”
  7. Ceiling effects are certainly an issue. Rothstein found that in some grades and subjects, “teachers whose students have unusually high prior year scores are very unlikely to earn high EVAAS scores, suggesting that ‘ceiling effects‘ in the tests are certainly relevant factors.” While EVAAS and HISD have previously acknowledged such problems with ceiling effects, they apparently believe these effects are being mediated with the new and improved tests recently adopted throughout the state of Texas. Rothstein, however, found that these effects persist even given the new and improved.
  8. There are major validity issues with “artificial conflation.” This is a term I recently coined to represent what is happening in Houston, and elsewhere (e.g., Tennessee), when district leaders (e.g., superintendents) mandate or force principals and other teacher effectiveness appraisers or evaluators, for example, to align their observational ratings of teachers’ effectiveness with value-added scores, with the latter being the “objective measure” around which all else should revolve, or align; hence, the conflation of the one to match the other, even if entirely invalid. As per my affidavit, “[t]o purposefully and systematically endorse the engineering and distortion of the perceptible ‘subjective’ indicator, using the perceptibly ‘objective’ indicator as a keystone of truth and consequence, is more than arbitrary, capricious, and remiss…not to mention in violation of the educational measurement field’s Standards for Educational and Psychological Testing” (American Educational Research Association (AERA), American Psychological Association (APA), National Council on Measurement in Education (NCME), 2014).
  9. Teaching-to-the-test is of perpetual concern. Both Rothstein and I, independently, noted concerns about how “VAM ratings reward teachers who teach to the end-of-year test [more than] equally effective teachers who focus their efforts on other forms of learning that may be more important.”
  10. HISD is not adequately monitoring the EVAAS system. According to HISD, EVAAS modelers keep the details of their model secret, even from them and even though they are paying an estimated $500K per year for district teachers’ EVAAS estimates. “During litigation, HISD has admitted that it has not performed or paid any contractor to perform any type of verification, analysis, or audit of the EVAAS scores. This violates the technical standards for use of VAM that AERA specifies, which provide that if a school district like HISD is going to use VAM, it is responsible for ‘conducting the ongoing evaluation of both intended and unintended consequences’ and that ‘monitoring should be of sufficient scope and extent to provide evidence to document the technical quality of the VAM application and the validity of its use’ (AERA Statement, 2015).
  11. EVAAS lacks transparency. AERA emphasizes the importance of transparency with respect to VAM uses. For example, as per the AERA Council who wrote the aforementioned AERA Statement, “when performance levels are established for the purpose of evaluative decisions, the methods used, as well as the classification accuracy, should be documented and reported” (AERA Statement, 2015). However, and in contrast to meeting AERA’s requirements for transparency, in this district and elsewhere, as per my affidavit, the “EVAAS is still more popularly recognized as the ‘black box’ value-added system.”
  12. Related, teachers lack opportunities to verify their own scores. This part is really interesting. “As part of this litigation, and under a very strict protective order that was negotiated over many months with SAS [i.e., SAS Institute Inc. which markets and delivers its EVAAS system], Dr. Rothstein was allowed to view SAS’ computer program code on a laptop computer in the SAS lawyer’s office in San Francisco, something that certainly no HISD teacher has ever been allowed to do. Even with the access provided to Dr. Rothstein, and even with his expertise and knowledge of value-added modeling, [however] he was still not able to reproduce the EVAAS calculations so that they could be verified.”Dr. Rothstein added, “[t]he complexity and interdependency of EVAAS also presents a barrier to understanding how a teacher’s data translated into her EVAAS score. Each teacher’s EVAAS calculation depends not only on her students, but also on all other students with- in HISD (and, in some grades and years, on all other students in the state), and is computed using a complex series of programs that are the proprietary business secrets of SAS Incorporated. As part of my efforts to assess the validity of EVAAS as a measure of teacher effectiveness, I attempted to reproduce EVAAS calculations. I was unable to reproduce EVAAS, however, as the information provided by HISD about the EVAAS model was far from sufficient.”

Houston’s “Split” Decision to Give Superintendent Grier $98,600 in Bonuses, Pre-Resignation

States of attention on this blog, and often of (dis)honorable mention as per their state-level policies bent on value-added models (VAMs), include Florida, New York, Tennessee, and New Mexico. As for a quick update about the latter state of New Mexico, we are still waiting to hear the final decision from the judge who recently heard the state-level lawsuit still pending on this matter in New Mexico (see prior posts about this case here, here, here, here, and here).

Another locale of great interest, though, is the Houston Independent School District. This is the seventh largest urban school district in the nation, and the district that has tied more high-stakes consequences to their value-added output than any other district/state in the nation. These “initiatives” were “led” by soon-to-resign/retire Superintendent Terry Greir who, during his time in Houston (2009-2015), implemented some of the harshest consequences ever attached to teacher-level value-added output, as per the district’s use of the Education Value-Added Assessment System (EVAAS) (see other posts about the EVAAS here, here, and here; see other posts about Houston here, here, and here).

In fact, the EVAAS is still used throughout Houston today to evaluate all EVAAS-eligible teachers, to also “reform” the district’s historically low-performing schools, by tying teachers’ purported value-added performance to teacher improvement plans, merit pay, nonrenewal, and termination (e.g., 221 Houston teachers were terminated “in large part” due to their EVAAS scores in 2011). However, pending litigation (i.e., this is the district in which the American and Houston Federation of Teachers (AFT/HFT) are currently suing the district for their wrongful use of, and over-emphasis on this particular VAM; see here), Superintendent Grier and the district have recoiled on some of the high-stakes consequences they formerly attached to the EVAAS  This particular lawsuit is to commence this spring/summer.

Nonetheless, my most recent post about Houston was about some of its future school board candidates, who were invited by The Houston Chronicle to respond to Superintendent Grier’s teacher evaluation system. For the most part, those who responded did so unfavorably, especially as the evaluation systems was/is disproportionately reliant on teachers’ EVAAS data and high-stakes use of these data in particular (see here).

Most recently, however, as per a “split” decision registered by Houston’s current school board (i.e., 4:3, and without any new members elected last November), Superintendent Grier received a $98,600 bonus for his “satisfactory evaluation” as the school district’s superintendent. See more from the full article published in The Houston Chronicle. As per the same article, Superintendent “Grier’s base salary is $300,000, plus $19,200 for car and technology allowances. He also is paid for unused leave time.”

More importantly, take a look at the two figures below, taken from actual district reports (see references below), highlighting Houston’s performance (declining, on average, in blue) as compared to the state of Texas (maintaining, on average, in black), to determine for yourself whether Superintendent Grier, indeed, deserved such a bonus (not to mention salary).

Another question to ponder is whether the district’s use of the EVAAS value-added system, especially since Superintendent Grier’s arrival in 2009, is actually reforming the school district as he and other district leaders have for so long now intended (e.g., since his Superintendent appointment in 2009).

Figure 1

Figure 1. Houston (blue trend line) v. Texas (black trend line) performance on the state’s STAAR tests, 2012-2015 (HISD, 2015a)

Figure 2

Figure 2. Houston (blue trend line) v. Texas (black trend line) performance on the state’s STAAR End-of-Course (EOC) tests, 2012-2015 (HISD, 2015b)

References:

Houston Independent School District (HISD). (2015a). State of Texas Assessments of Academic Readiness (STAAR) performance, grades 3-8, spring 2015. Retrieved here.

Houston Independent School District (HISD). (2015b). State of Texas Assessments of Academic Readiness (STAAR) end-of-course results, spring 2015. Retrieved here.