This one (should) explain itself (i.e., not reliable, and accordingly, not valid either). When high-stakes consequences are to be attached, this matters even more.
The ongoing lawsuit in New Mexico has, once again (see here and here), been postponed to October of 2017 due to what are largely (and pretty much only) state (i.e., Public Education Department (PED)) delays. Whether the delays are deliberate are uncertain but being involved in this case… The (still) good news is that the preliminary injunction granted to teachers last fall (see here) still holds so that teachers cannot (or are not to) face consequences as based on the state’s teacher evaluation system.
For more information, this is the email the American Federation of Teachers – New Mexico (AFT NM) and the Albuquerque Teachers Federation (ATF) sent out to all of their members yesterday:
Yesterday, both AFT NM/ATF and PED returned to court to address the ongoing legal battle against the PED evaluation system. Our lawyer proposed that we set a court date ASAP. The PED requested a date for next fall citing their busy schedule as the reason. As a result, the court date is now late October 2017.
While we are relieved to have a final court date set, we are dismayed at the amount of time that our teachers have to wait for the final ruling.
In a statement to the press, ATF President Ellen Bernstein reflected on the current state of our teachers in regards to the evaluation system, “Even though they know they can’t be harmed in their jobs right now, it bothers them in the core of their being, and nothing I can say can take that away…It’s a cloud over everybody.”
AFT NM President Stephanie Ly, said, “It is a shame our educators still don’t have a legitimate evaluation system. The PED’s previous abusive evaluation system was thankfully halted through an injunction by the New Mexico courts, and the PED has yet to create an evaluation system that uniformly and fairly evaluates educators, and have shown no signs to remedy this situation. The PED’s actions are beyond the pale, and it is simply unjust.”
While we await trial, we want to thank our members who sent in their evaluation data to help our case. Remind your colleagues that they may still advance in licensure by completing a dossier; the PED’s arbitrary rating system cannot negatively affect a teacher’s ability to advance thanks to the injunction won by AFT NM/ATF last fall. That injunction will stay in place until a ruling is issued on our case next October.
The state of Florida is another one of our state’s to watch in that, even since the passage of the Every Student Succeeds Act (ESSA) last January, the state is still moving forward with using its VAMs for high-stakes accountability reform. See my most recent post about one district in Florida here, after the state ordered it to dismiss a good number of its teachers as per their low VAM scores when this school year started. After realizing this also caused or contributed to a teacher shortage in the district, the district scrambled to hire Kelly Services contracted substitute teachers to replace them, after which the district also put administrators back into the classroom to help alleviate the bad situation turned worse.
In a recent post released by The Ledger, teachers from the same Polk County School District (size = 100K students) added much needed details and also voiced concerns about all of this in the article that author Madison Fantozzi titled “Polk teachers: We are more than value-added model scores.”
Throughout this piece Fantozzi covers the story of Elizabeth Keep, a teacher who was “plucked from” the middle school in which she taught for 13 years, after which she was involuntarily placed at a district high school “just days before she was to report back to work.” She was one of 35 teachers moved from five schools in need of reform as based on schools’ value-added scores, although this was clearly done with no real concern or regard of the disruption this would cause these teachers, not to mention the students on the exiting and receiving ends. Likewise, and according to Keep, “If you asked students what they need, they wouldn’t say a teacher with a high VAM score…They need consistency and stability.” Apparently not. In Keep’s case, she “went from being the second most experienced person in [her middle school’s English] department…where she was department chair and oversaw the gifted program, to a [new, and never before] 10th- and 11th-grade English teacher” at the new high school to which she was moved.
As background, when Polk County School District officials presented turnaround plans to the State Board of Education last July, school board members “were most critical of their inability to move ‘unsatisfactory’ teachers out of the schools and ‘effective’ teachers in.” One board member, for example, expressed finding it “horrendous” that the district was “held hostage” by the extent to which the local union was protecting teachers from being moved as per their value-added scores. Referring to the union, and its interference in this “reform,” he accused the unions of “shackling” the districts and preventing its intended reforms. Note that the “effective” teachers who are to replace the “ineffective” ones can earn up to $7,500 in bonuses per year to help the “turnaround” the schools into which they enter.
Likewise, the state’s Commissioner of Education concurred saying that she also “wanted ‘unsatisfactory’ teachers out and ‘highly effective’ teachers in,” again, with effectiveness being defined by teachers’ value-added or lack thereof, even though (1) the teachers targeted only had one or two years of the three years of value-added data required by state statute, and even though (2) the district’s senior director of assessment, accountability and evaluation noted that, in line with a plethora of other research findings, teachers being evaluated using the state’s VAM have a 51% chance of changing their scores from one year to the next. This lack of reliability, as we know it, should outright prevent any such moves in that without some level of stability, valid inferences from which valid decisions are to be made cannot be drawn. It’s literally impossible.
Nonetheless, state board of education members “unanimously… threatened to take [all of the district’s poor performing] over or close them in 2017-18 if district officials [didn’t] do what [the Board said].” See also other tales of similar districts in the article available, again, here.
In Keep’s case, “her ‘unsatisfactory’ VAM score [that caused the district to move her, as] paired with her ‘highly effective’ in-class observations by her administrators brought her overall district evaluation to ‘effective’…[although she also notes that]…her VAM scores fluctuate because the state has created a moving target.” Regardless, Keep was notified “five days before teachers were due back to their assigned schools Aug. 8 [after which she was] told she had to report to a new school with a different start time that [also] disrupted her 13-year routine and family that shares one car.”
VAM-based chaos reigns, especially in Florida.
Just this month, the Institute of Education Sciences (IES) wing of the U.S. Department of Education released a report about using value-added models (VAMs) for measuring school principals’ performance. The article conducted by researchers at Mathematica Policy Research and titled “Can Student Test Scores Provide Useful Measures of School Principals’ Performance?” can be found online here, with my summary of the study findings highlighted next and herein.
Before the passage of the Every Student Succeeds Act (ESSA), 40 states had written into their state statutes, as incentivized by the federal government, to use growth in student achievement growth for annual principal evaluation purposes. More states had written growth/value-added models (VAMs) for teacher evaluation purposes, which we have covered extensively via this blog, but this pertains only to school and/or principal evaluation purposes. Now since the passage of ESSA, and the reduction in the federal government’s control over state-level policies, states now have much more liberty to more freely decide whether to continue using student achievement growth for either purposes. This paper is positioned within this reasoning, and more specifically to help states decide whether or to what extent they might (or might not) continue to move forward with using growth/VAMs for school and principal evaluation purposes.
Researchers, more specifically, assessed (1) reliability – or the consistency or stability of these ratings over time, which is important “because only stable parts of a rating have the potential to contain information about principals’ future performance; unstable parts reflect only transient aspects of their performance;” and (2) one form of multiple evidences of validity – the predictive validity of these principal-level measures, with predictive validity defined as “the extent to which ratings from these measures accurately reflect principals’ contributions to student achievement in future years.” In short, “A measure could have high predictive validity only if [emphasis added] it was highly stable between consecutive years [i.e., reliability]…and its stable part was strongly related to principals’ contributions to student achievement” over time (i.e., predictive validity).
Researchers used principal-level value-added (unadjusted and adjusted for prior achievement and other potentially biasing demographic variables) to more directly examine “the extent to which student achievement growth at a school differed from average growth statewide for students with similar prior achievement and background characteristics.” Also important to note is that the data they used to examine school-level value-added came from Pennsylvania, which is one of a handful of states that uses the popular and proprietary (and controversial) Education Value-Added Assessment System (EVAAS) statewide.
Here are the researchers’ key findings, taken directly from the study’s summary (again, for more information see the full manuscript here).
- The two performance measures in this study that did not account for students’ past achievement—average achievement and adjusted average achievement—provided no information for predicting principals’ contributions to student achievement in the following year.
- The two performance measures in this study that accounted for students’ past achievement—school value-added and adjusted school value-added—provided, at most, a small amount of information for predicting principals’ contributions to student achievement in the following year. This was due to instability and inaccuracy in the stable parts.
- Averaging performance measures across multiple recent years did not improve their accuracy for predicting principals’ contributions to student achievement in the following year. In simpler terms, a principal’s average rating over three years did not predict his or her future contributions more accurately than did a rating from the most recent year only. This is more of a statistical finding than one that has direct implications for policy and practice (except for silly states who might, despite findings like those presented in this study, decide that they can use one year to do this not at all well instead of three years to do this not at all well).
Their bottom line? “…no available measures of principal [/school] performance have yet been shown to accurately identify principals [/schools] who will contribute successfully to student outcomes in future years,” especially if based on students’ test scores, although the researchers also assert that “no research has ever determined whether non-test measures, such as measures of principals’ leadership practices, [have successfully or accurately] predict[ed] their future contributions” either.
The researchers follow-up with a highly cautionary note: “the value-added measures will make plenty of mistakes when trying to identify principals [/schools] who will contribute effectively or ineffectively to student achievement in future years. Therefore, states and districts should exercise caution when using these measures to make major decisions about principals. Given the inaccuracy of the test-based measures, state and district leaders and researchers should also make every effort to identify nontest measures that can predict principals’ future contributions to student outcomes [instead].”
Citation: Chiang, H., McCullough, M., Lipscomb, S., & Gill, B. (2016). Can student test scores provide useful measures of school principals’ performance? Washington DC: U.S. Department of Education, Institute of Education Sciences. Retrieved from http://ies.ed.gov/ncee/pubs/2016002/pdf/2016002.pdf
Since the passage of the Every Student Succeeds Act (ESSA) last January, in which the federal government handed back to states the authority to decide whether to evaluate teachers with or without students’ test scores, states have been dropping the value-added measure (VAM) or growth components (e.g., the Student Growth Percentiles (SGP) package) of their teacher evaluation systems, as formerly required by President Obama’s Race to the Top initiative. See my most recent post here, for example, about how legislators in Oklahoma recently removed VAMs from their state-level teacher evaluation system, while simultaneously increasing the state’s focus on the professional development of all teachers. Hawaii recently did the same.
Now, it seems that Massachusetts is the next at least moving in this same direction.
As per a recent article in The Boston Globe (here), similar test-based teacher accountability efforts are facing increased opposition, primarily from school district superintendents and teachers throughout the state. At issue is whether all of this is simply “becoming a distraction,” whether the data can be impacted or “biased” by other statistically uncontrollable factors, and whether all teachers can be evaluated in similar ways, which is an issue with “fairness.” Also at issue is “reliability,” whereby a 2014 study released by the Center for Educational Assessment at the University of Massachusetts Amherst, in which researchers examined student growth percentiles, found the “amount of random error was substantial.” Stephen Sireci, one of the study authors and UMass professor, noted that, instead of relying upon the volatile results, “You might as well [just] flip a coin.”
Damian Betebenner, a senior associate at the National Center for the Improvement of Educational Assessment Inc. in Dover, N.H. who developed the SGP model in use in Massachusetts, added that “Unfortunately, the use of student percentiles has turned into a debate for scapegoating teachers for the ills.” Isn’t this the truth, to the extent that policymakers got a hold of these statistical tools, after which they much too swiftly and carelessly singled out teachers for unmerited treatment and blame.
Regardless, and recently, stakeholders in Massachusetts lobbied the Senate to approve an amendment to the budget that would no longer require such test-based ratings in teachers’ professional evaluations, while also passing a policy statement urging the state to scrap these ratings entirely. “It remains unclear what the fate of the Senate amendment will be,” however. “The House has previously rejected a similar amendment, which means the issue would have to be resolved in a conference committee as the two sides reconcile their budget proposals in the coming weeks.”
Not surprisingly, Mitchell Chester, Massachusetts Commissioner for Elementary and Secondary Education, continues to defend the requirement. It seems that Chester, like others, is still holding tight to the default (yet still unsubstantiated) logic helping to advance these systems in the first place, arguing, “Some teachers are strong, others are not…If we are not looking at who is getting strong gains and those who are not we are missing an opportunity to upgrade teaching across the system.”
Recall that the peer-reviewed journal Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#8 of 9), which is actually a commentary titled “Can Value-Added Add Value to Teacher Evaluation?” This commentary is authored by Linda Darling-Hammond – Professor of Education, Emeritus, at Stanford University.
Like with the last commentary reviewed here, Darling-Hammond reviews some of the key points taken from the five feature articles in the aforementioned “Special Issue.” More specifically, though, Darling-Hammond “reflect[s] on [these five] articles’ findings in light of other work in this field, and [she] offer[s her own] thoughts about whether and how VAMs may add value to teacher evaluation” (p. 132).
She starts her commentary with VAMs “in theory,” in that VAMs COULD accurately identify teachers’ contributions to student learning and achievement IF (and this is a big IF) the following three conditions were met: (1) “student learning is well-measured by tests that reflect valuable learning and the actual achievement of individual students along a vertical scale representing the full range of possible achievement measures in equal interval units” (2) “students are randomly assigned to teachers within and across schools—or, conceptualized another way, the learning conditions and traits of the group of students assigned to one teacher do not vary substantially from those assigned to another;” and (3) “individual teachers are the only contributors to students’ learning over the period of time used for measuring gains” (p. 132).
None of things are actual true (or near to true, nor will they likely ever be true) in educational practice, however. Hence, the errors we continue to observe that continue to prevent VAM use for their intended utilities, even with the sophisticated statistics meant to mitigate errors and account for the above-mentioned, let’s call them, “less than ideal” conditions.
Other pervasive and perpetual issues surrounding VAMs as highlighted by Darling-Hammond, per each of the three categories above, pertain to (1) the tests used to measure value-added is that the tests are very narrow, focus on lower level skills, and are manipulable. These tests in their current form cannot effectively measure the learning gains of a large share of students who are above or below grade level given a lack of sufficient coverage and stretch. As per Haertel (2013, as cited in Darling-Hammond’s commentary), this “translates into bias against those teachers working with the lowest-performing or the highest-performing classes’…and “those who teach in tracked school settings.” It is also important to note here that the new tests created by the Partnership for Assessing Readiness for College and Careers (PARCC) and Smarter Balanced, multistate consortia “will not remedy this problem…Even though they will report students’ scores on a vertical scale, they will not be able to measure accurately the achievement or learning of students who started out below or above grade level” (p.133).
With respect to (2) above, on the equivalence (or rather non-equivalence) of groups of student across teachers classrooms who are the ones whose VAM scores are relativistically compared, the main issue here is that “the U.S. education system is the one of most segregated and unequal in the industrialized world…[likewise]…[t]he country’s extraordinarily high rates of childhood poverty, homelessness, and food insecurity are not randomly distributed across communities…[Add] the extensive practice of tracking to the mix, and it is clear that the assumption of equivalence among classrooms is far from reality” (p. 133). Whether sophisticated statistics can control for all of this variation is one of most debated issues surrounding VAMs and their levels of outcome bias, accordingly.
And as per (3) above, “we know from decades of educational research that many things matter for student achievement aside from the individual teacher a student has at a moment in time for a given subject area. A partial list includes the following [that are also supposed to be statistically controlled for in most VAMs, but are also clearly not controlled for effectively enough, if even possible]: (a) school factors such as class sizes, curriculum choices, instructional time, availability of specialists, tutors, books, computers, science labs, and other resources; (b) prior teachers and schooling, as well as other current teachers—and the opportunities for professional learning and collaborative planning among them; (c) peer culture and achievement; (d) differential summer learning gains and losses; (e) home factors, such as parents’ ability to help with homework, food and housing security, and physical and mental support or abuse; and (e) individual student needs, health, and attendance” (p. 133).
“Given all of these influences on [student] learning [and achievement], it is not surprising that variation among teachers accounts for only a tiny share of variation in achievement, typically estimated at under 10%” (see, for example, highlights from the American Statistical Association’s (ASA’s) Position Statement on VAMs here). “Suffice it to say [these issues]…pose considerable challenges to deriving accurate estimates of teacher effects…[A]s the ASA suggests, these challenges may have unintended negative effects on overall educational quality” (p. 133). “Most worrisome [for example] are [the] studies suggesting that teachers’ ratings are heavily influenced [i.e., biased] by the students they teach even after statistical models have tried to control for these influences” (p. 135).
Other “considerable challenges” include: VAM output are grossly unstable given the swings and variations observed in teacher classifications across time, and VAM output are “notoriously imprecise” (p. 133) given the other errors observed as caused, for example, by varying class sizes (e.g., Sean Corcoran (2010) documented with New York City data that the “true” effectiveness of a teacher ranked in the 43rd percentile could have had a range of possible scores from the 15th to the 71st percentile, qualifying as “below average,” “average,” or close to “above average). In addition, practitioners including administrators and teachers are skeptical of these systems, and their (appropriate) skepticisms are impacting the extent to which they use and value their value-added data, noting that they value their observational data (and the professional discussions surrounding them) much more. Also important is that another likely unintended effect exists (i.e., citing Susan Moore Johnson’s essay here) when statisticians’ efforts to parse out learning to calculate individual teachers’ value-added causes “teachers to hunker down and focus only on their own students, rather than working collegially to address student needs and solve collective problems” (p. 134). Related, “the technology of VAM ranks teachers against each other relative to the gains they appear to produce for students, [hence] one teacher’s gain is another’s loss, thus creating disincentives for collaborative work” (p. 135). This is what Susan Moore Johnson termed the egg-crate model, or rather the egg-crate effects.
Darling-Hammond’s conclusions are that VAMs have “been prematurely thrust into policy contexts that have made it more the subject of advocacy than of careful analysis that shapes its use. There is [good] reason to be skeptical that the current prescriptions for using VAMs can ever succeed in measuring teaching contributions well (p. 135).
Darling-Hammond also “adds value” in one whole section (highlighted in another post forthcoming here), offering a very sound set of solutions, using VAMs for teacher evaluations or not. Given it’s rare in this area of research we can focus on actual solutions, this section is a must read. If you don’t want to wait for the next post, read Darling-Hammond’s “Modest Proposal” (p. 135-136) within her larger article here.
In the end, Darling-Hammond writes that, “Trying to fix VAMs is rather like pushing on a balloon: The effort to correct one problem often creates another one that pops out somewhere else” (p. 135).
If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; see the Review of Article #5 – on teachers’ perceptions of observations and student growth here; see the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here; and see the Review of Article (Commentary) #7 – on VAMs situated in their appropriate ecologies here.
Article #8, Part I Reference: Darling-Hammond, L. (2015). Can value-added add value to teacher evaluation? Educational Researcher, 44(2), 132-137. doi:10.3102/0013189X15575346
Recall the New York lawsuit pertaining to Long Island teacher Sheri Lederman? She just won in New York’s State Supreme court, and boy did she win big, also for the cause!
Sheri is a teacher, who by all accounts other than her 2013-2014 “ineffective” growth score of a 1/20, is a terrific 4th grade, 18-year veteran teacher. However, after receiving her “ineffective” growth rating and score, she along with her attorney and husband Bruce Lederman, sued the state of New York to challenge the state’s growth-based teacher evaluation system and Sheri’s individual score. See prior posts about Sheri’s case here, here, here and here.
The more specific goal of her case was to seek a judgment: (1) setting aside or vacating Sheri’s individual growth score and rating her as “ineffective,” and (2) declare that the New York endorsed and implemented growth measures in use was/is “arbitrary and capricious.” The “overall gist” was that Sheri contended that the system unfairly penalized teachers whose students consistently scored well and could not demonstrated growth upwards (e.g., teachers of gifted or other high achieving students). This concern/complaint is common elsewhere.
As per a State Supreme Court ruling, just released today as written by Acting Supreme Court Justice Judge Roger McDonough (May 10, 2016), and at 15 pages in length and available in full here, Sheri won her case. She won it against John King — the then New York State Education Department Commissioner and the now US Secretary of Education (who recently replaced Arne Duncan as US Secretary of Education). The Court concluded that Sheri (her husband, her team of experts, and other witnesses) effectively established that her growth score and rating for 2013-2014 was “arbitrary and capricious,” with “arbitrary and capricious” being defined as actions “taken without sound basis in reason or regard to the facts.”
More specifically, the Court’s conclusion was founded upon: (1) the convincing and detailed evidence of VAM bias against teachers at both ends of the spectrum (e.g. those with high-performing students or those with low-performing students); (2) the disproportionate effect of petitioner’s small class size and relatively large percentage of high-performing students; (3) the functional inability of high-performing students to demonstrate growth akin to lower-performing students; (4) the wholly unexplained swing in petitioner’s growth score from 14 [i.e., her growth score the year prior] to 1, despite the presence of statistically similar scoring students in her respective classes; and, most tellingly, (5) the strict imposition of rating constraints in the form of a “bell curve” that places teachers in four categories via pre-determined percentages regardless of whether the performance of students dramatically rose or dramatically fell from the previous year.”
As per an email I received earlier today from Bruce (i.e., Sheri’s husband/attorney who prosecuted her case), the Court otherwise “declined to make an overall ruling on the [New York growth] rating system in general because of new regulations in effect” [e.g., that the state’s growth model is currently under review]…[Nontheless, t]he decision should qualify as persuasive authority for other teachers challenging growth scores throughout the County [and Country]. [In addition, the] Court carefully recite[d] all our expert affidavits [i.e., from Professors Darling-Hammond, Pallas, Amrein-Beardsley, Sean Corcoran and Jesse Rothstein as well as Drs. Burris and Lindell].” Noted as well were the “absence of any meaningful’ challenge to [Sheri’s] experts’ conclusions, especially about the dramatic swings noticed between her, and potentially others’ scores, and the other ‘litany of expert affidavits submitted on [Sheris’] behalf].”
“It is clear that the evidence all of these amazing experts presented was a key factor in winning this case since the Judge repeatedly said both in Court and in the decision that we have a “high burden” to meet in this case.” [In addition,] [t]he Court wrote that the court “does not lightly enter into a critical analysis of this matter … [and] is constrained on this record, to conclude that [the] petitioner [i.e., Sheri] has met her high burden.”
To Bruce’s/our knowledge, this is the first time a judge has set aside an individual teacher’s VAM rating based upon such a presentation in court.
Thanks to all who helped in this endeavor. Onward!
In a prior post, about whether the state of “Alabama is the New, New Mexico,” I wrote about a draft bill in Alabama to be called the Rewarding Advancement in Instruction and Student Excellence (RAISE) Act of 2016. This has since been renamed the Preparing and Rewarding Educational Professionals (PREP) Bill (to be Act) of 2016. The bill was introduced by its sponsoring Republican Senator Marsh last Tuesday, and its public hearing is scheduled for tomorrow. I review the bill below, and attach it here for others who are interested in reading it in full.
First, the bill is to “provide a procedure for observing and evaluating teachers, principals, and assistant principals on performance and student achievement…[using student growth]…to isolate the effect and impact of a teacher on student learning, controlling for pre-existing characteristics of a student including, but not limited to, prior achievement.” Student growth is still one of the bill’s key components, with growth set at a 25% weight, and this is still written into this bill regardless of the fact that the new federal Elementary Student Success Act (ESSA) no longer requires teacher-level growth as a component of states’ educational reform legislation. In other words, states are no longer required to do this, but apparently the state/Senator Marsh still wants to move forward in this regard, regardless (and regardless of the research evidence). The student growth model is to be selected by October 1, 2016. On this my offer (as per my prior post) still stands. I would be more than happy to help the state negotiate this contract, pro bono, and much more wisely than so many other states and districts have negotiated similar contracts thus far (e.g., without asking for empirical evidence as a continuous contractual deliverable).
Second, and related, nothing is written about the ongoing research and evaluation of the state system, that is absolutely necessary in order to ensure the system is working as intended, especially before any types of consequential decisions are to be made (e.g., school bonuses, teachers’ denial of tenure, teacher termination, teacher termination due to a reduction in force). All that is mentioned is that things like stakeholder perceptions, general outcomes, and local compliance with the state will be monitored. Without evidence in hand, in advance and preferably as externally vetted and validated, the state will very likely be setting itself up for some legal trouble.
Third, to measure growth the state is set to use student performance data on state tests, as well as data derived via the ACT Aspire examination, American College Test (ACT), and “any number of measures from the department developed list of preapproved options for governing boards to utilize to measure student achievement growth.” As mentioned in my prior post about Alabama (linked to again here), this is precisely what has gotten the whole state of New Mexico wrapped up in, and quasi-losing their ongoing lawsuit. While providing districts with menus of off-the-shelf and other assessment options might make sense to policymakers, any self respecting researcher should know why this is entirely inappropriate. To read more about this, the best research study explaining why doing just this will set any state up for lawsuits comes from Brown University’s John Papay in his highly esteemed and highly cited “Different tests, different answers: The stability of teacher value-added estimates across outcome measures” article. The title of this research article alone should explain enough why simply positioning and offering up such tests in such casual ways makes way for legal recourse.
Fourth, the bill is to increase the number of years it will take Alabama teachers to earn tenure, requiring that teachers teach for at least five years, of which at least three consecutive years of “satisfies expectations,” “exceeds expectations,” or “significantly exceeds expectations” are demonstrated via the state’s evaluation system prior to earning tenure. Clearly the state does not understand the current issues with value-added/growth levels of reliability, or consistency, or lack thereof, that are altogether preventing such consistent classifications of teachers over time. Inversely, what is consistently evident across all growth models is that estimates are very inconsistent from year to year, which will likely thwart what the bill has written into it here, as such a theoretically simple proposition. For example, the common statistic still cited in this regard is that a teacher classified as “adding value” has a 25% to 50% chance of being classified as “subtracting value” the following year(s), and vice versa. This sometimes makes the probability of a teacher being consistently identified as (in)effective, from year-to-year, no different than the flip of a coin, and this is true when there are at least three years of data (which is standard practice and is also written into this bill as a minimum requirement).
Unless the state plans on “artificially conflating” scores, by manufacturing and forcing the oft-unreliable growth data to fit or correlate with teachers’ observational data (two observations per year are to be required), and/or survey data (student surveys are to be used for teachers of students in grades three and above), such consistency is thus far impossible unless deliberately manipulated (see a recent post here about how “artificial conflation” is one of the fundamental and critical points of litigation in another lawsuit in Houston). Related, this bill is also to allow a governing board to evaluate a tenured teacher as per his/her similar classifications every other year, and is to subject a tenured teacher who received a rating of below or significantly below expectations for two consecutive evaluations to personnel action.
Fifth and finally, the bill is also to use an allocated $5,000,000 to recruit teachers, $3,000,000 to mentor teachers, and $10,000,000 to reward the top 10% of schools in the state as per their demonstrated performance. The latter of these is the most consequential as per the state’s use of its planned growth data at the school level (see other teacher-level consequences to be attached to growth output above); hence, the comments about needing empirical evidence to justify such allocations prior to the distributions of such monies is important to underscore again. Likewise, given levels of bias in value-added/growth output are worse at the state versus teacher levels, I would also caution the state against rewarding schools, again in this regard, for what might not really be schools’ causal impacts on student growth over time, after all. See, for example, here, here, and here.
As I also mentioned in my prior post on Alabama, for those of you who have access to educational leaders there, do send them this post too, so they might be a bit more proactive, and appropriately more careful and cautious, before going down what continues to demonstrate itself as a poor educational policy path. While I do embrace my professional responsibility as a public scholar to be called to court to testify about all of this when such high-stakes consequences are ultimately, yet inappropriately based upon invalid inferences, I’d much rather be proactive in this regard and save states and states’ taxpayers their time and money, respectively.
As you may recall, one of 15 important lawsuits pertaining to teacher value-added estimates across the nation (Florida n=2, Louisiana n=1, Nevada n=1, New Mexico n=4, New York n=3, Tennessee n=3, and Texas n=1 – see more information here) was situated in Knox County, Tennessee.
Filed in February of 2015, with legal support provided by the Tennessee Education Association (TEA), Knox County teacher Lisa Trout and Mark Taylor charged that they were denied monetary bonuses after their Tennessee Value-Added Assessment System (TVAAS — the original Education Value-Added Assessment System (EVAAS)) teacher-level value-added scores were miscalculated. This lawsuit was also to contest the reasonableness, rationality, and arbitrariness of the TVAAS system, as per its intended and actual uses in this case, but also in Tennessee writ large. On this case, Jesse Rothstein (University of California – Berkeley) and I were serving as the Plaintiffs’ expert witnesses.
Unfortunately, however, last week (February 17, 2016) the Plaintiffs’ team received a Court order written by U.S. District Judge Harry S. Mattice Jr. dismissing their claims. While the Court had substantial questions about the reliability and validity of the TVAAS, the Court determined that the State satisfied the very low threshold of the “rational basis test,” at legal issue. I should note here, however, that all of the evidence that the lawyers for the Plaintiffs collected via their “extensive discovery,” including the affidavits both Jesse and I submitted on Plaintiffs’ behalves, were unfortunately not considered in Judge Mattice’s motion to dismiss. This, perhaps, makes sense given some of the assertions made by the Court, forthcoming.
Ultimately, the Court found that the TVAAS-based, teacher-level value-added policy at issue was “rationally related to a legitimate government interest.” As per the Court order itself, Judge Mattice wrote that “While the court expresses no opinion as to whether the Tennessee Legislature has enacted sound public policy, it finds that the use of TVAAS as a means to measure teacher efficacy survives minimal constitutional scrutiny. If this policy proves to be unworkable in practice, plaintiffs are not to be vindicated by judicial intervention but rather by democratic process.”
Otherwise, as per an article in the Knoxville News Sentinel, Judge Mattice was “not unsympathetic to the teachers’ claims,” for example, given the TVAAS measures “student growth — not teacher performance — using an algorithm that is not fail proof.” He inversely noted, however, in the Court order that the “TVAAS algorithms have been validated for their accuracy in measuring a teacher’s effect on student growth,” even if minimal. He also wrote that the test scores used in the TVAAS (and other models) “need not be validated for measuring teacher effectiveness merely because they are used as an input in a validated statistical model that measures teacher effectiveness.” This is, unfortunately, untrue. Nonetheless, he continued to write that even though the rational basis test “might be a blunt tool, a rational policymaker could conclude that TVAAS is ‘capable of measuring some marginal impact that teachers can have on their own students…[and t]his is all the Constitution requires.”
In the end, Judge Mattice concluded in the Court order that, overall, “It bears repeating that Plaintiff’s concerns about the statistical imprecision of TVAAS are not unfounded. In addressing Plaintiffs’ constitutional claims, however, the Court’s role is extremely limited. The judiciary is not empowered to second-guess the wisdom of the Tennessee legislature’s approach to solving the problems facing public education, but rather must determine whether the policy at issue is rationally related to a legitimate government interest.”
It is too early to know whether the prosecution team will appeal, although Judge Mattice dismissed the federal constitutional claims within the lawsuit “with prejudice.” As per an article in the Knoxville News Sentinel, this means that “it cannot be resurrected with new facts or legal claims or in another court. His decision can be appealed, though, to the 6th Circuit U.S. Court of Appeals.”
A retired Massachusetts principal, named Linda Murdock, posted a post on her blog titled “Murdock’s EduCorner” about her experiences, as a principal, with “value-added,” or more specifically in her state the use of Student Growth Percentile (SGP) scores to estimate said “value-added.” It’s certainly worth reading as one thing I continue to find is that which we continue to find in the research on value-added models (VAMs) is also being realized by practitioners in the schools being required to use value-added output such as these. In this case, for example, while Murdock does not discuss the technical terms we use in the research (e.g., reliability, validity, and bias), she discusses these in pragmatic, real terms (e.g., year-to-year fluctuations, lack of relationship of SGP scores and other indicators of teacher effectiveness, and the extent to which certain sets of students can hinder teachers’ demonstrated growth or value-added, respectively). Hence, do give her post a read here, and also pasted in full below. Do also pay special attention to the bulleted sections in which she discusses these and other issues on a case-by-case basis.
At the end of the last school year, I was chatting with two excellent teachers, and our conversation turned to the new state-mandated teacher evaluation system and its use of student “growth scores” (“Student Growth Percentiles” or “SGPs” in Massachusetts) to measure a teacher’s “impact on student learning.”
“Guess we didn’t have much of an impact this year,” said one teacher.
The other teacher added, “It makes you feel about this high,” showing a tiny space between her thumb and forefinger.
Throughout the school, comments were similar — indicating that a major “impact” of the new evaluation system is demoralizing and discouraging teachers. (How do I know, by the way, that these two teachers are excellent? I know because I worked with them as their principal – being in their classrooms, observing and offering feedback, talking to parents and students, and reviewing products demonstrating their students’ learning – all valuable ways of assessing a teacher’s “impact”.)
According to the Massachusetts Department of Elementary and Secondary Education (“DESE”), the new evaluation system’s goals include promoting the “growth and development of leaders and teachers,” and recognizing “excellence in teaching and leading.” The DESE website indicates that the DESE considers a teacher’s median SGP as an appropriate measure of that teacher’s “impact on student learning”:
“ESE has confidence that SGPs are a high quality measure of student growth. While the precision of a median SGP decreases with fewer students, median SGP based on 8-19 students still provides quality information that can be included in making a determination of an educator’s impact on students.”
Given the many concerns about the use of “value-added measurement” tools (such as SGPs) in teacher evaluation, this confidence is difficult to understand, particularly as applied to real teachers in real schools. Considerable research notes the imprecision and variability of these measures as applied to the evaluation of individual teachers. On the other side, experts argue that use of an “imperfect measure” is better than past evaluation methods. Theories aside, I believe that the actual impact of this “measure” on real people in real schools is important.
As a principal, when I first heard of SGPs I was curious. I wondered whether the data would actually filter out other factors affecting student performance, such as learning disabilities, English language proficiency, or behavioral challenges, and I wondered if the data would give me additional information useful in evaluating teachers.
Unfortunately, I found that SGPs did not provide useful information about student growth or learning, and median SGPs were inconsistent and not correlated with teaching skill, at least for the teachers with whom I was working. In two consecutive years of SGP data from our Massachusetts elementary school:
- One 4th grade teacher had median SGPs of 37 (ELA) and 36 (math) in one year, and 61.5 and 79 the next year. The first year’s class included students with disabilities and the next year’s did not.
- Two 4th grade teachers who co-teach their combined classes (teaching together, all students, all subjects) had widely differing median SGPs: one teacher had SGPs of 44 (ELA) and 42 (math) in the first year and 40 and 62.5 in the second, while the other teacher had SGPs of 61 and 50 in the first year and 41 and 45 in the second.
- A 5th grade teacher had median SGPs of 72.5 and 64 for two math classes in the first year, and 48.5, 26, and 57 for three math classes in the following year. The second year’s classes included students with disabilities and English language learners, but the first year’s did not.
- Another 5th grade teacher had median SGPs of 45 and 43 for two ELA classes in the first year, and 72 and 64 in the second year. The first year’s classes included students with disabilities and students with behavioral challenges while the second year’s classes did not.
As an experienced observer/evaluator, I found that median SGPs did not correlate with teachers’ teaching skills but varied with class composition. Stronger teachers had the same range of SGPs in their classes as teachers with weaker skills, and median SGPs for a new teacher with a less challenging class were higher than median SGPs for a highly skilled veteran teacher with a class that included English language learners.
Furthermore, SGP data did not provide useful information regarding student growth. In analyzing students’ SGPs, I noticed obvious general patterns: students with disabilities had lower SGPs than students without disabilities, English language learners had lower SGPs than students fluent in English, students who had some kind of trauma that year (e.g., parents’ divorce) had lower SGPs, and students with behavioral/social issues had lower SGPs. SGPs were correlated strongly with test performance: in one year, for example, the median ELA SGP for students in the “Advanced” category was 88, compared with 51.5 for “Proficient” students, 19.5 for “Needs Improvement,” and 5 for the “Warning” category.
There were also wide swings in student SGPs, not explainable except perhaps by differences in student performance on particular test days. One student with disabilities had an SGP of 1 in the first year and 71 in the next, while another student had SGPs of 4 in ELA and 94 in math in 4th grade and SGPs of 50 in ELA and 4 in math in 5th grade, both with consistent district test scores.
So how does this “information” impact real people in a real school? As a principal, I found that it added nothing to what I already knew about the teaching and learning in my school. Using these numbers for teacher evaluation does, however, negatively impact schools: it demoralizes and discourages teachers, and it has the potential to affect class and teacher assignments.
In real schools, student and teacher assignments are not random. Students are grouped for specific purposes, and teachers are assigned classes for particular reasons. Students with disabilities and English language learners are often grouped to allow specialists, such as the speech/language teacher or the ELL teacher, to work more effectively with them. Students with behavioral issues are sometimes placed in special classes, and are often assigned to teachers who work particularly well with them. Leveled classes (AP, honors, remedial), create different student combinations, and teachers are assigned particular classes based on the administrator’s judgment of which teachers will do the best with which classes. For example, I would assign new or struggling teachers less challenging classes so I could work successfully with them on improving their skills.
In the past, when I told a teacher that he/she had a particularly challenging class, because he/she could best work with these students, he/she generally cheerfully accepted the challenge, and felt complimented on his/her skills. Now, that teacher could be concerned about the effect of that class on his/her evaluation. Teachers may be reluctant to teach lower level courses, or to work with English language learners or students with behavioral issues, and administrators may hesitate to assign the most challenging classes to the most skilled teachers.
In short, in my experience, the use of this type of “value-added” measurement provides no useful information and has a negative impact on real teachers and real administrators in real schools. If “data” is not only not useful, but actively harmful, to those who are supposedly benefitting from using it, what is the point? Why is this continuing?