A Win in New Jersey: Tests to Now Account for 5% of Teachers’ Evaluations

Phil Murphy, the Governor of New Jersey, is keeping his campaign promise to parents, students, and educators, according to a news article just posted by the New Jersey Education Association (NJEA; see here). As per the New Jersey Commissioner of Education – Dr. Lamont Repollet, who was a classroom teacher himself — throughout New Jersey, Partnership for Assessment of Readiness for College and Careers (PARCC) test scores will now account for just 5% of a teacher’s evaluation, which is down from 30% as mandated for approxunatelt five years prior by both Murphy’s and Repollet’s predecessors.

Alas, the New Jersey Department of Education and the Murphy administration have “shown their respect for the research.” Because state law continues to require that standardized test scores play some role in teacher evaluation, a decrease to 5% is a victory, perhaps with a revocation of this law forthcoming.

“Today’s announcement is another step by Gov. Murphy toward keeping a campaign promise to rid New Jersey’s public schools of the scourge of high-stakes testing. While tens of thousands of families across the state have already refused to subject their children to PARCC, schools are still required to administer it and educators are still subject to its arbitrary effects on their evaluation. By dramatically lowering the stakes for the test, Murphy is making it possible for educators and students alike to focus more time and attention on real teaching and learning.” Indeed, “this is a victory of policy over politics, powered by parents and educators.”

Way to go New Jersey!

Much of the Same in Louisiana

As I wrote into a recent post: “…it seems that the residual effects of the federal governments’ former [teacher evaluation reform policies and] efforts are still dominating states’ actions with regards to educational accountability.” In other words, many states are still moving forward, more specifically in terms of states’ continued reliance on the use of value-added models (VAMs) for increased teacher accountability purposes, regardless of the passage of the Every Student Succeeds Act (ESSA).

Related, three articles were recently published online (here, here, and here) about how in Louisiana, the state’s old and controversial teacher evaluation system as based on VAMs is resuming after a four-year hiatus. It was put on hold when the state was in the process of adopting The Common Core.

This, of course, has serious implications for the approximately 50,000 teachers throughout the state, or the unknown proportion of them who are now VAM-eligible, believed to be around 15,000 (i.e., approximately 30% which is inline with other state trends).

While the state’s system has been partly adjusted, whereas 50% of a teacher’s evaluation was to be based on growth in student achievement over time using VAMs, and the new system has reduced this percentage down to 35%, now teachers of mathematics, English, science, and social studies are also to be held accountable using VAMs. The other 50% of these teachers’ evaluation scores are to be assessed using observations with 15% based on student learning targets (a.k.a., student learning objectives (SLOs)).

Evaluation system output are to be used to keep teachers from earning tenure, or to cause teachers to lose the tenure they might already have.

Among other controversies and issues of contention noted in these articles (see again here, here, and here), one of note (highlighted here) is also that now, “even after seven years”… the state is still “unable to truly explain or provide the actual mathematical calculation or formula’ used to link test scores with teacher ratings. ‘This obviously lends to the distrust of the entire initiative among the education community.”

A spokeswoman for the state, however, countered the transparency charge noting that the VAM formula has been on the state’s department of education website, “and updated annually, since it began in 2012.” She did not provide a comment about how to adequately explain the model, perhaps because she could not either.

Just because it might be available does not mean it is understandable and, accordingly, usable. This we have come to know from administrators, teachers, and yes, state-level administrators in charge of these models (and their adoption and implementation) for years. This is, indeed, one of the largest criticisms of VAMs abound.

VAM-Based Chaos Reigns in Florida, as Caused by State-Mandated Teacher Turnovers

The state of Florida is another one of our state’s to watch in that, even since the passage of the Every Student Succeeds Act (ESSA) last January, the state is still moving forward with using its VAMs for high-stakes accountability reform. See my most recent post about one district in Florida here, after the state ordered it to dismiss a good number of its teachers as per their low VAM scores when this school year started. After realizing this also caused or contributed to a teacher shortage in the district, the district scrambled to hire Kelly Services contracted substitute teachers to replace them, after which the district also put administrators back into the classroom to help alleviate the bad situation turned worse.

In a recent post released by The Ledger, teachers from the same Polk County School District (size = 100K students) added much needed details and also voiced concerns about all of this in the article that author Madison Fantozzi titled “Polk teachers: We are more than value-added model scores.”

Throughout this piece Fantozzi covers the story of Elizabeth Keep, a teacher who was “plucked from” the middle school in which she taught for 13 years, after which she was involuntarily placed at a district high school “just days before she was to report back to work.” She was one of 35 teachers moved from five schools in need of reform as based on schools’ value-added scores, although this was clearly done with no real concern or regard of the disruption this would cause these teachers, not to mention the students on the exiting and receiving ends. Likewise, and according to Keep, “If you asked students what they need, they wouldn’t say a teacher with a high VAM score…They need consistency and stability.” Apparently not. In Keep’s case, she “went from being the second most experienced person in [her middle school’s English] department…where she was department chair and oversaw the gifted program, to a [new, and never before] 10th- and 11th-grade English teacher” at the new high school to which she was moved.

As background, when Polk County School District officials presented turnaround plans to the State Board of Education last July, school board members “were most critical of their inability to move ‘unsatisfactory’ teachers out of the schools and ‘effective’ teachers in.”  One board member, for example, expressed finding it “horrendous” that the district was “held hostage” by the extent to which the local union was protecting teachers from being moved as per their value-added scores. Referring to the union, and its interference in this “reform,” he accused the unions of “shackling” the districts and preventing its intended reforms. Note that the “effective” teachers who are to replace the “ineffective” ones can earn up to $7,500 in bonuses per year to help the “turnaround” the schools into which they enter.

Likewise, the state’s Commissioner of Education concurred saying that she also “wanted ‘unsatisfactory’ teachers out and ‘highly effective’ teachers in,” again, with effectiveness being defined by teachers’ value-added or lack thereof, even though (1) the teachers targeted only had one or two years of the three years of value-added data required by state statute, and even though (2) the district’s senior director of assessment, accountability and evaluation noted that, in line with a plethora of other research findings, teachers being evaluated using the state’s VAM have a 51% chance of changing their scores from one year to the next. This lack of reliability, as we know it, should outright prevent any such moves in that without some level of stability, valid inferences from which valid decisions are to be made cannot be drawn. It’s literally impossible.

Nonetheless, state board of education members “unanimously… threatened to take [all of the district’s poor performing] over or close them in 2017-18 if district officials [didn’t] do what [the Board said].” See also other tales of similar districts in the article available, again, here.

In Keep’s case, “her ‘unsatisfactory’ VAM score [that caused the district to move her, as] paired with her ‘highly effective’ in-class observations by her administrators brought her overall district evaluation to ‘effective’…[although she also notes that]…her VAM scores fluctuate because the state has created a moving target.” Regardless, Keep was notified “five days before teachers were due back to their assigned schools Aug. 8 [after which she was] told she had to report to a new school with a different start time that [also] disrupted her 13-year routine and family that shares one car.”

VAM-based chaos reigns, especially in Florida.

One Score and Seven Policy Iterations Ago…

I just read what might be one of the best articles I’ve read in a long time on using test scores to measure teacher effectiveness, and why this is such a bad idea. Not surprisingly, unfortunately, this article was written 20 years ago (i.e., 1986) by – Edward Haertel, National Academy of Education member and recently retired Professor at Stanford University. If the name sounds familiar, it should as Professor Emeritus Haertel is one of the best on the topic of, and history behind VAMs (see prior posts about his related scholarship here, here, and here). To access the full article, please scroll to the reference at the bottom of this post.

Heartel wrote this article when at the time policymakers were, like they still are now, trying to hold teachers accountable for their students’ learning as measured on states’ standardized test scores. Although this article deals with minimum competency tests, which were in policy fashion at the time, about seven policy iterations ago, the contents of the article still have much relevance given where we are today — investing in “new and improved” Common Core tests and still riding on unsinkable beliefs that this is the way to reform the schools that have been in despair and (still) in need of major repair since 20+ years ago.

Here are some of the points I found of most “value:”

  • On isolating teacher effects: “Inferring teacher competence from test scores requires the isolation of teaching effects from other major influences on student test performance,” while “the task is to support an interpretation of student test performance as reflecting teacher competence by providing evidence against plausible rival hypotheses or interpretation.” While “student achievement depends on multiple factors, many of which are out of the teacher’s control,” and many of which cannot and likely never will be able to be “controlled.” In terms of home supports, “students enjoy varying levels of out-of-school support for learning. Not only may parental support and expectations influence student motivation and effort, but some parents may share directly in the task of instruction itself, reading with children, for example, or assisting them with homework.” In terms of school supports, “[s]choolwide learning climate refers to the host of factors that make a school more than a collection of self-contained classrooms. Where the principal is a strong instructional leader; where schoolwide policies on attendance, drug use, and discipline are consistently enforced; where the dominant peer culture is achievement-oriented; and where the school is actively supported by parents and the community.” This, all, makes isolating the teacher effect nearly if not wholly impossible.
  • On the difficulties with defining the teacher effect: “Does it include homework? Does it include self-directed study initiated by the student? How about tutoring by a parent or an older sister or brother? For present purposes, instruction logically refers to whatever the teacher being evaluated is responsible for, but there are degrees of responsibility, and it is often shared. If a teacher informs parents of a student’s learning difficulties and they arrange for private tutoring, is the teacher responsible for the student’s improvement? Suppose the teacher merely gives the student low marks, the student informs her parents, and they arrange for a tutor? Should teachers be credited with inspiring a student’s independent study of school subjects? There is no time to dwell on these difficulties; others lie ahead. Recognizing that some ambiguity remains, it may suffice to define instruction as any learning activity directed by the teacher, including homework….The question also must be confronted of what knowledge counts as achievement. The math teacher who digresses into lectures on beekeeping may be effective in communicating information, but for purposes of teacher evaluation the learning outcomes will not match those of a colleague who sticks to quadratic equations.” Much if not all of this cannot and likely never will be able to be “controlled” or “factored” in or our, as well.
  • On standardized tests: The best of standardized tests will (likely) always be too imperfect and not up to the teacher evaluation task, no matter the extent to which they are pitched as “new and improved.” While it might appear that these “problem[s] could be solved with better tests,” they cannot. Ultimately, all that these tests provide is “a sample of student performance. The inference that this performance reflects educational achievement [not to mention teacher effectiveness] is probabilistic [emphasis added], and is only justified under certain conditions.” Likewise, these tests “measure only a subset of important learning objectives, and if teachers are rated on their students’ attainment of just those outcomes, instruction of unmeasured objectives [is also] slighted.” Like it was then as it still is today, “it has become a commonplace that standardized student achievement tests are ill-suited for teacher evaluation.”
  • On the multiple choice formats of such tests: “[A] multiple-choice item remains a recognition task, in which the problem is to find the best of a small number of predetermined alternatives and the cri- teria for comparing the alternatives are well defined. The nonacademic situations where school learning is ultimately ap- plied rarely present problems in this neat, closed form. Discovery and definition of the problem itself and production of a variety of solutions are called for, not selection among a set of fixed alternatives.”
  • On students and the scores they are to contribute to the teacher evaluation formula: “Students varying in their readiness to profit from instruction are said to differ in aptitude. Not only general cognitive abilities, but relevant prior instruction, motivation, and specific inter- actions of these and other learner characteristics with features of the curriculum and instruction will affect academic growth.” In other words, one cannot simply assume all students will learn or grow at the same rate with the same teacher. Rather, they will learn at different rates given their aptitudes, their “readiness to profit from instruction,” the teachers’ instruction, and sometimes despite the teachers’ instruction or what the teacher teaches.
  • And on the formative nature of such tests, as it was then: “Teachers rarely consult standardized test results except, perhaps, for initial grouping or placement of students, and they believe that the tests are of more value to school or district administrators than to themselves.”

Sound familiar?

Reference: Haertel, E. (1986). The valid use of student performance measures for teacher evaluation. Educational Evaluation and Policy Analysis, 8(1), 45-60.

Special Issue of “Educational Researcher” (Paper #8 of 9, Part I): A More Research-Based Assessment of VAMs’ Potentials

Recall that the peer-reviewed journal Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#8 of 9), which is actually a commentary titled “Can Value-Added Add Value to Teacher Evaluation?” This commentary is authored by Linda Darling-Hammond – Professor of Education, Emeritus, at Stanford University.

Like with the last commentary reviewed here, Darling-Hammond reviews some of the key points taken from the five feature articles in the aforementioned “Special Issue.” More specifically, though, Darling-Hammond “reflect[s] on [these five] articles’ findings in light of other work in this field, and [she] offer[s her own] thoughts about whether and how VAMs may add value to teacher evaluation” (p. 132).

She starts her commentary with VAMs “in theory,” in that VAMs COULD accurately identify teachers’ contributions to student learning and achievement IF (and this is a big IF) the following three conditions were met: (1) “student learning is well-measured by tests that reflect valuable learning and the actual achievement of individual students along a vertical scale representing the full range of possible achievement measures in equal interval units” (2) “students are randomly assigned to teachers within and across schools—or, conceptualized another way, the learning conditions and traits of the group of students assigned to one teacher do not vary substantially from those assigned to another;” and (3) “individual teachers are the only contributors to students’ learning over the period of time used for measuring gains” (p. 132).

None of things are actual true (or near to true, nor will they likely ever be true) in educational practice, however. Hence, the errors we continue to observe that continue to prevent VAM use for their intended utilities, even with the sophisticated statistics meant to mitigate errors and account for the above-mentioned, let’s call them, “less than ideal” conditions.

Other pervasive and perpetual issues surrounding VAMs as highlighted by Darling-Hammond, per each of the three categories above, pertain to (1) the tests used to measure value-added is that the tests are very narrow, focus on lower level skills, and are manipulable. These tests in their current form cannot effectively measure the learning gains of a large share of students who are above or below grade level given a lack of sufficient coverage and stretch. As per Haertel (2013, as cited in Darling-Hammond’s commentary), this “translates into bias against those teachers working with the lowest-performing or the highest-performing classes’…and “those who teach in tracked school settings.” It is also important to note here that the new tests created by the Partnership for Assessing Readiness for College and Careers (PARCC) and Smarter Balanced, multistate consortia “will not remedy this problem…Even though they will report students’ scores on a vertical scale, they will not be able to measure accurately the achievement or learning of students who started out below or above grade level” (p.133).

With respect to (2) above, on the equivalence (or rather non-equivalence) of groups of student across teachers classrooms who are the ones whose VAM scores are relativistically compared, the main issue here is that “the U.S. education system is the one of most segregated and unequal in the industrialized world…[likewise]…[t]he country’s extraordinarily high rates of childhood poverty, homelessness, and food insecurity are not randomly distributed across communities…[Add] the extensive practice of tracking to the mix, and it is clear that the assumption of equivalence among classrooms is far from reality” (p. 133). Whether sophisticated statistics can control for all of this variation is one of most debated issues surrounding VAMs and their levels of outcome bias, accordingly.

And as per (3) above, “we know from decades of educational research that many things matter for student achievement aside from the individual teacher a student has at a moment in time for a given subject area. A partial list includes the following [that are also supposed to be statistically controlled for in most VAMs, but are also clearly not controlled for effectively enough, if even possible]: (a) school factors such as class sizes, curriculum choices, instructional time, availability of specialists, tutors, books, computers, science labs, and other resources; (b) prior teachers and schooling, as well as other current teachers—and the opportunities for professional learning and collaborative planning among them; (c) peer culture and achievement; (d) differential summer learning gains and losses; (e) home factors, such as parents’ ability to help with homework, food and housing security, and physical and mental support or abuse; and (e) individual student needs, health, and attendance” (p. 133).

“Given all of these influences on [student] learning [and achievement], it is not surprising that variation among teachers accounts for only a tiny share of variation in achievement, typically estimated at under 10%” (see, for example, highlights from the American Statistical Association’s (ASA’s) Position Statement on VAMs here). “Suffice it to say [these issues]…pose considerable challenges to deriving accurate estimates of teacher effects…[A]s the ASA suggests, these challenges may have unintended negative effects on overall educational quality” (p. 133). “Most worrisome [for example] are [the] studies suggesting that teachers’ ratings are heavily influenced [i.e., biased] by the students they teach even after statistical models have tried to control for these influences” (p. 135).

Other “considerable challenges” include: VAM output are grossly unstable given the swings and variations observed in teacher classifications across time, and VAM output are “notoriously imprecise” (p. 133) given the other errors observed as caused, for example, by varying class sizes (e.g., Sean Corcoran (2010) documented with New York City data that the “true” effectiveness of a teacher ranked in the 43rd percentile could have had a range of possible scores from the 15th to the 71st percentile, qualifying as “below average,” “average,” or close to “above average). In addition, practitioners including administrators and teachers are skeptical of these systems, and their (appropriate) skepticisms are impacting the extent to which they use and value their value-added data, noting that they value their observational data (and the professional discussions surrounding them) much more. Also important is that another likely unintended effect exists (i.e., citing Susan Moore Johnson’s essay here) when statisticians’ efforts to parse out learning to calculate individual teachers’ value-added causes “teachers to hunker down and focus only on their own students, rather than working collegially to address student needs and solve collective problems” (p. 134). Related, “the technology of VAM ranks teachers against each other relative to the gains they appear to produce for students, [hence] one teacher’s gain is another’s loss, thus creating disincentives for collaborative work” (p. 135). This is what Susan Moore Johnson termed the egg-crate model, or rather the egg-crate effects.

Darling-Hammond’s conclusions are that VAMs have “been prematurely thrust into policy contexts that have made it more the subject of advocacy than of careful analysis that shapes its use. There is [good] reason to be skeptical that the current prescriptions for using VAMs can ever succeed in measuring teaching contributions well (p. 135).

Darling-Hammond also “adds value” in one whole section (highlighted in another post forthcoming here), offering a very sound set of solutions, using VAMs for teacher evaluations or not. Given it’s rare in this area of research we can focus on actual solutions, this section is a must read. If you don’t want to wait for the next post, read Darling-Hammond’s “Modest Proposal” (p. 135-136) within her larger article here.

In the end, Darling-Hammond writes that, “Trying to fix VAMs is rather like pushing on a balloon: The effort to correct one problem often creates another one that pops out somewhere else” (p. 135).

*****

If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; see the Review of Article #5 – on teachers’ perceptions of observations and student growth here; see the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here; and see the Review of Article (Commentary) #7 – on VAMs situated in their appropriate ecologies here.

Article #8, Part I Reference: Darling-Hammond, L. (2015). Can value-added add value to teacher evaluation? Educational Researcher, 44(2), 132-137. doi:10.3102/0013189X15575346

Kane Is At It, Again: “Statistically Significant” Claims Exaggerated to Influence Policy

In a recent post, I critiqued a fellow academic and value-added model (VAM) supporter — Thomas Kane, an economics professor from Harvard University who also directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation. Kane has been the source of multiple posts on this blog (see also here, here, and here) as he is a very public figure, very often backing, albeit often not in non-peer-reviewed technical reports and documents, series of exaggerated, and “research-based” claims. In this prior post, I more specifically critiqued the overstated claims he made in a recent National Public Radio (NPR) interview titled: “There Is No FDA For Education. Maybe There Should Be.”

Well, a colleague recently emailed me another such document authored by Kane (and co-written with four colleagues), titled: “Teaching Higher: Educators’ Perspectives on Common Core Implementation.” While this one is quite methodologically sound (i.e., as assessed via a thorough read of the main text of the document, including all footnotes and appendices), it is Kane’s set of claims, again, that are of concern, especially knowing that this report, even though it too has not yet been externally vetted or reviewed, will likely have a policy impact. The main goal of this report is clearly (although not made explicit) to endorse, promote, and in many ways save the Common Core State Standards (CCSS). I emphasize the word save in that clearly, and especially since the passage of the Every Student Succeeds Act (ESSA), many states have rejected the still highly controversial Common Core. I also should note that researchers in this study clearly conducted this study with similar a priori conclusions in mind (i.e., that the Common Core should be saved/promoted); hence, future peer review of this piece may be out of the question as the bias evident in the sets of findings would certainly be a “methodological issue,” again, likely preventing a peer-reviewed publication (see, for example, the a priori conclusion that “[this] study highlights an important advantage of having a common set of standards and assessments across multiple states,” in the abstract (p. 3).

First I will comment on the findings regarding the Common Core, as related to value-added models (VAMs). Next, I will comment on Section III of the report, about “Which [Common Core] Implementation Strategies Helped Students Succeed?” (p. 17). This is where Kane and colleagues “link[ed] teachers’ survey responses [about the Common Core] to their students’ test scores on the 2014–2015 PARCC [Partnership for Assessment of Readiness for College and Careers] and SBAC [Smarter Balanced Assessment Consortium] assessments [both of which are aligned to the Common Core Standards]… This allowed [Kane et al.] to investigate which strategies and which of the [Common Core-related] supports [teachers] received were associated with their performance on PARCC and SBAC,” controlling for a variety of factors including teachers’ prior value-added (p. 17).

With regards to the Common Core sections, Kane et al. lay claims like: “Despite the additional work, teachers and principals in the five states [that have adopted the Common Core = Delaware, Maryland, Massachusetts, New Mexico, and Nevada] have largely embraced [emphasis added] the new standards” (p. 3). They mention nowhere, however, the mediating set of influences interfering with such a claim, that likely lead to this claim entirely or at least in part – that many teachers across the nation have been forced, by prior federal and current state mandates (e.g., in New Mexico), to “embrace the new standards.” Rather, Kane et al. imply throughout the document that this “embracement” is a sure sign that teachers and principals are literally taking the Common Core into and within their open arms. The same interference is at play with their similar claim that “Teachers in the five study states have made major changes [emphasis in the original] in their lesson plans and instructional materials to meet the CCSS” (p. 3). Compliance is certainly a intervening factor, again, likely contaminating and distorting the validity of both of these claims (which are two of the four total claims highlighted throughout the document’s (p. 3)).

Elsewhere, Kane et al. claim that “The new standards and assessments represent a significant challenge for teachers and students” (p. 6), along with an accompanying figure they use to illustrate how proficiency (i.e., the percent of students labeled as proficient) on these five states’ prior tests has decreased, indicating more rigor or a more “significant challenge for teachers and students” thanks to the Common Core. What they completely ignore again, however, is that the cut scores used to define “proficiency” are arbitrary per state, as was their approach to define “proficiency” across states in comparison (see footnote four). What we also know from years of research on such tests is that whenever a state introduces a “new and improved” test (e.g., the PARCC and SBAC tests), which is typically tied to “new and improved standards” (e.g., the Common Core), lower “proficiency” rates are observed. This has happened countless times across states, and certainly prior to the introduction of the PARCC and SBAC tests. Thereafter, the state typically responds with the same types of claims, that “The new standards and assessments represent a significant challenge for teachers and students.” These claims are meant to signal to the public that at last “we” are holding our teachers and students accountable for their teaching and learning, but thereafter, again, proficiency cut scores are arbitrarily redefined (among other things), and then five or ten years later “new and improved” tests and standards are needed again. In other words, this claim is nothing new and it should not be interpreted as such, but it should rather be interpreted as aligned with Einstein’s definition of insanity (i.e., repeating the same behaviors over and over again in the hopes that different results will ultimately materialize) as this is precisely what we as a nation have been doing since the minimum competency era in the early 1980s.

Otherwise, Kane et al.’s other two claims were related to “Which [Common Core] Implementation Strategies Helped Students Succeed” (p. 17), as mentioned. They assert first that “In mathematics, [they] identified three markers of successful implementation: more professional development days, more classroom observations with explicit feedback tied to the Common Core, and the inclusion of Common Core-aligned student outcomes in teacher evaluations. All were associated with statistically significantly [emphasis added] higher student performance on the PARCC and [SBAC] assessments in mathematics” (p. 3, see also p. 20). They assert second that “In English language arts, [they] did not find evidence for or against any particular implementation strategies” (p. 3, see also p. 20).

What is highly problematic about these claims is that the three correlated implementation strategies noted, again as significantly associated with teachers’ students’ test-based performance on the PARCC and SBAC mathematics assessments, were “statistically significant” (determined by standard p or “probability” values under which findings that may have happened due to chance are numerically specified). But, they were not really practically significant, at all. There IS a difference whereby “statistically significant” findings may not be “practically significant,” or in this case “policy relevant,” at all. While many misinterpret “statistical significance” as an indicator of strength or importance, it is not. Practical significance is.

As per the American Statistical Association’s (ASA) recently released “Statement on P-Values,” statistical significance “is not equivalent to scientific, human, or economic significance…Any effect, no matter how tiny, can produce a small p-value [i.e., “statistical significance”] if the sample size or measurement precision is high enough” (p. 10); hence, one must always check for practical significance when making claims about statistical significance, like Kane et al. actually do here, but do here in a similar inflated vein.

As their Table 6 shows (p. 20), the regression coefficients related to these three areas of “statistically significant” influence on teachers’ students’ test-based performance on the new PARCC and SBAC mathematics tests (i.e., more professional development days, more classroom observations with explicit feedback tied to the Common Core, and the inclusion of Common Core-aligned student outcomes in teacher evaluations) yielded the following coefficients, respectively: 0.045 (p < 0.01), 0.044 (p < 0.05), and 0.054 (p < 0.01). They then use as an example the 0.044 (p < 0.05) coefficient (as related to more classroom observations with explicit feedback tied to the Common Core) and explain that “a difference of one standard deviation in the observation and feedback index was associated with an increase of 0.044 standard deviations in students’ mathematics test scores—roughly the equivalent of 1.4 scale score points on the PARCC assessment and 4.1 scale score points on the SBAC.”

In order to generate sizable and policy relevant improvement in test scores, (e.g., by half of a standard deviation), the observation and feedback index should jump up by 11 standard deviations! In addition, given that scale score points do not equal raw or actual test items (e.g., scale score-to-actual test item relationships are typically in the neighborhood of 4 or 5 scale scores points to 1 actual test item), this likely also means that Kane’s interpretations (i.e., mathematics scores were roughly the equivalent of 1.4 scale score points on the PARCC and 4.1 scale score points on the SBAC) actually mean 1/4th or 1/5th of a test item in mathematics on the PARCC and 4/5th of or 1 test item on the SBAC. This hardly “Provides New Evidence on Strategies Related to Improved Student Performance,” unless you define improved student performance as something as little as 1/5th of a test item.

This is also not what Kane et al. claim to be “a moderately sizeable effect!” (p. 21). These numbers should not even be reported, much less emphasized as policy relevant/significant, unless perhaps equivalent to at least 0.25 standard deviations on the test (as a quasi-standard/accepted minimum). Likewise, the same argument can be made about the other three coefficients derived via these mathematics tests. See also similar claims that they asserted (e.g., that “students perform[ed] better when teachers [were] being evaluated based on student achievement” (p. 21).

Because the abstract (and possibly conclusions) section are the main sections of this paper that are likely to have the most educational/policy impact, especially when people do not read all of the text, footnotes, and abstract contents of this entire document, this is irresponsible, and in many ways contemptible. This is also precisely the reason why, again, Kane’s calls for a Federal Drug Administration (FDA) type of entity for education are also so ironic (as explained in my prior post here).

Yong Zhao’s Stand-Up Speech

Yong Zhao — Professor in the Department of Educational Methodology, Policy, and Leadership at the University of Oregon — was a featured speaker at the recent annual conference of the Network for Public Education (NPE). He spoke about “America’s Suicidal Quest for Outcomes,” as in, test-based outcomes.

I strongly recommend you take almost an hour (i.e., 55 minutes) out of your busy days and sit back and watch what is the closest thing to a stand-up speech I’ve ever seen. Zhao offers a poignant but also very entertaining and funny take on America’s public schools, surrounded by America’s public school politics and situated in America’s pop culture. The full transcription of Zhao’s speech is also available here, as made available by Mercedes Schneider, for any and all who wish to read it: Yong_Zhao NPE Transcript

Zhao speaks of democracy, and embraces his freedom of speech in America (v. China) that permits him to speak out. He explains why he pulled his son out of public school, thanks to No Child Left Behind (NCLB), yet he criticizes G. W. Bush for causing his son to (since college graduation) live in his basement. Hence, Zhao’s “readiness” to leave the basement is much more important than any other performance “readiness” measure being written into the plethora of educational policies surrounding “readiness” (e.g., career and college readiness, pre-school readiness).

Zhao uses what happened to Easter Island’s Rapa Nui civilization that led to their extinction as an analogy for what may happen to us post Race to the Top, given both sets of people are/were driven by false hopes of the gods raining down on them prosperity, should they successfully compete for success and praise. Like the Rapa Nui built monumental statues in their race to “the top” (literally), the unintended consequences that came about as a result (e.g., the exploitation of their natural resources) destroyed their civilization. Zhao argues the same thing is happening in our country with test scores being the most sought after monuments, again, despite the consequences.

Zhao calls for mandatory lists of side effects that come along with standardized testing, similar to something I wrote years ago in an article titled “Buyer, Be Aware: The Value-Added Assessment Model is One Over-the-Counter Product that May Be Detrimental to Your Health.” In this article I pushed for a Federal Drug Administration (FDA) approach to educational research, that would serve as a model to protect the intellectual health of the U.S. A simple approach that legislators and education leaders would have to follow when they passed legislation or educational policies whose benefits and risks are known, or unknown.

Otherwise, he calls all educators (and educational policymakers) to continuously ask themselves one question when test scores rise: “What did you give up to achieve this rise in scores.” When you choose something, what do you lose?

Do give it a watch!

New York’s VAM, by the American Institute for Research (AIR)

A colleague of mine — Stephen Caldas, Professor of Educational Leadership at Manhattanville College, one of the “heavyweights” who recently visited New York to discuss the state’s teacher evaluation system, and who according to Chalkbeat New York, “once called New York’s evaluation system “psychometrically indefensible” — wrote me with a critique of New Yorks’ VAM which I decided to post for you all here.

His critique is of the 2013-2014 Growth Model for Educator Evaluation Technical Report, produced by the American Institute for Research (AIR) that, “describes the models used to measure student growth for the purpose of educator evaluation in New York State for the 2013-2014 School Year” (p. 1).

Here’s what he wrote:

I’ve analyzed this tech report, which for many would be a great sedative prior to sleeping. It’s the latest in a series of three reports by AIR paid for by the New York State Education Department. Although the truth of how good the growth models used by AIR really are is buried deep in the report in Table 11 (p. 31) and Table 20 (p. 44), both of which are recreated here.

Table 11Table 20These tables give us indicators of how well the growth models are at predicting growth in current year student English/language arts (ELA) and mathematics (MATH) student scores by grade level and subject (i.e., the dependent variables). At the secondary level, an additional outcome, or dependent variable predicted is the number of Regents Exams a student passed for the first time in the current year. The unadjusted models only included prior academic achievement as predictor variables, and are shown for comparison purposes only. The adjusted models were the models that were actually used by the state to make predictions that fed into teacher and principal effectiveness scores. In additional to using prior student achievement as a predictor, the adjusted prediction models included these additional predictor variables: student and school-level poverty status, student and school-level socio-economic status (SES), student and school-level English language learner (ELL) status, and scores on the New York State English as a Second Language Achievement Test (the NYSESLAT). These tables above report a statistic called “Pseudo R-squared” or just “R-squared,” and this statistic shows us the predictive power of the overall models.

To help interpret these numbers, if one observes a “1.0” (which one won’t), it would mean that the model was “100%” perfect (with no prediction error). One would obtain the “percentage of perfect” (if you will) by moving the decimal point two places to the right. Otherwise, the difference between the percentage perfect and 100 is called the “error” or “e.”

With this knowledge, one can see in the adjusted ELA 8th grade model (Table 11) that the predictor variables altogether explain “74%” of the variance of current year student ELA 8th grade scores (R-squared = 0.74). Conversely, this same model has a 26% of error (and this is one of the best ones illustrated in the report). In other words, this particular prediction model cannot account for 26% of the cause of current ELA 8th grade scores, “all other things considered” (i.e., the predictor variables that are so highly correlated with test scores in the first place).

The prediction models at the secondary level are much, MUCH worse. If one is to look at Table 20, one would see that in the worst model (adjusted ELA Common Core ) the predictor variables together only explain 45% of student ELA Common Core test scores. Thus, this prediction model cannot account for 55% of the causes of these scores!!

While not terrible R-squared values for social science research, these are horrific values for a model used to make individual level predictions at the teacher or school level with any degree of precision. Quite frankly, they simply cannot be precise given these huge quantities of error. The chances that these models would precisely (with no error) predict a teacher’s or school’s ACTUAL student test scores is slim to none. Yet, the results of these imprecise growth models can contribute up to 40% of a teacher’s effectiveness rating.

This high level of imprecision would explain why teachers like Sheri Lederman of Long Island, who is apparently a terrific fourth grade educator based on all kinds of data besides her most recent VAM scores, received an “ineffective” rating based on this flawed growth model (see prior posts here and here). She clearly has a solid basis for her lawsuit against the state of New York in which she claims her score was “arbitrary and capricious.”

This kind of information on all the prediction error in these growth models needs to be in an executive summary in front of these technical reports. The interpretation of this error should be in PLAIN LANGUAGE for the tax payers who foot the bill for these reports, the policy makers who need to understand the findings in these reports, and the educators who suffer the consequences of such imprecision in measurement.

“Insanity Reigns” in New York

As per an article in Capitol Confidential, two weeks ago New York’s Governor Cuomo – the source of many posts, especially lately (see, for example, here, here, here, here, and here) — was questioned about the school districts that throughout New York were requesting delays in implementing the state’s new teacher evaluation program. Cuomo was also questioned about students in his state who were opting out of the state’s tests.

In response, Cuomo “stressed that the tests used in the evaluations don’t affect the students’ grades.” In his direct words, “[t]he grades are meaningless to the students.”

Yet the tests are to be used to evaluate how effective New York’s teachers are? So, the tests are meaningless to students throughout the state, but the state is to use them to evaluate the effectiveness of students’ teachers throughout the state regardless? The tests won’t count for measuring student knowledge (ostensibly what the tests are designed to measure) but they will be used to evaluate teachers (which the tests were not designed to measure)?

In fact, the tests as per Cuomo, “won’t count at all for the students…for at least the next five years.” Hence, students “can opt out if they want to.” Inversely, if a student decides to take the test the student should consider it “practice” because, again, “the score doesn’t count.” Nor will it count for some time.

In others words, those of a colleague who sent me this article: “Cuomo’s answer to parents who are on the fence about opting out, “oh, it’s just practice.” He expects that when parents hear that testing is low stakes for their kids they will not opt out, but once kids hear that the tests don’t count for them, how hard do you think they are going to try. Low stakes for students, high stakes for the teacher. Insanity reigns!”

This all brings into light the rarely questioned assumption about how the gains that students make on “meaningless” tests actually indicate how much “value” a teacher “adds” to or detracts from his/her students.

What is interesting to point out here is that with No Child Left Behind (NCLB), Governor turned President George W. Bush’s brainchild, the focus was entirely on student-level accountability (i.e., a student must pass a certain test or face the consequences). The goal was that 100% of America’s public school students would be academically proficient in reading and mathematics by 2014 – yes, last year.

When that clearly did not work as politically intended, the focus changed to teacher accountability — thanks to President Obama, his U.S. Secretary of Education Arne Duncan, and their 2009 Race to the Top competition. Approximately $4.35 billion in taxpayer revenues later, we now have educational policies focused on teacher, but no longer student accountability, with similar results (or the lack thereof).

The irony here is that for the most part the students taking these tests are no longer to be held accountable for their performance, but their teachers are to be held for their students’ performance instead, and regardless. Accordingly, across the country we now have teachers, justifiably nervous, who without telling their students that their professional lives are on the line — which is true in many cases — or otherwise lying to their students (e.g., your grades on these tests will be used to place you into college) — which is false in all cases — could face serious consequences, now because their students who as per Cuomo don’t have to care about their test performance (e.g., for five years)

While VAMs certainly have a number of serious issues with which we must contend, this is another that is not often mentioned, made transparent, or discussed. But the reality is that teachers across the country are living out this reality, in practice, every time they prepare their students for these tests.

So I suppose, within the insanity, we have Cuomo to thank for his comments here, as these alone make yet another reality behind VAMs all too apparent.

Unfortunate Updates from Tennessee’s New “Leadership”

Did I write in a prior post (from December, 2014) that “following the (in many ways celebrated) exit of Commissioner Huffman, it seems the state [of Tennessee] is taking an even more reasonable stance towards VAMs and their use(s) for teacher accountability?” I did, and I was apparently wrong…

Things in Tennessee — the state in which our beloved education-based VAMs were born (see here and here) and on which we have been constant watch — have turned south, once again, thanks to new “leadership.” I use this term loosely to describe the author of the letter below, as featured in a recent article in The Tennessean. The letter’s author is the state’s new Commissioner of the Tennessee Department of Education — Candice McQueen — “a former Tennessee classroom teacher and higher education leader.”

Pasted below is what she wrote. You be the judge and question her claims, particularly those I underlined for emphasis in the below.

This week legislators debated Governor Haslam’s plan to respond to educator feedback regarding Tennessee’s four-year-old teacher evaluation system. There’s been some confusion around the details of the plan, so I’m writing to clarify points and share the overall intent.

Every student has individual and unique needs, and each student walks through the school doors on the first day at a unique point in their education journey. It’s critical that we understand not only the grade the child earns at the end of the year, but also how much the child has grown over the course of the year.

Tennessee’s Value-Added Assessment System, or TVAAS, provides educators vital information about our students’ growth.

From TVAAS we learn if our low-achieving students are getting the support they need to catch up to their peers, and we also learn if our high-achieving students are being appropriately challenged so that they remain high-achieving. By analyzing this information, teachers can make adjustments to their instruction and get better at serving the unique needs of every child, every year.

We know that educators often have questions about how TVAAS is calculated and we are working hard to better support teachers’ and leaders’ understanding of this growth measure so that they can put the data to use to help their students learn (check out our website for some of these resources).

While we are working on improving resources, we have also heard concerns from teachers about being evaluated during a transition to a new state assessment. In response to this feedback from teachers, Governor Haslam proposed legislation to phase in the impact of the new TNReady test over the next three years.

Tennessee’s teacher evaluation system, passed as part of the First to the Top Act by the General Assembly in 2010 with support from the Tennessee Education Association, has always included multiple years of student growth data. Student growth data makes up 35 percent of a teacher’s evaluation and includes up to three years of data when available.

Considering prior and current year’s data together paints a more complete picture of a teacher’s impact on her student’s growth.

More data protects teachers from an oversized impact of a tough year (which teachers know happen from time to time). The governor’s plan maintains this more complete picture for teachers by continuing to look at both current and prior year’s growth in a teacher’s evaluation.

Furthermore, the governor’s proposal offers teachers more ownership and choice during this transition period.

If teachers are pleased with the growth score they earn from the new TNReady test, they can opt to have that score make up the entirety of their 35 percent growth measure. However, teachers may also choose to include growth scores they earned from prior TCAP years, reducing the impact of the new assessment.

The purpose of all of this is to support our student’s learning. Meeting the unique needs of each learner is harder than rocket science, and to achieve that we need feedback and information about how our students are learning, what is working well, and where we need to make adjustments. One of these critical pieces of feedback is information about our students’ growth.