Five “Indisputable” Reasons Why VAMs are Good?

Just this week, in Education Week — the field’s leading national newspaper covering K–12 education — a blogger by the name of Matthew Lynch published a piece explaining his “Five Indisputable [emphasis added] Reasons Why You Should Be Implementing Value-Added Assessment.”

I’m going to try to stay aboveboard with my critique of this piece, as best I can, as by the title alone you all can infer there are certainly pieces (mainly five) to be seriously criticized about the author’s indisputable take on value-added (and by default value-added models (VAMs)). I examine each of these assertions below, but I will say overall and before we begin, that pretty much everything that is included in this piece is hardly palatable, and tolerable considering that Education Week published it, and by publishing it they quasi-endorsed it, even if in an independent blog post that they likely at minimum reviewed, then made public.

First, the five assertions, along with a simple response per assertion:

1. Value-added assessment moves the focus from statistics and demographics to asking of essential questions such as, “How well are students progressing?”

In theory, yes – this is generally true (see also my response about the demographics piece replicated in assertion #3 below). The problem here, though, as we all should know by now, is that once we move away from the theory in support of value-added, this theory more or less crumbles. The majority of the research on this topic explains and evidences the reasons why. Is value-added better than what “we” did before, however, while measuring student achievement once per year without taking growth over time into consideration? Perhaps, but if it worked as intended, for sure!

2. Value-added assessment focuses on student growth, which allows teachers and students to be recognized for their improvement. This measurement applies equally to high-performing and advantaged students and under-performing or disadvantaged students.

Indeed, the focus is on growth (see my response about growth in assertion #1 above). What the author of this post does not understand, however, is that his latter conclusion is likely THE most controversial issue surrounding value-added, and on this all topical researchers likely agree. In fact, authors of the most recent review of what is actually called “bias” in value-added estimates, as published in the peer-reviewed Economics Education Review (see a pre-publication version of this manuscript here), concluded that because of potential bias (i.e., “This measurement [does not apply] equally to high-performing and advantaged students and under-performing or disadvantaged students“), that all value-added modelers should control for as many student-level (and other) demographic variables to help to minimize this potential, also given the extent to which multiple authors’ evidence of bias varies wildly (from negligible to considerable).

3. Value-added assessment provides results that are tied to teacher effectiveness, not student demographics; this is a much more fair accountability measure.

See my comment immediately above, with general emphasis added to this overly simplistic take on the extent to which VAMs yield “fair” estimates, free from the biasing effects (never to always) caused by such demographics. My “fairest” interpretation of the current albeit controversial research surrounding this particular issue is that bias does not exist across teacher-level estimates, but it certainly occurs when teachers are non-randomly assigned highly homogenous sets of students who are gifted, who are English Language Learners (ELLs), who are enrolled in special education programs, who disproportionately represent racial minority groups, who disproportionately come from lower socioeconomic backgrounds, and who have been retained in grade prior.

4. Value-added assessment is not a stand-alone solution, but it does provide rich data that helps educators make data-driven decisions.

This is entirely false. There is no research evidence, still to date, that teachers use these data to make instructional decisions. Accordingly, no research is linked to or cited here (as well as elsewhere). Now, if the author is talking about naive “educators,” in general, who make consequential decisions as based on poor (i.e., the oppostie of “rich”) data, this assertion would be true. This “truth,” in fact, is at the core of the lawsuits ongoing across the nation regarding this matter (see, for example, here), with consequences ranging from tagging a teacher’s file for receiving a low value-added score to teacher termination.

5. Value-added assessment assumes that teachers matter and recognizes that a good teacher can facilitate student improvement. Perhaps we have only value-added assessment to thank for “assuming” [sic] this. Enough said…

Or not…

Lastly, the author professes to be a “professor,” pretty much all over the place (see, again, here), although he is currently an associate professor. There is a difference, and folks who respect the difference typically make the distinction explicit and known, especially in an academic setting or context. See also here, however, given his expertise (or the lack thereof) in value-added or VAMs, about what he writes here as “indisputable.”

Perhaps most important here, though, is that his falsely inflated professional title implies, especially to a naive or uncritical public, that what he has to say, again without any research support, demands some kind of credibility and respect. Unfortunately, this is just not the case; hence, we are again reminded of the need for general readers to be critical in their consumption of such pieces. I would have thought Education Week would have played a larger role than this, rather than just putting this stuff “out there,” even if for simple debate or discussion.

Another Oldie but Still Very Relevant Goodie, by McCaffrey et al.

I recently re-read an article in full that is now 10 years old, or 10 years out, as published in 2004 and, as per the words of the authors, before VAM approaches were “widely adopted in formal state or district accountability systems.” Unfortunately, I consistently find it interesting, particularly in terms of the research on VAMs, to re-explore/re-discover what we actually knew 10 years ago about VAMs, as most of the time, this serves as a reminder of how things, most of the time, have not changed.

The article, “Models for Value-Added Modeling of Teacher Effects,” is authored by Daniel McCaffrey (Educational Testing Service [ETS] Scientist, and still a “big name” in VAM research), J. R. Lockwood (RAND Corporation Scientists),  Daniel Koretz (Professor at Harvard), Thomas Louis (Professor at Johns Hopkins), and Laura Hamilton (RAND Corporation Scientist).

At the point at which the authors wrote this article, besides the aforementioned data and data base issues, were issues with “multiple measures on the same student and multiple teachers instructing each student” as “[c]lass groupings of students change annually, and students are taught by a different teacher each year.” Authors, more specifically, questioned “whether VAM really does remove the effects of factors such as prior performance and [students’] socio-economic status, and thereby provide[s] a more accurate indicator of teacher effectiveness.”

The assertions they advanced, accordingly and as relevant to these questions, follow:

  • Across different types of VAMs, given different types of approaches to control for some of the above (e.g., bias), teachers’ contribution to total variability in test scores (as per value-added gains) ranged from 3% to 20%. That is, teachers can realistically only be held accountable for 3% to 20% of the variance in test scores using VAMs, while the other 80% to 97% of the variance (stil) comes from influences outside of the teacher’s control. A similar statistic (i.e., 1% to 14%) was similarly and recently highlighted in the recent position statement on VAMs released by the American Statistical Association.
  • Most VAMs focus exclusively on scores from standardized assessments, although I will take this one-step further now, noting that all VAMs now focus exclusively on large-scale standardized tests. This I evidenced in a recent paper I published here: Putting growth and value-added models on the map: A national overview).
  • VAMs introduce bias when missing test scores are not missing completely at random. The missing at random assumption, however, runs across most VAMs because without it, data missingness would be pragmatically insolvable, especially “given the large proportion of missing data in many achievement databases and known differences between students with complete and incomplete test data.” The really only solution here is to use “implicit imputation of values for unobserved gains using the observed scores” which is “followed by estimation of teacher effect[s] using the means of both the imputed and observe gains [together].”
  • Bias “[still] is one of the most difficult issues arising from the use of VAMs to estimate school or teacher effects…[and]…the inclusion of student level covariates is not necessarily the solution to [this] bias.” In other words, “Controlling for student-level covariates alone is not sufficient to remove the effects of [students’] background [or demographic] characteristics.” There is a reason why bias is still such a highly contested issue when it comes to VAMs (see a recent post about this here).
  • All (or now most) commonly-used VAMs assume that teachers’ (and prior teachers’) effects persist undiminished over time. This assumption “is not empirically or theoretically justified,” either, yet it persists.

These authors’ overall conclusion, again from 10 years ago but one that in many ways still stands? VAMs “will often be too imprecise to support some of [its] desired inferences” and uses including, for example, making low- and high-stakes decisions about teacher effects as produced via VAMs. “[O]btaining sufficiently precise estimates of teacher effects to support ranking [and such decisions] is likely to [forever] be a challenge.”

Special Issue of “Educational Researcher” (Paper #8 of 9, Part I): A More Research-Based Assessment of VAMs’ Potentials

Recall that the peer-reviewed journal Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#8 of 9), which is actually a commentary titled “Can Value-Added Add Value to Teacher Evaluation?” This commentary is authored by Linda Darling-Hammond – Professor of Education, Emeritus, at Stanford University.

Like with the last commentary reviewed here, Darling-Hammond reviews some of the key points taken from the five feature articles in the aforementioned “Special Issue.” More specifically, though, Darling-Hammond “reflect[s] on [these five] articles’ findings in light of other work in this field, and [she] offer[s her own] thoughts about whether and how VAMs may add value to teacher evaluation” (p. 132).

She starts her commentary with VAMs “in theory,” in that VAMs COULD accurately identify teachers’ contributions to student learning and achievement IF (and this is a big IF) the following three conditions were met: (1) “student learning is well-measured by tests that reflect valuable learning and the actual achievement of individual students along a vertical scale representing the full range of possible achievement measures in equal interval units” (2) “students are randomly assigned to teachers within and across schools—or, conceptualized another way, the learning conditions and traits of the group of students assigned to one teacher do not vary substantially from those assigned to another;” and (3) “individual teachers are the only contributors to students’ learning over the period of time used for measuring gains” (p. 132).

None of things are actual true (or near to true, nor will they likely ever be true) in educational practice, however. Hence, the errors we continue to observe that continue to prevent VAM use for their intended utilities, even with the sophisticated statistics meant to mitigate errors and account for the above-mentioned, let’s call them, “less than ideal” conditions.

Other pervasive and perpetual issues surrounding VAMs as highlighted by Darling-Hammond, per each of the three categories above, pertain to (1) the tests used to measure value-added is that the tests are very narrow, focus on lower level skills, and are manipulable. These tests in their current form cannot effectively measure the learning gains of a large share of students who are above or below grade level given a lack of sufficient coverage and stretch. As per Haertel (2013, as cited in Darling-Hammond’s commentary), this “translates into bias against those teachers working with the lowest-performing or the highest-performing classes’…and “those who teach in tracked school settings.” It is also important to note here that the new tests created by the Partnership for Assessing Readiness for College and Careers (PARCC) and Smarter Balanced, multistate consortia “will not remedy this problem…Even though they will report students’ scores on a vertical scale, they will not be able to measure accurately the achievement or learning of students who started out below or above grade level” (p.133).

With respect to (2) above, on the equivalence (or rather non-equivalence) of groups of student across teachers classrooms who are the ones whose VAM scores are relativistically compared, the main issue here is that “the U.S. education system is the one of most segregated and unequal in the industrialized world…[likewise]…[t]he country’s extraordinarily high rates of childhood poverty, homelessness, and food insecurity are not randomly distributed across communities…[Add] the extensive practice of tracking to the mix, and it is clear that the assumption of equivalence among classrooms is far from reality” (p. 133). Whether sophisticated statistics can control for all of this variation is one of most debated issues surrounding VAMs and their levels of outcome bias, accordingly.

And as per (3) above, “we know from decades of educational research that many things matter for student achievement aside from the individual teacher a student has at a moment in time for a given subject area. A partial list includes the following [that are also supposed to be statistically controlled for in most VAMs, but are also clearly not controlled for effectively enough, if even possible]: (a) school factors such as class sizes, curriculum choices, instructional time, availability of specialists, tutors, books, computers, science labs, and other resources; (b) prior teachers and schooling, as well as other current teachers—and the opportunities for professional learning and collaborative planning among them; (c) peer culture and achievement; (d) differential summer learning gains and losses; (e) home factors, such as parents’ ability to help with homework, food and housing security, and physical and mental support or abuse; and (e) individual student needs, health, and attendance” (p. 133).

“Given all of these influences on [student] learning [and achievement], it is not surprising that variation among teachers accounts for only a tiny share of variation in achievement, typically estimated at under 10%” (see, for example, highlights from the American Statistical Association’s (ASA’s) Position Statement on VAMs here). “Suffice it to say [these issues]…pose considerable challenges to deriving accurate estimates of teacher effects…[A]s the ASA suggests, these challenges may have unintended negative effects on overall educational quality” (p. 133). “Most worrisome [for example] are [the] studies suggesting that teachers’ ratings are heavily influenced [i.e., biased] by the students they teach even after statistical models have tried to control for these influences” (p. 135).

Other “considerable challenges” include: VAM output are grossly unstable given the swings and variations observed in teacher classifications across time, and VAM output are “notoriously imprecise” (p. 133) given the other errors observed as caused, for example, by varying class sizes (e.g., Sean Corcoran (2010) documented with New York City data that the “true” effectiveness of a teacher ranked in the 43rd percentile could have had a range of possible scores from the 15th to the 71st percentile, qualifying as “below average,” “average,” or close to “above average). In addition, practitioners including administrators and teachers are skeptical of these systems, and their (appropriate) skepticisms are impacting the extent to which they use and value their value-added data, noting that they value their observational data (and the professional discussions surrounding them) much more. Also important is that another likely unintended effect exists (i.e., citing Susan Moore Johnson’s essay here) when statisticians’ efforts to parse out learning to calculate individual teachers’ value-added causes “teachers to hunker down and focus only on their own students, rather than working collegially to address student needs and solve collective problems” (p. 134). Related, “the technology of VAM ranks teachers against each other relative to the gains they appear to produce for students, [hence] one teacher’s gain is another’s loss, thus creating disincentives for collaborative work” (p. 135). This is what Susan Moore Johnson termed the egg-crate model, or rather the egg-crate effects.

Darling-Hammond’s conclusions are that VAMs have “been prematurely thrust into policy contexts that have made it more the subject of advocacy than of careful analysis that shapes its use. There is [good] reason to be skeptical that the current prescriptions for using VAMs can ever succeed in measuring teaching contributions well (p. 135).

Darling-Hammond also “adds value” in one whole section (highlighted in another post forthcoming here), offering a very sound set of solutions, using VAMs for teacher evaluations or not. Given it’s rare in this area of research we can focus on actual solutions, this section is a must read. If you don’t want to wait for the next post, read Darling-Hammond’s “Modest Proposal” (p. 135-136) within her larger article here.

In the end, Darling-Hammond writes that, “Trying to fix VAMs is rather like pushing on a balloon: The effort to correct one problem often creates another one that pops out somewhere else” (p. 135).

*****

If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; see the Review of Article #5 – on teachers’ perceptions of observations and student growth here; see the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here; and see the Review of Article (Commentary) #7 – on VAMs situated in their appropriate ecologies here.

Article #8, Part I Reference: Darling-Hammond, L. (2015). Can value-added add value to teacher evaluation? Educational Researcher, 44(2), 132-137. doi:10.3102/0013189X15575346

Kane Is At It, Again: “Statistically Significant” Claims Exaggerated to Influence Policy

In a recent post, I critiqued a fellow academic and value-added model (VAM) supporter — Thomas Kane, an economics professor from Harvard University who also directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation. Kane has been the source of multiple posts on this blog (see also here, here, and here) as he is a very public figure, very often backing, albeit often not in non-peer-reviewed technical reports and documents, series of exaggerated, and “research-based” claims. In this prior post, I more specifically critiqued the overstated claims he made in a recent National Public Radio (NPR) interview titled: “There Is No FDA For Education. Maybe There Should Be.”

Well, a colleague recently emailed me another such document authored by Kane (and co-written with four colleagues), titled: “Teaching Higher: Educators’ Perspectives on Common Core Implementation.” While this one is quite methodologically sound (i.e., as assessed via a thorough read of the main text of the document, including all footnotes and appendices), it is Kane’s set of claims, again, that are of concern, especially knowing that this report, even though it too has not yet been externally vetted or reviewed, will likely have a policy impact. The main goal of this report is clearly (although not made explicit) to endorse, promote, and in many ways save the Common Core State Standards (CCSS). I emphasize the word save in that clearly, and especially since the passage of the Every Student Succeeds Act (ESSA), many states have rejected the still highly controversial Common Core. I also should note that researchers in this study clearly conducted this study with similar a priori conclusions in mind (i.e., that the Common Core should be saved/promoted); hence, future peer review of this piece may be out of the question as the bias evident in the sets of findings would certainly be a “methodological issue,” again, likely preventing a peer-reviewed publication (see, for example, the a priori conclusion that “[this] study highlights an important advantage of having a common set of standards and assessments across multiple states,” in the abstract (p. 3).

First I will comment on the findings regarding the Common Core, as related to value-added models (VAMs). Next, I will comment on Section III of the report, about “Which [Common Core] Implementation Strategies Helped Students Succeed?” (p. 17). This is where Kane and colleagues “link[ed] teachers’ survey responses [about the Common Core] to their students’ test scores on the 2014–2015 PARCC [Partnership for Assessment of Readiness for College and Careers] and SBAC [Smarter Balanced Assessment Consortium] assessments [both of which are aligned to the Common Core Standards]… This allowed [Kane et al.] to investigate which strategies and which of the [Common Core-related] supports [teachers] received were associated with their performance on PARCC and SBAC,” controlling for a variety of factors including teachers’ prior value-added (p. 17).

With regards to the Common Core sections, Kane et al. lay claims like: “Despite the additional work, teachers and principals in the five states [that have adopted the Common Core = Delaware, Maryland, Massachusetts, New Mexico, and Nevada] have largely embraced [emphasis added] the new standards” (p. 3). They mention nowhere, however, the mediating set of influences interfering with such a claim, that likely lead to this claim entirely or at least in part – that many teachers across the nation have been forced, by prior federal and current state mandates (e.g., in New Mexico), to “embrace the new standards.” Rather, Kane et al. imply throughout the document that this “embracement” is a sure sign that teachers and principals are literally taking the Common Core into and within their open arms. The same interference is at play with their similar claim that “Teachers in the five study states have made major changes [emphasis in the original] in their lesson plans and instructional materials to meet the CCSS” (p. 3). Compliance is certainly a intervening factor, again, likely contaminating and distorting the validity of both of these claims (which are two of the four total claims highlighted throughout the document’s (p. 3)).

Elsewhere, Kane et al. claim that “The new standards and assessments represent a significant challenge for teachers and students” (p. 6), along with an accompanying figure they use to illustrate how proficiency (i.e., the percent of students labeled as proficient) on these five states’ prior tests has decreased, indicating more rigor or a more “significant challenge for teachers and students” thanks to the Common Core. What they completely ignore again, however, is that the cut scores used to define “proficiency” are arbitrary per state, as was their approach to define “proficiency” across states in comparison (see footnote four). What we also know from years of research on such tests is that whenever a state introduces a “new and improved” test (e.g., the PARCC and SBAC tests), which is typically tied to “new and improved standards” (e.g., the Common Core), lower “proficiency” rates are observed. This has happened countless times across states, and certainly prior to the introduction of the PARCC and SBAC tests. Thereafter, the state typically responds with the same types of claims, that “The new standards and assessments represent a significant challenge for teachers and students.” These claims are meant to signal to the public that at last “we” are holding our teachers and students accountable for their teaching and learning, but thereafter, again, proficiency cut scores are arbitrarily redefined (among other things), and then five or ten years later “new and improved” tests and standards are needed again. In other words, this claim is nothing new and it should not be interpreted as such, but it should rather be interpreted as aligned with Einstein’s definition of insanity (i.e., repeating the same behaviors over and over again in the hopes that different results will ultimately materialize) as this is precisely what we as a nation have been doing since the minimum competency era in the early 1980s.

Otherwise, Kane et al.’s other two claims were related to “Which [Common Core] Implementation Strategies Helped Students Succeed” (p. 17), as mentioned. They assert first that “In mathematics, [they] identified three markers of successful implementation: more professional development days, more classroom observations with explicit feedback tied to the Common Core, and the inclusion of Common Core-aligned student outcomes in teacher evaluations. All were associated with statistically significantly [emphasis added] higher student performance on the PARCC and [SBAC] assessments in mathematics” (p. 3, see also p. 20). They assert second that “In English language arts, [they] did not find evidence for or against any particular implementation strategies” (p. 3, see also p. 20).

What is highly problematic about these claims is that the three correlated implementation strategies noted, again as significantly associated with teachers’ students’ test-based performance on the PARCC and SBAC mathematics assessments, were “statistically significant” (determined by standard p or “probability” values under which findings that may have happened due to chance are numerically specified). But, they were not really practically significant, at all. There IS a difference whereby “statistically significant” findings may not be “practically significant,” or in this case “policy relevant,” at all. While many misinterpret “statistical significance” as an indicator of strength or importance, it is not. Practical significance is.

As per the American Statistical Association’s (ASA) recently released “Statement on P-Values,” statistical significance “is not equivalent to scientific, human, or economic significance…Any effect, no matter how tiny, can produce a small p-value [i.e., “statistical significance”] if the sample size or measurement precision is high enough” (p. 10); hence, one must always check for practical significance when making claims about statistical significance, like Kane et al. actually do here, but do here in a similar inflated vein.

As their Table 6 shows (p. 20), the regression coefficients related to these three areas of “statistically significant” influence on teachers’ students’ test-based performance on the new PARCC and SBAC mathematics tests (i.e., more professional development days, more classroom observations with explicit feedback tied to the Common Core, and the inclusion of Common Core-aligned student outcomes in teacher evaluations) yielded the following coefficients, respectively: 0.045 (p < 0.01), 0.044 (p < 0.05), and 0.054 (p < 0.01). They then use as an example the 0.044 (p < 0.05) coefficient (as related to more classroom observations with explicit feedback tied to the Common Core) and explain that “a difference of one standard deviation in the observation and feedback index was associated with an increase of 0.044 standard deviations in students’ mathematics test scores—roughly the equivalent of 1.4 scale score points on the PARCC assessment and 4.1 scale score points on the SBAC.”

In order to generate sizable and policy relevant improvement in test scores, (e.g., by half of a standard deviation), the observation and feedback index should jump up by 11 standard deviations! In addition, given that scale score points do not equal raw or actual test items (e.g., scale score-to-actual test item relationships are typically in the neighborhood of 4 or 5 scale scores points to 1 actual test item), this likely also means that Kane’s interpretations (i.e., mathematics scores were roughly the equivalent of 1.4 scale score points on the PARCC and 4.1 scale score points on the SBAC) actually mean 1/4th or 1/5th of a test item in mathematics on the PARCC and 4/5th of or 1 test item on the SBAC. This hardly “Provides New Evidence on Strategies Related to Improved Student Performance,” unless you define improved student performance as something as little as 1/5th of a test item.

This is also not what Kane et al. claim to be “a moderately sizeable effect!” (p. 21). These numbers should not even be reported, much less emphasized as policy relevant/significant, unless perhaps equivalent to at least 0.25 standard deviations on the test (as a quasi-standard/accepted minimum). Likewise, the same argument can be made about the other three coefficients derived via these mathematics tests. See also similar claims that they asserted (e.g., that “students perform[ed] better when teachers [were] being evaluated based on student achievement” (p. 21).

Because the abstract (and possibly conclusions) section are the main sections of this paper that are likely to have the most educational/policy impact, especially when people do not read all of the text, footnotes, and abstract contents of this entire document, this is irresponsible, and in many ways contemptible. This is also precisely the reason why, again, Kane’s calls for a Federal Drug Administration (FDA) type of entity for education are also so ironic (as explained in my prior post here).

New Mexico’s Teacher Evaluation Trial Postponed Until October, w/Preliminary Injunction Still in Place

Last December in New Mexico, a Judge granted a preliminary injunction preventing consequences from being attached to the state’s teacher evaluation data as based on the state’s value-added model (VAM). More specifically, Judge David K. Thomson ruled that the state can proceed with “developing” and “improving” its teacher evaluation system, but the state is not to make any consequential decisions about New Mexico’s teachers using the data the state collects until the state (and/or others external to the state) can evidence to the court during another trial (which was set for April of 2016) that the system is reliable, valid, fair, uniform, and the like. See more details regarding Judge Thomson’s ruling in a previous post here: “Consequences Attached to VAMs Suspended Throughout New Mexico.” See more details about this specific lawsuit, sponsored by the American Federation of Teachers (AFT) New Mexico and the Albuquerque Teachers Federation (ATF), in a previous post here: “Lawsuit in New Mexico Challenging [the] State’s Teacher Evaluation System.” This is one of the cases on which I am continuing to serve as an expert witness.

Yesterday, however, and given another state-level lawsuit that is also ongoing regarding the state’s teacher evaluation system, although this one is sponsored by the National Education Association (NEA), Judge Thomson (apparently along with Judge Francis Mathew) pushed both the AFT-NM/ATF and NEA trials back to October of 2016, yielding a six month delay for the AFT-NM/ATF hearing.

According to an article published this morning in the Santa Fe New Mexican, “To date, the [New Mexico] Public Education Department [PED] has been unsuccessful in its efforts to stop either suit or combine them;” hence, yesterday in court the state requested that the court postpone both hearings so that the state could introduce its new teacher evaluation system, on March 15 of 2016, along with its specifics and rules, as also based on the state’s new Partnership for the Assessment of Readiness for College and Careers (PARCC) test data. Recall that the state’s Secretary of Education – Hanna “Skandera is new chair of PARCC test board.” It is also anticipated, however, that the state’s new system is to still “rely heavily” (i.e., 50% weight) on VAMs. See also a related post about “New Mexico Chang[ing] its Teacher Evaluation System, But Not Really.”

This window of time is also to allow for the public forums needed to review the state’s new system, but also to allow time for “the acrimony to be resolved without trials.” The preliminary injunction granted by Judge Thomson in December, though, still remains in place. See also a related article, also published this morning, in the Albuquerque Journal.

Stephanie Ly, president of the AFT-NM, said she is not happy with the trial being postponed. She called this a “stalling tactic” to give the [state] education department more time to compile student achievement data that the plaintiffs have been requesting. “We had no option but to agree because they are withholding data,” she said.

Ly and ATF President Ellen Bernstein also responded yesterday via a joint statement, pasted in full below:

March 7, 2016

Contact: John Dyrcz — 505-554-8679

“The Public Education Department and Secretary Skandera have once again willfully delayed the AFT NM/ATF lawsuit against the current value added model [VAM] evaluation system due to their purposeful refusal to reveal the data being used to evaluate our educators in New Mexico.

“In addition to this stall tactic, and during a status hearing this morning in the First District Court, lawyers for the PED revealed that new rules and regulations were to be unveiled on March 15 by the PED, and would ‘rely heavily’ on VAM as a method of evaluation for educators.

“New Mexico educators will not cease in our fight against the abusive policies of this administration. Allowing PED or districts to terminate employees based on VAM and student test scores is completely unacceptable, it is unacceptable to allow PED or districts to refuse licensure advancement based upon VAM scores, and it is unacceptable for PED or districts to place New Mexico educators on growth plans based on faulty data.

“High-performing education systems have policies 
in place which respect and support their educators and use evaluations not as punitive measures but as opportunities for improvement. Educators, unions, and administrators should oversee the evaluation process to ensure it is thorough and of high quality, as well as fair and reliable. Educators, unions, and administrators should be involved in developing, implementing and monitoring the system to ensure it reflects good teaching well, that it operates effectively, that it is tied to useful learning opportunities for teachers, and that it produces valid results.

“It is well known the PED is in a current state of crisis with several high-level staff members abandoning the Department, an on-going whistle-blower lawsuit…the failure to produce meaningful changes to education in New Mexico during her six years as Secretary, and Skandera’s constant changes to the rules is a desperate attempt to right a sinking ship,” said Ly and Bernstein.

Tennessee’s Trout/Taylor Value-Added Lawsuit Dismissed

As you may recall, one of 15 important lawsuits pertaining to teacher value-added estimates across the nation (Florida n=2, Louisiana n=1, Nevada n=1, New Mexico n=4, New York n=3, Tennessee n=3, and Texas n=1 – see more information here) was situated in Knox County, Tennessee.

Filed in February of 2015, with legal support provided by the Tennessee Education Association (TEA), Knox County teacher Lisa Trout and Mark Taylor charged that they were denied monetary bonuses after their Tennessee Value-Added Assessment System (TVAAS — the original Education Value-Added Assessment System (EVAAS)) teacher-level value-added scores were miscalculated. This lawsuit was also to contest the reasonableness, rationality, and arbitrariness of the TVAAS system, as per its intended and actual uses in this case, but also in Tennessee writ large. On this case, Jesse Rothstein (University of California – Berkeley) and I were serving as the Plaintiffs’ expert witnesses.

Unfortunately, however, last week (February 17, 2016) the Plaintiffs’ team received a Court order written by U.S. District Judge Harry S. Mattice Jr. dismissing their claims. While the Court had substantial questions about the reliability and validity of the TVAAS, the Court determined that the State satisfied the very low threshold of the “rational basis test,” at legal issue. I should note here, however, that all of the evidence that the lawyers for the Plaintiffs collected via their “extensive discovery,” including the affidavits both Jesse and I submitted on Plaintiffs’ behalves, were unfortunately not considered in Judge Mattice’s motion to dismiss. This, perhaps, makes sense given some of the assertions made by the Court, forthcoming.

Ultimately, the Court found that the TVAAS-based, teacher-level value-added policy at issue was “rationally related to a legitimate government interest.” As per the Court order itself, Judge Mattice wrote that “While the court expresses no opinion as to whether the Tennessee Legislature has enacted sound public policy, it finds that the use of TVAAS as a means to measure teacher efficacy survives minimal constitutional scrutiny. If this policy proves to be unworkable in practice, plaintiffs are not to be vindicated by judicial intervention but rather by democratic process.”

Otherwise, as per an article in the Knoxville News Sentinel, Judge Mattice was “not unsympathetic to the teachers’ claims,” for example, given the TVAAS measures “student growth — not teacher performance — using an algorithm that is not fail proof.” He inversely noted, however, in the Court order that the “TVAAS algorithms have been validated for their accuracy in measuring a teacher’s effect on student growth,” even if minimal. He also wrote that the test scores used in the TVAAS (and other models) “need not be validated for measuring teacher effectiveness merely because they are used as an input in a validated statistical model that measures teacher effectiveness.” This is, unfortunately, untrue. Nonetheless, he continued to write that even though the rational basis test “might be a blunt tool, a rational policymaker could conclude that TVAAS is ‘capable of measuring some marginal impact that teachers can have on their own students…[and t]his is all the Constitution requires.”

In the end, Judge Mattice concluded in the Court order that, overall, “It bears repeating that Plaintiff’s concerns about the statistical imprecision of TVAAS are not unfounded. In addressing Plaintiffs’ constitutional claims, however, the Court’s role is extremely limited. The judiciary is not empowered to second-guess the wisdom of the Tennessee legislature’s approach to solving the problems facing public education, but rather must determine whether the policy at issue is rationally related to a legitimate government interest.”

It is too early to know whether the prosecution team will appeal, although Judge Mattice dismissed the federal constitutional claims within the lawsuit “with prejudice.” As per an article in the Knoxville News Sentinel, this means that “it cannot be resurrected with new facts or legal claims or in another court. His decision can be appealed, though, to the 6th Circuit U.S. Court of Appeals.”

In Schools, Teacher Quality Matters Most

Education Next — a non peer-reviewed journal with a mission to “steer a steady course, presenting the facts as best they can be determined…[while]…partak[ing] of no program, campaign, or ideology,” although these last claims are certainly of controversy (see, for example, here and here) — just published an article titled “In Schools, Teacher Quality Matters Most” as part of the journal’s series commemorating the 50th anniversary of James Coleman’s (and colleagues’) groundbreaking 1966 report, “Equality of Educational Opportunity.”

For background, the purpose of The Coleman Report was to assess the equal educational opportunities provided to children of different race, color, religion, and national origin. The main finding was that what we know today as students of color (although African American students were of primary focus in this study), who are (still) often denied equal educational opportunities due to a variety of factors, are largely and unequally segregated across America’s public schools, especially as segregated from their white and wealthier peers. These disparities were most notable via achievement measures, and what we know today as “the achievement gap.” Accordingly, Coleman et al. argued that equal opportunities for students in said schools mattered (and continue to matter) much more for these traditionally marginalized and segregated students than for those who were/are whiter and more economically fortunate. In addition, Coleman argued that out-of-school influences also mattered much more than in-school influences on said achievement. On this point, though, The Coleman Report was of great controversy, and (mis)interpreted as (still) supporting arguments that students’ teachers and schools do/don’t matter as much as students’ families and backgrounds do.

Hence, the Education Next article of focus in this post takes this up, 50 years later, and post the advent of value-added models (VAMs) as better measures than those to which Coleman and his colleagues had access. The article is authored by Dan Goldhaber — a Professor at the University of Washington Bothell, Director of the National Center for Analysis of Longitudinal Data in Education Research (CALDER), and a Vice-President at the American Institutes of Research (AIR). AIR is one of our largest VAM consulting/contract firms, and Goldabher is, accordingly, perhaps one of the field’s most vocal proponents of VAMs and their capacities to both measure and increase teachers’ noteworthy effects (see, for example here); hence, it makes sense he writes about said teacher effects in this article, and in this particular journal (see, for example, Education Next’s Editorial and Editorial Advisory Board members here).

Here is his key claim.

Goldhaber argues that The Coleman Report’s “conclusions about the importance of teacher quality, in particular, have stood the test of time, which is noteworthy, [especially] given that today’s studies of the impacts of teachers [now] use more-sophisticated statistical methods and employ far better data” (i.e., VAMs). Accordingly, Goldhaber’s primary conclusion is that “the main way that schools affect student outcomes is through the quality of their teachers.”

Note that Goldhaber does not offer in this article much evidence, other than evidence not cited or provided by some of his econometric friends (e.g., Raj Chetty). Likewise, Goldhaber cites none of the literature coming from educational statistics, even though recent estimates [1] suggest that approximately 83% of articles written since 1893 (the year in which the first article about VAMs was ever published, in the Journal of Political Economy) on this topic have been published in educational journals, and 14% have been published in economics journals (3% have been published in education finance journals). Hence, what we are clearly observing as per the literature on this topic are severe slants in perspective, especially when articles such as these are written by econometricians, versus educational researchers and statisticians, who often marginalize the research of their education, discipline-based colleagues.

Likewise, Goldhaber does not cite or situate any of his claims within the recent report released by the American Statistical Association (ASA), in which it is written that “teachers account for about 1% to 14% of the variability in test scores.” While teacher effects do matter, they do not matter nearly as much as many, including many/most VAM proponents including Goldhaber, would like us to naively accept and believe. The truth of the matter is is that teachers do indeed matter, in many ways including their impacts on students’ affects, motivations, desires, aspirations, senses of efficacy, and the like, all of which are not estimated on the large-scale standardized tests that continue to matter and that are always the key dependent variables across these and all VAM-based studies today. As Coleman argued 50 years ago, as recently verified by the ASA, students’ out-of-school and out-of-classroom environments matter more, as per these dependent variables or measures.

I think I’ll take ASA’s “word” on this, also as per Coleman’s research 50 years prior.

*****

[1] Reference removed as the manuscript is currently under blind peer-review. Email me if you have any questions at audrey.beardsley@asu.edu

You Are Invited to Participate in the #HowMuchTesting Debate!

As the scholarly debate about the extent and purpose of educational testing rages on, the American Educational Research Association (AERA) wants to hear from you.  During a key session at its Centennial Conference this spring in Washington DC, titled How Much Testing and for What Purpose? Public Scholarship in the Debate about Educational Assessment and Accountability, prominent educational researchers will respond to questions and concerns raised by YOU, parents, students, teachers, community members, and public at large.

Hence, any and all of you with an interest in testing, value-added modeling, educational assessment, educational accountability policies, and the like are invited to post your questions, concerns, and comments using the hashtag #HowMuchTesting on Twitter, Facebook, Instagram, Google+, or the social media platform of your choice, as these are the posts to which AERA’s panelists will respond.

Organizers are interested in all #HowMuchTesting posts, but they are particularly interested in video-recorded questions and comments of 30 – 45 seconds in duration so that you can ask your own questions, rather than having it read by a moderator. In addition, in order to provide ample time for the panel of experts to prepare for the discussion, comments and questions posted by March 17 have the best chances for inclusion in the debate.

Thank you all in advance for your contributions!!

To read more about this session, from the session’s organizer, click here.

New York Teacher Sheri Lederman’s Lawsuit Update

Recall the New York lawsuit pertaining to Long Island teacher Sheri Lederman? The teacher who by all accounts other than her recent (2013-2014) 1 out of 20 growth score is a terrific 4th grade, 18 year veteran teacher. She, along with her attorney and husband Bruce Lederman, are suing the state of New York to challenge the state’s growth-based teacher evaluation system. See prior posts about Sheri’s case herehere and here. I, along with Linda Darling-Hammond (Stanford), Aaron Pallas (Columbia University Teachers College), Carol Burris (Executive Director of the Network for Public Education Foundation), Brad Lindell (Long Island Research Consultant), Sean Corcoran (New York University) and Jesse Rothstein (University of California – Berkeley) are serving as part of Sheri’s team.

Bruce Lederman just emailed me with an update, and some links re: this update (below), and he gave me permission to share all of this with you.

The judge hearing this case recently asked the lawyers on both sides of Sheri’s case to brief the court by the end of this month (February 29, 2016) on a new issue, positioned and pushed back into the court by the New York State Education Department (NYSED). The issue to be heard pertains to the state’s new “moratorium” or “emergency regulations” related to the state’s high-stakes use of its growth scores, all of which is likely related to the political reaction to the opt-out movement throughout the state of New York, the publicity pertaining to the Lederman lawsuit in and of itself, and the federal government’s adoption of the recent Every Student Succeeds Act (ESSA) given its specific provision that now permits states to decide whether (and if so how) to use teachers’ students’ test scores to hold teachers accountable for their levels of growth (in New York) or value-added.

While the federal government did not abolish such practices via its ESSA, the federal government did hand back to the states all power and authority over this matter. Accordingly, this does not mean growth models/VAMs are going to simply disappear, as states do still have the power and authority to move forward with their prior and/or their new teacher evaluation systems, based in small or large part, on growth models/VAMs. As also quite evident since President Obama’s signing of the ESSA, some states are continuing to move forward in this regard, and regardless of the ESSA, in some cases at even higher speeds than before, in support of what some state policymakers still apparently believe (despite the research) are the accountability measures that will still help them to (symbolically) support educational reform in their states. See, for example, prior posts about the state of Alabama, here, New Mexico, here, and Texas, here, which is still moving forward with its plans introduced pre-ESSA. See prior posts about New York here, here, and here, the state in which also just one year ago Governor Cuomo was promoting increased use of New York’s growth model and publicly proclaiming that it was “baloney” that more teachers were not being found “ineffective,” after which Cuomo pushed through the New York budget process amendments to the law increasing the weight of teachers’ growth scores to an approximate 50% weight in many cases.

Nonetheless, as per this case in New York, state Attorney General Eric Schneiderman, on behalf of the NYSED, offered to settle this lawsuit out of court by giving Sheri some accommodation on her aforementioned 2013-2014 score of 1 out of 20, if Sheri and Bruce dropped the challenge to the state’s VAM-based teacher evaluation system. Sheri and Bruce declined, for a number or reasons, including that under the state’s recent “moratorium,” the state’s growth model is still set to be used throughout the state of New York for the next four years, with teachers’ annual performance reviews based in part on growth scores reported to parents, newspapers (on an aggregate basis), and the like. While, again, high-stakes are not to be attached to the growth output for four years, the scores will still “count.”

Hence, Sheri and Bruce believe that because they have already “convincingly” shown that the state’s growth model does not “rationally” work for teacher evaluation purposes, and that teacher evaluations as based on the state’s growth model actually violate state law since teachers like Sheri are not capable of getting perfect scores (which is “irrational”), they will continue with this case, also on behalf of New York teachers and principals who are “demoralized” by the system, as well as New York taxpayers who are paying (millions “if not tens of millions of dollars” for the system’s (highly) unreliable and inaccurate results.

As per Bruce’s email: “Spending the next 4 years studying a broken system is a terrible idea and terrible waste of taxpayer $$s. Also, if [NYSED] recognizes that Sheri’s 2013-14 score of 1 out of 20 is wrong [which they apparently recognize given their offer to settle this suit out of court], it’s sad and frustrating that [NYSED] still wants to fight her score unless she drops her challenge to the evaluation system in general.”

“We believe our case is already responsible for the new administrative appeal process in NY, and also partly responsible for Governor Cuomos’ apparent reversal on his stand about teacher evaluations. However, at this point we will not settle and allow important issues to be brushed under the carpet. Sheri and I are committed to pressing ahead with our case.”

To read more about this case via a Politico New York article click here (registration required). To hear more from Bruce Lederman about this case via WCNY-TV, Syracuse, click here. The pertinent section of this interview starts at 22:00 minutes and ends at 36:21. It’s well worth listening!

New Mexico to Change its Teacher Evaluation System, But Not Really

As you all likely recall, the American Federation of Teachers (AFT), joined by the Albuquerque Teachers Federation (ATF), last year, filed a “Lawsuit in New Mexico Challenging [the] State’s Teacher Evaluation System.” In December 2015, state District Judge David K. Thomson granted a preliminary injunction preventing consequences from being attached to the state’s teacher evaluation data. More specifically, Judge Thomson ruled that the state can proceed with “developing” and “improving” its teacher evaluation system, but the state is not to make any consequential decisions about New Mexico’s teachers using the data the state collects until the state (and/or others external to the state) can evidence to the court that the system is reliable, valid, fair, uniform, and the like (see prior post on this ruling here).

Late Friday afternoon, New Mexico’s Public Education Department (PED) announced that they are accordingly changing their NMTEACH teacher evaluation system, and they will be issuing new regulations. Their primary goal is as follows: To (1) “Address major liabilities resulting from litigation” as these liabilities specifically pertain to the former NMTEACH system’s (a) Uniformity, (b) Transparency, and (c) Clarity. On the surface level, this is gratifying to the extent that the state is attempting to, at least theoretically, please the court. But we, and especially those in New Mexico, might refrain from celebrating too soon…given when one reads the PED announcement here, one will see this is yet another example of the state’s futile attempts to keep with a very top-down teacher evaluation system. Note, however, that a uniform teacher evaluation system is also required under state law, although the governor has the right to change state statute should those at the state (including the governor, state superintendent, and PED) decide to eventually work with local districts and schools regarding the construction of a better teacher evaluation system for the state.

As per the PED’s subsequent goals, accordingly, things do not look much different than what they did in the past, especially given why and what got the state involved in such litigation in the first place. While the state also intends to (2) Simplify processes for districts/charters and also for the PED, and this is more or less fair, the state is also to (3) Establish a timeline for providing to districts and schools more current data in that currently such data are delayed by one school year, and these data are (still) needed for the state’s Pay for Performance plans (which was considered one high-stakes consequence at issue in Judge Thompson’s ruling). A tertiary goal is also to deliver in a more timely fashion such data to teacher preparation programs, which is something also of great controversy, as (uninformed) policymakers also continue to believe that colleges of education should also be held accountable for the test scores of their graduates’ students (see why this is problematic, for example, here). In the state’s final expressed goal, they make it explicit that (4) “Moving the timeline enhances the understanding that this system isn’t being used for termination decisions.” While this is certainly good, at least for now, the performance pay program is still something that is of concern. As is the state’s continued attempts to (still) use students’ test scores to evaluate teachers, and the state’s perpetual beliefs that the data errors also exposed by the lawsuit were the fault of the school districts, not the state, which Judge Thomson also noted.

Regardless, here is the state’s “Legal Rationale,” and here is also where things go a bit more awry. As re-positioned by the state/PED, they write that “the NEA and AFT recently advanced lawsuits set on eliminating any meaningful teacher evaluation [emphasis added to highlight the language that state is using to distort the genuine purposes of these lawsuits]. These lawsuits have exposed that the flexibility provided to local authorities has created confusion and complexity. Judge Thomson used this complexity when granting an injunction in the AFT case—citing a confusing array of classifications, tags, assessments, graduated considerations, etc. Judge Thomson made clear that he views this local authority as a threat to the statutorily required uniformity of the system [emphasis added given Judge Thompson said nothing of this sort, in terms of devaluing local authority or control, but rather, he emphasized the state’s menu of options was arbitrary and not uniform, especially given the consequences the state was requiring districts to enforce].” This, again, pertains to what is written in the current state statute in terms of a uniform teacher evaluation system.

Accordingly, and unfortunately, the state’s proposed changes would: “Provide a single plan that all districts and charters would use, providing greater uniformity,” and “Simplify the model from 107 possible classifications to three.” See three other moves detailed in the PED announcement here (e.g., moving data delivery dates, eliminating all but three tests, and the fall 2016 date which all of this is to become official).

Related, see a visual of what the state’s “new and improved” teacher evaluation system, in response to said litigation, is to look like. Unfortunately, again, it really does not look much different than it did prior except, perhaps, in the proposed reductions of testing options. See also the full document from which all of this came here.

Screen Shot 2016-01-30 at 10.20.01 AM

Nonetheless, we will have to wait to see if this, again, will please the court, and Judge Thompson’s ruling that the state is not to make any consequential decisions about New Mexico’s teachers using the data the state collects until the state (and/or others external to the state) can evidence to the court that the system is reliable, valid, etc.

And as for what the President of the American Federation of Teachers (AFT) New Mexico – Stephanie Biondo-Ly – had to say in response, see her press release below. See also an article in the Las Cruces – Sun Times here, in which President Ly is cited as “denounc[ing] the changes and call[ing] them attempts to obscure deficiencies in the [state’s] evaluation system.” From her original press release, she also wrote: “We are troubled…that once again, these changes are being implemented from the top down and if the secretary [Hanna Skandera] and her [PED] staff were serious about improving student outcomes and producing a fair evaluation system, they would have involved teachers, principals, and superintendents in the process.”

here