The ACT Testing Corporation (Unsurprisingly) Against America’s Opt-Out Movement

The Research and Evaluation section/division of the ACT testing corporation — ACT, Inc., the nonprofit also famously known for developing the college-entrance ACT test — recently released a policy issue brief titled “Opt-Outs: What Is Lost When Students Do Not Test.” What an interesting read, especially given ACT’s position and perspective as a testing company, also likely being impacted by America’s opt-out-of-testing movement. Should it not be a rule that people writing on policy issues should disclose all potential conflicts-of-interest? They did not here…

Regardless, last year throughout the state of New York, approximately 20% of students opted out of statewide testing. In the state of Washington more than 25% of students opted out. Large and significant numbers of students also opted out in Colorado, Florida, Oregon, Maine, Michigan, New Jersey, and New Mexico. Students are opting out, primarily because of community, parent, and student concerns about the types of tests being administered, the length and number of the tests administered, the time that testing and testing preparation takes away from classroom instruction, and the like.

Because many states also rely on ACT tests for statewide, not just college entrance exam purposes, clearly this is of concern to ACT, Inc. But rather than the corporation rightfully positioning itself on this matter as a company with clear vested interests, ACT Issue Brief author Michelle Croft frames the piece as a genuine plead to help others understand why they should reject the opt out movement, not opt out their own children, generally help to curb and reduce the nation’s opt-out movement, and the like, given the movement’s purportedly negative effects.

Here are some of the reasons ACT’s Croft give in support of not opting out, along with my research-informed commentaries per reason:

  • Scores on annual statewide achievement tests can provide parents, students, educators, and policymakers with valuable information—but only if students participate. What Croft does not note here is that such large scale standardized test scores, without taking into account growth over time (an argument that actually works in favor of VAMs), are so highly correlated with student test-takers’ demographics that test scores do not often tell us much that we would not have known otherwise from what student demographics alone tell us. This is a very true, and also very unfortunate reality, whereby with a small set of student demographics we can actually predict with great (albeit imperfect) certainty students’ test scores without students taking the tests. In other words, if 100% of students opted out, we could still use some of even our most rudimentary statistical techniques to determine what students’ scores would have been regardless; hence, this claim is false.
  • Statewide test scores are one of the most readily available forms of data used by educators to help inform instruction. This is also patently false. Teachers, on average and as per the research, do not use the “[i]ndividual student data [derived via these tests] to identify general strengths and weaknesses, [or to] identify students who may need additional support” for many reasons, including the fact that test scores often come back to teachers after their tested students have moved onto the next grade level. This is also especially true when these tests, as compared to tests that are administered not at the state, but at the district, school, or classroom levels, yield data that is much more instructionally useful. What Croft  does not note is that many research studies, and researchers, have evidenced that the types of tests at the source of the opt out movement are the tests that are also the least instructionally useful (see a prior post on this topic here). Accordingly, Croft’s claim here also contradicts recent research written by some of the luminaries in the field of educational measurement, who collectively support the design of more instructionally useful and sensitive tests in general, to combat the perpetual claims like these surrounding large scale standardized tests (see here).
  • Statewide test scores allow parents and educators to see how students measure up
    to statewide academic standards intended for all students in the state…[by providing] information about a student’s, school’s, or district’s standing compared to others in the state (or across states, if the assessment is used by more than one). See my first argument about student-level demographics, as the same holds true here. Whether these tests are better indicators of what students learned or students’ demographics is certainly of debate, and unfortunately most of the research evidence supports the latter (unless, perhaps, VAMs or growth models are used to measure large scale growth over time).
  • Another benefit…is that the data gives parents an indicator of school quality that can
    help in selecting a school for their children. See my prior argument, again, especially in that test scores are also highly correlated with property/house values; hence, with great certainty one can pick a school just by picking a home one can afford or a neighborhood in which one would like to live, regardless of test scores, as the test scores of the surrounding schools will ultimately reveal themselves to match said property/house values.
  • While grades are important, they [are not as objective as large-scale test scores because they] can also be influenced by a variety of factors unrelated to student achievement, such as grade inflation, noncognitive factors separate from achievement (such as attendance and timely completion of assignments), unintentional bias, or unawareness of performance expectations in subsequent grades (e.g., what it means to be prepared for college). Large-scale standardized tests, of course, are not subject to such biases and unrelated influences, we are to assume and accept as an objective truth.
  • Opt-outs threaten the overall accuracy—and therefore the usefulness—of the data provided. Indeed, this is true, and also one of the arguably positive side-effects of the opt out movement, whereby without large enough samples of students participating in such tests, the extent to which test companies and others can make generalizable results about, in this case, larger student populations is statistically limited. Given the fact that we have been relying on large-scale standardized tests to reform America’s education system for now over the past 30 years, yet we continue to face an “educational crisis” across America’s public schools, perhaps test-based reform policies are not the solution that testing companies like ACT, Inc. continue to argue they are. While perpetuating this argument in favor of reform is financially wise and lucrative, all at the taxpayer’s expense, no to very little research exists to support that using such large scale test-based information helps to reform or improve much of anything.
  • Student assessment data allows for rigorous examination of programs and policies to ensure that resources are allocated towards what works. The one thing large scale standardized tests do help us do, especially as researchers and program evaluators, is help us examine and assess large-scale programs’ and other reform efforts’ impacts. Whether students should have to take tests for just this purpose, however, may also not be worth the nation’s and states’ financial and human resources and investments. With this, most scholars also agree, but more so now when VAMs are used for such large-scale research and evaluation purposes. VAMs are, indeed, a step in the right direction when we are talking about large-scale research.

Author Croft, on behalf of ACT, then makes a series of recommendations to states regarding such large scale testing, again, to help curb the opt out movement. Here are their four recommendations, again, alongside my research-informed commentaries per recommendation:

  • Districts should reduce unnecessary testing. Interesting, here, is that that states are not listed as an additional entity that should reduce unnecessary testing. See my prior comments, especially the one regarding the most instructionally useful tests being at the classroom, school, and/or district levels.
  • Educators and policymakers should improve communication with parents
    about the value gained from having all students take the assessments. Unfortunately, I would not start with the list provided in this piece. Perhaps this blog post will help, however, present a fairer interpretation of their recommendations and the research-based truths surrounding them.
  • Policymakers should discourage opting out…States that allow opt-outs should avoid creating laws, policies, or communications that suggest an endorsement of the practice. Such a recommendation is remiss, in my opinion, given the vested interests of the company making this recommendation.
  • Policymakers should support appropriate uses of test scores. I think we can all agree with this one, although large scale tests scores should not be used and promoted for accountability purposes, as also suggested herein, given the research does not support that doing this actually works either. For a great, recent post on this, click here.

In the end, all of these recommendations, as well as reasons that the opt out movement should be thwarted, are coming via an Issue Brief authored and sponsored by a large scale testing company. This fact, in and of itself, puts everything they position as a set of disinterested recommendations and reasons, at question. This is unfortunate, for ACT Inc., and their roles as the authors and sponsors of this piece.

Some Lawmakers Reconsidering VAMs in the South

A few weeks ago in Education Week, Stephen Sawchuk and Emmanuel Felton wrote a post in the its Teacher Beat blog about lawmakers, particularly in the southern states, who are beginning to reconsider, via legislation, the role of test scores and value-added measures in their states’ teacher evaluation systems. Perhaps the tides are turning.

I tweeted this one out, but I also pasted this (short) one below to make sure you all, especially those of you teaching and/or residing in states like Georgia, Oklahoma, Louisiana, Tennessee, and Virginia, did not miss it.

Southern Lawmakers Reconsidering Role of Test Scores in Teacher Evaluations

After years of fierce debates over effectiveness and fairness of the methodology, several southern lawmakers are looking to minimize the weight placed on so called value-added measures, derived from how much students’ test scores changed, in teacher-evaluation systems.

In part because these states are home to some of the weakest teachers unions in the country, southern policymakers were able to push past arguments that the state tests were ill suited for teacher-evaluation purposes and that the system would punish teachers for working in the toughest classrooms. States like Louisiana, Georgia and Tennessee, became some of the earliest and strongest adopters of the practice. But in the past few weeks, lawmakers from Baton Rouge, La., to Atlanta have introduced bills to limit the practice.

In February, the Georgia Senate unanimously passed a bill that would reduce the student-growth component from 50 percent of a teachers’ evaluation down to 30 percent. Earlier this week, nearly 30 individuals signed up to speak on behalf fo the bill at a State House hearing.

Similarly, Louisiana House Bill 479 would reduce student-growth weight from 50 percent to 35 percent. Tennessee House Bill 1453 would reduce the weight of student-growth data through the 2018-2019 school year and would require the state Board of Education to produce a report evaluating the policy’s ongoing effectiveness. Lawmakers in Florida, Kentucky, and Oklahoma have introduced similar bills, according to the Southern Regional Education Board’s 2016 educator-effectiveness bill tracker.

By and large, states adopted these test-score centric teacher-evaluation systems to attain waivers from No Child Left Behind’s requirement that all students by proficient by 2014. To get a waiver, states had to adopt systems that evaluated teachers “in significant part, based on student growth.” That has looked very different from state to state, ranging from 20 percent in Utah to 50 percent in states like Alaska, Tennessee, and Louisiana.

No Child Left Behind’s replacement, the Every Student Succeeds Act, doesn’t require states to have a teacher-evaluation system at all, but, as my colleague Stephen Sawchuk reported, the nation’s state superintendents say they remain committed to maintaining systems that regularly review teachers.

But, as Sawchuk reported, Steven Staples, Virginia’s state superintendent, signaled that his state may move away from its current system where student test scores make up 40 percent of a teacher’s evaluation:

“What we’ve found is that through our experience [with the NCLB waivers], we have had some unintended outcomes. The biggest one is that there’s an over-reliance on a single measure; too many of our divisions defaulted to the statewide standardized test … and their feedback was that because that was a focus [of the federal government], they felt they needed to emphasize that, ignoring some other factors. It also drove a real emphasis on a summative, final evaluation. And it resulted in our best teachers running away from our most challenged.”

Some state lawmakers appear to absorbing a similar message

Contradictory Data Complicate Definitions of, and Attempts to Capture, “Teacher Effects”

An researcher from Brown University — Matthew Kraft — and his student, recently released a “working paper” in which I think you all will (and should) be interested. The study is about “…Teacher Effects on Complex Cognitive Skills and Social-Emotional Competencies” — those effects that are beyond “just” test scores (see also a related article on this working piece released in The Seventy Four). This one is 64-pages long, but here are the (condensed) highlights as I see them.

The researchers use data from the Bill & Melinda Gates Foundations’ Measures of Effective Teaching (MET) Project “to estimate teacher effects on students’ performance on cognitively demanding open-ended tasks in math and reading, as well as their growth mindset, grit, and effort in class.” They find “substantial variation in teacher effects on complex task performance and social-emotional measures. [They] also find weak relationships between teacher effects on state standardized tests, complex tasks, and social-emotional competencies” (p. 1).

More specifically, researchers found that: (1) “teachers who are most effective at raising student performance on standardized tests are not consistently the same teachers who develop students’ complex cognitive abilities and social-emotional competencies” (p. 7); (2) “While teachers who add the most value to students’ performance on state tests in math do also appear to strengthen their analytic and problem-solving skills, teacher effects on state [English/language arts] tests are only moderately correlated with open-ended tests in reading” (p. 7); and (3) “[T]eacher effects on social-emotional measures are only weakly correlated with effects on state achievement tests and more cognitively demanding open-ended tasks (p. 7).

The ultimate finding, then, is that “teacher effectiveness differs across specific abilities” and definitions of what it means to be an effective teacher (p. 7). Likewise, authors concluded that really all current teacher evaluation systems, also given those included within the MET studies are/were some of the best given the multiple sources of data MET researchers included (e.g., traditional tests, indicators capturing complex cognitive skills and social-emotional competencies, observations, student surveys), are not mapping onto similar definitions of teacher quality or effectiveness, as “we” have and continue to theorize.

Hence, attaching high-stakes consequences to data, especially when multiple data yield contradictory findings as based on how one might define effective teaching or its most important components (e.g., test scores v. affective, socio-emotional effects), is (still) not yet warranted, whereby an effective teacher here might not be an effective teacher there, even if defined similarly in like schools, districts, or states. As per Kraft, “while high-stakes decisions may be an important part of teacher evaluation systems, we need to decide on what we value most when making these decisions rather than just using what we measure by default because it is easier.” Ignoring such complexities will not make standard, uniform, or defensible some of the high-stakes decisions that some states are still wanting to attach to such data derived via  multiple measures and sources, given data defined differently even within standard definitions of “effective teaching” continue to contradict one another.

Accordingly, “[q]uestions remain about whether those teachers and schools that are judged as effective by state standardized tests [and the other measures] are also developing the skills necessary to succeed in the 21st century economy.” (p. 36). Likewise, it is not necessarily the case that teachers defined as high value-added teachers, using these common indicators, are indeed high value-added teachers given they are the same teachers defined as low value-added teachers when different aspects of “effective teaching” are also examined.

Ultimately, this further complicates “our” current definitions of effective teaching, especially when those definitions are constructed in policy arenas oft-removed from the realities of America’s public schools.

Reference: Kraft, M. A., & Grace, S. (2016). Teaching for tomorrow’s economy? Teacher effects on complex cognitive skills and social-emotional competencies. Providence, RI: Brown University. Working paper.

Victory in New Mexico’s Lawsuit, Again

My most recent post about the state of New Mexico (here) included an explanation of a New Mexico Judge’s ruling to postpone New Mexico’s state-wide teacher evaluation trial until October 2016, with the state’s December 2015 preliminary injunction (described here) in place until (at least) then.

New Mexico’s Public Education Department (PED) recently, however, also tried to appeal the Judge’s October 2016 injunction, and took it to New Mexico’s Court of Appeals for an emergency review of the Judge’s injunction order.

The state and its PED lost, again. Here is the court order, which essentially says that the appeal was denied, and pasted below is the press release, released by the American Federation of Teachers New Mexico and Albuquerque Teachers Federation (i.e., the plaintiffs in this case).

Also here is an article just released in the Santa Fe New Mexican about this ruling, also about how the “Appeals court reject[ed the state’s] request to intervene in [this] teacher evaluation case.”


Court Denies Request from Public Education Department; Keeps Case in District Court

March 16, 2016

Contact: John Dyrcz

Albuquerque – American Federation of Teachers New Mexico (AFT NM) President Stephanie Ly and Albuquerque Teachers Federation (ATF) President Ellen Bernstein released the following statement:

“We are not surprised by today’s decision of the New Mexico Court of Appeals denying the New Mexico Public Education Department’s request for an interlocutory – or emergency – review of District Court Judge David Thomson’s injunction order. The December 2015 injunction preventing the PED from using its faulty evaluation system to penalize educators was well reasoned and the product of a fair and lengthy series of hearings over four months.

“We have maintained throughout this process that while the PED has every right to pursue all legal options under our judicial system, these frequent attempts at disrupting the progress of this case are nothing more than an attempt to stall the momentum of our efforts to seek relief for New Mexico’s education community.

“With this order, the case returns to Judge Thomson for final testimony from our expert witnesses, and we are pleased that the temporary injunction granted in December of 2015 will remain in place until at least October of 2016, when AFT NM and ATF will seek to make the injunction permanent,” said Ly and Bernstein.

Alleged Violation of Protective Order in Houston Lawsuit, Overruled

Many of you will recall a post I made public in January including “Houston Lawsuit Update[s], with Summar[ies] of Expert Witnesses’ Findings about the EVAAS” (Education Value-Added Assessment System sponsored by SAS Institute Inc.). What you might not have recognized since, however, was that I pulled the post down a few weeks after I posted it. Here’s the back story.

In January 2016, the Houston Federation of Teachers (HFT) published an “EVAAS Litigation Update,” which summarized a portion of Dr. Jesse Rothstein’s expert report in which he conclude[d], among other things, that teachers do not have the ability to meaningfully verify their EVAAS scores. He wrote that “[a]t most, a teacher could request information about which students were assigned to her, and could read literature — mostly released by SAS, and not the product of an independent investigation — regarding the properties of EVAAS estimates.” On January 10, 2016, I posted the post: “Houston Lawsuit Update, with Summary of Expert Witnesses’ Findings about the EVAAS” summarizing what I considered to be the twelve key highlights of HFT’s “EVAAS Litigation Update,” in which I highlighted Rothstein’s above conclusions.

Lawyers representing SAS Institute Inc. charged that this post, along with the more detailed “EVAAS Litigation Update” I summarized within the post (authored by the Houston Federation of Teachers (HFT) to keep their members in Houston up-to-date on the progress of this lawsuit) violated a protective order that was put in place to protect SAS’s EVAAS computer source code. Even though there is/was nothing in the “EVAAS Litigation Update” or the blog post that disclosed the source code, SAS objected to both as disclosing conclusions that, SAS said, could not have been reached in the absence of a review of the source code. They threatened HFT, its lawyers, and its experts (myself and Dr. Rothstein) with monetary sanctions. HFT went to court in order to get the court’s interpretation of the protective order and to see if a Judge agreed with SAS’s position. In the meantime, I removed the prior post (which is now back up here).

The great news is that the Judge found in HFT’s favor. He found that neither the “EVAAS Litigation Update” nor the related blog post violated the protective order. Further, he found that “we” have the right to share other updates on the Houston lawsuit, which is still pending, as long as the updates do not violate the protective order still in place. This includes discussion of the conclusions or findings of experts, provided that the source code is not disclosed, either explicitly or by necessary implication.

In more specific terms, as per his ruling in his Court Order, the judge ruled that SAS Institute Inc.’s lawyers “interpret[ed] the protective order too broadly in this instance. Rothstein’s opinion regarding the inability to verify or replicate a teacher’s EVAAS score essentially mimics the allegations of HFT’s complaint. The Litigation Update made clear that Rothstein confirmed this opinion after review of the source code; but it [was] not an opinion ‘that could not have been made in the absence of [his] review’ of the source code. Rothstein [also] testified by affidavit that his opinion is not based on anything he saw in the source code, but on the extremely restrictive access permitted by SAS.” He added that “the overly broad interpretation urged by SAS would inhibit legitimate discussion about the lawsuit, among both the union’s membership and the public at large.” That, also in his words, would be an “unfortunate result” that should, in the future, be avoided.

Here, again, are the 12 key highlights of the EVAAS Litigation Update:
  • Large-scale standardized tests have never been validated for their current uses. In other words, as per my affidavit, “VAM-based information is based upon large-scale achievement tests that have been developed to assess levels of student achievement, but not levels of growth in student achievement over time, and not levels of growth in student achievement over time that can be attributed back to students’ teachers, to capture the teachers’ [purportedly] causal effects on growth in student achievement over time.”
  • The EVAAS produces different results from another VAM. When, for this case, Rothstein constructed and ran an alternative, albeit sophisticated VAM using data from HISD both times, he found that results “yielded quite different rankings and scores.” This should not happen if these models are indeed yielding indicators of truth, or true levels of teacher effectiveness from which valid interpretations and assertions can be made.
  • EVAAS scores are highly volatile from one year to the next. Rothstein, when running the actual data, found that while “[a]ll VAMs are volatile…EVAAS growth indexes and effectiveness categorizations are particularly volatile due to the EVAAS model’s failure to adequately account for unaccounted-for variation in classroom achievement.” In addition, volatility is “particularly high in grades 3 and 4, where students have relatively few[er] prior [test] scores available at the time at which the EVAAS scores are first computed.”
  • EVAAS overstates the precision of teachers’ estimated impacts on growth. As per Rothstein, “This leads EVAAS to too often indicate that teachers are statistically distinguishable from the average…when a correct calculation would indicate that these teachers are not statistically distinguishable from the average.”
  • Teachers of English Language Learners (ELLs) and “highly mobile” students are substantially less likely to demonstrate added value, as per the EVAAS, and likely most/all other VAMs. This, what we term as “bias,” makes it “impossible to know whether this is because ELL teachers [and teachers of highly mobile students] are, in fact, less effective than non-ELL teachers [and teachers of less mobile students] in HISD, or whether it is because the EVAAS VAM is biased against ELL [and these other] teachers.”
  • The number of students each teacher teaches (i.e., class size) also biases teachers’ value-added scores. As per Rothstein, “teachers with few linked students—either because they teach small classes or because many of the students in their classes cannot be used for EVAAS calculations—are overwhelmingly [emphasis added] likely to be assigned to the middle effectiveness category under EVAAS (labeled “no detectable difference [from average], and average effectiveness”) than are teachers with more linked students.”
  • Ceiling effects are certainly an issue. Rothstein found that in some grades and subjects, “teachers whose students have unusually high prior year scores are very unlikely to earn high EVAAS scores, suggesting that ‘ceiling effects‘ in the tests are certainly relevant factors.” While EVAAS and HISD have previously acknowledged such problems with ceiling effects, they apparently believe these effects are being mediated with the new and improved tests recently adopted throughout the state of Texas. Rothstein, however, found that these effects persist even given the new and improved.
  • There are major validity issues with “artificial conflation.” This is a term I recently coined to represent what is happening in Houston, and elsewhere (e.g., Tennessee), when district leaders (e.g., superintendents) mandate or force principals and other teacher effectiveness appraisers or evaluators, for example, to align their observational ratings of teachers’ effectiveness with value-added scores, with the latter being the “objective measure” around which all else should revolve, or align; hence, the conflation of the one to match the other, even if entirely invalid. As per my affidavit, “[t]o purposefully and systematically endorse the engineering and distortion of the perceptible ‘subjective’ indicator, using the perceptibly ‘objective’ indicator as a keystone of truth and consequence, is more than arbitrary, capricious, and remiss…not to mention in violation of the educational measurement field’s Standards for Educational and Psychological Testing” (American Educational Research Association (AERA), American Psychological Association (APA), National Council on Measurement in Education (NCME), 2014).
  • Teaching-to-the-test is of perpetual concern. Both Rothstein and I, independently, noted concerns about how “VAM ratings reward teachers who teach to the end-of-year test [more than] equally effective teachers who focus their efforts on other forms of learning that may be more important.”
  • HISD is not adequately monitoring the EVAAS system. According to HISD, EVAAS modelers keep the details of their model secret, even from them and even though they are paying an estimated $500K per year for district teachers’ EVAAS estimates. “During litigation, HISD has admitted that it has not performed or paid any contractor to perform any type of verification, analysis, or audit of the EVAAS scores. This violates the technical standards for use of VAM that AERA specifies, which provide that if a school district like HISD is going to use VAM, it is responsible for ‘conducting the ongoing evaluation of both intended and unintended consequences’ and that ‘monitoring should be of sufficient scope and extent to provide evidence to document the technical quality of the VAM application and the validity of its use’ (AERA Statement, 2015).
  • EVAAS lacks transparency. AERA emphasizes the importance of transparency with respect to VAM uses. For example, as per the AERA Council who wrote the aforementioned AERA Statement, “when performance levels are established for the purpose of evaluative decisions, the methods used, as well as the classification accuracy, should be documented and reported” (AERA Statement, 2015). However, and in contrast to meeting AERA’s requirements for transparency, in this district and elsewhere, as per my affidavit, the “EVAAS is still more popularly recognized as the ‘black box’ value-added system.”
  • Related, teachers lack opportunities to verify their own scores. This part is really interesting. “As part of this litigation, and under a very strict protective order that was negotiated over many months with SAS [i.e., SAS Institute Inc. which markets and delivers its EVAAS system], Dr. Rothstein was allowed to view SAS’ computer program code on a laptop computer in the SAS lawyer’s office in San Francisco, something that certainly no HISD teacher has ever been allowed to do. Even with the access provided to Dr. Rothstein, and even with his expertise and knowledge of value-added modeling, [however] he was still not able to reproduce the EVAAS calculations so that they could be verified.”Dr. Rothstein added, “[t]he complexity and interdependency of EVAAS also presents a barrier to understanding how a teacher’s data translated into her EVAAS score. Each teacher’s EVAAS calculation depends not only on her students, but also on all other students with- in HISD (and, in some grades and years, on all other students in the state), and is computed using a complex series of programs that are the proprietary business secrets of SAS Incorporated. As part of my efforts to assess the validity of EVAAS as a measure of teacher effectiveness, I attempted to reproduce EVAAS calculations. I was unable to reproduce EVAAS, however, as the information provided by HISD about the EVAAS model was far from sufficient.”

Kane’s (Ironic) Calls for an Education Equivalent to the Federal Drug Administration (FDA)

Thomas Kane, an economics professor from Harvard University who also directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation, has been the source of multiple posts on this blog (see, for example, here, here, and here). He is consistently backing, with his Harvard affiliation in tow, VAMs, as per a series of exaggerated, but as he argues, research-based claims.

However, much of the work that Kane cites, he himself has written on the topic (sometimes with coauthors). Likewise, the strong majority of these works have not been published in peer reviewed journals, but rather as technical reports or book chapters, often in his own books (see for example Kane’s curriculum vitae (CV) here). This includes the technical reports derived via his aforementioned MET studies, now in technical report and book form, but now completed three years ago in 2013, and still not externally vetted and published in any said journal. Lack of publication of the MET studies might be due to some of the methodological issues within these particular studies, however (see published and unpublished, (in)direct criticisms of these studies here, here, and here). Although Kane does also cite some published studies authored by others, again, in support of VAMs, the studies Kane cites are primarily/only authored by econometricians (e.g., Chetty, Friedman, and Rockoff) and, accordingly, largely unrepresentative of the larger literature surrounding VAMs.

With that being said, and while I do try my best to stay aboveboard as an academic who takes my scholarly and related blogging activities seriously, sometimes it is hard to, let’s say, not “go there” when more than deserved. Now is one of those times for what I believe is a fair assessment of one example of Kane’s unpublished and externally un-vetted works.

Last week on National Public Radio (NPR), Kane was interviewed by Eric Westervelt in a series titled “There Is No FDA For Education. Maybe There Should Be.” Ironically, in 2009 I made this claim in an article that I authored and that was published in Education Leadership. I began the piece noting that “The value-added assessment model is one over-the-counter product that may be detrimental to your health.” I ended the article noting that “We need to take our education health as seriously as we take our physical health…[hence, perhaps] the FDA approach [might] also serve as a model to protect the intellectual health of the United States. [It] might [also] be a model that legislators and education leaders follow when they pass legislation or policies whose benefits and risks are unknown” (Amrein-Beardsley, 2009).

Never did I foresee, however, how much my 2009 calls for such an approach similar to Kane’s recent calls during this interview would ironically apply, now, and precisely because of academics like Kane. In other words, ironic is that Kane is now calling for “rigorous vetting” of educational research, as he claims is being done with medical research, but those rules apparently do not apply to his own work, and the scholarly works he perpetually cites in favor of VAMs.

Take also, for example, some of the more specific claims Kane expressed in this particular NPR interview.

Kane claims that “The point of education research is to identify effective interventions for closing the achievement gap.” Although elsewhere in the same interview Kane claims that “American education research [has also] mostly languished in an echo chamber for much of the last half century.” While NPR’s Westervelt criticizes Kane for making a “pretty scathing and strong indictment” of America’s education system, what Kane does not understand writ large is that the very solutions for which Kane advocates – using VAM-based measurements to fire and hire bad and good teachers, respectively – are really no different than the “stronger accountability” measures upon which we have relied for the last 40 years (since the minimum competency testing era) within this alleged “echo chamber.” Kane himself, then, IS one of the key perpetrators and perpetuators of the very “echo chamber” of which he speaks, and also scorns and indicts. Kane simply does not realize (or acknowledge) that this “echo chamber” continues in its persistence because of those who continue to preserve a stronger accountability for educational reform logic, and who continue to reinvent “new and improved” accountability measures and policies to support this logic, accordingly.

Kane also claims that he does not “point fingers at the school officials out there” for not being able to reform our schools for the better, and in this particular case, for closing the achievement gap. But unfortunately, Kane does. Recall the article Kane authored and that the New York Daily News published, titled “Teachers Must Look in the Mirror,” in which Kane insults New York’s administrators, writing that the fact that 96% of teachers in New York were given the two highest ratings last year – “effective” or “highly effective” – was “a sure sign that principals have not been honest to date.” Enough said.

The only thing with which I do agree with Kane as per this NPR interview is that we as an academic community can certainly do a better job of translating research into action, although I would add that only externally-reviewed published research is that which is worthy of actually turning into action, in practice. Hence, my repeated critiques of research studies on this blog, that are sometimes not even internally vetted or reviewed before being released to the public, and then the media, and then the masses as “truth.” Recall, for example, when the Chetty et al. studies (mentioned prior as also often cited by Kane) were cited in President Obama’s 2012 State of the Union address, regardless of the fact that the Chetty et al. studies had not even been internally reviewed by their sponsoring agency – the National Bureau of Education Research (NBER) – prior to their public release? This is also regardless of the fact that the federal government’s “What Works Clearinghouse”- also positioned in this NPR interview by Kane as the closest entity we have so far to an educational FDA – noted grave concerns about this study at the same time (as have others since; see, for example, here, here, here, and here).

Now this IS bad practice, although on this, Kane is also guilty.

To put it lightly, and in sum, I’m confused by Kane’s calls for such an external monitoring/quality entity akin to the FDA. While I could not agree more with his statement of need, as I also argued in my 2009 piece mentioned prior, it is because of educational academics like Kane that I have to continue to make such claims and calls for such monitoring and regulation.

Reference: Amrein-Beardsley, A. (2009). Buyers be-aware: What you don’t know can hurt you. Educational Leadership, 67(3), 38-42.

New Mexico’s Teacher Evaluation Trial Postponed Until October, w/Preliminary Injunction Still in Place

Last December in New Mexico, a Judge granted a preliminary injunction preventing consequences from being attached to the state’s teacher evaluation data as based on the state’s value-added model (VAM). More specifically, Judge David K. Thomson ruled that the state can proceed with “developing” and “improving” its teacher evaluation system, but the state is not to make any consequential decisions about New Mexico’s teachers using the data the state collects until the state (and/or others external to the state) can evidence to the court during another trial (which was set for April of 2016) that the system is reliable, valid, fair, uniform, and the like. See more details regarding Judge Thomson’s ruling in a previous post here: “Consequences Attached to VAMs Suspended Throughout New Mexico.” See more details about this specific lawsuit, sponsored by the American Federation of Teachers (AFT) New Mexico and the Albuquerque Teachers Federation (ATF), in a previous post here: “Lawsuit in New Mexico Challenging [the] State’s Teacher Evaluation System.” This is one of the cases on which I am continuing to serve as an expert witness.

Yesterday, however, and given another state-level lawsuit that is also ongoing regarding the state’s teacher evaluation system, although this one is sponsored by the National Education Association (NEA), Judge Thomson (apparently along with Judge Francis Mathew) pushed both the AFT-NM/ATF and NEA trials back to October of 2016, yielding a six month delay for the AFT-NM/ATF hearing.

According to an article published this morning in the Santa Fe New Mexican, “To date, the [New Mexico] Public Education Department [PED] has been unsuccessful in its efforts to stop either suit or combine them;” hence, yesterday in court the state requested that the court postpone both hearings so that the state could introduce its new teacher evaluation system, on March 15 of 2016, along with its specifics and rules, as also based on the state’s new Partnership for the Assessment of Readiness for College and Careers (PARCC) test data. Recall that the state’s Secretary of Education – Hanna “Skandera is new chair of PARCC test board.” It is also anticipated, however, that the state’s new system is to still “rely heavily” (i.e., 50% weight) on VAMs. See also a related post about “New Mexico Chang[ing] its Teacher Evaluation System, But Not Really.”

This window of time is also to allow for the public forums needed to review the state’s new system, but also to allow time for “the acrimony to be resolved without trials.” The preliminary injunction granted by Judge Thomson in December, though, still remains in place. See also a related article, also published this morning, in the Albuquerque Journal.

Stephanie Ly, president of the AFT-NM, said she is not happy with the trial being postponed. She called this a “stalling tactic” to give the [state] education department more time to compile student achievement data that the plaintiffs have been requesting. “We had no option but to agree because they are withholding data,” she said.

Ly and ATF President Ellen Bernstein also responded yesterday via a joint statement, pasted in full below:

March 7, 2016

Contact: John Dyrcz — 505-554-8679

“The Public Education Department and Secretary Skandera have once again willfully delayed the AFT NM/ATF lawsuit against the current value added model [VAM] evaluation system due to their purposeful refusal to reveal the data being used to evaluate our educators in New Mexico.

“In addition to this stall tactic, and during a status hearing this morning in the First District Court, lawyers for the PED revealed that new rules and regulations were to be unveiled on March 15 by the PED, and would ‘rely heavily’ on VAM as a method of evaluation for educators.

“New Mexico educators will not cease in our fight against the abusive policies of this administration. Allowing PED or districts to terminate employees based on VAM and student test scores is completely unacceptable, it is unacceptable to allow PED or districts to refuse licensure advancement based upon VAM scores, and it is unacceptable for PED or districts to place New Mexico educators on growth plans based on faulty data.

“High-performing education systems have policies 
in place which respect and support their educators and use evaluations not as punitive measures but as opportunities for improvement. Educators, unions, and administrators should oversee the evaluation process to ensure it is thorough and of high quality, as well as fair and reliable. Educators, unions, and administrators should be involved in developing, implementing and monitoring the system to ensure it reflects good teaching well, that it operates effectively, that it is tied to useful learning opportunities for teachers, and that it produces valid results.

“It is well known the PED is in a current state of crisis with several high-level staff members abandoning the Department, an on-going whistle-blower lawsuit…the failure to produce meaningful changes to education in New Mexico during her six years as Secretary, and Skandera’s constant changes to the rules is a desperate attempt to right a sinking ship,” said Ly and Bernstein.

Alabama’s “New” Accountability System: Part II

In a prior post, about whether the state of “Alabama is the New, New Mexico,” I wrote about a draft bill in Alabama to be called the Rewarding Advancement in Instruction and Student Excellence (RAISE) Act of 2016. This has since been renamed the Preparing and Rewarding Educational Professionals (PREP) Bill (to be Act) of 2016. The bill was introduced by its sponsoring Republican Senator Marsh last Tuesday, and its public hearing is scheduled for tomorrow. I review the bill below, and attach it here for others who are interested in reading it in full.

First, the bill is to “provide a procedure for observing and evaluating teachers, principals, and assistant principals on performance and student achievement…[using student growth]…to isolate the effect and impact of a teacher on student learning, controlling for pre-existing characteristics of a student including, but not limited to, prior achievement.” Student growth is still one of the bill’s key components, with growth set at a 25% weight, and this is still written into this bill regardless of the fact that the new federal Elementary Student Success Act (ESSA) no longer requires teacher-level growth as a component of states’ educational reform legislation. In other words, states are no longer required to do this, but apparently the state/Senator Marsh still wants to move forward in this regard, regardless (and regardless of the research evidence). The student growth model is to be selected by October 1, 2016. On this my offer (as per my prior post) still stands. I would  be more than happy to help the state negotiate this contract, pro bono, and much more wisely than so many other states and districts have negotiated similar contracts thus far (e.g., without asking for empirical evidence as a continuous contractual deliverable).

Second, and related, nothing is written about the ongoing research and evaluation of the state system, that is absolutely necessary in order to ensure the system is working as intended, especially before any types of consequential decisions are to be made (e.g., school bonuses, teachers’ denial of tenure, teacher termination, teacher termination due to a reduction in force). All that is mentioned is that things like stakeholder perceptions, general outcomes, and local compliance with the state will be monitored. Without evidence in hand, in advance and preferably as externally vetted and validated, the state will very likely be setting itself up for some legal trouble.

Third, to measure growth the state is set to use student performance data on state tests, as well as data derived via the ACT Aspire examination, American College Test (ACT), and “any number of measures from the department developed list of preapproved options for governing boards to utilize to measure student achievement growth.” As mentioned in my prior post about Alabama (linked to again here), this is precisely what has gotten the whole state of New Mexico wrapped up in, and quasi-losing their ongoing lawsuit. While providing districts with menus of off-the-shelf and other assessment options might make sense to policymakers, any self respecting researcher should know why this is entirely inappropriate. To read more about this, the best research study explaining why doing just this will set any state up for lawsuits comes from Brown University’s John Papay in his highly esteemed and highly cited “Different tests, different answers: The stability of teacher value-added estimates across outcome measures” article. The title of this research article alone should explain enough why simply positioning and offering up such tests in such casual ways makes way for legal recourse.

Fourth, the bill is to increase the number of years it will take Alabama teachers to earn tenure, requiring that teachers teach for at least five years, of which at least three  consecutive years of “satisfies expectations,” “exceeds expectations,” or “significantly exceeds expectations” are demonstrated via the state’s evaluation system prior to earning tenure. Clearly the state does not understand the current issues with value-added/growth levels of reliability, or consistency, or lack thereof, that are altogether preventing such consistent classifications of teachers over time. Inversely, what is consistently evident across all growth models is that estimates are very inconsistent from year to year, which will likely thwart what the bill has written into it here, as such a theoretically simple proposition. For example, the common statistic still cited in this regard is that a  teacher classified as “adding value” has a 25% to 50% chance of being classified as “subtracting value” the following year(s), and vice versa. This sometimes makes the probability of a teacher being consistently identified as (in)effective, from year-to-year, no different than the flip of a coin, and this is true when there are at least three years of data (which is standard practice and is also written into this bill as a minimum requirement).

Unless the state plans on “artificially conflating” scores, by manufacturing and forcing the oft-unreliable growth data to fit or correlate with teachers’ observational data (two observations per year are to be required), and/or survey data (student surveys are to be used for teachers of students in grades three and above), such consistency is thus far impossible unless deliberately manipulated (see a recent post here about how “artificial conflation” is one of the fundamental and critical points of litigation in another lawsuit in Houston). Related, this bill is also to allow a governing board to evaluate a tenured teacher as per his/her similar classifications every other year, and is to subject a tenured teacher who received a rating of below or significantly below expectations for two consecutive evaluations to personnel action.

Fifth and finally, the bill is also to use an allocated $5,000,000 to recruit teachers, $3,000,000 to mentor teachers, and $10,000,000 to reward the top 10% of schools in the state as per their demonstrated performance. The latter of these is the most consequential as per the state’s use of its planned growth data at the school level (see other teacher-level consequences to be attached to growth output above); hence, the comments about needing empirical evidence to justify such allocations prior to the distributions of such monies is important to underscore again. Likewise, given levels of bias in value-added/growth output are worse at the state versus teacher levels, I would also caution the state against rewarding schools, again in this regard, for what might not really be schools’ causal impacts on student growth over time, after all. See, for example, here, here, and here.

As I also mentioned in my prior post on Alabama, for those of you who have access to educational leaders there, do send them this post too, so they might be a bit more proactive, and appropriately more careful and cautious, before going down what continues to demonstrate itself as a poor educational policy path. While I do embrace my professional responsibility as a public scholar to be called to court to testify about all of this when such high-stakes consequences are ultimately, yet inappropriately based upon invalid inferences, I’d much rather be proactive in this regard and save states and states’ taxpayers their time and money, respectively.

VAMs: A Global Perspective by Tore Sørensen

Tore Bernt Sørensen is a PhD student currently studying at the University of Bristol in England, he is an emerging global educational policy scholar, and he is a future colleague whom I am to meet this summer during an internationally-situated talk on VAMs. Just last week he released a paper published by Education International (Belgium) in which he discusses VAMs, and their use(s) globally. It is rare that I read or have the opportunities to write about what is happening with VAMs worldwide; hence, I am taking this opportunity to share with you all some of the global highlights from his article. I have also attached his article to this post here for those of you who want to give the full document a thorough read (see also the article’s full reference below).

First is that the US is “leading” the world in terms of its adoption of VAMs as an educational policy tool. While I did know this prior given my prior attempts to explore what was happening in the universe of VAMs outside of the US, as per Sørensen, our nation’s ranking in this case is still in place. In fact, “in the US the use of VAM as a policy instrument to evaluate schools and teachers has been taken exceptionally far [emphasis added] in the last 5 years, [while] most other high-income countries remain [relatively more] cautious towards the use of VAM;” this, “as reflected in OECD [Organisation for Economic Co-operation and Development] reports on the [VAM] policy instrument” (p. 1).

The second country most exceptionally using VAMs, so far, is England. Their national school inspection system in England, run by England’s Office for Standards in Education, Children’s Services and Skills (OFSTED), for example, now has VAM as its central standard and accountability indicator.

These two nations are the most invested in VAMs, thus far, primarily because they have similar histories with the school effectiveness movement that emerged in the 1970s. In addition, both countries are also both highly engaged in what Pasi Sahlberg in his 2011 book Finnish Lessons termed the Global Educational Reform Movement (GERM). GERM, in place since the 1980s, has “radically altered education sectors throughout the world with an
 agenda of evidence-based policy based on the [same] school effectiveness paradigm…[as it]…combines the centralised formulation of objectives and standards, and [the] monitoring of data, with the decentralisation to schools concerning decisions around how they seek to meet standards and maximise performance in their day-to-day running” (p. 5).

“The Chilean education system has [also recently] been subject to one of the more radical variants of GERM and there is [now] an interest [also there] in calculating VAM scores for teachers” (p. 6). In  Denmark and Sweden state authorities have begun to compare predicted versus actual performance of schools, not teachers, while taking into consideration “the contextual factors of parents’ educational background, gender, and student origin” (i.e, “context value added”) (p. 7). In Uganda and Delhi, in “partnership” with an England based, international school development company ARK, they are looking to gear up their data systems so they can run VAM trials and analyses to assess their schools’ effects, and also likely continue to scale up and out.

The US-based World Bank is also backing such international moves, as is the US-based Pearson testing corporation via its Learning Curve Project, which is relying on the input from some of the most prominent VAM advocates including Eric Hanushek (see prior posts on Hanushek here and here) and Raj Chetty (see prior posts on Chetty here and here) to promote itself as a player in the universe of VAMs. This makes sense, “[c]onsidering Pearson’s aspirations to be a global education company… particularly in low-income countries” (p. 7). On that note, also as per Sørensen, “education systems in low-income countries might prove [most] vulnerable in the coming years as international donors and for-profit enterprises appear to be endorsing VAM as a means to raise school and teacher quality” in such educationally struggling nations (p. 2).

See also a related blog post about Sørensen’s piece here, as written by him on the Education in Crisis blog, which is also sponsored by Education International. In this piece he also discusses the use of data for political purposes, as is too often the case with VAMs when “the use of statistical tools as policy instruments is taken too far…towards bounded rationality in education policy.”

In short, “VAM, if it has any use at all, must expose the misleading use of statistical mumbo jumbo that effectively #VAMboozles [thanks for the shout out!!] teachers, schools and society. This could help to spark some much needed reflection on the basic propositions of school effectiveness, the negative effects of putting too much trust in numbers, and lead us to start holding policy-makers to account for their misuse of data in policy formation.

Reference: Sørensen, T. B. (2016). Value-added measurement or modelling (VAM). Brussels, Belgium: Education International. Retrieved from