The Nation’s “Best Test” Scores Released: Test-Based Policies (Evidently) Not Working

From Diane Ravitch’s Blog (click here for direct link):

Sometimes events happen that seem to be disconnected, but after a few days or weeks, the pattern emerges. Consider this: On October 2, [U.S.] Secretary of Education Arne Duncan announced that he was resigning and planned to return to Chicago. Former New York Commissioner of Education John King, who is a clone of Duncan in terms of his belief in testing and charter schools, was designated to take Duncan’s place. On October 23, the Obama administration held a surprise news conference to declare that testing was out of control and should be reduced to not more than 2% of classroom time [see prior link on this announcement here]. Actually, that wasn’t a true reduction, because 2% translates into between 18-24 hours of testing, which is a staggering amount of annual testing for children in grades 3-8 and not different from the status quo in most states.

Disconnected events?

Not at all. Here comes the pattern-maker: the federal tests called the National Assessment of Educational Progress [NAEP] released its every-other-year report card in reading and math, and the results were dismal. There would be many excuses offered, many rationales, but the bottom line: the NAEP scores are an embarrassment to the Obama administration (and the George W. Bush administration that preceded it).

For nearly 15 years, Presidents Bush and Obama and the Congress have bet billions of dollars—both federal and state—on a strategy of testing, accountability, and choice. They believed that if every student was tested in reading and mathematics every year from grades 3 to 8, test scores would go up and up. In those schools where test scores did not go up, the principals and teachers would be fired and replaced. Where scores didn’t go up for five years in a row, the schools would be closed. Thousands of educators were fired, and thousands of public schools were closed, based on the theory that sticks and carrots, rewards and punishments, would improve education.

But the 2015 NAEP scores released today by the National Assessment Governing Board (a federal agency) showed that Arne Duncan’s $4.35 billion Race to the Top program had flopped. It also showed that George W. Bush’s No Child Left Behind was as phony as the “Texas education miracle” of 2000, which Bush touted as proof of his education credentials.

NAEP is an audit test. It is given every other year to samples of students in every state and in about 20 urban districts. No one can prepare for it, and no one gets a grade. NAEP measures the rise or fall of average scores for states in fourth grade and eighth grade in reading and math and reports them by race, gender, disability status, English language ability, economic status, and a variety of other measures.

The 2015 NAEP scores showed no gains nationally in either grade in either subject. In mathematics, scores declined in both grades, compared to 2013. In reading, scores were flat in grade 4 and lower in grade 8. Usually the Secretary of Education presides at a press conference where he points with pride to increases in certain grades or in certain states. Two years ago, Arne Duncan boasted about the gains made in Tennessee, which had won $500 million in Duncan’s Race to the Top competition. This year, Duncan had nothing to boast about.

In his Race to the Top program, Duncan made testing the primary purpose of education. Scores had to go up every year, because the entire nation was “racing to the top.” Only 12 states won a share of the $4.35 billion that Duncan was given by Congress: Tennessee and Delaware were first to win, in 2010. The next round, the following states won multi-millions of federal dollars to double down on testing: Maryland, Massachusetts, the District of Columbia, Florida, Georgia, Hawaii, New York, North Carolina, Ohio, and Rhode Island.

Tennessee, Duncan’s showcase state in 2013, made no gains in reading or mathematics, neither in fourth grade or eighth grade. The black-white test score gap was as large in 2015 as it had been in 1998, before either NCLB or the Race to the Top.

The results in mathematics were bleak across the nation, in both grades 4 and 8. The declines nationally were only 1 or 2 points, but they were significant in a national assessment on the scale of NAEP.

In fourth grade mathematics, the only jurisdictions to report gains were the District of Columbia, Mississippi, and the Department of Defense schools. Sixteen states had significant declines in their math scores, and thirty-three were flat in relation to 2013 scores. The scores in Tennessee (the $500 million winner) were flat.

In eighth grade, the lack of progress in mathematics was universal. Twenty-two states had significantly lower scores than in 2013, while 30 states or jurisdictions had flat scores. Pennsylvania, Kansas, and Florida (a Race to the Top winner), were the biggest losers, by dropping six points. Among the states that declined by four points were Race to the Top winners Ohio, North Carolina, and Massachusetts. Maryland, Hawaii, New York, and the District of Columbia lost two points. The scores in Tennessee were flat.

The District of Columbia made gains in fourth grade reading and mathematics, but not in eighth grade. It continues to have the largest score gap-—56 points–between white and black students of any urban district in the nation. That is more than double the average of the other 20 urban districts. The state with the biggest achievement gap between black and white students is Wisconsin; it is also the state where black students have the lowest scores, lower than their peers in states like Mississippi and South Carolina. Wisconsin has invested heavily in vouchers and charter schools, which Governor Scott Walker intends to increase.

The best single word to describe NAEP 2015 is stagnation. Contrary to President George W. Bush’s law, many children have been left behind by the strategy of test-and-punish. Contrary to the Obama administration’s Race to the Top program, the mindless reliance on standardized testing has not brought us closer to some mythical “Top.”

No wonder Arne Duncan is leaving Washington. There is nothing to boast about, and the next set of NAEP results won’t be published until 2017. The program that he claimed would transform American education has not raised test scores, but has demoralized educators and created teacher shortages. Disgusted with the testing regime, experienced teachers leave and enrollments in teacher education programs fall. One can only dream about what the Obama administration might have accomplished had it spent that $5 billion in discretionary dollars to encourage states and districts to develop and implement realistic plans for desegregation of their schools, or had they invested the same amount of money in the arts.

The past dozen or so years have been a time when “reformers” like Arne Duncan, Michelle Rhee, Joel Klein, and Bill Gates proudly claimed that they were disrupting school systems and destroying the status quo. Now the “reformers” have become the status quo, and we have learned that disruption is not good for children or education.

Time is running out for this administration, and it is not likely that there will be any meaningful change of course in education policy. One can only hope that the next administration learns important lessons from the squandered resources and failure of NCLB and Race to the Top.

New York’s Ahern V. King: Court-Ordered to Move Forward

In Ahern v. King, filed on April 16, 2014, the Syracuse Teachers Association (and its President Kevin Ahern, along with a set of teachers employed by the Syracuse School District) sued the state of New York (and its Commissioner of the New York State Department of Education John King, *who is Arne Duncan’s U.S. Secretary of Education Successor, along with others representing the state) regarding its teacher evaluation system.

They alleged that the state’s method of evaluating teachers is unfair to those who teach economically disadvantaged students. More specifically, and as per an article in Education Week to which I referred in a post a couple of weeks ago, they alleged that the state’s value-added model (which is actually a growth model) fails to “fully account for poverty in the student-growth formula used for evaluation, penalizing some teachers.” They also alleged that “the state imposed regulations without public comment.”

Well, the decision on the state’s motion to dismiss is in. The Court rejected the state’s attempt to dismiss this case; hence, the case will move forward. For the full report, see attached the official Decision and Order.

As per plaintiff’s charge that controlling for student demographic variables, within the Decision and Order document it is written that the growth model used by the state was developed, under contract, by the American Institutes for Research (AIR). In AIR’s official report/court exhibit, AIR acknowledged the need for controlling for students’ demographic variables, to moderate potential bias; hence, “if a student’s family was participating in any one of a number of economic assistance programs, that student would have been ‘flagged,’ and his or her growth would have been compared with students who were similarly ‘flagged’…If a student’s economic status was unknown, the student was treated as if he or she was not economically disadvantaged” (p. 7).

Nothing in the document was explicated, however, regarding whether the model output were indeed biased, or rather biased as per the individual plaintiffs charging that their individual value-added scores were biased as such.

What was explicated was that the defendants attempted to dismiss this case given a four-month statute of limitations. Plaintiffs, however, prevailed in that the state never made a “definitive decision” that was “readily ascertainable” to teachers on this matter (i.e., about controlling for students’ demographic factors within the model); hence, the statute of limitations of clock really never started taking time.

As per the Court, the state’s definition of “similar students” to whom other students are to be compared (as mentioned prior) was indeterminate until December of 2013. This was evidenced by the state’s Board of Regents’ June 2013 approval of an “enhanced list of characteristics used to define” such students, and this was evidenced by the state’s December 2013 publication of AIR’s technical report in which their methods were (finally) made public. Hence, the Court found that the plaintiffs’ claims fell within the four-month statute of limitations upon which the state was riding in order to dismiss, because the state had not “properly promulgated” its methodology for measuring student growth at that time (and arguably still).

So ultimately, the Court found that the defendants failed to establish their entitlement to dismiss the specific teachers on the plaintiffs side in this case, given a statute of limitation violation (that is also more fully explained in detail in the Decision and Order). The court denied the state of New York’s motion to dismiss, and this one will move forward in New York.

This is great news for those in New York, especially considering this state is one more than most others in which education “leaders” have attached high-stakes consequences to such teacher evaluation output. As per state law, teacher evaluation scores (as based in large part on value-added or growth in this particular case) can be used as “a significant factor for employment decisions including but not limited to, promotion, retention, tenure determination, termination, and supplemental compensation.”

Special Issue of “Educational Researcher” (Paper #2 of 9): VAMs’ Measurement Errors, Issues with Retroactive Revisions, and (More) Problems with Using Test Scores

Recall from a prior post that the peer-reviewed journal titled Educational Researcher (ER) – recently published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#2 of 9) here, titled “Using Student Test Scores to Measure Teacher Performance: Some Problems in the Design and Implementation of Evaluation Systems” and authored by Dale Ballou – Associate Professor of Leadership, Policy, and Organizations at Vanderbilt University – and Matthew Springer – Assistant Professor of Public Policy also at Vanderbilt.

As written into the articles’ abstract, their “aim in this article [was] to draw attention to some underappreciated problems in the design and implementation of evaluation systems that incorporate value-added measures. [They focused] on four [problems]: (1) taking into account measurement error in teacher assessments, (2) revising teachers’ scores as more information becomes available about their students, and (3) and (4) minimizing opportunistic behavior by teachers during roster verification and the supervision of exams.”

Here is background on their perspective, so that you all can read and understand their forthcoming findings in context: “On the whole we regard the use of educator evaluation systems as a positive development, provided judicious use is made of this information. No evaluation instrument is perfect; every evaluation system is an assembly of various imperfect measures. There is information in student test scores about teacher performance; the challenge is to extract it and combine it with the information gleaned from other instruments.”

Their claims of most interest, in my opinion and given their perspective as illustrated above, are as follows:

  • “Teacher value-added estimates are notoriously imprecise. If value-added scores are to be used for high-stakes personnel decisions, appropriate account must be taken of the magnitude of the likely error in these estimates” (p. 78).
  • “[C]omparing a teacher of 25 students to [an equally effective] teacher of 100 students… the former is 4 to 12 times more likely to be deemed ineffective, solely as a function of the number of the teacher’s students who are tested—a reflection of the fact that the measures used in such accountability systems are noisy and that the amount of noise is greater the fewer students a teacher has. Clearly it is unfair to treat two teachers with the same true effectiveness differently” (p. 78).
  • “[R]esources will be wasted if teachers are targeted for interventions without taking
    into account the probability that the ratings they receive are based on error” (p. 78).
  • “Because many state administrative data systems are not up to [the data challenges required to calculate VAM output], many states have implemented procedures wherein teachers are called on to verify and correct their class rosters [i.e., roster verification]…[Hence]…the notion that teachers might manipulate their rosters in order to improve their value-added scores [is worrisome as the possibility of this occurring] obtains indirect support from other studies of strategic behavior in response to high-stakes accountability…These studies suggest that at least some teachers and schools will take advantage of virtually any opportunity to game
    a test-based evaluation system…” (p. 80), especially if they view the system as unfair (this is my addition, not theirs) and despite the extent to which school or district administrators monitor the process or verify the final roster data. This is another gaming technique not often discussed, or researched.
  • Related, in one analysis these authors found that “students [who teachers] do not claim [during this roster verification process] have on average test scores far below those of the students who are claimed…a student who is not claimed is very likely to be one who would lower teachers’ value added” (p. 80). Interestingly, and inversely, they also found that “a majority of the students [they] deem[ed] exempt [were actually] claimed by their teachers [on teachers’ rosters]” (p. 80). They note that when either occurs, it’s rare; hence, it should not significantly impact teachers value added scores on the whole. However, this finding also “raises the prospect of more serious manipulation of roster verification should value added come to be used for high-stakes personnel decisions, when incentives to game the system will grow stronger” (p. 80).
  • In terms of teachers versus proctors or other teachers monitoring students when they take large-scale standardized tests (that are used across all states to calculate value-added estimates), researchers also found that “[a]t every grade level, the number of questions answered correctly is higher when students are monitored by their own teacher” (p. 82). They believe this finding is more relevant that I do in that the difference was one question (although when multiplied by the number of students included in a teacher’s value-added calculations this might be more noteworthy). In addition,  I know of very few teachers, anymore, who are permitted to proctor their own students’ tests, but for those who still allow this, this finding might also be relevant. “An alternative interpretation of these findings is that students
    naturally do better when their own teacher supervises the exam as
    opposed to a teacher they do not know” (p. 83).

The authors also critique, quite extensively in fact, the Education Value-Added Assessment System (EVAAS) used statewide in North Carolina, Ohio, Pennsylvania, and Tennessee and many districts elsewhere. In particular, they take issue with the model’s use of the conventional t-test statistic to identify a teacher for whom they are 95% confident (s)he differs from average. They also take issue with EVAAS practice whereby teachers’ EVAAS scores change retroactively, as more data become available, to get at more “precision” even though teachers’ scores can change one or two years well after the initial score is registered (and used for whatever purposes).

“This has confused teachers, who wonder why their value-added score keeps changing for students they had in the past. Whether or not there are sound statistical reasons for undertaking these revisions…revising value-added estimates poses problems when the evaluation system is used for high-stakes decisions. What will be done about the teacher whose performance during the 2013–2014 school year, as calculated in the summer of 2014, was so low that the teacher loses his or her job or license but whose revised estimate for the same year, released in the summer of 2015, places the teacher’s performance above the threshold at which these sanctions would apply?…[Hence,] it clearly makes no sense to revise these estimates, as each revision is based on less information about student performance” (p. 79).

Hence, “a state that [makes] a practice of issuing revised ‘improved’ estimates would appear to be in a poor position to argue that high-stakes decisions ought to be based on initial, unrevised estimates, though in fact the grounds for regarding the revised estimates as an improvement are sometimes highly dubious. There is no obvious fix for this problem, which we expect will be fought out in the courts” (p. 83).


If interested, see the Review of Article #1 – the introduction to the special issue here.

Article #2 Reference: Ballou, D., & Springer, M. G. (2015). Using student test scores to measure teacher performance: Some problems in the design and implementation of evaluation systems. Educational Researcher, 44(2), 77-86. doi:10.3102/0013189X15574904

New Research Study: Controlling for Student Background Variables Matters

An article about the “Sensitivity of Teacher Value-Added Estimates to Student and Peer Control Variables” was recently published in the peer-reviewed Journal of Research on Educational Effectiveness. While this article is not open-access, here is a link to the article released by Mathematica prior, which nearly mirrors the final published article.

In this study, researchers Matthew Johnson, Stephen Lipscomb, and Brian Gill, all of whom are associated with Mathematica, examined the sensitivity and precision of various VAMs, the extent to which their estimates vary given whether modelers include student- and peer-level background characteristics as control variables, and the extent to which their estimates vary given whether modelers include 1+ years of students’ prior achievement scores, also as control variables. They did this while examining state data, as also compared to what they called District X – a district within the state with three-times more African-American students, two-times more students receiving free or reduced-price lunch, and generally more special educations students than the state average. While the state data included more students, the district data included additional control variables, supporting researchers’ analyses.

Here are the highlights, with thanks to lead author Matthew Johnson for edits and clarifications.

  • Different VAMs produced similar results overall [when using the same data], almost regardless of specifications. “[T]eacher estimates are highly correlated across model specifications. The correlations [they] observe[d] in the state and district data range[d] from 0.90 to 0.99 relative to [their] baseline specification.”

This has been something that has been evidenced in the literature prior, when the same models are applied to the same datasets taken from the same sets of tests at the same time. Hence, many critics argue that similar results come about when using different VAMs on the same data because conditions are such that when using the same, fallible, large-scale standardized test data, even the most sophisticated models are processing “garbage in” and “garbage out.” When the tests used and inserted into the same VAM vary, however, even if the tests used are measuring the same constructs (e.g., mathematics learning in grade X), things go haywire. For more information about this, please see Papay (2010) here.

  • However, “even correlation coefficients above 0.9 do not preclude substantial amounts of potential misclassification of teachers across performance categories.” The researchers also found that, even with such consistencies, 26% of teachers rated in the bottom quintile were placed in higher performance categories under an alternative model.
  • Modeling choices impacted the rankings of teachers in District X in “meaningful” ways, given District X’s varying student demographics compared with those in the state overall. In other words, VAMs that do not include all relevant student characteristics can penalize teachers in districts that serve students who are more disadvantaged than statewide averages.

See an article related to whether VAMs can include all relevant student characteristics, also given the non-random assignment of students to teachers (and teachers to classrooms) here.


Original Version Citation: Johnson, M., Lipscomb, S., & Gill, B. (2013). Sensitivity of teacher value-added estimates to student and peer control variables. Princeton, NJ: Mathematica Policy Research.

Published Version Citation: Johnson, M., Lipscomb, S., & Gill, B. (2013). Sensitivity of teacher value-added estimates to student and peer control variables. Journal of Research on Educational Effectiveness, 8(1), 60-83. doi:10.1080/19345747.2014.967898

EVAAS, Value-Added, and Teacher Branding

I do not think I ever shared this video out, and now following up on another post, about the potential impact these videos should really have, I thought now is an appropriate time to share. “We can be the change,” and social media can help.

My former doctoral student and I put together this video, after conducting a study with teachers in the Houston Independent School District and more specifically four teachers whose contracts were not renewed due in large part to their EVAAS scores in the summer of 2011. This video (which is really a cartoon, although it certainly lacks humor) is about them, but also about what is happening in general in their schools, post the adoption and implementation (at approximately $500,000/year) of the SAS EVAAS value-added system.

To read the full study from which this video was created, click here. Below is the abstract.

The SAS Educational Value-Added Assessment System (SAS® EVAAS®) is the most widely used value-added system in the country. It is also self-proclaimed as “the most robust and reliable” system available, with its greatest benefit to help educators improve their teaching practices. This study critically examined the effects of SAS® EVAAS® as experienced by teachers, in one of the largest, high-needs urban school districts in the nation – the Houston Independent School District (HISD). Using a multiple methods approach, this study critically analyzed retrospective quantitative and qualitative data to better comprehend and understand the evidence collected from four teachers whose contracts were not renewed in the summer of 2011, in part given their low SAS® EVAAS® scores. This study also suggests some intended and unintended effects that seem to be occurring as a result of SAS® EVAAS® implementation in HISD. In addition to issues with reliability, bias, teacher attribution, and validity, high-stakes use of SAS® EVAAS® in this district seems to be exacerbating unintended effects.

Bias in School-Level Value-Added, Related to High V. Low Attrition

In a 2013 study titled “Re-testing PISA Students One Year Later: On School Value Added Estimation Using OECD-PISA” (Organisation for Economic Co‑operation and Development-Programme for International Student Assessment), researchers Bratti and Chechi explored a unique PISA international test score data set in Italy.

‘[I]n two regions of North Italy (Valle d’Aosta and the autonomous province of Trento) the PISA 2009 test was re-administered to the same students one year later.” Hence, authors had the unique opportunity to analyze what happens with school-level value-added when the same students were retested for two adjacent years, using a very strong standardized achievement test (i.e., the PISA).

Researchers found that “cross-sectional measures of school value added based on PISA…tend to be very volatile over time whenever there is a high year-to-year attrition in the student population.” In addition, some of this volatility can be mitigated when longitudinal measures of school value added take into account students’ prior test scores; however, higher consistency (less volatility) tends to be more evident in schools in which there is little attrition/transition. Inversely, lower consistency (higher volatility) tends to be more evident in schools in which there is much attrition/transition.

Researchers observed correlations “as high as 0.92 in Trento and is close to zero in Valle d’Aosta” when the VAM was not used to control for past test scores. When a more sophisticated VAM was used (accounting for students’ prior performance, and school fixed effects), however, researchers found that the ” coefficient [was] much higher for Valle d’Aosta than for Trento.” So, the correlations flip-flopped based on model specifications, the more advanced specs yielding “the better” or “more accurate” value-added output.

Researchers attribute this to panel attrition in that “in Trento only 8% of the students who were originally tested in 2009 dropped out or changed school in 2010, [but] the percentage [rose] to about 21% in Valle d’Aosta” at the same time.

Likewise, “[i]n educational settings characterized by high student attrition, this will lead to very volatile measures of VA.” Inversely, “in settings characterized by low student attrition (drop-out or school changes), longitudinal and cross-sectional measures of school VA turn out to be very correlated.”

Splits, Rotations, and Other Consequences of Teaching in a High-Stakes Environment in an Urban School

An Arizona teacher who teaches in a very urban, high-needs schools writes about the realities of teaching in her school, under the pressures that come along with high-stakes accountability and a teacher workforce working under an administration, both of which are operating in chaos. This is a must read, as she also talks about two unintended consequences of educational reform in her school about which I’ve never heard before: splits and rotations. Both seem to occur at all costs simply to stay afloat during “rough” times, but both also likely have deleterious effects on students in such schools, as well as teachers being held accountable for the students “they” teach.

She writes:

Last academic year (2012-2013) a new system for evaluating teachers was introduced into my school district. And it was rough. Teachers were dropping like flies. Some were stressed to the point of requiring medical leave. Others were labeled ineffective based on a couple classroom observations and were asked to leave. By mid-year, the school was down five teachers. And there were a handful of others who felt it was just a matter of time before they were labeled ineffective and asked to leave, too.

The situation became even worse when the long-term substitutes who had been brought in to cover those teacher-less classrooms began to leave also. Those students with no contracted teacher and no substitute began getting “split”. “Splitting” is what the administration of a school does in a desperate effort to put kids somewhere. And where the students go doesn’t seem to matter. A class roster is printed, and the first five students on the roster go to teacher A. The second five students go to teacher B, and so on. Grade-level isn’t even much of a consideration. Fourth graders get split to fifth grade classrooms. Sixth graders get split to 5th and 7th grade classrooms. And yes, even 7th and 8th graders get split to 5th grade classrooms. Was it difficult to have another five students in my class? Yes. Was it made more difficult that they weren’t even of the same grade level I was teaching? Yes. This went on for weeks…

And then the situation became even worse. As it became more apparent that the revolving door of long-term substitutes was out of control, the administration began “The Rotation.” “The Rotation” was a plan that used the contracted teachers (who remained!) as substitutes in those teacher-less classrooms. And so once or twice a week, I (and others) would get an email from the administration alerting me that it was my turn to substitute during prep time. Was it difficult to sacrifice 20-40 % of weekly prep time (that is used to do essential work like plan lessons, gather materials, grade, call parents, etc…) Yes. Was it difficult to teach in a classroom that had a different teacher, literally, every hour without coordinated lessons? Yes.

Despite this absurd scenario, in October 2013, I received a letter from my school district indicating how I fared in this inaugural year of the teacher evaluation system. It wasn’t good. Fifty percent of my performance label was based on school test scores (not on the test scores of my homeroom students). How well can students perform on tests when they don’t have a consistent teacher?

So when I think about accountability, I wonder now what it is I was actually held accountable for? An ailing, urban school? An ineffective leadership team who couldn’t keep a workforce together? Or was I just held accountable for not walking away from a no-win situation?

Coincidentally, this 2013-2014 academic year has, in many ways, mirrored the 2012-2013. The upside is that this year, only 10% of my evaluation is based on school-wide test scores (the other 40% will be my homeroom students’ test scores). This year, I have a fighting chance to receive a good label. One more year of an unfavorable performance label and the district will have to, by law, do something about me. Ironically, if it comes to that point, the district can replace me with a long-term substitute, who is not subject to the same evaluation system that I am. Moreover, that long-term substitute doesn’t have to hold a teaching certificate. Further, that long-term substitute will cost the district a lot less money in benefits (i.e. healthcare, retirement system contributions).

I should probably start looking for a job—maybe as a long-term substitute.

Teacher Won’t be Bullied by Alhambra (AZ) School Officials

Lisa Elliott, a National Board Certified Teacher (NBCT) and 18-year veteran teacher who has devoted her 18-year professional career to the Alhambra Elementary School District — a Title I school district (i.e., having at least 40% of the student population from low-income families) located in the Phoenix/Glendale area — expresses in this video how she refuses to be bullied by her district’s misuse of standardized test scores.

Approximately nine months ago she was asked to resign her teaching position by the district’s interim superintendent – Dr. Michael Rivera – due to her students’ low test scores for the 2013-2014 school year, and despite her students exceeding expectations on other indicators of learning and achievement. She “respectfully declined” submitting her resignation letter because, for a number of reasons, including that her “children are more than a test score.” Unfortunately, however, other excellent teachers in her district just left…

The (Relentless) Will to Quantify

An article was just published in the esteemed, peer-reviewed journal Teachers College Record titled, “The Will to Quantify: The “Bottom Line” in the Market Model of Education Reform” and authored by Leo Casey – Executive Director of the Albert Shanker Institute. I will summarize its key points here (1) for those of you without the time to read the whole article (albeit worth reading) and (2) just in case the link above does not work for those of you out there without subscriptions to Teachers College Record.

In this article, Casey reviews the case of New York and the state department of education’s policy attempts to use New York teachers’ value-added data to reform the state’s public schools, “in the image and likeness of competitive businesses.” Casey interrogates this state’s history given the current, market-based, corporate reform environment surrounding (and swallowing) America’s public schools within New York, but also beyond.

Recall that New York is one of our states to watch, especially since the election of Governor Cuomo into the Governor’s office (see prior posts about New York here, here, and here). Accordingly, according to Casey as demonstrated in this article, this is the state to use to demonstrate how “[t]he market model of education reform has become a prisoner to a Nietzschean will to quantify, in which the validity and reliability of the actual numbers is irrelevant.”

In New York, using the state’s large-scale standardized tests in English/language arts and mathematics, grades 3 through 8, teachers’ value-added data reports were first developed for approximately 18,000 teachers throughout the state for three school years: 2007-2010. The scores were constructed with all assurances that these scores “would not be used for [teacher] evaluation purposes,” while the state department specifically identified tenure decisions and annual rating processes as two areas where teachers’ value-added scores “would play no role.” At that time the department of education also took a “firm position” that that these reports would not be disclosed or shared outside of the school community (i.e., with the public).

Soon, thereafter, however the department of education, “acting unilaterally,” began to use the scores in tenure decisions and began to, after a series of Freedom of Information requests, release the scores to the media, who in turn released the scores to the public at large. By February of 2012, teachers’ value-added scores were published by all  major New York media.

Recall these articles, primarily about the worst teachers in New York (see, for example, here, here, and here), and recall the story of Pascale Mauclair – a sixth-grade teacher in Queens who was “pilloried” in the New York Post as the city’s “worst teacher” based solely on her value-added reports. After a more thorough investigation, however, “Mauclair proved to be an excellent teacher who had the unqualified support of her school, one of the best in the city: her principal declared without hesitation or qualification that she would put her own child in Mauclair’s class, and her colleagues met Mauclair with a standing ovation when she returned to the school after the Post’s attack. Mauclair’s undoing had been her own dedication to teaching students with the greatest needs. As a teacher of English as a Second Language, she had taken on the task of teaching small self-contained classes of recent immigrants for the last five years.”

Nonetheless, the state department of education continued (and continues) to produce data for New York teachers “with a single year of test score data, and sample sizes as low as 10…When students did not have a score in a previous year, scores were statistically “imputed” to them in order to produce a basis for making a growth measure.”

These scores had, and often continue to have (also across states), “average confidence intervals of 60 to 70 percentiles for a single-year estimate. On a distribution that went from 0 to 99, the average margins of error in the [New York scores] were, by the [state department of education’s] own calculations, 35 percentiles for Math and 53 percentiles for English Language Arts. One-third of all [scores], the [department] conceded, were completely unreliable—that is, so imprecise as to not warrant any confidence in them. The sheer magnitude of these numbers takes us into the realm of the statistically surreal.” Yet the state continues to this day in its efforts to use these data despite the gross statistical and consequential human errors present.

This is, in the words of Casey, is “a demonstration of [extreme] professional malpractice in the realm of testing.” Yet educational reformers like Governor Cuomo as well as “Michael Bloomberg, Joel Klein, and a cohort of similarly minded education reformers across the United States, the fundamental problem with American public education is that it has been organized as a monopoly that is not subject to the discipline of the marketplace. The solution to all that ails public schools, therefore, is to remake them in the image and likeness of a competitive business. Just as private businesses rise and fall on their ability to compete in the marketplace, as measured by the ‘bottom line’ of their profit balance sheet, schools need to live or die on their ability to compete with each other, based on an educational ‘bottom line.’ If ‘bad’ schools die and new ‘good’ schools are created in their stead, the productivity of education improves. But to undertake this transformation and to subject schools to market discipline, an educational “bottom line” must be established. Standardized testing and value-added measures of performance based on standardized testing provide that ‘bottom line.”

Otherwise, some of the key findings taken from other studies Casey cited in this piece are also good to keep in mind:

  • “A 2010 U.S. Department of Education study found that value-added measures in general have disturbingly high rates of error, with the use of three years of test data producing a 25% error rate in classifying teachers as above average, average, and below average and one year of test data yielding a 35% error rate.” Nothing much has changed in terms of error rates here, so this study stills stands as one of the capstone pieces on this topic.
  • “New York University Professor Sean Corcoran has shown that it is hard to isolate a specific teacher effect from classroom factors and school effects using value-added measures, and that in a single year of test scores, it is impossible to distinguish the teacher’s impact. The fewer the years of data and the smaller the sample (the number of student scores), the more imprecise the value-added estimates.”
  • Also recall that “the tests in question [are/were] designed for another purpose: the measure of student performance, not teacher or school performance.” That is, these tests were designed and validated to measure student achievement, BUT they were never designed or validated for their current purposes/uses: to measure teacher effects on student achievement.

Playing Fair: Factors that Influence VAM for Special Education Teachers

As you all know, value-added models (VAMs) are intended to measure a teacher’s effectiveness. By comparing students’ learning and the value that educators add, VAMs attempt to isolate the teacher’s impact on student achievement. VAMs focus on individual student progress from one testing period to the next, sometimes without considering past learning, peer influence, family environment or individual ability, depending on the model.

Teachers, administrators and experts have debated VAM reliability and validity, but not often mentioned is the controversy regarding the use of VAMs for teachers of special education students. Why is this so controversial? Because students with disabilities are often educated in general education classrooms, but generally score lower on standardized tests – tests that they often should not be taking in the first place. Accordingly, holding teachers of special education students accountable for their performance is uniquely problematic. For example, many special education students are in mainstream classrooms, with co-teaching provided by both special and general education teachers; hence, special education programs can present challenges to VAMs that are meant to measure straightforward progress.

Co-teaching Complexities

Research like “Co-Teaching: An Illustration of the Complexity of Collaboration in Special Education” outlines some of the specific challenges that teachers of special education can face when co-teaching is involved. But essentially, co-teaching is a partnership between a general and a special education teacher, who jointly instruct a group of students, including those with special needs and disabilities. The intent is to provide special education students with access to the general curriculum while receiving more specialized instruction to support their learning.

Accordingly, collaboration is key to successful co-teaching. Teams that demonstrate lower levels of collaboration tend to struggle more, while successful co-teaching teams share their expertise to motivate students. However, special education teachers often report differences in teaching styles that lead to conflict; they often feel regulated to the role of classroom assistant, rather than full teaching partner. This also has implications for VAMs.

For example, student outcomes from co-teaching vary. A 2002 study by Rea, McLaughlin and Walther-Thomas found that students with learning disabilities in co-taught classes had better attendance and report card grades, but no better performance on standardized tests. Another report showed that test scores for students with and without disabilities were not affected by co-teaching (Idol, 2006).

A 2014 study by the Carnegie Foundation for the Advancement of Teaching points out another issue that can make co-teaching more difficult in a special education settings; it can be difficult to determine value-added because it can be hard to separate such teachers’ contributions. Authors also assert that calculating value-added would be more accurate if the models used more detailed data about disability status, services rendered, and past and present accommodations made, but many states do not collect these data (Buzick, 2014), and even if they did there is no real level of certainty that this would work.

Likewise, inclusion brings special education students into the general classroom, eliminating boundaries between special education students and general education peers. However, special education teachers often voice opposition to general education inclusion as it relates to VAMs.

According to “Value-Added Modeling: Challenges for Measuring Special Education Teacher Quality” (Lawson, 2014) some of the specific challenges cited include:

  • When students with disabilities spend the majority of their day in general education classrooms, special education teacher effectiveness is distorted.
  • Quality special education instruction can be hindered by poor general education instruction.
  • Students may be pulled out of class for additional services, which makes it difficult to maintain progress and pace.
  • Multiple teachers often provide instruction to special education students, so each teacher’s impact is difficult to assess.
  • When special education teachers assist general education classrooms, their impact is not measured by VAMs.

And along with the complexities involved with teaching students with disabilities, special education teachers also deal with a number of constraints that impact instructional time and affect VAMs. Special education teachers also deal with more paperwork, including Individualized Education Plans (IEPs) that take time to write and review. In addition, they must handle extensive curriculum and lesson planning, manage parent communication, keep up with special education laws and coordinate with general education teachers. While their priority may be to fully support each student’s learning and achievement, it’s not always possible. In addition, not everything special education teachers do can be appropriately captured using tests.

These are but a few reasons that special education teachers should question the fairness of VAMs.


This is a guest post from Umari Osgood who works at Bisk Education and writes on behalf of University of St. Thomas online programs.