Special Issue of “Educational Researcher” (Paper #6 of 9): VAMs as Tools for “Egg-Crate” Schools

Recall that the peer-reviewed journal Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#6 of 9), which is actually an essay here, titled “Will VAMS Reinforce the Walls of the Egg-Crate School?” This essay is authored by Susan Moore Johnson – Professor of Education at Harvard and somebody who I in the past I had the privilege of interviewing as an esteemed member of the National Academy of Education (see interviews here and here).

In this article, Moore Johnson argues that when policymakers use VAMs to evaluate, reward, or dismiss teachers, they may be perpetuating an egg-crate model, which is (referencing Tyack (1974) and Lortie (1975)) a metaphor for the compartmentalized school structure in which teachers (and students) work, most often in isolation. This model ultimately undermines the efforts of all involved in the work of schools to build capacity school wide, and to excel as a school given educators’ individual and collective efforts.

Contrary to the primary logic supporting VAM use, however, “teachers are not inherently effective or ineffective” on their own. Rather, their collective effectiveness is related to their professional development that may be stunted when they work alone, “without the benefit of ongoing collegial influence” (p. 119). VAMs then, and unfortunately, can cause teachers and administrators to (hyper)focus “on identifying, assigning, and rewarding or penalizing individual [emphasis added] teachers for their effectiveness in raising students’ test scores [which] depends primarily on the strengths of individual teachers” (p. 119). What comes along with this, then, are a series of interrelated egg-crate behaviors including, but not limited to, increased competition, lack of collaboration, increased independence versus interdependence, and the like, all of which can lead to decreased morale and decreased effectiveness in effect.

Inversely, students are much “better served when human resources are deliberately organized to draw on the strengths of all teachers on behalf of all students, rather than having students subjected to the luck of the draw in their classroom assignment[s]” (p. 119). Likewise, “changing the context in which teachers work could have important benefits for students throughout the school, whereas changing individual teachers without changing the context [as per VAMs] might not [work nearly as well] (Lohr, 2012)” (p. 120). Teachers learning from their peers, working in teams, teaching in teams, co-planning, collaborating, learning via mentoring by more experienced teachers, learning by mentoring, and the like should be much more valued, as warranted via the research, yet they are not valued given the very nature of VAM use.

Hence, there are also unintended consequences that can also come along with the (hyper)use of individual-level VAMs. These include, but are not limited to: (1) Teachers who are more likely to “literally or figuratively ‘close their classroom door’ and revert to working alone…[This]…affect[s] current collaboration and shared responsibility for school improvement, thus reinforcing the walls of the egg-crate school” (p. 120); (2) Due to bias, or that teachers might be unfairly evaluated given the types of students non-randomly assigned into their classrooms, teachers might avoid teaching high-needs students if teachers perceive themselves to be “at greater risk” of teaching students they cannot grow; (3) This can perpetuate isolative behaviors, as well as behaviors that encourage teachers to protect themselves first, and above all else; (4) “Therefore, heavy reliance on VAMS may lead effective teachers in high-need subjects and schools to seek safer assignments, where they can avoid the risk of low VAMS scores[; (5) M]eanwhile, some of the most challenging teaching assignments would remain difficult to fill and likely be subject to repeated turnover, bringing steep costs for students” (p. 120); While (6) “using VAMS to determine a substantial part of the teacher’s evaluation or pay [also] threatens to sidetrack the teachers’ collaboration and redirect the effective teacher’s attention to the students on his or her roster” (p. 120-121) versus students, for example, on other teachers’ rosters who might also benefit from other teachers’ content area or other expertise. Likewise (7) “Using VAMS to make high-stakes decisions about teachers also may have the unintended effect of driving skillful and committed teachers away from the schools that need them most and, in the extreme, causing them to leave the profession” in the end (p. 121).

I should add, though, and in all fairness given the Review of Paper #3 – on VAMs’ potentials here, many of these aforementioned assertions are somewhat hypothetical in the sense that they are based on the grander literature surrounding teachers’ working conditions, versus the direct, unintended effects of VAMs, given no research yet exists to examine the above, or other unintended effects, empirically. “There is as yet no evidence that the intensified use of VAMS interferes with collaborative, reciprocal work among teachers and principals or sets back efforts to move beyond the traditional egg-crate structure. However, the fact that we lack evidence about the organizational consequences of using VAMS does not mean that such consequences do not exist” (p. 123).

The bottom line is that we do not want to prevent the school organization from becoming “greater than the sum of its parts…[so that]…the social capital that transforms human capital through collegial activities in schools [might increase] the school’s overall instructional capacity and, arguably, its success” (p. 118). Hence, as Moore Johnson argues, we must adjust the focus “from the individual back to the organization, from the teacher to the school” (p. 118), and from the egg-crate back to a much more holistic and realistic model capturing what it means to be an effective school, and what it means to be an effective teacher as an educational professional within one. “[A] school would do better to invest in promoting collaboration, learning, and professional accountability among teachers and administrators than to rely on VAMS scores in an effort to reward or penalize a relatively small number of teachers” (p. 122).


If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; and see the Review of Article #5 – on teachers’ perceptions of observations and student growth here.

Article #6 Reference: Moore Johnson, S. (2015). Will VAMS reinforce the walls of the egg-crate school? Educational Researcher, 44(2), 117-126. doi:10.3102/0013189X15573351

Special Issue of “Educational Researcher” (Paper #4 of 9): Make Room VAMs for Observations

Recall that the peer-reviewed journal Educational Researcher (ER) – recently published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#4 of 9) here, titled “Make Room Value-Added: Principals’ Human Capital Decisions and the Emergence of Teacher Observation Data. This one is authored by Ellen Goldring, Jason A. Grissom, Christine Neumerski, Marisa Cannata, Mollie Rubin, Timothy Drake, and Patrick Schuermann, all of whom are associated with Vanderbilt University.

This article is primarily about (1) the extent to which the data generated by “high-quality observation systems” can inform principals’ human capital decisions (e.g., teacher hiring, contract renewal, assignment to classrooms, professional development), and (2) the extent to which principals are relying less on test scores derived via value-added models (VAMs), when making the same decisions, and why. Here are some of their key (and most important, in my opinion) findings:

  • Principals across all school systems revealed major hesitations and challenges regarding the use of VAM output for human capital decisions. Barriers preventing VAM use included the timing of data availability (e.g., the fall), which is well after human capital decisions are made (p. 99).
  • VAM output are too far removed from the practice of teaching (p. 99), and this lack of instructional sensitivity impedes, if not entirely prevents their actual versus hypothetical use for school/teacher improvement.
  • “Principals noted they did not really understand how value-added scores were calculated, and therefore they were not completely comfortable using them” (p. 99). Likewise, principals reported that because teachers did not understand how the systems worked either, teachers did not use VAM output data either (p. 100).
  • VAM output are not transparent when used to determine compensation, and especially when used to evaluate teachers teaching nontested subject areas. In districts that use school-wide VAM output to evaluate teachers in nontested subject areas, in fact, principals reported regularly ignoring VAM output altogether (p. 99-100).
  • “Principals reported that they perceived observations to be more valid than value-added measures” (p. 100); hence, principals reported using observational output much more, again, in terms of human capital decisions and making such decisions “valid.” (p. 100).
  • “One noted exception to the use of value-added scores seemed to be in the area of assigning teachers to particular grades, subjects, and classes. Many principals mentioned they use value-added measures to place teachers in tested subjects and with students in grade levels that ‘count’ for accountability purpose…some principals [also used] VAM [output] to move ineffective teachers to untested grades, such as K-2 in elementary schools and 12th grade in high schools” (p. 100).

Of special note here is also the following finding: “In half of the systems [in which researchers investigated these systems], there [was] a strong and clear expectation that there be alignment between a teacher’s value-added growth score and observation ratings…Sometimes this was a state directive and other times it was district-based. In some systems, this alignment is part of the principal’s own evaluation; principals receive reports that show their alignment” (p. 101). In other words, principals are being evaluated and held accountable given the extent to which their observations of their teachers match their teachers’ VAM-based data. If misalignment is noticed, it is not to be the fault of either measure (e.g., in terms of measurement error), it is to be the fault of the principal who is critiqued for inaccuracy, and therefore (inversely) incentivized to skew their observational data (the only data over which the supervisor has control) to artificially match VAM-based output. This clearly distorts validity, or rather the validity of the inferences that are to be made using such data. Appropriately, principals also “felt uncomfortable [with this] because they were not sure if their observation scores should align primarily…with the VAM” output (p. 101).

“In sum, the use of observation data is important to principals for a number of reasons: It provides a “bigger picture” of the teacher’s performance, it can inform individualized and large group professional development, and it forms the basis of individualized support for remediation plans that serve as the documentation for dismissal cases. It helps principals provides specific and ongoing feedback to teachers. In some districts, it is beginning to shape the approach to teacher hiring as well” (p. 102).

The only significant weakness, again in my opinion, with this piece is that the authors write that these observational data, at focus in this study, are “new,” thanks to recent federal initiatives. They write, for example, that “data from structured teacher observations—both quantitative and qualitative—constitute a new [emphasis added] source of information principals and school systems can utilize in decision making” (p. 96). They are also “beginning to emerge [emphasis added] in the districts…as powerful engines for principal data use” (p. 97). I would beg to differ as these systems have not changed much over time, pre and post these federal initiatives as (without evidence or warrant) claimed by these authors herein. See, for example, Table 1 on p. 98 of the article to see if what they have included within the list of components of such new and “complex, elaborate teacher observation systems systems” is actually new or much different than most of the observational systems in use prior. As an aside, one such system in use and of issue in this examination is one with which I am familiar, in use in the Houston Independent School District. Click here to also see if this system is also more “complex” or “elaborate” over and above such systems prior.

Also recall that one of the key reports that triggered the current call for VAMs, as the “more objective” measures needed to measure and therefore improve teacher effectiveness, was based on data that suggested that “too many teachers” were being rated as satisfactory or above. The observational systems in use then are essentially the same observational systems still in use today (see “The Widget Effect” report here). This is in stark contradiction to authors’ claims throughout this piece, for example, when they write “Structured teacher observations, as integral components of teacher evaluations, are poised to be a very powerful lever for changing principal leadership and the influence of principals on schools, teachers, and learning.” This counters all that is and all that came from “The Widget Effect” report here.


If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; and see the Review of Article #3 – on VAMs’ potentials here.

Article #4 Reference: Goldring, E., Grissom, J. A., Rubin, M., Neumerski, C. M., Cannata, M., Drake, T., & Schuermann, P. (2015). Make room value-added: Principals’ human capital decisions and the emergence of teacher observation data. Educational Researcher, 44(2), 96-104. doi:10.3102/0013189X15575031

Victory in Court: Consequences Attached to VAMs Suspended Throughout New Mexico

Great news for New Mexico and New Mexico’s approximately 23,000 teachers, and great news for states and teachers potentially elsewhere, in terms of setting precedent!

Late yesterday, state District Judge David K. Thomson, who presided over the ongoing teacher-evaluation lawsuit in New Mexico, granted a preliminary injunction preventing consequences from being attached to the state’s teacher evaluation data. More specifically, Judge Thomson ruled that the state can proceed with “developing” and “improving” its teacher evaluation system, but the state is not to make any consequential decisions about New Mexico’s teachers using the data the state collects until the state (and/or others external to the state) can evidence to the court during another trial (set for now, for April) that the system is reliable, valid, fair, uniform, and the like.

As you all likely recall, the American Federation of Teachers (AFT), joined by the Albuquerque Teachers Federation (ATF), last year, filed a “Lawsuit in New Mexico Challenging [the] State’s Teacher Evaluation System.” Plaintiffs charged that the state’s teacher evaluation system, imposed on the state in 2012 by the state’s current Public Education Department (PED) Secretary Hanna Skandera (with value-added counting for 50% of teachers’ evaluation scores), is unfair, error-ridden, spurious, harming teachers, and depriving students of high-quality educators, among other claims (see the actual lawsuit here).

Thereafter, one scheduled day of testimonies turned into five in Santa Fe, that ran from the end of September through the beginning of October (each of which I covered here, here, here, here, and here). I served as the expert witness for the plaintiff’s side, along with other witnesses including lawmakers (e.g., a state senator) and educators (e.g., teachers, superintendents) who made various (and very articulate) claims about the state’s teacher evaluation system on the stand. Thomas Kane served as the expert witness for the defendant’s side, along with other witnesses including lawmakers and educators who made counter claims about the system, some of which backfired, unfortunately for the defense, primarily during cross-examination.

See articles released about this ruling this morning in the Santa Fe New Mexican (“Judge suspends penalties linked to state’s teacher eval system”) and the Albuquerque Journal (“Judge curbs PED teacher evaluations).” See also the AFT’s press release, written by AFT President Randi Weingarten, here. Click here for the full 77-page Order written by Judge Thomson (see also, below, five highlights I pulled from this Order).

The journalist of the Santa Fe New Mexican, though, provided the most detailed information about Judge Thomson’s Order, writing, for example, that the “ruling by state District Judge David Thomson focused primarily on the complicated combination of student test scores used to judge teachers. The ruling [therefore] prevents the Public Education Department [PED] from denying teachers licensure advancement or renewal, and it strikes down a requirement that poorly performing teachers be placed on growth plans.” In addition, the Judge noted that “the teacher evaluation system varies from district to district, which goes against a state law calling for a consistent evaluation plan for all educators.”

The PED continues to stand by its teacher evaluation system, calling the court challenge “frivolous” and “a legal PR stunt,” all the while noting that Judge Thomson’s decision “won’t affect how the state conducts its teacher evaluations.” Indeed it will, for now and until the state’s teacher evaluation system is vetted, and validated, and “the court” is “assured” that the system can actually be used to take the “consequential actions” against teachers, “required” by the state’s PED.

Here are some other highlights that I took directly from Judge Thomson’s ruling, capturing what I viewed as his major areas of concern about the state’s system (click here, again, to read Judge Thomson’s full Order):

  • Validation Needed: “The American Statistical Association says ‘estimates from VAM should always be accompanied by measures of precision and a discussion of the assumptions and possible limitations of the model. These limitations are particularly relevant if VAM are used for high stake[s] purposes” (p. 1). These are the measures, assumptions, limitations, and the like that are to be made transparent in this state.
  • Uniformity Required: “New Mexico’s evaluation system is less like a [sound] model than a cafeteria-style evaluation system where the combination of factors, data, and elements are not easily determined and the variance from school district to school district creates conflicts with the [state] statutory mandate” (p. 2)…with the existing statutory framework for teacher evaluations for licensure purposes requiring “that the teacher be evaluated for ‘competency’ against a ‘highly objective uniform statewide standard of evaluation’ to be developed by PED” (p. 4). “It is the term ‘highly objective uniform’ that is the subject matter of this suit” (p. 4), whereby the state and no other “party provided [or could provide] the Court a total calculation of the number of available district-specific plans possible given all the variables” (p. 54). See also the Judge’s points #78-#80 (starting on page 70) for some of the factors that helped to “establish a clear lack of statewide uniformity among teachers” (p. 70).
  • Transparency Missing: “The problem is that it is not easy to pull back the curtain, and the inner workings of the model are not easily understood, translated or made accessible” (p. 2). “Teachers do not find the information transparent or accurate” and “there is no evidence or citation that enables a teacher to verify the data that is the content of their evaluation” (p. 42). In addition, “[g]iven the model’s infancy, there are no real studies to explain or define the [s]tate’s value-added system…[hence, the consequences and decisions]…that are to be made using such system data should be examined and validated prior to making such decisions” (p. 12).
  • Consequences Halted: “Most significant to this Order, [VAMs], in this [s]tate and others, are being used to make consequential decisions…This is where the rubber hits the road [as per]…teacher employment impacts. It is also where, for purposes of this proceeding, the PED departs from the statutory mandate of uniformity requiring an injunction” (p. 9). In addition, it should be noted that indeed “[t]here are adverse consequences to teachers short of termination” (p. 33) including, for example, “a finding of ‘minimally effective’ [that] has an impact on teacher licenses” (p. 41). These, too, are to be halted under this injunction Order.
  • Clarification Required: “[H]ere is what this [O]rder is not: This [O]rder does not stop the PED’s operation, development and improvement of the VAM in this [s]tate, it simply restrains the PED’s ability to take consequential actions…until a trial on the merits is held” (p. 2). In addition, “[a] preliminary injunction differs from a permanent injunction, as does the factors for its issuance…’ The objective of the preliminary injunction is to preserve the status quo [minus the consequences] pending the litigation of the merits. This is quite different from finally determining the cause itself” (p. 74). Hence, “[t]he court is simply enjoining the portion of the evaluation system that has adverse consequences on teachers” (p. 75).

The PED also argued that “an injunction would hurt students because it could leave in place bad teachers.” As per Judge Thomson, “That is also a faulty argument. There is no evidence that temporarily halting consequences due to the errors outlined in this lengthy Opinion more likely results in retention of bad teachers than in the firing of good teachers” (p. 75).

Finally, given my involvement in this lawsuit and given the team with whom I was/am still so fortunate to work (see picture below), including all of those who testified as part of the team and whose testimonies clearly proved critical in Judge Thomson’s final Order, I want to thank everyone for all of their time, energy, and efforts in this case, thus far, on behalf of the educators attempting to (still) do what they love to do — teach and serve students in New Mexico’s public schools.


Left to right: (1) Stephanie Ly, President of AFT New Mexico; (2) Dan McNeil, AFT Legal Department; (3) Ellen Bernstein, ATF President; (4) Shane Youtz, Attorney at Law; and (5) me 😉

Houston’s “Split” Decision to Give Superintendent Grier $98,600 in Bonuses, Pre-Resignation

States of attention on this blog, and often of (dis)honorable mention as per their state-level policies bent on value-added models (VAMs), include Florida, New York, Tennessee, and New Mexico. As for a quick update about the latter state of New Mexico, we are still waiting to hear the final decision from the judge who recently heard the state-level lawsuit still pending on this matter in New Mexico (see prior posts about this case here, here, here, here, and here).

Another locale of great interest, though, is the Houston Independent School District. This is the seventh largest urban school district in the nation, and the district that has tied more high-stakes consequences to their value-added output than any other district/state in the nation. These “initiatives” were “led” by soon-to-resign/retire Superintendent Terry Greir who, during his time in Houston (2009-2015), implemented some of the harshest consequences ever attached to teacher-level value-added output, as per the district’s use of the Education Value-Added Assessment System (EVAAS) (see other posts about the EVAAS here, here, and here; see other posts about Houston here, here, and here).

In fact, the EVAAS is still used throughout Houston today to evaluate all EVAAS-eligible teachers, to also “reform” the district’s historically low-performing schools, by tying teachers’ purported value-added performance to teacher improvement plans, merit pay, nonrenewal, and termination (e.g., 221 Houston teachers were terminated “in large part” due to their EVAAS scores in 2011). However, pending litigation (i.e., this is the district in which the American and Houston Federation of Teachers (AFT/HFT) are currently suing the district for their wrongful use of, and over-emphasis on this particular VAM; see here), Superintendent Grier and the district have recoiled on some of the high-stakes consequences they formerly attached to the EVAAS  This particular lawsuit is to commence this spring/summer.

Nonetheless, my most recent post about Houston was about some of its future school board candidates, who were invited by The Houston Chronicle to respond to Superintendent Grier’s teacher evaluation system. For the most part, those who responded did so unfavorably, especially as the evaluation systems was/is disproportionately reliant on teachers’ EVAAS data and high-stakes use of these data in particular (see here).

Most recently, however, as per a “split” decision registered by Houston’s current school board (i.e., 4:3, and without any new members elected last November), Superintendent Grier received a $98,600 bonus for his “satisfactory evaluation” as the school district’s superintendent. See more from the full article published in The Houston Chronicle. As per the same article, Superintendent “Grier’s base salary is $300,000, plus $19,200 for car and technology allowances. He also is paid for unused leave time.”

More importantly, take a look at the two figures below, taken from actual district reports (see references below), highlighting Houston’s performance (declining, on average, in blue) as compared to the state of Texas (maintaining, on average, in black), to determine for yourself whether Superintendent Grier, indeed, deserved such a bonus (not to mention salary).

Another question to ponder is whether the district’s use of the EVAAS value-added system, especially since Superintendent Grier’s arrival in 2009, is actually reforming the school district as he and other district leaders have for so long now intended (e.g., since his Superintendent appointment in 2009).

Figure 1

Figure 1. Houston (blue trend line) v. Texas (black trend line) performance on the state’s STAAR tests, 2012-2015 (HISD, 2015a)

Figure 2

Figure 2. Houston (blue trend line) v. Texas (black trend line) performance on the state’s STAAR End-of-Course (EOC) tests, 2012-2015 (HISD, 2015b)


Houston Independent School District (HISD). (2015a). State of Texas Assessments of Academic Readiness (STAAR) performance, grades 3-8, spring 2015. Retrieved here.

Houston Independent School District (HISD). (2015b). State of Texas Assessments of Academic Readiness (STAAR) end-of-course results, spring 2015. Retrieved here.

The Nation’s “Best Test” Scores Released: Test-Based Policies (Evidently) Not Working

From Diane Ravitch’s Blog (click here for direct link):

Sometimes events happen that seem to be disconnected, but after a few days or weeks, the pattern emerges. Consider this: On October 2, [U.S.] Secretary of Education Arne Duncan announced that he was resigning and planned to return to Chicago. Former New York Commissioner of Education John King, who is a clone of Duncan in terms of his belief in testing and charter schools, was designated to take Duncan’s place. On October 23, the Obama administration held a surprise news conference to declare that testing was out of control and should be reduced to not more than 2% of classroom time [see prior link on this announcement here]. Actually, that wasn’t a true reduction, because 2% translates into between 18-24 hours of testing, which is a staggering amount of annual testing for children in grades 3-8 and not different from the status quo in most states.

Disconnected events?

Not at all. Here comes the pattern-maker: the federal tests called the National Assessment of Educational Progress [NAEP] released its every-other-year report card in reading and math, and the results were dismal. There would be many excuses offered, many rationales, but the bottom line: the NAEP scores are an embarrassment to the Obama administration (and the George W. Bush administration that preceded it).

For nearly 15 years, Presidents Bush and Obama and the Congress have bet billions of dollars—both federal and state—on a strategy of testing, accountability, and choice. They believed that if every student was tested in reading and mathematics every year from grades 3 to 8, test scores would go up and up. In those schools where test scores did not go up, the principals and teachers would be fired and replaced. Where scores didn’t go up for five years in a row, the schools would be closed. Thousands of educators were fired, and thousands of public schools were closed, based on the theory that sticks and carrots, rewards and punishments, would improve education.

But the 2015 NAEP scores released today by the National Assessment Governing Board (a federal agency) showed that Arne Duncan’s $4.35 billion Race to the Top program had flopped. It also showed that George W. Bush’s No Child Left Behind was as phony as the “Texas education miracle” of 2000, which Bush touted as proof of his education credentials.

NAEP is an audit test. It is given every other year to samples of students in every state and in about 20 urban districts. No one can prepare for it, and no one gets a grade. NAEP measures the rise or fall of average scores for states in fourth grade and eighth grade in reading and math and reports them by race, gender, disability status, English language ability, economic status, and a variety of other measures.

The 2015 NAEP scores showed no gains nationally in either grade in either subject. In mathematics, scores declined in both grades, compared to 2013. In reading, scores were flat in grade 4 and lower in grade 8. Usually the Secretary of Education presides at a press conference where he points with pride to increases in certain grades or in certain states. Two years ago, Arne Duncan boasted about the gains made in Tennessee, which had won $500 million in Duncan’s Race to the Top competition. This year, Duncan had nothing to boast about.

In his Race to the Top program, Duncan made testing the primary purpose of education. Scores had to go up every year, because the entire nation was “racing to the top.” Only 12 states won a share of the $4.35 billion that Duncan was given by Congress: Tennessee and Delaware were first to win, in 2010. The next round, the following states won multi-millions of federal dollars to double down on testing: Maryland, Massachusetts, the District of Columbia, Florida, Georgia, Hawaii, New York, North Carolina, Ohio, and Rhode Island.

Tennessee, Duncan’s showcase state in 2013, made no gains in reading or mathematics, neither in fourth grade or eighth grade. The black-white test score gap was as large in 2015 as it had been in 1998, before either NCLB or the Race to the Top.

The results in mathematics were bleak across the nation, in both grades 4 and 8. The declines nationally were only 1 or 2 points, but they were significant in a national assessment on the scale of NAEP.

In fourth grade mathematics, the only jurisdictions to report gains were the District of Columbia, Mississippi, and the Department of Defense schools. Sixteen states had significant declines in their math scores, and thirty-three were flat in relation to 2013 scores. The scores in Tennessee (the $500 million winner) were flat.

In eighth grade, the lack of progress in mathematics was universal. Twenty-two states had significantly lower scores than in 2013, while 30 states or jurisdictions had flat scores. Pennsylvania, Kansas, and Florida (a Race to the Top winner), were the biggest losers, by dropping six points. Among the states that declined by four points were Race to the Top winners Ohio, North Carolina, and Massachusetts. Maryland, Hawaii, New York, and the District of Columbia lost two points. The scores in Tennessee were flat.

The District of Columbia made gains in fourth grade reading and mathematics, but not in eighth grade. It continues to have the largest score gap-—56 points–between white and black students of any urban district in the nation. That is more than double the average of the other 20 urban districts. The state with the biggest achievement gap between black and white students is Wisconsin; it is also the state where black students have the lowest scores, lower than their peers in states like Mississippi and South Carolina. Wisconsin has invested heavily in vouchers and charter schools, which Governor Scott Walker intends to increase.

The best single word to describe NAEP 2015 is stagnation. Contrary to President George W. Bush’s law, many children have been left behind by the strategy of test-and-punish. Contrary to the Obama administration’s Race to the Top program, the mindless reliance on standardized testing has not brought us closer to some mythical “Top.”

No wonder Arne Duncan is leaving Washington. There is nothing to boast about, and the next set of NAEP results won’t be published until 2017. The program that he claimed would transform American education has not raised test scores, but has demoralized educators and created teacher shortages. Disgusted with the testing regime, experienced teachers leave and enrollments in teacher education programs fall. One can only dream about what the Obama administration might have accomplished had it spent that $5 billion in discretionary dollars to encourage states and districts to develop and implement realistic plans for desegregation of their schools, or had they invested the same amount of money in the arts.

The past dozen or so years have been a time when “reformers” like Arne Duncan, Michelle Rhee, Joel Klein, and Bill Gates proudly claimed that they were disrupting school systems and destroying the status quo. Now the “reformers” have become the status quo, and we have learned that disruption is not good for children or education.

Time is running out for this administration, and it is not likely that there will be any meaningful change of course in education policy. One can only hope that the next administration learns important lessons from the squandered resources and failure of NCLB and Race to the Top.

“Value-Less” Value-Added Data

Peter Greene, a veteran teacher of English in Pennsylvania who works as a teacher in a state using the Pennsylvania version of the Education Value-Added Assessment System (EVAAS), wrote last week (October 5, 2015) in his Curmudgucation blog about his “Value-Less Data.” I thought it very important to share with you all, as he does a great job deconstructing one of the most widespread claims being made, and most lacking research support, about using the data derived via value-added models (VAMs) to inform and improve what teachers do in their classrooms.

Greene sententiously critiques this claim, writing:

It’s autumn in Pennsylvania, which means it’s time to look at the rich data to be gleaned from our Big Standardized Test (called PSSA for grades 3-8, and Keystone Exams at the high school level).

We love us some value added data crunching in PA (our version is called PVAAS, an early version of the value-added baloney model). This is a model that promises far more than it can deliver, but it also makes up a sizeable chunk of our school evaluation model, which in turn is part of our teacher evaluation model.

Of course the data crunching and collecting is supposed to have many valuable benefits, not the least of which is unleashing a pack of rich and robust data hounds who will chase the wild beast of low student achievement up the tree of instructional re-alignment. Like every other state, we have been promised that the tests will have classroom teachers swimming in a vast vault of data, like Scrooge McDuck on a gold bullion bender. So this morning I set out early to the states Big Data Portal to see what riches the system could reveal.

Here’s what I can learn from looking at the rich data.

* the raw scores of each student
* how many students fell into each of the achievement subgroups (test scores broken down by 20 point percentile slices)
* if each of the five percentile slices was generally above, below, or at its growth target

Annnnd that’s about it. I can sift through some of that data for a few other features.

For instance, PVAAS can, in a Minority Report sort of twist, predict what each student should get as a score based on– well, I’ve been trying for six years to find someone who can explain this to me, and still nothing. But every student has his or her own personal alternate universe score. If the student beats that score, they have shown growth. If they don’t, they have not.

The state’s site will actually tell me what each student’s alternate universe score was, side by side with their actual score. This is kind of an amazing twist– you might think this data set would be useful for determining how well the state’s predictive legerdemain actually works. Or maybe a discrepancy might be a signal that something is up with the student. But no — all discrepancies between predicted and actual scores are either blamed on or credited to the teacher.

I can use that same magical power to draw a big target on the backs of certain students. I can generate a list of students expected to fall within certain score ranges and throw them directly into the extra test prep focused remediation tank. Although since I’m giving them the instruction based on projected scores from a test they haven’t taken yet, maybe I should call it premediation.

Of course, either remediation or premediation would be easier to develop if I knew exactly what the problem was.

But the website gives only raw scores. I don’t know what “modules” or sections of the test the student did poorly on. We’ve got a principal working on getting us that breakdown, but as classroom teachers we don’t get to see it. Hell, as classroom teachers, we are not allowed to see the questions, and if we do see them, we are forbidden to talk about them, report on them, or use them in any way. (Confession: I have peeked, and many of the questions absolutely suck as measures of anything).

Bottom line– we have no idea what exactly our students messed up to get a low score on the test. In fact, we have no idea what they messed up generally.

So that’s my rich data. A test grade comes back, but I can’t see the test, or the questions, or the actual items that the student got wrong.

The website is loaded with bells and whistles and flash-dependent functions along with instructional videos that seem to assume that the site will be used by nine-year-olds, combining instructions that should be unnecessary (how to use a color-coding key to read a pie chart) to explanations of “analysis” that isn’t (by looking at how many students have scored below basic, we can determine how many students have scored below basic).

I wish some of the reformsters who believe that BS [i.e., not “basic skills” but the “other” BS] Testing gets us rich data that can drive and focus instruction would just get in there and take a look at this, because they would just weep. No value is being added, but lots of time and money is being wasted.

Valerie Strauss also covered Greene’s post in her Answer Sheet Blog in The Washington Post here, in case you’re interested in seeing her take on this as well: “Why the ‘rich’ student data we get from testing is actually worthless.”

EVAAS, Value-Added, and Teacher Branding

I do not think I ever shared this video out, and now following up on another post, about the potential impact these videos should really have, I thought now is an appropriate time to share. “We can be the change,” and social media can help.

My former doctoral student and I put together this video, after conducting a study with teachers in the Houston Independent School District and more specifically four teachers whose contracts were not renewed due in large part to their EVAAS scores in the summer of 2011. This video (which is really a cartoon, although it certainly lacks humor) is about them, but also about what is happening in general in their schools, post the adoption and implementation (at approximately $500,000/year) of the SAS EVAAS value-added system.

To read the full study from which this video was created, click here. Below is the abstract.

The SAS Educational Value-Added Assessment System (SAS® EVAAS®) is the most widely used value-added system in the country. It is also self-proclaimed as “the most robust and reliable” system available, with its greatest benefit to help educators improve their teaching practices. This study critically examined the effects of SAS® EVAAS® as experienced by teachers, in one of the largest, high-needs urban school districts in the nation – the Houston Independent School District (HISD). Using a multiple methods approach, this study critically analyzed retrospective quantitative and qualitative data to better comprehend and understand the evidence collected from four teachers whose contracts were not renewed in the summer of 2011, in part given their low SAS® EVAAS® scores. This study also suggests some intended and unintended effects that seem to be occurring as a result of SAS® EVAAS® implementation in HISD. In addition to issues with reliability, bias, teacher attribution, and validity, high-stakes use of SAS® EVAAS® in this district seems to be exacerbating unintended effects.

Article on the “Heroic” Assumptions Surrounding VAMs Published (and Currently Available) in TCR

My former doctoral student and I wrote a paper about the “heroic” assumptions surrounding VAMs. We titled it “Truths” Devoid of Empirical Proof: Underlying Assumptions Surrounding Value-Added Models in Teacher Evaluation” and it was just published in the esteemed Teachers College Record (TCR). It is also open and accessible for one week, for free, here. I have also pasted the abstract below for more information.


Despite the overwhelming and research-based concerns regarding value-added models (VAMs), VAM advocates, policymakers, and supporters continue to hold strong to VAMs’ purported, yet still largely theoretical strengths and potentials. Those advancing VAMs have, more or less, adopted and promoted a set of agreed-upon, albeit “heroic” set of assumptions, without independent, peer-reviewed research in support. These “heroic” assumptions transcend promotional, policy, media, and research-based pieces, but they have never been fully investigated, explicated, or made explicit as a set or whole. These assumptions, though often violated, are often ignored in order to promote VAM adoption and use, and also to sell for-profits’ and sometimes non-profits’ VAM-based systems to states and districts. The purpose of this study was to make obvious the assumptions that have been made within the VAM narrative and that, accordingly, have often been accepted without challenge. Ultimately, sources for this study included 470 distinctly different written pieces, from both traditional and non-traditional sources. The results of this analysis suggest that the preponderance of sources propagating unfounded assertions are fostering a sort of VAM echo chamber that seems impenetrable by even the most rigorous and trustworthy empirical evidence.

Yong Zhao’s Stand-Up Speech

Yong Zhao — Professor in the Department of Educational Methodology, Policy, and Leadership at the University of Oregon — was a featured speaker at the recent annual conference of the Network for Public Education (NPE). He spoke about “America’s Suicidal Quest for Outcomes,” as in, test-based outcomes.

I strongly recommend you take almost an hour (i.e., 55 minutes) out of your busy days and sit back and watch what is the closest thing to a stand-up speech I’ve ever seen. Zhao offers a poignant but also very entertaining and funny take on America’s public schools, surrounded by America’s public school politics and situated in America’s pop culture. The full transcription of Zhao’s speech is also available here, as made available by Mercedes Schneider, for any and all who wish to read it: Yong_Zhao NPE Transcript

Zhao speaks of democracy, and embraces his freedom of speech in America (v. China) that permits him to speak out. He explains why he pulled his son out of public school, thanks to No Child Left Behind (NCLB), yet he criticizes G. W. Bush for causing his son to (since college graduation) live in his basement. Hence, Zhao’s “readiness” to leave the basement is much more important than any other performance “readiness” measure being written into the plethora of educational policies surrounding “readiness” (e.g., career and college readiness, pre-school readiness).

Zhao uses what happened to Easter Island’s Rapa Nui civilization that led to their extinction as an analogy for what may happen to us post Race to the Top, given both sets of people are/were driven by false hopes of the gods raining down on them prosperity, should they successfully compete for success and praise. Like the Rapa Nui built monumental statues in their race to “the top” (literally), the unintended consequences that came about as a result (e.g., the exploitation of their natural resources) destroyed their civilization. Zhao argues the same thing is happening in our country with test scores being the most sought after monuments, again, despite the consequences.

Zhao calls for mandatory lists of side effects that come along with standardized testing, similar to something I wrote years ago in an article titled “Buyer, Be Aware: The Value-Added Assessment Model is One Over-the-Counter Product that May Be Detrimental to Your Health.” In this article I pushed for a Federal Drug Administration (FDA) approach to educational research, that would serve as a model to protect the intellectual health of the U.S. A simple approach that legislators and education leaders would have to follow when they passed legislation or educational policies whose benefits and risks are known, or unknown.

Otherwise, he calls all educators (and educational policymakers) to continuously ask themselves one question when test scores rise: “What did you give up to achieve this rise in scores.” When you choose something, what do you lose?

Do give it a watch!

Mirror, Mirror on the Wall…

No surprise, again, but Thomas Kane, an economics professor from Harvard University who also directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation, is publicly writing in support of VAMs, again (redundancy intended). I just posted about one of his recent articles published on the website of the Brookings Institution titled “Do Value-Added Estimates Identify Causal Effects of Teachers and Schools?” after which I received another of his articles, this time published by the New York Daily News titled “Teachers Must Look in the Mirror.”

Embracing a fabled metaphor, while not to position teachers as the wicked queens or to position Kane as Snow White, let us ask ourselves the classic question:”Who is the fairest one of all?” as we critically review yet another fairytale authored by Harvard’s Kane. He has, after all, “carefully studied the best systems for rating teachers” (see other prior posts about Kane’s public perspectives on VAMs here and here).

In this piece, Kane continues to advance a series of phantasmal claims about the potentials of VAMs, this time in the state of New York where Governor Andrew Cuomo intends to take the state’s teacher evaluation system up to a system based 50% on teachers’ value-added, or 100% on value-added in cases where a teacher rated as “ineffective” in his/her value-added score can be rated as “ineffective” overall. Here,  value-added could be used to trump all else (see prior posts about this here and here).

According to Kane, Governor Cuomo “picked the right fight.” The state’s new system “will finally give schools the tools they need to manage and improve teaching.” Perhaps the magic mirror would agree with such a statement, but research would evidence it vain.

As I have noted prior, there is absolutely no evidence, thus far, indicating that such systems have any (in)formative use or value. These data are first and foremost designed for summative, or summary, purposes; they are not designed for formative use. Accordingly, the data that come from such systems — besides the data that come from the observational components still being built into these systems that have existed and been used for decades past — are not transparent, difficult to understand, and therefore challenging to use. Likewise, such data are not instructionally sensitive, and they are untimely in that test-based results typically come back to teachers well after their students have moved on to subsequent grade levels.

What about Kane’s claims against tenure: “The tenure process is the place to start. It’s the most important decision a principal makes. One poor decision can burden thousands of future students, parents, colleagues and supervisors.” This is quite an effect considering the typical teacher being held accountable using these new and improved teacher evaluation systems as based (in this case largely) on VAMs typically impacts only teachers at the elementary level who teach mathematics and reading/language arts. Even an elementary teacher with a career spanning 40 years with an average of 30 students per class would directly impact (or burden) 1,200 students, maximum. This is not to say this is inconsequential, but as consequential as Kane’s sensational numbers imply? What about the thousands of parents, colleagues, and supervisors also to be burdened by one poor decision? Fair and objective? This particular mirror thinks not.

Granted, I am not making any claims about tenure as I think all would agree that sometimes tenure can support, keeping with the metaphor, bad apples. Rather I take claim with the exaggerations, including also that “Traditionally, principals have used much too low a standard, promoting everyone but the very worst teachers.” We must all check our assumptions here about how we define “the very worst teachers” and how many of them really lurk in the shadows of America’s now not-so-enchanted forests. There is no evidence to support this claim, either, just conjecture.

As for the solution, “Under the new law, the length of time it will take to earn tenure will be lengthened from three to four years.” Yes, that arbitrary, one-year extension will certainly help… Likewise, tenure decisions will now be made better using classroom observations (the data that have, according to Kane in this piece, been used for years to make all of these aforementioned bad decisions) and our new fair and objective, test-based measures, which not accordingly to Kane, can only be used for about 30% of all teachers in America’s public schools. Nonetheless, “Student achievement gains [are to serve as] the bathroom scale, [and] classroom observations [are to serve] as the mirror.”

Kane continues, scripting, “Although the use of test scores has received all the attention, [one of] the most consequential change[s] in the law has been overlooked: One of a teacher’s observers must now be drawn from outside his or her school — someone whose only role is to comment on teaching.” Those from inside the school were only commenting on one’s beauty and fairness prior, I suppose, as “The fact that 96% of teachers were given the two highest ratings last year — being deemed either “effective” or “highly effective” — is a sure sign that principals have not been honest to date.”

All in all, perhaps somebody else should be taking a long hard “Look in the Mirror,” as this new law will likely do everything but “[open] the door to a renewed focus on instruction and excellence in teaching” despite the best efforts of “union leadership,” although I might add to Kane’s list many adorable little researchers who have also “carefully studied the best systems for rating teachers” and more or less agree on their intended and unintended results in…the end.