Recommended Reading — 50 Myths and Lies that Threaten America’s Public Schools: The Real Crisis in Education

A few months ago, two of the most renowned scholars in the field of education—Drs. David Berliner and Gene Glass, who are both mentors of mine here at Arizona State — wrote a book that is sure to stir up some engaging dialogue regarding public education. The two teamed up with a group of our PhD students as well as PhD students from the University of Colorado-Boulder in order to tackle what they have deemed the 50 Myths and Lies that Threaten America’s Public Schools: The Real Crisis in Education. While the book covers a wide range of topics including everything from charter schools, to bullying, to English acquisition programs, and even sex education, there are several chapters (or myths) that would likely be of particular interest to the readers of this blog (i.e., on teacher accountability and VAMs — see post forthcoming).

Otherwise, eight of the other myths, specifically, deal directly with the lies often told about teachers, including the way in which they should be evaluated based on student test scores (via VAMs). They include the following myths, again, deconstructed in this book:

  • Teachers are the most important influence in a child’s education.
  • Teachers in the United States are well-paid.
  • Merit pay is a good way to increase the performance of teachers. Teachers should be evaluated on the basis of the performance of their students. Rewarding and punishing schools for the performance of their students will improve our nation’s schools. – See forthcoming post about this myth specifically
  • Teachers in schools that serve the poor are not very talented.
  • Teach for America teachers are well trained, highly qualified, and get amazing results.
  • Subject matter knowledge is the most important asset a teacher can possess.
  • Teachers’ unions are responsible for much poor school performance. Incompetent teachers cannot be fired if they have tenure.
  • Judging teacher education programs by means of the scores that their teachers’ students get on state tests is a good way to judge the quality of the teacher education program.

At a time when corporate reformists are more interested in making money off of education than actually attempting to improve educational quality for all students, books like this are critical in helping us decipher the truth from the perpetual myths that have become the bedrock of the reform movement. Berliner and Glass do not hold back – they name names where possible and provide enough research to support their claims… but not too much to slow them down. Instead, they call upon decades’ worth of educational research, the trusty work of their students, and a bit of logic and humor in order to pack a powerful punch against those most responsible for spreading the myths and, often, flat-out lies about America’s students, teachers, and schools. This book is for teachers, parents, policymakers, school administrators, and concerned citizens alike. I definitely recommend that you add it to your reading list!

Will the Obama Administration Continue to Stay the Course?

According to a “recent” post by Diane Ravitch, President Obama “recently” selected Robert Gordon, and economist and “early proponent of VAMs,” for a top policy post in the US Department of Education. “This is a very important position” as “he will be the person in charge of the agency that basically decides what is working, what is not, and which way to go next with [educational] policy.”

Diane writes: “When he worked in the Office of Management and Budget, Gordon helped to develop the priorities for the controversial Race to the Top program. Before joining the Obama administration, he worked for Joel Klein in the New York City Department of Education. An economist, Gordon was [also] lead author of an influential paper in 2006 that helped to put value-added-measurement at the top of the “reformers” policy agenda. That paper, called “Identifying Effective Teachers Using Performance on the Job,” was co-authored by Thomas J. Kane and Douglas O. Staiger. Kane became the lead adviser to the Gates Foundation in developing its “Measures of Effective Teaching,” which has spent hundreds of millions of dollars trying to develop the formula for the teacher who can raise test scores consistently. Gordon went on to Obama’s Office of Management and Budget, which is the U.S. government’s lead agency for determining budget priorities.”

Here is the abstract from this paper:

Traditionally, policymakers have attempted to improve the quality of the teaching force by raising minimum credentials for entering teachers. Recent research, however, suggests that such paper qualifications have little predictive power in identifying effective teachers. We propose federal support to help states measure the effectiveness of individual teachers—based on their impact on student achievement, subjective evaluations by principals and peers, and parental evaluations. States would be given considerable discretion to develop their own measures, as long as student achievement impacts (using so-called “value-added” measures) are a key component. The federal government would pay for bonuses to highly rated teachers willing to teach in high-poverty schools. In return for federal support, schools would not be able to offer tenure to new teachers who receive poor evaluations during their first two years on the job without obtaining district approval and informing parents in the schools. States would open further the door to teaching for those who lack traditional certification but can demonstrate success on the job [Gordon’s wife worked for TFA]. This approach would facilitate entry into teaching by those pursuing other careers. The new measures of teacher performance would also provide key data for teachers and schools to use in their efforts to improve their performance.

Diane writes again: “Much has happened since Gordon, Kane, and Staiger speculated about how to identify effective teachers by performance measures such as student test scores. We now have evidence that these measures are fraught with error and instability. We now have numerous examples where teachers are evaluated based on the scores of students they never taught. We have numerous examples of teachers rated highly effective one year, but ineffective the next year, showing that what mattered most was the composition of their class, not their quality or effectiveness. Just recently, the American Statistical Association said: “Most VAM studies find that teachers account for about 1% to 14% of the variability in test scores, and that the majority of opportunities for quality improvement are found in the system-level conditions. Ranking teachers by their VAM scores can have unintended consequences that reduce quality.”). In a joint statement, the National Academy of Education and the American Educational Research Association warned about the defects and limitations of VAM and showed that most of the factors that determine test scores are beyond the control of teachers.  Numerous individual scholars have taken issue with the naive belief that teacher quality can be established by the test scores of their students, even when the computer matches as many variables as it can find.”

So, will the Obama administration continue to stay its now well-established course? Diane having met Gordon and “knowing him to be a very smart person,” is betting, and hoping, that he might come to his senses, provided the research that has literally burgeoned since his initial report was released in 2006. Gordon might just help the Obama administration change course, not stay it, we all sincerely hope.

TIME Magazine Needs a TIME Out

In its paper version next week, TIME Magazine will release an article titled “Rotten Apples: It’s Nearly Impossible to Fire a Bad Teacher.” As the title foreshadows, the article is about how “really difficult” it is to fire a bad teacher, and how a group of Silicon Valley investors, along with Campbell Brown (award winning news anchor who recently joined “the cause” in New York, as discussed prior here), want to change that.


The article summarizes, or I should say celebrates, the Vergara v. California trial, the case in which nine public school students (emphasis added as these were not necessarily these students’ ideas) challenged California’s “ironclad tenure system,” arguing that their rights to a good education had been violated by state-level job protections making it “too difficult” to fire bad teachers. Not surprisingly, behind the students stood one of approximately six Silicon Valley technology magnates — David Welch who financed and ultimately won the case (see prior posts about this case here and here). Not surprisingly, again, the author of this article also comes from Silicon Valley. I wonder if they all know each other…

Anyhow, as summarized (and celebrated) in this piece, the Vergara judge found that (1) “[t]enure and other job protections make it harder to fire teachers and therefore effectively work to keep bad ones in the classroom,” and (2) “[b]ad teachers ‘substantially undermine’ a child’s education.” This last point, which according to the judge “shock[ed] the conscience,” came almost entirely thanks to the testimony of Thomas Kane (see prior posts about his Bill and Melinda Gates funded research here and here) and the testimony of Raj Chetty, Kane’s colleague at Harvard who has also been the source of many prior posts (see here and here), but whose research at the source of his testimony has also been recently (and seriously) scrutinized given how bias in his study actually made false this key assertion.

Research recently released by Jesse Rothstein evidences that the claims Chetty made during Vergara were actually false, because Chetty (and his study colleagues) improperly masked the bias underlying their main causal assertion, again, that “bad teachers ‘substantially undermine a child’s education.” It was not bad teachers at the root cause of student failure, as Chetty (and Kane) argued (and continue to argue), but many other influences that account for student success (or failure) than just “bad teachers” (click here to read more).

Nevertheless, the case (and this article) were built on this false premise. As per one of the prosecuting attorneys, “The fact that [they, as in Chetty and Kane] could show how students were actually harmed by bad teachers–that changed the argument.”

The article proceeds to describe how, “predictably,” many teacher unions and the like dismissed pretty much everything associated with the lawsuit, except of course the key assertions advanced by the judge. The only other luminaries of note included “U.S. Secretary of Education Arne Duncan and former D.C. chancellor of schools Michelle Rhee,” who both praised the judge’s decision for challenging the “broken status quo.” The aforementioned, and self-proclaimed “education reformer” Campbell Brown, heralded it as “the most important civil rights suit in decades,” after which she helped to file two similar cases in New York (as discussed, again, here).

As the author of this TIME piece then asserts, this is a war “not [to] be won incrementally, through painstaking compromise with multiple stakeholders, but through sweeping decisions–judicial and otherwise–made possible by the tactical [emphasis added] application of vast personal fortunes,” thanks to the Silicon Valley et al. magnates. “It is a reflection of our politics that no one elected these men [emphasis added] to take on the knotty problem of fixing our public schools, but here they are anyway, fighting for what they firmly believe is in the public interest.” This is, indeed, a “war on teacher tenure” that, funded by this “latest batch of tech tycoons…follows in the footsteps of a long line of older magnates, from the Carnegies and Rockefellers to Walmart’s Waltons, who have also funneled their fortunes into education-reform projects built on private-sector management strategies.”

The article then goes into a biography of Welch (the aforementioned, “bushy eyebrowed” backer of the Vergara case) as if he was a Christopher Columbus incarnate, himself. Welch “didn’t think much about how the system actually functioned, or malfunctioned, until his own children were born in the ’90s and went on to have ‘some public experiences and some private-school experiences.” Another education-expert such an experience makes. He also had a conversation with one superintendent, who wanted more control over his workforce and inspired Welch to, thereafter, lead the war on teacher tenure, because “children [were presumably] being harmed by these laws.”

“[S]tudents who are stuck in classrooms with bad teachers receive an education that is substantially inferior to that of students who are in classrooms with good teachers. Laws that keep bad teachers in the classroom…therefore violate the equal-protection clause of the state constitution….[and] poor and minority students, who are more likely to be in classrooms with bad teachers, endure a disproportionate burden, making the issue a matter of civil rights as well.”

To see two other articles written about this TIME piece, please click here and here. As per Diane Ravitch in the latter link: “More: tenure is due process, the right to a hearing, not a guarantee of a lifetime job. Are there bad apples in teaching? Undoubtedly, just as there are bad apples in medicine, the law, business, and even TIME magazine. There are also bad apples in states where teachers have no tenure. Will abolishing tenure increase the supply of great teachers? Surely we should look to those states where teachers do not have tenure to see how that worked out. Sadly, there is no evidence for the hope, wish, belief, that eliminating due process produces a surge of great teachers.”

“Accountability that Sticks” v. “Accountability with Sticks”

Michael Fullan, Professor Emeritus from the University of Toronto and former Dean of the Ontario Institute for Studies in Education (OISE), gave a presentation he titled “Accountability that Sticks.” Which is “a preposition away from accountability with sticks.”

In his speech he said: “Firing teachers and closing schools if student test scores and graduation rates do not meet a certain bar is not an effective way to raise achievement across a district or a state…Linking student achievement to teacher appraisal, as sensible as it might seem on the surface, is a non-starter…It’s a wrong policy [emphasis added]…[and] Its days are numbered.”

He noted that teacher evaluation is “the biggest factor that most policies get wrong…Teacher appraisal, even if you get it right – which the federal government doesn’t do – is the wrong driver. It will never be intensive enough.”

He then spoke about how, at least in the state of California, things look more promising than they have in the past, from his view working with local districts throughout the state. He noted noticing that “Growing numbers of superintendents, teachers and parents…are rejecting punitive measures…in favor of what he called more collaborative, humane and effective approaches to supporting teachers and improving student achievement.”

If the goal is to improve teaching, then, what’s the best way to do this according to Fullan? “[A] culture of collaboration is the most powerful tool for improving what happens in classrooms and across districts…This is the foundation. You reinforce it with selective professional development and teacher appraisal.”

In addition, “[c]ollaboration requires a positive school climate – teachers need to feel respected and listened to, school principals need to step back, and the tone has to be one of growth and improvement, not degradation.” Accordingly, “New Local Control and Accountability Plans [emphasis added], created individually by districts, could be used by teachers and parents to push for ways to create collaborative cultures” and cultures of community-based and community-respected accountability.

This will help allow “talented schools” to improve “weak teachers” and further prevent the attrition of “talented teachers” from “weak schools.”

To read more about his speech, as highlighted by Jane Meredith Adams on EdSource, click here.

VAMs and Observations: Consistencies, Correlations, and Contortions

A research article was recently published in the peer-reviewed Education Policy Analysis Archives journal titled “The Stability of Teacher Performance and Effectiveness: Implications for Policies Concerning Teacher Evaluation.” This articles was published by professors/doctoral researchers from Baylor University, Texas A & M, & University of South Carolina. Researchers set out to explore the stability (or fluctuations) observed across 132 teachers’ effectiveness ratings across 23 schools in South Carolina over time using both observational and VAM-based output.

Researchers found, not surprisingly given prior research in this area, that neither teacher performance using value-added nor effectiveness using observations was highly stable over time. This is most problematic when “sound decisions about continued employment, tenure, and promotion are predicated on some degree of stability over time. It is imprudent to make such decisions of the performance and effectiveness of a teacher as “Excellent” one year and “Mediocre” the next.”

They also observed “a generally weak positive relationship between the two sets of ratings [i.e., value-added estimates and observations], which has also been the source of many literature studies.

On both of these findings, really, we are well pas the point of saturation. That is, we could not have more agreement across research studies on the following (1) that teacher-level value-added scores are highly unstable over time and (2) that these value-added scores do not align well with observational scores, as they should if both measures were to be appropriate capturing the “teacher effectiveness” construct.

Another interesting finding from this study, discussed before but that has not yet reached the point of saturation in the research literature like the prior two is how (3) different teacher performance ratings as based on observational data are also markedly different across schools. At issue here is “that performance ratings may be school-specific.” Or as per a recent post on this blog, that there is indeed much “Arbitrariness Inherent in Teacher Observations.” This is also highly problematic in that where a teacher might be housed might determine more his/her ratings based not necessarily (or entirely) on his/her actual “quality” or “effectiveness” but his/her location, his/her rater, and his/her rater’s scoring approach given differential tendencies towards leniency, or severity. This might leave us with more of a luck-of-the-draw approach than an actually “objective” measurement of true teacher quality, contrary to current and popular (especially policy) beliefs.

Accordingly, and also per the research, this is not getting much better in that, as per the authors of this article as well as many other scholars, (1) “the variance in value-added scores that can be attributed to teacher performance rarely exceeds 10 percent; (2) in many ways “gross” measurement errors that in many ways come, first, from the tests being used to calculate value-added; (3) the restricted ranges in teacher effectiveness scores also given these test scores and their limited stretch, and depth, and instructional insensitivity — this was also at the heart of a recent post whereas in what demonstrated that “the entire range from the 15th percentile of effectiveness to the 85th percentile of [teacher] effectiveness [using the EVAAS] cover[ed] approximately 3.5 raw score points [given the tests used to measure value-added];” (4) context or student, family, school, and community background effects that simply cannot be controlled for, or factored out; (5) especially at the classroom/teacher level when students are not randomly assigned to classrooms (and teachers assigned to teach those classrooms)… although this will likely never happen for the sake of improving the sophistication and rigor of the value-added model over students’ “best interests.”

Rothstein, Chetty et al., and VAM-Based Bias

Recall the Chetty et al. study at focus of many posts on this blog (see for example here, here, and here)? The study was cited in President Obama’ 2012 State of the Union address when Obama said, “We know a good teacher can increase the lifetime income of a classroom by over $250,000,” and this study was more recently the focus of attention when the judge in Vergara v. California cited Chetty et al.’s study as providing evidence that “a single year in a classroom with a grossly ineffective teacher costs students $1.4 million in lifetime earnings per classroom.” Well, this study is at the source of a new, and very interesting VAM-based debate, again.

This time, new research conducted by Berkeley Associate Professor of Economics – Jesse Rothstein – provides evidence that puts the aforementioned Chetty et al. results under another appropriate light. While Rothstein and others have written critiques of the Chetty et al. study prior (see prior reviews here, here, here, and here), what Rothstein recently found (in his working, not-yet-peer-reviewed-study here) is that by using “teacher switching” statistical procedures, Chetty et al. masked evidence of bias in their prior study. While Chetty et al. have repeatedly claimed bias was not an issue (see for example a series of emails on this topic here), it seems indeed it was.

While Rothstein replicated Chetty et al.’s overall results using a similar dataset, Rothstein did not replicate Chetty et al.’s findings when it came to bias. As mentioned, Chetty et al. used a process of “teacher switching” to test for bias in their study, and by doing so found, with evidence, that bias did not exist in their value-added output. Rothstein found that when “teacher switching” is appropriately controlled, however, “bias accounts for about 20% of the variance in [VAM] scores.” This makes suspect, more now than before, Chetty et al.’s prior assertions that their model, and their findings, were immune to bias.

What this means, as per Rothstein, is that “teacher switching [the process used by Chetty et al.] is correlated with changes in students’ prior grade scores that bias the key coefficient toward a finding of no bias.” Hence, there was a reason Chetty et al. did not find bias in their value-added estimates because they did not use the proper, statistical controls to control for bias in the first place. When properly controlled, or adjusted, estimates yield “evidence of moderate bias;” hence, “[t]he association between [value-added] and long-run outcomes is not robust and quite sensitive to controls.”

This has major implications in the sense that this makes suspect the causal statements also made by Chetty et al. and repeated by President Obama, the Vergara v. California judge, and others – that “high value-added” teachers caused students to ultimately realize higher long-term incomes, fewer pregnancies, etc. X years down the road. If Chetty et al. did not appropriately control for bias, which again Rothstein argues with evidence they did not, it is likely that students would have realized these “things” almost if not entirely regardless of their teachers or what “value” their teachers purportedly “added” to their learning X years prior.

In other words, students were likely not randomly assigned to classrooms in either the Chetty et al. or the Rothstein datasets (making these datasets comparable). So if the statistical controls used did not effectively “control for” the lack of non random assignment of students into classrooms, teachers may have been assigned high value-added scores not necessarily because they were high value-added teachers but because they were non-randomly assigned higher performing, higher aptitude, etc. students, in the first place and as a whole. Thereafter, they were given credit for the aforementioned long-term outcomes, regardless.

If the name Jesse Rothstein sounds familiar, it should. I have referenced his research in prior posts here, here, and here, as he is well-known in the area of VAM research, in particular, for a series of papers in which he provided evidence that students who are assigned to classrooms in non-random ways can create biased, teacher-level value-added scores. If random assignment was the norm (i.e., whereas students are randomly assigned to classrooms and, ideally, teachers are randomly assigned to teach those classrooms of randomly assigned students), teacher-level bias would not be so problematic. However, given research I also recently conducted on this topic (see here), random assignment (at least in the state of Arizona) occurs 2% of the time, at best. Principals otherwise outright reject the notion as random assignment is not viewed as in “students’ best interests,” regardless of whether randomly assigning students to classrooms might mean “more accurate” value-added output as a result.

So it seems, we either get the statistical controls right (which I doubt is possible) or we randomly assign (which I highly doubt is possible). Otherwise, we are left wondering whether value-added analyses will ever work as per their intended (and largely ideal) purposes, especially when it comes to evaluating and holding accountable America’s public school teachers for their effectiveness.


In case you’re interested, Chetty et al have responded to Rothstein’s critique. Their full response can be accessed here. Not surprisingly, they first highlight that Rothstein (and another set of their colleagues at Harvard), replicated their results. That “value-added (VA) measures of teacher quality show very consistent properties across different settings” is that on which Chetty et al. focus first and foremost. What they dismiss, however, is whether the main concerns raised by Rothstein threaten the validity of their methods, and their conclusions. They also dismiss the fact that Rothstein addressed Chetty et al.’s counterpoints before they published them, in Appendix B of his paper given Chetty et al. shared their concerns with Rothstein prior to his study’s release.

Nonetheless, the concerns Chetty et al. attempt to counter are whether their “teacher-switching” approach was invalid, and whether the “exclusion of teachers with missing [value-added] estimates biased the[ir]conclusion[s]” as well. The extent to which missing data bias value-added estimates has also been discussed prior when statisticians force the assumption in their analyses that missing data are “missing at random” (MAR), which is a difficult (although for some like Chetty et al, necessary) assumption to swallow (see, for example, the Braun 2004 reference here).

Five Nashville Teachers Face Termination

Three months ago, Tennessee Schools Director Jesse Register announced he was to fire 63 Tennessean teachers, of 195 total who for two consecutive years scored lowest (i.e., a 1 on a scale of 1 to 5) in terms of their overall “value-added” (as based on 35% EVAAS, 15% related “student achievement,” and 50% observational data). These 63 were the ones who three months ago were still “currently employed,” given the other 132 apparently left voluntarily or retired (which is a nice-sized cost-savings, so let’s be sure not to forget about the economic motivators behind all of this as well). To see a better breakdown of these numbers, click here.

This was to be the first time in Tennessee that its controversial, and “new and improved” teacher evaluation system would be used to take deliberate action against whom they deemed their “lowest-performing” teachers, as “objectively” identified in the classroom; although, officials at that time did not expect to have a “final number” to be terminated until fall.

Well, fall is here, and it seems this final number is officially five: three middle school teachers, one elementary school teacher, and one high school teacher, all teaching in metro Nashville.

The majority of these teachers come from Neely’s Bend: “one of 14 Nashville schools on the state’s priority list for operating at the bottom 5 percent in performance statewide.” Some of these teachers were evaluated even though their principal who evaluated them is  “no longer there.” Another is a computer instructor being terminated as based on this school’s overall “school-level value-added.” This is problematic in and of itself given teacher-level and in this case school-level bias seem to go hand in hand with the use of these models, and grossly interfere with accusations that these teachers “caused” low performance (see a recent post about this here).

It’s not to say these teachers were not were indeed the lowest performing; maybe they were. But I for one would love to talk to these teachers and take a look at their actual data, EVAAS and observational data included. Based on prior experiences working with such individuals, there may be more to this than what it seems. Hence, if anybody knows these folks, do let them know I’d like to better understand their stories.

Otherwise, all of this effort to ultimately attempt to terminate five of a total 5,685 certified teachers in the district (0.09%) seems awfully inefficient, and costly, and quite frankly absurd given this is a “new and improved” system meant to be much better than a prior system that likely yielded a similar termination rate, not including, however, those who left voluntarily prior.

Perhaps an ulterior motive is, indeed, the cost-savings realized given the mere “new and improved” threat.

State Tests: Instructional Sensitivity and (Small) Differences between Extreme Teachers

As per Standard 1.2. of the newly released 2014 Standards for Educational and Psychological Testing authored by the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME):

A rationale should be presented for each intended interpretation of test scores for a given use, together with a summary of the evidence and theory bearing on the intended interpretation. (p. 23)

After a series of conversations I have had, also recently with a colleague named Harris Zwerling who is a researcher for the Pennsylvania State Education Association, I asked him to submit a guest blog about what he has observed working with Pennsylvania Education Value-Added Assessment (PVAAS, aka the Education Value-Added Assessment System [EVAAS) administered by SAS-EVAAS) data in his state as per this standard. Here is what he wrote:

Recent Vamboozled posts have focused on one of the qualities of the achievement tests used as the key measures in value-added models (VAMs) and other “growth” models. Specifically, the instructional sensitivity of current student achievement tests has been questioned (see, for example, here). Evidence of whether the achievement tests in use can actually detect the influence of instruction on student test performance should, accordingly, be a central part of validity evidence required for VAMs [as per the specific standard also mentioned above]. After all, it is achievement test scores that provide the bases for VAM-based estimates. In addition, while many have noted that the statistical methods employed to generate value-added estimates will force a distribution of “teacher effectiveness” scores, it is not clear that the distinctions generated and measured have educational (or clinical) meaning.

This concern can be illustrated with an example taken from comments I previously made while serving on a Gates Foundation-funded stakeholders committee tasked with making recommendations for the redesign of Pennsylvania’s teacher and principal evaluation systems (see the document to which I refer here). Let me note that what follows is only suggestive of the possible limitations of some of Pennsylvania’s state achievement tests (the PSSAs) and is not offered as an example of an actual analysis of their instructional sensitivity. The value-added calculations, which I converted to raw scores, were performed by a research team from Mathematica Policy Research (for more information click here).

From my comments in this report:

“…8th grade Reading provided the smallest teacher effects from three subjects reported in Table III.1. The difference the [Final Report] estimates comparing the teacher at the 15th percentile of effectiveness to the average teacher (50th percentile) is -22 scaled score points on the 5th grade PSSA Reading test…[referring] to the 2010 PSSA Technical Manual raw score table… for the 8th grade Reading test, that would be a difference of approximately 2 raw score points, or the equivalent of 2 multiple choice (MC) questions (1 point apiece) or half credit on one OE [open-ended] question. (There are 40 MC questions and 4 (3 point) OE questions on the Reading test.) The entire range from the 15th percentile of effectiveness to the 85th percentile of effectiveness …covers approximately 3.5 raw score points [emphasis added].” (p. 9)… [T]he range of teacher effectiveness covering the 5th to the 95th percentiles (73 scaled score points) represents approximately a 5.5 point change in the raw score (i.e., 5.5 of 52 total possible points [emphasis added].” (p. 9)

The 5th to 95th percentiles corresponds to the effectiveness range used in the PVAAS.

“It would be difficult to see how the range of teacher effects on the 8th grade Reading test would be considered substantively meaningful if this result holds up in a larger and random sample. In the case of this subject and grade level test, the VAM does not appear to be able to detect a range of teacher effectiveness that is substantially more discriminating than those reported for more traditional forms of evaluation. (p. 9-10)

Among other issues, researchers have considered which scaling properties are necessary for measuring growth (see, for example, here), whether the tests’ scale properties met the assumptions of the statistical models being used (see, for example, here), if growth in student achievement is scale dependent (see, for example, here), and even if tests that were vertically scaled could meet the assumptions required by regression-based models (see, for example, here).

Another issue raised by researchers is the fundamental question of whether state tests could be put to more than one use, that is, as a valid measure of student progress and teacher performance, or would the multiple uses inevitably lead to behaviors that would invalidate the tests for either purpose (see, for example, here). All this suggests is that before racing ahead to implement value-added components of teacher evaluation systems, states have an obligation to assure all involved that these psychometric issues have been explicitly addressed and the tests used are properly validated for use in value-added measurements of “teacher effectiveness.”

So…How many states have validated their student achievement tests for use as measures of teachers’ performance? How many have produced evidence of their tests instructional sensitivity or of the educational significance of the distinctions made by the value-added models they use?

One major vendor of value-added measures (i.e., SAS as in SAS-EVAAS) long has held that the tests need only to have 1) sufficient “stretch” in the scales “to ensure that progress could be measured for low-and high achieving students”, 2) that “the test is highly related to the academic standards,” and 3) “the scales are sufficiently reliable from one year to the next” (see, for example, here). These assertions should be subject to independent psychometric verification, but they have not. Rather, and notably, in all the literature they have produced regarding their VAMs, SAS does not indicate that it requires or conducts the tests that researchers have designed for determining instructional sensitivity. However, they assert that they can provide accurate measures of individual teacher effectiveness nonetheless.

To date Pennsylvania’s Department of Education (PDE) apparently has ignored the fact that the Data Recognition Corporation (DRC), the vendor of its state tests, has avoided making any claims that their tests are valid for use as measures of teacher performance (see, for example, here). In DRC’s most recent Technical Report chapter on validity, for example, they list the main purposes of the PSSA, the second of which is to “Provid[e] information on school and district accountability” (p.283). The omission of teacher accountability here certainly stands out.

To date, again, none of the publicly released DRC technical reports for Pennsylvania’s state tests provides any indication that instructional sensitivity has been examined, either. Nor do the reports provide a basis for others to make their own judgment. In fact, they make it clear that historically the PSSA exams were designed for school level accountability and only later have moved toward measuring individual student mastery of Pennsylvania’s academic standards. Various PSSA tests will sample an entire academic standard with as little as one or two questions (see, for example, Appendix C here), which aligns with the evidence presented in the first section of this post.

In sum, and given that VAMs, by design, force a distribution of “teacher effectiveness,” even if the measured differences are substantively quite small, and much smaller than the lay observer or in particular educational policymaker might have believed otherwise (in the case of the 8th grade PSSA Reading test 5.5 of 52 total possible points covers the entire range from the bottom to the top categories of “ teacher effectiveness”) , it is essential that both instructional sensitivity and educational significance be explicitly established. Otherwise, untrained users of the data may labor under the misapprehension that the difference observed provides a solid basis for making such (low- and high-stakes) personnel decisions.

This is a completely separate concern from the fact that, at best, measured “teacher effects” may only explain from 1% to 15 or 20% of the variance in student test scores, a figure dwarfed by the variance explained by factors beyond the control of teachers. To date, SAS has not publicly reported the amount of variance explained by its PVAAS models for any of PA’s state tests. I believe every vendor of value-added models should report this information for every achievement test being used as a measure of “teacher effectiveness.” This too, would provide essential context for understanding the substantive meaning of reported scores.

A Major VAMbarrassment in New Orleans

Tulane University’s Cowen Institute for Education Initiatives, in what is now being called a “high-profile embarrassment,” is apologizing for releasing a high-profile report based on faulty research. Their report, that at one point seemed to have successfully VAMboozle folks representing numerous Louisiana governmental agencies and education non-profits, many of which emerged after Hurricane Katrina, has since been pulled from its once prevalent placement on the institute’s website.

It seems that on October 1, 2014 the institute released a report titled “Beating-the-Odds: Academic Performance and Vulnerable Student Populations in New Orleans Public High Schools.” The report was celebrated widely (see, for example, here) in that it “proved” that students in post-Hurricane Katrina (largely charter) schools, despite the disadvantages they faced prior to the school-based reforms triggered by Katrina, were now “beating the odds, while “posting better test scores and graduation rates than predicted by their populations,” thanks to these reforms. Demographics were no longer predicting students’ educational destinies, and the institute had the VAM-based evidence to prove it.

Institute researchers also “found” that New Orleans charter schools, in which over 80% of all New Orleans are now educated, were to have been substantively helping students “beat the odds,” accordingly. Let’s just say the leaders of the charter movement were all over this one, and also allegedly involved.

To some, however, the report’s initial findings (the findings that have now been retracted) did not come as much of a surprise. It seems the Cowen Institute and the city’s leading charter school incubator, New Schools for New Orleans, literally share office space; that is, they are literally “in office space” together.

To read more about this, as per the research of Kristen Buras (Associate Professor at Georgia State), click here. Mercedes Schneider on a recent post she also wrote about this noted that the “Cowen Institute at Tulane University has been promoting the New Orleans Charter Miracle [emphasis added]” and consistently trying “to sell the ‘transformed’ post-Katrina education system in New Orleans” since 2007 (two years post Katrina). Thanks also go out to Mercedes Schneider as before the report was brought down, she downloaded the report. This report can still be accessed there, or also directly here: Beating-the-Odds for those who want to dive into this further.

Anyhow, and as per another news article recently released about this mess, the Cowen Institute’s Executive Director John Ayers removed the report because the research within it was “inaccurate,” and institute “[o]fficials determined the report’s methodology was flawed, making its conclusions inaccurate.” The report is not be reissued. The institute also intends to “thoroughly examine and strengthen [the institute’s] internal protocols” because of this and to make sure that this does not happen again. The released report was not appropriately internally reviewed, as per the institute’s official response, although external review would have certainly been more appropriate here. Similar situations capturing why internal AND more importantly external review are so very important have been the focus of prior blog posts here, here, and here.

But in this report, that listed Debra Vaughan (the Institute’s Director of Research) and Patrick Sims (the Institute’s Senior Research Analysts) as the lead researchers, they used what they called a VAM – BUT what they did in terms of analyses was certainly not akin to an advanced or “sophisticated” VAM (not that using a VAM would have revealed entirely more accurate and/or less flawed results). Instead, they used a simpler regression approach to reach (or confirm) their conclusions. While Ayers “would not say what piece of the methodology was flawed,” we can all be quite certain it was the so-called “VAM” that was the cause of this serious case of VAMbarrassment.

As background, and from a related article about the Cowen Institute and one of its new affiliates – Douglas Harris, who has also written a book about VAMs but positioned VAMs in a more positive light than I did in my book, but who is also not listed as a direct or affiliated author on the report – the situation in New Orleans post Katrina is as follows:

Before the storm hit in August 2005, New Orleans public schools were like most cities and followed the 100-year-old “One Best System.” A superintendent managed all schools in the school district, which were governed by a locally-elected school board. Students were taught by certified and unionized teachers and attended schools mainly based on where they lived.

But that all changed when the hurricane shuttered most schools and scattered students around the nation. That opened the door for alternative forms of public education, such as charter schools that shifted control from the Orleans Parish School Board into the hands of parents and a state agency, the Recovery School District.

In the 2012-13 school year, 84 percent of New Orleans public school students attended charter schools…New Orleans [currently] leads the nation in the percentage of public school students enrolled in charter schools, with the next-highest percentages in Washington D.C. and Detroit (41 percent in each).

Again…Does the Test Matter?

Following up on my most recent post about whether “The Test Matters,” a colleague of mine from the University of Southern California (USC) – Morgan Polikoff whom I also referenced on this blog this past summer as per a study he and Andrew Porter (University of Pennsylvania) released about the “Surprisingly Weak” Correlations among VAMs and Other Teacher Quality Indicators” – wrote me an email. In this email he sent another study very similar to the one referenced above, about whether “The Test Matters.” But this study is titled “Does the Test Matter?” (see full reference below).

Given this is coming from a book chapter included within a volume capturing many of the Bill & Melinda Gates Measures of Effective Teaching (MET) studies – studies that have also been at the sources of prior posts here and here – I thought it even more appropriate to share this with you all given book chapters are sometimes hard to access and find.

Like the authors of “The Test Matters,” Polikoff used MET data to investigate whether large-scale standardized state tests “differ in the extent to which they reflect the content or quality of teachers’ instruction” (i.e., tests’ instructional sensitivity).” He also investigated whether differences in instructional sensitivity affect recommendations made in the MET reports for creating multiple-measure evaluation systems.

Polikoff found that “state tests indeed vary considerably in their correlations with observational and student survey measures of effective teaching.” These correlations, as they were in the previous article/post cited above, were statistically significant, positive, and fell in the very weak range in mathematics (0.15 < 0.19) and very weak range in English/language arts (0.07 < < 0.15). This, and other noted correlations that approach zero (i.e., = 0 or no correlation at all), all, as per Polikoff, indicate “weak sensitivity to instructional quality.”

Put differently, while this indicates that the extent to which state tests are instructionally sensitive appears slightly more favorable in mathematics versus English/language arts, we might otherwise conclude that the state tests used in at least the MET partner states “are only weakly sensitive to pedagogical quality, as judged by the MET study instruments” – the instruments that in many ways are “the best” we have thus far to offer in terms of teacher evaluation (i.e., observational, and student survey instruments).

But why does instructional sensitivity matter? Because the inferences made from these state test results, independently or more likely post VAM calculation “rely on the assumption that [state test] results accurately reflect the instruction received by the students taking the test. This assumption is at the heart of [investigations] of instructional sensitivity.” See another related (and still troublesome) post about instructional sensitivity here.

Polikoff, M. S. (2014). Does the Test Matter? Evaluating teachers when tests differ in their sensitivity to instruction. In T. J. Kane, K. A. Kerr, & R. C. Pianta (Eds.). Designing teacher evaluation systems: New guidance from the Measures of Effective Teaching project (pp. 278-302). San Francisco, CA: Jossey-Bass.