Special Issue of “Educational Researcher” (Paper #8 of 9, Part I): A More Research-Based Assessment of VAMs’ Potentials

Recall that the peer-reviewed journal Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#8 of 9), which is actually a commentary titled “Can Value-Added Add Value to Teacher Evaluation?” This commentary is authored by Linda Darling-Hammond – Professor of Education, Emeritus, at Stanford University.

Like with the last commentary reviewed here, Darling-Hammond reviews some of the key points taken from the five feature articles in the aforementioned “Special Issue.” More specifically, though, Darling-Hammond “reflect[s] on [these five] articles’ findings in light of other work in this field, and [she] offer[s her own] thoughts about whether and how VAMs may add value to teacher evaluation” (p. 132).

She starts her commentary with VAMs “in theory,” in that VAMs COULD accurately identify teachers’ contributions to student learning and achievement IF (and this is a big IF) the following three conditions were met: (1) “student learning is well-measured by tests that reflect valuable learning and the actual achievement of individual students along a vertical scale representing the full range of possible achievement measures in equal interval units” (2) “students are randomly assigned to teachers within and across schools—or, conceptualized another way, the learning conditions and traits of the group of students assigned to one teacher do not vary substantially from those assigned to another;” and (3) “individual teachers are the only contributors to students’ learning over the period of time used for measuring gains” (p. 132).

None of things are actual true (or near to true, nor will they likely ever be true) in educational practice, however. Hence, the errors we continue to observe that continue to prevent VAM use for their intended utilities, even with the sophisticated statistics meant to mitigate errors and account for the above-mentioned, let’s call them, “less than ideal” conditions.

Other pervasive and perpetual issues surrounding VAMs as highlighted by Darling-Hammond, per each of the three categories above, pertain to (1) the tests used to measure value-added is that the tests are very narrow, focus on lower level skills, and are manipulable. These tests in their current form cannot effectively measure the learning gains of a large share of students who are above or below grade level given a lack of sufficient coverage and stretch. As per Haertel (2013, as cited in Darling-Hammond’s commentary), this “translates into bias against those teachers working with the lowest-performing or the highest-performing classes’…and “those who teach in tracked school settings.” It is also important to note here that the new tests created by the Partnership for Assessing Readiness for College and Careers (PARCC) and Smarter Balanced, multistate consortia “will not remedy this problem…Even though they will report students’ scores on a vertical scale, they will not be able to measure accurately the achievement or learning of students who started out below or above grade level” (p.133).

With respect to (2) above, on the equivalence (or rather non-equivalence) of groups of student across teachers classrooms who are the ones whose VAM scores are relativistically compared, the main issue here is that “the U.S. education system is the one of most segregated and unequal in the industrialized world…[likewise]…[t]he country’s extraordinarily high rates of childhood poverty, homelessness, and food insecurity are not randomly distributed across communities…[Add] the extensive practice of tracking to the mix, and it is clear that the assumption of equivalence among classrooms is far from reality” (p. 133). Whether sophisticated statistics can control for all of this variation is one of most debated issues surrounding VAMs and their levels of outcome bias, accordingly.

And as per (3) above, “we know from decades of educational research that many things matter for student achievement aside from the individual teacher a student has at a moment in time for a given subject area. A partial list includes the following [that are also supposed to be statistically controlled for in most VAMs, but are also clearly not controlled for effectively enough, if even possible]: (a) school factors such as class sizes, curriculum choices, instructional time, availability of specialists, tutors, books, computers, science labs, and other resources; (b) prior teachers and schooling, as well as other current teachers—and the opportunities for professional learning and collaborative planning among them; (c) peer culture and achievement; (d) differential summer learning gains and losses; (e) home factors, such as parents’ ability to help with homework, food and housing security, and physical and mental support or abuse; and (e) individual student needs, health, and attendance” (p. 133).

“Given all of these influences on [student] learning [and achievement], it is not surprising that variation among teachers accounts for only a tiny share of variation in achievement, typically estimated at under 10%” (see, for example, highlights from the American Statistical Association’s (ASA’s) Position Statement on VAMs here). “Suffice it to say [these issues]…pose considerable challenges to deriving accurate estimates of teacher effects…[A]s the ASA suggests, these challenges may have unintended negative effects on overall educational quality” (p. 133). “Most worrisome [for example] are [the] studies suggesting that teachers’ ratings are heavily influenced [i.e., biased] by the students they teach even after statistical models have tried to control for these influences” (p. 135).

Other “considerable challenges” include: VAM output are grossly unstable given the swings and variations observed in teacher classifications across time, and VAM output are “notoriously imprecise” (p. 133) given the other errors observed as caused, for example, by varying class sizes (e.g., Sean Corcoran (2010) documented with New York City data that the “true” effectiveness of a teacher ranked in the 43rd percentile could have had a range of possible scores from the 15th to the 71st percentile, qualifying as “below average,” “average,” or close to “above average). In addition, practitioners including administrators and teachers are skeptical of these systems, and their (appropriate) skepticisms are impacting the extent to which they use and value their value-added data, noting that they value their observational data (and the professional discussions surrounding them) much more. Also important is that another likely unintended effect exists (i.e., citing Susan Moore Johnson’s essay here) when statisticians’ efforts to parse out learning to calculate individual teachers’ value-added causes “teachers to hunker down and focus only on their own students, rather than working collegially to address student needs and solve collective problems” (p. 134). Related, “the technology of VAM ranks teachers against each other relative to the gains they appear to produce for students, [hence] one teacher’s gain is another’s loss, thus creating disincentives for collaborative work” (p. 135). This is what Susan Moore Johnson termed the egg-crate model, or rather the egg-crate effects.

Darling-Hammond’s conclusions are that VAMs have “been prematurely thrust into policy contexts that have made it more the subject of advocacy than of careful analysis that shapes its use. There is [good] reason to be skeptical that the current prescriptions for using VAMs can ever succeed in measuring teaching contributions well (p. 135).

Darling-Hammond also “adds value” in one whole section (highlighted in another post forthcoming here), offering a very sound set of solutions, using VAMs for teacher evaluations or not. Given it’s rare in this area of research we can focus on actual solutions, this section is a must read. If you don’t want to wait for the next post, read Darling-Hammond’s “Modest Proposal” (p. 135-136) within her larger article here.

In the end, Darling-Hammond writes that, “Trying to fix VAMs is rather like pushing on a balloon: The effort to correct one problem often creates another one that pops out somewhere else” (p. 135).


If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; see the Review of Article #5 – on teachers’ perceptions of observations and student growth here; see the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here; and see the Review of Article (Commentary) #7 – on VAMs situated in their appropriate ecologies here.

Article #8, Part I Reference: Darling-Hammond, L. (2015). Can value-added add value to teacher evaluation? Educational Researcher, 44(2), 132-137. doi:10.3102/0013189X15575346

In Schools, Teacher Quality Matters Most

Education Next — a non peer-reviewed journal with a mission to “steer a steady course, presenting the facts as best they can be determined…[while]…partak[ing] of no program, campaign, or ideology,” although these last claims are certainly of controversy (see, for example, here and here) — just published an article titled “In Schools, Teacher Quality Matters Most” as part of the journal’s series commemorating the 50th anniversary of James Coleman’s (and colleagues’) groundbreaking 1966 report, “Equality of Educational Opportunity.”

For background, the purpose of The Coleman Report was to assess the equal educational opportunities provided to children of different race, color, religion, and national origin. The main finding was that what we know today as students of color (although African American students were of primary focus in this study), who are (still) often denied equal educational opportunities due to a variety of factors, are largely and unequally segregated across America’s public schools, especially as segregated from their white and wealthier peers. These disparities were most notable via achievement measures, and what we know today as “the achievement gap.” Accordingly, Coleman et al. argued that equal opportunities for students in said schools mattered (and continue to matter) much more for these traditionally marginalized and segregated students than for those who were/are whiter and more economically fortunate. In addition, Coleman argued that out-of-school influences also mattered much more than in-school influences on said achievement. On this point, though, The Coleman Report was of great controversy, and (mis)interpreted as (still) supporting arguments that students’ teachers and schools do/don’t matter as much as students’ families and backgrounds do.

Hence, the Education Next article of focus in this post takes this up, 50 years later, and post the advent of value-added models (VAMs) as better measures than those to which Coleman and his colleagues had access. The article is authored by Dan Goldhaber — a Professor at the University of Washington Bothell, Director of the National Center for Analysis of Longitudinal Data in Education Research (CALDER), and a Vice-President at the American Institutes of Research (AIR). AIR is one of our largest VAM consulting/contract firms, and Goldabher is, accordingly, perhaps one of the field’s most vocal proponents of VAMs and their capacities to both measure and increase teachers’ noteworthy effects (see, for example here); hence, it makes sense he writes about said teacher effects in this article, and in this particular journal (see, for example, Education Next’s Editorial and Editorial Advisory Board members here).

Here is his key claim.

Goldhaber argues that The Coleman Report’s “conclusions about the importance of teacher quality, in particular, have stood the test of time, which is noteworthy, [especially] given that today’s studies of the impacts of teachers [now] use more-sophisticated statistical methods and employ far better data” (i.e., VAMs). Accordingly, Goldhaber’s primary conclusion is that “the main way that schools affect student outcomes is through the quality of their teachers.”

Note that Goldhaber does not offer in this article much evidence, other than evidence not cited or provided by some of his econometric friends (e.g., Raj Chetty). Likewise, Goldhaber cites none of the literature coming from educational statistics, even though recent estimates [1] suggest that approximately 83% of articles written since 1893 (the year in which the first article about VAMs was ever published, in the Journal of Political Economy) on this topic have been published in educational journals, and 14% have been published in economics journals (3% have been published in education finance journals). Hence, what we are clearly observing as per the literature on this topic are severe slants in perspective, especially when articles such as these are written by econometricians, versus educational researchers and statisticians, who often marginalize the research of their education, discipline-based colleagues.

Likewise, Goldhaber does not cite or situate any of his claims within the recent report released by the American Statistical Association (ASA), in which it is written that “teachers account for about 1% to 14% of the variability in test scores.” While teacher effects do matter, they do not matter nearly as much as many, including many/most VAM proponents including Goldhaber, would like us to naively accept and believe. The truth of the matter is is that teachers do indeed matter, in many ways including their impacts on students’ affects, motivations, desires, aspirations, senses of efficacy, and the like, all of which are not estimated on the large-scale standardized tests that continue to matter and that are always the key dependent variables across these and all VAM-based studies today. As Coleman argued 50 years ago, as recently verified by the ASA, students’ out-of-school and out-of-classroom environments matter more, as per these dependent variables or measures.

I think I’ll take ASA’s “word” on this, also as per Coleman’s research 50 years prior.


[1] Reference removed as the manuscript is currently under blind peer-review. Email me if you have any questions at audrey.beardsley@asu.edu

New York Teacher Sheri Lederman’s Lawsuit Update

Recall the New York lawsuit pertaining to Long Island teacher Sheri Lederman? The teacher who by all accounts other than her recent (2013-2014) 1 out of 20 growth score is a terrific 4th grade, 18 year veteran teacher. She, along with her attorney and husband Bruce Lederman, are suing the state of New York to challenge the state’s growth-based teacher evaluation system. See prior posts about Sheri’s case herehere and here. I, along with Linda Darling-Hammond (Stanford), Aaron Pallas (Columbia University Teachers College), Carol Burris (Executive Director of the Network for Public Education Foundation), Brad Lindell (Long Island Research Consultant), Sean Corcoran (New York University) and Jesse Rothstein (University of California – Berkeley) are serving as part of Sheri’s team.

Bruce Lederman just emailed me with an update, and some links re: this update (below), and he gave me permission to share all of this with you.

The judge hearing this case recently asked the lawyers on both sides of Sheri’s case to brief the court by the end of this month (February 29, 2016) on a new issue, positioned and pushed back into the court by the New York State Education Department (NYSED). The issue to be heard pertains to the state’s new “moratorium” or “emergency regulations” related to the state’s high-stakes use of its growth scores, all of which is likely related to the political reaction to the opt-out movement throughout the state of New York, the publicity pertaining to the Lederman lawsuit in and of itself, and the federal government’s adoption of the recent Every Student Succeeds Act (ESSA) given its specific provision that now permits states to decide whether (and if so how) to use teachers’ students’ test scores to hold teachers accountable for their levels of growth (in New York) or value-added.

While the federal government did not abolish such practices via its ESSA, the federal government did hand back to the states all power and authority over this matter. Accordingly, this does not mean growth models/VAMs are going to simply disappear, as states do still have the power and authority to move forward with their prior and/or their new teacher evaluation systems, based in small or large part, on growth models/VAMs. As also quite evident since President Obama’s signing of the ESSA, some states are continuing to move forward in this regard, and regardless of the ESSA, in some cases at even higher speeds than before, in support of what some state policymakers still apparently believe (despite the research) are the accountability measures that will still help them to (symbolically) support educational reform in their states. See, for example, prior posts about the state of Alabama, here, New Mexico, here, and Texas, here, which is still moving forward with its plans introduced pre-ESSA. See prior posts about New York here, here, and here, the state in which also just one year ago Governor Cuomo was promoting increased use of New York’s growth model and publicly proclaiming that it was “baloney” that more teachers were not being found “ineffective,” after which Cuomo pushed through the New York budget process amendments to the law increasing the weight of teachers’ growth scores to an approximate 50% weight in many cases.

Nonetheless, as per this case in New York, state Attorney General Eric Schneiderman, on behalf of the NYSED, offered to settle this lawsuit out of court by giving Sheri some accommodation on her aforementioned 2013-2014 score of 1 out of 20, if Sheri and Bruce dropped the challenge to the state’s VAM-based teacher evaluation system. Sheri and Bruce declined, for a number or reasons, including that under the state’s recent “moratorium,” the state’s growth model is still set to be used throughout the state of New York for the next four years, with teachers’ annual performance reviews based in part on growth scores reported to parents, newspapers (on an aggregate basis), and the like. While, again, high-stakes are not to be attached to the growth output for four years, the scores will still “count.”

Hence, Sheri and Bruce believe that because they have already “convincingly” shown that the state’s growth model does not “rationally” work for teacher evaluation purposes, and that teacher evaluations as based on the state’s growth model actually violate state law since teachers like Sheri are not capable of getting perfect scores (which is “irrational”), they will continue with this case, also on behalf of New York teachers and principals who are “demoralized” by the system, as well as New York taxpayers who are paying (millions “if not tens of millions of dollars” for the system’s (highly) unreliable and inaccurate results.

As per Bruce’s email: “Spending the next 4 years studying a broken system is a terrible idea and terrible waste of taxpayer $$s. Also, if [NYSED] recognizes that Sheri’s 2013-14 score of 1 out of 20 is wrong [which they apparently recognize given their offer to settle this suit out of court], it’s sad and frustrating that [NYSED] still wants to fight her score unless she drops her challenge to the evaluation system in general.”

“We believe our case is already responsible for the new administrative appeal process in NY, and also partly responsible for Governor Cuomos’ apparent reversal on his stand about teacher evaluations. However, at this point we will not settle and allow important issues to be brushed under the carpet. Sheri and I are committed to pressing ahead with our case.”

To read more about this case via a Politico New York article click here (registration required). To hear more from Bruce Lederman about this case via WCNY-TV, Syracuse, click here. The pertinent section of this interview starts at 22:00 minutes and ends at 36:21. It’s well worth listening!

Report on the Stability of Student Growth Percentile (SGP) “Value-Added” Estimates

The Student Growth Percentiles (SGPs) model, which is loosely defined by value-added model (VAM) purists as a VAM, uses students’ level(s) of past performance to determine students’ normative growth over time, as compared to his/her peers. “SGPs describe the relative location of a student’s current score compared to the current scores of students with similar score histories” (Castellano & Ho, p. 89). Students are compared to themselves (i.e., students serve as their own controls) over time; therefore, the need to control for other variables (e.g., student demographics) is less necessary, although this is of debate. Nonetheless, the SGP model was developed as a “better” alternative to existing models, with the goal of providing clearer, more accessible, and more understandable results to both internal and external education stakeholders and consumers. For more information about the SGP please see prior posts here and here. See also an original source about the SGP here.

Related, in a study released last week, WestEd researchers conducted an “Analysis of the stability of teacher-level growth scores [derived] from the student growth percentile [SGP] model” in one, large school district in Nevada (n=370 teachers). The key finding they present is that “half or more of the variance in teacher scores from the [SGP] model is due to random or otherwise unstable sources rather than to reliable information that could predict future performance. Even when derived by averaging several years of teacher scores, effectiveness estimates are unlikely to provide a level of reliability desired in scores used for high-stakes decisions, such as tenure or dismissal. Thus, states may want to be cautious in using student growth percentile [SGP] scores for teacher evaluation.”

Most importantly, the evidence in this study should make us (continue to) question the extent to which “the learning of a teacher’s students in one year will [consistently] predict the learning of the teacher’s future students.” This is counter to the claims continuously made by VAM proponents, including folks like Thomas Kane — economics professor from Harvard University who directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation. While faint signals of what we call predictive validity might be observed across VAMs, what folks like Kane overlook or avoid is that very often these faint signals do not remain constant over time. Accordingly, the extent to which we can make stable predictions is limited.

Worse is when folks falsely assume that said predictions will remain constant over time, and they make high-stakes decisions about teachers unaware of the lack of stability present, in typically 25-59% of teachers’ value-added (or in this case SGP) scores (estimates vary by study and by analyses using one to three years of data — see, for example, the studies detailed in Appendix A of this report; see also other research on this topic here, here, and here). Nonetheless, researchers in this study found that in mathematics, 50% of the variance in teachers’ value-added scores were attributable to differences among teachers, and the other 50% was random or unstable. In reading, 41% of the variance in teachers’ value-added scores were attributable to differences among teachers, and the other 59% was random or unstable.

In addition, using a 95% confidence interval (which is very common in educational statistics) researchers found that in mathematics, a teacher’s true score would span 48 points, “a margin of error that covers nearly half the 100 point score scale,” whereby “one would be 95 percent confident that the true math score of a teacher who received a score of 50 [would actually fall] between 26 and 74.” For reading, a teacher’s true score would span 44 points, whereby one would be 95 percent confident that the true reading score of a teacher who received a score of 50 would actually fall between 38 and 72. The stability of these scores would increase with three years of data, which has also been found by other researchers on this topic. However, they too have found that such error rates persist to an extent that still prohibits high-stakes decision making.

In more practical terms, what this also means is that a teacher who might be considered highly ineffective might be terminated, even though the following year (s)he could have been observed to be highly effective. Inversely, teachers who are awarded tenure might be observed as ineffective one, two, and/or three years following, not because their true level(s) of effectiveness change, but because of the error in the estimates that causes such instabilities to occur. Hence, examinations of the the stability of such estimates over time provides essential evidence of the validity, and in this case predictive validity, of the interpretations and uses of such scores over time. This is particularly pertinent when high-stakes decisions are to be based on (or in large part on) such scores, especially given some researchers are calling for reliability coefficients of .85 or higher to make such decisions (Haertel, 2013; Wasserman & Bracken, 2003).

In the end, researchers’ overall conclusion is that SGP-derived “growth scores alone may not be sufficiently stable to support high-stakes decisions.” Likewise, relying on the extant research on this topic, the overall conclusion can be broadened in that neither SGP- or VAM-based growth scores may be sufficiently stable to support high-stakes decisions. In other words, it is not just the SGP model that is yielding such issues with stability (or a lack thereof). Again, see the other literature in which researchers situated their findings in Appendix A. See also other similar studies here, here, and here.

Accordingly, those who read this report, and consequently seek to find a better or more stable model that yields more stable estimates, will unfortunately but likely fail in their search.


Castellano, K. E., & Ho, A. D. (2013). A practitioner’s guide to growth models. Washington, DC: Council of Chief State School Officers.

Haertel, E. H. (2013). Reliability and validity of inferences about teachers based on student test scores (14th William H. Angoff Memorial Lecture). Princeton, NJ: Educational Testing Service (ETS).

Lash, A., Makkonen, R., Tran, L., & Huang, M. (2016). Analysis of the stability of teacher-level growth scores [derived] from the student growth percentile [SGP] model. (16–104). Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory West.

Wasserman, J. D., & Bracken, B. A. (2003). Psychometric characteristics of assessment procedures. In I. B. Weiner, J. R. Graham, & J. A. Naglieri (Eds.), Handbook of psychology:
Assessment psychology (pp. 43–66). Hoboken, NJ: John Wiley & Sons.

Special Issue of “Educational Researcher” (Paper #7 of 9): VAMs Situated in Appropriate Ecologies

Recall that the peer-reviewed journal Educational Researcher (ER) – recently published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#7 of 9), which is actually a commentary titled “The Value in Value-Added Depends on the Ecology.” This commentary is authored by Henry Braun – Professor of Education and Public Policy, Educational Research, Measurement, and Evaluation at Boston College (also the author of a previous post on this site here).

In this article Braun, importantly, makes explicit the assumptions on which this special issue of ER is based; that is, on assumptions that (1) too many students in America’s public schools are being inadequately educated, (2) evaluation systems as they currently exist “require radical overhaul,” and (3) it is therefore essential to use student test performance with low- and high-stakes attached to improve that which educators do (or don’t do) to adequately address the first assumption. There are counterarguments Braun also offers to readers on each of these assumptions (see p. 127), but more importantly he makes evident that the focus of this special issue is situated otherwise, as in line with current education policies. This special issue, overall, then “raise[s] important questions regarding the potential for high-stakes, test-driven educator accountability systems to contribute to raising student achievement” (p. 127).

Given this context, the “value-added” provided within this special issue, again according to Braun, is that the authors of each of the five main research articles included report on how VAM output actually plays out in practice, given “careful consideration to how the design and implementation of teacher evaluation systems could be modified to enhance the [purportedly, see comments above] positive impact of accountability and mitigate the negative consequences” at the same time (p. 127). In other words, if we more or less agree to the aforementioned assumptions, also given the educational policy context influence, perpetuating, or actually forcing these assumptions, these articles should help others better understand VAMs’ and observational systems’ potentials and perils in practice.

At the same time, Braun encourages us to note that “[t]he general consensus is that a set of VAM scores does contain some useful information that meaningfully differentiates among teachers, especially in the tails of the distribution [although I would argue bias has a role here]. However, individual VAM scores do suffer from high variance and low year-to-year stability as well as an undetermined amount of bias [which may be greater in the tails of the distribution]. Consequently, if VAM scores are to be used for evaluation, they should not be given inordinate weight and certainly not treated as the “gold standard” to which all other indicators must be compared” (p. 128).

Likewise, it’s important to note that IF consequences are to be attached to said indicators of teacher evaluation (i.e., VAM and observational data), there should be validity evidence made available and transparent to warrant the inferences and decisions to be made, and the validity evidence “should strongly support a causal [emphasis added] argument” (p. 128). However, both indicators still face major “difficulties in establishing defensible causal linkage[s]” as theorized, and desired (p. 128); hence, this prevents validity in inference. What does not help, either, is when VAM scores are given precedence over other indicators OR when principals align teachers’ observational scores with the same teachers’ VAM scores given the precedence often given to (what are often viewed as the superior, more objective) VAM-based measures. This sometimes occurs given external pressures (e.g., applied by superintendents) to artificially conflate, in this case, levels of agreement between indicators (i.e., convergent validity).

Related, in the section Braun titles his “Trio of Tensions,” (p. 129) he notes that (1) [B]oth accountability and improvement are undermined, as attested to by a number of the articles in this issue. In the current political and economic climate, [if possible] it will take thoughtful and inspiring leadership at the state and district levels to create contexts in which an educator evaluation system constructively fulfills its roles with respect to both public accountability and school improvement” (p. 129-130); (2) [T]he chasm between the technical sophistication of the various VAM[s] and the ability of educators to appreciate what these models are attempting to accomplish…sow[s] further confusion…[hence]…there must be ongoing efforts to convey to various audiences the essential issues—even in the face of principled disagreements among experts on the appropriate roles(s) for VAM[s] in educator evaluations” (p. 130); and finally (3) [H]ow to balance the rights of students to an adequate education and the rights of teachers to fair evaluations and due process [especially for]…teachers who have value-added scores and those who teach in subject-grade combinations for which value-added scores are not feasible…[must be addressed; this] comparability issue…has not been addressed but [it] will likely [continue to] rear its [ugly] head” (p. 130).

In the end, Braun argues for another “Trio,” but this one including three final lessons: (1) “although the concerns regarding the technical properties of VAM scores are not misplaced, they are not necessarily central to their reputation among teachers and principals. [What is central is]…their links to tests of dubious quality, their opaqueness in an atmosphere marked by (mutual) distrust, and the apparent lack of actionable information that are largely responsible for their poor reception” (p. 130); (2) there is a “very substantial, multiyear effort required for proper implementation of a new evaluation system…[related, observational] ratings are not a panacea. They, too, suffer from technical deficiencies and are the object of concern among some teachers because of worries about bias” (p. 130); and (3) “legislators and policymakers should move toward a more ecological approach [emphasis added; see also the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here] to the design of accountability systems; that is, “one that takes into account the educational and political context for evaluation, the behavioral responses and other dynamics that are set in motion when a new regime of high-stakes accountability is instituted, and the long-term consequences of operating the system” (p. 130).


If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; see the Review of Article #5 – on teachers’ perceptions of observations and student growth here; and see the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here.

Article #7 Reference: Braun, H. (2015). The value in value-added depends on the ecology. Educational Researcher, 44(2), 127-131. doi:10.3102/0013189X15576341

Why Standardized Tests Should Not Be Used to Evaluate Teachers (and Teacher Education Programs)

David C. Berliner, Regents’ Professor Emeritus here at Arizona State University (ASU), who also just happens to be my former albeit forever mentor, recently took up research on the use of test scores to evaluate teachers, for example, using value-added models (VAMs). While David is world-renowned for his research in educational psychology, and more specific to this case, his expertise on effective teaching behaviors and how to capture and observe them, he has also now ventured into the VAM-related debates.

Accordingly, he recently presented his newest and soon-to-be-forthcoming published research on using standardized tests to evaluate teachers, something he aptly termed in the title of his presentation “A Policy Fiasco.” He delivered his speech to an audience in Melbourne, Australia, and you can click here for the full video-taped presentation; however, given the whole presentation takes about one hour to watch, although I must say watching the full hour is well worth it, I highlight below what are his highlights and key points. These should certainly be of interest to you all as followers of this blog, and hopefully others.

Of main interest are his 14 reasons, “big and small’ for [his] judgment that assessing teacher competence using standardized achievement tests is nearly worthless.”

Here are his fourteen reasons:

  1. “When using standardized achievement tests as the basis for inferences about the quality of teachers, and the institutions from which they came, it is easy to confuse the effects of sociological variables on standardized test scores” and the effects teachers have on those same scores. Sociological variables (e.g., chronic absenteeism) continue to distort others’ even best attempts to disentangle them from the very instructional variables of interest. This, what we also term as biasing variables, are important not to inappropriately dismiss, as purportedly statistically “controlled for.”
  2. In law, we do not hold people accountable for the actions of others, for example, when a child kills another child and the parents are not charged as guilty. Hence, “[t]he logic of holding [teachers and] schools of education responsible for student achievement does not fit into our system of law or into the moral code subscribed to by most western nations.” Related, should medical school or doctors, for that matter, be held accountable for the health of their patients? One of the best parts of his talk, in fact, is about the medical field and the corollaries Berliner draws between doctors and medical schools, and teachers and colleges of education, respectively (around the 19-25 minute mark of his video presentation).
  3. Professionals are often held harmless for their lower success rates with clients who have observable difficulties in meeting the demands and the expectations of the professionals who attend to them. In medicine again, for example, when working with impoverished patients, “[t]here is precedent for holding [doctors] harmless for their lowest success rates with clients who have observable difficulties in meeting the demands and expectations of the [doctors] who attend to them, but the dispensation we offer to physicians is not offered to teachers.”
  4. There are other quite acceptable sources of data, besides tests, for judging the efficacy of teachers and teacher education programs. “People accept the fact that treatment and medicine may not result in the cure of a disease. Practicing good medicine is the goal, whether or not the patient gets better or lives. It is equally true that competent teaching can occur independent of student learning or of the achievement test scores that serve as proxies for said learning. A teacher can literally “save lives” and not move the metrics used to measure teacher effectiveness.
  5. Reliance on standardized achievement test scores as the source of data about teacher quality will inevitably promote confusion between “successful” instruction and “good” instruction. “Successful” instruction gets test scores up. “Good” instruction leaves lasting impressions, fosters further interest by the students, makes them feel competent in the area, etc. Good instruction is hard to measure, but remains the goal of our finest teachers.
  6. Related, teachers affect individual students greatly, but affect standardized achievement test scores very little. All can think of how their own teachers impacted their lives in ways that cannot be captured on a standardized achievement test.  Standardized achievement test scores are much more related to home, neighborhood and cohort than they are to teachers’ instructional capabilities. In more contemporary terms, this is also due the fact that large-scale standardized tests have (still) never been validated to measure student growth over time, nor have they been validated to attribute that growth to teachers. “Teachers have huge effects, it’s just that the tests are not sensitive to them.”
  7. Teacher’s effects on standardized achievement test scores fade quickly, barely discernable after a few years. So we might not want to overly worry about most teachers’ effects on their students—good or bad—as they are hard to detect on tests after two or so years. To use these ephemeral effects to then hold teacher education programs accountable seems even more problematic.
  8. Observational measures of teacher competency and achievement tests of teacher competency do not correlate well. This suggest nothing more than that one or both of these measures, and likely the latter, are malfunctioning in their capacities to measure the teacher effectiveness construct. See other Vamboozled posts about this here, here, and here.
  9. Different standardized achievement tests, both purporting to measure reading, mathematics, or science at the same grade level, will give different estimates of teacher competency. That is because different test developers have different visions of what it means to be competent in each of these subject areas. Thus one achievement test in these subject areas could find a teacher exemplary, but another test of those same subject areas would find the teacher lacking. What then? Have we an unstable teacher or an ill-defined subject area?
  10. Tests can be administered early or late in the fall, early or late in the spring, and the dates they are given influence the judgments about whether a teacher is performing well or poorly. Teacher competency should not be determined by minor differences in the date of testing, but that happens frequently.
  11. No standardized achievement tests have provided proof that their items are instructionally sensitive. If test items do not, because they cannot “react to good instruction,” how can one make a claim that the test items are “tapping good instruction?”
  12. Teacher effects show up more dramatically on teacher made tests than on standardized achievement tests because the former are based on the enacted curriculum, while the latter are based on the desired curriculum. You get seven times more instructionally sensitive tests the closer the test is to the classroom (i.e., teacher made tests).
  13. The opt-out testing movement invalidates inferences about teachers and schools that can be made from standardized achievement test results. Its not bad to remove these kids from taking these tests, and perhaps it is even necessary in our over-tested schools, but the tests and the VAM estimates derived via these tests, are far less valid when that happens. This is because the students who opt out are likely different in significant ways from those who do take the tests. This severely limits the validity claims that are made.
  14. Assessing new teachers with standardized achievement tests is likely to yield many false negatives. That is, the assessments would identify teachers early in their careers as ineffective in improving test scores, which is, in fact, often the case for new teachers. Two or three years later that could change. Perhaps the last thing we want to do in a time of teacher shortage is discourage new teachers while they acquire their skills.

Victory in Court: Consequences Attached to VAMs Suspended Throughout New Mexico

Great news for New Mexico and New Mexico’s approximately 23,000 teachers, and great news for states and teachers potentially elsewhere, in terms of setting precedent!

Late yesterday, state District Judge David K. Thomson, who presided over the ongoing teacher-evaluation lawsuit in New Mexico, granted a preliminary injunction preventing consequences from being attached to the state’s teacher evaluation data. More specifically, Judge Thomson ruled that the state can proceed with “developing” and “improving” its teacher evaluation system, but the state is not to make any consequential decisions about New Mexico’s teachers using the data the state collects until the state (and/or others external to the state) can evidence to the court during another trial (set for now, for April) that the system is reliable, valid, fair, uniform, and the like.

As you all likely recall, the American Federation of Teachers (AFT), joined by the Albuquerque Teachers Federation (ATF), last year, filed a “Lawsuit in New Mexico Challenging [the] State’s Teacher Evaluation System.” Plaintiffs charged that the state’s teacher evaluation system, imposed on the state in 2012 by the state’s current Public Education Department (PED) Secretary Hanna Skandera (with value-added counting for 50% of teachers’ evaluation scores), is unfair, error-ridden, spurious, harming teachers, and depriving students of high-quality educators, among other claims (see the actual lawsuit here).

Thereafter, one scheduled day of testimonies turned into five in Santa Fe, that ran from the end of September through the beginning of October (each of which I covered here, here, here, here, and here). I served as the expert witness for the plaintiff’s side, along with other witnesses including lawmakers (e.g., a state senator) and educators (e.g., teachers, superintendents) who made various (and very articulate) claims about the state’s teacher evaluation system on the stand. Thomas Kane served as the expert witness for the defendant’s side, along with other witnesses including lawmakers and educators who made counter claims about the system, some of which backfired, unfortunately for the defense, primarily during cross-examination.

See articles released about this ruling this morning in the Santa Fe New Mexican (“Judge suspends penalties linked to state’s teacher eval system”) and the Albuquerque Journal (“Judge curbs PED teacher evaluations).” See also the AFT’s press release, written by AFT President Randi Weingarten, here. Click here for the full 77-page Order written by Judge Thomson (see also, below, five highlights I pulled from this Order).

The journalist of the Santa Fe New Mexican, though, provided the most detailed information about Judge Thomson’s Order, writing, for example, that the “ruling by state District Judge David Thomson focused primarily on the complicated combination of student test scores used to judge teachers. The ruling [therefore] prevents the Public Education Department [PED] from denying teachers licensure advancement or renewal, and it strikes down a requirement that poorly performing teachers be placed on growth plans.” In addition, the Judge noted that “the teacher evaluation system varies from district to district, which goes against a state law calling for a consistent evaluation plan for all educators.”

The PED continues to stand by its teacher evaluation system, calling the court challenge “frivolous” and “a legal PR stunt,” all the while noting that Judge Thomson’s decision “won’t affect how the state conducts its teacher evaluations.” Indeed it will, for now and until the state’s teacher evaluation system is vetted, and validated, and “the court” is “assured” that the system can actually be used to take the “consequential actions” against teachers, “required” by the state’s PED.

Here are some other highlights that I took directly from Judge Thomson’s ruling, capturing what I viewed as his major areas of concern about the state’s system (click here, again, to read Judge Thomson’s full Order):

  • Validation Needed: “The American Statistical Association says ‘estimates from VAM should always be accompanied by measures of precision and a discussion of the assumptions and possible limitations of the model. These limitations are particularly relevant if VAM are used for high stake[s] purposes” (p. 1). These are the measures, assumptions, limitations, and the like that are to be made transparent in this state.
  • Uniformity Required: “New Mexico’s evaluation system is less like a [sound] model than a cafeteria-style evaluation system where the combination of factors, data, and elements are not easily determined and the variance from school district to school district creates conflicts with the [state] statutory mandate” (p. 2)…with the existing statutory framework for teacher evaluations for licensure purposes requiring “that the teacher be evaluated for ‘competency’ against a ‘highly objective uniform statewide standard of evaluation’ to be developed by PED” (p. 4). “It is the term ‘highly objective uniform’ that is the subject matter of this suit” (p. 4), whereby the state and no other “party provided [or could provide] the Court a total calculation of the number of available district-specific plans possible given all the variables” (p. 54). See also the Judge’s points #78-#80 (starting on page 70) for some of the factors that helped to “establish a clear lack of statewide uniformity among teachers” (p. 70).
  • Transparency Missing: “The problem is that it is not easy to pull back the curtain, and the inner workings of the model are not easily understood, translated or made accessible” (p. 2). “Teachers do not find the information transparent or accurate” and “there is no evidence or citation that enables a teacher to verify the data that is the content of their evaluation” (p. 42). In addition, “[g]iven the model’s infancy, there are no real studies to explain or define the [s]tate’s value-added system…[hence, the consequences and decisions]…that are to be made using such system data should be examined and validated prior to making such decisions” (p. 12).
  • Consequences Halted: “Most significant to this Order, [VAMs], in this [s]tate and others, are being used to make consequential decisions…This is where the rubber hits the road [as per]…teacher employment impacts. It is also where, for purposes of this proceeding, the PED departs from the statutory mandate of uniformity requiring an injunction” (p. 9). In addition, it should be noted that indeed “[t]here are adverse consequences to teachers short of termination” (p. 33) including, for example, “a finding of ‘minimally effective’ [that] has an impact on teacher licenses” (p. 41). These, too, are to be halted under this injunction Order.
  • Clarification Required: “[H]ere is what this [O]rder is not: This [O]rder does not stop the PED’s operation, development and improvement of the VAM in this [s]tate, it simply restrains the PED’s ability to take consequential actions…until a trial on the merits is held” (p. 2). In addition, “[a] preliminary injunction differs from a permanent injunction, as does the factors for its issuance…’ The objective of the preliminary injunction is to preserve the status quo [minus the consequences] pending the litigation of the merits. This is quite different from finally determining the cause itself” (p. 74). Hence, “[t]he court is simply enjoining the portion of the evaluation system that has adverse consequences on teachers” (p. 75).

The PED also argued that “an injunction would hurt students because it could leave in place bad teachers.” As per Judge Thomson, “That is also a faulty argument. There is no evidence that temporarily halting consequences due to the errors outlined in this lengthy Opinion more likely results in retention of bad teachers than in the firing of good teachers” (p. 75).

Finally, given my involvement in this lawsuit and given the team with whom I was/am still so fortunate to work (see picture below), including all of those who testified as part of the team and whose testimonies clearly proved critical in Judge Thomson’s final Order, I want to thank everyone for all of their time, energy, and efforts in this case, thus far, on behalf of the educators attempting to (still) do what they love to do — teach and serve students in New Mexico’s public schools.


Left to right: (1) Stephanie Ly, President of AFT New Mexico; (2) Dan McNeil, AFT Legal Department; (3) Ellen Bernstein, ATF President; (4) Shane Youtz, Attorney at Law; and (5) me 😉

Including Summers “Adds Considerable Measurement Error” to Value-Added Estimates

A new article titled “The Effect of Summer on Value-added Assessments of Teacher and School Performance” was recently released in the peer-reviewed journal Education Policy Analysis Archives. The article is authored by Gregory Palardy and Luyao Peng from the University of California, Riverside. 

Before we begin, though, here is some background so that you all understand the importance of the findings in this particular article.

In order to calculate teacher-level value added, all states are currently using (at minimum) the large-scale standardized tests mandated by No Child Left Behind (NCLB) in 2002. These tests were mandated for use in the subject areas of mathematics and reading/language arts. However, because these tests are given only once per year, typically in the spring, to calculate value-added statisticians measure actual versus predicted “growth” (aka “value-added”) from spring-to-spring, over a 12-month span, which includes summers.

While many (including many policymakers) assume that value-added estimations are calculated from fall to spring during time intervals under which students are under the same teachers’ supervision and instruction, this is not true. The reality is that the pre- to post-test occasions actually span 12-month periods, including the summers that often cause the nettlesome summer effects often observed via VAM-based estimates. Different students learn different things over the summer, and this is strongly associated (and correlated) with student’s backgrounds, and this is strongly associated (and correlated) with students’ out-of-school opportunities (e.g., travel, summer camps, summer schools). Likewise, because summers are the time periods over which teachers and schools tend to have little control over what students do, this is also the time period during which research  indicates that achievement gaps maintain or widen. More specifically, research indicates that indicates that students from relatively lower socio-economic backgrounds tend to suffer more from learning decay than their wealthier peers, although they learn at similar rates during the school year.

What these 12-month testing intervals also include are prior teachers’ residual effects, whereas students testing in the spring, for example, finish out every school year (e.g., two months or so) with their prior teachers before entering the classrooms of the teachers for whom value-added is to be calculated the following spring, although teachers’ residual effects were not of focus in this particular study.

Nonetheless, via the research, we have always known that these summer (and prior or adjacent teachers’ residual effects) are difficult if not impossible to statistically control. This in and of itself leads to much of the noise (fluctuations/lack of reliability, imprecision, and potential biases) we observe in the resulting value-added estimates. This is precisely what was of focus in this particular study.

In this study researchers examined “the effects of including the summer period on value-added assessments (VAA) of teacher and school performance at the [1st] grade [level],” as compared to using VAM-based estimates derived from a fall-to-spring test administration within the same grade and same year (i.e., using data derived via a nationally representative sample via the National Center for Education Statistics (NCES) with an n=5,034 children).

Researchers found that:

  • Approximately 40-62% of the variance in VAM-based estimates originates from the summer period, depending on the reading or math outcome;
  • When summer is omitted from VAM-based calculations using within year pre/post-tests, approximately 51-61% of the teachers change performance categories. What this means in simpler terms is that including summers in VAM-based estimates is indeed causing some of the errors and misclassification rates being observed across studies.
  • Statistical controls to control for student and classroom/school variables reduces summer effects considerably (e.g., via controlling for students’ prior achievement), yet 36-47% of teachers still fall into different quintiles when summers are included in the VAM-based estimates.
  • Findings also evidence that including summers within VAM-based calculations tends to bias VAM-based estimates against schools with higher relative concentrations of poverty, or rather higher relative concentrations of students who are eligible for the federal free-and-reduced lunch program.
  • Overall, results suggest that removing summer effects from VAM-based estimates may require biannual achievement assessments (i.e., fall and spring). If we want VAM-based estimates to be more accurate, we might have to double the number of tests we administer per year in each subject area for which teachers are to be held accountable using VAMs. However, “if twice-annual assessments are not conducted, controls for prior achievement seem to be the best method for minimizing summer effects.”

This is certainly something to consider in terms of trade-offs, specifically in terms of whether we really want to “double-down” on the number of tests we already require our public students to take (also given the time that testing and test preparation already takes away from students’ learning activities), and whether we also want to “double-down” on the increased costs of doing so. I should also note here, though, that using pre/post-tests within the same year is (also) not as simple as it may seem (either). See another post forthcoming about the potential artificial deflation/inflation of pre/post scores to manufacture artificial levels of growth.

To read the full study, click here.

*I should note that I am an Associate Editor for this journal, and I served as editor for this particular publication, seeing it through the full peer-reviewed process.

Citation: Palardy, G. J., & Peng, L. (2015). The effects of including summer on value-added assessments of teachers and schools. Education Policy Analysis Archives, 23(92). doi:10.14507/epaa.v23.1997 Retrieved from http://epaa.asu.edu/ojs/article/view/1997

The Forgotten VAM: The A-F School Grading System

Here is another post from our “Concerned New Mexico Parent” (see prior posts from him/her here and here). This one is about New Mexico’s A-F School Grading System and how it is not only contradictory, within and beyond itself, but how it also provides little instrumental value to the public as an invalid indicator of the “quality” of any school.

(S)he writes:

  1. What do you call a high school that has only 38% of its students proficient in reading and 35% of its students proficient in mathematics?
  2. A school that needs help improving their student scores.
  3. What does the New Mexico Public Education Department (NMPED) call this same high school?
  4. A top-rated “A” school, of course.

Readers of this blog are familiar with the VAMs being used to grade teachers. Many states have implemented analogous formulas to grade entire schools. This “forgotten” VAM suffers from all of the familiar problems of the teacher formulas — incomprehensibility, lack of transparency, arbitrariness, and the like.

The first problem with the A-F Grading System is inherent in its very name. The “A-F” terminology implies that this one static assessment is an accurate representation of a school’s quality. As you will see, it is nothing of the sort.

The second problem with the A-F Grading System is that is is composed of benchmarks that are arbitrarily weighted and scored by the NMPED using VAM methodologies.

Thirdly, the “collapsing of the data” from a numeric score to a grade (corresponding to a range of values) causes valuable information to be lost.

Table 1 shows the range of values for reading and mathematics proficiencies for each of the five A-F grade categories for New Mexico schools.

Table 1: Ranges and Median of Reading and Mathematics Proficiencies by A-F School Grade

School A-F Grade Number of Schools

Proficiency Range


Mathematics Proficiency Range


A 86

37.90 – 94.00


31.50 – 95.70


B 237

16.90 – 90.90


4.90 – 90.90


C 177

0.00 – 83.80


0.00 – 76.20


D 21

4.50 – 64.60


2.20 – 70.00


F 88

7.80 – 52.30


3.30 – 40.90


For example, to earn an A rating, a school can have between 37.9% and 94.0% of its students proficient in reading. In other words, a school can have roughly two-thirds of its students fail reading proficiency yet be rated as an “A” school!

The median value listed shows the point which splits the group in half — one-half of the scores are below the median value. Thus, an “A” school median of 66.2% indicates that one-half of the “A” schools have a reading proficiency below 66.2%. In other words, in one-half of the “A” schools 1/3 or more of their students are NOT proficient in reading!

Amazingly, the figures for mathematics are even worse, the minimum proficiency for a B rating is only 4.9% proficient! Scandalous!

Obviously, and contrary to popular and press perceptions, the A-F Grading System has nothing to do with the actual or current quality of the school!

A few case studies will highlight further absurdities of the New Mexico A-F School Grading System next.

Case Study 1 – Highest “A”   vs. Lowest “A” High School

Logan High School, Logan, New Mexico received the lowest reading proficiency of any “A” school, and the Albuquerque Institute of Math and Science received the highest reading proficiency score.

These two schools have both received an “A” rating. The Albuquerque Institute had a reading proficiency of 94% and a mathematics proficiency rating of 93%. Logan HS had a reading proficiency of only 38% and a mathematics proficiency rating of only 35%!

How is that possible?

First, much of the A-F VAM, like the teacher VAM is based on multi-year growth calculations and predictions. Logan has plenty of opportunity for growth whereas the Math Academy has “maxed” out most of its scores. Thus, the Albuquerque Institute is penalized in a manner analogous to Gifted and Talented teachers when teacher-level VAM is used. With already excellent scores, there is little, if any, room for improvement.

Second, Logan has an emphasis on shop/trade classes which yields a very high college and CAREER readiness score for the VAM calculation.

Also, a final factor is that the NMPED-defined range for an “A” extends from 75 to 100 points, and Logan barely made it into the A grouping.

Thus, a proficiency score of only 37.9% is no deterrent to an A score for Logan High.

Case Study 2: Hanging on by a Thread

As noted above, any school that scores between 75 and 100 points is considered an “A” school.

This statistical oddity was very beneficial to Hagerman High (Hagerman, NM) in their 2014 School Grade Report Card. They fell 5.99 points overall from the previous year’s score, but they managed to still receive an “A” score since their resulting 2014 score was exactly 75.01.

With this one one-hundredth of a point, they are in the same “A” grade category as the Albuquerque Institute of Math and Science (rated best in New Mexico by NMPED) and the Cottonwood Classical Preparatory School of Albuquerque (rated best in New Mexico by US News).

Case Study 3: A Tale of Two Ranking Systems

This inaccuracy and arbitrariness of any A-F School Grading System was also apparent in a recent Albuquerque Journal News article (May 14, 2015) which reported on the most recent US News ratings of high schools nationwide.

The Journal reported on the top 12 high schools in New Mexico as rated by US News. It is not surprising that most were NMPED A-rated schools. What was unusual is that the 3rd and 5th US News highest rated schools in New Mexico (South Valley Academy and Albuquerque High, both in Albuquerque) were actually rated as B schools by the NMPED A-F School Grading System.

According to NMPED data, I tabulated at least forty-four (44) high schools that were rated as “A” schools with higher NMPED scores than South Valley Academy which had an NMPED score of 71.4.

None of these 44 higher NMPED scoring schools were rated above South Valley Academy by US News.

Case Study 4: Punitive Grading

Many school districts and school boards throughout New Mexico have adopted policies that prohibit punitive grading based on behavior. It is no longer possible to lower a student’s grade just because of their behavior. The grade should reflect classroom assessment only.

NMPED ignores this policy in the context of the A-F School Grading System. Schools were graded down one letter grade if they did not achieve 95% participation rates.

One such school was Mills Elementary in the Hobbs Municipal Schools District. Only 198 students were tested; they fell 11 short of the 95% mark and were penalized one “grade”-level. Their grade was reduced from a “D” to an “F” In fact, Mills Elementary proficiency scores were higher than the A-rated Logan High School discussed earlier.

The likely explanation is that Hobbs has a highly transient population with both seasonal farm laborers and oil-field workers predominating in the local economy.

For more urban schools, it will be interesting to see how the NMPED policy of punitive grading will play out with the increasingly popular Opt-Out movement.


It is apparent that the NMPED’s A-F School Grading System rates schools deceptively using VAM-augmented data and provides little of any value to the public as to the “quality” of a school. By presenting it in the form of an “NMPED School Grade Report Card” the state seeks to hide its arbitrary nature.

Such a useless grade should certainly not be used to declare a school a “failure” and in need of radical reform.

Florida’s Superintendents Have Officially “Lost Confidence” in the State’s Teacher and School Evaluation System

In an article written by Valerie Strauss and featured yesterday in The Washington Post, school superintendents throughout the state of Florida have officially declared to the state that they have “lost confidence” in Florida’s teacher and school evaluation system (click here for the original article). More specifically, state school superintendents submitted a letter to the state asserting that “they ‘have lost confidence’ in the system’s accuracy and are calling for a suspension and a review of the system” (click here also for their original letter).

This is the evaluation model instituted by Florida’s former governor Jeb Bush, whom we all also identify as the brother of former President George W. Bush and son of former President George H. W. Bush. Former President GW Bush was also the architect of what most now agree is the failed No Child Left Behind (NCLB) Act of 2001, that former President GW Bush put forth as federal policy given his “educational reforms” in the state of Texas in the late 1990s. Beyond this brotherly association, however, Jeb is also currently a running candidate for the 2016 Republican presidential nomination, and via his campaign he is also (like his brother, pre-NCLB) touting his “educational reforms” in the state of Florida, hoping to take these reforms also to the White House.

It is Jeb Bush’s “educational reforms” that, without research evidence in support, also became models of “educational reform” for other states around the country, including states like New Mexico. In New Mexico, the state in which its current evaluation system, as based on the Florida model, is at the core of a potentially significant lawsuit (see prior posts about this lawsuit here, here, here, and here). Not surprisingly, one of Jeb Bush’s protégés – Hanna Skandera, who is currently the state of New Mexico’s Secretary of its Public Education Department (PED) – took Bush’s model to even more of an extreme, making this particular model one of the most arbitrary and capricious of all such models currently in place throughout the U.S. (as also defined and detailed more fully here).

Hence, it should certainly be of interest to those in New Mexico (and other states) that the teacher and school evaluation model upon which their state model was built, is now apparently falling apart, at least for now.

“The superintendents [in Florida] are calling for the state to suspend the system for a year – meaning that the scores from this spring’s administration of the [state’s] exams will not be used in [the state-level] evaluations [of teachers or schools].” The state school superintendents are also calling for “a full review” of the system, which certainly seems warranted in Florida, New Mexico, and any other state for that matter, in which similar evaluation models are being paid for using taxpayer revenues, without much if any research evidence to support that they are working as intended, and sold, and paid for.