New Mexico to Change its Teacher Evaluation System, But Not Really

As you all likely recall, the American Federation of Teachers (AFT), joined by the Albuquerque Teachers Federation (ATF), last year, filed a “Lawsuit in New Mexico Challenging [the] State’s Teacher Evaluation System.” In December 2015, state District Judge David K. Thomson granted a preliminary injunction preventing consequences from being attached to the state’s teacher evaluation data. More specifically, Judge Thomson ruled that the state can proceed with “developing” and “improving” its teacher evaluation system, but the state is not to make any consequential decisions about New Mexico’s teachers using the data the state collects until the state (and/or others external to the state) can evidence to the court that the system is reliable, valid, fair, uniform, and the like (see prior post on this ruling here).

Late Friday afternoon, New Mexico’s Public Education Department (PED) announced that they are accordingly changing their NMTEACH teacher evaluation system, and they will be issuing new regulations. Their primary goal is as follows: To (1) “Address major liabilities resulting from litigation” as these liabilities specifically pertain to the former NMTEACH system’s (a) Uniformity, (b) Transparency, and (c) Clarity. On the surface level, this is gratifying to the extent that the state is attempting to, at least theoretically, please the court. But we, and especially those in New Mexico, might refrain from celebrating too soon…given when one reads the PED announcement here, one will see this is yet another example of the state’s futile attempts to keep with a very top-down teacher evaluation system. Note, however, that a uniform teacher evaluation system is also required under state law, although the governor has the right to change state statute should those at the state (including the governor, state superintendent, and PED) decide to eventually work with local districts and schools regarding the construction of a better teacher evaluation system for the state.

As per the PED’s subsequent goals, accordingly, things do not look much different than what they did in the past, especially given why and what got the state involved in such litigation in the first place. While the state also intends to (2) Simplify processes for districts/charters and also for the PED, and this is more or less fair, the state is also to (3) Establish a timeline for providing to districts and schools more current data in that currently such data are delayed by one school year, and these data are (still) needed for the state’s Pay for Performance plans (which was considered one high-stakes consequence at issue in Judge Thompson’s ruling). A tertiary goal is also to deliver in a more timely fashion such data to teacher preparation programs, which is something also of great controversy, as (uninformed) policymakers also continue to believe that colleges of education should also be held accountable for the test scores of their graduates’ students (see why this is problematic, for example, here). In the state’s final expressed goal, they make it explicit that (4) “Moving the timeline enhances the understanding that this system isn’t being used for termination decisions.” While this is certainly good, at least for now, the performance pay program is still something that is of concern. As is the state’s continued attempts to (still) use students’ test scores to evaluate teachers, and the state’s perpetual beliefs that the data errors also exposed by the lawsuit were the fault of the school districts, not the state, which Judge Thomson also noted.

Regardless, here is the state’s “Legal Rationale,” and here is also where things go a bit more awry. As re-positioned by the state/PED, they write that “the NEA and AFT recently advanced lawsuits set on eliminating any meaningful teacher evaluation [emphasis added to highlight the language that state is using to distort the genuine purposes of these lawsuits]. These lawsuits have exposed that the flexibility provided to local authorities has created confusion and complexity. Judge Thomson used this complexity when granting an injunction in the AFT case—citing a confusing array of classifications, tags, assessments, graduated considerations, etc. Judge Thomson made clear that he views this local authority as a threat to the statutorily required uniformity of the system [emphasis added given Judge Thompson said nothing of this sort, in terms of devaluing local authority or control, but rather, he emphasized the state’s menu of options was arbitrary and not uniform, especially given the consequences the state was requiring districts to enforce].” This, again, pertains to what is written in the current state statute in terms of a uniform teacher evaluation system.

Accordingly, and unfortunately, the state’s proposed changes would: “Provide a single plan that all districts and charters would use, providing greater uniformity,” and “Simplify the model from 107 possible classifications to three.” See three other moves detailed in the PED announcement here (e.g., moving data delivery dates, eliminating all but three tests, and the fall 2016 date which all of this is to become official).

Related, see a visual of what the state’s “new and improved” teacher evaluation system, in response to said litigation, is to look like. Unfortunately, again, it really does not look much different than it did prior except, perhaps, in the proposed reductions of testing options. See also the full document from which all of this came here.

Screen Shot 2016-01-30 at 10.20.01 AM

Nonetheless, we will have to wait to see if this, again, will please the court, and Judge Thompson’s ruling that the state is not to make any consequential decisions about New Mexico’s teachers using the data the state collects until the state (and/or others external to the state) can evidence to the court that the system is reliable, valid, etc.

And as for what the President of the American Federation of Teachers (AFT) New Mexico – Stephanie Biondo-Ly – had to say in response, see her press release below. See also an article in the Las Cruces – Sun Times here, in which President Ly is cited as “denounc[ing] the changes and call[ing] them attempts to obscure deficiencies in the [state’s] evaluation system.” From her original press release, she also wrote: “We are troubled…that once again, these changes are being implemented from the top down and if the secretary [Hanna Skandera] and her [PED] staff were serious about improving student outcomes and producing a fair evaluation system, they would have involved teachers, principals, and superintendents in the process.”


Report on the Stability of Student Growth Percentile (SGP) “Value-Added” Estimates

The Student Growth Percentiles (SGPs) model, which is loosely defined by value-added model (VAM) purists as a VAM, uses students’ level(s) of past performance to determine students’ normative growth over time, as compared to his/her peers. “SGPs describe the relative location of a student’s current score compared to the current scores of students with similar score histories” (Castellano & Ho, p. 89). Students are compared to themselves (i.e., students serve as their own controls) over time; therefore, the need to control for other variables (e.g., student demographics) is less necessary, although this is of debate. Nonetheless, the SGP model was developed as a “better” alternative to existing models, with the goal of providing clearer, more accessible, and more understandable results to both internal and external education stakeholders and consumers. For more information about the SGP please see prior posts here and here. See also an original source about the SGP here.

Related, in a study released last week, WestEd researchers conducted an “Analysis of the stability of teacher-level growth scores [derived] from the student growth percentile [SGP] model” in one, large school district in Nevada (n=370 teachers). The key finding they present is that “half or more of the variance in teacher scores from the [SGP] model is due to random or otherwise unstable sources rather than to reliable information that could predict future performance. Even when derived by averaging several years of teacher scores, effectiveness estimates are unlikely to provide a level of reliability desired in scores used for high-stakes decisions, such as tenure or dismissal. Thus, states may want to be cautious in using student growth percentile [SGP] scores for teacher evaluation.”

Most importantly, the evidence in this study should make us (continue to) question the extent to which “the learning of a teacher’s students in one year will [consistently] predict the learning of the teacher’s future students.” This is counter to the claims continuously made by VAM proponents, including folks like Thomas Kane — economics professor from Harvard University who directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation. While faint signals of what we call predictive validity might be observed across VAMs, what folks like Kane overlook or avoid is that very often these faint signals do not remain constant over time. Accordingly, the extent to which we can make stable predictions is limited.

Worse is when folks falsely assume that said predictions will remain constant over time, and they make high-stakes decisions about teachers unaware of the lack of stability present, in typically 25-59% of teachers’ value-added (or in this case SGP) scores (estimates vary by study and by analyses using one to three years of data — see, for example, the studies detailed in Appendix A of this report; see also other research on this topic here, here, and here). Nonetheless, researchers in this study found that in mathematics, 50% of the variance in teachers’ value-added scores were attributable to differences among teachers, and the other 50% was random or unstable. In reading, 41% of the variance in teachers’ value-added scores were attributable to differences among teachers, and the other 59% was random or unstable.

In addition, using a 95% confidence interval (which is very common in educational statistics) researchers found that in mathematics, a teacher’s true score would span 48 points, “a margin of error that covers nearly half the 100 point score scale,” whereby “one would be 95 percent confident that the true math score of a teacher who received a score of 50 [would actually fall] between 26 and 74.” For reading, a teacher’s true score would span 44 points, whereby one would be 95 percent confident that the true reading score of a teacher who received a score of 50 would actually fall between 38 and 72. The stability of these scores would increase with three years of data, which has also been found by other researchers on this topic. However, they too have found that such error rates persist to an extent that still prohibits high-stakes decision making.

In more practical terms, what this also means is that a teacher who might be considered highly ineffective might be terminated, even though the following year (s)he could have been observed to be highly effective. Inversely, teachers who are awarded tenure might be observed as ineffective one, two, and/or three years following, not because their true level(s) of effectiveness change, but because of the error in the estimates that causes such instabilities to occur. Hence, examinations of the the stability of such estimates over time provides essential evidence of the validity, and in this case predictive validity, of the interpretations and uses of such scores over time. This is particularly pertinent when high-stakes decisions are to be based on (or in large part on) such scores, especially given some researchers are calling for reliability coefficients of .85 or higher to make such decisions (Haertel, 2013; Wasserman & Bracken, 2003).

In the end, researchers’ overall conclusion is that SGP-derived “growth scores alone may not be sufficiently stable to support high-stakes decisions.” Likewise, relying on the extant research on this topic, the overall conclusion can be broadened in that neither SGP- or VAM-based growth scores may be sufficiently stable to support high-stakes decisions. In other words, it is not just the SGP model that is yielding such issues with stability (or a lack thereof). Again, see the other literature in which researchers situated their findings in Appendix A. See also other similar studies here, here, and here.

Accordingly, those who read this report, and consequently seek to find a better or more stable model that yields more stable estimates, will unfortunately but likely fail in their search.


Castellano, K. E., & Ho, A. D. (2013). A practitioner’s guide to growth models. Washington, DC: Council of Chief State School Officers.

Haertel, E. H. (2013). Reliability and validity of inferences about teachers based on student test scores (14th William H. Angoff Memorial Lecture). Princeton, NJ: Educational Testing Service (ETS).

Lash, A., Makkonen, R., Tran, L., & Huang, M. (2016). Analysis of the stability of teacher-level growth scores [derived] from the student growth percentile [SGP] model. (16–104). Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory West.

Wasserman, J. D., & Bracken, B. A. (2003). Psychometric characteristics of assessment procedures. In I. B. Weiner, J. R. Graham, & J. A. Naglieri (Eds.), Handbook of psychology:
Assessment psychology (pp. 43–66). Hoboken, NJ: John Wiley & Sons.

Special Issue of “Educational Researcher” (Paper #7 of 9): VAMs Situated in Appropriate Ecologies

Recall that the peer-reviewed journal Educational Researcher (ER) – recently published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#7 of 9), which is actually a commentary titled “The Value in Value-Added Depends on the Ecology.” This commentary is authored by Henry Braun – Professor of Education and Public Policy, Educational Research, Measurement, and Evaluation at Boston College (also the author of a previous post on this site here).

In this article Braun, importantly, makes explicit the assumptions on which this special issue of ER is based; that is, on assumptions that (1) too many students in America’s public schools are being inadequately educated, (2) evaluation systems as they currently exist “require radical overhaul,” and (3) it is therefore essential to use student test performance with low- and high-stakes attached to improve that which educators do (or don’t do) to adequately address the first assumption. There are counterarguments Braun also offers to readers on each of these assumptions (see p. 127), but more importantly he makes evident that the focus of this special issue is situated otherwise, as in line with current education policies. This special issue, overall, then “raise[s] important questions regarding the potential for high-stakes, test-driven educator accountability systems to contribute to raising student achievement” (p. 127).

Given this context, the “value-added” provided within this special issue, again according to Braun, is that the authors of each of the five main research articles included report on how VAM output actually plays out in practice, given “careful consideration to how the design and implementation of teacher evaluation systems could be modified to enhance the [purportedly, see comments above] positive impact of accountability and mitigate the negative consequences” at the same time (p. 127). In other words, if we more or less agree to the aforementioned assumptions, also given the educational policy context influence, perpetuating, or actually forcing these assumptions, these articles should help others better understand VAMs’ and observational systems’ potentials and perils in practice.

At the same time, Braun encourages us to note that “[t]he general consensus is that a set of VAM scores does contain some useful information that meaningfully differentiates among teachers, especially in the tails of the distribution [although I would argue bias has a role here]. However, individual VAM scores do suffer from high variance and low year-to-year stability as well as an undetermined amount of bias [which may be greater in the tails of the distribution]. Consequently, if VAM scores are to be used for evaluation, they should not be given inordinate weight and certainly not treated as the “gold standard” to which all other indicators must be compared” (p. 128).

Likewise, it’s important to note that IF consequences are to be attached to said indicators of teacher evaluation (i.e., VAM and observational data), there should be validity evidence made available and transparent to warrant the inferences and decisions to be made, and the validity evidence “should strongly support a causal [emphasis added] argument” (p. 128). However, both indicators still face major “difficulties in establishing defensible causal linkage[s]” as theorized, and desired (p. 128); hence, this prevents validity in inference. What does not help, either, is when VAM scores are given precedence over other indicators OR when principals align teachers’ observational scores with the same teachers’ VAM scores given the precedence often given to (what are often viewed as the superior, more objective) VAM-based measures. This sometimes occurs given external pressures (e.g., applied by superintendents) to artificially conflate, in this case, levels of agreement between indicators (i.e., convergent validity).

Related, in the section Braun titles his “Trio of Tensions,” (p. 129) he notes that (1) [B]oth accountability and improvement are undermined, as attested to by a number of the articles in this issue. In the current political and economic climate, [if possible] it will take thoughtful and inspiring leadership at the state and district levels to create contexts in which an educator evaluation system constructively fulfills its roles with respect to both public accountability and school improvement” (p. 129-130); (2) [T]he chasm between the technical sophistication of the various VAM[s] and the ability of educators to appreciate what these models are attempting to accomplish…sow[s] further confusion…[hence]…there must be ongoing efforts to convey to various audiences the essential issues—even in the face of principled disagreements among experts on the appropriate roles(s) for VAM[s] in educator evaluations” (p. 130); and finally (3) [H]ow to balance the rights of students to an adequate education and the rights of teachers to fair evaluations and due process [especially for]…teachers who have value-added scores and those who teach in subject-grade combinations for which value-added scores are not feasible…[must be addressed; this] comparability issue…has not been addressed but [it] will likely [continue to] rear its [ugly] head” (p. 130).

In the end, Braun argues for another “Trio,” but this one including three final lessons: (1) “although the concerns regarding the technical properties of VAM scores are not misplaced, they are not necessarily central to their reputation among teachers and principals. [What is central is]…their links to tests of dubious quality, their opaqueness in an atmosphere marked by (mutual) distrust, and the apparent lack of actionable information that are largely responsible for their poor reception” (p. 130); (2) there is a “very substantial, multiyear effort required for proper implementation of a new evaluation system…[related, observational] ratings are not a panacea. They, too, suffer from technical deficiencies and are the object of concern among some teachers because of worries about bias” (p. 130); and (3) “legislators and policymakers should move toward a more ecological approach [emphasis added; see also the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here] to the design of accountability systems; that is, “one that takes into account the educational and political context for evaluation, the behavioral responses and other dynamics that are set in motion when a new regime of high-stakes accountability is instituted, and the long-term consequences of operating the system” (p. 130).


If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; see the Review of Article #5 – on teachers’ perceptions of observations and student growth here; and see the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here.

Article #7 Reference: Braun, H. (2015). The value in value-added depends on the ecology. Educational Researcher, 44(2), 127-131. doi:10.3102/0013189X15576341

How Measurement Fails Doctors and Teachers: NY Times Op-ed

In case you missed it, click here for the full op-ed “How Measurement Fails Doctors and Teachers” piece published in The New York Times on Saturday.

It’s well worth the read, especially given the comparisons that author, Robert M. Wachter – MD, Professor and Interim Chair of the Department of Medicine at the University of California, San Francisco – makes between medicine and education, in terms of how measurement systems in many ways have worked to hurt, not help improve, both professions.

Is Alabama the New, New Mexico?

In Alabama, the Grand Old Party (GOP) has put forth a draft bill to be entitled as an act and ultimately called the Rewarding Advancement in Instruction and Student Excellence (RAISE) Act of 2016. The purpose of the act will be to…wait for it…use test scores to grade and pay teachers annual bonuses (i.e., “supplements”) as per their performance. More specifically, the bill is to “provide a procedure for observing and evaluating teachers” to help make “significant differentiation[s] in pay, retention, promotion, dismissals, and other staffing decisions, including transfers, placements, and preferences in the event of reductions in force, [as] primarily [based] on evaluation results.” Related, Alabama districts may no longer use teachers’ “seniority, degrees, or credentials as a basis for determining pay or making the retention, promotion, dismissal, and staffing decisions.” Genius!

Accordingly, Larry Lee whose blog is based on the foundation that “education is everyone’s business,” sent me this bill to review, and critique, and help make everyone’s business. I attach it here for others who are interested, but I also summarize and critique it’s most relevant (but also contemptible) issues below.

For the Alabama teachers who are eligible, they are (after a staggered period of time) to be primarily evaluated (i.e., for up to 45% of a teacher’s total evaluation score) on the extent to which they purportedly cause student growth in achievement, with student growth being defined as the teachers’ purported impacts on “[t]he change in achievement for an individual student between two or more points in time.” Teachers are also to be observed at least twice per year (i.e., for up to 45% of a teacher’s total evaluation score), by their appropriate and appropriately trained evaluators/supervisors, and an unnamed and undefined set of parent and student surveys are to be used to evaluate the teachers (i.e., up to 15% of a teacher’s total evaluation score).

Again, no real surprises here as the adoption of such measures is common among states like Alabama (and New Mexico), but when these components are explained in more detail is where things really go awry.

“For grade levels and subjects for which student standardized assessment data is not available and for teachers for whom student standardized assessment data is not available, the [state’s] department [of education] shall establish a list of preapproved options for governing boards to utilize to measure student growth.” This is precisely what has gotten the whole state of New Mexico wrapped up in, and currently losing their ongoing lawsuit (see my most recent post on this here). While providing districts with menus of preapproved assessment options might make sense to policymakers, any self respecting researcher or even assessment commoner should know why this is entirely inappropriate. To read more about this, the best research study explaining why doing just this will set any state up for lawsuits comes from Brown University’s John Papay in his highly esteemed and highly cited “Different tests, different answers: The stability of teacher value-added estimates across outcome measures” article. The title of this research article alone should explain enough why simply positioning and offering up such tests in such casual (and quite careless) ways makes way for legal recourse.

Otherwise, the only test mentioned that is also to be used to measure teachers’ purported impacts on student growth is the ACT Aspire – the ACT test corporation’s “college and career readiness” test that is aligned to and connected with their more familiar college-entrance ACT. This, too, was one of the sources of the aforementioned lawsuit in New Mexico in terms of what we call content validity, in that states cannot simply pull in tests that are not adequately aligned with a state’s curriculum (e.g., I could find no information about the alignment of the ACT Aspire to Alabama’s curriculum here, which is also highly problematic as this information should definitely be available) and that have not been validated for such purposes (i.e., to measure teachers’ impacts on student growth).

Regardless of the tests, however, all of the secondary measures to be used to evaluate Alabama teachers (e.g., student and parent survey scores, observational scores) are also to be “correlated with impacts on student achievement results.” We’ve also increasingly seen this becoming the case across the nation, whereas state/district leaders are not simply assessing whether these indicators are independently correlated, which they should be if they all, in fact, help to measure our construct of interest = teacher effectiveness, but state/district leaders are rather manufacturing and forcing these correlations via what I have termed “artificial conflation” strategies (see also a recent post here about how this is one of the fundamental and critical points of litigation in Houston).

The state is apparently also set on going “all in” on evaluating their principals in many of the same ways, although I did not critique those sections for this particular post.

Most importantly, though, for those of you who have access to such leaders in Alabama, do send them this post so they might be a bit more proactive, and appropriately more careful and cautious, before going down this poor educational policy path. While I do embrace my professional responsibility as a public scholar to be called to court to testify about all of this when such high-stakes consequences are ultimately, yet inappropriately based upon invalid inferences, I’d much rather be proactive in this regard and save states and states’ taxpayers their time and money, respectively.

Accordingly, I see the state is also to put out a request for proposals to retain an external contractor to help them measure said student growth and teachers’ purported impacts on it. I would also be more than happy to help the state negotiate this contract, much more wisely than so many other states and districts have negotiated similar contracts thus far (e.g., without asking for reliability and validity evidence as a contractual deliverable)…should this poor educational policy actually come to fruition.

Houston Lawsuit Update, with Summary of Expert Witnesses’ Findings about the EVAAS

Recall from a prior post that a set of teachers in the Houston Independent School District (HISD), with the support of the Houston Federation of Teachers (HFT) are taking their district to federal court to fight for their rights as professionals, and how their value-added scores, derived via the Education Value-Added Assessment System (EVAAS), have allegedly violated them. The case, Houston Federation of Teachers, et al. v. Houston ISD, is to officially begin in court early this summer.

More specifically, the teachers are arguing that EVAAS output are inaccurate, the EVAAS is unfair, that teachers are being evaluated via the EVAAS using tests that do not match the curriculum they are to teach, that the EVAAS system fails to control for student-level factors that impact how well teachers perform but that are outside of teachers’ control (e.g., parental effects), that the EVAAS is incomprehensible and hence very difficult if not impossible to actually use to improve upon their instruction (i.e., actionable), and, accordingly, that teachers’ due process rights are being violated because teachers do not have adequate opportunities to change as a results of their EVAAS results.

The EVAAS is the one value-added model (VAM) on which I’ve conducted most of my research, also in this district (see, for example, here, here, here, and here); hence, I along with Jesse Rothstein – Professor of Public Policy and Economics at the University of California – Berkeley, who also conducts extensive research on VAMs – are serving as the expert witnesses in this case.

What was recently released regarding this case is a summary of the contents of our affidavits, as interpreted by authors of the attached “EVAAS Litigation UPdate,” in which the authors declare, with our and others’ research in support, that “Studies Declare EVAAS ‘Flawed, Invalid and Unreliable.” Here are the twelve key highlights, again, as summarized by the authors of this report and re-summarized, by me, below:

  1. Large-scale standardized tests have never been validated for their current uses. In other words, as per my affidavit, “VAM-based information is based upon large-scale achievement tests that have been developed to assess levels of student achievement, but not levels of growth in student achievement over time, and not levels of growth in student achievement over time that can be attributed back to students’ teachers, to capture the teachers’ [purportedly] causal effects on growth in student achievement over time.”
  2. The EVAAS produces different results from another VAM. When, for this case, Rothstein constructed and ran an alternative, albeit sophisticated VAM using data from HISD both times, he found that results “yielded quite different rankings and scores.” This should not happen if these models are indeed yielding indicators of truth, or true levels of teacher effectiveness from which valid interpretations and assertions can be made.
  3. EVAAS scores are highly volatile from one year to the next. Rothstein, when running the actual data, found that while “[a]ll VAMs are volatile…EVAAS growth indexes and effectiveness categorizations are particularly volatile due to the EVAAS model’s failure to adequately account for unaccounted-for variation in classroom achievement.” In addition, volatility is “particularly high in grades 3 and 4, where students have relatively few[er] prior [test] scores available at the time at which the EVAAS scores are first computed.”
  4. EVAAS overstates the precision of teachers’ estimated impacts on growth. As per Rothstein, “This leads EVAAS to too often indicate that teachers are statistically distinguishable from the average…when a correct calculation would indicate that these teachers are not statistically distinguishable from the average.”
  5. Teachers of English Language Learners (ELLs) and “highly mobile” students are substantially less likely to demonstrate added value, as per the EVAAS, and likely most/all other VAMs. This, what we term as “bias,” makes it “impossible to know whether this is because ELL teachers [and teachers of highly mobile students] are, in fact, less effective than non-ELL teachers [and teachers of less mobile students] in HISD, or whether it is because the EVAAS VAM is biased against ELL [and these other] teachers.”
  6. The number of students each teacher teaches (i.e., class size) also biases teachers’ value-added scores. As per Rothstein, “teachers with few linked students—either because they teach small classes or because many of the students in their classes cannot be used for EVAAS calculations—are overwhelmingly [emphasis added] likely to be assigned to the middle effectiveness category under EVAAS (labeled “no detectable difference [from average], and average effectiveness”) than are teachers with more linked students.”
  7. Ceiling effects are certainly an issue. Rothstein found that in some grades and subjects, “teachers whose students have unusually high prior year scores are very unlikely to earn high EVAAS scores, suggesting that ‘ceiling effects‘ in the tests are certainly relevant factors.” While EVAAS and HISD have previously acknowledged such problems with ceiling effects, they apparently believe these effects are being mediated with the new and improved tests recently adopted throughout the state of Texas. Rothstein, however, found that these effects persist even given the new and improved.
  8. There are major validity issues with “artificial conflation.” This is a term I recently coined to represent what is happening in Houston, and elsewhere (e.g., Tennessee), when district leaders (e.g., superintendents) mandate or force principals and other teacher effectiveness appraisers or evaluators, for example, to align their observational ratings of teachers’ effectiveness with value-added scores, with the latter being the “objective measure” around which all else should revolve, or align; hence, the conflation of the one to match the other, even if entirely invalid. As per my affidavit, “[t]o purposefully and systematically endorse the engineering and distortion of the perceptible ‘subjective’ indicator, using the perceptibly ‘objective’ indicator as a keystone of truth and consequence, is more than arbitrary, capricious, and remiss…not to mention in violation of the educational measurement field’s Standards for Educational and Psychological Testing” (American Educational Research Association (AERA), American Psychological Association (APA), National Council on Measurement in Education (NCME), 2014).
  9. Teaching-to-the-test is of perpetual concern. Both Rothstein and I, independently, noted concerns about how “VAM ratings reward teachers who teach to the end-of-year test [more than] equally effective teachers who focus their efforts on other forms of learning that may be more important.”
  10. HISD is not adequately monitoring the EVAAS system. According to HISD, EVAAS modelers keep the details of their model secret, even from them and even though they are paying an estimated $500K per year for district teachers’ EVAAS estimates. “During litigation, HISD has admitted that it has not performed or paid any contractor to perform any type of verification, analysis, or audit of the EVAAS scores. This violates the technical standards for use of VAM that AERA specifies, which provide that if a school district like HISD is going to use VAM, it is responsible for ‘conducting the ongoing evaluation of both intended and unintended consequences’ and that ‘monitoring should be of sufficient scope and extent to provide evidence to document the technical quality of the VAM application and the validity of its use’ (AERA Statement, 2015).
  11. EVAAS lacks transparency. AERA emphasizes the importance of transparency with respect to VAM uses. For example, as per the AERA Council who wrote the aforementioned AERA Statement, “when performance levels are established for the purpose of evaluative decisions, the methods used, as well as the classification accuracy, should be documented and reported” (AERA Statement, 2015). However, and in contrast to meeting AERA’s requirements for transparency, in this district and elsewhere, as per my affidavit, the “EVAAS is still more popularly recognized as the ‘black box’ value-added system.”
  12. Related, teachers lack opportunities to verify their own scores. This part is really interesting. “As part of this litigation, and under a very strict protective order that was negotiated over many months with SAS [i.e., SAS Institute Inc. which markets and delivers its EVAAS system], Dr. Rothstein was allowed to view SAS’ computer program code on a laptop computer in the SAS lawyer’s office in San Francisco, something that certainly no HISD teacher has ever been allowed to do. Even with the access provided to Dr. Rothstein, and even with his expertise and knowledge of value-added modeling, [however] he was still not able to reproduce the EVAAS calculations so that they could be verified.”Dr. Rothstein added, “[t]he complexity and interdependency of EVAAS also presents a barrier to understanding how a teacher’s data translated into her EVAAS score. Each teacher’s EVAAS calculation depends not only on her students, but also on all other students with- in HISD (and, in some grades and years, on all other students in the state), and is computed using a complex series of programs that are the proprietary business secrets of SAS Incorporated. As part of my efforts to assess the validity of EVAAS as a measure of teacher effectiveness, I attempted to reproduce EVAAS calculations. I was unable to reproduce EVAAS, however, as the information provided by HISD about the EVAAS model was far from sufficient.”

Deep Pockets, Corporate Reform, and Teacher Education

A colleague whom I have never formally met, but with whom I’ve had some interesting email exchanges with over the past few months — James D. Kirylo, Professor of Teaching and Learning in Louisiana — recently sent me an email I read, and appreciated; hence, I asked him to turn it into a blog post. He responded with a guest post he has titled “Deep Pockets, Corporate Reform, and Teacher Education,” pasted below. Do give this a read, and a social media share, as this one is deserving of some legs.

Here is what he wrote:

Money is power. Money is influence. Money shapes direction. Notwithstanding the influential nature of it in the electoral process, one only needs to see how bags of dough from the mega-rich-one-percenters—largely led by Bill Gates—have bought their way in their attempt to corporatize K-12 education (see, for example, here).  

This corporatization works to defund public education, grossly blames teachers for all that ails society, is obsessed with testing, and aims to privatize.  And next on the corporatized docket: teacher education programs.

In a recent piece by Valerie Strauss, “Gates Foundation Puts Millions of Dollars into New Education Focus: Teacher Preparation,” she sketches how Gates is awarding $35 million to a three-year project called Teacher Preparation Transformation Centers funneled through five different projects, one of which is the Texas Tech based University-School Partnerships for the Renewal of Educator Preparation (U.S. Prep) National Center.

A framework that will guide this “renewal” of educator preparation comes from the National Institute for Excellence in Teaching (NIET), along with the peddling of their programs, The System for Teacher and Student Advancement (TAP) and Student and Best Practices Center (BPC). Yet, again, coming from another guy with oodles of money, leading the charge of NIET is Lowell Milken who is Chairmen and TAP founder (see, for example, here).

The state of Louisiana serves as an example on how NIET is already working overtime in chipping its way into K-12 education. One can spend hours at the Louisiana Department of Education (LDE) website and view the various links on how TAP is applying a full-court-press in hyping its brand (see, for example, here).  

And now that TAP has entered the K-12 door in Louisiana, the brand is now squiggling its way into teacher education preparation programs, namely through the Texas Tech based U.S. Prep National Center. This Gates Foundation backed project involves five teacher education programs in the country (Southern Methodist University, University of Houston, Jackson State University, and the University of Memphis, including one in Louisiana (Southeastern Louisiana University) (see more information about this here).  

Therefore, teacher educators must be “trained” to use TAP in order to “rightly” inculcate the prescription to teacher candidates.

TAP: Four Elements of Success

TAP principally plugs four Elements of Success: Multiple Career Paths (for educators as career, mentor and master teachers); Ongoing Applied Professional Growth (through weekly cluster meetings, follow-up support in the classroom, and coaching); Instructionally Focused Accountability (through multiple classroom observations and evaluations utilizing a research based instrument and rubric that identified effective teaching practices); and, Performance-Based Compensation (based on multiple; measures of performance, including student achievement gains and teachers’ instructional practices).

And according to the TAP literature, the elements of success “…were developed based upon scientific research, as well as best practices from the fields of education, business, and management” (see, for example, here). Recall, perhaps, that No Child Left Behind (NCLB) was also based on “scientific-based” research. Enough said. It is also interesting to note their use of the words “business” and “management” when referring to educating our children. Regardless, “The ultimate goal of TAP is to raise student achievement” so students will presumably be better equipped to compete in the global society (see, for example, here). 

While each element is worthy of discussion, a brief comment is in order on the first element Multiple Career Paths and fourth element, Performance-Based Compensation. Regarding the former, TAP has created a mini-hierarchy within already-hierarchical school systems (which most are) in identifying three potential sets of teachers, to reiterate from the above: a “career” teacher; a “mentor” teacher, and a “master” teacher. A “career” teacher as opposed to what? As opposed to a “temporary” teacher, a Teach For America (TFA) teacher, a substitute teacher? But, of course, according to TAP, as opposed to a “mentor” teacher and a “master” teacher.

This certainly begs the question: Why in the world would any parent want their child to be taught by a “career” teacher as opposed to a “mentor” teacher or better yet a “master” teacher? Wouldn’t we want “master” teachers in all our classrooms? To analogize, I would rather have a “master” doctor performing heart surgery on me than a “lowly” career doctor. Indeed, words, language, and concepts matter.

With respect to the latter, the notion of having an ultimate goal on raising student achievement is perhaps more than euphemistic on raising test scores, cultivating a test-centric way of doing things.

Achievement and VAM

That is, instead of focusing on learning, opportunity, developmentally appropriate practices, and falling in love with learning, “achievement” is the goal of TAP. Make no mistake, this is far from an argument on semantics. And this “achievement” linked to student growth to merit pay relies heavily on a VAM-aligned rubric.

Yet, there are multiple problems with VAM, an instrument that has been used in K-12 education since 2011. Among many other outstanding sources, one may simply want to check out this cleverly called blog here, “VAMboozled,” or see what Diane Ravitch has said about VAMs (among other places, see, for example, here), not to mention the well-visited site produced by Mercedes Schneider here. Finally, see the 2015 position statement issued by the American Educational Research Association (AERA) regarding VAMs here, as well as a similar statement issued by the American Statistical Association (ASA) here

Back to the Gates Foundation and the Texas Tech based (U.S. Prep) National Center, though. To restate, at the aforementioned university in Louisiana (though likely in the other four recruited institutions, as well), TAP will be the chief vehicle that will drive this process, and teacher education programs will be used as the host to prop the brand.

With presumably some very smart, well-educated, talented, and experienced professionals at respective teacher education sites, how is it possible that they capitulated to be the samples for the petri dish that will only work to enculturate the continuation of corporate reform, which will predictably lead to what Hofstra University Professor, Alan Singer, calls the “McDonaldization of Teacher Education“?

Strauss puts the question this way, “How many times do educators need to attempt to reinvent the wheel just because someone with deep pockets wants to try when the money could almost certainly be more usefully spent somewhere else?” I ask this same question, in this case, here.

Special Issue of “Educational Researcher” (Paper #6 of 9): VAMs as Tools for “Egg-Crate” Schools

Recall that the peer-reviewed journal Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#6 of 9), which is actually an essay here, titled “Will VAMS Reinforce the Walls of the Egg-Crate School?” This essay is authored by Susan Moore Johnson – Professor of Education at Harvard and somebody who I in the past I had the privilege of interviewing as an esteemed member of the National Academy of Education (see interviews here and here).

In this article, Moore Johnson argues that when policymakers use VAMs to evaluate, reward, or dismiss teachers, they may be perpetuating an egg-crate model, which is (referencing Tyack (1974) and Lortie (1975)) a metaphor for the compartmentalized school structure in which teachers (and students) work, most often in isolation. This model ultimately undermines the efforts of all involved in the work of schools to build capacity school wide, and to excel as a school given educators’ individual and collective efforts.

Contrary to the primary logic supporting VAM use, however, “teachers are not inherently effective or ineffective” on their own. Rather, their collective effectiveness is related to their professional development that may be stunted when they work alone, “without the benefit of ongoing collegial influence” (p. 119). VAMs then, and unfortunately, can cause teachers and administrators to (hyper)focus “on identifying, assigning, and rewarding or penalizing individual [emphasis added] teachers for their effectiveness in raising students’ test scores [which] depends primarily on the strengths of individual teachers” (p. 119). What comes along with this, then, are a series of interrelated egg-crate behaviors including, but not limited to, increased competition, lack of collaboration, increased independence versus interdependence, and the like, all of which can lead to decreased morale and decreased effectiveness in effect.

Inversely, students are much “better served when human resources are deliberately organized to draw on the strengths of all teachers on behalf of all students, rather than having students subjected to the luck of the draw in their classroom assignment[s]” (p. 119). Likewise, “changing the context in which teachers work could have important benefits for students throughout the school, whereas changing individual teachers without changing the context [as per VAMs] might not [work nearly as well] (Lohr, 2012)” (p. 120). Teachers learning from their peers, working in teams, teaching in teams, co-planning, collaborating, learning via mentoring by more experienced teachers, learning by mentoring, and the like should be much more valued, as warranted via the research, yet they are not valued given the very nature of VAM use.

Hence, there are also unintended consequences that can also come along with the (hyper)use of individual-level VAMs. These include, but are not limited to: (1) Teachers who are more likely to “literally or figuratively ‘close their classroom door’ and revert to working alone…[This]…affect[s] current collaboration and shared responsibility for school improvement, thus reinforcing the walls of the egg-crate school” (p. 120); (2) Due to bias, or that teachers might be unfairly evaluated given the types of students non-randomly assigned into their classrooms, teachers might avoid teaching high-needs students if teachers perceive themselves to be “at greater risk” of teaching students they cannot grow; (3) This can perpetuate isolative behaviors, as well as behaviors that encourage teachers to protect themselves first, and above all else; (4) “Therefore, heavy reliance on VAMS may lead effective teachers in high-need subjects and schools to seek safer assignments, where they can avoid the risk of low VAMS scores[; (5) M]eanwhile, some of the most challenging teaching assignments would remain difficult to fill and likely be subject to repeated turnover, bringing steep costs for students” (p. 120); While (6) “using VAMS to determine a substantial part of the teacher’s evaluation or pay [also] threatens to sidetrack the teachers’ collaboration and redirect the effective teacher’s attention to the students on his or her roster” (p. 120-121) versus students, for example, on other teachers’ rosters who might also benefit from other teachers’ content area or other expertise. Likewise (7) “Using VAMS to make high-stakes decisions about teachers also may have the unintended effect of driving skillful and committed teachers away from the schools that need them most and, in the extreme, causing them to leave the profession” in the end (p. 121).

I should add, though, and in all fairness given the Review of Paper #3 – on VAMs’ potentials here, many of these aforementioned assertions are somewhat hypothetical in the sense that they are based on the grander literature surrounding teachers’ working conditions, versus the direct, unintended effects of VAMs, given no research yet exists to examine the above, or other unintended effects, empirically. “There is as yet no evidence that the intensified use of VAMS interferes with collaborative, reciprocal work among teachers and principals or sets back efforts to move beyond the traditional egg-crate structure. However, the fact that we lack evidence about the organizational consequences of using VAMS does not mean that such consequences do not exist” (p. 123).

The bottom line is that we do not want to prevent the school organization from becoming “greater than the sum of its parts…[so that]…the social capital that transforms human capital through collegial activities in schools [might increase] the school’s overall instructional capacity and, arguably, its success” (p. 118). Hence, as Moore Johnson argues, we must adjust the focus “from the individual back to the organization, from the teacher to the school” (p. 118), and from the egg-crate back to a much more holistic and realistic model capturing what it means to be an effective school, and what it means to be an effective teacher as an educational professional within one. “[A] school would do better to invest in promoting collaboration, learning, and professional accountability among teachers and administrators than to rely on VAMS scores in an effort to reward or penalize a relatively small number of teachers” (p. 122).


If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; and see the Review of Article #5 – on teachers’ perceptions of observations and student growth here.

Article #6 Reference: Moore Johnson, S. (2015). Will VAMS reinforce the walls of the egg-crate school? Educational Researcher, 44(2), 117-126. doi:10.3102/0013189X15573351