One Score and Seven Policy Iterations Ago…

I just read what might be one of the best articles I’ve read in a long time on using test scores to measure teacher effectiveness, and why this is such a bad idea. Not surprisingly, unfortunately, this article was written 20 years ago (i.e., 1986) by – Edward Haertel, National Academy of Education member and recently retired Professor at Stanford University. If the name sounds familiar, it should as Professor Emeritus Haertel is one of the best on the topic of, and history behind VAMs (see prior posts about his related scholarship here, here, and here). To access the full article, please scroll to the reference at the bottom of this post.

Heartel wrote this article when at the time policymakers were, like they still are now, trying to hold teachers accountable for their students’ learning as measured on states’ standardized test scores. Although this article deals with minimum competency tests, which were in policy fashion at the time, about seven policy iterations ago, the contents of the article still have much relevance given where we are today — investing in “new and improved” Common Core tests and still riding on unsinkable beliefs that this is the way to reform the schools that have been in despair and (still) in need of major repair since 20+ years ago.

Here are some of the points I found of most “value:”

  • On isolating teacher effects: “Inferring teacher competence from test scores requires the isolation of teaching effects from other major influences on student test performance,” while “the task is to support an interpretation of student test performance as reflecting teacher competence by providing evidence against plausible rival hypotheses or interpretation.” While “student achievement depends on multiple factors, many of which are out of the teacher’s control,” and many of which cannot and likely never will be able to be “controlled.” In terms of home supports, “students enjoy varying levels of out-of-school support for learning. Not only may parental support and expectations influence student motivation and effort, but some parents may share directly in the task of instruction itself, reading with children, for example, or assisting them with homework.” In terms of school supports, “[s]choolwide learning climate refers to the host of factors that make a school more than a collection of self-contained classrooms. Where the principal is a strong instructional leader; where schoolwide policies on attendance, drug use, and discipline are consistently enforced; where the dominant peer culture is achievement-oriented; and where the school is actively supported by parents and the community.” This, all, makes isolating the teacher effect nearly if not wholly impossible.
  • On the difficulties with defining the teacher effect: “Does it include homework? Does it include self-directed study initiated by the student? How about tutoring by a parent or an older sister or brother? For present purposes, instruction logically refers to whatever the teacher being evaluated is responsible for, but there are degrees of responsibility, and it is often shared. If a teacher informs parents of a student’s learning difficulties and they arrange for private tutoring, is the teacher responsible for the student’s improvement? Suppose the teacher merely gives the student low marks, the student informs her parents, and they arrange for a tutor? Should teachers be credited with inspiring a student’s independent study of school subjects? There is no time to dwell on these difficulties; others lie ahead. Recognizing that some ambiguity remains, it may suffice to define instruction as any learning activity directed by the teacher, including homework….The question also must be confronted of what knowledge counts as achievement. The math teacher who digresses into lectures on beekeeping may be effective in communicating information, but for purposes of teacher evaluation the learning outcomes will not match those of a colleague who sticks to quadratic equations.” Much if not all of this cannot and likely never will be able to be “controlled” or “factored” in or our, as well.
  • On standardized tests: The best of standardized tests will (likely) always be too imperfect and not up to the teacher evaluation task, no matter the extent to which they are pitched as “new and improved.” While it might appear that these “problem[s] could be solved with better tests,” they cannot. Ultimately, all that these tests provide is “a sample of student performance. The inference that this performance reflects educational achievement [not to mention teacher effectiveness] is probabilistic [emphasis added], and is only justified under certain conditions.” Likewise, these tests “measure only a subset of important learning objectives, and if teachers are rated on their students’ attainment of just those outcomes, instruction of unmeasured objectives [is also] slighted.” Like it was then as it still is today, “it has become a commonplace that standardized student achievement tests are ill-suited for teacher evaluation.”
  • On the multiple choice formats of such tests: “[A] multiple-choice item remains a recognition task, in which the problem is to find the best of a small number of predetermined alternatives and the cri- teria for comparing the alternatives are well defined. The nonacademic situations where school learning is ultimately ap- plied rarely present problems in this neat, closed form. Discovery and definition of the problem itself and production of a variety of solutions are called for, not selection among a set of fixed alternatives.”
  • On students and the scores they are to contribute to the teacher evaluation formula: “Students varying in their readiness to profit from instruction are said to differ in aptitude. Not only general cognitive abilities, but relevant prior instruction, motivation, and specific inter- actions of these and other learner characteristics with features of the curriculum and instruction will affect academic growth.” In other words, one cannot simply assume all students will learn or grow at the same rate with the same teacher. Rather, they will learn at different rates given their aptitudes, their “readiness to profit from instruction,” the teachers’ instruction, and sometimes despite the teachers’ instruction or what the teacher teaches.
  • And on the formative nature of such tests, as it was then: “Teachers rarely consult standardized test results except, perhaps, for initial grouping or placement of students, and they believe that the tests are of more value to school or district administrators than to themselves.”

Sound familiar?

Reference: Haertel, E. (1986). The valid use of student performance measures for teacher evaluation. Educational Evaluation and Policy Analysis, 8(1), 45-60.

Houston Lawsuit Update, with Summary of Expert Witnesses’ Findings about the EVAAS

Recall from a prior post that a set of teachers in the Houston Independent School District (HISD), with the support of the Houston Federation of Teachers (HFT) are taking their district to federal court to fight for their rights as professionals, and how their value-added scores, derived via the Education Value-Added Assessment System (EVAAS), have allegedly violated them. The case, Houston Federation of Teachers, et al. v. Houston ISD, is to officially begin in court early this summer.

More specifically, the teachers are arguing that EVAAS output are inaccurate, the EVAAS is unfair, that teachers are being evaluated via the EVAAS using tests that do not match the curriculum they are to teach, that the EVAAS system fails to control for student-level factors that impact how well teachers perform but that are outside of teachers’ control (e.g., parental effects), that the EVAAS is incomprehensible and hence very difficult if not impossible to actually use to improve upon their instruction (i.e., actionable), and, accordingly, that teachers’ due process rights are being violated because teachers do not have adequate opportunities to change as a results of their EVAAS results.

The EVAAS is the one value-added model (VAM) on which I’ve conducted most of my research, also in this district (see, for example, here, here, here, and here); hence, I along with Jesse Rothstein – Professor of Public Policy and Economics at the University of California – Berkeley, who also conducts extensive research on VAMs – are serving as the expert witnesses in this case.

What was recently released regarding this case is a summary of the contents of our affidavits, as interpreted by authors of the attached “EVAAS Litigation UPdate,” in which the authors declare, with our and others’ research in support, that “Studies Declare EVAAS ‘Flawed, Invalid and Unreliable.” Here are the twelve key highlights, again, as summarized by the authors of this report and re-summarized, by me, below:

  1. Large-scale standardized tests have never been validated for their current uses. In other words, as per my affidavit, “VAM-based information is based upon large-scale achievement tests that have been developed to assess levels of student achievement, but not levels of growth in student achievement over time, and not levels of growth in student achievement over time that can be attributed back to students’ teachers, to capture the teachers’ [purportedly] causal effects on growth in student achievement over time.”
  2. The EVAAS produces different results from another VAM. When, for this case, Rothstein constructed and ran an alternative, albeit sophisticated VAM using data from HISD both times, he found that results “yielded quite different rankings and scores.” This should not happen if these models are indeed yielding indicators of truth, or true levels of teacher effectiveness from which valid interpretations and assertions can be made.
  3. EVAAS scores are highly volatile from one year to the next. Rothstein, when running the actual data, found that while “[a]ll VAMs are volatile…EVAAS growth indexes and effectiveness categorizations are particularly volatile due to the EVAAS model’s failure to adequately account for unaccounted-for variation in classroom achievement.” In addition, volatility is “particularly high in grades 3 and 4, where students have relatively few[er] prior [test] scores available at the time at which the EVAAS scores are first computed.”
  4. EVAAS overstates the precision of teachers’ estimated impacts on growth. As per Rothstein, “This leads EVAAS to too often indicate that teachers are statistically distinguishable from the average…when a correct calculation would indicate that these teachers are not statistically distinguishable from the average.”
  5. Teachers of English Language Learners (ELLs) and “highly mobile” students are substantially less likely to demonstrate added value, as per the EVAAS, and likely most/all other VAMs. This, what we term as “bias,” makes it “impossible to know whether this is because ELL teachers [and teachers of highly mobile students] are, in fact, less effective than non-ELL teachers [and teachers of less mobile students] in HISD, or whether it is because the EVAAS VAM is biased against ELL [and these other] teachers.”
  6. The number of students each teacher teaches (i.e., class size) also biases teachers’ value-added scores. As per Rothstein, “teachers with few linked students—either because they teach small classes or because many of the students in their classes cannot be used for EVAAS calculations—are overwhelmingly [emphasis added] likely to be assigned to the middle effectiveness category under EVAAS (labeled “no detectable difference [from average], and average effectiveness”) than are teachers with more linked students.”
  7. Ceiling effects are certainly an issue. Rothstein found that in some grades and subjects, “teachers whose students have unusually high prior year scores are very unlikely to earn high EVAAS scores, suggesting that ‘ceiling effects‘ in the tests are certainly relevant factors.” While EVAAS and HISD have previously acknowledged such problems with ceiling effects, they apparently believe these effects are being mediated with the new and improved tests recently adopted throughout the state of Texas. Rothstein, however, found that these effects persist even given the new and improved.
  8. There are major validity issues with “artificial conflation.” This is a term I recently coined to represent what is happening in Houston, and elsewhere (e.g., Tennessee), when district leaders (e.g., superintendents) mandate or force principals and other teacher effectiveness appraisers or evaluators, for example, to align their observational ratings of teachers’ effectiveness with value-added scores, with the latter being the “objective measure” around which all else should revolve, or align; hence, the conflation of the one to match the other, even if entirely invalid. As per my affidavit, “[t]o purposefully and systematically endorse the engineering and distortion of the perceptible ‘subjective’ indicator, using the perceptibly ‘objective’ indicator as a keystone of truth and consequence, is more than arbitrary, capricious, and remiss…not to mention in violation of the educational measurement field’s Standards for Educational and Psychological Testing” (American Educational Research Association (AERA), American Psychological Association (APA), National Council on Measurement in Education (NCME), 2014).
  9. Teaching-to-the-test is of perpetual concern. Both Rothstein and I, independently, noted concerns about how “VAM ratings reward teachers who teach to the end-of-year test [more than] equally effective teachers who focus their efforts on other forms of learning that may be more important.”
  10. HISD is not adequately monitoring the EVAAS system. According to HISD, EVAAS modelers keep the details of their model secret, even from them and even though they are paying an estimated $500K per year for district teachers’ EVAAS estimates. “During litigation, HISD has admitted that it has not performed or paid any contractor to perform any type of verification, analysis, or audit of the EVAAS scores. This violates the technical standards for use of VAM that AERA specifies, which provide that if a school district like HISD is going to use VAM, it is responsible for ‘conducting the ongoing evaluation of both intended and unintended consequences’ and that ‘monitoring should be of sufficient scope and extent to provide evidence to document the technical quality of the VAM application and the validity of its use’ (AERA Statement, 2015).
  11. EVAAS lacks transparency. AERA emphasizes the importance of transparency with respect to VAM uses. For example, as per the AERA Council who wrote the aforementioned AERA Statement, “when performance levels are established for the purpose of evaluative decisions, the methods used, as well as the classification accuracy, should be documented and reported” (AERA Statement, 2015). However, and in contrast to meeting AERA’s requirements for transparency, in this district and elsewhere, as per my affidavit, the “EVAAS is still more popularly recognized as the ‘black box’ value-added system.”
  12. Related, teachers lack opportunities to verify their own scores. This part is really interesting. “As part of this litigation, and under a very strict protective order that was negotiated over many months with SAS [i.e., SAS Institute Inc. which markets and delivers its EVAAS system], Dr. Rothstein was allowed to view SAS’ computer program code on a laptop computer in the SAS lawyer’s office in San Francisco, something that certainly no HISD teacher has ever been allowed to do. Even with the access provided to Dr. Rothstein, and even with his expertise and knowledge of value-added modeling, [however] he was still not able to reproduce the EVAAS calculations so that they could be verified.”Dr. Rothstein added, “[t]he complexity and interdependency of EVAAS also presents a barrier to understanding how a teacher’s data translated into her EVAAS score. Each teacher’s EVAAS calculation depends not only on her students, but also on all other students with- in HISD (and, in some grades and years, on all other students in the state), and is computed using a complex series of programs that are the proprietary business secrets of SAS Incorporated. As part of my efforts to assess the validity of EVAAS as a measure of teacher effectiveness, I attempted to reproduce EVAAS calculations. I was unable to reproduce EVAAS, however, as the information provided by HISD about the EVAAS model was far from sufficient.”

Special Issue of “Educational Researcher” (Paper #6 of 9): VAMs as Tools for “Egg-Crate” Schools

Recall that the peer-reviewed journal Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#6 of 9), which is actually an essay here, titled “Will VAMS Reinforce the Walls of the Egg-Crate School?” This essay is authored by Susan Moore Johnson – Professor of Education at Harvard and somebody who I in the past I had the privilege of interviewing as an esteemed member of the National Academy of Education (see interviews here and here).

In this article, Moore Johnson argues that when policymakers use VAMs to evaluate, reward, or dismiss teachers, they may be perpetuating an egg-crate model, which is (referencing Tyack (1974) and Lortie (1975)) a metaphor for the compartmentalized school structure in which teachers (and students) work, most often in isolation. This model ultimately undermines the efforts of all involved in the work of schools to build capacity school wide, and to excel as a school given educators’ individual and collective efforts.

Contrary to the primary logic supporting VAM use, however, “teachers are not inherently effective or ineffective” on their own. Rather, their collective effectiveness is related to their professional development that may be stunted when they work alone, “without the benefit of ongoing collegial influence” (p. 119). VAMs then, and unfortunately, can cause teachers and administrators to (hyper)focus “on identifying, assigning, and rewarding or penalizing individual [emphasis added] teachers for their effectiveness in raising students’ test scores [which] depends primarily on the strengths of individual teachers” (p. 119). What comes along with this, then, are a series of interrelated egg-crate behaviors including, but not limited to, increased competition, lack of collaboration, increased independence versus interdependence, and the like, all of which can lead to decreased morale and decreased effectiveness in effect.

Inversely, students are much “better served when human resources are deliberately organized to draw on the strengths of all teachers on behalf of all students, rather than having students subjected to the luck of the draw in their classroom assignment[s]” (p. 119). Likewise, “changing the context in which teachers work could have important benefits for students throughout the school, whereas changing individual teachers without changing the context [as per VAMs] might not [work nearly as well] (Lohr, 2012)” (p. 120). Teachers learning from their peers, working in teams, teaching in teams, co-planning, collaborating, learning via mentoring by more experienced teachers, learning by mentoring, and the like should be much more valued, as warranted via the research, yet they are not valued given the very nature of VAM use.

Hence, there are also unintended consequences that can also come along with the (hyper)use of individual-level VAMs. These include, but are not limited to: (1) Teachers who are more likely to “literally or figuratively ‘close their classroom door’ and revert to working alone…[This]…affect[s] current collaboration and shared responsibility for school improvement, thus reinforcing the walls of the egg-crate school” (p. 120); (2) Due to bias, or that teachers might be unfairly evaluated given the types of students non-randomly assigned into their classrooms, teachers might avoid teaching high-needs students if teachers perceive themselves to be “at greater risk” of teaching students they cannot grow; (3) This can perpetuate isolative behaviors, as well as behaviors that encourage teachers to protect themselves first, and above all else; (4) “Therefore, heavy reliance on VAMS may lead effective teachers in high-need subjects and schools to seek safer assignments, where they can avoid the risk of low VAMS scores[; (5) M]eanwhile, some of the most challenging teaching assignments would remain difficult to fill and likely be subject to repeated turnover, bringing steep costs for students” (p. 120); While (6) “using VAMS to determine a substantial part of the teacher’s evaluation or pay [also] threatens to sidetrack the teachers’ collaboration and redirect the effective teacher’s attention to the students on his or her roster” (p. 120-121) versus students, for example, on other teachers’ rosters who might also benefit from other teachers’ content area or other expertise. Likewise (7) “Using VAMS to make high-stakes decisions about teachers also may have the unintended effect of driving skillful and committed teachers away from the schools that need them most and, in the extreme, causing them to leave the profession” in the end (p. 121).

I should add, though, and in all fairness given the Review of Paper #3 – on VAMs’ potentials here, many of these aforementioned assertions are somewhat hypothetical in the sense that they are based on the grander literature surrounding teachers’ working conditions, versus the direct, unintended effects of VAMs, given no research yet exists to examine the above, or other unintended effects, empirically. “There is as yet no evidence that the intensified use of VAMS interferes with collaborative, reciprocal work among teachers and principals or sets back efforts to move beyond the traditional egg-crate structure. However, the fact that we lack evidence about the organizational consequences of using VAMS does not mean that such consequences do not exist” (p. 123).

The bottom line is that we do not want to prevent the school organization from becoming “greater than the sum of its parts…[so that]…the social capital that transforms human capital through collegial activities in schools [might increase] the school’s overall instructional capacity and, arguably, its success” (p. 118). Hence, as Moore Johnson argues, we must adjust the focus “from the individual back to the organization, from the teacher to the school” (p. 118), and from the egg-crate back to a much more holistic and realistic model capturing what it means to be an effective school, and what it means to be an effective teacher as an educational professional within one. “[A] school would do better to invest in promoting collaboration, learning, and professional accountability among teachers and administrators than to rely on VAMS scores in an effort to reward or penalize a relatively small number of teachers” (p. 122).


If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; and see the Review of Article #5 – on teachers’ perceptions of observations and student growth here.

Article #6 Reference: Moore Johnson, S. (2015). Will VAMS reinforce the walls of the egg-crate school? Educational Researcher, 44(2), 117-126. doi:10.3102/0013189X15573351

Special Issue of “Educational Researcher” (Paper #4 of 9): Make Room VAMs for Observations

Recall that the peer-reviewed journal Educational Researcher (ER) – recently published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#4 of 9) here, titled “Make Room Value-Added: Principals’ Human Capital Decisions and the Emergence of Teacher Observation Data. This one is authored by Ellen Goldring, Jason A. Grissom, Christine Neumerski, Marisa Cannata, Mollie Rubin, Timothy Drake, and Patrick Schuermann, all of whom are associated with Vanderbilt University.

This article is primarily about (1) the extent to which the data generated by “high-quality observation systems” can inform principals’ human capital decisions (e.g., teacher hiring, contract renewal, assignment to classrooms, professional development), and (2) the extent to which principals are relying less on test scores derived via value-added models (VAMs), when making the same decisions, and why. Here are some of their key (and most important, in my opinion) findings:

  • Principals across all school systems revealed major hesitations and challenges regarding the use of VAM output for human capital decisions. Barriers preventing VAM use included the timing of data availability (e.g., the fall), which is well after human capital decisions are made (p. 99).
  • VAM output are too far removed from the practice of teaching (p. 99), and this lack of instructional sensitivity impedes, if not entirely prevents their actual versus hypothetical use for school/teacher improvement.
  • “Principals noted they did not really understand how value-added scores were calculated, and therefore they were not completely comfortable using them” (p. 99). Likewise, principals reported that because teachers did not understand how the systems worked either, teachers did not use VAM output data either (p. 100).
  • VAM output are not transparent when used to determine compensation, and especially when used to evaluate teachers teaching nontested subject areas. In districts that use school-wide VAM output to evaluate teachers in nontested subject areas, in fact, principals reported regularly ignoring VAM output altogether (p. 99-100).
  • “Principals reported that they perceived observations to be more valid than value-added measures” (p. 100); hence, principals reported using observational output much more, again, in terms of human capital decisions and making such decisions “valid.” (p. 100).
  • “One noted exception to the use of value-added scores seemed to be in the area of assigning teachers to particular grades, subjects, and classes. Many principals mentioned they use value-added measures to place teachers in tested subjects and with students in grade levels that ‘count’ for accountability purpose…some principals [also used] VAM [output] to move ineffective teachers to untested grades, such as K-2 in elementary schools and 12th grade in high schools” (p. 100).

Of special note here is also the following finding: “In half of the systems [in which researchers investigated these systems], there [was] a strong and clear expectation that there be alignment between a teacher’s value-added growth score and observation ratings…Sometimes this was a state directive and other times it was district-based. In some systems, this alignment is part of the principal’s own evaluation; principals receive reports that show their alignment” (p. 101). In other words, principals are being evaluated and held accountable given the extent to which their observations of their teachers match their teachers’ VAM-based data. If misalignment is noticed, it is not to be the fault of either measure (e.g., in terms of measurement error), it is to be the fault of the principal who is critiqued for inaccuracy, and therefore (inversely) incentivized to skew their observational data (the only data over which the supervisor has control) to artificially match VAM-based output. This clearly distorts validity, or rather the validity of the inferences that are to be made using such data. Appropriately, principals also “felt uncomfortable [with this] because they were not sure if their observation scores should align primarily…with the VAM” output (p. 101).

“In sum, the use of observation data is important to principals for a number of reasons: It provides a “bigger picture” of the teacher’s performance, it can inform individualized and large group professional development, and it forms the basis of individualized support for remediation plans that serve as the documentation for dismissal cases. It helps principals provides specific and ongoing feedback to teachers. In some districts, it is beginning to shape the approach to teacher hiring as well” (p. 102).

The only significant weakness, again in my opinion, with this piece is that the authors write that these observational data, at focus in this study, are “new,” thanks to recent federal initiatives. They write, for example, that “data from structured teacher observations—both quantitative and qualitative—constitute a new [emphasis added] source of information principals and school systems can utilize in decision making” (p. 96). They are also “beginning to emerge [emphasis added] in the districts…as powerful engines for principal data use” (p. 97). I would beg to differ as these systems have not changed much over time, pre and post these federal initiatives as (without evidence or warrant) claimed by these authors herein. See, for example, Table 1 on p. 98 of the article to see if what they have included within the list of components of such new and “complex, elaborate teacher observation systems systems” is actually new or much different than most of the observational systems in use prior. As an aside, one such system in use and of issue in this examination is one with which I am familiar, in use in the Houston Independent School District. Click here to also see if this system is also more “complex” or “elaborate” over and above such systems prior.

Also recall that one of the key reports that triggered the current call for VAMs, as the “more objective” measures needed to measure and therefore improve teacher effectiveness, was based on data that suggested that “too many teachers” were being rated as satisfactory or above. The observational systems in use then are essentially the same observational systems still in use today (see “The Widget Effect” report here). This is in stark contradiction to authors’ claims throughout this piece, for example, when they write “Structured teacher observations, as integral components of teacher evaluations, are poised to be a very powerful lever for changing principal leadership and the influence of principals on schools, teachers, and learning.” This counters all that is and all that came from “The Widget Effect” report here.


If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; and see the Review of Article #3 – on VAMs’ potentials here.

Article #4 Reference: Goldring, E., Grissom, J. A., Rubin, M., Neumerski, C. M., Cannata, M., Drake, T., & Schuermann, P. (2015). Make room value-added: Principals’ human capital decisions and the emergence of teacher observation data. Educational Researcher, 44(2), 96-104. doi:10.3102/0013189X15575031

“Value-Less” Value-Added Data

Peter Greene, a veteran teacher of English in Pennsylvania who works as a teacher in a state using the Pennsylvania version of the Education Value-Added Assessment System (EVAAS), wrote last week (October 5, 2015) in his Curmudgucation blog about his “Value-Less Data.” I thought it very important to share with you all, as he does a great job deconstructing one of the most widespread claims being made, and most lacking research support, about using the data derived via value-added models (VAMs) to inform and improve what teachers do in their classrooms.

Greene sententiously critiques this claim, writing:

It’s autumn in Pennsylvania, which means it’s time to look at the rich data to be gleaned from our Big Standardized Test (called PSSA for grades 3-8, and Keystone Exams at the high school level).

We love us some value added data crunching in PA (our version is called PVAAS, an early version of the value-added baloney model). This is a model that promises far more than it can deliver, but it also makes up a sizeable chunk of our school evaluation model, which in turn is part of our teacher evaluation model.

Of course the data crunching and collecting is supposed to have many valuable benefits, not the least of which is unleashing a pack of rich and robust data hounds who will chase the wild beast of low student achievement up the tree of instructional re-alignment. Like every other state, we have been promised that the tests will have classroom teachers swimming in a vast vault of data, like Scrooge McDuck on a gold bullion bender. So this morning I set out early to the states Big Data Portal to see what riches the system could reveal.

Here’s what I can learn from looking at the rich data.

* the raw scores of each student
* how many students fell into each of the achievement subgroups (test scores broken down by 20 point percentile slices)
* if each of the five percentile slices was generally above, below, or at its growth target

Annnnd that’s about it. I can sift through some of that data for a few other features.

For instance, PVAAS can, in a Minority Report sort of twist, predict what each student should get as a score based on– well, I’ve been trying for six years to find someone who can explain this to me, and still nothing. But every student has his or her own personal alternate universe score. If the student beats that score, they have shown growth. If they don’t, they have not.

The state’s site will actually tell me what each student’s alternate universe score was, side by side with their actual score. This is kind of an amazing twist– you might think this data set would be useful for determining how well the state’s predictive legerdemain actually works. Or maybe a discrepancy might be a signal that something is up with the student. But no — all discrepancies between predicted and actual scores are either blamed on or credited to the teacher.

I can use that same magical power to draw a big target on the backs of certain students. I can generate a list of students expected to fall within certain score ranges and throw them directly into the extra test prep focused remediation tank. Although since I’m giving them the instruction based on projected scores from a test they haven’t taken yet, maybe I should call it premediation.

Of course, either remediation or premediation would be easier to develop if I knew exactly what the problem was.

But the website gives only raw scores. I don’t know what “modules” or sections of the test the student did poorly on. We’ve got a principal working on getting us that breakdown, but as classroom teachers we don’t get to see it. Hell, as classroom teachers, we are not allowed to see the questions, and if we do see them, we are forbidden to talk about them, report on them, or use them in any way. (Confession: I have peeked, and many of the questions absolutely suck as measures of anything).

Bottom line– we have no idea what exactly our students messed up to get a low score on the test. In fact, we have no idea what they messed up generally.

So that’s my rich data. A test grade comes back, but I can’t see the test, or the questions, or the actual items that the student got wrong.

The website is loaded with bells and whistles and flash-dependent functions along with instructional videos that seem to assume that the site will be used by nine-year-olds, combining instructions that should be unnecessary (how to use a color-coding key to read a pie chart) to explanations of “analysis” that isn’t (by looking at how many students have scored below basic, we can determine how many students have scored below basic).

I wish some of the reformsters who believe that BS [i.e., not “basic skills” but the “other” BS] Testing gets us rich data that can drive and focus instruction would just get in there and take a look at this, because they would just weep. No value is being added, but lots of time and money is being wasted.

Valerie Strauss also covered Greene’s post in her Answer Sheet Blog in The Washington Post here, in case you’re interested in seeing her take on this as well: “Why the ‘rich’ student data we get from testing is actually worthless.”

EVAAS, Value-Added, and Teacher Branding

I do not think I ever shared this video out, and now following up on another post, about the potential impact these videos should really have, I thought now is an appropriate time to share. “We can be the change,” and social media can help.

My former doctoral student and I put together this video, after conducting a study with teachers in the Houston Independent School District and more specifically four teachers whose contracts were not renewed due in large part to their EVAAS scores in the summer of 2011. This video (which is really a cartoon, although it certainly lacks humor) is about them, but also about what is happening in general in their schools, post the adoption and implementation (at approximately $500,000/year) of the SAS EVAAS value-added system.

To read the full study from which this video was created, click here. Below is the abstract.

The SAS Educational Value-Added Assessment System (SAS® EVAAS®) is the most widely used value-added system in the country. It is also self-proclaimed as “the most robust and reliable” system available, with its greatest benefit to help educators improve their teaching practices. This study critically examined the effects of SAS® EVAAS® as experienced by teachers, in one of the largest, high-needs urban school districts in the nation – the Houston Independent School District (HISD). Using a multiple methods approach, this study critically analyzed retrospective quantitative and qualitative data to better comprehend and understand the evidence collected from four teachers whose contracts were not renewed in the summer of 2011, in part given their low SAS® EVAAS® scores. This study also suggests some intended and unintended effects that seem to be occurring as a result of SAS® EVAAS® implementation in HISD. In addition to issues with reliability, bias, teacher attribution, and validity, high-stakes use of SAS® EVAAS® in this district seems to be exacerbating unintended effects.

Mirror, Mirror on the Wall…

No surprise, again, but Thomas Kane, an economics professor from Harvard University who also directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation, is publicly writing in support of VAMs, again (redundancy intended). I just posted about one of his recent articles published on the website of the Brookings Institution titled “Do Value-Added Estimates Identify Causal Effects of Teachers and Schools?” after which I received another of his articles, this time published by the New York Daily News titled “Teachers Must Look in the Mirror.”

Embracing a fabled metaphor, while not to position teachers as the wicked queens or to position Kane as Snow White, let us ask ourselves the classic question:”Who is the fairest one of all?” as we critically review yet another fairytale authored by Harvard’s Kane. He has, after all, “carefully studied the best systems for rating teachers” (see other prior posts about Kane’s public perspectives on VAMs here and here).

In this piece, Kane continues to advance a series of phantasmal claims about the potentials of VAMs, this time in the state of New York where Governor Andrew Cuomo intends to take the state’s teacher evaluation system up to a system based 50% on teachers’ value-added, or 100% on value-added in cases where a teacher rated as “ineffective” in his/her value-added score can be rated as “ineffective” overall. Here,  value-added could be used to trump all else (see prior posts about this here and here).

According to Kane, Governor Cuomo “picked the right fight.” The state’s new system “will finally give schools the tools they need to manage and improve teaching.” Perhaps the magic mirror would agree with such a statement, but research would evidence it vain.

As I have noted prior, there is absolutely no evidence, thus far, indicating that such systems have any (in)formative use or value. These data are first and foremost designed for summative, or summary, purposes; they are not designed for formative use. Accordingly, the data that come from such systems — besides the data that come from the observational components still being built into these systems that have existed and been used for decades past — are not transparent, difficult to understand, and therefore challenging to use. Likewise, such data are not instructionally sensitive, and they are untimely in that test-based results typically come back to teachers well after their students have moved on to subsequent grade levels.

What about Kane’s claims against tenure: “The tenure process is the place to start. It’s the most important decision a principal makes. One poor decision can burden thousands of future students, parents, colleagues and supervisors.” This is quite an effect considering the typical teacher being held accountable using these new and improved teacher evaluation systems as based (in this case largely) on VAMs typically impacts only teachers at the elementary level who teach mathematics and reading/language arts. Even an elementary teacher with a career spanning 40 years with an average of 30 students per class would directly impact (or burden) 1,200 students, maximum. This is not to say this is inconsequential, but as consequential as Kane’s sensational numbers imply? What about the thousands of parents, colleagues, and supervisors also to be burdened by one poor decision? Fair and objective? This particular mirror thinks not.

Granted, I am not making any claims about tenure as I think all would agree that sometimes tenure can support, keeping with the metaphor, bad apples. Rather I take claim with the exaggerations, including also that “Traditionally, principals have used much too low a standard, promoting everyone but the very worst teachers.” We must all check our assumptions here about how we define “the very worst teachers” and how many of them really lurk in the shadows of America’s now not-so-enchanted forests. There is no evidence to support this claim, either, just conjecture.

As for the solution, “Under the new law, the length of time it will take to earn tenure will be lengthened from three to four years.” Yes, that arbitrary, one-year extension will certainly help… Likewise, tenure decisions will now be made better using classroom observations (the data that have, according to Kane in this piece, been used for years to make all of these aforementioned bad decisions) and our new fair and objective, test-based measures, which not accordingly to Kane, can only be used for about 30% of all teachers in America’s public schools. Nonetheless, “Student achievement gains [are to serve as] the bathroom scale, [and] classroom observations [are to serve] as the mirror.”

Kane continues, scripting, “Although the use of test scores has received all the attention, [one of] the most consequential change[s] in the law has been overlooked: One of a teacher’s observers must now be drawn from outside his or her school — someone whose only role is to comment on teaching.” Those from inside the school were only commenting on one’s beauty and fairness prior, I suppose, as “The fact that 96% of teachers were given the two highest ratings last year — being deemed either “effective” or “highly effective” — is a sure sign that principals have not been honest to date.”

All in all, perhaps somebody else should be taking a long hard “Look in the Mirror,” as this new law will likely do everything but “[open] the door to a renewed focus on instruction and excellence in teaching” despite the best efforts of “union leadership,” although I might add to Kane’s list many adorable little researchers who have also “carefully studied the best systems for rating teachers” and more or less agree on their intended and unintended results in…the end.

Help Florida Teacher Luke Flint “Tell His Story” about His VAM Scores

This is a great (although unfortunate) YouTube video capturing Indian River County, Florida teacher Luke Flint’s “Story” about the VAM scores he just received from the state as based on the state’s value-added formula.

This is a must watch, and a must share, as his “Story” has potential to “add value” in the best of ways, that is, in terms of further informing debates about how these VAMs actually “work” in practice.

New Mexico’s Teacher Evaluation Lawsuit: Four Teachers’ Individual Cases

Regarding a prior post about a recently filed “Lawsuit in New Mexico Challenging State’s Teacher Evaluation System,” filed by the American Federation of Teachers (AFT) and charging that the state’s current teacher evaluation system is unfair, error-ridden, harming teachers, and depriving students of high-quality educators (see the actual lawsuit here), the author of an article recently released in The Washington Post takes “A closer look at four New Mexico teachers’ evaluations.”

Emma Brown writes that the state believes this system supports the “aggressive changes’ needed “to produce real change for students” and “these evaluations are an essential tool to support the teachers and students of New Mexico.” Teachers, on the other hand (and in general terms), believe that the new evaluations “are arbitrary and offer little guidance as to how to improve.”

Highlighted further in this piece, though, are four specific teachers’ evaluations taken from this state’s system along with each teacher’s explanations of the problems as they see them. The first veteran teacher with 36 years of “excellent evaluations” scored ineffective for missing too much work, although she was approved for and put on a six-month’s leave after a serious injury caused by a fall. She took four of the six months, but her “teacher attendance” score dropped her to the bottom of the teacher rankings. She has since retired.

The second, 2nd-grade teacher, also a veteran teacher with 27 years of experience, received 50% of her “teacher attendance” points also given a family-related illness, but she also received 8 out of 50 “student achievement” points. She argues that her students, because most of them are well above average had difficulties demonstrating growth. In other words, her argument rests on the real concern (and very real concern in terms of the current research) that “ceiling effects” are/were preventing her students from growing upwards, enough, when compared to other “similar” students who are also to demonstrate “a full year’s worth of growth.” She is also retiring in a few months “in part because she is so frustrated with the evaluation system.”

The third teacher, a middle-school teacher, scored 23 out of 70 “value-added” points, even though he switched from teaching language arts to teaching social studies at the middle-school level. This teacher did not apparently have the three-years needed (not to mention in the same subject area) to calculate his “value-added,” nor does he have “any idea” where his score came from or how it was calculated.” Accordingly, his score “doesn’t give him any information about how to get better,” which falls under the general issue that these scores are apparently offering teachers little guidance as to how to improve. This is an issue familiar across most if not all such models.

The fourth teacher, an alternative high school mathematics and science teacher of pregnant and parenting teens many of whom have learning or emotional disabilities, received 24 of 70 “student achievement” points, she is arguing, are based on tests that are “unvetted and unreliable,” especially given the types of students she teaches. As per her claim: ““There are things I am being evaluated on which I do not and cannot control…Each year my school graduates 30 to 60 students, each of whom is either employed, or enrolled in post-secondary training/education. This is the measure of our success, not test scores.”

This is certainly a state to watch, as the four New Mexico teachers highlighted in this article certainly have unique and important cases, all of which may be used to help set precedent in this state as well as others. Do stay tuned…

Teacher Evaluation and Accountability Alternatives, for A New Year

At the beginning of December I posted a post about Diane Ravitch’s really nice piece published in the Huffington Post about what she views as a much better paradigm for teacher evaluation and accountability. Diane Ravitch posted another on similar alternatives, although this one was written by teachers themselves.

I thought this was more than appropriate, especially given a New Year is upon us, and while it might very well be wishful thinking, perhaps at least some of our state policy makers might be willing to think in new ways about what really could be new and improved teacher evaluation systems. Cheers to that!

The main point here, though, is that alternatives do, indeed, exist. Likewise, it’s not that teachers do not want to be held accountable for, and evaluated on that which they do, but they do want whatever systems are in place (formal or informal) to be appropriate, professional, and fair. How about that for policy-based resolution.

This is from Diane’s post: The Wisdom of Teachers: A New Vision of Accountability.

Anyone who criticizes the current regime of test-based accountability is inevitably asked: What would you replace it with? Test-based accountability fails because it is based on a lack of trust in professionals. It fails because it confuses measurement with instruction. No doctor ever said to a sick patient, “Go home, take your temperature hourly, and call me in a month.” Measurement is not a treatment or a cure. It is measurement. It doesn’t close gaps: it measures them.

Here is a sound alternative approach to accountability, written by a group of teachers whose collective experience is 275 years in the classroom. Over 900 teachers contributed ideas to the plan. It is a new vision that holds all actors responsible for the full development and education of children, acknowledging that every child is a unique individual.

Its key features:

  • Shared responsibility, not blame
  • Educate the whole child
  • Full and adequate funding for all schools, with less emphasis on standardized testing
  • Teacher autonomy and professionalism
  • A shift from evaluation to support
  • Recognition that in education one size does not fit all