The More Weight VAMs Carry, the More Teacher Effects (Will Appear to) Vary

Matthew A. Kraft — an Assistant Professor of Education & Economics at Brown University and co-author of an article published in Educational Researcher on “Revisiting The Widget Effect” (here), and another of his co-authors Matthew P. Steinberg — an Assistant Professor of Education Policy at the University of Pennsylvania — just published another article in this same journal on “The Sensitivity of Teacher Performance Ratings to the Design of Teacher Evaluation Systems” (see the full and freely accessible, at least for now, article here; see also its original and what should be enduring version here).

In this article, Steinberg and Kraft (2017) examine teacher performance measure weights while conducting multiple simulations of data taken from the Bill & Melinda Gates Measures of Effective Teaching (MET) studies. They conclude that “performance measure weights and ratings” surrounding teachers’ value-added, observational measures, and student survey indicators play “critical roles” when “determining teachers’ summative evaluation ratings and the distribution of teacher proficiency rates.” In other words, the weighting of teacher evaluation systems’ multiple measures matter, matter differently for different types of teachers within and across school districts and states, and matter also in that so often these weights are arbitrarily and politically defined and set.

Indeed, because “state and local policymakers have almost no empirically based evidence [emphasis added, although I would write “no empirically based evidence”] to inform their decision process about how to combine scores across multiple performance measures…decisions about [such] weights…are often made through a somewhat arbitrary and iterative process, one that is shaped by political considerations in place of empirical evidence” (Steinberg & Kraft, 2017, p. 379).

This is very important to note in that the consequences attached to these measures, also given the arbitrary and political constructions they represent, can be both professionally and personally, career and life changing, respectively. How and to what extent “the proportion of teachers deemed professionally proficient changes under different weighting and ratings thresholds schemes” (p. 379), then, clearly matters.

While Steinberg and Kraft (2017) have other key findings they also present throughout this piece, their most important finding, in my opinion, is that, again, “teacher proficiency rates change substantially as the weights assigned to teacher performance measures change” (p. 387). Moreover, the more weight assigned to measures with higher relative means (e.g., observational or student survey measures), the greater the rate by which teachers are rated effective or proficient, and vice versa (i.e., the more weight assigned to teachers’ value-added, the higher the rate by which teachers will be rated ineffective or inadequate; as also discussed on p. 388).

Put differently, “teacher proficiency rates are lowest across all [district and state] systems when norm-referenced teacher performance measures, such as VAMs [i.e., with scores that are normalized in line with bell curves, with a mean or average centered around the middle of the normal distributions], are given greater relative weight” (p. 389).

This becomes problematic when states or districts then use these weighted systems (again, weighted in arbitrary and political ways) to illustrate, often to the public, that their new-and-improved teacher evaluation systems, as inspired by the MET studies mentioned prior, are now “better” at differentiating between “good and bad” teachers. Thereafter, some states over others are then celebrated (e.g., by the National Center of Teacher Quality; see, for example, here) for taking the evaluation of teacher effects more seriously than others when, as evidenced herein, this is (unfortunately) more due to manipulation than true changes in these systems. Accordingly, the fact remains that the more weight VAMs carry, the more teacher effects (will appear to) vary. It’s not necessarily that they vary in reality, but the manipulation of the weights on the back end, rather, cause such variation and then lead to, quite literally, such delusions of grandeur in these regards (see also here).

At a more pragmatic level, this also suggests that the teacher evaluation ratings for the roughly 70% of teachers who are not VAM eligible “are likely to differ in systematic ways from the ratings of teachers for whom VAM scores can be calculated” (p. 392). This is precisely why evidence in New Mexico suggests VAM-eligible teachers are up to five times more likely to be ranked as “ineffective” or “minimally effective” than their non-VAM-eligible colleagues; that is, “[also b]ecause greater weight is consistently assigned to observation scores for teachers in nontested grades and subjects” (p. 392). This also causes a related but also important issue with fairness, whereas equally effective teachers, just by being VAM eligible, may be five-or-so times likely (e.g., in states like New Mexico) of being rated as ineffective by the mere fact that they are VAM eligible and their states, quite literally, “value” value-added “too much” (as also arbitrarily defined).

Finally, it should also be noted as an important caveat here, that the findings advanced by Steinberg and Kraft (2017) “are not intended to provide specific recommendations about what weights and ratings to select—such decisions are fundamentally subject to local district priorities and preferences. (p. 379). These findings do, however, “offer important insights about how these decisions will affect the distribution of teacher performance ratings as policymakers and administrators continue to refine and possibly remake teacher evaluation systems” (p. 379).

Related, please recall that via the MET studies one of the researchers’ goals was to determine which weights per multiple measure were empirically defensible. MET researchers failed to do so and then defaulted to recommending an equal distribution of weights without empirical justification (see also Rothstein & Mathis, 2013). This also means that anyone at any state or district level who might say that this weight here or that weight there is empirically defensible should be asked for the evidence in support.

Citations:

Rothstein, J., & Mathis, W. J. (2013, January). Review of two culminating reports from the MET Project. Boulder, CO: National Educational Policy Center. Retrieved from http://nepc.colorado.edu/thinktank/review-MET-final-2013

Steinberg, M. P., & Kraft, M. A. (2017). The sensitivity of teacher performance ratings to the design of teacher evaluation systems. Educational Researcher, 46(7), 378–
396. doi:10.3102/0013189X17726752 Retrieved from http://journals.sagepub.com/doi/abs/10.3102/0013189X17726752

New Mexico’s “New, Bait and Switch” Schemes

“A Concerned New Mexico Parent” sent me another blog entry for you all to help you all stay apprised of the ongoing “situation” in New Mexico with its New Mexico Public Education Department (NMPED). See “A Concerned New Mexico Parent’s” prior posts here, here, and here, but in this one (s)he writes a response to an editorial that was recently released in support of the newest version of New Mexico’s teacher evaluation system. The editorial was titled: “Teacher evals have evolved but tired criticisms of them have not,” and it was published in the Albuquerque Journal, as also written by the Albuquerque Journal Editorial Board themselves.

(S)he writes:

The editorial seems to contain and promote many of the “talking points” provided by NMPED with their latest release of teacher evaluations. Hence, I would like to present a few observations on the editorial.

NMPED and the Albuquerque Journal Editorial Board both underscore the point that teachers are still primarily being (and should primarily continue to be) evaluated on the basis of their own students’ test scores (i.e., using a value-added model (VAM)), but it is actually not that simple. Rather, the new statewide teacher evaluation formula is shown here on their website, with one notable difference being that the state’s “new system” now replaces the previously district-wide variations that produced 217 scoring categories for teachers (see here for details).

Accordingly, it now appears that NMPED has kept the same 50% student achievement, 25% observations, and 25% multiple measures division as before. The “new” VAM, however, requires a minimum of three years of data for proper use. Without three years of data, NMPED is to use what it calls graduated considerations or “NMTEACH” steps to change the percentages used in the evaluation formulas by teacher type.

A small footnote on the NMTEACH website devoted to teacher evaluations explains these graduated considerations whereby “Each category is weighted according to the amount of student achievement data available for the teacher. Improved student achievement is worth from 0% to 50%; classroom observations are worth 25% to 50%; planning, preparation and professionalism is worth 15% to 40%; and surveys and/or teacher attendance is worth 10%.” In other words student achievement represents between 0 and 50% of the total, observations comprise somewhere between 14% and 40% of the total, and teacher attendance comprises 10%.

The graduated considerations (Steps) are shown below, as per their use when substitutions are needed when student achievement data are missing:

nmteach

Also, the NMTEACH “Steps” provide for the use of one year of data (Step 2 is used for 1-2 years of data.) I do not see how NMPED can calculate “student improvement” based on just one year’s worth of data.

Hence, this data substitution problem is likely massive. For example, for Category A teachers, 45 of the 58 formulas formerly used will require Step 1 substitutions. For Category B teachers, 112 of 117 prior formulas will require data substitution (Step 1), and all Category C teachers will require data substitution at the Step 1 level.

The reason that this presents a huge data problem is that the state’s prior teacher evaluation system did not require the use of so much end-of-course (EOC) data, and so the tests were not given for three years. Simultaneously and for Group C teachers, NMPED also introduced an new evaluation assessment plus software called iStation that is also in its first year of use.

Thus, for a typical Category B teacher, the evaluation will be based on 50% observation, 40% planning, preparation, and professionalism, and 10% on attendance.

Amazingly, none of this relates to student achievement, and it looks identical to the former administrator-based teacher evaluation system!

Such a “bait-and-switch” scheme will be occurring for most teachers in the state.

Further, in a small case-study I performed on a local New Mexico school (here), I found that not one single teacher in a seven-year period had “good” data for three consecutive years. This also has major implications here given the state’s notorious issues with their data, data management, and the like.

Notwithstanding, the Editorial Board also notes that “The evaluations consider only student improvement, not proficiency.” However, as noted above little actual student achievement is actually available for the strong majority of all teachers’ evaluation; hence, the rate by which this will actually count (versus perhaps appear to count to the public) are two very distinctively different things.

Regardless, the Editorial Board thereafter proclaims that “The evaluations only rate teachers’ effect on their students over a school year…” Even the simple phrase “school year” is also problematic, however.

The easiest way to explain this is to imagine a student in a dual language program (a VERY common situation in New Mexico). Let’s follow his timeline of instruction and testing:

  • August 2015: The student begins the fourth grade with teachers A1 and A2.
  • March 2016: Seven months into the year the student is tested with test #1 at the 4th-grade level.
  • March 2016 – May 2016: The student finishes fourth grade with Teachers A1 and A2
  • June 2016 – Aug 2016: Summer vacation — no tests (i.e., differential summer learning and decay occurs)
  • August 2016: The student begins the fifth grade with teachers B1 and B2.
  • March 2017: Seven months into the year the student is tested with test #2 at the 5th-grade level.
  • March 2017 – May 2017: The student finishes fifth grade with Teachers B1 and B2
  • October 2017: A teacher receives a score based on this student’s improvement (along with other students like him, although coming from different A level teachers) from test#1 to test#2

To simplify, the test improvement is based on a test given before he has completed the grade level of interest with material taught by four teachers at two different grade levels over the span of one calendar year [this is something that is known in the literature as prior teachers’ residual effects].

And it gets worse. The NMPED requires that a student be assigned to only one teacher. According to the NMTEACH FAQ, in the case of team-teaching, “Students are assigned to one teacher. That teacher would get credit. A school could change teacher assignment each snapshot and thus both teachers would get counted automatically.”

I can only assume the Editorial Board members are brighter than I am because I cannot parse out the teacher evaluation values for my sample student.

Nevertheless, the Editorial Board also gushes with praise regarding the use of teacher attendance as an evaluation tool. This is just morally wrong.

Leave is not “granted” to teachers by some benevolent overlord. It is earned and is part of the union contract between teachers and the state. Imagine a job where you are told that you have two weeks vacation time but, of course, you can only take two days of it or you might be fired. Absurd, right? Well, apparently not if you are NMPED.

This is one of the major issues in the ongoing lawsuit, where as I recall, one of the plaintiffs was penalized for taking time off for the apparently frivolous task of cancer treatment! NMPED should be ashamed of themselves!

The Editorial Board also praises the new, “no lag time” aspect of the evaluation system. In the past, teacher evaluations were presented at the end of the school year before student scores were available. Now that the evaluations depend upon student scores, the evaluations appear early in the next school year. As noted in the timeline above, the lag time is still present contrary to what they assert. Further, these evaluations now come mid-term after the school-year has started and teacher assignments have been made.

In the end, and again in the title, the Editorial Board claims that the “Teacher evals have evolved but tired criticisms of them have not.”

The evals have not evolved but have rather devolved to something virtually identical to the former observation and administration-based evaluations. The tired criticisms are tired precisely because they have never been adequately answered by NMPED.

~A Concerned New Mexico Parent

One Score and Seven Policy Iterations Ago…

I just read what might be one of the best articles I’ve read in a long time on using test scores to measure teacher effectiveness, and why this is such a bad idea. Not surprisingly, unfortunately, this article was written 20 years ago (i.e., 1986) by – Edward Haertel, National Academy of Education member and recently retired Professor at Stanford University. If the name sounds familiar, it should as Professor Emeritus Haertel is one of the best on the topic of, and history behind VAMs (see prior posts about his related scholarship here, here, and here). To access the full article, please scroll to the reference at the bottom of this post.

Heartel wrote this article when at the time policymakers were, like they still are now, trying to hold teachers accountable for their students’ learning as measured on states’ standardized test scores. Although this article deals with minimum competency tests, which were in policy fashion at the time, about seven policy iterations ago, the contents of the article still have much relevance given where we are today — investing in “new and improved” Common Core tests and still riding on unsinkable beliefs that this is the way to reform the schools that have been in despair and (still) in need of major repair since 20+ years ago.

Here are some of the points I found of most “value:”

  • On isolating teacher effects: “Inferring teacher competence from test scores requires the isolation of teaching effects from other major influences on student test performance,” while “the task is to support an interpretation of student test performance as reflecting teacher competence by providing evidence against plausible rival hypotheses or interpretation.” While “student achievement depends on multiple factors, many of which are out of the teacher’s control,” and many of which cannot and likely never will be able to be “controlled.” In terms of home supports, “students enjoy varying levels of out-of-school support for learning. Not only may parental support and expectations influence student motivation and effort, but some parents may share directly in the task of instruction itself, reading with children, for example, or assisting them with homework.” In terms of school supports, “[s]choolwide learning climate refers to the host of factors that make a school more than a collection of self-contained classrooms. Where the principal is a strong instructional leader; where schoolwide policies on attendance, drug use, and discipline are consistently enforced; where the dominant peer culture is achievement-oriented; and where the school is actively supported by parents and the community.” This, all, makes isolating the teacher effect nearly if not wholly impossible.
  • On the difficulties with defining the teacher effect: “Does it include homework? Does it include self-directed study initiated by the student? How about tutoring by a parent or an older sister or brother? For present purposes, instruction logically refers to whatever the teacher being evaluated is responsible for, but there are degrees of responsibility, and it is often shared. If a teacher informs parents of a student’s learning difficulties and they arrange for private tutoring, is the teacher responsible for the student’s improvement? Suppose the teacher merely gives the student low marks, the student informs her parents, and they arrange for a tutor? Should teachers be credited with inspiring a student’s independent study of school subjects? There is no time to dwell on these difficulties; others lie ahead. Recognizing that some ambiguity remains, it may suffice to define instruction as any learning activity directed by the teacher, including homework….The question also must be confronted of what knowledge counts as achievement. The math teacher who digresses into lectures on beekeeping may be effective in communicating information, but for purposes of teacher evaluation the learning outcomes will not match those of a colleague who sticks to quadratic equations.” Much if not all of this cannot and likely never will be able to be “controlled” or “factored” in or our, as well.
  • On standardized tests: The best of standardized tests will (likely) always be too imperfect and not up to the teacher evaluation task, no matter the extent to which they are pitched as “new and improved.” While it might appear that these “problem[s] could be solved with better tests,” they cannot. Ultimately, all that these tests provide is “a sample of student performance. The inference that this performance reflects educational achievement [not to mention teacher effectiveness] is probabilistic [emphasis added], and is only justified under certain conditions.” Likewise, these tests “measure only a subset of important learning objectives, and if teachers are rated on their students’ attainment of just those outcomes, instruction of unmeasured objectives [is also] slighted.” Like it was then as it still is today, “it has become a commonplace that standardized student achievement tests are ill-suited for teacher evaluation.”
  • On the multiple choice formats of such tests: “[A] multiple-choice item remains a recognition task, in which the problem is to find the best of a small number of predetermined alternatives and the cri- teria for comparing the alternatives are well defined. The nonacademic situations where school learning is ultimately ap- plied rarely present problems in this neat, closed form. Discovery and definition of the problem itself and production of a variety of solutions are called for, not selection among a set of fixed alternatives.”
  • On students and the scores they are to contribute to the teacher evaluation formula: “Students varying in their readiness to profit from instruction are said to differ in aptitude. Not only general cognitive abilities, but relevant prior instruction, motivation, and specific inter- actions of these and other learner characteristics with features of the curriculum and instruction will affect academic growth.” In other words, one cannot simply assume all students will learn or grow at the same rate with the same teacher. Rather, they will learn at different rates given their aptitudes, their “readiness to profit from instruction,” the teachers’ instruction, and sometimes despite the teachers’ instruction or what the teacher teaches.
  • And on the formative nature of such tests, as it was then: “Teachers rarely consult standardized test results except, perhaps, for initial grouping or placement of students, and they believe that the tests are of more value to school or district administrators than to themselves.”

Sound familiar?

Reference: Haertel, E. (1986). The valid use of student performance measures for teacher evaluation. Educational Evaluation and Policy Analysis, 8(1), 45-60.

Another Oldie but Still Very Relevant Goodie, by McCaffrey et al.

I recently re-read an article in full that is now 10 years old, or 10 years out, as published in 2004 and, as per the words of the authors, before VAM approaches were “widely adopted in formal state or district accountability systems.” Unfortunately, I consistently find it interesting, particularly in terms of the research on VAMs, to re-explore/re-discover what we actually knew 10 years ago about VAMs, as most of the time, this serves as a reminder of how things, most of the time, have not changed.

The article, “Models for Value-Added Modeling of Teacher Effects,” is authored by Daniel McCaffrey (Educational Testing Service [ETS] Scientist, and still a “big name” in VAM research), J. R. Lockwood (RAND Corporation Scientists),  Daniel Koretz (Professor at Harvard), Thomas Louis (Professor at Johns Hopkins), and Laura Hamilton (RAND Corporation Scientist).

At the point at which the authors wrote this article, besides the aforementioned data and data base issues, were issues with “multiple measures on the same student and multiple teachers instructing each student” as “[c]lass groupings of students change annually, and students are taught by a different teacher each year.” Authors, more specifically, questioned “whether VAM really does remove the effects of factors such as prior performance and [students’] socio-economic status, and thereby provide[s] a more accurate indicator of teacher effectiveness.”

The assertions they advanced, accordingly and as relevant to these questions, follow:

  • Across different types of VAMs, given different types of approaches to control for some of the above (e.g., bias), teachers’ contribution to total variability in test scores (as per value-added gains) ranged from 3% to 20%. That is, teachers can realistically only be held accountable for 3% to 20% of the variance in test scores using VAMs, while the other 80% to 97% of the variance (stil) comes from influences outside of the teacher’s control. A similar statistic (i.e., 1% to 14%) was similarly and recently highlighted in the recent position statement on VAMs released by the American Statistical Association.
  • Most VAMs focus exclusively on scores from standardized assessments, although I will take this one-step further now, noting that all VAMs now focus exclusively on large-scale standardized tests. This I evidenced in a recent paper I published here: Putting growth and value-added models on the map: A national overview).
  • VAMs introduce bias when missing test scores are not missing completely at random. The missing at random assumption, however, runs across most VAMs because without it, data missingness would be pragmatically insolvable, especially “given the large proportion of missing data in many achievement databases and known differences between students with complete and incomplete test data.” The really only solution here is to use “implicit imputation of values for unobserved gains using the observed scores” which is “followed by estimation of teacher effect[s] using the means of both the imputed and observe gains [together].”
  • Bias “[still] is one of the most difficult issues arising from the use of VAMs to estimate school or teacher effects…[and]…the inclusion of student level covariates is not necessarily the solution to [this] bias.” In other words, “Controlling for student-level covariates alone is not sufficient to remove the effects of [students’] background [or demographic] characteristics.” There is a reason why bias is still such a highly contested issue when it comes to VAMs (see a recent post about this here).
  • All (or now most) commonly-used VAMs assume that teachers’ (and prior teachers’) effects persist undiminished over time. This assumption “is not empirically or theoretically justified,” either, yet it persists.

These authors’ overall conclusion, again from 10 years ago but one that in many ways still stands? VAMs “will often be too imprecise to support some of [its] desired inferences” and uses including, for example, making low- and high-stakes decisions about teacher effects as produced via VAMs. “[O]btaining sufficiently precise estimates of teacher effects to support ranking [and such decisions] is likely to [forever] be a challenge.”

Special Issue of “Educational Researcher” (Paper #6 of 9): VAMs as Tools for “Egg-Crate” Schools

Recall that the peer-reviewed journal Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#6 of 9), which is actually an essay here, titled “Will VAMS Reinforce the Walls of the Egg-Crate School?” This essay is authored by Susan Moore Johnson – Professor of Education at Harvard and somebody who I in the past I had the privilege of interviewing as an esteemed member of the National Academy of Education (see interviews here and here).

In this article, Moore Johnson argues that when policymakers use VAMs to evaluate, reward, or dismiss teachers, they may be perpetuating an egg-crate model, which is (referencing Tyack (1974) and Lortie (1975)) a metaphor for the compartmentalized school structure in which teachers (and students) work, most often in isolation. This model ultimately undermines the efforts of all involved in the work of schools to build capacity school wide, and to excel as a school given educators’ individual and collective efforts.

Contrary to the primary logic supporting VAM use, however, “teachers are not inherently effective or ineffective” on their own. Rather, their collective effectiveness is related to their professional development that may be stunted when they work alone, “without the benefit of ongoing collegial influence” (p. 119). VAMs then, and unfortunately, can cause teachers and administrators to (hyper)focus “on identifying, assigning, and rewarding or penalizing individual [emphasis added] teachers for their effectiveness in raising students’ test scores [which] depends primarily on the strengths of individual teachers” (p. 119). What comes along with this, then, are a series of interrelated egg-crate behaviors including, but not limited to, increased competition, lack of collaboration, increased independence versus interdependence, and the like, all of which can lead to decreased morale and decreased effectiveness in effect.

Inversely, students are much “better served when human resources are deliberately organized to draw on the strengths of all teachers on behalf of all students, rather than having students subjected to the luck of the draw in their classroom assignment[s]” (p. 119). Likewise, “changing the context in which teachers work could have important benefits for students throughout the school, whereas changing individual teachers without changing the context [as per VAMs] might not [work nearly as well] (Lohr, 2012)” (p. 120). Teachers learning from their peers, working in teams, teaching in teams, co-planning, collaborating, learning via mentoring by more experienced teachers, learning by mentoring, and the like should be much more valued, as warranted via the research, yet they are not valued given the very nature of VAM use.

Hence, there are also unintended consequences that can also come along with the (hyper)use of individual-level VAMs. These include, but are not limited to: (1) Teachers who are more likely to “literally or figuratively ‘close their classroom door’ and revert to working alone…[This]…affect[s] current collaboration and shared responsibility for school improvement, thus reinforcing the walls of the egg-crate school” (p. 120); (2) Due to bias, or that teachers might be unfairly evaluated given the types of students non-randomly assigned into their classrooms, teachers might avoid teaching high-needs students if teachers perceive themselves to be “at greater risk” of teaching students they cannot grow; (3) This can perpetuate isolative behaviors, as well as behaviors that encourage teachers to protect themselves first, and above all else; (4) “Therefore, heavy reliance on VAMS may lead effective teachers in high-need subjects and schools to seek safer assignments, where they can avoid the risk of low VAMS scores[; (5) M]eanwhile, some of the most challenging teaching assignments would remain difficult to fill and likely be subject to repeated turnover, bringing steep costs for students” (p. 120); While (6) “using VAMS to determine a substantial part of the teacher’s evaluation or pay [also] threatens to sidetrack the teachers’ collaboration and redirect the effective teacher’s attention to the students on his or her roster” (p. 120-121) versus students, for example, on other teachers’ rosters who might also benefit from other teachers’ content area or other expertise. Likewise (7) “Using VAMS to make high-stakes decisions about teachers also may have the unintended effect of driving skillful and committed teachers away from the schools that need them most and, in the extreme, causing them to leave the profession” in the end (p. 121).

I should add, though, and in all fairness given the Review of Paper #3 – on VAMs’ potentials here, many of these aforementioned assertions are somewhat hypothetical in the sense that they are based on the grander literature surrounding teachers’ working conditions, versus the direct, unintended effects of VAMs, given no research yet exists to examine the above, or other unintended effects, empirically. “There is as yet no evidence that the intensified use of VAMS interferes with collaborative, reciprocal work among teachers and principals or sets back efforts to move beyond the traditional egg-crate structure. However, the fact that we lack evidence about the organizational consequences of using VAMS does not mean that such consequences do not exist” (p. 123).

The bottom line is that we do not want to prevent the school organization from becoming “greater than the sum of its parts…[so that]…the social capital that transforms human capital through collegial activities in schools [might increase] the school’s overall instructional capacity and, arguably, its success” (p. 118). Hence, as Moore Johnson argues, we must adjust the focus “from the individual back to the organization, from the teacher to the school” (p. 118), and from the egg-crate back to a much more holistic and realistic model capturing what it means to be an effective school, and what it means to be an effective teacher as an educational professional within one. “[A] school would do better to invest in promoting collaboration, learning, and professional accountability among teachers and administrators than to rely on VAMS scores in an effort to reward or penalize a relatively small number of teachers” (p. 122).

*****

If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; and see the Review of Article #5 – on teachers’ perceptions of observations and student growth here.

Article #6 Reference: Moore Johnson, S. (2015). Will VAMS reinforce the walls of the egg-crate school? Educational Researcher, 44(2), 117-126. doi:10.3102/0013189X15573351

Special Issue of “Educational Researcher” (Paper #2 of 9): VAMs’ Measurement Errors, Issues with Retroactive Revisions, and (More) Problems with Using Test Scores

Recall from a prior post that the peer-reviewed journal titled Educational Researcher (ER) – recently published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#2 of 9) here, titled “Using Student Test Scores to Measure Teacher Performance: Some Problems in the Design and Implementation of Evaluation Systems” and authored by Dale Ballou – Associate Professor of Leadership, Policy, and Organizations at Vanderbilt University – and Matthew Springer – Assistant Professor of Public Policy also at Vanderbilt.

As written into the articles’ abstract, their “aim in this article [was] to draw attention to some underappreciated problems in the design and implementation of evaluation systems that incorporate value-added measures. [They focused] on four [problems]: (1) taking into account measurement error in teacher assessments, (2) revising teachers’ scores as more information becomes available about their students, and (3) and (4) minimizing opportunistic behavior by teachers during roster verification and the supervision of exams.”

Here is background on their perspective, so that you all can read and understand their forthcoming findings in context: “On the whole we regard the use of educator evaluation systems as a positive development, provided judicious use is made of this information. No evaluation instrument is perfect; every evaluation system is an assembly of various imperfect measures. There is information in student test scores about teacher performance; the challenge is to extract it and combine it with the information gleaned from other instruments.”

Their claims of most interest, in my opinion and given their perspective as illustrated above, are as follows:

  • “Teacher value-added estimates are notoriously imprecise. If value-added scores are to be used for high-stakes personnel decisions, appropriate account must be taken of the magnitude of the likely error in these estimates” (p. 78).
  • “[C]omparing a teacher of 25 students to [an equally effective] teacher of 100 students… the former is 4 to 12 times more likely to be deemed ineffective, solely as a function of the number of the teacher’s students who are tested—a reflection of the fact that the measures used in such accountability systems are noisy and that the amount of noise is greater the fewer students a teacher has. Clearly it is unfair to treat two teachers with the same true effectiveness differently” (p. 78).
  • “[R]esources will be wasted if teachers are targeted for interventions without taking
    into account the probability that the ratings they receive are based on error” (p. 78).
  • “Because many state administrative data systems are not up to [the data challenges required to calculate VAM output], many states have implemented procedures wherein teachers are called on to verify and correct their class rosters [i.e., roster verification]…[Hence]…the notion that teachers might manipulate their rosters in order to improve their value-added scores [is worrisome as the possibility of this occurring] obtains indirect support from other studies of strategic behavior in response to high-stakes accountability…These studies suggest that at least some teachers and schools will take advantage of virtually any opportunity to game
    a test-based evaluation system…” (p. 80), especially if they view the system as unfair (this is my addition, not theirs) and despite the extent to which school or district administrators monitor the process or verify the final roster data. This is another gaming technique not often discussed, or researched.
  • Related, in one analysis these authors found that “students [who teachers] do not claim [during this roster verification process] have on average test scores far below those of the students who are claimed…a student who is not claimed is very likely to be one who would lower teachers’ value added” (p. 80). Interestingly, and inversely, they also found that “a majority of the students [they] deem[ed] exempt [were actually] claimed by their teachers [on teachers’ rosters]” (p. 80). They note that when either occurs, it’s rare; hence, it should not significantly impact teachers value added scores on the whole. However, this finding also “raises the prospect of more serious manipulation of roster verification should value added come to be used for high-stakes personnel decisions, when incentives to game the system will grow stronger” (p. 80).
  • In terms of teachers versus proctors or other teachers monitoring students when they take large-scale standardized tests (that are used across all states to calculate value-added estimates), researchers also found that “[a]t every grade level, the number of questions answered correctly is higher when students are monitored by their own teacher” (p. 82). They believe this finding is more relevant that I do in that the difference was one question (although when multiplied by the number of students included in a teacher’s value-added calculations this might be more noteworthy). In addition,  I know of very few teachers, anymore, who are permitted to proctor their own students’ tests, but for those who still allow this, this finding might also be relevant. “An alternative interpretation of these findings is that students
    naturally do better when their own teacher supervises the exam as
    opposed to a teacher they do not know” (p. 83).

The authors also critique, quite extensively in fact, the Education Value-Added Assessment System (EVAAS) used statewide in North Carolina, Ohio, Pennsylvania, and Tennessee and many districts elsewhere. In particular, they take issue with the model’s use of the conventional t-test statistic to identify a teacher for whom they are 95% confident (s)he differs from average. They also take issue with EVAAS practice whereby teachers’ EVAAS scores change retroactively, as more data become available, to get at more “precision” even though teachers’ scores can change one or two years well after the initial score is registered (and used for whatever purposes).

“This has confused teachers, who wonder why their value-added score keeps changing for students they had in the past. Whether or not there are sound statistical reasons for undertaking these revisions…revising value-added estimates poses problems when the evaluation system is used for high-stakes decisions. What will be done about the teacher whose performance during the 2013–2014 school year, as calculated in the summer of 2014, was so low that the teacher loses his or her job or license but whose revised estimate for the same year, released in the summer of 2015, places the teacher’s performance above the threshold at which these sanctions would apply?…[Hence,] it clearly makes no sense to revise these estimates, as each revision is based on less information about student performance” (p. 79).

Hence, “a state that [makes] a practice of issuing revised ‘improved’ estimates would appear to be in a poor position to argue that high-stakes decisions ought to be based on initial, unrevised estimates, though in fact the grounds for regarding the revised estimates as an improvement are sometimes highly dubious. There is no obvious fix for this problem, which we expect will be fought out in the courts” (p. 83).

*****

If interested, see the Review of Article #1 – the introduction to the special issue here.

Article #2 Reference: Ballou, D., & Springer, M. G. (2015). Using student test scores to measure teacher performance: Some problems in the design and implementation of evaluation systems. Educational Researcher, 44(2), 77-86. doi:10.3102/0013189X15574904

Splits, Rotations, and Other Consequences of Teaching in a High-Stakes Environment in an Urban School

An Arizona teacher who teaches in a very urban, high-needs schools writes about the realities of teaching in her school, under the pressures that come along with high-stakes accountability and a teacher workforce working under an administration, both of which are operating in chaos. This is a must read, as she also talks about two unintended consequences of educational reform in her school about which I’ve never heard before: splits and rotations. Both seem to occur at all costs simply to stay afloat during “rough” times, but both also likely have deleterious effects on students in such schools, as well as teachers being held accountable for the students “they” teach.

She writes:

Last academic year (2012-2013) a new system for evaluating teachers was introduced into my school district. And it was rough. Teachers were dropping like flies. Some were stressed to the point of requiring medical leave. Others were labeled ineffective based on a couple classroom observations and were asked to leave. By mid-year, the school was down five teachers. And there were a handful of others who felt it was just a matter of time before they were labeled ineffective and asked to leave, too.

The situation became even worse when the long-term substitutes who had been brought in to cover those teacher-less classrooms began to leave also. Those students with no contracted teacher and no substitute began getting “split”. “Splitting” is what the administration of a school does in a desperate effort to put kids somewhere. And where the students go doesn’t seem to matter. A class roster is printed, and the first five students on the roster go to teacher A. The second five students go to teacher B, and so on. Grade-level isn’t even much of a consideration. Fourth graders get split to fifth grade classrooms. Sixth graders get split to 5th and 7th grade classrooms. And yes, even 7th and 8th graders get split to 5th grade classrooms. Was it difficult to have another five students in my class? Yes. Was it made more difficult that they weren’t even of the same grade level I was teaching? Yes. This went on for weeks…

And then the situation became even worse. As it became more apparent that the revolving door of long-term substitutes was out of control, the administration began “The Rotation.” “The Rotation” was a plan that used the contracted teachers (who remained!) as substitutes in those teacher-less classrooms. And so once or twice a week, I (and others) would get an email from the administration alerting me that it was my turn to substitute during prep time. Was it difficult to sacrifice 20-40 % of weekly prep time (that is used to do essential work like plan lessons, gather materials, grade, call parents, etc…) Yes. Was it difficult to teach in a classroom that had a different teacher, literally, every hour without coordinated lessons? Yes.

Despite this absurd scenario, in October 2013, I received a letter from my school district indicating how I fared in this inaugural year of the teacher evaluation system. It wasn’t good. Fifty percent of my performance label was based on school test scores (not on the test scores of my homeroom students). How well can students perform on tests when they don’t have a consistent teacher?

So when I think about accountability, I wonder now what it is I was actually held accountable for? An ailing, urban school? An ineffective leadership team who couldn’t keep a workforce together? Or was I just held accountable for not walking away from a no-win situation?

Coincidentally, this 2013-2014 academic year has, in many ways, mirrored the 2012-2013. The upside is that this year, only 10% of my evaluation is based on school-wide test scores (the other 40% will be my homeroom students’ test scores). This year, I have a fighting chance to receive a good label. One more year of an unfavorable performance label and the district will have to, by law, do something about me. Ironically, if it comes to that point, the district can replace me with a long-term substitute, who is not subject to the same evaluation system that I am. Moreover, that long-term substitute doesn’t have to hold a teaching certificate. Further, that long-term substitute will cost the district a lot less money in benefits (i.e. healthcare, retirement system contributions).

I should probably start looking for a job—maybe as a long-term substitute.

Out with the Old, In with the New: Proposed Ohio Budget Bill to Revise the Teacher Evaluation System (Again)

Here is another post from VAMboozled!’s new team member – Noelle Paufler, Ph.D. – on Ohio’s “new and improved” teacher evaluation system, redesigned three years out from Ohio’s last attempt.

The Ohio Teacher Evaluation System (OTES) can hardly be considered “old” in its third year of implementation, and yet Ohio Budget Bill (HB64) proposes new changes to the system for the 2015-2016 school year. In a recent blog post, Plunderbund (aka Greg Mild) highlights the latest revisions to the OTES as proposed in HB64. (This post is also featured here on Diane Ravitch’s blog.)

Plunderbund outlines several key concerns with the budget bill including:

  • Student Learning Objectives (SLOs): In place of SLOs, teachers who are assigned to grade levels, courses, or subjects for which value-added scores are unavailable (i.e., via state standardized tests or vendor assessments approved by the Ohio Department of Education [ODE]) are to be evaluated “using a method of attributing student growth,” per HB64, Section 3319.111 (B) (2).
  • Attributed Student Growth: The value-added results of an entire school or district are to be attributed to teachers who otherwise do not have individual value-added scores for evaluation purposes. In this scenario, teachers are to be evaluated based upon the performance of students they may not have met in subject areas they do not directly teach.
  • Timeline: If enacted, the budget bill does will require the ODE to finalize the revised evaluation framework until October 31, 2015. Although the OTES has just now been fully implemented in most districts across the state, school boards would need to quickly revise teacher evaluation processes, forms, and software to comply with the new requirements well after the school year is already underway.

As Plunderbund notes, these newly proposed changes resurrect a series of long-standing questions of validity and credibility with regards to OTES. The proposed use of “attributed student growth” to evaluate teachers who are assigned to non-tested grade levels or subject areas has and should raise concerns among all teachers. This proposal presumes that an essentially two-tiered evaluation system can validly measure the effectiveness of some teachers based on presumably proximal outcomes (their individual students’ scores on state or approved vendor assessments) and others based on distal outcomes (at best) using attributed student growth. While the dust has scarcely settled with regards to OTES implementation, Plunderbund compellingly argues that this new wave of proposed changes would result in more confusion, frustration, and chaos among teachers and disruptions to student learning.

To learn more, read Plunderbund’s full critique of the proposed changes, again, click here.

A Really Old Oldie but Still Very Relevant Goodie

Thanks to a colleague in Florida, I recently read an article about the “Problems of Teacher Measurementpublished in 1917 in the Journal of Educational Psychology
by B. F. Pittenger. As mentioned, it’s always interesting to take a historical approach (hint here to policymakers), and in this case a historical vies via the perspective of an author on the same topic of interest to followers here through an article he wrote almost 100 years ago. Let’s see how things have changed, or more specifically, how things have not changed.

Then, “they” had the same goals we still have today, if this isn’t telling in and of itself. From 1917: “The current efforts of experimentallists in the field of teacher measurement are only attempts to extract from the consciousness of principals and supervisors these personal criteria of good teaching, and to assemble and condense them into a single objective schedule, thoroughly tested, by means of which every judge of teaching may make his [sic] estimates more accurate, and more consistent with those of other judges. There is nothing new about the entire movement except the attempt to objectify what already exists subjectively, and to unify and render universal what is now the scattered property of many men.”

Policymakers continue to invest entirely on an ideal known then also to be (possibly forever) false. From 1917: “There are those who believe that the movement toward teacher measurement is a monstrous innovation, which threatens the holiest traditions of the educational profession by putting a premium upon mechanical methodology…the phrase ‘teacher-measurement,’ itself, no doubt, is in part responsible for this misunderstanding, as it suggests a mathematical exactness of procedure which is clearly impossible in this field [emphasis added]. Teacher measurement will probably never become more than a carefully controlled process of estimating a teacher’s individual efficiency…[This is]…sufficiently convenient and euphonious, and has now been used widely enough, to warrant its continuation.”

As for the methods “issues” in 1917? “However sympathetic one may be with the general plan of devising schedules for teacher measurement, it is difficult to justify many of the methods by which these investigators have attacked the problem. For example, all of them appear to have set up as their goal the construction of a schedule which can be applied to any teacher, whether in the elementary or high school, and irrespective of the grade or subject in which his teaching is being done. “Teaching is teaching,” is the evident assumption, “and the same wherever found.” But it may reasonably be maintained that different qualities and methods, at least in part, are requisite…In so far as the criteria of good teaching are the same in these very diverse situations, it seems probable that the comparative importance to be attached to each must differ.” Sound familiar?

On the use of multiple measures, as currently in line with the current measurement standards of the profession, from 1917: “students of teacher measurement appear to have erred in that they have attempted too much. The writer is strongly of the opinion that, for the present at least, efforts to construct a schedule for teacher measurement should be confined to a single one of the three planes which have been enumerated. Doubtless in the end we shall want to know as much as possible about all three; and to combine in our final estimate of a teacher’s merit all attainable facts as to her equipment, her classroom procedure, and the results which she achieves. But at present we should do wisely to project our investigations upon one plane at a time, and to make each of these investigations as thorough as it is possible to make it. Later, when we know the nature and comparative value of the various items necessary to adequate judgment upon all planes, there will be time and opportunity for putting together the different schedules into one.” One-hundred years later…

On prior teachers’ effects: “we must keep constantly in mind the fact that the results which pupils achieve in any given subject are by no means the product of the labor of any single teacher. Earlier teachers, other contemporary teachers, and the environment external to the school, are all factors in determining pupil efficiency in any school subject. It has been urged that the influence of these complicating factors can be materially reduced by measuring only the change in pupil achievement which takes place under the guidance of a single teacher. But it must be remembered that this process only reduces these complications; it does not and cannot eliminate them.”

Finally, the supreme to be sought, then and now? “The plane of results (in the sense of changes wrought in pupils) would be the ideal plane upon which to build an estimate of a teacher’s individual efficiency, if it were possible (1) to measure all of the results of teaching, and (2) to pick out from the body of measured results any single teacher’s contribution. At present these desiderata are impossible to attain [emphasis added]…[but]…let us not make the mistake of assuming that the results that we can measure are the only results of teaching, or even that they are the most important part.”

Likewise, “no one teacher can be given the entire blame or credit for the doings of the pupils in her classroom…the ‘classroom process’ should be regarded as including the activities of both teachers and pupils.” In the end, “The promotion, discharge, or constructive criticism of teachers cannot be reduced to mathematical formulae. The proper function of a scorecard for teacher measurement is not to substitute such a formula for a supervisor’s personal judgment, but to aid him in discovering and assembling all the data upon which intelligent judgment should be based.”

ACT’s Dan Wright on a Simple VAM Experiment

A VAMboozled! folllower, Dan Wright, who is also a statistician at ACT (the nonprofit company famously known for developing the college-entrance ACT test), wrote me an email a few weeks ago about his informed and statistical take on VAMs. I invited him to write a post for you all here. His response, with my appreciation, follows:

“I am part cognitive scientist, part statistician, and I work for ACT, Inc.  About a year ago I was asked whether value-added models (VAMs) provide good estimates of teacher effectiveness.  I found lots of papers that examined if different statistical models gave similar answers to each other and lots showing certain groups tended to have lower scores, but all of these papers seemed to bypass the question I was interested in: do VAMs accurately estimate teacher effectiveness?  I began looking at this and Audrey invited me to write a guest blog post describing what I found.

The difficulty is that in the real world we don’t really know how effective a teacher, that is in “objective” terms, so that we do not have much with which to compare our VAM estimates. Hence, a popular alternative/way for statisticians to decide if an estimation method is good or bad is to simulate data using particular values, and then see if the method produces estimates that are similar to the values expected.

The trick is how to create the simulation data to do this. It is necessary to have some model for how the data arise. Hence, I will use a simple model to demonstrate the problem. It is usually best to start with simple models, make sure the statistical procedures work, and then progress to more complex models. On that note, I have an under review paper that goes into more details with other models with more simulation studies. Email me if you want a copy.

Anyhow, the model I used to create the data for this simulation starts with three variables. The first encapsulates everything about the students that is unique to them, including ability, effort, grit, and even the home environment (called AB in the code below). The second encapsulates all the neighborhood and environmental factors that can influence everything from financial spending in schools to which teachers are hired (called NE). This value is unique to each teacher (so it is simpler than real data where teachers are nested in schools). The third is teacher effectiveness (called TE). I created it from the neighborhood variable plus some random variation. These three variables would be unmeasured in a real data set (of course, some elements of them may be measured, but not all elements of them), but in a computer simulation their values are known.

In addition, there are two sets of test scores. The first are scores from before the student has encountered the teacher (called PRE). They are created by adding the student ability variable, the neighborhood variable, and some random variation. The second set of test scores are from after the student has encountered the teacher (called POST). They are created by adding the first set of test scores, the student ability variable, one-fifth of the teacher effectiveness variable (less than the other effects since the impact of a single teacher is usually estimated at about 10% or less of student achievement), and some random variation.

Again, however, this model is simpler than real educational data. Accordingly, there are no complications like missing values, teachers nested within schools, etc.  Also, I will use a very simple VAM, just using the first set of scores to predict the second set of scores, but allowing for random variation by teachers and by students.  Given the importance placed on the results of VAMs in many different countries and in many industries (not just in education), this method should work…Right?

Theories about causation actually suggest that there will be a problem.  I won’t go into the details about this (again, email me for my paper if you want more information), but using the first set of scores in the VAM allows information to flow between the teacher effectiveness variable and the second set of test scores through the other unmeasured variables. This messes up trying to measure the effect of teachers on the final scores.

But let’s see if a simulation supports that this is problematic as well.  The simulation used the freeware R and the code is below so that you can repeat it, if you want, or change the values.  Download R onto your computer (follow the instructions on http://cran.us.r-project.org/), open it, copy the code, and paste it into R. The # sign means R ignores the remainder of the line so I have used that to make comments. If you think another set of numbers are more realistic, put those in. Or construct the data in some other way. One nice thing about simulations is you can keep trying all sorts of things.

Screen Shot 2014-04-17 at 8.21.53 PM

For the values used, the correlation is -.13.  A negative correlation means the teachers who are more effective tend to have lower estimated teacher effectiveness scores.  That’s not good.  That’s bad!  Don’t get hung up with this specific number, though, as it moves around depending on how the data are constructed. The correlation can go up (e.g., change any one, but not two, of the first four values to -1) and down (e.g., increase the size of AB2POST to 2).

Given this, there is a conclusion, as well as a recommendation.  The conclusion is that value-added estimates can be very inaccurate for what seem to be highly commonsensical and plausible models for how the data could arise, and where they are bad is predicted from theories of causation.  The recommendation is that those promoting and those critical of VAMs should write down plausible models for how they think the data in which they are interested arose and see if the statistical procedures used perform well.

Dan Wright

*The personal opinions expressed do not necessarily represent the views and opinions of ACT, Inc.