Special Issue of “Educational Researcher” (Paper #4 of 9): Make Room VAMs for Observations

Please follow and like us:

Recall that the peer-reviewed journal Educational Researcher (ER) – recently published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#4 of 9) here, titled “Make Room Value-Added: Principals’ Human Capital Decisions and the Emergence of Teacher Observation Data. This one is authored by Ellen Goldring, Jason A. Grissom, Christine Neumerski, Marisa Cannata, Mollie Rubin, Timothy Drake, and Patrick Schuermann, all of whom are associated with Vanderbilt University.

This article is primarily about (1) the extent to which the data generated by “high-quality observation systems” can inform principals’ human capital decisions (e.g., teacher hiring, contract renewal, assignment to classrooms, professional development), and (2) the extent to which principals are relying less on test scores derived via value-added models (VAMs), when making the same decisions, and why. Here are some of their key (and most important, in my opinion) findings:

  • Principals across all school systems revealed major hesitations and challenges regarding the use of VAM output for human capital decisions. Barriers preventing VAM use included the timing of data availability (e.g., the fall), which is well after human capital decisions are made (p. 99).
  • VAM output are too far removed from the practice of teaching (p. 99), and this lack of instructional sensitivity impedes, if not entirely prevents their actual versus hypothetical use for school/teacher improvement.
  • “Principals noted they did not really understand how value-added scores were calculated, and therefore they were not completely comfortable using them” (p. 99). Likewise, principals reported that because teachers did not understand how the systems worked either, teachers did not use VAM output data either (p. 100).
  • VAM output are not transparent when used to determine compensation, and especially when used to evaluate teachers teaching nontested subject areas. In districts that use school-wide VAM output to evaluate teachers in nontested subject areas, in fact, principals reported regularly ignoring VAM output altogether (p. 99-100).
  • “Principals reported that they perceived observations to be more valid than value-added measures” (p. 100); hence, principals reported using observational output much more, again, in terms of human capital decisions and making such decisions “valid.” (p. 100).
  • “One noted exception to the use of value-added scores seemed to be in the area of assigning teachers to particular grades, subjects, and classes. Many principals mentioned they use value-added measures to place teachers in tested subjects and with students in grade levels that ‘count’ for accountability purpose…some principals [also used] VAM [output] to move ineffective teachers to untested grades, such as K-2 in elementary schools and 12th grade in high schools” (p. 100).

Of special note here is also the following finding: “In half of the systems [in which researchers investigated these systems], there [was] a strong and clear expectation that there be alignment between a teacher’s value-added growth score and observation ratings…Sometimes this was a state directive and other times it was district-based. In some systems, this alignment is part of the principal’s own evaluation; principals receive reports that show their alignment” (p. 101). In other words, principals are being evaluated and held accountable given the extent to which their observations of their teachers match their teachers’ VAM-based data. If misalignment is noticed, it is not to be the fault of either measure (e.g., in terms of measurement error), it is to be the fault of the principal who is critiqued for inaccuracy, and therefore (inversely) incentivized to skew their observational data (the only data over which the supervisor has control) to artificially match VAM-based output. This clearly distorts validity, or rather the validity of the inferences that are to be made using such data. Appropriately, principals also “felt uncomfortable [with this] because they were not sure if their observation scores should align primarily…with the VAM” output (p. 101).

“In sum, the use of observation data is important to principals for a number of reasons: It provides a “bigger picture” of the teacher’s performance, it can inform individualized and large group professional development, and it forms the basis of individualized support for remediation plans that serve as the documentation for dismissal cases. It helps principals provides specific and ongoing feedback to teachers. In some districts, it is beginning to shape the approach to teacher hiring as well” (p. 102).

The only significant weakness, again in my opinion, with this piece is that the authors write that these observational data, at focus in this study, are “new,” thanks to recent federal initiatives. They write, for example, that “data from structured teacher observations—both quantitative and qualitative—constitute a new [emphasis added] source of information principals and school systems can utilize in decision making” (p. 96). They are also “beginning to emerge [emphasis added] in the districts…as powerful engines for principal data use” (p. 97). I would beg to differ as these systems have not changed much over time, pre and post these federal initiatives as (without evidence or warrant) claimed by these authors herein. See, for example, Table 1 on p. 98 of the article to see if what they have included within the list of components of such new and “complex, elaborate teacher observation systems systems” is actually new or much different than most of the observational systems in use prior. As an aside, one such system in use and of issue in this examination is one with which I am familiar, in use in the Houston Independent School District. Click here to also see if this system is also more “complex” or “elaborate” over and above such systems prior.

Also recall that one of the key reports that triggered the current call for VAMs, as the “more objective” measures needed to measure and therefore improve teacher effectiveness, was based on data that suggested that “too many teachers” were being rated as satisfactory or above. The observational systems in use then are essentially the same observational systems still in use today (see “The Widget Effect” report here). This is in stark contradiction to authors’ claims throughout this piece, for example, when they write “Structured teacher observations, as integral components of teacher evaluations, are poised to be a very powerful lever for changing principal leadership and the influence of principals on schools, teachers, and learning.” This counters all that is and all that came from “The Widget Effect” report here.


If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; and see the Review of Article #3 – on VAMs’ potentials here.

Article #4 Reference: Goldring, E., Grissom, J. A., Rubin, M., Neumerski, C. M., Cannata, M., Drake, T., & Schuermann, P. (2015). Make room value-added: Principals’ human capital decisions and the emergence of teacher observation data. Educational Researcher, 44(2), 96-104. doi:10.3102/0013189X15575031

1 thought on “Special Issue of “Educational Researcher” (Paper #4 of 9): Make Room VAMs for Observations

  1. Long. From page 98, describing Table 1 “Many Elements Included in Teacher Observation Systems” notice these two elements “Score by using a detailed instructional rubric, evaluating teachers on a series of indicators using a scaled rating system” and ”Provide evidence tied to scoring.” I find this interesting.

    These researchers seem to be engaged in a spin-off of the Gates-funded Measures of Effective Teaching study, deeply flawed. This study is also funded by Gates.

    I agree that there is nothing “emerging” about the issues in teacher evaluation treated in this study. There has been a long propaganda campaign to make VAM the “objective” measure against which all others measures are judged “valid” (or not).

    Principals have been inundated with arbitrary demands that are all about metrics. There own evaluations hinge on rating teachers on scores that look objective and can be rationalized as impartial, unbiased—as if scores are the “trustworthy deciders” of the merit of the teacher, and the contribution of the teacher to students and the schools.

    The contradictory, often absurd, consequences of seeking “objective’ measures with scores aggrandized can be seen in the quandaries of principals who must judge teachers of “untested” subjects—estimated to be about 70% of the teaching workforce.

    Principals know it is “wrong” to assign these teachers a “distributed score” —fancy term for a score stolen because is readily available—typically a school-wide reading or math score or a composite of those scores. The principal can do this, but must pretend it is OK, and not a case of making the evaluation a sham for a whole lot of teachers.

    This article is useful in documenting how thoroughly the evaluation schemes pushed into policies by armchair experts are ill-suited to humane and wise judgments for all teachers, but especially absurd and unjust for the “teachers of non-tested subjects.”

    I find no discussion of the rubric-based observation schemes in wide use—most of these one-size-fits-all–or the issues in scoring these. For Example, the proprietary scheme from Charlotte Danielson, the FFT or Framework for Teaching, has been widely used for all grades and subjects in the absence of any studies that establish its validity for these uses. It was used in the MET Project, where observers were trained to make observations and offer ratings of teacher-made videos, no high stakes to the ratings in that case.

    But observation protocols in real classrooms do have high stakes and are not often the subject of questioning about content and grade-level relevance, or validity apart from VAM.

    I have email correspondence from Charlotte Danielson about these matters from 2013. I raised questions about the few studies of “validity” cited on the Website, and related matters. Danielson acknowledged that “all the validation studies have consisted of correlations between teacher performance and student “value-added” gains on standardized tests.”… “there is no comparable assurance of ‘predictive’ validity for the many teachers (as you point out) whose students are not assessed by those measures.” She also said: “The FFT is intended to apply to all disciplines, K-12. That is grounded in the simple fact that teaching, in whatever context, requires the same basic tasks, namely, knowing one’s subject, knowing one’s students, having clear outcomes, establishing a culture for learning, etc., etc. All of the details of how each of those things is done, naturally, is highly grade-level and discipline-specific and requires, as you correctly suggest, expertise on the part of those teaching in these settings. But your larger point is a good one: I don’t know of any systematic studies looking at this issue.

    Well, the wheels grind slowly but look what is in the works, funded by the The Helmsley Charitable Trust : Caps are in the database.
    PROJECT TITLE: FRAMEWORK FOR TEACHING CLUSTERS REFINEMENT, TRAINING RESOURCES, AND LONG-TERM STEWARDSHIP OF THE FRAMEWORK FOR TEACHING Funding to support the refinement and public release of the content-specific versions of Charlotte Danielson’s Framework for Teaching.

    I wonder what these content-specific versions will look like, especially since relatively new national standards have been published for science, technology, engineering (STEM), the arts (including media arts), social studies, and physical education. There are also standards for “social emotional learning”—the hot new focus of USDE grants and embedded in the “Every Child Succeeds Act.” Some of the standards are grade-specific, some are organized by grade spans.
    Some of these standards are connected to the Common Core, others are not. Indeed, only a few states, about eight, are implementing the Common Core without modification.

    Where does all of this end? I do not know. The surveillance of teachers under the banner of accountability is really getting out of hand.

Leave a Reply to Laura H. Chapman Cancel reply

Your email address will not be published. Required fields are marked *