U.S. Department of Education: Value-Added Not Good for Evaluating Schools and Principals

Just this month, the Institute of Education Sciences (IES) wing of the U.S. Department of Education released a report about using value-added models (VAMs) for measuring school principals’ performance. The article conducted by researchers at Mathematica Policy Research and titled “Can Student Test Scores Provide Useful Measures of School Principals’ Performance?” can be found online here, with my summary of the study findings highlighted next and herein.

Before the passage of the Every Student Succeeds Act (ESSA), 40 states had written into their state statutes, as incentivized by the federal government, to use growth in student achievement growth for annual principal evaluation purposes. More states had written growth/value-added models (VAMs) for teacher evaluation purposes, which we have covered extensively via this blog, but this pertains only to school and/or principal evaluation purposes. Now since the passage of ESSA, and the reduction in the federal government’s control over state-level policies, states now have much more liberty to more freely decide whether to continue using student achievement growth for either purposes. This paper is positioned within this reasoning, and more specifically to help states decide whether or to what extent they might (or might not) continue to move forward with using growth/VAMs for school and principal evaluation purposes.

Researchers, more specifically, assessed (1) reliability – or the consistency or stability of these ratings over time, which is important “because only stable parts of a rating have the potential to contain information about principals’ future performance; unstable parts reflect only transient aspects of their performance;” and (2) one form of multiple evidences of validity – the predictive validity of these principal-level measures, with predictive validity defined as “the extent to which ratings from these measures accurately reflect principals’ contributions to student achievement in future years.” In short, “A measure could have high predictive validity only if [emphasis added] it was highly stable between consecutive years [i.e., reliability]…and its stable part was strongly related to principals’ contributions to student achievement” over time (i.e., predictive validity).

Researchers used principal-level value-added (unadjusted and adjusted for prior achievement and other potentially biasing demographic variables) to more directly examine “the extent to which student achievement growth at a school differed from average growth statewide for students with similar prior achievement and background characteristics.” Also important to note is that the data they used to examine school-level value-added came from Pennsylvania, which is one of a handful of states that uses the popular and proprietary (and controversial) Education Value-Added Assessment System (EVAAS) statewide.

Here are the researchers’ key findings, taken directly from the study’s summary (again, for more information see the full manuscript here).

  • The two performance measures in this study that did not account for students’ past achievement—average achievement and adjusted average achievement—provided no information for predicting principals’ contributions to student achievement in the following year.
  • The two performance measures in this study that accounted for students’ past achievement—school value-added and adjusted school value-added—provided, at most, a small amount of information for predicting principals’ contributions to student achievement in the following year. This was due to instability and inaccuracy in the stable parts.
  • Averaging performance measures across multiple recent years did not improve their accuracy for predicting principals’ contributions to student achievement in the following year. In simpler terms, a principal’s average rating over three years did not predict his or her future contributions more accurately than did a rating from the most recent year only. This is more of a statistical finding than one that has direct implications for policy and practice (except for silly states who might, despite findings like those presented in this study, decide that they can use one year to do this not at all well instead of three years to do this not at all well).

Their bottom line? “…no available measures of principal [/school] performance have yet been shown to accurately identify principals [/schools] who will contribute successfully to student outcomes in future years,” especially if based on students’ test scores, although the researchers also assert that “no research has ever determined whether non-test measures, such as measures of principals’ leadership practices, [have successfully or accurately] predict[ed] their future contributions” either.

The researchers follow-up with a highly cautionary note: “the value-added measures will make plenty of mistakes when trying to identify principals [/schools] who will contribute effectively or ineffectively to student achievement in future years. Therefore, states and districts should exercise caution when using these measures to make major decisions about principals. Given the inaccuracy of the test-based measures, state and district leaders and researchers should also make every effort to identify nontest measures that can predict principals’ future contributions to student outcomes [instead].”

Citation: Chiang, H., McCullough, M., Lipscomb, S., & Gill, B. (2016). Can student test scores provide useful measures of school principals’ performance? Washington DC: U.S. Department of Education, Institute of Education Sciences. Retrieved from http://ies.ed.gov/ncee/pubs/2016002/pdf/2016002.pdf

New Mexico Is “At It Again”

“A Concerned New Mexico Parent” sent me yet another blog entry for you all to stay apprised of the ongoing “situation” in New Mexico and the continuous escapades of the New Mexico Public Education Department (NMPED). See “A Concerned New Mexico Parent’s” prior posts here, here, and here, but in this one (s)he writes what follows:

Well, the NMPED is at it again.

They just released the teacher evaluation results for the 2015-2016 school year. And, the report and media press releases are a something.

Readers of this blog are familiar with my earlier documentation of the myriad varieties of scoring formulas used by New Mexico to evaluate its teachers. If I recall, I found something like 200 variations in scoring formulas [see his/her prior post on this here with an actual variation count at n=217].

However, a recent article published in the Albuquerque Journal indicates that, now according to the NMPED, “only three types of test scores are [being] used in the calculation: Partnership for Assessment of Readiness for College and Careers [PARCC], end-of-course exams, and the [state’s new] Istation literacy test.” [Recall from another article released last January that New Mexico’s Secretary of Education Hanna Skandera is also the head of the governing board for the PARCC test].

Further, the Albuquerque Journal article author reports that the “PED also altered the way it classifies teachers, dropping from 107 options to three. Previously, the system incorporated many combinations of criteria such as a teacher’s years in the classroom and the type of standardized test they administer.”

The new state-wide evaluation plan is also available in more detail here. Although I should also add that there has been no published notification of the radical changes in this plan. It was just simply and quietly posted on NMPED’s public website.

Important to note, though, is that for Group B teachers (all levels), the many variations documented previously have all been replaced by end-of-course (EOC) exams. Also note that for Group A teachers (all levels) the percentage assigned to the PARCC test has been reduced from 50% to 35%. (Oh, how the mighty have fallen …). The remaining 15% of the Group A score is to be composed of EOC exam scores.

There are only two small problems with this NMPED simplification.

First, in many districts, no EOC exams were given to Group B teachers in the 2015-2016 school year, and none were given in the previous year either. Any EOC scores that might exist were from a solitary administration of EOC exams three years previously.

Second, for Group A teachers whose scores formerly relied solely on the PARCC test for 50% of their score, no EOC exams were ever given.

Thus, NMPED has replaced their policy of evaluating teachers on the basis of students they don’t teach to this new policy of evaluating teachers on the basis of tests they never administered!

Well done, NMPED (not…)

Luckily, NMPED still cannot make any consequential decisions based on these data, again, until NMPED proves to the court that the consequential decisions that they would still very much like to make (e.g., employment, advancement and licensure decisions) are backed by research evidence. I know, interesting concept…

Why So Silent? Did You Think I Have Left You for Good?

You might recognize the title of this post from one of my all time favorite Broadway shoes: The Phantom Of The Opera – Masquerade/Why So Silent. I thought I would use it here, to explain my recent and notable silence on the topic of value-added models (VAMs).

First, I recently returned from summer break, although I still occasionally released blog posts when important events related to VAMs and their (ab)uses for teacher evaluation purposes occurred. More importantly, though, the frequency with which said important events have happened has, relatively, fortunately, and significantly declined.

Yes — the so-far-so-good news is that schools, school districts, and states are apparently not as nearly active, or actively pursuing the use of VAMs for stronger teacher accountability purposes for educational reform. Likewise, schools, school districts, and states are not as nearly prone to make really silly (and stupid) decisions with these models, especially without the research supporting such decisions.

This is very much due to the federal government’s recent (January 1, 2016) passage of the Every Student Succeeds Act (ESSA) that no longer requires teachers to be evaluated by their student’s tests score, for example, using VAMs (see prior posts on this here and here).

While there are still states, districts, and schools that are still moving forward with VAMs and their original high-stakes teacher evaluation plans as largely based on VAMs (e.g., New Mexico, Tennessee, Texas), many others have really begun to rethink the importance and vitality of VAMs as part of their teacher evaluation systems for educational reform (e.g., Alabam, Georgia, Oklahoma). This, of course, is primary at the state level. Certainly, there are districts out there representing both sides of the same continuum.

Accordingly, however, I have had multiple conversations with colleagues and others regarding what I might do with this blog should people stop seriously investing and riding their teacher/educational reform efforts on VAMs. While I don’t think that this will ever happen, there is honestly nothing I would like more (as an academic) than to close this blog down, should educational policymakers, politicians, philanthropists, and others focus on new and entirely different, non-Draconian ways to reform America’s public schools. We shall see how it goes.

But for now, why have I been relatively so silent? The VAM as we currently know it, in use and implementation, might very well be turning into our VAMtom of the Profession 😉

One Score and Seven Policy Iterations Ago…

I just read what might be one of the best articles I’ve read in a long time on using test scores to measure teacher effectiveness, and why this is such a bad idea. Not surprisingly, unfortunately, this article was written 20 years ago (i.e., 1986) by – Edward Haertel, National Academy of Education member and recently retired Professor at Stanford University. If the name sounds familiar, it should as Professor Emeritus Haertel is one of the best on the topic of, and history behind VAMs (see prior posts about his related scholarship here, here, and here). To access the full article, please scroll to the reference at the bottom of this post.

Heartel wrote this article when at the time policymakers were, like they still are now, trying to hold teachers accountable for their students’ learning as measured on states’ standardized test scores. Although this article deals with minimum competency tests, which were in policy fashion at the time, about seven policy iterations ago, the contents of the article still have much relevance given where we are today — investing in “new and improved” Common Core tests and still riding on unsinkable beliefs that this is the way to reform the schools that have been in despair and (still) in need of major repair since 20+ years ago.

Here are some of the points I found of most “value:”

  • On isolating teacher effects: “Inferring teacher competence from test scores requires the isolation of teaching effects from other major influences on student test performance,” while “the task is to support an interpretation of student test performance as reflecting teacher competence by providing evidence against plausible rival hypotheses or interpretation.” While “student achievement depends on multiple factors, many of which are out of the teacher’s control,” and many of which cannot and likely never will be able to be “controlled.” In terms of home supports, “students enjoy varying levels of out-of-school support for learning. Not only may parental support and expectations influence student motivation and effort, but some parents may share directly in the task of instruction itself, reading with children, for example, or assisting them with homework.” In terms of school supports, “[s]choolwide learning climate refers to the host of factors that make a school more than a collection of self-contained classrooms. Where the principal is a strong instructional leader; where schoolwide policies on attendance, drug use, and discipline are consistently enforced; where the dominant peer culture is achievement-oriented; and where the school is actively supported by parents and the community.” This, all, makes isolating the teacher effect nearly if not wholly impossible.
  • On the difficulties with defining the teacher effect: “Does it include homework? Does it include self-directed study initiated by the student? How about tutoring by a parent or an older sister or brother? For present purposes, instruction logically refers to whatever the teacher being evaluated is responsible for, but there are degrees of responsibility, and it is often shared. If a teacher informs parents of a student’s learning difficulties and they arrange for private tutoring, is the teacher responsible for the student’s improvement? Suppose the teacher merely gives the student low marks, the student informs her parents, and they arrange for a tutor? Should teachers be credited with inspiring a student’s independent study of school subjects? There is no time to dwell on these difficulties; others lie ahead. Recognizing that some ambiguity remains, it may suffice to define instruction as any learning activity directed by the teacher, including homework….The question also must be confronted of what knowledge counts as achievement. The math teacher who digresses into lectures on beekeeping may be effective in communicating information, but for purposes of teacher evaluation the learning outcomes will not match those of a colleague who sticks to quadratic equations.” Much if not all of this cannot and likely never will be able to be “controlled” or “factored” in or our, as well.
  • On standardized tests: The best of standardized tests will (likely) always be too imperfect and not up to the teacher evaluation task, no matter the extent to which they are pitched as “new and improved.” While it might appear that these “problem[s] could be solved with better tests,” they cannot. Ultimately, all that these tests provide is “a sample of student performance. The inference that this performance reflects educational achievement [not to mention teacher effectiveness] is probabilistic [emphasis added], and is only justified under certain conditions.” Likewise, these tests “measure only a subset of important learning objectives, and if teachers are rated on their students’ attainment of just those outcomes, instruction of unmeasured objectives [is also] slighted.” Like it was then as it still is today, “it has become a commonplace that standardized student achievement tests are ill-suited for teacher evaluation.”
  • On the multiple choice formats of such tests: “[A] multiple-choice item remains a recognition task, in which the problem is to find the best of a small number of predetermined alternatives and the cri- teria for comparing the alternatives are well defined. The nonacademic situations where school learning is ultimately ap- plied rarely present problems in this neat, closed form. Discovery and definition of the problem itself and production of a variety of solutions are called for, not selection among a set of fixed alternatives.”
  • On students and the scores they are to contribute to the teacher evaluation formula: “Students varying in their readiness to profit from instruction are said to differ in aptitude. Not only general cognitive abilities, but relevant prior instruction, motivation, and specific inter- actions of these and other learner characteristics with features of the curriculum and instruction will affect academic growth.” In other words, one cannot simply assume all students will learn or grow at the same rate with the same teacher. Rather, they will learn at different rates given their aptitudes, their “readiness to profit from instruction,” the teachers’ instruction, and sometimes despite the teachers’ instruction or what the teacher teaches.
  • And on the formative nature of such tests, as it was then: “Teachers rarely consult standardized test results except, perhaps, for initial grouping or placement of students, and they believe that the tests are of more value to school or district administrators than to themselves.”

Sound familiar?

Reference: Haertel, E. (1986). The valid use of student performance measures for teacher evaluation. Educational Evaluation and Policy Analysis, 8(1), 45-60.

Center on the Future of American Education, on America’s “New and Improved” Teacher Evaluation Systems

Thomas Toch — education policy expert and research fellow at Georgetown University, and founding director of the Center on the Future of American Education — just released, as part of the Center, a report titled: Grading the Graders: A Report on Teacher Evaluation Reform in Public Education. He sent this to me for my thoughts, and I decided to summarize my thoughts here, with thanks and all due respect to the author, as clearly we are on different sides of the spectrum in terms of the literal “value” America’s new teacher evaluation systems might in fact “add” to the reformation of America’s public schools.

While quite a long and meaty report, here are some of the points I think that are important to address publicly:

First, is it true that using prior teacher evaluation systems (which were almost if not entirely based on teacher observational systems) yielded for “nearly every teacher satisfactory ratings”? Indeed, this is true. However, what we have seen since 2009, when states began to adopt what were then (and in many ways still are) viewed as America’s “new and improved” or “strengthened” teacher evaluation systems, is that for 70% of America’s teachers, these teacher evaluation systems are still based only on the observational indicators being used prior, because for only 30% of America’s teachers are value-added estimates calculable. As also noted in this report, it is for these 70% that “the superficial teacher [evaluation] practices of the past” (p. 2) will remain the same, although I disagree with this particular adjective, especially when these measures are used for formative purposes. While certainly imperfect, these are not simply “flimsy checklists” of no use or value. There is, indeed, much empirical research to support this assertion.

Likewise, these observational systems have not really changed since 2009, or 1999 for that matter and not that they could change all that much; but, they are not in their “early stages” (p. 2) of development. Indeed, this includes the Danielson Framework explicitly propped up in this piece as an exemplar, regardless of the fact it has been used across states and districts for decades and it is still not functioning as intended, especially when summative decisions about teacher effectiveness are to be made (see, for example, here).

Hence, in some states and districts (sometimes via educational policy) principals or other observers are now being asked, or required to deliberately assign to teachers’ lower observational categories, or assign approximate proportions of teachers per observational category used. Whereby the instrument might not distribute scores “as currently needed,” one way to game the system is to tell principals, for example, that they should only allot X% of teachers as per the three-to-five categories most often used across said instruments. In fact, in an article one of my doctoral students and I have forthcoming, we have termed this, with empirical evidence, the “artificial deflation” of observational scores, as externally being persuaded or required. Worse is that this sometimes signals to the greater public that these “new and improved” teacher evaluation systems are being used for more discriminatory purposes (i.e., to actually differentiate between good and bad teachers on some sort of discriminating continuum), or that, indeed, there is a normal distribution of teachers, as per their levels of effectiveness. While certainly there is some type of distribution, no evidence exists whatsoever to suggest that those who fall on the wrong side of the mean are, in fact, ineffective, and vice versa. It’s all relative, seriously, and unfortunately.

Related, the goal here is really not to “thoughtfully compare teacher performances,” but to evaluate teachers as per a set of criteria against which they can be evaluated and judged (i.e., whereby criterion-referenced inferences and decisions can be made). Inversely, comparing teachers in norm-referenced ways, as (socially) Darwinian and resonate with many-to-some, does not necessarily work, either or again. This is precisely what the authors of The Widget Effect report did, after which they argued for wide-scale system reform, so that increased discrimination among teachers, and reduced indifference on the part of evaluating principals, could occur. However, as also evidenced in this aforementioned article, the increasing presence of normal curves illustrating “new and improved” teacher observational distributions does not necessarily mean anything normal.

And were these systems not used often enough or “rarely” prior, to fire teachers? Perhaps, although there are no data to support such assertions, either. This very argument was at the heart of the Vergara v. California case (see, for example, here) — that teacher tenure laws, as well as laws protecting teachers’ due process rights, were keeping “grossly ineffective” teachers teaching in the classroom. Again, while no expert on either side could produce for the Court any hard numbers regarding how many “grossly ineffective” teachers were in fact being protected but such archaic rules and procedures, I would estimate (as based on my years of experience as a teacher) that this number is much lower than many believe it (and perhaps perpetuate it) to be. In fact, there was only one teacher whom I recall, who taught with me in a highly urban school, who I would have classified as grossly ineffective, and also tenured. He was ultimately fired, and quite easy to fire, as he also knew that he just didn’t have it.

Now to be clear, here, I do think that not just “grossly ineffective” but also simply “bad teachers” should be fired, but the indicators used to do this must yield valid inferences, as based on the evidence, as critically and appropriately consumed by the parties involved, after which valid and defensible decisions can and should be made. Whether one calls this due process in a proactive sense, or a wrongful termination suit in a retroactive sense, what matters most, though, is that the evidence supports the decision. This is the very issue at the heart of many of the lawsuits currently ongoing on this topic, as many of you know (see, for example, here).

Finally, where is the evidence, I ask, for many of the declaration included within and throughout this report. A review of the 133 endnotes included, for example, include only a very small handful of references to the larger literature on this topic (see a very comprehensive list of these literature here, here, and here). This is also highly problematic in this piece, as only the usual suspects (e.g., Sandi Jacobs, Thomas Kane, Bill Sanders) are cited to support the assertions advanced.

Take, for example, the following declaration: “a large and growing body of state and local implementation studies, academic research, teacher surveys, and interviews with dozens of policymakers, experts, and educators all reveal a much more promising picture: The reforms have strengthened many school districts’ focus on instructional quality, created a foundation for making teaching a more attractive profession, and improved the prospects for student achievement” (p. 1). Where is the evidence? There is no such evidence, and no such evidence published in high-quality, scholarly peer-reviewed journals of which I am aware. Again, publications released by the National Council on Teacher Quality (NCTQ) and from the Measures of Effective Teaching (MET) studies, as still not externally reviewed and still considered internal technical reports with “issues”, don’t necessarily count. Accordingly, no such evidence has been introduced, by either side, in any court case in which I am involved, likely, because such evidence does not exist, again, empirically and at some unbiased, vetted, and/or generalizable level. While Thomas Kane has introduced some of his MET study findings in the cases in Houston and New Mexico, these might be  some of the easiest pieces of evidence to target, accordingly, given the issues.

Otherwise, the only thing I can say from reading this piece that with which I agree, as that which I view, given the research literature as true and good, is that now teachers are being observed more often, by more people, in more depth, and in perhaps some cases with better observational instruments. Accordingly, teachers, also as per the research, seem to appreciate and enjoy the additional and more frequent/useful feedback and discussions about their practice, as increasingly offered. This, I would agree is something that is very positive that has come out of the nation’s policy-based focus on its “new and improved” teacher evaluation systems, again, as largely required by the federal government, especially pre-Every Student Succeeds Act (ESSA).

Overall, and in sum, “the research reveals that comprehensive teacher-evaluation models are stronger than the sum of their parts.” Unfortunately again, however, this is untrue in that systems based on multiple measures are entirely limited by the indicator that, in educational measurement terms, performs the worst. While such a holistic view is ideal, in measurement terms the sum of the parts is entirely limited by the weakest part. This is currently the value-added indicator (i.e., with the lowest levels of reliability and, related, issues with validity and bias) — the indicator at issue within this particular blog, and the indicator of the most interest, as it is this indicator that has truly changed our overall approaches to the evaluation of America’s teachers. It has yet to deliver, however, especially if to be used for high-stakes consequential decision-making purposes (e.g., incentives, getting rid of “bad apples”).

Feel free to read more here, as publicly available: Grading the Teachers: A Report on Teacher Evaluation Reform in Public Education. See also other claims regarding the benefits of said systems within (e.g., these systems as foundations for new teacher roles and responsibilities, smarter employment decisions, prioritizing classrooms, increased focus on improved standards). See also the recommendations offered, some with which I agree on the observational side (e.g., ensuring that teachers receive multiple observations during a school year by multiple evaluators), and none with which I agree on the value-added side (e.g., use at least two years of student achievement data in teacher evaluation ratings–rather, researchers agree that three years of value-added data are needed, as based on at least four years of student-level test data). There are, of course, many other recommendations included. You all can be the judges of those.

Five “Indisputable” Reasons Why VAMs are Good?

Just this week, in Education Week — the field’s leading national newspaper covering K–12 education — a blogger by the name of Matthew Lynch published a piece explaining his “Five Indisputable [emphasis added] Reasons Why You Should Be Implementing Value-Added Assessment.”

I’m going to try to stay aboveboard with my critique of this piece, as best I can, as by the title alone you all can infer there are certainly pieces (mainly five) to be seriously criticized about the author’s indisputable take on value-added (and by default value-added models (VAMs)). I examine each of these assertions below, but I will say overall and before we begin, that pretty much everything that is included in this piece is hardly palatable, and tolerable considering that Education Week published it, and by publishing it they quasi-endorsed it, even if in an independent blog post that they likely at minimum reviewed, then made public.

First, the five assertions, along with a simple response per assertion:

1. Value-added assessment moves the focus from statistics and demographics to asking of essential questions such as, “How well are students progressing?”

In theory, yes – this is generally true (see also my response about the demographics piece replicated in assertion #3 below). The problem here, though, as we all should know by now, is that once we move away from the theory in support of value-added, this theory more or less crumbles. The majority of the research on this topic explains and evidences the reasons why. Is value-added better than what “we” did before, however, while measuring student achievement once per year without taking growth over time into consideration? Perhaps, but if it worked as intended, for sure!

2. Value-added assessment focuses on student growth, which allows teachers and students to be recognized for their improvement. This measurement applies equally to high-performing and advantaged students and under-performing or disadvantaged students.

Indeed, the focus is on growth (see my response about growth in assertion #1 above). What the author of this post does not understand, however, is that his latter conclusion is likely THE most controversial issue surrounding value-added, and on this all topical researchers likely agree. In fact, authors of the most recent review of what is actually called “bias” in value-added estimates, as published in the peer-reviewed Economics Education Review (see a pre-publication version of this manuscript here), concluded that because of potential bias (i.e., “This measurement [does not apply] equally to high-performing and advantaged students and under-performing or disadvantaged students“), that all value-added modelers should control for as many student-level (and other) demographic variables to help to minimize this potential, also given the extent to which multiple authors’ evidence of bias varies wildly (from negligible to considerable).

3. Value-added assessment provides results that are tied to teacher effectiveness, not student demographics; this is a much more fair accountability measure.

See my comment immediately above, with general emphasis added to this overly simplistic take on the extent to which VAMs yield “fair” estimates, free from the biasing effects (never to always) caused by such demographics. My “fairest” interpretation of the current albeit controversial research surrounding this particular issue is that bias does not exist across teacher-level estimates, but it certainly occurs when teachers are non-randomly assigned highly homogenous sets of students who are gifted, who are English Language Learners (ELLs), who are enrolled in special education programs, who disproportionately represent racial minority groups, who disproportionately come from lower socioeconomic backgrounds, and who have been retained in grade prior.

4. Value-added assessment is not a stand-alone solution, but it does provide rich data that helps educators make data-driven decisions.

This is entirely false. There is no research evidence, still to date, that teachers use these data to make instructional decisions. Accordingly, no research is linked to or cited here (as well as elsewhere). Now, if the author is talking about naive “educators,” in general, who make consequential decisions as based on poor (i.e., the oppostie of “rich”) data, this assertion would be true. This “truth,” in fact, is at the core of the lawsuits ongoing across the nation regarding this matter (see, for example, here), with consequences ranging from tagging a teacher’s file for receiving a low value-added score to teacher termination.

5. Value-added assessment assumes that teachers matter and recognizes that a good teacher can facilitate student improvement. Perhaps we have only value-added assessment to thank for “assuming” [sic] this. Enough said…

Or not…

Lastly, the author professes to be a “professor,” pretty much all over the place (see, again, here), although he is currently an associate professor. There is a difference, and folks who respect the difference typically make the distinction explicit and known, especially in an academic setting or context. See also here, however, given his expertise (or the lack thereof) in value-added or VAMs, about what he writes here as “indisputable.”

Perhaps most important here, though, is that his falsely inflated professional title implies, especially to a naive or uncritical public, that what he has to say, again without any research support, demands some kind of credibility and respect. Unfortunately, this is just not the case; hence, we are again reminded of the need for general readers to be critical in their consumption of such pieces. I would have thought Education Week would have played a larger role than this, rather than just putting this stuff “out there,” even if for simple debate or discussion.

The Danielson Framework: Evidence of Un/Warranted Use

The US Department of Education’s statistics, research, and evaluation arm — the Institute of Education Sciences — recently released a study (here) about the validity of the Danielson Framework for Teaching‘s observational ratings as used for 713 teachers, with some minor adaptations (see box 1 on page 1), in the second largest school district in Nevada — Washoe County School District (Reno). This district is to use these data, along with student growth ratings, to inform decisions about teachers’ tenure, retention, and pay-for-performance system, in compliance with the state’s still current teacher evaluation system. The study was authored by researchers out of the Regional Educational Laboratory (REL) West at WestEd — a nonpartisan, nonprofit research, development, and service organization.

As many of you know, principals throughout many districts throughout the US, as per the Danielson Framework, use a four-point rating scale to rate teachers on 22 teaching components meant to measure four different dimensions or “constructs” of teaching.
In this study, researchers found that principals did not discriminate as much among the individual four constructs and 22 components (i.e., the four domains were not statistically distinct from one another and the ratings of the 22 components seemed to measure the same or universal cohesive trait). Accordingly, principals did discriminate among the teachers they observed to be more generally effective and highly effective (i.e., the universal trait of overall “effectiveness”), as captured by the two highest categories on the scale. Hence, analyses support the use of the overall scale versus the sub-components or items in and of themselves. Put differently, and In the authors’ words, “the analysis does not support interpreting the four domain scores [or indicators] as measurements of distinct aspects of teaching; instead, the analysis supports using a single rating, such as the average over all [sic] components of the system to summarize teacher effectiveness” (p. 12).
In addition, principals also (still) rarely identified teachers as minimally effective or ineffective, with approximately 10% of ratings falling into these of the lowest two of the four categories on the Danielson scale. This was also true across all but one of the 22 aforementioned Danielson components (see Figures 1-4, p. 7-8); see also Figure 5, p. 9).
I emphasize the word “still” in that this negative skew — what would be an illustrated distribution of, in this case, the proportion of teachers receiving all scores, whereby the mass of the distribution would be concentrated toward the right side of the figure — is one of the main reasons we as a nation became increasingly focused on “more objective” indicators of teacher effectiveness, focused on teachers’ direct impacts on student learning and achievement via value-added measures (VAMs). Via “The Widget Effect” report (here), authors argued that it was more or less impossible to have so many teachers perform at such high levels, especially given the extent to which students in other industrialized nations were outscoring students in the US on international exams. Thereafter, US policymakers who got a hold of this report, among others, used it to make advancements towards, and research-based arguments for, “new and improved” teacher evaluation systems with key components being the “more objective” VAMs.

In addition, and as directly related to VAMs, in this study researchers also found that each rating from each of the four domains, as well as the average of all ratings, “correlated positively with student learning [gains, as derived via the Nevada Growth
Model, as based on the Student Growth Percentiles (SGP) model; for more information about the SGP model see here and here; see also p. 6 of this report here], in reading and in math, as would be expected if the ratings measured teacher effectiveness in promoting student learning” (p. i). Of course, this would only be expected if one agrees that the VAM estimate is the core indicator around which all other such indicators should revolve, but I digress…

Anyhow, researchers found that by calculating standard correlation coefficients between teachers’ growth scores and the four Danielson domain scores, that “in all but one case” [i.e., the correlation coefficient between Domain 4 and growth in reading], said correlations were positive and statistically significant. Indeed this is true, although the correlations they observed, as aligned with what is increasingly becoming a saturated finding in the literature (see similar findings about the Marzano observational framework here; see similar findings from other studies here, here, and here; see also other studies as cited by authors of this study on p. 13-14 here), is that the magnitude and practical significance of these correlations are “very weak” (e.g., r = .18) to “moderate” (e.g., r = .45, .46, and .48). See their Table 2 (p. 13) with all relevant correlation coefficients illustrated below.

Screen Shot 2016-06-02 at 11.24.09 AM

Regardless, “[w]hile th[is] study takes place in one school district, the findings may be of interest to districts and states that are using or considering using the Danielson Framework” (p. i), especially those that intend to use this particular instrument for summative and sometimes consequential purposes, in that the Framework’s factor structure does not hold up, especially if to be used for summative and consequential purposes, unless, possibly, used as a generalized discriminator. With that too, however, evidence of validity is still quite weak to support further generalized inferences and decisions.

So, those of you in states, districts, and schools, do make these findings known, especially if this framework is being used for similar purposes without such evidence in support of such.

Citation: Lash, A., Tran, L., & Huang, M. (2016). Examining the validity of ratings
from a classroom observation instrument for use in a district’s teacher evaluation system

REL 2016–135). Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory West. Retrieved from http://ies.ed.gov/ncee/edlabs/regions/west/pdf/REL_2016135.pdf

Special Issue of “Educational Researcher” (Paper #8 of 9, Part I): A More Research-Based Assessment of VAMs’ Potentials

Recall that the peer-reviewed journal Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#8 of 9), which is actually a commentary titled “Can Value-Added Add Value to Teacher Evaluation?” This commentary is authored by Linda Darling-Hammond – Professor of Education, Emeritus, at Stanford University.

Like with the last commentary reviewed here, Darling-Hammond reviews some of the key points taken from the five feature articles in the aforementioned “Special Issue.” More specifically, though, Darling-Hammond “reflect[s] on [these five] articles’ findings in light of other work in this field, and [she] offer[s her own] thoughts about whether and how VAMs may add value to teacher evaluation” (p. 132).

She starts her commentary with VAMs “in theory,” in that VAMs COULD accurately identify teachers’ contributions to student learning and achievement IF (and this is a big IF) the following three conditions were met: (1) “student learning is well-measured by tests that reflect valuable learning and the actual achievement of individual students along a vertical scale representing the full range of possible achievement measures in equal interval units” (2) “students are randomly assigned to teachers within and across schools—or, conceptualized another way, the learning conditions and traits of the group of students assigned to one teacher do not vary substantially from those assigned to another;” and (3) “individual teachers are the only contributors to students’ learning over the period of time used for measuring gains” (p. 132).

None of things are actual true (or near to true, nor will they likely ever be true) in educational practice, however. Hence, the errors we continue to observe that continue to prevent VAM use for their intended utilities, even with the sophisticated statistics meant to mitigate errors and account for the above-mentioned, let’s call them, “less than ideal” conditions.

Other pervasive and perpetual issues surrounding VAMs as highlighted by Darling-Hammond, per each of the three categories above, pertain to (1) the tests used to measure value-added is that the tests are very narrow, focus on lower level skills, and are manipulable. These tests in their current form cannot effectively measure the learning gains of a large share of students who are above or below grade level given a lack of sufficient coverage and stretch. As per Haertel (2013, as cited in Darling-Hammond’s commentary), this “translates into bias against those teachers working with the lowest-performing or the highest-performing classes’…and “those who teach in tracked school settings.” It is also important to note here that the new tests created by the Partnership for Assessing Readiness for College and Careers (PARCC) and Smarter Balanced, multistate consortia “will not remedy this problem…Even though they will report students’ scores on a vertical scale, they will not be able to measure accurately the achievement or learning of students who started out below or above grade level” (p.133).

With respect to (2) above, on the equivalence (or rather non-equivalence) of groups of student across teachers classrooms who are the ones whose VAM scores are relativistically compared, the main issue here is that “the U.S. education system is the one of most segregated and unequal in the industrialized world…[likewise]…[t]he country’s extraordinarily high rates of childhood poverty, homelessness, and food insecurity are not randomly distributed across communities…[Add] the extensive practice of tracking to the mix, and it is clear that the assumption of equivalence among classrooms is far from reality” (p. 133). Whether sophisticated statistics can control for all of this variation is one of most debated issues surrounding VAMs and their levels of outcome bias, accordingly.

And as per (3) above, “we know from decades of educational research that many things matter for student achievement aside from the individual teacher a student has at a moment in time for a given subject area. A partial list includes the following [that are also supposed to be statistically controlled for in most VAMs, but are also clearly not controlled for effectively enough, if even possible]: (a) school factors such as class sizes, curriculum choices, instructional time, availability of specialists, tutors, books, computers, science labs, and other resources; (b) prior teachers and schooling, as well as other current teachers—and the opportunities for professional learning and collaborative planning among them; (c) peer culture and achievement; (d) differential summer learning gains and losses; (e) home factors, such as parents’ ability to help with homework, food and housing security, and physical and mental support or abuse; and (e) individual student needs, health, and attendance” (p. 133).

“Given all of these influences on [student] learning [and achievement], it is not surprising that variation among teachers accounts for only a tiny share of variation in achievement, typically estimated at under 10%” (see, for example, highlights from the American Statistical Association’s (ASA’s) Position Statement on VAMs here). “Suffice it to say [these issues]…pose considerable challenges to deriving accurate estimates of teacher effects…[A]s the ASA suggests, these challenges may have unintended negative effects on overall educational quality” (p. 133). “Most worrisome [for example] are [the] studies suggesting that teachers’ ratings are heavily influenced [i.e., biased] by the students they teach even after statistical models have tried to control for these influences” (p. 135).

Other “considerable challenges” include: VAM output are grossly unstable given the swings and variations observed in teacher classifications across time, and VAM output are “notoriously imprecise” (p. 133) given the other errors observed as caused, for example, by varying class sizes (e.g., Sean Corcoran (2010) documented with New York City data that the “true” effectiveness of a teacher ranked in the 43rd percentile could have had a range of possible scores from the 15th to the 71st percentile, qualifying as “below average,” “average,” or close to “above average). In addition, practitioners including administrators and teachers are skeptical of these systems, and their (appropriate) skepticisms are impacting the extent to which they use and value their value-added data, noting that they value their observational data (and the professional discussions surrounding them) much more. Also important is that another likely unintended effect exists (i.e., citing Susan Moore Johnson’s essay here) when statisticians’ efforts to parse out learning to calculate individual teachers’ value-added causes “teachers to hunker down and focus only on their own students, rather than working collegially to address student needs and solve collective problems” (p. 134). Related, “the technology of VAM ranks teachers against each other relative to the gains they appear to produce for students, [hence] one teacher’s gain is another’s loss, thus creating disincentives for collaborative work” (p. 135). This is what Susan Moore Johnson termed the egg-crate model, or rather the egg-crate effects.

Darling-Hammond’s conclusions are that VAMs have “been prematurely thrust into policy contexts that have made it more the subject of advocacy than of careful analysis that shapes its use. There is [good] reason to be skeptical that the current prescriptions for using VAMs can ever succeed in measuring teaching contributions well (p. 135).

Darling-Hammond also “adds value” in one whole section (highlighted in another post forthcoming here), offering a very sound set of solutions, using VAMs for teacher evaluations or not. Given it’s rare in this area of research we can focus on actual solutions, this section is a must read. If you don’t want to wait for the next post, read Darling-Hammond’s “Modest Proposal” (p. 135-136) within her larger article here.

In the end, Darling-Hammond writes that, “Trying to fix VAMs is rather like pushing on a balloon: The effort to correct one problem often creates another one that pops out somewhere else” (p. 135).

*****

If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; see the Review of Article #5 – on teachers’ perceptions of observations and student growth here; see the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here; and see the Review of Article (Commentary) #7 – on VAMs situated in their appropriate ecologies here.

Article #8, Part I Reference: Darling-Hammond, L. (2015). Can value-added add value to teacher evaluation? Educational Researcher, 44(2), 132-137. doi:10.3102/0013189X15575346

Special Issue of “Educational Researcher” (Paper #7 of 9): VAMs Situated in Appropriate Ecologies

Recall that the peer-reviewed journal Educational Researcher (ER) – recently published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#7 of 9), which is actually a commentary titled “The Value in Value-Added Depends on the Ecology.” This commentary is authored by Henry Braun – Professor of Education and Public Policy, Educational Research, Measurement, and Evaluation at Boston College (also the author of a previous post on this site here).

In this article Braun, importantly, makes explicit the assumptions on which this special issue of ER is based; that is, on assumptions that (1) too many students in America’s public schools are being inadequately educated, (2) evaluation systems as they currently exist “require radical overhaul,” and (3) it is therefore essential to use student test performance with low- and high-stakes attached to improve that which educators do (or don’t do) to adequately address the first assumption. There are counterarguments Braun also offers to readers on each of these assumptions (see p. 127), but more importantly he makes evident that the focus of this special issue is situated otherwise, as in line with current education policies. This special issue, overall, then “raise[s] important questions regarding the potential for high-stakes, test-driven educator accountability systems to contribute to raising student achievement” (p. 127).

Given this context, the “value-added” provided within this special issue, again according to Braun, is that the authors of each of the five main research articles included report on how VAM output actually plays out in practice, given “careful consideration to how the design and implementation of teacher evaluation systems could be modified to enhance the [purportedly, see comments above] positive impact of accountability and mitigate the negative consequences” at the same time (p. 127). In other words, if we more or less agree to the aforementioned assumptions, also given the educational policy context influence, perpetuating, or actually forcing these assumptions, these articles should help others better understand VAMs’ and observational systems’ potentials and perils in practice.

At the same time, Braun encourages us to note that “[t]he general consensus is that a set of VAM scores does contain some useful information that meaningfully differentiates among teachers, especially in the tails of the distribution [although I would argue bias has a role here]. However, individual VAM scores do suffer from high variance and low year-to-year stability as well as an undetermined amount of bias [which may be greater in the tails of the distribution]. Consequently, if VAM scores are to be used for evaluation, they should not be given inordinate weight and certainly not treated as the “gold standard” to which all other indicators must be compared” (p. 128).

Likewise, it’s important to note that IF consequences are to be attached to said indicators of teacher evaluation (i.e., VAM and observational data), there should be validity evidence made available and transparent to warrant the inferences and decisions to be made, and the validity evidence “should strongly support a causal [emphasis added] argument” (p. 128). However, both indicators still face major “difficulties in establishing defensible causal linkage[s]” as theorized, and desired (p. 128); hence, this prevents validity in inference. What does not help, either, is when VAM scores are given precedence over other indicators OR when principals align teachers’ observational scores with the same teachers’ VAM scores given the precedence often given to (what are often viewed as the superior, more objective) VAM-based measures. This sometimes occurs given external pressures (e.g., applied by superintendents) to artificially conflate, in this case, levels of agreement between indicators (i.e., convergent validity).

Related, in the section Braun titles his “Trio of Tensions,” (p. 129) he notes that (1) [B]oth accountability and improvement are undermined, as attested to by a number of the articles in this issue. In the current political and economic climate, [if possible] it will take thoughtful and inspiring leadership at the state and district levels to create contexts in which an educator evaluation system constructively fulfills its roles with respect to both public accountability and school improvement” (p. 129-130); (2) [T]he chasm between the technical sophistication of the various VAM[s] and the ability of educators to appreciate what these models are attempting to accomplish…sow[s] further confusion…[hence]…there must be ongoing efforts to convey to various audiences the essential issues—even in the face of principled disagreements among experts on the appropriate roles(s) for VAM[s] in educator evaluations” (p. 130); and finally (3) [H]ow to balance the rights of students to an adequate education and the rights of teachers to fair evaluations and due process [especially for]…teachers who have value-added scores and those who teach in subject-grade combinations for which value-added scores are not feasible…[must be addressed; this] comparability issue…has not been addressed but [it] will likely [continue to] rear its [ugly] head” (p. 130).

In the end, Braun argues for another “Trio,” but this one including three final lessons: (1) “although the concerns regarding the technical properties of VAM scores are not misplaced, they are not necessarily central to their reputation among teachers and principals. [What is central is]…their links to tests of dubious quality, their opaqueness in an atmosphere marked by (mutual) distrust, and the apparent lack of actionable information that are largely responsible for their poor reception” (p. 130); (2) there is a “very substantial, multiyear effort required for proper implementation of a new evaluation system…[related, observational] ratings are not a panacea. They, too, suffer from technical deficiencies and are the object of concern among some teachers because of worries about bias” (p. 130); and (3) “legislators and policymakers should move toward a more ecological approach [emphasis added; see also the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here] to the design of accountability systems; that is, “one that takes into account the educational and political context for evaluation, the behavioral responses and other dynamics that are set in motion when a new regime of high-stakes accountability is instituted, and the long-term consequences of operating the system” (p. 130).

*****

If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; see the Review of Article #5 – on teachers’ perceptions of observations and student growth here; and see the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here.

Article #7 Reference: Braun, H. (2015). The value in value-added depends on the ecology. Educational Researcher, 44(2), 127-131. doi:10.3102/0013189X15576341

Special Issue of “Educational Researcher” (Paper #6 of 9): VAMs as Tools for “Egg-Crate” Schools

Recall that the peer-reviewed journal Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#6 of 9), which is actually an essay here, titled “Will VAMS Reinforce the Walls of the Egg-Crate School?” This essay is authored by Susan Moore Johnson – Professor of Education at Harvard and somebody who I in the past I had the privilege of interviewing as an esteemed member of the National Academy of Education (see interviews here and here).

In this article, Moore Johnson argues that when policymakers use VAMs to evaluate, reward, or dismiss teachers, they may be perpetuating an egg-crate model, which is (referencing Tyack (1974) and Lortie (1975)) a metaphor for the compartmentalized school structure in which teachers (and students) work, most often in isolation. This model ultimately undermines the efforts of all involved in the work of schools to build capacity school wide, and to excel as a school given educators’ individual and collective efforts.

Contrary to the primary logic supporting VAM use, however, “teachers are not inherently effective or ineffective” on their own. Rather, their collective effectiveness is related to their professional development that may be stunted when they work alone, “without the benefit of ongoing collegial influence” (p. 119). VAMs then, and unfortunately, can cause teachers and administrators to (hyper)focus “on identifying, assigning, and rewarding or penalizing individual [emphasis added] teachers for their effectiveness in raising students’ test scores [which] depends primarily on the strengths of individual teachers” (p. 119). What comes along with this, then, are a series of interrelated egg-crate behaviors including, but not limited to, increased competition, lack of collaboration, increased independence versus interdependence, and the like, all of which can lead to decreased morale and decreased effectiveness in effect.

Inversely, students are much “better served when human resources are deliberately organized to draw on the strengths of all teachers on behalf of all students, rather than having students subjected to the luck of the draw in their classroom assignment[s]” (p. 119). Likewise, “changing the context in which teachers work could have important benefits for students throughout the school, whereas changing individual teachers without changing the context [as per VAMs] might not [work nearly as well] (Lohr, 2012)” (p. 120). Teachers learning from their peers, working in teams, teaching in teams, co-planning, collaborating, learning via mentoring by more experienced teachers, learning by mentoring, and the like should be much more valued, as warranted via the research, yet they are not valued given the very nature of VAM use.

Hence, there are also unintended consequences that can also come along with the (hyper)use of individual-level VAMs. These include, but are not limited to: (1) Teachers who are more likely to “literally or figuratively ‘close their classroom door’ and revert to working alone…[This]…affect[s] current collaboration and shared responsibility for school improvement, thus reinforcing the walls of the egg-crate school” (p. 120); (2) Due to bias, or that teachers might be unfairly evaluated given the types of students non-randomly assigned into their classrooms, teachers might avoid teaching high-needs students if teachers perceive themselves to be “at greater risk” of teaching students they cannot grow; (3) This can perpetuate isolative behaviors, as well as behaviors that encourage teachers to protect themselves first, and above all else; (4) “Therefore, heavy reliance on VAMS may lead effective teachers in high-need subjects and schools to seek safer assignments, where they can avoid the risk of low VAMS scores[; (5) M]eanwhile, some of the most challenging teaching assignments would remain difficult to fill and likely be subject to repeated turnover, bringing steep costs for students” (p. 120); While (6) “using VAMS to determine a substantial part of the teacher’s evaluation or pay [also] threatens to sidetrack the teachers’ collaboration and redirect the effective teacher’s attention to the students on his or her roster” (p. 120-121) versus students, for example, on other teachers’ rosters who might also benefit from other teachers’ content area or other expertise. Likewise (7) “Using VAMS to make high-stakes decisions about teachers also may have the unintended effect of driving skillful and committed teachers away from the schools that need them most and, in the extreme, causing them to leave the profession” in the end (p. 121).

I should add, though, and in all fairness given the Review of Paper #3 – on VAMs’ potentials here, many of these aforementioned assertions are somewhat hypothetical in the sense that they are based on the grander literature surrounding teachers’ working conditions, versus the direct, unintended effects of VAMs, given no research yet exists to examine the above, or other unintended effects, empirically. “There is as yet no evidence that the intensified use of VAMS interferes with collaborative, reciprocal work among teachers and principals or sets back efforts to move beyond the traditional egg-crate structure. However, the fact that we lack evidence about the organizational consequences of using VAMS does not mean that such consequences do not exist” (p. 123).

The bottom line is that we do not want to prevent the school organization from becoming “greater than the sum of its parts…[so that]…the social capital that transforms human capital through collegial activities in schools [might increase] the school’s overall instructional capacity and, arguably, its success” (p. 118). Hence, as Moore Johnson argues, we must adjust the focus “from the individual back to the organization, from the teacher to the school” (p. 118), and from the egg-crate back to a much more holistic and realistic model capturing what it means to be an effective school, and what it means to be an effective teacher as an educational professional within one. “[A] school would do better to invest in promoting collaboration, learning, and professional accountability among teachers and administrators than to rely on VAMS scores in an effort to reward or penalize a relatively small number of teachers” (p. 122).

*****

If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; and see the Review of Article #5 – on teachers’ perceptions of observations and student growth here.

Article #6 Reference: Moore Johnson, S. (2015). Will VAMS reinforce the walls of the egg-crate school? Educational Researcher, 44(2), 117-126. doi:10.3102/0013189X15573351