Learning from What Doesn’t Work in Teacher Evaluation

One of my doctoral students — Kevin Close — and I just had a study published in the practitioner journal Phi Delta Kappan that I wanted to share out with all of you, especially before the study is no longer open-access or free (see full study as currently available here). As the title indicates, the study is about how states, school districts, and schools can “Learn from What Doesn’t Work in Teacher Evaluation,” given an analysis that the two of us conducted of all documents pertaining to the four teacher evaluation and value-added model (VAM)-centered lawsuits in which I have been directly involved, and that I have also covered in this blog. These lawsuits include Lederman v. King in New York (see here), American Federation of Teachers et al. v. Public Education Department in New Mexico (see here), Houston Federation of Teachers v. Houston Independent School District in Texas (see here), and Trout v. Knox County Board of Education in Tennessee (see here).

Via this analysis we set out to comb through the legal documents to identify the strongest objections, as also recognized by the courts in these lawsuits, to VAMs as teacher measurement and accountability strategies. “The lessons to be learned from these cases are both important and timely” given that “[u]nder the Every Student Succeeds Act (ESSA), local education leaders once again have authority to decide for themselves how to assess teachers’ work.”

The most pertinent and also common issues as per these cases were as follows:

(1) Inconsistencies in teachers’ VAM-based estimates from one year to the next that are sometimes “wildly different.” Across these lawsuits, issues with reliability were very evident, whereas teachers classified as “effective” one year were either theorized or demonstrated to have around a 25%-59% chance of being classified as “ineffective” the next year, or vice versa, with other permutations also possible. As per our profession’s Standards for Educational and Psychological Testing, reliability should, rather, be observed whereby VAM estimates of teacher effectiveness are more or less consistent over time, from one year to the next, regardless of the type of students and perhaps subject areas that teachers teach.

(2) Bias in teachers’ VAM-based estimates were also of note, whereby documents suggested or evidenced that bias, or rather biased estimates of teachers’ actual effects does indeed exist (although this area was also of most contention and dispute). Specific to VAMs, since teachers are not randomly assigned the students they teach, whether their students are invariably more or less motivated, smart, knowledgeable, or capable can bias students’ test-based data, and teachers’ test-based data when aggregated. Court documents, although again not without counterarguments, suggested that VAM-based estimates are sometimes biased, especially when relatively homogeneous sets of students (i.e., English Language Learners (ELLs), gifted and special education students, free-or-reduced lunch eligible students) are non-randomly concentrated into schools, purposefully placed into classrooms, or both. Research suggests that this also sometimes happens regardless of the the sophistication of the statistical controls used to block said bias.

(3) The gaming mechanisms in play within teacher evaluation systems in which VAMs play a key role, or carry significant evaluative weight, were also of legal concern and dispute. That administrators sometimes inflate the observational ratings of their teachers whom they want to protect, while simultaneously offsetting the weight the VAMs sometimes carry was of note, as was the inverse. That administrators also sometimes lower teachers’ ratings to better align them with their “more objective” VAM counterparts were also at issue. “So argued the plaintiffs in the Houston and Tennessee lawsuits, for example. In those systems, school leaders appear to have given precedence to VAM scores, adjusting their classroom observations to match them. In both cases, administrators admitted to doing so, explaining that they sensed pressure to ensure that their ‘subjective’ classroom ratings were in sync with the VAM’s ‘objective’ scores.” Both sets of behavior distort the validity (or “truthfulness”) of any teacher evaluation system and are in violation of the same, aforementioned Standards for Educational and Psychological Testing that call for VAM scores and observation ratings to be kept separate. One indicator should never be adjusted to offset or to fit the other.

(4) Transparency, or the lack thereof, was also a common issue across cases. Transparency, which can be defined as the extent to which something is accessible and readily capable of being understood, pertains to whether VAM-based estimates are accessible and make sense to those at the receiving ends. “Not only should [teachers] have access to [their VAM-based] information for instructional purposes, but if they believe their evaluations to be unfair, they should be able to see all of the relevant data and calculations so that they can defend themselves.” In no case was this more legally pertinent than in Houston Federation of Teachers v. Houston Independent School District in Texas. Here, the presiding judge ruled that teachers did have “legitimate claims to see how their scores were calculated. Concealing this information, the judge ruled, violated teachers’ due process protections under the 14th Amendment (which holds that no state — or in this case organization — shall deprive any person of life, liberty, or property, without due process). Given this precedent, it seems likely that teachers in other states and districts will demand transparency as well.”

In the main article (here) we also discuss what states are now doing to (hopefully) improve upon their teacher evaluation systems in terms of using multiple measures to help to evaluate teachers more holistically. We emphasize the (in)formative versus the summative and high-stakes functions of such systems, and allowing teachers to take ownership over such systems in their development and implementation. I will leave you all to read the full article (here) for these details.

In sum, though, when rethinking states’ teacher evaluation systems, especially given the new liberties afforded to states via the Every Student Succeeds Act (ESSA), educators, education leaders, policymakers, and the like would do well to look to the past for guidance on what not to do — and what to do better. These legal cases can certainly inform such efforts.

Reference: Close, K., & Amrein-Beardsley, A. (2018). Learning from what doesn’t work in teacher evaluation. Phi Delta Kappan, 100(1), 15-19. Retrieved from http://www.kappanonline.org/learning-from-what-doesnt-work-in-teacher-evaluation/

A “Next Generation” Vision for School, Teacher, and Student Accountability

Within a series of prior posts (see, for example, here and here), I have written about what the Every Student Succeeds Act (ESSA), passed in December of 2015, means for the U.S., or more specifically states’ school and teacher evaluation systems as per the federal government’s prior mandates requiring their use of growth and value-added models (VAMs).

Related, states were recently (this past May) required to submit to the federal government their revised school and teacher evaluation plans, post ESSA, given how they have changed, or not. While I have a doctoral student currently gathering updated teacher evaluation data, state-by-state, and our preliminary findings indicate that “things” have not (yet) changed much post ESSA, at least at the teacher level of focus in this study and except for in a few states (e.g., Connecticut, Oklahoma), states still have the liberties to change that which they do on both ends (i.e., school and teacher accountability).

Recently, a colleague recently shared with me a study titled “Next Generation Accountability: A Vision for School Improvement Under ESSA” that warrants coverage here, in hopes that states are still “out there” trying to reform their school and teacher evaluation systems, of course, for the better. While the document was drafted by folks coming from the aforementioned state of Oklahoma, who are also affiliated with the Learning Policy Institute, it is important to note that the document was also vetted by some “heavy hitters” in this line of research including, but not limited to, David C. Berliner (Arizona State University), Peter W. Cookson Jr. (American Institutes for Research (AIR)), Linda Darling-Hammond (Stanford University), and William A. Firestone (Rutgers University).

As per ESSA, states are to have increased opportunities “to develop innovative strategies for advancing equity, measuring success, and developing cycles of continuous improvement” while using “multiple measures to assess school and student performance” (p. iii). Likewise, the authors of this report state that “A broader spectrum of indicators,
going well beyond a summary of annual test performance, seems necessary to account transparently for performance and assign responsibility for improvement.”

Here are some of their more specific recommendations that I found of value for blog followers:

  • The continued use of a single composite indicator to reduce and then sort teachers or schools by their overall effectiveness or performance (e.g., using teacher “effectiveness” categories or school A–F letter grades) is myopic, to say the least. This is because doing this (a) misses all that truly “matters,” including  multidimensional concepts and (non)cognitive competencies we want students to know and to be able to do, not captured by large-scale tests; and (b) inhibits the usefulness of what may be informative, stand-alone data (i.e., as taken from “multiple measures” individually) once these data are reduced and then collapsed so that they can be used for hierarchical categorizations and rankings. This also (c) very much trivializes the multiple causes of low achievement, also of importance and in much greater need of attention.
  • Accordingly, “Next Generation” accountability systems should include “a broad palette of functionally significant indicators to replace [such] single composite indicators [as this] will likely be regarded as informational rather than controlling, thereby motivating stakeholders to action” (p. ix). Stakeholders should be defined in the following terms…
  • “Next Generation” accountability systems should incorporate principles of “shared accountability,” whereby educational responsibility and accountability should be “distributed across system components and not foisted upon any one group of actors or stakeholders” (p. ix). “[E]xerting pressure on stakeholders who do not have direct control over [complex educational] elements is inappropriate and worse, harmful” (p. ix). Accordingly, the goal of “shared accountability” is to “create an accountability environment in which all participants [including governmental organizations] recognize their obligations and commitments in relation to each other” (p. ix) and their collective educational goals.
  • To facilitate this, “Next Generation” information systems should be designed and implemented in order to service the “dual reporting needs of compliance with federal mandates and the particular improvement needs of a state’s schools,” while also addressing “the different information needs of state, district, school site
    leadership, teachers, and parents” (p. ix). Data may include, at minimum, data on school resources, processes, outcomes, and other nuanced indicators, and this information must be made transparent and accessible in order for all types of data users to be responsive, holistically and individually (e.g, at school or classroom levels). The formative functions of such “Next Generation” informational systems, accordingly, take priority, at least for initial terms, until informational data can be used to, with priority, “identify and transform schools in catastrophic failure” (p. ix).
  • Related, all test- or other educational measurement-related components of states’ “Next Generation” statutes and policies should adhere to the Standards for Educational and Psychological Testing, and more specifically their definitions of reliability, validity, bias, fairness, and the like. Statutes and policies should also be written “in the least restrictive and prescriptive terms possible to allow for [continous] corrective action and improvement” (p. x).
  • Finally, “Next Generation” accountability systems should adhere to the following five essentials: “(a) state, district, and school leaders must create a system-wide culture grounded in “learning to improve;” (b) learning to improve using [the aforementioned informational systems also] necessitates the [overall] development of [students’] strong pedagogical data-literacy skills; (c) resources in addition to funding—including time, access to expertise, and collaborative opportunities—should be prioritized for sustaining these ongoing improvement efforts; (d) there must be a coherent structure of state-level support for learning to improve, including the development of a strong Longitudinal Data System (LDS) infrastructure; and (e) educator labor market policy in some states may need adjustment to support the above elements” (p. x).

To read more, please access the full report here.

In sum, “Next Generation” accountability systems aim at “a loftier goal—universal college and career readiness—a goal that current accountability systems were not designed to achieve. To reach this higher level, next generation accountability must embrace a wider vision, distribute trustworthy performance information, and build support infrastructure, while eliciting the assent, support, and enthusiasm of citizens and educators” (p. vii).

As briefly noted prior, “a few states have been working to put more supportive, humane accountability systems in place, but others remain stuck in a compliance mindset that undermines their ability to design effective accountability systems” (p. vii). Perhaps (or perhaps likely) this is because for the past decade or so states invested so much time, effort, and money to “reforming” their prior teacher evaluations systems as formerly required by the federal government. This included investments in states’ growth models of VAMs, onto which many/most states seem to be holding firm.

Hence, while it seems that the residual effects of the federal governments’ former efforts are still dominating states’ actions with regards to educational accountability, hopefully some states can at least begin to lead the way to what will likely yield the educational reform…still desired…

Nevada (Potentially) Dropping Students’ Test Scores from Its Teacher Evaluation System

This week in Nevada “Lawmakers Mull[ed] Dropping Student Test Scores from Teacher Evaluations,” as per a recent article in The Nevada Independent (see here). This would be quite a move from 2011 when the state (as backed by state Republicans, not backed by federal Race to the Top funds, and as inspired by Michelle Rhee) passed into policy a requirement that 50% of all Nevada teachers’ evaluations were to rely on said data. The current percentage rests at 20%, but it is to double next year to 40%.

Nevada is one of a still uncertain number of states looking to retract the weight and purported “value-added” of such measures. Note also that last week Connecticut dropped some of its test-based components of its teacher evaluation system (see here). All of this is occurring, of course, post the federal passage of the Every Student Succeeds Act (ESSA), within which it is written that states must no longer set up teacher-evaluation systems based in significant part on their students’ test scores.

Accordingly, Nevada’s “Democratic lawmakers are trying to eliminate — or at least reduce — the role [students’] standardized tests play in evaluations of teachers, saying educators are being unfairly judged on factors outside of their control.” The Democratic Assembly Speaker, for example, said that “he’s always been troubled that teachers are rated on standardized test scores,” more specifically noting: “I don’t think any single teacher that I’ve talked to would shirk away from being held accountable…[b]ut if they’re going to be held accountable, they want to be held accountable for things that … reflect their actual work.” I’ve never met a teacher would disagree with this statement.

Anyhow, this past Monday the state’s Assembly Education Committee heard public testimony on these matters and three bills “that would alter the criteria for how teachers’ effectiveness is measured.” These three bills are as follows:

  • AB212 would prohibit the use of student test scores in evaluating teachers, while
  • AB320 would eliminate statewide [standardized] test results as a measure but allow local assessments to account for 20 percent of the total evaluation.
  • AB312 would ensure that teachers in overcrowded classrooms not be penalized for certain evaluation metrics deemed out of their control given the student-to-teacher ratio.

Many presented testimony in support of these bills over an extended period of time on Tuesday. I was also invited to speak, during which I “cautioned lawmakers against being ‘mesmerized’ by the promised objectivity of standardized tests. They have their own flaws, [I] argued, estimating that 90-95 percent of researchers who are looking at the effects of high-stakes testing agree that they’re not moving the dial [really whatsoever] on teacher performance.”

Lawmakers have until the end of tomorrow (i.e., Friday) to pass these bills outside of the committee. Otherwise, they will die.

Of course, I will keep you posted, but things are currently looking “very promising,” especially for AB320.

David Berliner on The Purported Failure of America’s Schools

My primary mentor, David Berliner (Regents Professor at Arizona State University (ASU)) wrote, yesterday, a blog post for the Equity Alliance Blog (also at ASU) on “The Purported Failure of America’s Schools, and Ways to Make Them Better” (click here to access the original blog post). See other posts about David’s scholarship on this blog here, here, and here. See also one of our best blog posts that David also wrote here, about “Why Standardized Tests Should Not Be Used to Evaluate Teachers (and Teacher Education Programs).”

In sum, for many years David has been writing “about the lies told about the poor performance of our students and the failure of our schools and teachers.” For example, he wrote one of the education profession’s all time classics and best sellers: The Manufactured Crisis: Myths, Fraud, And The Attack On America’s Public Schools (1995). If you have not read it, you should! All educators should read this book, on that note and in my opinion, but also in the opinion of many other iconic educational scholars throughout the U.S. (Paufler, Amrein-Beardsley, Hobson, under revision for publication).

While the title of this book accurately captures its contents, more specifically it “debunks the myths that test scores in America’s schools are falling, that illiteracy is rising, and that better funding has no benefit. It shares the good news about public education.” I’ve found the contents of this book to still be my best defense when others with whom I interact attack America’s public schools, as often misinformed and perpetuated by many American politicians and journalists.

In this blog post David, once again, debunks many of these myths surrounding America’s public schools using more up-to-date data from international tests, our country’s National Assessment of Educational Progress (NAEP), state-level SAT and ACT scores, and the like. He reminds us of how student characteristics “strongly influence the [test] scores obtained by the students” at any school and, accordingly, “strongly influence” or bias these scores when used in any aggregate form (e.g., to hold teachers, schools, districts, and states accountable for their students’ performance).

He reminds us that “in the US, wealthy children attending public schools that serve the wealthy are competitive with any nation in the world…[but in]…schools in which low-income students do not achieve well, [that are not competitive with many nations in the world] we find the common correlates of poverty: low birth weight in the neighborhood, higher than average rates of teen and single parenthood, residential mobility, absenteeism, crime, and students in need of special education or English language instruction.” These societal factors explain poor performance much more (i.e., more variance explained) than any school-level, and as pertinent to this blog, teacher-level factor (e.g., teacher quality as measured by large-scale standardized test scores).

In this post David reminds us of much, much more, that we need to remember and also often recall in defense of our public schools and in support of our schools’ futures (e.g., research-based notes to help “fix” some of our public schools).

Again, please do visit the original blog post here to read more.

Center on the Future of American Education, on America’s “New and Improved” Teacher Evaluation Systems

Thomas Toch — education policy expert and research fellow at Georgetown University, and founding director of the Center on the Future of American Education — just released, as part of the Center, a report titled: Grading the Graders: A Report on Teacher Evaluation Reform in Public Education. He sent this to me for my thoughts, and I decided to summarize my thoughts here, with thanks and all due respect to the author, as clearly we are on different sides of the spectrum in terms of the literal “value” America’s new teacher evaluation systems might in fact “add” to the reformation of America’s public schools.

While quite a long and meaty report, here are some of the points I think that are important to address publicly:

First, is it true that using prior teacher evaluation systems (which were almost if not entirely based on teacher observational systems) yielded for “nearly every teacher satisfactory ratings”? Indeed, this is true. However, what we have seen since 2009, when states began to adopt what were then (and in many ways still are) viewed as America’s “new and improved” or “strengthened” teacher evaluation systems, is that for 70% of America’s teachers, these teacher evaluation systems are still based only on the observational indicators being used prior, because for only 30% of America’s teachers are value-added estimates calculable. As also noted in this report, it is for these 70% that “the superficial teacher [evaluation] practices of the past” (p. 2) will remain the same, although I disagree with this particular adjective, especially when these measures are used for formative purposes. While certainly imperfect, these are not simply “flimsy checklists” of no use or value. There is, indeed, much empirical research to support this assertion.

Likewise, these observational systems have not really changed since 2009, or 1999 for that matter and not that they could change all that much; but, they are not in their “early stages” (p. 2) of development. Indeed, this includes the Danielson Framework explicitly propped up in this piece as an exemplar, regardless of the fact it has been used across states and districts for decades and it is still not functioning as intended, especially when summative decisions about teacher effectiveness are to be made (see, for example, here).

Hence, in some states and districts (sometimes via educational policy) principals or other observers are now being asked, or required to deliberately assign to teachers’ lower observational categories, or assign approximate proportions of teachers per observational category used. Whereby the instrument might not distribute scores “as currently needed,” one way to game the system is to tell principals, for example, that they should only allot X% of teachers as per the three-to-five categories most often used across said instruments. In fact, in an article one of my doctoral students and I have forthcoming, we have termed this, with empirical evidence, the “artificial deflation” of observational scores, as externally being persuaded or required. Worse is that this sometimes signals to the greater public that these “new and improved” teacher evaluation systems are being used for more discriminatory purposes (i.e., to actually differentiate between good and bad teachers on some sort of discriminating continuum), or that, indeed, there is a normal distribution of teachers, as per their levels of effectiveness. While certainly there is some type of distribution, no evidence exists whatsoever to suggest that those who fall on the wrong side of the mean are, in fact, ineffective, and vice versa. It’s all relative, seriously, and unfortunately.

Related, the goal here is really not to “thoughtfully compare teacher performances,” but to evaluate teachers as per a set of criteria against which they can be evaluated and judged (i.e., whereby criterion-referenced inferences and decisions can be made). Inversely, comparing teachers in norm-referenced ways, as (socially) Darwinian and resonate with many-to-some, does not necessarily work, either or again. This is precisely what the authors of The Widget Effect report did, after which they argued for wide-scale system reform, so that increased discrimination among teachers, and reduced indifference on the part of evaluating principals, could occur. However, as also evidenced in this aforementioned article, the increasing presence of normal curves illustrating “new and improved” teacher observational distributions does not necessarily mean anything normal.

And were these systems not used often enough or “rarely” prior, to fire teachers? Perhaps, although there are no data to support such assertions, either. This very argument was at the heart of the Vergara v. California case (see, for example, here) — that teacher tenure laws, as well as laws protecting teachers’ due process rights, were keeping “grossly ineffective” teachers teaching in the classroom. Again, while no expert on either side could produce for the Court any hard numbers regarding how many “grossly ineffective” teachers were in fact being protected but such archaic rules and procedures, I would estimate (as based on my years of experience as a teacher) that this number is much lower than many believe it (and perhaps perpetuate it) to be. In fact, there was only one teacher whom I recall, who taught with me in a highly urban school, who I would have classified as grossly ineffective, and also tenured. He was ultimately fired, and quite easy to fire, as he also knew that he just didn’t have it.

Now to be clear, here, I do think that not just “grossly ineffective” but also simply “bad teachers” should be fired, but the indicators used to do this must yield valid inferences, as based on the evidence, as critically and appropriately consumed by the parties involved, after which valid and defensible decisions can and should be made. Whether one calls this due process in a proactive sense, or a wrongful termination suit in a retroactive sense, what matters most, though, is that the evidence supports the decision. This is the very issue at the heart of many of the lawsuits currently ongoing on this topic, as many of you know (see, for example, here).

Finally, where is the evidence, I ask, for many of the declaration included within and throughout this report. A review of the 133 endnotes included, for example, include only a very small handful of references to the larger literature on this topic (see a very comprehensive list of these literature here, here, and here). This is also highly problematic in this piece, as only the usual suspects (e.g., Sandi Jacobs, Thomas Kane, Bill Sanders) are cited to support the assertions advanced.

Take, for example, the following declaration: “a large and growing body of state and local implementation studies, academic research, teacher surveys, and interviews with dozens of policymakers, experts, and educators all reveal a much more promising picture: The reforms have strengthened many school districts’ focus on instructional quality, created a foundation for making teaching a more attractive profession, and improved the prospects for student achievement” (p. 1). Where is the evidence? There is no such evidence, and no such evidence published in high-quality, scholarly peer-reviewed journals of which I am aware. Again, publications released by the National Council on Teacher Quality (NCTQ) and from the Measures of Effective Teaching (MET) studies, as still not externally reviewed and still considered internal technical reports with “issues”, don’t necessarily count. Accordingly, no such evidence has been introduced, by either side, in any court case in which I am involved, likely, because such evidence does not exist, again, empirically and at some unbiased, vetted, and/or generalizable level. While Thomas Kane has introduced some of his MET study findings in the cases in Houston and New Mexico, these might be  some of the easiest pieces of evidence to target, accordingly, given the issues.

Otherwise, the only thing I can say from reading this piece that with which I agree, as that which I view, given the research literature as true and good, is that now teachers are being observed more often, by more people, in more depth, and in perhaps some cases with better observational instruments. Accordingly, teachers, also as per the research, seem to appreciate and enjoy the additional and more frequent/useful feedback and discussions about their practice, as increasingly offered. This, I would agree is something that is very positive that has come out of the nation’s policy-based focus on its “new and improved” teacher evaluation systems, again, as largely required by the federal government, especially pre-Every Student Succeeds Act (ESSA).

Overall, and in sum, “the research reveals that comprehensive teacher-evaluation models are stronger than the sum of their parts.” Unfortunately again, however, this is untrue in that systems based on multiple measures are entirely limited by the indicator that, in educational measurement terms, performs the worst. While such a holistic view is ideal, in measurement terms the sum of the parts is entirely limited by the weakest part. This is currently the value-added indicator (i.e., with the lowest levels of reliability and, related, issues with validity and bias) — the indicator at issue within this particular blog, and the indicator of the most interest, as it is this indicator that has truly changed our overall approaches to the evaluation of America’s teachers. It has yet to deliver, however, especially if to be used for high-stakes consequential decision-making purposes (e.g., incentives, getting rid of “bad apples”).

Feel free to read more here, as publicly available: Grading the Teachers: A Report on Teacher Evaluation Reform in Public Education. See also other claims regarding the benefits of said systems within (e.g., these systems as foundations for new teacher roles and responsibilities, smarter employment decisions, prioritizing classrooms, increased focus on improved standards). See also the recommendations offered, some with which I agree on the observational side (e.g., ensuring that teachers receive multiple observations during a school year by multiple evaluators), and none with which I agree on the value-added side (e.g., use at least two years of student achievement data in teacher evaluation ratings–rather, researchers agree that three years of value-added data are needed, as based on at least four years of student-level test data). There are, of course, many other recommendations included. You all can be the judges of those.

Special Issue of “Educational Researcher” (Paper #9 of 9): Amidst the “Blooming Buzzing Confusion”

Recall that the peer-reviewed journal Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the last of nine articles (#9 of 9), which is actually a commentary titled “Value Added: A Case Study in the Mismatch Between Education Research and Policy.” This commentary is authored by Stephen Raudenbush – Professor of Sociology and Public Policy Studies at the University of Chicago.

Like with the last two commentaries reviewed here and here, Raudenbush writes of the “Special Issue” that, in this topical area, “[r]esearchers want their work to be used, so we flirt with the idea that value-added research tells us how to improve schooling…[Luckily, perhaps] this volume has some potential to subdue this flirtation” (p. 138).

Raudenbush positions the research covered in this “Special Issue,” as well as the research on teacher evaluation and education in general, as being conducted amidst the “blooming buzzing confusion” (p. 138) surrounding the messy world through which we negotiate life. This is why “specific studies don’t tell us what to do, even if they sometimes have large potential for informing expert judgment” (p. 138).

With that being said, “[t]he hard question is how to integrate the new research on teachers with other important strands of research [e.g., effective schools research] in order to inform rather than distort practical judgment” (p. 138). Echoing Susan Moore Johnson’s sentiments, reviewed as article #6 here, this is appropriately hard if we are to augment versus undermine “our capacity to mobilize the “social capital” of the school to strengthen the human capital of the teacher” (p. 138).

On this note, and “[i]n sum, recent research on value added tells us that, by using data from student perceptions, classroom observations, and test score growth, we can obtain credible evidence [albeit weakly related evidence, referring to the Bill & Melinda Gates Foundation’s MET studies] of the relative effectiveness of a set of teachers who teach similar kids [emphasis added] under similar conditions [emphasis added]…[Although] if a district administrator uses data like that collected in MET, we can anticipate that an attempt to classify teachers for personnel decisions will be characterized by intolerably high error rates [emphasis added]. And because districts can collect very limited information, a reliance on district-level data collection systems will [also] likely generate…distorted behavior[s]..in which teachers attempt to “game” the
comparatively simple indicators,” or system (p. 138-139).

Accordingly, “[a]n effective school will likely be characterized by effective ‘distributed’ leadership, meaning that expert teachers share responsibility for classroom observation, feedback, and frequent formative assessments of student learning. Intensive professional development combined with classroom follow-up generates evidence about teacher learning and teacher improvement. Such local data collection efforts [also] have some potential to gain credibility among teachers, a virtue that seems too often absent” (p. 140).

This, might be at least a significant part of the solution.

“If the school is potentially rich in information about teacher effectiveness and teacher improvement, it seems to follow that key personnel decisions should be located firmly at the school level..This sense of collective efficacy [accordingly] seems to be a key feature of…highly effective schools” (p. 140).

*****

If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; see the Review of Article #5 – on teachers’ perceptions of observations and student growth here; see the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here; see the Review of Article (Commentary) #7 – on VAMs situated in their appropriate ecologies here; and see the Review of Article #8, Part I – on a more research-based assessment of VAMs’ potentials here and Part II on “a modest solution” provided to us by Linda Darling-Hammond here.

Article #9 Reference: Raudenbush, S. W. (2015). Value added: A case study in the mismatch between education research and policy. Educational Researcher, 44(2), 138-141. doi:10.3102/0013189X15575345

 

 

 

Special Issue of “Educational Researcher” (Paper #8 of 9, Part II): A Modest Solution Offered by Linda Darling-Hammond

One of my prior posts was about the peer-reviewed journal Educational Researcher (ER)’sSpecial Issue” on VAMs and the commentary titled “Can Value-Added Add Value to Teacher Evaluation?” contributed to the “Special Issue” by Linda Darling-Hammond – Professor of Education, Emeritus, at Stanford University.

In this post, I noted that Darling-Hammond “added” a lot of “value” in one particular section of her commentary, in which she offerec a very sound set of solutions, using VAMs for teacher evaluations or not. Given it’s rare in this area of research to focus on actual solutions, and this section is a must read, I paste this small section here for you all to read (and bookmark, especially if you are currently grappling with how to develop good evaluation systems that must meet external mandates, requiring VAMs).

Here is Darling-Hammond’s “Modest Proposal” (p. 135-136):

What if, instead of insisting on the high-stakes use of a single approach to VAM as a significant percentage of teachers’ ratings, policymakers were to acknowledge the limitations that have been identified and allow educators to develop more thoughtful
approaches to examining student learning in teacher evaluation? This might include sharing with practitioners honest information about imprecision and instability of the measures they receive, with instructions to use them cautiously, along with other evidence that can help paint a more complete picture of how students are learning in a teacher’s classroom. An appropriate warning might alert educators to the fact that VAM ratings
based on state tests are more likely to be informative for students already at grade level, and least likely to display the gains of students who are above or below grade level in their knowledge and skills. For these students, other measures will be needed.

What if teachers could create a collection of evidence about their students’ learning that is appropriate for the curriculum and students being taught and targeted to goals the teacher is pursuing for improvement? In a given year, one teacher’s evidence set might include gains on the vertically scaled Developmental Reading Assessment she administers to students, plus gains on the English language proficiency test for new English learners,
and rubric scores on the beginning and end of the year essays her grade level team assigns and collectively scores.

Another teacher’s evidence set might include the results of the AP test in Calculus with a pretest on key concepts in the course, plus pre- and posttests on a unit regarding the theory of limits which he aimed to improve this year, plus evidence from students’ mathematics projects using trigonometry to estimate the distance of a major landmark from their home. VAM ratings from a state test might be included when appropriate, but they would not stand alone as though they offered incontrovertible evidence about teacher effectiveness.

Evaluation ratings would combine the evidence from multiple sources in a judgment model, as Massachusetts’ plan does, using a matrix to combine and evaluate several pieces of student learning data, and then integrate that rating with those from observations and professional contributions. Teachers receive low or high ratings when multiple indicators point in the same direction. Rather than merely tallying up disparate percentages and urging administrators to align their observations with inscrutable VAM scores, this approach would identify teachers who warrant intervention while enabling pedagogical discussions among teachers and evaluators based on evidence that connects what teachers do with how their students learn. A number of studies suggest that teachers become more effective as they receive feedback from standards-based observations and as they develop ways to evaluate their students’ learning in relation to their practice (Darling-Hammond, 2013).

If the objective is not just to rank teachers and slice off those at the bottom, irrespective of accuracy, but instead to support improvement while providing evidence needed for action, this modest proposal suggests we might make more headway by allowing educators to design systems that truly add value to their knowledge of how students are learning in relation to how teachers are teaching.

*****

If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; see the Review of Article #5 – on teachers’ perceptions of observations and student growth here; see the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here; and see the Review of Article (Commentary) #7 – on VAMs situated in their appropriate ecologies here; and see the Review of Article #8, Part I – on a more research-based assessment of VAMs’ potentials here.

Article #8, Part II Reference: Darling-Hammond, L. (2015). Can value-added add value to teacher evaluation? Educational Researcher, 44(2), 132-137. doi:10.3102/0013189X15575346

Special Issue of “Educational Researcher” (Paper #8 of 9, Part I): A More Research-Based Assessment of VAMs’ Potentials

Recall that the peer-reviewed journal Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#8 of 9), which is actually a commentary titled “Can Value-Added Add Value to Teacher Evaluation?” This commentary is authored by Linda Darling-Hammond – Professor of Education, Emeritus, at Stanford University.

Like with the last commentary reviewed here, Darling-Hammond reviews some of the key points taken from the five feature articles in the aforementioned “Special Issue.” More specifically, though, Darling-Hammond “reflect[s] on [these five] articles’ findings in light of other work in this field, and [she] offer[s her own] thoughts about whether and how VAMs may add value to teacher evaluation” (p. 132).

She starts her commentary with VAMs “in theory,” in that VAMs COULD accurately identify teachers’ contributions to student learning and achievement IF (and this is a big IF) the following three conditions were met: (1) “student learning is well-measured by tests that reflect valuable learning and the actual achievement of individual students along a vertical scale representing the full range of possible achievement measures in equal interval units” (2) “students are randomly assigned to teachers within and across schools—or, conceptualized another way, the learning conditions and traits of the group of students assigned to one teacher do not vary substantially from those assigned to another;” and (3) “individual teachers are the only contributors to students’ learning over the period of time used for measuring gains” (p. 132).

None of things are actual true (or near to true, nor will they likely ever be true) in educational practice, however. Hence, the errors we continue to observe that continue to prevent VAM use for their intended utilities, even with the sophisticated statistics meant to mitigate errors and account for the above-mentioned, let’s call them, “less than ideal” conditions.

Other pervasive and perpetual issues surrounding VAMs as highlighted by Darling-Hammond, per each of the three categories above, pertain to (1) the tests used to measure value-added is that the tests are very narrow, focus on lower level skills, and are manipulable. These tests in their current form cannot effectively measure the learning gains of a large share of students who are above or below grade level given a lack of sufficient coverage and stretch. As per Haertel (2013, as cited in Darling-Hammond’s commentary), this “translates into bias against those teachers working with the lowest-performing or the highest-performing classes’…and “those who teach in tracked school settings.” It is also important to note here that the new tests created by the Partnership for Assessing Readiness for College and Careers (PARCC) and Smarter Balanced, multistate consortia “will not remedy this problem…Even though they will report students’ scores on a vertical scale, they will not be able to measure accurately the achievement or learning of students who started out below or above grade level” (p.133).

With respect to (2) above, on the equivalence (or rather non-equivalence) of groups of student across teachers classrooms who are the ones whose VAM scores are relativistically compared, the main issue here is that “the U.S. education system is the one of most segregated and unequal in the industrialized world…[likewise]…[t]he country’s extraordinarily high rates of childhood poverty, homelessness, and food insecurity are not randomly distributed across communities…[Add] the extensive practice of tracking to the mix, and it is clear that the assumption of equivalence among classrooms is far from reality” (p. 133). Whether sophisticated statistics can control for all of this variation is one of most debated issues surrounding VAMs and their levels of outcome bias, accordingly.

And as per (3) above, “we know from decades of educational research that many things matter for student achievement aside from the individual teacher a student has at a moment in time for a given subject area. A partial list includes the following [that are also supposed to be statistically controlled for in most VAMs, but are also clearly not controlled for effectively enough, if even possible]: (a) school factors such as class sizes, curriculum choices, instructional time, availability of specialists, tutors, books, computers, science labs, and other resources; (b) prior teachers and schooling, as well as other current teachers—and the opportunities for professional learning and collaborative planning among them; (c) peer culture and achievement; (d) differential summer learning gains and losses; (e) home factors, such as parents’ ability to help with homework, food and housing security, and physical and mental support or abuse; and (e) individual student needs, health, and attendance” (p. 133).

“Given all of these influences on [student] learning [and achievement], it is not surprising that variation among teachers accounts for only a tiny share of variation in achievement, typically estimated at under 10%” (see, for example, highlights from the American Statistical Association’s (ASA’s) Position Statement on VAMs here). “Suffice it to say [these issues]…pose considerable challenges to deriving accurate estimates of teacher effects…[A]s the ASA suggests, these challenges may have unintended negative effects on overall educational quality” (p. 133). “Most worrisome [for example] are [the] studies suggesting that teachers’ ratings are heavily influenced [i.e., biased] by the students they teach even after statistical models have tried to control for these influences” (p. 135).

Other “considerable challenges” include: VAM output are grossly unstable given the swings and variations observed in teacher classifications across time, and VAM output are “notoriously imprecise” (p. 133) given the other errors observed as caused, for example, by varying class sizes (e.g., Sean Corcoran (2010) documented with New York City data that the “true” effectiveness of a teacher ranked in the 43rd percentile could have had a range of possible scores from the 15th to the 71st percentile, qualifying as “below average,” “average,” or close to “above average). In addition, practitioners including administrators and teachers are skeptical of these systems, and their (appropriate) skepticisms are impacting the extent to which they use and value their value-added data, noting that they value their observational data (and the professional discussions surrounding them) much more. Also important is that another likely unintended effect exists (i.e., citing Susan Moore Johnson’s essay here) when statisticians’ efforts to parse out learning to calculate individual teachers’ value-added causes “teachers to hunker down and focus only on their own students, rather than working collegially to address student needs and solve collective problems” (p. 134). Related, “the technology of VAM ranks teachers against each other relative to the gains they appear to produce for students, [hence] one teacher’s gain is another’s loss, thus creating disincentives for collaborative work” (p. 135). This is what Susan Moore Johnson termed the egg-crate model, or rather the egg-crate effects.

Darling-Hammond’s conclusions are that VAMs have “been prematurely thrust into policy contexts that have made it more the subject of advocacy than of careful analysis that shapes its use. There is [good] reason to be skeptical that the current prescriptions for using VAMs can ever succeed in measuring teaching contributions well (p. 135).

Darling-Hammond also “adds value” in one whole section (highlighted in another post forthcoming here), offering a very sound set of solutions, using VAMs for teacher evaluations or not. Given it’s rare in this area of research we can focus on actual solutions, this section is a must read. If you don’t want to wait for the next post, read Darling-Hammond’s “Modest Proposal” (p. 135-136) within her larger article here.

In the end, Darling-Hammond writes that, “Trying to fix VAMs is rather like pushing on a balloon: The effort to correct one problem often creates another one that pops out somewhere else” (p. 135).

*****

If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; see the Review of Article #5 – on teachers’ perceptions of observations and student growth here; see the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here; and see the Review of Article (Commentary) #7 – on VAMs situated in their appropriate ecologies here.

Article #8, Part I Reference: Darling-Hammond, L. (2015). Can value-added add value to teacher evaluation? Educational Researcher, 44(2), 132-137. doi:10.3102/0013189X15575346

Special Issue of “Educational Researcher” (Paper #4 of 9): Make Room VAMs for Observations

Recall that the peer-reviewed journal Educational Researcher (ER) – recently published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#4 of 9) here, titled “Make Room Value-Added: Principals’ Human Capital Decisions and the Emergence of Teacher Observation Data. This one is authored by Ellen Goldring, Jason A. Grissom, Christine Neumerski, Marisa Cannata, Mollie Rubin, Timothy Drake, and Patrick Schuermann, all of whom are associated with Vanderbilt University.

This article is primarily about (1) the extent to which the data generated by “high-quality observation systems” can inform principals’ human capital decisions (e.g., teacher hiring, contract renewal, assignment to classrooms, professional development), and (2) the extent to which principals are relying less on test scores derived via value-added models (VAMs), when making the same decisions, and why. Here are some of their key (and most important, in my opinion) findings:

  • Principals across all school systems revealed major hesitations and challenges regarding the use of VAM output for human capital decisions. Barriers preventing VAM use included the timing of data availability (e.g., the fall), which is well after human capital decisions are made (p. 99).
  • VAM output are too far removed from the practice of teaching (p. 99), and this lack of instructional sensitivity impedes, if not entirely prevents their actual versus hypothetical use for school/teacher improvement.
  • “Principals noted they did not really understand how value-added scores were calculated, and therefore they were not completely comfortable using them” (p. 99). Likewise, principals reported that because teachers did not understand how the systems worked either, teachers did not use VAM output data either (p. 100).
  • VAM output are not transparent when used to determine compensation, and especially when used to evaluate teachers teaching nontested subject areas. In districts that use school-wide VAM output to evaluate teachers in nontested subject areas, in fact, principals reported regularly ignoring VAM output altogether (p. 99-100).
  • “Principals reported that they perceived observations to be more valid than value-added measures” (p. 100); hence, principals reported using observational output much more, again, in terms of human capital decisions and making such decisions “valid.” (p. 100).
  • “One noted exception to the use of value-added scores seemed to be in the area of assigning teachers to particular grades, subjects, and classes. Many principals mentioned they use value-added measures to place teachers in tested subjects and with students in grade levels that ‘count’ for accountability purpose…some principals [also used] VAM [output] to move ineffective teachers to untested grades, such as K-2 in elementary schools and 12th grade in high schools” (p. 100).

Of special note here is also the following finding: “In half of the systems [in which researchers investigated these systems], there [was] a strong and clear expectation that there be alignment between a teacher’s value-added growth score and observation ratings…Sometimes this was a state directive and other times it was district-based. In some systems, this alignment is part of the principal’s own evaluation; principals receive reports that show their alignment” (p. 101). In other words, principals are being evaluated and held accountable given the extent to which their observations of their teachers match their teachers’ VAM-based data. If misalignment is noticed, it is not to be the fault of either measure (e.g., in terms of measurement error), it is to be the fault of the principal who is critiqued for inaccuracy, and therefore (inversely) incentivized to skew their observational data (the only data over which the supervisor has control) to artificially match VAM-based output. This clearly distorts validity, or rather the validity of the inferences that are to be made using such data. Appropriately, principals also “felt uncomfortable [with this] because they were not sure if their observation scores should align primarily…with the VAM” output (p. 101).

“In sum, the use of observation data is important to principals for a number of reasons: It provides a “bigger picture” of the teacher’s performance, it can inform individualized and large group professional development, and it forms the basis of individualized support for remediation plans that serve as the documentation for dismissal cases. It helps principals provides specific and ongoing feedback to teachers. In some districts, it is beginning to shape the approach to teacher hiring as well” (p. 102).

The only significant weakness, again in my opinion, with this piece is that the authors write that these observational data, at focus in this study, are “new,” thanks to recent federal initiatives. They write, for example, that “data from structured teacher observations—both quantitative and qualitative—constitute a new [emphasis added] source of information principals and school systems can utilize in decision making” (p. 96). They are also “beginning to emerge [emphasis added] in the districts…as powerful engines for principal data use” (p. 97). I would beg to differ as these systems have not changed much over time, pre and post these federal initiatives as (without evidence or warrant) claimed by these authors herein. See, for example, Table 1 on p. 98 of the article to see if what they have included within the list of components of such new and “complex, elaborate teacher observation systems systems” is actually new or much different than most of the observational systems in use prior. As an aside, one such system in use and of issue in this examination is one with which I am familiar, in use in the Houston Independent School District. Click here to also see if this system is also more “complex” or “elaborate” over and above such systems prior.

Also recall that one of the key reports that triggered the current call for VAMs, as the “more objective” measures needed to measure and therefore improve teacher effectiveness, was based on data that suggested that “too many teachers” were being rated as satisfactory or above. The observational systems in use then are essentially the same observational systems still in use today (see “The Widget Effect” report here). This is in stark contradiction to authors’ claims throughout this piece, for example, when they write “Structured teacher observations, as integral components of teacher evaluations, are poised to be a very powerful lever for changing principal leadership and the influence of principals on schools, teachers, and learning.” This counters all that is and all that came from “The Widget Effect” report here.

*****

If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; and see the Review of Article #3 – on VAMs’ potentials here.

Article #4 Reference: Goldring, E., Grissom, J. A., Rubin, M., Neumerski, C. M., Cannata, M., Drake, T., & Schuermann, P. (2015). Make room value-added: Principals’ human capital decisions and the emergence of teacher observation data. Educational Researcher, 44(2), 96-104. doi:10.3102/0013189X15575031

Special Issue of “Educational Researcher” (Paper #3 of 9): Exploring VAMs’ Potentials

Recall that the peer-reviewed journal Educational Researcher (ER) – recently published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#3 of 9) here, titled “Exploring the Potential of Value-Added Performance Measures to Affect the Quality of the Teacher Workforce” as authored by Dan Goldhaber – Professor at the University of Washington Bothell, Director of the National Center for Analysis of Longitudinal Data in Education Research (CALDER), and a Vice-President at the American Institutes of Research (AIR). AIR is one of our largest VAM consulting/contract firms, and Goldabher is, accordingly, perhaps one of the field’s most vocal proponents of VAMs, also self-described as an “advocate of using value-added measurements carefully to inform some high-stakes decisions” (see original reference here). Hence, it makes sense he writes about VAMs’ potentials herein.

Here’s what he has to add to the conversation, specifically about “the various mechanisms through which the use of value added might affect teacher quality and…what we know empirically about the potential of each mechanism” (p. 87).

Most importantly in this piece, and in my opinion, Goldhaber discusses the “[s]everal [which turns out to be two] studies that simulate the effects of using value-added estimates for high-stakes purposes [and] suggest there may be significant student achievement benefits” (p. 88). Here are the two sections in support of these benefits as defined and claimed:

  • “There is evidence that high value-added teachers are perceived to engage in better teaching practices, and they are valued by principals as reflected in formal evaluations (Harris, Ingle, & Rutledge, 2014)” (p. 88). Contrary to this claim as interpreted herein by Goldhaber, however, is that these authors actually found “that some principals give high value-added teachers low ratings” which implies that the opposite is also true (i.e., inconsistencies in ratings), and “that teacher value-added measures and informal principal evaluations are positively, but weakly [emphasis added], correlated.” This puts a different spin on both of the actual results derived via this study, as per Goldhaber’s interpretation (see the cited study here).
  • “Perhaps most importantly, value added is also associated with long-term schooling (e.g., high school graduation and college-going), labor market (e.g., earnings), and nonacademic outcomes (e.g., teen pregnancy) (Chetty, Friedman, & Rockoff, 2014)” (p. 88). As you all likely recall, this study is of much controversy (see prior posts on this study here, here, here, and here.)

Otherwise, Goldhaber explores “the various mechanisms through which the use of value added might [emphasis in the original] affect teacher quality and describe[s] what we know empirically about the potential of [emphasis added] each mechanism.” The word “might” (with or without emphases added) is notably used throughout this manuscript, as is the word “assumption” albeit less often, which leaves us with not much more than a clear impression that most of what is offered in this piece, is still conjecture.

I write this even though some of the other research cited in this piece is peripherally related, for example, given what we know from labor economics. “We know” that “teachers who believe they will be effective and rewarded for their effectiveness are more likely to see teaching as a desirable profession” (p. 89). But do we really know this? Little mention is made of our reality here, however, given the real and deleterious effects we witness, for example, as current teacher educators when we work with potential/future teachers who almost daily express serious concerns about joining a profession now with very little autonomy, not much respect, and a stark increase in draconian accountability measures that will be used to hold them accountable for that which they do, or do not do well. This also makes no mention of the prospective teachers who have now chosen not to enter teacher education, pre-profession, either, and for similar reasons. “On the other hand, the use of value-added performance measures might lead to positive changes in the perception of teachers, making teaching a more prestigious profession and hence leading more people to pursue a teaching career” (p. 89). Hmm…

Nonetheless, these conjectures are categorized into sections about how VAMs might help us to (1) change the supply of people who opt into pursuing a teaching career and who are selected into the labor market, (2) change the effectiveness of those currently teaching, and (3) change which teachers elect to, or are permitted to, stay in teaching. Unfortunately again, however, there’s not much else in terms of research-based statements (other than the two articles briefly mentioned in this manuscript, bulleted above) that Goldhaber “adds” in terms of “value” regarding the “Potential of Value-Added Performance Measures.”

I write this with some regret in that it would be fine with me if this thing actually worked, and more importantly, helped any of the three above desired outcomes come to fruition, or helped teachers improve their professional practice, professional selves, and the like. Indeed, in theory, this should work, but it doesn’t….yet. I write “yet” here with serious reservations about whether VAMs will ever satisfy that for which they have been tasked, largely via educational policies.

Related, and on this point we agree, “teacher pay incentives is one area that we know a good deal about, based on analysis of actual policy variation, and the results are not terribly promising…experiments generally show performance bonuses, a particular form of pay for performance, have no significant student achievement effects, whether the bonus is rewarded at the individual teacher level” (p. 89). We disagree, though, again on Goldhaber’s conjectures for the future, that “there are several reasons why it is premature to write off pay for performance entirely…” (p. 89; see also a prior post on this here as related to a study Goldhaber (overly) cites in support some of his latter claims).

In the end (which is actually near the beginning of the manuscript), Goldhaber notes that VAMs are “distinct” as compared to classroom observations, because they offer “an objective measure that does not rely on human interpretation of teacher practices, and by design, [they offer] a system in which teachers are evaluated relative to one another rather than relative to an absolute standard (i.e., it creates a distribution in which teachers can be ranked). It is also a more novel measure” (p. 88). As just stated, in theory, this should work, but it just doesn’t….yet.

In the actual end (actually in terms of Goldhaber’s conclusions) he suggests we “Take a Leap of Faith?” (p. 90). I, for one, am not jumping.

*****

If interested, see the Review of Article #1 – the introduction to the special issue here and the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here.

Article #3 Reference: Goldhaber, D. (2015). Exploring the potential of value-added performance measures to affect the quality of the teacher workforce. Educational Researcher, 44(2), 87-95. doi:10.3102/0013189X15574905