The More Weight VAMs Carry, the More Teacher Effects (Will Appear to) Vary

Matthew A. Kraft — an Assistant Professor of Education & Economics at Brown University and co-author of an article published in Educational Researcher on “Revisiting The Widget Effect” (here), and another of his co-authors Matthew P. Steinberg — an Assistant Professor of Education Policy at the University of Pennsylvania — just published another article in this same journal on “The Sensitivity of Teacher Performance Ratings to the Design of Teacher Evaluation Systems” (see the full and freely accessible, at least for now, article here; see also its original and what should be enduring version here).

In this article, Steinberg and Kraft (2017) examine teacher performance measure weights while conducting multiple simulations of data taken from the Bill & Melinda Gates Measures of Effective Teaching (MET) studies. They conclude that “performance measure weights and ratings” surrounding teachers’ value-added, observational measures, and student survey indicators play “critical roles” when “determining teachers’ summative evaluation ratings and the distribution of teacher proficiency rates.” In other words, the weighting of teacher evaluation systems’ multiple measures matter, matter differently for different types of teachers within and across school districts and states, and matter also in that so often these weights are arbitrarily and politically defined and set.

Indeed, because “state and local policymakers have almost no empirically based evidence [emphasis added, although I would write “no empirically based evidence”] to inform their decision process about how to combine scores across multiple performance measures…decisions about [such] weights…are often made through a somewhat arbitrary and iterative process, one that is shaped by political considerations in place of empirical evidence” (Steinberg & Kraft, 2017, p. 379).

This is very important to note in that the consequences attached to these measures, also given the arbitrary and political constructions they represent, can be both professionally and personally, career and life changing, respectively. How and to what extent “the proportion of teachers deemed professionally proficient changes under different weighting and ratings thresholds schemes” (p. 379), then, clearly matters.

While Steinberg and Kraft (2017) have other key findings they also present throughout this piece, their most important finding, in my opinion, is that, again, “teacher proficiency rates change substantially as the weights assigned to teacher performance measures change” (p. 387). Moreover, the more weight assigned to measures with higher relative means (e.g., observational or student survey measures), the greater the rate by which teachers are rated effective or proficient, and vice versa (i.e., the more weight assigned to teachers’ value-added, the higher the rate by which teachers will be rated ineffective or inadequate; as also discussed on p. 388).

Put differently, “teacher proficiency rates are lowest across all [district and state] systems when norm-referenced teacher performance measures, such as VAMs [i.e., with scores that are normalized in line with bell curves, with a mean or average centered around the middle of the normal distributions], are given greater relative weight” (p. 389).

This becomes problematic when states or districts then use these weighted systems (again, weighted in arbitrary and political ways) to illustrate, often to the public, that their new-and-improved teacher evaluation systems, as inspired by the MET studies mentioned prior, are now “better” at differentiating between “good and bad” teachers. Thereafter, some states over others are then celebrated (e.g., by the National Center of Teacher Quality; see, for example, here) for taking the evaluation of teacher effects more seriously than others when, as evidenced herein, this is (unfortunately) more due to manipulation than true changes in these systems. Accordingly, the fact remains that the more weight VAMs carry, the more teacher effects (will appear to) vary. It’s not necessarily that they vary in reality, but the manipulation of the weights on the back end, rather, cause such variation and then lead to, quite literally, such delusions of grandeur in these regards (see also here).

At a more pragmatic level, this also suggests that the teacher evaluation ratings for the roughly 70% of teachers who are not VAM eligible “are likely to differ in systematic ways from the ratings of teachers for whom VAM scores can be calculated” (p. 392). This is precisely why evidence in New Mexico suggests VAM-eligible teachers are up to five times more likely to be ranked as “ineffective” or “minimally effective” than their non-VAM-eligible colleagues; that is, “[also b]ecause greater weight is consistently assigned to observation scores for teachers in nontested grades and subjects” (p. 392). This also causes a related but also important issue with fairness, whereas equally effective teachers, just by being VAM eligible, may be five-or-so times likely (e.g., in states like New Mexico) of being rated as ineffective by the mere fact that they are VAM eligible and their states, quite literally, “value” value-added “too much” (as also arbitrarily defined).

Finally, it should also be noted as an important caveat here, that the findings advanced by Steinberg and Kraft (2017) “are not intended to provide specific recommendations about what weights and ratings to select—such decisions are fundamentally subject to local district priorities and preferences. (p. 379). These findings do, however, “offer important insights about how these decisions will affect the distribution of teacher performance ratings as policymakers and administrators continue to refine and possibly remake teacher evaluation systems” (p. 379).

Related, please recall that via the MET studies one of the researchers’ goals was to determine which weights per multiple measure were empirically defensible. MET researchers failed to do so and then defaulted to recommending an equal distribution of weights without empirical justification (see also Rothstein & Mathis, 2013). This also means that anyone at any state or district level who might say that this weight here or that weight there is empirically defensible should be asked for the evidence in support.

Citations:

Rothstein, J., & Mathis, W. J. (2013, January). Review of two culminating reports from the MET Project. Boulder, CO: National Educational Policy Center. Retrieved from http://nepc.colorado.edu/thinktank/review-MET-final-2013

Steinberg, M. P., & Kraft, M. A. (2017). The sensitivity of teacher performance ratings to the design of teacher evaluation systems. Educational Researcher, 46(7), 378–
396. doi:10.3102/0013189X17726752 Retrieved from http://journals.sagepub.com/doi/abs/10.3102/0013189X17726752

The Elephant in the Room – Fairness

While VAMs have many issues pertaining, fundamentally, to their levels of reliability, validity, and bias, they are wholeheartedly unfair. This is one thing that is so very important but so rarely discussed when those external to VAM-based metrics and metrics use are debating, mainly the benefits of VAMs.

Issues of “fairness” arise when a test, or more likely its summative (i.e., summary and sometimes consequential) and formative (i.e., informative) uses, impact some more than others in unfair yet often important ways. In terms of VAMs, the main issue here is that VAM-based estimates can be produced for only approximately 30-40% of all teachers across America’s public schools. The other 60-70%, which sometimes includes entire campuses of teachers (e.g., early elementary and high school teachers), cannot altogether be evaluated or “held accountable” using teacher- or individual-level VAM data.

Put differently, what VAM-based data provide, in general, “are incredibly imprecise and inconsistent measures of supposed teacher effectiveness for only a tiny handful [30-40%] of teachers in a given school” (see reference here). But this is often entirely overlooked, not only in the debates surrounding VAM use (and abuse) but also in the discussions surrounding how many taxpayer-derived funds are still being used to support such a (purportedly) reformatory overhaul of America’s public education system. The fact of the matter is that VAMs only directly impact the large minority.

While some states and districts are rushing into adopting “multiple measures” to alleviate at least some of these issues with fairness, what state and district leaders don’t entirely understand is that this, too, is grossly misguided. Should any of these states and districts also tie serious consequences to such output (e.g., merit pay, performance plans, teacher termination, denial of tenure), or rather tie serious consequences to measures of growth derived via any varieties of the “multiple assessment” that can be pulled from increasingly prevalent multiple assessment “menus,” states and districts are also setting themselves for lawsuits…no joke! Starting with the basic psychometrics, and moving onto the (entire) lack of research in support of using more “off-the-shelf” tests to help alleviate issues with fairness, would be the (easy) approach to take in a court of law as, really, doing any of this is entirely wrong.

School-level value-added is also being used to accommodate the issue of “fairness,” just less frequently now than before given the aforementioned “multiple assessment” trends. Regardless, many states and districts also continue to attribute a school-level aggregate score to teachers who do not teach primarily reading/language arts and mathematics, primarily in grades 3-8. That’s right, a majority of teachers receive a value-added score that is based on students whom they do not teach. This also calls for legal recourse, also in that this has been a contested issue within all of the lawsuits in which I’ve thus far been engaged.

Another Study about Bias in Teachers’ Observational Scores

Following-up on two prior posts about potential bias in teachers’ observations (see prior posts here and here), another research study was recently released evidencing, again, that the evaluation ratings derived via observations of teachers in practice are indeed related to (and potentially biased by) teachers’ demographic characteristics. The study also evidenced that teachers representing racial and ethnic minority background might be more likely than others to not only receive lower relatively scores but also be more likely identified for possible dismissal as a result of their relatively lower evaluation scores.

The Regional Educational Laboratory (REL) authored and U.S. Department of Education (Institute of Education Sciences) sponsored study titled “Teacher Demographics and Evaluation: A Descriptive Study in a Large Urban District” can be found here, and a condensed version of the study can be found here. Interestingly, the study was commissioned by district leaders who were already concerned about what they believed to be occurring in this regard, but for which they had no hard evidence… until the completion of this study.

Authors’ key finding follows (as based on three consecutive years of data): Black teachers, teachers age 50 and older, and male teachers were rated below proficient relatively more often than the same district teachers to whom they were compared. More specifically,

  • In all three years the percentage of teachers who were rated below proficient was higher among Black teachers than among White teachers, although the gap was smaller in 2013/14 and 2014/15.
  • In all three years the percentage of teachers with a summative performance rating who were rated below proficient was higher among teachers age 50 and older than among teachers younger than age 50.
  • In all three years the difference in the percentage of male and female teachers with a summative performance rating who were rated below proficient was approximately 5 percentage points or less.
  • The percentage of teachers who improved their rating during all three year-to-year
    comparisons did not vary by race/ethnicity, age, or gender.

This is certainly something to (still) keep in consideration, especially when teachers are rewarded (e.g., via merit pay) or penalized (e.g., vie performance improvement plans or plans for dismissal). Basing these or other high-stakes decisions on not only subjective but also likely biased observational data (see, again, other studies evidencing that this is happening here and here), is not only unwise, it’s also possibly prejudiced.

While study authors note that their findings do not necessarily “explain why the
patterns exist or to what they may be attributed,” and that there is a “need
for further research on the potential causes of the gaps identified, as well as strategies for
ameliorating them,” for starters and at minimum, those conducting these observations literally across the country must be made aware.

Citation: Bailey, J., Bocala, C., Shakman, K., & Zweig, J. (2016). Teacher demographics and evaluation: A descriptive study in a large urban district. Washington DC: U.S. Department of Education. Retrieved from http://ies.ed.gov/ncee/edlabs/regions/northeast/pdf/REL_2017189.pdf

Miami-Dade, Florida’s Recent “Symbolic” and “Artificial” Teacher Evaluation Moves

Last spring, Eduardo Porter – writer of the Economic Scene column for The New York Times – wrote an excellent article, from an economics perspective, about that which is happening with our current obsession in educational policy with “Grading Teachers by the Test” (see also my prior post about this article here; although you should give the article a full read; it’s well worth it). In short, though, Porter wrote about what economist’s often refer to as Goodhart’s Law, which states that “when a measure becomes the target, it can no longer be used as the measure.” This occurs given the great (e.g., high-stakes) value (mis)placed on any measure, and the distortion (i.e., in terms of artificial inflation or deflation, depending on the desired direction of the measure) that often-to-always comes about as a result.

Well, it’s happened again, this time in Miami-Dade, Florida, where the Miami-Dade district’s teachers are saying its now “getting harder to get a good evaluation” (see the full article here). Apparently, teachers evaluation scores, from last to this year, are being “dragged down,” primarily given teachers’ students’ performances on tests (as well as tests of subject areas that and students whom they do not teach).

“In the weeks after teacher evaluations for the 2015-16 school year were distributed, Miami-Dade teachers flooded social media with questions and complaints. Teachers reported similar stories of being evaluated based on test scores in subjects they don’t teach and not being able to get a clear explanation from school administrators. In dozens of Facebook posts, they described feeling confused, frustrated and worried. Teachers risk losing their jobs if they get a series of low evaluations, and some stand to gain pay raises and a bonus of up to $10,000 if they get top marks.”

As per the figure also included in this article, see the illustration of how this is occurring below; that is, how it is becoming more difficult for teachers to get “good” overall evaluation scores but also, and more importantly, how it is becoming more common for districts to simply set different cut scores to artificially increase teachers’ overall evaluation scores.

00-00 template_cs5

“Miami-Dade say the problems with the evaluation system have been exacerbated this year as the number of points needed to get the “highly effective” and “effective” ratings has continued to increase. While it took 85 points on a scale of 100 to be rated a highly effective teacher for the 2011-12 school year, for example, it now takes 90.4.”

This, as mentioned prior, is something called “artificial deflation,” whereas the quality of teaching is likely not changing nearly to the extent the data might illustrate it is. Rather, what is happening behind the scenes (e.g., the manipulation of cut scores) is giving the impression that indeed the overall teacher system is in fact becoming better, more rigorous, aligning with policymakers’ “higher standards,” etc).

This is something in the educational policy arena that we also call “symbolic policies,” whereas nothing really instrumental or material is happening, and everything else is a facade, concealing a less pleasant or creditable reality that nothing, in fact, has changed.

Citation: Gurney, K. (2016). Teachers say it’s getting harder to get a good evaluation. The school district disagrees. The Miami Herald. Retrieved from http://www.miamiherald.com/news/local/education/article119791683.html#storylink=cpy

New Mexico: Holding Teachers Accountable for Missing More Than 3 Days of Work

One state that seems to still be going strong after the passage of last January’s Every Student Succeeds Act (ESSA) — via which the federal government removed (or significantly relaxed) its former mandates that all states adopt and use of growth and value-added models (VAMs) to hold their teachers accountable (see here) — is New Mexico.

This should be of no surprise to followers of this blog, especially those who have not only recognized the decline in posts via this blog post ESSA (see a post about this decline here), but also those who have noted that “New Mexico” is the state most often mentioned in said posts post ESSA (see for example here, here, and here).

Well, apparently now (and post  revisions likely caused by the ongoing lawsuit regarding New Mexico’s teacher evaluation system, of which attendance is/was a part; see for example here, here, and here), teachers are to now also be penalized if missing more than three days of work.

As per a recent article in the Santa Fe New Mexican (here), and the title of this article, these new teacher attendance regulations, as to be factored into teachers’ performance evaluations, has clearly caught schools “off guard.”

“The state has said that including attendance in performance reviews helps reduce teacher absences, which saves money for districts and increases students’ learning time.” In fact, effective this calendar year, 5 percent of a teacher’s evaluation is to be made up of teacher attendance. New Mexico Public Education Department spokesman Robert McEntyre clarified that “teachers can miss up to three days of work without being penalized.” He added that “Since attendance was first included in teacher evaluations, it’s estimated that New Mexico schools are collectively saving $3.5 million in costs for substitute teachers and adding 300,000 hours of instructional time back into [their] classrooms.”

“The new guidelines also do not dock teachers for absences covered by the federal Family and Medical Leave Act, or absences because of military duty, jury duty, bereavement, religious leave or professional development programs.” Reported to me only anecdotally (i.e., I could not find evidence of this elsewhere), the new guidelines might also dock teachers for engaging in professional development or overseeing extracurricular events such as debate team performances. If anybody has anything to add on this end, especially as evidence of this, please do comment below.

A Case of VAM-Based Chaos in Florida

Within a recent post, I wrote about my recent “silence” explaining that, apparently, post the passage of federal government’s (January 1, 2016) passage of the Every Student Succeeds Act (ESSA) that no longer requires teachers to be evaluated by their student’s tests score using VAMs (see prior posts on this here and here), “crazy” VAM-related events have apparently subsided. While I noted in the post that this also did not mean that certain states and districts are not still drinking (and overdosing on) the VAM-based Kool-Aid, what I did not note is that the ways by which I get many of the stories I cover on this blog is via Google Alerts. This is where I have noticed a significant decline in VAM-related stories. Clearly, however, the news outlets often covered via Google Alerts don’t include district-level stories, so to cover these we must continue to rely on our followers (i.e., teachers, administrators, parents, students, school board members, etc.) to keep the stories coming.

Coincidentally — Billy Townsend, who is running for a school board seat in Polk County, Florida (district size = 100K students) — sent me one such story. As an edublogger himself, he actually sent me three blog posts (see post #1, post #2, and post #3 listed by order of relevance) capturing what is happening in his district, again, as situated under the state of Florida’s ongoing, VAM-based, nonsense. I’ve summarized the situation below as based on his three posts.

In short, the state ordered the district to dismiss a good number of its teachers as per their VAM scores when this school year started. “[T]his has been Florida’s [educational reform] model for nearly 20 years [actually since 1979, so 35 years]: Choose. Test. Punish. Stigmatize. Segregate. Turnover.” Because the district already had a massive teacher shortage as well, however, these teachers were replaced with Kelly Services contracted substitute teachers. Thereafter, district leaders decided that this was not “a good thing,” and they decided that administrators and “coaches” would temporarily replace the substitute teachers to make the situation “better.” While, of course, the substitutes’ replacements did not have VAM scores themselve, they were nonetheless deemed fit to teach and clearly more fit to teach than the teachers who were terminated as based on their VAM scores.

According to one teacher who anonymously wrote about her terminated teacher colleagues, and one of the district’s “best” teachers: “She knew our kids well. She understood how to reach them, how to talk to them. Because she ‘looked like them’ and was from their neighborhood, she [also] had credibility with the students and parents. She was professional, always did what was best for students. She had coached several different sports teams over the past decade. Her VAM score just wasn’t good enough.”

Consequently, this has turned into a “chaotic reality for real kids and adults” throughout the county’s schools, and the district and state apparently realized this by “threaten[ing] all of [the district’s] teachers with some sort of ethics violation if they talk about what’s happening” throughout the district. While “[t]he repetition of stories that sound just like this from [the districts’] schools is numbing and heartbreaking at the same time,” the state, district, and school board, apparently, “has no interest” in such stories.

Put simply, and put well as this aligns with our philosophy here: “Let’s [all] consider what [all of this] really means: [Florida] legislators do not want to hear from you if you are communicating a real experience from your life at a school — whether you are a teacher, parent, or student. Your experience doesn’t matter. Only your test score.”

Isn’t that the unfortunate truth; hence, and with reference to the introduction above, please do keep these relatively more invisible studies coming so that we can share out with the nation and make such stories more visible and accessible. VAMs, again, are alive and well, just perhaps in more undisclosed ways, like within districts as is the case here.

47 Teachers To Be Stripped of Tenure in Denver

As per a recent article by Chalkbeat Colorado, “Denver Public Schools [is] Set to Strip Nearly 50 Teachers of Tenure Protections after [two-years of consecutive] Poor Evaluations.” This will make Denver Public Schools — Colorado’s largest school district — the district with the highest relative proportion of teachers to lose tenure, which demotes teachers to probationary status, which also causes them to lose their due process rights.

  • The majority of the 47 teachers — 26 of them — are white. Another 14 are Latino, four are African-American, two are multi-racial and one is Asian.
  • Thirty-one of the 47 teachers set to lose tenure — or 66 percent — teach in “green” or “blue” schools, the two highest ratings on Denver’s color-coded School Performance Framework. Only three — or 6 percent — teach in “red” schools, the lowest rating.
  • Thirty-eight of the 47 teachers — or 81 percent — teach at schools where more than half of the students qualify for federally subsidized lunches, an indicator of poverty.

Elsewhere, in Douglas County 24, in Aurora 12, in Cherry Creek one, and in Jefferson County, the state’s second largest district, zero teachers teachers are set to lose their tenure status. This all occurred provided a sweeping educator effectiveness law — Senate Bill 191 — passed throughout Colorado six years ago. As per this law, “at least 50 percent of a teacher’s evaluation [must] be based on student academic growth.”

“Because this is the first year teachers can lose that status…[however]…officials said it’s difficult to know why the numbers differ from district to district.” This, of course, is an issue with fairness whereby a court, for example, could find that if a teacher is teaching in District X versus District Y, and (s)he had an different probability of losing tenure due only to the District in which (s)he taught, this could be quite easily argued as an arbitrary component of the law, not to mention an arbitrary component of its implementation. If I was advising these districts on these matters, I would certainly advise them to tread lightly.

However, apparently many districts throughout Colorado use a state-developed and endorsed model to evaluate their teachers, but Denver uses its own model; hence, this would likely take some of the pressure off of the state, should this end up in court, and place it more so upon the district. That is, the burden of proof would likely rest on Denver Public School officials to evidence that they are no only complying with the state law but that they are doing so in sound, evidence-based, and rational/reasonable ways.

Citation: Amar, M. (2016, July 15). Denver Public Schools set to strip nearly 50 teachers of tenure protections after poor evaluations. Chalkbeat Colorado. Retrieved from http://www.chalkbeat.org/posts/co/2016/07/14/denver-public-schools-set-to-strip-nearly-50-teachers-of-tenure-protections-after-poor-evaluations/#.V5Yryq47Tof

Center on the Future of American Education, on America’s “New and Improved” Teacher Evaluation Systems

Thomas Toch — education policy expert and research fellow at Georgetown University, and founding director of the Center on the Future of American Education — just released, as part of the Center, a report titled: Grading the Graders: A Report on Teacher Evaluation Reform in Public Education. He sent this to me for my thoughts, and I decided to summarize my thoughts here, with thanks and all due respect to the author, as clearly we are on different sides of the spectrum in terms of the literal “value” America’s new teacher evaluation systems might in fact “add” to the reformation of America’s public schools.

While quite a long and meaty report, here are some of the points I think that are important to address publicly:

First, is it true that using prior teacher evaluation systems (which were almost if not entirely based on teacher observational systems) yielded for “nearly every teacher satisfactory ratings”? Indeed, this is true. However, what we have seen since 2009, when states began to adopt what were then (and in many ways still are) viewed as America’s “new and improved” or “strengthened” teacher evaluation systems, is that for 70% of America’s teachers, these teacher evaluation systems are still based only on the observational indicators being used prior, because for only 30% of America’s teachers are value-added estimates calculable. As also noted in this report, it is for these 70% that “the superficial teacher [evaluation] practices of the past” (p. 2) will remain the same, although I disagree with this particular adjective, especially when these measures are used for formative purposes. While certainly imperfect, these are not simply “flimsy checklists” of no use or value. There is, indeed, much empirical research to support this assertion.

Likewise, these observational systems have not really changed since 2009, or 1999 for that matter and not that they could change all that much; but, they are not in their “early stages” (p. 2) of development. Indeed, this includes the Danielson Framework explicitly propped up in this piece as an exemplar, regardless of the fact it has been used across states and districts for decades and it is still not functioning as intended, especially when summative decisions about teacher effectiveness are to be made (see, for example, here).

Hence, in some states and districts (sometimes via educational policy) principals or other observers are now being asked, or required to deliberately assign to teachers’ lower observational categories, or assign approximate proportions of teachers per observational category used. Whereby the instrument might not distribute scores “as currently needed,” one way to game the system is to tell principals, for example, that they should only allot X% of teachers as per the three-to-five categories most often used across said instruments. In fact, in an article one of my doctoral students and I have forthcoming, we have termed this, with empirical evidence, the “artificial deflation” of observational scores, as externally being persuaded or required. Worse is that this sometimes signals to the greater public that these “new and improved” teacher evaluation systems are being used for more discriminatory purposes (i.e., to actually differentiate between good and bad teachers on some sort of discriminating continuum), or that, indeed, there is a normal distribution of teachers, as per their levels of effectiveness. While certainly there is some type of distribution, no evidence exists whatsoever to suggest that those who fall on the wrong side of the mean are, in fact, ineffective, and vice versa. It’s all relative, seriously, and unfortunately.

Related, the goal here is really not to “thoughtfully compare teacher performances,” but to evaluate teachers as per a set of criteria against which they can be evaluated and judged (i.e., whereby criterion-referenced inferences and decisions can be made). Inversely, comparing teachers in norm-referenced ways, as (socially) Darwinian and resonate with many-to-some, does not necessarily work, either or again. This is precisely what the authors of The Widget Effect report did, after which they argued for wide-scale system reform, so that increased discrimination among teachers, and reduced indifference on the part of evaluating principals, could occur. However, as also evidenced in this aforementioned article, the increasing presence of normal curves illustrating “new and improved” teacher observational distributions does not necessarily mean anything normal.

And were these systems not used often enough or “rarely” prior, to fire teachers? Perhaps, although there are no data to support such assertions, either. This very argument was at the heart of the Vergara v. California case (see, for example, here) — that teacher tenure laws, as well as laws protecting teachers’ due process rights, were keeping “grossly ineffective” teachers teaching in the classroom. Again, while no expert on either side could produce for the Court any hard numbers regarding how many “grossly ineffective” teachers were in fact being protected but such archaic rules and procedures, I would estimate (as based on my years of experience as a teacher) that this number is much lower than many believe it (and perhaps perpetuate it) to be. In fact, there was only one teacher whom I recall, who taught with me in a highly urban school, who I would have classified as grossly ineffective, and also tenured. He was ultimately fired, and quite easy to fire, as he also knew that he just didn’t have it.

Now to be clear, here, I do think that not just “grossly ineffective” but also simply “bad teachers” should be fired, but the indicators used to do this must yield valid inferences, as based on the evidence, as critically and appropriately consumed by the parties involved, after which valid and defensible decisions can and should be made. Whether one calls this due process in a proactive sense, or a wrongful termination suit in a retroactive sense, what matters most, though, is that the evidence supports the decision. This is the very issue at the heart of many of the lawsuits currently ongoing on this topic, as many of you know (see, for example, here).

Finally, where is the evidence, I ask, for many of the declaration included within and throughout this report. A review of the 133 endnotes included, for example, include only a very small handful of references to the larger literature on this topic (see a very comprehensive list of these literature here, here, and here). This is also highly problematic in this piece, as only the usual suspects (e.g., Sandi Jacobs, Thomas Kane, Bill Sanders) are cited to support the assertions advanced.

Take, for example, the following declaration: “a large and growing body of state and local implementation studies, academic research, teacher surveys, and interviews with dozens of policymakers, experts, and educators all reveal a much more promising picture: The reforms have strengthened many school districts’ focus on instructional quality, created a foundation for making teaching a more attractive profession, and improved the prospects for student achievement” (p. 1). Where is the evidence? There is no such evidence, and no such evidence published in high-quality, scholarly peer-reviewed journals of which I am aware. Again, publications released by the National Council on Teacher Quality (NCTQ) and from the Measures of Effective Teaching (MET) studies, as still not externally reviewed and still considered internal technical reports with “issues”, don’t necessarily count. Accordingly, no such evidence has been introduced, by either side, in any court case in which I am involved, likely, because such evidence does not exist, again, empirically and at some unbiased, vetted, and/or generalizable level. While Thomas Kane has introduced some of his MET study findings in the cases in Houston and New Mexico, these might be  some of the easiest pieces of evidence to target, accordingly, given the issues.

Otherwise, the only thing I can say from reading this piece that with which I agree, as that which I view, given the research literature as true and good, is that now teachers are being observed more often, by more people, in more depth, and in perhaps some cases with better observational instruments. Accordingly, teachers, also as per the research, seem to appreciate and enjoy the additional and more frequent/useful feedback and discussions about their practice, as increasingly offered. This, I would agree is something that is very positive that has come out of the nation’s policy-based focus on its “new and improved” teacher evaluation systems, again, as largely required by the federal government, especially pre-Every Student Succeeds Act (ESSA).

Overall, and in sum, “the research reveals that comprehensive teacher-evaluation models are stronger than the sum of their parts.” Unfortunately again, however, this is untrue in that systems based on multiple measures are entirely limited by the indicator that, in educational measurement terms, performs the worst. While such a holistic view is ideal, in measurement terms the sum of the parts is entirely limited by the weakest part. This is currently the value-added indicator (i.e., with the lowest levels of reliability and, related, issues with validity and bias) — the indicator at issue within this particular blog, and the indicator of the most interest, as it is this indicator that has truly changed our overall approaches to the evaluation of America’s teachers. It has yet to deliver, however, especially if to be used for high-stakes consequential decision-making purposes (e.g., incentives, getting rid of “bad apples”).

Feel free to read more here, as publicly available: Grading the Teachers: A Report on Teacher Evaluation Reform in Public Education. See also other claims regarding the benefits of said systems within (e.g., these systems as foundations for new teacher roles and responsibilities, smarter employment decisions, prioritizing classrooms, increased focus on improved standards). See also the recommendations offered, some with which I agree on the observational side (e.g., ensuring that teachers receive multiple observations during a school year by multiple evaluators), and none with which I agree on the value-added side (e.g., use at least two years of student achievement data in teacher evaluation ratings–rather, researchers agree that three years of value-added data are needed, as based on at least four years of student-level test data). There are, of course, many other recommendations included. You all can be the judges of those.

Five “Indisputable” Reasons Why VAMs are Good?

Just this week, in Education Week — the field’s leading national newspaper covering K–12 education — a blogger by the name of Matthew Lynch published a piece explaining his “Five Indisputable [emphasis added] Reasons Why You Should Be Implementing Value-Added Assessment.”

I’m going to try to stay aboveboard with my critique of this piece, as best I can, as by the title alone you all can infer there are certainly pieces (mainly five) to be seriously criticized about the author’s indisputable take on value-added (and by default value-added models (VAMs)). I examine each of these assertions below, but I will say overall and before we begin, that pretty much everything that is included in this piece is hardly palatable, and tolerable considering that Education Week published it, and by publishing it they quasi-endorsed it, even if in an independent blog post that they likely at minimum reviewed, then made public.

First, the five assertions, along with a simple response per assertion:

1. Value-added assessment moves the focus from statistics and demographics to asking of essential questions such as, “How well are students progressing?”

In theory, yes – this is generally true (see also my response about the demographics piece replicated in assertion #3 below). The problem here, though, as we all should know by now, is that once we move away from the theory in support of value-added, this theory more or less crumbles. The majority of the research on this topic explains and evidences the reasons why. Is value-added better than what “we” did before, however, while measuring student achievement once per year without taking growth over time into consideration? Perhaps, but if it worked as intended, for sure!

2. Value-added assessment focuses on student growth, which allows teachers and students to be recognized for their improvement. This measurement applies equally to high-performing and advantaged students and under-performing or disadvantaged students.

Indeed, the focus is on growth (see my response about growth in assertion #1 above). What the author of this post does not understand, however, is that his latter conclusion is likely THE most controversial issue surrounding value-added, and on this all topical researchers likely agree. In fact, authors of the most recent review of what is actually called “bias” in value-added estimates, as published in the peer-reviewed Economics Education Review (see a pre-publication version of this manuscript here), concluded that because of potential bias (i.e., “This measurement [does not apply] equally to high-performing and advantaged students and under-performing or disadvantaged students“), that all value-added modelers should control for as many student-level (and other) demographic variables to help to minimize this potential, also given the extent to which multiple authors’ evidence of bias varies wildly (from negligible to considerable).

3. Value-added assessment provides results that are tied to teacher effectiveness, not student demographics; this is a much more fair accountability measure.

See my comment immediately above, with general emphasis added to this overly simplistic take on the extent to which VAMs yield “fair” estimates, free from the biasing effects (never to always) caused by such demographics. My “fairest” interpretation of the current albeit controversial research surrounding this particular issue is that bias does not exist across teacher-level estimates, but it certainly occurs when teachers are non-randomly assigned highly homogenous sets of students who are gifted, who are English Language Learners (ELLs), who are enrolled in special education programs, who disproportionately represent racial minority groups, who disproportionately come from lower socioeconomic backgrounds, and who have been retained in grade prior.

4. Value-added assessment is not a stand-alone solution, but it does provide rich data that helps educators make data-driven decisions.

This is entirely false. There is no research evidence, still to date, that teachers use these data to make instructional decisions. Accordingly, no research is linked to or cited here (as well as elsewhere). Now, if the author is talking about naive “educators,” in general, who make consequential decisions as based on poor (i.e., the oppostie of “rich”) data, this assertion would be true. This “truth,” in fact, is at the core of the lawsuits ongoing across the nation regarding this matter (see, for example, here), with consequences ranging from tagging a teacher’s file for receiving a low value-added score to teacher termination.

5. Value-added assessment assumes that teachers matter and recognizes that a good teacher can facilitate student improvement. Perhaps we have only value-added assessment to thank for “assuming” [sic] this. Enough said…

Or not…

Lastly, the author professes to be a “professor,” pretty much all over the place (see, again, here), although he is currently an associate professor. There is a difference, and folks who respect the difference typically make the distinction explicit and known, especially in an academic setting or context. See also here, however, given his expertise (or the lack thereof) in value-added or VAMs, about what he writes here as “indisputable.”

Perhaps most important here, though, is that his falsely inflated professional title implies, especially to a naive or uncritical public, that what he has to say, again without any research support, demands some kind of credibility and respect. Unfortunately, this is just not the case; hence, we are again reminded of the need for general readers to be critical in their consumption of such pieces. I would have thought Education Week would have played a larger role than this, rather than just putting this stuff “out there,” even if for simple debate or discussion.

Massachusetts Also Moving To Remove Growth Measures from State’s Teacher Evaluation Systems

Since the passage of the Every Student Succeeds Act (ESSA) last January, in which the federal government handed back to states the authority to decide whether to evaluate teachers with or without students’ test scores, states have been dropping the value-added measure (VAM) or growth components (e.g., the Student Growth Percentiles (SGP) package) of their teacher evaluation systems, as formerly required by President Obama’s Race to the Top initiative. See my most recent post here, for example, about how legislators in Oklahoma recently removed VAMs from their state-level teacher evaluation system, while simultaneously increasing the state’s focus on the professional development of all teachers. Hawaii recently did the same.

Now, it seems that Massachusetts is the next at least moving in this same direction.

As per a recent article in The Boston Globe (here), similar test-based teacher accountability efforts are facing increased opposition, primarily from school district superintendents and teachers throughout the state. At issue is whether all of this is simply “becoming a distraction,” whether the data can be impacted or “biased” by other statistically uncontrollable factors, and whether all teachers can be evaluated in similar ways, which is an issue with “fairness.” Also at issue is “reliability,” whereby a 2014 study released by the Center for Educational Assessment at the University of Massachusetts Amherst, in which researchers examined student growth percentiles, found the “amount of random error was substantial.” Stephen Sireci, one of the study authors and UMass professor, noted that, instead of relying upon the volatile results, “You might as well [just] flip a coin.”

Damian Betebenner, a senior associate at the National Center for the Improvement of Educational Assessment Inc. in Dover, N.H. who developed the SGP model in use in Massachusetts, added that “Unfortunately, the use of student percentiles has turned into a debate for scapegoating teachers for the ills.” Isn’t this the truth, to the extent that policymakers got a hold of these statistical tools, after which they much too swiftly and carelessly singled out teachers for unmerited treatment and blame.

Regardless, and recently, stakeholders in Massachusetts lobbied the Senate to approve an amendment to the budget that would no longer require such test-based ratings in teachers’ professional evaluations, while also passing a policy statement urging the state to scrap these ratings entirely. “It remains unclear what the fate of the Senate amendment will be,” however. “The House has previously rejected a similar amendment, which means the issue would have to be resolved in a conference committee as the two sides reconcile their budget proposals in the coming weeks.”

Not surprisingly, Mitchell Chester, Massachusetts Commissioner for Elementary and Secondary Education, continues to defend the requirement. It seems that Chester, like others, is still holding tight to the default (yet still unsubstantiated) logic helping to advance these systems in the first place, arguing, “Some teachers are strong, others are not…If we are not looking at who is getting strong gains and those who are not we are missing an opportunity to upgrade teaching across the system.”