Mirror, Mirror on the Wall…

No surprise, again, but Thomas Kane, an economics professor from Harvard University who also directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation, is publicly writing in support of VAMs, again (redundancy intended). I just posted about one of his recent articles published on the website of the Brookings Institution titled “Do Value-Added Estimates Identify Causal Effects of Teachers and Schools?” after which I received another of his articles, this time published by the New York Daily News titled “Teachers Must Look in the Mirror.”

Embracing a fabled metaphor, while not to position teachers as the wicked queens or to position Kane as Snow White, let us ask ourselves the classic question:”Who is the fairest one of all?” as we critically review yet another fairytale authored by Harvard’s Kane. He has, after all, “carefully studied the best systems for rating teachers” (see other prior posts about Kane’s public perspectives on VAMs here and here).

In this piece, Kane continues to advance a series of phantasmal claims about the potentials of VAMs, this time in the state of New York where Governor Andrew Cuomo intends to take the state’s teacher evaluation system up to a system based 50% on teachers’ value-added, or 100% on value-added in cases where a teacher rated as “ineffective” in his/her value-added score can be rated as “ineffective” overall. Here,  value-added could be used to trump all else (see prior posts about this here and here).

According to Kane, Governor Cuomo “picked the right fight.” The state’s new system “will finally give schools the tools they need to manage and improve teaching.” Perhaps the magic mirror would agree with such a statement, but research would evidence it vain.

As I have noted prior, there is absolutely no evidence, thus far, indicating that such systems have any (in)formative use or value. These data are first and foremost designed for summative, or summary, purposes; they are not designed for formative use. Accordingly, the data that come from such systems — besides the data that come from the observational components still being built into these systems that have existed and been used for decades past — are not transparent, difficult to understand, and therefore challenging to use. Likewise, such data are not instructionally sensitive, and they are untimely in that test-based results typically come back to teachers well after their students have moved on to subsequent grade levels.

What about Kane’s claims against tenure: “The tenure process is the place to start. It’s the most important decision a principal makes. One poor decision can burden thousands of future students, parents, colleagues and supervisors.” This is quite an effect considering the typical teacher being held accountable using these new and improved teacher evaluation systems as based (in this case largely) on VAMs typically impacts only teachers at the elementary level who teach mathematics and reading/language arts. Even an elementary teacher with a career spanning 40 years with an average of 30 students per class would directly impact (or burden) 1,200 students, maximum. This is not to say this is inconsequential, but as consequential as Kane’s sensational numbers imply? What about the thousands of parents, colleagues, and supervisors also to be burdened by one poor decision? Fair and objective? This particular mirror thinks not.

Granted, I am not making any claims about tenure as I think all would agree that sometimes tenure can support, keeping with the metaphor, bad apples. Rather I take claim with the exaggerations, including also that “Traditionally, principals have used much too low a standard, promoting everyone but the very worst teachers.” We must all check our assumptions here about how we define “the very worst teachers” and how many of them really lurk in the shadows of America’s now not-so-enchanted forests. There is no evidence to support this claim, either, just conjecture.

As for the solution, “Under the new law, the length of time it will take to earn tenure will be lengthened from three to four years.” Yes, that arbitrary, one-year extension will certainly help… Likewise, tenure decisions will now be made better using classroom observations (the data that have, according to Kane in this piece, been used for years to make all of these aforementioned bad decisions) and our new fair and objective, test-based measures, which not accordingly to Kane, can only be used for about 30% of all teachers in America’s public schools. Nonetheless, “Student achievement gains [are to serve as] the bathroom scale, [and] classroom observations [are to serve] as the mirror.”

Kane continues, scripting, “Although the use of test scores has received all the attention, [one of] the most consequential change[s] in the law has been overlooked: One of a teacher’s observers must now be drawn from outside his or her school — someone whose only role is to comment on teaching.” Those from inside the school were only commenting on one’s beauty and fairness prior, I suppose, as “The fact that 96% of teachers were given the two highest ratings last year — being deemed either “effective” or “highly effective” — is a sure sign that principals have not been honest to date.”

All in all, perhaps somebody else should be taking a long hard “Look in the Mirror,” as this new law will likely do everything but “[open] the door to a renewed focus on instruction and excellence in teaching” despite the best efforts of “union leadership,” although I might add to Kane’s list many adorable little researchers who have also “carefully studied the best systems for rating teachers” and more or less agree on their intended and unintended results in…the end.

Vanderbilt Researchers on Performance Pay, VAMs, and SLOs

Do higher paychecks translate into higher student test scores? That is the question two researchers at Vanderbilt – Ryan Balch (recent Graduate Research Assistant at Vanderbilt’s National Center on Performance Incentives) and Matthew Springer (Assistant Professor of Public Policy and Education and Director of Vanderbilt’s National Center on Performance Incentives) – attempted to answer in a recent study of the REACH pay-for-performance program in Austin, Texas (a nationally recognized performance program model with $62.3 million in federal support). The study published in Education Economics Review can be found here, but for a $19.95 fee; hence, I’ll do my best to explain this study’s contents so you all can save your money, unless of course you too want to dig deeper.

As background (and as explained on the first page of the full paper), the theory behind performance pay is that tying teacher pay to teacher performance provides “strong incentives” to improve outcomes of interest. “It can help motivate teachers to higher levels of performance and align their behaviors and interests with institutional goals.” I should note, however, that there is very mixed evidence from over 100 years of research on performance pay regarding whether it has ever worked. Economists tend to believe it works while educational researchers tend to disagree.

Regardless, in this study as per a ResearchNews@Vanderbilt post put out by Vanderbilt highlighting it, researchers found that teacher-level growth in student achievement in mathematics and reading in schools in which teachers were given monetary performance incentives was significantly higher during the first year of the program’s implementation (2007-2008), than was the same growth in the nearest matched, neighborhood schools where teachers were not given performance incentives. Similar gains were maintained the following year, yet (as per the full report) no additional growth or loss was noted otherwise.

As per the full report as well, researchers more specifically found that students who were enrolled in the REACH program gained between 0.13 and 0.17 standard deviations greater gains in mathematics, and (although not as evident or highlighted in the text of the actual report, but within a related table) students who were enrolled in the REACH program gained between 0.10 and 0.05 standard deviations greater gains in reading, although these gains were also less significant in statistical terms. Curious…

While the method by which schools were matched was well-detailed, and inter-school descriptive statistics were presented to help readers determine whether in fact the schools sampled for this study were comparable (although statistics that would also help us determine whether the inter-school differences noted were statistically significant enough to pay attention to), the statistics comparing the teachers in REACH schools versus those not in REACH schools to whom they were compared were completely missing. Hence, it is impossible to even begin to determine whether the matching methodology used actually yielded comparable samples down to the teacher level – the heart of this research study. This is a fatal flaw that in my opinion should have prevented this study from being published, at least as is, as without this information we have no guarantees that teachers within these schools were indeed comparable.

Regardless, researchers also examined teachers’ Student Learning Objectives (SLOs) – the incentive program’s “primary measure of individual teacher performance” given so many teachers are still VAM-ineligible (see a prior post about SLOs, here). They examined whether SLO scores correlated with VAM scores, for those teachers who had both.

They found, as per a quote by Springer in the above-mentioned post, that “[w]hile SLOs may serve as an important pedagogical tool for teachers in encouraging goal-setting for students, the format and guidance for SLOs within the specific program did not lead to the proper identification of high value-added teachers.” That is, more precisely and as indicated in the actual study, SLOs were “not significantly correlated with a teacher’s value-added student test scores;” hence, “a teacher is no more likely to meet his or her SLO targets if [his/her] students have higher levels of achievement [over time].” This has huge implications, in particular regarding the still lacking evidence of validity surrounding SLOs.

New York Times Article on “Grading Teachers by the Test”

Last Tuesday, Eduardo Porter – writer of the Economic Scene column for The New York Times – wrote an excellent article, from an economics perspective, about that which is happening with our current obsession in educational policy with “Grading Teachers by the Test.” Do give the article a full read; it’s well worth it, and also consider here some of what I see as Porter’s strongest points for ponder.

Porter writes about what economist’s often refer to as Goodhart’s Law, which states that “when a measure becomes the target, it can no longer be used as the measure.” This occurs given the value (mis)placed on any measure, and the distortion (i.e., in terms of artificial inflation or deflation, depending on the desired direction of the measure) that often-to-always comes about as a result.

Remember when the Federal Aviation Administration was to hold airlines “accountable” for their on-time arrivals? Airlines responded by meeting the “new and improved” FAA standards by merely extending the estimated time of arrivals (ETAs) on the back end. Airlines did not do anything if much else of instrumental value anywhere else. As cited by Porter, the same thing happens when U.S. hospitals “do whatever it takes to keep patients alive at least 31 days after an operation, to beat Medicare’s 30-day survival yardstick.” This is Goodhart’s law at play.

In education we commonly refer to Goodhart’s Law’s interdisciplinary cousin — Campbell’s Law instead, which states that “the more any quantitative social indicator (or even some qualitative indicator) is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.” In his 1976 paper, Campbell wrote that “achievement tests may well be valuable indicators of general school achievement under conditions of normal teaching aimed at general competence. But when test scores become the goal of the teaching process, they both lose their value as indicators of educational status and distort the educational process in undesirable ways” (click here for more information). This is the main point of this New York Times piece. See also a classic article on test score pollution here and a fantastic book all about this here.

My colleague Jesse Rothstein – Associate Professor of Economics at University of California – Berkeley – is cited as saying “We don’t know how big a deal this is…[but]…It is one of [our] main concerns.” I would add that while we cannot quantify this or how often this actually occurs, nor will we every likely be able to do so, we can state with 100% certainty this is, indeed, a very big deal. Porter agrees, citing multiple research studies in support (including one of mine here and another of my former PhD student’s here).

See also a research study I published about this with a set of colleagues in 2010, titled “Cheating in the first, second, and third degree: Educators’ responses to high-stakes testing,” in which we got as close as anyone of whom I am aware to understanding not only the frequencies but also the varieties of ways in which educators distort, sometimes consciously and sometimes subconsciously, their students’ test scores. The also often times engage in such practices believing that their test-preparation and other test-related practices are in their students’ best interests. Note, however, that this study took place before even more consequences were attached to students’ test scores, given the federal government’s more recent fascination with holding teachers accountable for their quantifiable-by-student-test-scores “value-added.”

It is a fascinating phenomenon, really, albeit highly unfortunate given the consequences those external to America’s public education system (e.g,. educational policymakers) continue to tie to teachers’ test scores, regardless, and now more than ever before.

As underscored by Porter, “American education has embarked upon a nationwide experiment in incentive design. Prodded by the Education Department, most states have set up evaluation systems for teachers built on the gains of their students on standardized tests, alongside more traditional criteria like evaluations from principals.” All of this has occurred despite the research evidence, some of which is over 20 years old, and despite the profession’s calls (see, for example, here) to stop what, thanks to Goodhart’s and Campbell’s Laws, can often be understood as noise and nonsense.

Unfortunate Updates from Tennessee’s New “Leadership”

Did I write in a prior post (from December, 2014) that “following the (in many ways celebrated) exit of Commissioner Huffman, it seems the state [of Tennessee] is taking an even more reasonable stance towards VAMs and their use(s) for teacher accountability?” I did, and I was apparently wrong…

Things in Tennessee — the state in which our beloved education-based VAMs were born (see here and here) and on which we have been constant watch — have turned south, once again, thanks to new “leadership.” I use this term loosely to describe the author of the letter below, as featured in a recent article in The Tennessean. The letter’s author is the state’s new Commissioner of the Tennessee Department of Education — Candice McQueen — “a former Tennessee classroom teacher and higher education leader.”

Pasted below is what she wrote. You be the judge and question her claims, particularly those I underlined for emphasis in the below.

This week legislators debated Governor Haslam’s plan to respond to educator feedback regarding Tennessee’s four-year-old teacher evaluation system. There’s been some confusion around the details of the plan, so I’m writing to clarify points and share the overall intent.

Every student has individual and unique needs, and each student walks through the school doors on the first day at a unique point in their education journey. It’s critical that we understand not only the grade the child earns at the end of the year, but also how much the child has grown over the course of the year.

Tennessee’s Value-Added Assessment System, or TVAAS, provides educators vital information about our students’ growth.

From TVAAS we learn if our low-achieving students are getting the support they need to catch up to their peers, and we also learn if our high-achieving students are being appropriately challenged so that they remain high-achieving. By analyzing this information, teachers can make adjustments to their instruction and get better at serving the unique needs of every child, every year.

We know that educators often have questions about how TVAAS is calculated and we are working hard to better support teachers’ and leaders’ understanding of this growth measure so that they can put the data to use to help their students learn (check out our website for some of these resources).

While we are working on improving resources, we have also heard concerns from teachers about being evaluated during a transition to a new state assessment. In response to this feedback from teachers, Governor Haslam proposed legislation to phase in the impact of the new TNReady test over the next three years.

Tennessee’s teacher evaluation system, passed as part of the First to the Top Act by the General Assembly in 2010 with support from the Tennessee Education Association, has always included multiple years of student growth data. Student growth data makes up 35 percent of a teacher’s evaluation and includes up to three years of data when available.

Considering prior and current year’s data together paints a more complete picture of a teacher’s impact on her student’s growth.

More data protects teachers from an oversized impact of a tough year (which teachers know happen from time to time). The governor’s plan maintains this more complete picture for teachers by continuing to look at both current and prior year’s growth in a teacher’s evaluation.

Furthermore, the governor’s proposal offers teachers more ownership and choice during this transition period.

If teachers are pleased with the growth score they earn from the new TNReady test, they can opt to have that score make up the entirety of their 35 percent growth measure. However, teachers may also choose to include growth scores they earned from prior TCAP years, reducing the impact of the new assessment.

The purpose of all of this is to support our student’s learning. Meeting the unique needs of each learner is harder than rocket science, and to achieve that we need feedback and information about how our students are learning, what is working well, and where we need to make adjustments. One of these critical pieces of feedback is information about our students’ growth.

An Average of 60-80 Days of 180 School Days (≈ 40%) on State Testing in Florida

“In Florida, which tests students more frequently than most other states, many schools this year will dedicate on average 60 to 80 days out of the 180-day school year to standardized testing.” So it is written in an article recently released in the New York Times titled “States Listen as Parents Give Rampant Testing an F.”

Of issue, as per a serious set of parents, include the following:

  • The aforementioned focus on state standardized testing, as highlighted above, is being defined as a serious educational issue/concern. In the words of the article’s author, “Parents railed at a system that they said was overrun by new tests coming from all levels — district, state and federal.”
  • Related, parents took issue with the new Common Core tests and a ” state mandate that students use computers for [these, and other] standardized tests” which has “made the situation worse because computers are scarce and easily crash” and because “the state did not give districts extra money for computers or technology help.”
  • Related, “Because schools do not have computers for every student, tests are staggered throughout the day, which translates to more hours spent administering tests and less time teaching. Students who are not taking tests often occupy their time watching movies. The staggered test times also mean computer labs are not available for other students.”
  • In addition, parents “wept as they described teenagers who take Xanax to cope with test stress, children who refuse to go to school and teachers who retire rather than promote a culture that seems to value testing over learning.” One father cried confessing that he planned to pull his second grader from school because, in his words, “Teaching to a test is destroying our society,” as a side effect of also destroying the public education system.

Regardless, former Governor Jeb Bush — a possible presidential contender and one of the first governors to introduce high-stakes testing into “his” state of Florida — “continues to advocate test-based accountability through his education foundation. Former President George W. Bush, his brother, introduced similar measures as [former] governor of Texas and, as [former] president, embraced No Child Left Behind, the law that required states to develop tests to measure progress.”

While the testing craze existed in many ways and places prior to NCLB, NCLB and the Bush brothers’ insistent reliance on high-stakes tests to reform America’s public education system have really brought the U.S. to where it is today; that is, relying even more on more and more tests and the use of VAMs to better measure growth in between the tests administered.

Likewise, and as per the director of FairTest, “The numbers and consequences of these tests have driven public opinion over the edge, and politicians are scrambling to figure out how to deal with that.” In the state of Florida in particular, “[d]espite continued support in the Republican-dominated State Legislature for high-stakes testing,” these are just some of the signs that “Florida is headed for a showdown with opponents of an education system that many say is undermining its original mission: to improve student learning, help teachers and inform parents.”

Doug Harris on the (Increased) Use of Value-Added in Louisiana

Thus far, there have been four total books written about value-added models (VAMs) in education: one (2005) scholarly, edited book that was published prior to our heightened policy interest in VAMs; one (2012) that is less scholarly but more of a field guide on how to use VAM-based data; my recent (2014) scholarly book; and another recent (2011) scholarly book written by Doug Harris. Doug is an Associate Professor of Economics at Tulane in Louisiana. He is also, as I’ve written prior, “a ‘cautious’ but quite active proponent of VAMs.”

There is quite an interesting history surrounding these latter two books, given Harris and I have quite different views on VAMs and their potentials in education. To read more about our differing opinions you can read a review of Harris’s book I wrote for Teachers College Record, and another review a former doctoral student and I wrote for Education Review, to which he responded in his (and his book’s) defense, to which we also responded (with a “rebuttal to a rebuttal, as you will“). What was ultimately titled a “Value-Added Smackdown” in a blog post featured in Education Week, let’s just say, got a little out of hand, with the “smackdown” ending up focusing almost solely around our claim that Harris believed, and we disagreed, with the notion that “value-added [was and still is] good enough to be used for [purposes of] educational accountability.” We asserted then, and I continue to assert now, that “value-added is not good enough to be attaching any sort of consequences much less any such decisions to its output. Value-added may not even be good enough even at the most basic, pragmatic level.”

Harris continues to disagree…

Just this month he released a technical report to his state’s school board (i.e., the Louisiana Board of Elementary and Secondary Education (BESE)), in which he (unfortunately) evidenced that he has not (yet) changed his scholarly stripes….even given the most recent research about the increasingly apparent, methodological, statistical, and pragmatic limitations (see, for example, here, here, here, here, and here), and the recent position statement released by the American Statistical Association underscoring the key points being evidenced across (most) educational research studies. See also the 24 articles published about VAMs in all American Educational Research Association (AERA) Journals here, along with open-access links to the actual articles.

In this report Harris makes “Recommendations to Improve the Louisiana System of
Accountability for Teachers, Leaders, Schools, and Districts,” the main one being that the state focus “more on student learning or growth—[by] specifically, calculating the predicted test scores and rewarding schools based on how well students do compared with those predictions.” The recommendations in more detail, in support, and that also pertain to our interests here include the following five (of six recommendations total):

1. “Focus more on student growth [i.e., value-added] in order to better measure the performance of schools.” Not that there is any research evidence in support, but “The state should [also] aim for a 50-­‐50 split between growth and achievement levels [i.e., not based on value-added].” Doing this at the school accountability level “would also improve alignment with teacher accountability, which includes student growth [i.e., teacher-level value-added] as 50% of the evaluation.”

2. “Reduce uneven incentives and avoid “incentive cliffs” by increasing [school performance score] points more gradually as students move to higher performance levels,” notwithstanding the fact that no research to date has evidenced that such incentives incentivize much of anything intended, at least in education. Regardless, and despite the research, “Giving more weight to achievement growth [will help to create] more even [emphasis added] incentives (see Recommendation #1).”

3. Related, “Create a larger number of school letter grades [to] create incentives for all schools to improve,” by adding +/- extensions to the school letter grades, because “[i]f there were more categories, the next [school letter grade] level would always be within reach….provide. This way all schools will have an incentive to improve, whereas currently only those who are at the high end of the B-­‐D categories have much incentive.” If only the real world of education worked as informed by simple idioms, like those simplifying the theories supporting incentives (e.g., the carrot just in front of the reach of the mule will make the mule draw the cart harder).

5. “Eliminate the first over-­ride provision in the teacher accountability system, which automatically places teachers who are “Ineffective” on either measure in the “Ineffective” performance category.” With this recommendation, I fully agree, as Louisiana is one of the most extreme states when it comes to attaching consequences to problematic data, although I don’t think Harris would agree with my “problematic” classification. But this would mean that “teachers who appear highly effective on one measure could not end up in the “Ineffective” category,” which for this state would certainly be a step in the right direction. Although Harris’s assertion that doing this would also help prevent principals from saving truly ineffective teachers (e.g., by countering teachers’ value-added scores with artificially inflated or allegedly fake observational scores), on behalf of principals as professionals, I find insulting.

6. “Commission a full-­scale third party evaluation of the entire accountability system focused on educator responses and student outcomes.” With this recommendation, I also fully agree under certain conditions: (1) the external evaluator is indeed external to the system and has no conflicts of interest, including financial (even prior to payment for the external review), (2) that which the external evaluator is to investigate is informed by the research in terms of the framing of the questions that need to be asked, (3) as also recommended by Harris, that perspectives of those involved (e.g., principals and teachers) are included in the evaluation design, and (4) all parties formally agree to releasing all data regardless of what (positive or negative) the external evaluator might evidence and find.

Harris’s additional details and “other, more modest recommendations” include the following:

  • Keep “value-­‐added out of the principal [evaluation] measure,” but the state should consider calculating principal value-­‐added measures and issuing reports that describe patterns of variation (e.g., variation in performance overall [in] certain kinds of schools) both for the state as a whole and specific districts.” This reminds me of the time that value-added measures for teachers were to be used only for descriptive purposes. While noble as a recommendation, we know from history what policymakers can do once the data are made available.
  • “Additional Teacher Accountability Recommendations” start on page 11 of this report, although all of these (unfortunately, again) focus on value-added model twists and tweaks (e.g., how to adjust for ceiling effects for schools and teachers with disproportionate numbers of gifted/high-achieving students, how to watch and account for bias) to make the teacher value-added model even better.

Harris concludes that “With these changes, Louisiana would have one of the best accountability systems in the country. Rather than weakening accountability, these recommendations [would] make accountability smarter and make it more likely to improve students’ academic performance.” Following these recommendations would “make the state a national leader.” While Harris cites 20 years of failed attempts in Louisiana and across all states across the country as the reason America’s public education system has not improved its public school students’ academic performance, I’d argue it’s more like 40 years of failed attempts because Harris’s (and so many others’) accountability-bent logic is seriously flawed.

Houston, We Have A Problem: New Research Published about the EVAAS

New VAM research was recently published in the peer-reviewed Education Policy Analysis Archives journal, titled “Houston, We Have a Problem: Teachers Find No Value in the SAS Education Value-Added Assessment System (EVAAS®).” This article was published by a former doctoral student of mine, turned researcher now at a large non-profit — Clarin Collins. I asked her to write a guest post for you all summarizing the fully study (linked again here). Here is what she wrote.

As someone who works in the field of philanthropy, completed a doctoral program more than two years ago, and recently became a new mom, you might question why I worked on an academic publication and am writing about it here as a guest blogger? My motivation is simple: the teachers. Teachers continue to be at the crux of the national education reform efforts as they are blamed for the nation’s failing education system and student academic struggles. National and state legislation has been created and implemented as believed remedies to “fix” this problem by holding teachers accountable for student progress as measured by achievement gains.

While countless researchers have highlighted the faults of teacher accountability systems and growth models (unfortunately to fall on the deaf ears of those mandating such policies), very rarely are teachers asked how such policies play out in practice, or for their opinions, as representing their voices in all of this. The goal of this research, therefore, was first, to see how one such teacher evaluation policy is playing out in practice and second, to give voice to marginalized teachers, those who are at the forefront of these new policy initiatives. That being said, while I encourage you to check out the full article [linked again here], I highlight key findings in this summary, using the words of teachers as often as possible to permit them, really, to speak for themselves.

In this study I examined the SAS Education Value-Added Assessment System (EVAAS) in practice, as perceived and experienced by teachers in the Southwest School District (SSD). SSD [a pseudonym] is using EVAAS for high-stakes consequences more than any other district or state in the country. I used a mixed-method design including a large-scale electronic survey to investigate the model’s reliability and validity; to determine whether teachers used the EVAAS data in formative ways as intended; to gather teachers’ opinions on EVAAS’s claimed benefits and statements; and to understand the unintended consequences that might have also occurred as a result of EVAAS use in SSD.

Results revealed that the reliability of the EVAAS model produced split and inconsistent results among teacher participants regardless of subject or grade-level taught. As one teacher stated, “In three years, I was above average, below average and average.” Teachers indicated that it was the students and their varying background demographics who biased their EVAAS results, and much that was demonstrated via their scores was beyond the control of teachers. “[EVAAS] depends a lot on home support, background knowledge, current family situation, lack of sleep, whether parents are at home, in jail, etc. [There are t]oo many outside factors – behavior issues, etc.” that apparently are not controlled or accounted for in the model.

Teachers reported dissimilar EVAAS and principal observation scores, reducing the criterion-related validity of both measures of teacher quality. Some even reported that principals changed their observation scores to match their EVAAS scores; “One principal told me one year that even though I had high [state standardized test] scores and high Stanford [test] scores, the fact that my EVAAS scores showed no growth, it would look bad to the superintendent.” Added another teacher, “I had high appraisals but low EVAAS, so they had to change the appraisals to match lower EVAAS scores.”

The majority of teachers disagreed with SAS’s marketing claims such as EVAAS reports are easy to use to improve instruction, and EVAAS will ensure growth opportunities for all students. Teachers called the reports “vague” and “unclear” and were “not quite sure how to interpret” and use the data to inform their instruction. As one teacher explained, she looked at her EVAAS report “only to guess as to what to do for the next group in my class.”

Many unintended consequences associated with the high-stakes use of EVAAS emerged through teachers’ responses, which revealed among others that teachers felt heightened pressure and competition, which they believed reduced morale and collaboration, and encouraged cheating or teaching to the test in attempt to raise EVAAS scores. Teachers made comments such as, “To gain the highest EVAAS score, drill and kill and memorization yields the best results, as does teaching to the test,” and “When I figured out how to teach to the test, the scores went up,” as well as, “EVAAS leaves room for me to teach to the test and appear successful.”

Teachers realized this emphasis on test scores was detrimental for students, as one teacher wrote, “As a result of the emphasis on EVAAS, we teach less math, not more. Too much drill and kill and too little understanding [for the] love of math… Raising a generation of children under these circumstances seems best suited for a country of followers, not inventors, not world leaders.”

Teachers also admitted they are not collaborating to share best practices as much anymore: “Since the inception of the EVAAS system, teachers have become even more distrustful of each other because they are afraid that someone might steal a good teaching method or materials from them and in turn earn more bonus money. This is not conducive to having a good work environment, and it actually is detrimental to students because teachers are not willing to share ideas or materials that might help increase student learning and achievement.”

While I realize this body of work could simply add to “the shelves” along with those findings of other researchers striving to deflate and demystify this latest round of education reform, if nothing else, I hope the teachers who participated in this study know I am determined to let their true experiences, perceptions of their experiences, and voices be heard.

—–

Again, to find out more information including the statistics in support of the above assertions and findings, please click here to read the full study.

The Nation’s High School Principals (Working) Position Statement on VAMs

The Board of Directors of the National Association of Secondary School Principals (NASSP) officially released a working position announcement on VAMs, that was also recently referenced in an article in Education Week here (“Principals’ Group Latest to Criticize ‘Value Added’ for Teacher Evaluations“) and a Washington Post post here (“Principals Reject ‘Value-Added’ Assessment that Links Test Scores to Educators’ Jobs“).

I have pasted this statement below, but also link to it here as well. The position’s highlights follow, as also summarized in the above links and the position statement itself:

  • “[T]est-score-based algorithms for measuring teacher quality aren’t appropriate.”
  • “[T]he timing for using [VAMs] comes at a a terrible time, just as schools adjust to demands from the Common Core State Standards and other difficult new expectations for K-12 students.”
  • “Principals are concerned that the new evaluation systems are eroding trust and are detrimental to building a culture of collaboration and continuous improvement necessary to successfully raise student performance to college and career-ready levels.”
  • “Value-added systems, the statement concludes, should be used to measure school improvement and help determine the effectivness of some programs and instructional methods; they could even be used to tailor professional development. But they shouldn’t be used to make “key personel decisions” about individual teachers.”
  • “[P]rincipals often don’t use value-added data even where it exists, largely because a lot of them don’t trust it.”
  • The position statement also quotes Mel Riddile, a former National Principal of the Year and chief architect of the NASSP statement, who says: “We are using value-added measurement in a way that the science does not yet support. We have to make it very clear to policymakers that using a flawed measurement both misrepresents student growth and does a disservice to the educators who live the work each day.”

See also two other great blog posts re: the potential impact the NASSP’s working statement might/should have, also, on America’s current VAM-situation. The first external post comes from the blog “curmudgucation” and discusses in great detail the highlights of the NASSP’s post. The second external post comes from a guest post on Diane Ravitch’s blog.

Below, again, is the full post as per the website of the NASSP:

Purpose

To determine the efficacy of the use of data from student test scores, particularly in the form of Value-Added Measures (VAMs), to evaluate and to make key personnel decisions about classroom teachers.

Issue

Currently, a number of states either are adopting or have adopted new or revamped teacher evaluation systems, which are based in part on data from student test scores in the form of value-added measures (VAM). Some states mandate that up to fifty percent of the teacher evaluation must be based on data from student test scores. States and school districts are using the evaluation systems to make key personnel decisions about retention, dismissal and compensation of teachers and principals.

At the same time, states have also adopted and are implementing new, more rigorous college- and career standards. These new standards are intended to raise the bar from having every student earn a high school diploma to the much more ambitious goal of having every student be on-target for success in post-secondary education and training.

The assessments accompanying these new standards depart from the old, much less expensive, multiple-choice style tests to assessments, which include constructed responses. These new assessments demand higher-order thinking and up to a two-year increase in expected reading and writing skills. Not surprisingly, the newness of the assessments combined with increased rigor has resulted in significant drops in the number of students reaching “proficient” levels on assessments aligned to the new standards.

Herein lies the challenge for principals and school leaders. New teacher evaluation systems demand the inclusion of student data at a time when scores on new assessments are dropping. The fears accompanying any new evaluation system have been magnified by the inclusion of data that will get worse before it gets better. Principals are concerned that the new evaluation systems are eroding trust and are detrimental to building a culture of collaboration and continuous improvement necessary to successfully raise student performance to college and career-ready levels.

Specific question have arisen about using value-added measurement (VAM) to retain, dismiss, and compensate teachers. VAMs are statistical measures of student growth. They employ complex algorithms to figure out how much teachers contribute to their students’ learning, holding constant factors such as demographics. And so, at first glance, it would appear reasonable to use VAMs to gauge teacher effectiveness. Unfortunately, policy makers have acted on that impression over the consistent objections of researchers who have cautioned against this inappropriate use of VAM.

In a 2014 report, the American Statistical Association urged states and school districts against using VAM systems to make personnel decisions. A statement accompanying the report pointed out the following:

  • “VAMs are generally based on standardized test scores, and do not directly measure potential teacher contributions toward other student outcomes.
  • VAMs typically measure correlation, not causation: Effects – positive or negative – attributed to a teacher may actually be caused by other factors that are not captured in the model.
  • Under some conditions, VAM scores and rankings can change substantially when a different model or test is used, and a thorough analysis should be undertaken to evaluate the sensitivity of estimates to different models.
  • VAMs should be viewed within the context of quality improvement, which distinguishes aspects of quality that can be attributed to the system from those that can be attributed to individual teachers, teacher preparation programs, or schools.
  • Most VAM studies find that teachers account for about 1% to 14% of the variability in test scores, and that the majority of opportunities for quality improvement are found in the system-level conditions. Ranking teachers by their VAM scores can have unintended consequences that reduce quality.”

Another peer-reviewed study funded by the Gates Foundation and published by the American Educational Research Association (AERA) stated emphatically, “Value-Added Performance Measures Do Not Reflect the Content or Quality of Teachers’ Instruction.” The study found that “state tests and these measures of evaluating teachers don’t really seem to be associated with the things we think of as defining good teaching.” It further found that some teachers who were highly rated on student surveys, classroom observations by principals and other indicators of quality had students who scored poorly on tests. The opposite also was true. “We need to slow down or ease off completely for the stakes for teachers, at least in the first few years, so we can get a sense of what do these things measure, what does it mean,” the researchers admonished. “We’re moving these systems forward way ahead of the science in terms of the quality of the measures.”

Researcher Bruce Baker cautions against using VAMs even when test scores count less than fifty percent of a teacher’s final evaluation.  Using VAM estimates in a parallel weighting system with other measures like student surveys and principal observations “requires that VAM be considered even in the presence of a likely false positive. NY legislation prohibits a teacher from being rated highly if their test-based effectiveness estimate is low. Further, where VAM estimates vary more than other components, they will quite often be the tipping point – nearly 100% of the decision even if only 20% of the weight.”

Stanford’s Edward Haertel takes the objection for using VAMs for personnel decisions one step further: “Teacher VAM scores should emphatically not be included as a substantial factor with a fixed weight in consequential teacher personnel decisions. The information they provide is simply not good enough to use in that way. It is not just that the information is noisy. Much more serious is the fact that the scores may be systematically biased for some teachers and against others, and major potential sources of bias stem from the way our school system is organized. No statistical manipulation can assure fair comparisons of teachers working in very different schools, with very different students, under very different conditions.”

Still other researchers believe that VAM is flawed at its very foundation. Linda Darling-Hammond et al. point out that the use of test scores via VAMs assumes “that student learning is measured by a given test, is influenced by the teacher alone, and is independent from the growth of classmates and other aspects of the classroom context. None of these assumptions is well supported by current evidence.” Other factors including class size, instructional time, home support, peer culture, summer learning loss impact student achievement. Darling-Hammond points out that VAMs are inconsistent from class to class and year to year. VAMs are based on the false assumption that students are randomly assigned to teachers. VAMs cannot account for the fact that “some teachers may be more effective at some forms of instruction…and less effective in others.”

Guiding Principles

  • As instructional leader, “the principal’s role is to lead the school’s teachers in a process of learning to improve teaching, while learning alongside them about what works and what doesn’t.”
  • The teacher evaluation system should aid the principal in creating a collaborative culture of continuous learning and incremental improvement in teaching and learning.
  • Assessment for learning is critical to continuous improvement of teachers.
  • Data from student test scores should be used by schools to move students to mastery and a deep conceptual understanding of key concepts as well as to inform instruction, target remediation, and to focus review efforts.
  • NASSP supports recommendations for the use of “multiple measures” to evaluate teachers as indicated in the 2014 “Standards for Educational and Psychological Testing” measurement standards released by leading professional organizations in the area of educational measurement, including the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME).

Recommendations

  • Successful teacher evaluation systems should employ “multiple classroom observations across the year by expert evaluators looking to multiple sources of data, and they provide meaningful feedback to teachers.”
  • Districts and States should encourage the use of Peer Assistance and Review (PAR) programs, which use expert mentor teachers supporting novice teachers and struggling veteran teachers have been proven to be an effective system for improving instruction.
  • States and Districts should allow the use of teacher-constructed portfolios of student learning, which are being successfully used as a part of teacher evaluation systems in a number of jurisdictions.
  • VAMs should be used by principals to measure school improvement and to determine the effectiveness of programs and instructional methods.
  • VAMs should be used by principals to target professional development initiatives.
  • VAMs should not to be used to make key personnel decisions about individual teachers.
  • States and Districts should provide ongoing training for Principals in the appropriate use student data and VAMs.
  • States and Districts should make student data and VAMs available to principals at a time when decisions about school programs are being made.
  • States and Districts should provide resources and time principals need in order to make the best use of data.

Surveys + Observations for Measuring Value-Added

Following up on a recent post about the promise of Using Student Surveys to Evaluate Teachers using a more holistic definition of a teacher’s valued added, I just read a chapter written by Ronald Ferguson — the creator of the Tripod student survey instrument and Tripod’s lead researcher — and written along with Charlotte Danielson — the creator of the Framework for Teaching and founder of The Danielson Group (see a prior post about this instrument here). Both instruments are “research-based,” both are used nationally and internationally, both are (increasingly being) used as key indicators to evaluate teachers across the U.S., and both were used throughout the Bill & Melinda Gates Foundation’s ($43 million worth of) Measures of Effective Teaching (MET) studies.

The chapter titled, “How Framework for Teaching and Tripod 7Cs Evidence Distinguish Key Components of Effective Teaching,” was recently published in a book all about the MET studies, titled “Designing Teacher Evaluation Systems: New Guidance from the Measures of Effective Teaching Project” written by Thomas Kane, Kerri Kerr, and Robert Pianta. The chapter is about whether and how data derived via the Tripod student survey instrument (i.e., as built on 7Cs: challenging students, control of the classroom, teacher caring, teachers confer with students, teachers captivate their students, teachers clarify difficult concepts, teachers consolidate students’ concerns) align with the data derived via Danielson’s Framework for Teaching, to collectively capture teacher effectiveness.

Another purpose for this chapter is to examine how both indicators also align with teacher level-value-added. Ferguson (and Danielson) find that:

  • Their two measures (i.e., the Tripod and the Framework for Teaching) are more reliable (and likely more valid) than value-added measures. The over-time, teacher-level classroom correlations, cited in this chapter, are r = 0.38 for value-added (which is comparable with the correlations noted in plentiful studies elsewhere), r = 0.42 for the Danielson Framework, and r = 0.61 for the Tripod student survey component. These “clear correlations,” while not strong particularly in terms of value-added, do indicate there is some common signal that the indicators are capturing, some stronger than the others (as should be obvious given the above numbers).
  • Contrary to what some (softies) might think, classroom management, not caring (i.e., the extent to which teachers care about their students and what their students learn and achieve), is the strongest predictor of a teachers’ value-added. However, the correlation (i.e., the strongest of the bunch) is still quite “weak” at an approximate r = 0.26, even though it is statistically significant. Caring, rather, is the strongest predictor of whether students are happy in their classrooms with their teachers.
  • In terms of “predicting” teacher-level value-added, and of the aforementioned 7Cs, the things that also matter “most” next to classroom management (although none of the coefficients are as strong as we might expect [i.e., r < 0.26]) include: the extent to which teachers challenge their students and have control over their classrooms.
  • Value-added in general is more highly correlated with teachers at the extremes in terms of their student survey and observational composite indicators.

In the end, while the authors of this chapter do not disclose the actual correlations between their two measures and value-added, specifically (although from the appendix one can infer that the correlation between value-added and Tripod output is around r = 0.45 as based on an unadjusted r-squared), and I should mention this is a HUGE shortcoming of this chapter (one that would not have passed peer review should this chapter have been submitted to a journal for publication), the authors do mention that “the conceptual overlap between the frameworks is substantial and that empirical patterns in the data show similarities.” Unfortunately again, however, they do not quantify the strength of said “similarities.” This only leaves us to assume that since they were not reported the actual strength of the similarities empirically observed between was likely low (as is also evidenced in many other studies, although not as often with student survey indicators as opposed to observational indicators.)

The final conclusion the authors of this chapter make is that educators “cross-walk” the two frameworks (i.e., the Tripod and the Danielson Framework) and use both frameworks when reflecting on teaching. I must say I’m concerned about these recommendations, as well, mainly given this recommendation will cost states and districts more $$$, and the returns or “added value” (using the grandest definition of this term) of doing so and engaging in such an approach does not have the necessary evidence I would say one might use to adequately justify such recommendations.

Rothstein, Chetty et al., and VAM-Based Bias

Recall the Chetty et al. study at focus of many posts on this blog (see for example here, here, and here)? The study was cited in President Obama’ 2012 State of the Union address when Obama said, “We know a good teacher can increase the lifetime income of a classroom by over $250,000,” and this study was more recently the focus of attention when the judge in Vergara v. California cited Chetty et al.’s study as providing evidence that “a single year in a classroom with a grossly ineffective teacher costs students $1.4 million in lifetime earnings per classroom.” Well, this study is at the source of a new, and very interesting VAM-based debate, again.

This time, new research conducted by Berkeley Associate Professor of Economics – Jesse Rothstein – provides evidence that puts the aforementioned Chetty et al. results under another appropriate light. While Rothstein and others have written critiques of the Chetty et al. study prior (see prior reviews here, here, here, and here), what Rothstein recently found (in his working, not-yet-peer-reviewed-study here) is that by using “teacher switching” statistical procedures, Chetty et al. masked evidence of bias in their prior study. While Chetty et al. have repeatedly claimed bias was not an issue (see for example a series of emails on this topic here), it seems indeed it was.

While Rothstein replicated Chetty et al.’s overall results using a similar dataset, Rothstein did not replicate Chetty et al.’s findings when it came to bias. As mentioned, Chetty et al. used a process of “teacher switching” to test for bias in their study, and by doing so found, with evidence, that bias did not exist in their value-added output. Rothstein found that when “teacher switching” is appropriately controlled, however, “bias accounts for about 20% of the variance in [VAM] scores.” This makes suspect, more now than before, Chetty et al.’s prior assertions that their model, and their findings, were immune to bias.

What this means, as per Rothstein, is that “teacher switching [the process used by Chetty et al.] is correlated with changes in students’ prior grade scores that bias the key coefficient toward a finding of no bias.” Hence, there was a reason Chetty et al. did not find bias in their value-added estimates because they did not use the proper, statistical controls to control for bias in the first place. When properly controlled, or adjusted, estimates yield “evidence of moderate bias;” hence, “[t]he association between [value-added] and long-run outcomes is not robust and quite sensitive to controls.”

This has major implications in the sense that this makes suspect the causal statements also made by Chetty et al. and repeated by President Obama, the Vergara v. California judge, and others – that “high value-added” teachers caused students to ultimately realize higher long-term incomes, fewer pregnancies, etc. X years down the road. If Chetty et al. did not appropriately control for bias, which again Rothstein argues with evidence they did not, it is likely that students would have realized these “things” almost if not entirely regardless of their teachers or what “value” their teachers purportedly “added” to their learning X years prior.

In other words, students were likely not randomly assigned to classrooms in either the Chetty et al. or the Rothstein datasets (making these datasets comparable). So if the statistical controls used did not effectively “control for” the lack of non random assignment of students into classrooms, teachers may have been assigned high value-added scores not necessarily because they were high value-added teachers but because they were non-randomly assigned higher performing, higher aptitude, etc. students, in the first place and as a whole. Thereafter, they were given credit for the aforementioned long-term outcomes, regardless.

If the name Jesse Rothstein sounds familiar, it should. I have referenced his research in prior posts here, here, and here, as he is well-known in the area of VAM research, in particular, for a series of papers in which he provided evidence that students who are assigned to classrooms in non-random ways can create biased, teacher-level value-added scores. If random assignment was the norm (i.e., whereas students are randomly assigned to classrooms and, ideally, teachers are randomly assigned to teach those classrooms of randomly assigned students), teacher-level bias would not be so problematic. However, given research I also recently conducted on this topic (see here), random assignment (at least in the state of Arizona) occurs 2% of the time, at best. Principals otherwise outright reject the notion as random assignment is not viewed as in “students’ best interests,” regardless of whether randomly assigning students to classrooms might mean “more accurate” value-added output as a result.

So it seems, we either get the statistical controls right (which I doubt is possible) or we randomly assign (which I highly doubt is possible). Otherwise, we are left wondering whether value-added analyses will ever work as per their intended (and largely ideal) purposes, especially when it comes to evaluating and holding accountable America’s public school teachers for their effectiveness.

—–

In case you’re interested, Chetty et al have responded to Rothstein’s critique. Their full response can be accessed here. Not surprisingly, they first highlight that Rothstein (and another set of their colleagues at Harvard), replicated their results. That “value-added (VA) measures of teacher quality show very consistent properties across different settings” is that on which Chetty et al. focus first and foremost. What they dismiss, however, is whether the main concerns raised by Rothstein threaten the validity of their methods, and their conclusions. They also dismiss the fact that Rothstein addressed Chetty et al.’s counterpoints before they published them, in Appendix B of his paper given Chetty et al. shared their concerns with Rothstein prior to his study’s release.

Nonetheless, the concerns Chetty et al. attempt to counter are whether their “teacher-switching” approach was invalid, and whether the “exclusion of teachers with missing [value-added] estimates biased the[ir]conclusion[s]” as well. The extent to which missing data bias value-added estimates has also been discussed prior when statisticians force the assumption in their analyses that missing data are “missing at random” (MAR), which is a difficult (although for some like Chetty et al, necessary) assumption to swallow (see, for example, the Braun 2004 reference here).