Special Issue of “Educational Researcher” (Paper #6 of 9): VAMs as Tools for “Egg-Crate” Schools

Recall that the peer-reviewed journal Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#6 of 9), which is actually an essay here, titled “Will VAMS Reinforce the Walls of the Egg-Crate School?” This essay is authored by Susan Moore Johnson – Professor of Education at Harvard and somebody who I in the past I had the privilege of interviewing as an esteemed member of the National Academy of Education (see interviews here and here).

In this article, Moore Johnson argues that when policymakers use VAMs to evaluate, reward, or dismiss teachers, they may be perpetuating an egg-crate model, which is (referencing Tyack (1974) and Lortie (1975)) a metaphor for the compartmentalized school structure in which teachers (and students) work, most often in isolation. This model ultimately undermines the efforts of all involved in the work of schools to build capacity school wide, and to excel as a school given educators’ individual and collective efforts.

Contrary to the primary logic supporting VAM use, however, “teachers are not inherently effective or ineffective” on their own. Rather, their collective effectiveness is related to their professional development that may be stunted when they work alone, “without the benefit of ongoing collegial influence” (p. 119). VAMs then, and unfortunately, can cause teachers and administrators to (hyper)focus “on identifying, assigning, and rewarding or penalizing individual [emphasis added] teachers for their effectiveness in raising students’ test scores [which] depends primarily on the strengths of individual teachers” (p. 119). What comes along with this, then, are a series of interrelated egg-crate behaviors including, but not limited to, increased competition, lack of collaboration, increased independence versus interdependence, and the like, all of which can lead to decreased morale and decreased effectiveness in effect.

Inversely, students are much “better served when human resources are deliberately organized to draw on the strengths of all teachers on behalf of all students, rather than having students subjected to the luck of the draw in their classroom assignment[s]” (p. 119). Likewise, “changing the context in which teachers work could have important benefits for students throughout the school, whereas changing individual teachers without changing the context [as per VAMs] might not [work nearly as well] (Lohr, 2012)” (p. 120). Teachers learning from their peers, working in teams, teaching in teams, co-planning, collaborating, learning via mentoring by more experienced teachers, learning by mentoring, and the like should be much more valued, as warranted via the research, yet they are not valued given the very nature of VAM use.

Hence, there are also unintended consequences that can also come along with the (hyper)use of individual-level VAMs. These include, but are not limited to: (1) Teachers who are more likely to “literally or figuratively ‘close their classroom door’ and revert to working alone…[This]…affect[s] current collaboration and shared responsibility for school improvement, thus reinforcing the walls of the egg-crate school” (p. 120); (2) Due to bias, or that teachers might be unfairly evaluated given the types of students non-randomly assigned into their classrooms, teachers might avoid teaching high-needs students if teachers perceive themselves to be “at greater risk” of teaching students they cannot grow; (3) This can perpetuate isolative behaviors, as well as behaviors that encourage teachers to protect themselves first, and above all else; (4) “Therefore, heavy reliance on VAMS may lead effective teachers in high-need subjects and schools to seek safer assignments, where they can avoid the risk of low VAMS scores[; (5) M]eanwhile, some of the most challenging teaching assignments would remain difficult to fill and likely be subject to repeated turnover, bringing steep costs for students” (p. 120); While (6) “using VAMS to determine a substantial part of the teacher’s evaluation or pay [also] threatens to sidetrack the teachers’ collaboration and redirect the effective teacher’s attention to the students on his or her roster” (p. 120-121) versus students, for example, on other teachers’ rosters who might also benefit from other teachers’ content area or other expertise. Likewise (7) “Using VAMS to make high-stakes decisions about teachers also may have the unintended effect of driving skillful and committed teachers away from the schools that need them most and, in the extreme, causing them to leave the profession” in the end (p. 121).

I should add, though, and in all fairness given the Review of Paper #3 – on VAMs’ potentials here, many of these aforementioned assertions are somewhat hypothetical in the sense that they are based on the grander literature surrounding teachers’ working conditions, versus the direct, unintended effects of VAMs, given no research yet exists to examine the above, or other unintended effects, empirically. “There is as yet no evidence that the intensified use of VAMS interferes with collaborative, reciprocal work among teachers and principals or sets back efforts to move beyond the traditional egg-crate structure. However, the fact that we lack evidence about the organizational consequences of using VAMS does not mean that such consequences do not exist” (p. 123).

The bottom line is that we do not want to prevent the school organization from becoming “greater than the sum of its parts…[so that]…the social capital that transforms human capital through collegial activities in schools [might increase] the school’s overall instructional capacity and, arguably, its success” (p. 118). Hence, as Moore Johnson argues, we must adjust the focus “from the individual back to the organization, from the teacher to the school” (p. 118), and from the egg-crate back to a much more holistic and realistic model capturing what it means to be an effective school, and what it means to be an effective teacher as an educational professional within one. “[A] school would do better to invest in promoting collaboration, learning, and professional accountability among teachers and administrators than to rely on VAMS scores in an effort to reward or penalize a relatively small number of teachers” (p. 122).

*****

If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; and see the Review of Article #5 – on teachers’ perceptions of observations and student growth here.

Article #6 Reference: Moore Johnson, S. (2015). Will VAMS reinforce the walls of the egg-crate school? Educational Researcher, 44(2), 117-126. doi:10.3102/0013189X15573351

New Research Study: Controlling for Student Background Variables Matters

An article about the “Sensitivity of Teacher Value-Added Estimates to Student and Peer Control Variables” was recently published in the peer-reviewed Journal of Research on Educational Effectiveness. While this article is not open-access, here is a link to the article released by Mathematica prior, which nearly mirrors the final published article.

In this study, researchers Matthew Johnson, Stephen Lipscomb, and Brian Gill, all of whom are associated with Mathematica, examined the sensitivity and precision of various VAMs, the extent to which their estimates vary given whether modelers include student- and peer-level background characteristics as control variables, and the extent to which their estimates vary given whether modelers include 1+ years of students’ prior achievement scores, also as control variables. They did this while examining state data, as also compared to what they called District X – a district within the state with three-times more African-American students, two-times more students receiving free or reduced-price lunch, and generally more special educations students than the state average. While the state data included more students, the district data included additional control variables, supporting researchers’ analyses.

Here are the highlights, with thanks to lead author Matthew Johnson for edits and clarifications.

  • Different VAMs produced similar results overall [when using the same data], almost regardless of specifications. “[T]eacher estimates are highly correlated across model specifications. The correlations [they] observe[d] in the state and district data range[d] from 0.90 to 0.99 relative to [their] baseline specification.”

This has been something that has been evidenced in the literature prior, when the same models are applied to the same datasets taken from the same sets of tests at the same time. Hence, many critics argue that similar results come about when using different VAMs on the same data because conditions are such that when using the same, fallible, large-scale standardized test data, even the most sophisticated models are processing “garbage in” and “garbage out.” When the tests used and inserted into the same VAM vary, however, even if the tests used are measuring the same constructs (e.g., mathematics learning in grade X), things go haywire. For more information about this, please see Papay (2010) here.

  • However, “even correlation coefficients above 0.9 do not preclude substantial amounts of potential misclassification of teachers across performance categories.” The researchers also found that, even with such consistencies, 26% of teachers rated in the bottom quintile were placed in higher performance categories under an alternative model.
  • Modeling choices impacted the rankings of teachers in District X in “meaningful” ways, given District X’s varying student demographics compared with those in the state overall. In other words, VAMs that do not include all relevant student characteristics can penalize teachers in districts that serve students who are more disadvantaged than statewide averages.

See an article related to whether VAMs can include all relevant student characteristics, also given the non-random assignment of students to teachers (and teachers to classrooms) here.

*****

Original Version Citation: Johnson, M., Lipscomb, S., & Gill, B. (2013). Sensitivity of teacher value-added estimates to student and peer control variables. Princeton, NJ: Mathematica Policy Research.

Published Version Citation: Johnson, M., Lipscomb, S., & Gill, B. (2013). Sensitivity of teacher value-added estimates to student and peer control variables. Journal of Research on Educational Effectiveness, 8(1), 60-83. doi:10.1080/19345747.2014.967898

EVAAS, Value-Added, and Teacher Branding

I do not think I ever shared this video out, and now following up on another post, about the potential impact these videos should really have, I thought now is an appropriate time to share. “We can be the change,” and social media can help.

My former doctoral student and I put together this video, after conducting a study with teachers in the Houston Independent School District and more specifically four teachers whose contracts were not renewed due in large part to their EVAAS scores in the summer of 2011. This video (which is really a cartoon, although it certainly lacks humor) is about them, but also about what is happening in general in their schools, post the adoption and implementation (at approximately $500,000/year) of the SAS EVAAS value-added system.

To read the full study from which this video was created, click here. Below is the abstract.

The SAS Educational Value-Added Assessment System (SAS® EVAAS®) is the most widely used value-added system in the country. It is also self-proclaimed as “the most robust and reliable” system available, with its greatest benefit to help educators improve their teaching practices. This study critically examined the effects of SAS® EVAAS® as experienced by teachers, in one of the largest, high-needs urban school districts in the nation – the Houston Independent School District (HISD). Using a multiple methods approach, this study critically analyzed retrospective quantitative and qualitative data to better comprehend and understand the evidence collected from four teachers whose contracts were not renewed in the summer of 2011, in part given their low SAS® EVAAS® scores. This study also suggests some intended and unintended effects that seem to be occurring as a result of SAS® EVAAS® implementation in HISD. In addition to issues with reliability, bias, teacher attribution, and validity, high-stakes use of SAS® EVAAS® in this district seems to be exacerbating unintended effects.

Harvard Economist Deming on VAM-Based Bias

David Deming – an Associate Professor of Education and Economics at Harvard – just published in the esteemed American Economic Review an article about VAM-based bias, in this case when VAMs are used to measure school versus teacher level effects.

Deming appropriately situated his study within the prior works on this topic, including the key works of Thomas Kane (Education and Economics at Harvard) and Raj Chetty (Economics at Harvard). These two, most notably, continue to advance assertions that using students’ prior test scores and other covariates (i.e., to statistically control for students’ demographic/background factors) minimizes VAM-based bias to negligible levels. Deming also situated his study given the notable works of Jesse Rothstein (Public Policy and Economics at the University of California, Berkeley) who continues to evidence VAM-based bias really does exist. The research of these three key players, along with their scholarly disagreements, have also been highlighted in prior posts about VAM-based bias on this blog (see, for example, here and here).

In this study to test for bias, though, Deming used data from Charlotte-Mecklenburg, North Carolina, given a data set derived from a district in which there was quasi-random assignment of students to schools (given a school choice initiative). With these data, Deming tested whether VAM-biased bias was evident across a variety of common VAM approaches, from the least sophisticated VAM (e.g., one year of prior test scores and no other covariates) to the most (e.g., two or more years of prior test score data plus various covariates).

Overall, Deming failed to reject the hypothesis that school-level effects as measured using VAMs are unbiased, almost regardless of the VAM being used. In more straightforward terms, Deming found that school effects as measured using VAMs were rarely if ever biased when compared to his randomized samples. Hence, this work falls inline with prior works countering that bias really does exist (Note: this is a correction from the prior post).

There are still, however, at least three reasons that could lead to bias in either direction (I.e., positive, in favor of school effects or negative, underestimating school effects):

  • VAMs may be biased due to the non-random sorting of students into schools (and classrooms) “on unobserved determinants of achievement” (see also the work of Rothstein, here and here).
  • If “true” school effects vary over time (independent of error), then test-based forecasts based on prior cohorts’ test scores (as is common when measuring the difference between predictions and “actual” growth, when calculating value-added) may be poor predictors of future effectiveness.
  • When students self-select into schools, the impact of attending a school may be different for students who self-select in than for students who do not. The same thing likely holds true for classroom assignment practices, although that is my extrapolation, not Deming’s.

In addition, and in Deming’s overall conclusions that also pertain here, “many other important outcomes of schooling are not measured here. Schools and teachers [who] are good at increasing student achievement may or may not be effective along other important dimensions” (see also here).

For all of these reasons, “we should be cautious before moving toward policies that hold schools accountable for improving their ‘value added” given bias.

No Teacher Is An Island

This week in The Shanker Blog, authors lan Daly (Professor, University of California San Diego) and Kara Finnigan (Associate Professor, the University of Rochester) published a piece titled: No Teacher Is An Island: The Role Of Social Relations In Teacher Evaluation.

They discuss, as largely based on their research and expertise in social network analyses, the roles of social interactions when examining student outcomes (i.e., student outcomes that are to be directly attributed to teacher effects using value-added models).

They also discuss three major assumptions surrounding the use of value-added measures to assess teacher quality. The first assumption is that growth in student achievement is the result of (really only) interaction(s) among teacher knowledge/training/experience, teachers’ abilities to teach, students’ prior performance levels, and student demographics. Once that assumption is agreed to, the second assumption is that all of these variables can be captured (well), or controlled for (well), using a quantitative or numerical measure. It is then assumed, more generally, that “a teacher’s ability to ‘add-value’ [can be appropriately captured as] a very individualistic undertaking determined almost exclusively by the human capital (i.e., training, knowledge, and skills) of the individual teacher and some basic characteristics of the student.”

As they explain in this piece, these assumptions overlook recent research, as well as reality. They also provide two real-world examples (with graphics to help illustrate how these interactions really look in reality, which I also advise readers to examine here). The first real-world example captures a teacher who “enters a grade level or department in which trust is low and teachers do not share or collaborate around effective practices, innovative ideas or instructional resources, all of which have been shown to support student achievement.” The second real-world example captures a teacher who “enters a department in which teachers actively collaborate, exchange ideas, develop common assessments and reflect on practice – in short, a faculty that operates as a professional learning community.”

Even though these teachers might be teaching two miles from one another, as they are in the case used to illustrate this point, “the first teacher is ‘disadvantaged’ because he/she was not able to learn from colleagues and, as a result, [appears to be] less equipped to provide effective instruction to students. In contrast, in the second scenario, a similarly skilled teacher, one who has benefited from rich exchanges with peers, [appears to have the capacity] to add more ‘value’ based on increased access to effective instructional practices and support from colleagues, as well as many other relational resources such as emotional support or mentorship.”

While these two teachers, with very different professional (and likely personal) realities vary greatly, the value-added models used to evaluate them will not really vary at all, nor will or can the models capture all that interacts with their effectiveness, every single day of every year they teach.

Such teachers will vary only by the types of schools in which they teach, largely given the varying backgrounds of the students they teach and the “prior performance” numerically captured in the model (as mentioned). This, it is assumed, effectively captures all of these other “things” or data nuances (and nuisances) as oft-perceived.

This all continues to occur entirely despite “the social milieu” that always surround teachers’ professional practice, which these authors argue in this article “play a crucial role”…and in their view might be the most significant shortcoming of many/most/all value-added models.

Do read more here.

 

 

Morality, Validity, and the Design of Instructionally Sensitive Tests by David Berliner

A recent article in Education Week was written by David C. Berliner, Regents Professor Emeritus at Arizona State University. As pertinent to our purposes, here, I have included his piece in full below, as well as the direct link to this piece here.

Please pay particular attention to the points about (1) “instructional sensitivity” and why this relates to the tests we currently use to calculate value-added, (2) attribution and whether valid inferences can be made about teachers as isolated from other competing variables, also (3) in consideration of teachers’ actual potentials to have “effects” on test scores versus “effects” on students’ lives otherwise.

Moral Reasons for Using Appropriate Tests to Evaluate Teachers and Schools

The first reason for caring about how sensitive our standardized tests are to instruction is moral. If the tests we use to judge the effects of instruction on student learning are not sensitive to differences in the instructional skills of teachers, then teachers will be seen as less powerful than they might actually be in affecting student achievement. This would not be fair. Thus, instructionally insensitive tests give rise to concerns about fairness, a moral issue.

Additionally, we need to be concerned about whether the scores obtained on instructionally insensitive tests are consequentially, used, for example, to judge a teacher’s performance, with the possibility of the teacher being fired or rewarded. If that is the case, then we move from the moral issue of fairness in trying to assess the contributions of teachers to student achievement, to the psychometric issue of test validity: What inference can we make about teachers, from the scores students get on a typical standardized test?

Validity Reasons for Using Appropriate Tests to Evaluate Teachers and Schools

What does a change in a student’s test score over the course of a year actually mean? To whom or to what do we attribute the changes that occur? If the standardized tests we use are not sensitive to instruction by teachers, yet still show growth in achievement over a year, the likely causes of such growth will be attributed to other influences on our nations’ students. These would be school factors other than teachers–say qualities of the peer group, or the textbook, or the principal’s leadership. Or such changes might be attributed to outside-of-school factors, such as parental involvement in schooling and homework, income and social class of the neighborhood in which the child lives, and so forth.

Currently, all the evidence we have is that teachers are not particularly powerful sources of influence on aggregate measures of student achievement such as mean scores of classrooms on standardized tests. Certainly teachers do, occasionally and to some extent, affect the test scores of everyone in a class (Pedersen, Faucher, & Eaton, 1978; Barone, 2001). And teachers can make a school or a district look like a great success based on average student test scores (Casanova, 2010; Kirp, 2013). But exceptions do not negate the rule.

Teachers Account for Only a Little Variance in Students’ Test Scores

Teachers are not powerful forces in accounting for the variance we see in the achievement test scores of students in classrooms, grades, schools, districts, states and nations. Teachers, it turns out, affect individuals a lot more than they affect aggregate test scores, say, the means of classrooms, schools or districts.

A consensus is that outside of school factors account for about 60% of the variance in student test scores, while schools account for about 20% of that variance (Haertel, 2013; Borman and Dowling, 2012; Coleman et al., 1966). Further, about half of the variance accounted for by schools is attributed to teachers. So, on tests that may be insensitive to instruction, teachers appear to account for about 10% of the variance we see in student achievement test scores (American Statistical Association, 2014). Thus outside-of-school factors appear 6 times more powerful than teachers in effecting student achievement. 

How Instructionally Sensitive Tests Might Help

What would teacher effects on student achievement test scores be were tests designed differently? We don’t know because we have no information about the sensitivity of the tests currently used to detect teacher differences in instructional competence. Teachers judged excellent might be able to screen items for instructional sensitivity during test design. That might be helpful. Even better, I think, might be cognitive laboratories, in which teachers judged to be excellent provide instruction to students on curriculum units appropriate for a grade. The test items showing pre-post gains–items empirically found to be sensitive to instruction–could be chosen for the tests, while less sensitive items would be rejected.

Would the percent of variance attributed to teachers be greater if the tests used to judge teachers were more sensitive to instruction? I think so. Would the variance accounted for by teachers be a lot greater? I doubt that. But if the variance accounted for by teachers went up from 10% to 15%, then teacher effects would be estimated to be 50% greater than currently. And that is the estimate of teachers’ effects over just one year. Over twelve years, teachers clearly can play an influential role on aggregate data, as well as continuing to be a powerful force on their students, at an individual level. In sum, only with instructionally sensitive tests can we be fair to teachers and make valid inferences about their contributions to student growth. 

Reference: Berliner, D. C. (2014). Morality, validity, and the design of instructionally sensitive tests. Education Week. Retrieved from http://blogs.edweek.org/edweek/assessing_the_assessments/2014/06/morality_validity_and_the_design_of_instructionally_sensitive_tests.html

VAMs and the “Dummies In Charge:” A Clever “Must Read”

Peter Greene wrote a very clever, poignant, and to the point article about VAMs, titled “VAMs for Dummies” in The Blog of The Huffington Post. While I already tweeted, facebooked, shared, and else short of printing this one out for my files, I thought it imperative I also share it out with you. Also – Greene gave us at VAMboozled a wonderful shout out directing readers here to find out more. So Peter — even though I’ve never met you — thanks for the kudos and keep it coming. This is a fabulous piece!

Click here to read his piece in full. I’ve also pasted it below, mainly because this one is a keeper. See also a link to a “nifty” 3-minute video on VAMs below.

If you don’t spend every day with your head stuck in the reform toilet, receiving the never-ending education swirly that is school reformy stuff, there are terms that may not be entirely clear to you. One is VAM — Value-Added Measure.

VAM is a concept borrowed from manufacturing. If I take one dollar’s worth of sheet metal and turn it into a lovely planter that I can sell for ten dollars, I’ve added nine dollars of value to the metal.

It’s a useful concept in manufacturing management. For instance, if my accounting tells me that it costs me ten dollars in labor to add five dollars of value to an object, I should plan my going-out-of-business sale today.

And a few years back, when we were all staring down the NCLB law requiring that 100 percent of our students be above average by this year, it struck many people as a good idea — let’s check instead to see if teachers are making students better. Let’s measure if teachers have added value to the individual student.

There are so many things wrong with this conceptually, starting with the idea that a student is like a piece of manufacturing material and continuing on through the reaffirmation of the school-is-a-factory model of education. But there are other problems as well.

1) Back in the manufacturing model, I knew how much value my piece of metal had before I started working my magic on it. We have no such information for students.

2) The piece of sheet metal, if it just sits there, will still be a piece of sheet metal. If anything, it will get rusty and less valuable. But a child, left to its own devices, will still get older, bigger, and smarter. A child will add value on its own, out of thin air. Almost like it was some living, breathing sentient being and not a piece of raw manufacturing material.

3) All piece of sheet metals are created equal. Any that are too not-equal get thrown in the hopper. On the assembly line, each piece of metal is as easy to add value to as the last. But here we have one more reformy idea predicated on the idea that children are pretty much identical.

How to solve these three big problems? Call the statisticians!

This is the point at which that horrifying formula that pops up in these discussion appears. Or actually, a version of it, because each state has its own special sauce when it comes to VAM. In Pennsylvania, our special VAM sauce is called PVAAS [i.e., the EVAAS in Pennsylvania]. I went to a state training session about PVAAS in 2009 and wrote about it for my regular newspaper gig. Here’s what I said about how the formula works at the time:

PVAAS uses a thousand points of data to project the test results for students. This is a highly complex model that three well-paid consultants could not clearly explain to seven college-educated adults, but there were lots of bars and graphs, so you know it’s really good. I searched for a comparison and first tried “sophisticated guess;” the consultant quickly corrected me–“sophisticated prediction.” I tried again–was it like a weather report, developed by comparing thousands of instances of similar conditions to predict the probability of what will happen next? Yes, I was told. That was exactly right. This makes me feel much better about PVAAS, because weather reports are the height of perfect prediction.

Here’s how it’s supposed to work. The magic formula will factor in everything from your socio-economics through the trends over the past X years in your classroom, throw in your pre-testy thing if you like, and will spit out a prediction of how Johnny would have done on the test in some neutral universe where nothing special happened to Johnny. Your job as a teacher is to get your real Johnny to do better on The Test than Alternate Universe Johnny would.

See? All that’s required for VAM to work is believing that the state can accurately predict exactly how well your students would have done this year if you were an average teacher. How could anything possibly go wrong??

And it should be noted — all of these issues occur in the process before we add refinements such as giving VAM scores based on students that the teacher doesn’t even teach. There is no parallel for this in the original industrial VAM model, because nobody anywhere could imagine that it’s not insanely ridiculous.

If you want to know more, the interwebs are full of material debunking this model, because nobody — I mean nobody — believes in it except politicians and corporate privateers. So you can look at anything from this nifty three minute video to the awesome blog Vamboozled by Audrey Amrein-Beardsley.

This is one more example of a feature of reformy stuff that is so top-to-bottom stupid that it’s hard to understand. But whether you skim the surface, look at the philosophical basis, or dive into the math, VAM does not hold up. You may be among the people who feel like you don’t quite get it, but let me reassure you — when I titled this “VAM for Dummies,” I wasn’t talking about you. VAM is always and only for dummies; it’s just that right now, those dummies are in charge.

 

American Statistical Association (ASA) Position Statement on VAMs

Inside my most recent post, about the Top 14 research-based articles about VAMs, there was a great research-based statement that was released just last week by the American Statistical Association (ASA), titled the “ASA Statement on Using Value-Added Models for Educational Assessment.”

It is short, accessible, easy to understand, and hard to dispute, so I wanted to be sure nobody missed it as this is certainly a must read for all of you following this blog, not to mention everybody else dealing/working with VAMs and their related educational policies. Likewise, this represents the current, research-based evidence and thinking of probably 90% of the educational researchers and econometricians (still) conducting research in this area.

Again, the ASA is the best statistical organization in the U.S. and likely one of if not the best statistical associations in the world. Some of the most important parts of their statement, taken directly from their full statement as I see them, follow:

  1. VAMs are complex statistical models, and high-level statistical expertise is needed to
    develop the models and [emphasis added] interpret their results.
  2. Estimates from VAMs should always be accompanied by measures of precision and a discussion of the assumptions and possible limitations of the model. These limitations are particularly relevant if VAMs are used for high-stakes purposes.
  3. VAMs are generally based on standardized test scores, and do not directly measure
    potential teacher contributions toward other student outcomes.
  4. VAMs typically measure correlation, not causation: Effects – positive or negative –
    attributed to a teacher may actually be caused by other factors that are not captured in the model.
  5. Under some conditions, VAM scores and rankings can change substantially when a
    different model or test is used, and a thorough analysis should be undertaken to
    evaluate the sensitivity of estimates to different models.
  6. VAMs should be viewed within the context of quality improvement, which distinguishes aspects of quality that can be attributed to the system from those that can be attributed to individual teachers, teacher preparation programs, or schools.
  7. Most VAM studies find that teachers account for about 1% to 14% of the variability in test scores, and that the majority of opportunities for quality improvement are found in the system-level conditions. Ranking teachers by their VAM scores can have unintended consequences that reduce quality.
  8. Attaching too much importance to a single item of quantitative information is counter-productive—in fact, it can be detrimental to the goal of improving quality.
  9. When used appropriately, VAMs may provide quantitative information that is relevant for improving education processes…[but only if used for descriptive/description purposes]. Otherwise, using VAM scores to improve education requires that they provide meaningful information about a teacher’s ability to promote student learning…[and they just do not do this at this point, as there is no research evidence to support this ideal].
  10. A decision to use VAMs for teacher evaluations might change the way the tests are viewed and lead to changes in the school environment. For example, more classroom time might be spent on test preparation and on specific content from the test at the exclusion of content that may lead to better long-term learning gains or motivation for students. Certain schools may be hard to staff if there is a perception that it is harder for teachers to achieve good VAM scores when working in them. Overreliance on VAM scores may foster a competitive environment, discouraging collaboration and efforts to improve the educational system as a whole.

Also important to point out is that included in the report the ASA makes recommendations regarding the “key questions states and districts [yes, practitioners!] should address regarding the use of any type of VAM.” These include, although they are not limited to questions about reliability (consistency), validity, the tests on which VAM estimates are based, and the major statistical errors that always accompany VAM estimates, but are often buried and often not reported with results (i.e., in terms of confidence
intervals or standard errors).

Also important is the purpose for ASA’s statement, as written by them: “As the largest organization in the United States representing statisticians and related professionals, the American Statistical Association (ASA) is making this statement to provide guidance, given current knowledge and experience, as to what can and cannot reasonably be expected from the use of VAMs. This statement focuses on the use of VAMs for assessing teachers’ performance but the issues discussed here also apply to their use for school or principal accountability. The statement is not intended to be prescriptive. Rather, it is intended to enhance general understanding of the strengths and limitations of the results generated by VAMs and thereby encourage the informed use of these results.”

Do give the position statement a read and use it as needed!

Correction: Make the “Top 13” VAM Articles the “Top 14”

As per my most recent post earlier today, about the Top 13 research-based articles about VAMs, low and behold another great research-based statement was just this week released by the American Statistical Association (ASA), titled the “ASA Statement on Using Value-Added Models for Educational Assessment.”

So, let’s make the Top 13 the Top 14 and call it a day. I say “day” deliberately; this is such a hot and controversial topic it is often hard to keep up with the literature in this area, on literally a daily basis.

As per this outstanding statement released by the ASA – the best statistical organization in the U.S. and one of if not the best statistical associations in the world – some of the most important parts of their statement, taken directly from their full statement as I see them, follow:

  1. VAMs are complex statistical models, and high-level statistical expertise is needed to
    develop the models and [emphasis added] interpret their results.
  2. Estimates from VAMs should always be accompanied by measures of precision and a discussion of the assumptions and possible limitations of the model. These limitations are particularly relevant if VAMs are used for high-stakes purposes.
  3. VAMs are generally based on standardized test scores, and do not directly measure
    potential teacher contributions toward other student outcomes.
  4. VAMs typically measure correlation, not causation: Effects – positive or negative –
    attributed to a teacher may actually be caused by other factors that are not captured in the model.
  5. Under some conditions, VAM scores and rankings can change substantially when a
    different model or test is used, and a thorough analysis should be undertaken to
    evaluate the sensitivity of estimates to different models.
  6. VAMs should be viewed within the context of quality improvement, which distinguishes aspects of quality that can be attributed to the system from those that can be attributed to individual teachers, teacher preparation programs, or schools.
  7. Most VAM studies find that teachers account for about 1% to 14% of the variability in test scores, and that the majority of opportunities for quality improvement are found in the system-level conditions. Ranking teachers by their VAM scores can have unintended consequences that reduce quality.
  8. Attaching too much importance to a single item of quantitative information is counter-productive—in fact, it can be detrimental to the goal of improving quality.
  9. When used appropriately, VAMs may provide quantitative information that is relevant for improving education processes…[but only if used for descriptive/description purposes]. Otherwise, using VAM scores to improve education requires that they provide meaningful information about a teacher’s ability to promote student learning…[and they just do not do this at this point, as there is no research evidence to support this ideal].
  10. A decision to use VAMs for teacher evaluations might change the way the tests are viewed and lead to changes in the school environment. For example, more classr
    oom time might be spent on test preparation and on specific content from the test at the exclusion of content that may lead to better long-term learning gains or motivation for students. Certain schools may be hard to staff if there is a perception that it is harder for teachers to achieve good VAM scores when working in them. Overreliance on VAM scores may foster a competitive environment, discouraging collaboration and efforts to improve the educational system as a whole.

Also important to point out is that included in the report the ASA makes recommendations regarding the “key questions states and districts [yes, practitioners!] should address regarding the use of any type of VAM.” These include, although they are not limited to questions about reliability (consistency), validity, the tests on which VAM estimates are based, and the major statistical errors that always accompany VAM estimates, but are often buried and often not reported with results (i.e., in terms of confidence
intervals or standard errors).

Also important is the purpose for ASA’s statement, as written by them: “As the largest organization in the United States representing statisticians and related professionals, the American Statistical Association (ASA) is making this statement to provide guidance, given current knowledge and experience, as to what can and cannot reasonably be expected from the use of VAMs. This statement focuses on the use of VAMs for assessing teachers’ performance but the issues discussed here also apply to their use for school or principal accountability. The statement is not intended to be prescriptive. Rather, it is intended to enhance general understanding of the strengths and limitations of the results generated by VAMs and thereby encourage the informed use of these results.”

If you’re going to choose one article to read and review, this week or this month, and one that is thorough and to the key points, this is the one I recommend you read…at least for now!

A VAM Shame, again, from Florida

Another teacher from Florida wrote a blog post for Diane Ravitch, and I just came across it and am re-posting it here. Be sure to give it a good read as you will see that what is happening in her state right now and why it is a VAM shame!

She writes:

I conducted a very unscientific study and concluded that I might possibly have the worst VAM score at my school. Today I conducted a slightly more scientific analysis and now I can confidently proclaim myself to be the worst teacher at my school, the 14th worst teacher in Dade County, and the 146th worst (out of 120,000) in the state of Florida! There were 4,800 pages of teachers ranked highest to lowest on the Florida Times Union website and my VAM was on page 4,795. Gosh damn! That’s a bad VAM!  I always feared I might end up at the low end of the spectrum due to the fact that I teach gifted students that score high already and have no room to grow, but 146th out of 120,000?!?! That’s not “needs improvement.” That’s “you really stink and should immediately have your teaching license revoked before you do anymore harm to innocent children” bad. That’s, “your odds are so bad you better hope you don’t get eaten by a shark or struck by lightening” bad.  This is the reason I don’t play the lotto or gamble in Vegas. And to think some other Florida teacher had the nerve to write a blog post declaring herself to be one of the worst teachers in the state and her VAM was only -3%! Negative 3 percent is the best you got honey? I’ll meet your negative 3 percent and raise you another negative 146 percentage points! (Actually I enjoyed her blog post [see also our coverage of this teacher’s story here] and I hope more teachers come out of their VAM closets soon).

Speaking of coming out of the VAM closet, I managed to hunt down the emails of about ten other bottom dwellers as posted by the Florida Times Union. I was attempting to conduct a minor survey of what types of teachers end up getting slammed by VAM. Did they have anything in common? What types of students did they teach? As of this moment, none of them have returned my emails. I really wanted to get in touch with “The Worst Teacher in the State of Florida” according to VAM. After a little cyber stalking, it turns out she’s my teaching twin. She also teaches ninth grade world history to gifted students in a preIB program.  The runner up for “Worst Teacher in the State of Florida” teaches at an arts magnet school.  Are we really to believe that teachers selected to teach in an IB program or magnet school are the very worst the state of Florida has to offer? Let me tell you a little something about teaching gifted students. They are the first kids to nark out a bad teacher because they don’t think anyone is good enough to teach them. First they’ll let you know to your face that they’re smarter than you and you stink at teaching. Then they’ll tell their parents and the gifted guidance counselor who will nark you out to the Principal. If you suck as a gifted teacher, you won’t last long.

I don’t want to ignore the poor teachers that get slammed by VAM on the opposite end of the spectrum either. Although there appeared to be many teachers of high achievers who scored poorly under VAM, there also seemed to be an abundance of special education teachers as well.  These poor educators are often teaching children with horrible disabilities who will never show any learning gains on a standardized test. Do we really want to create a system that penalizes and fires the teachers whose positions we struggle the hardest to fill? Is it any wonder that teachers who teach the very top performers and teachers who teach the lowest performers would come out looking the worst in an algorithm measuring learning gains? I suck at math and this was immediately obvious to me.

Another interesting fact garnered from my amateur and cursory analysis of Florida VAM data is the fact that high school teachers overwhelming populated the bottom of the VAM rankings. Of the 148 teachers who scored lower than me, 136 were high school teachers. Ten were middle school teachers, and only two elementary school teachers.  All of this directly contradicts the testimony of Ms. Kathy Hebda, Deputy Chancellor for Educator Quality, in front of the Florida lawmakers last year regarding the Florida VAM.

“Hebda presented charts to the House K-12 Education Subcommittee that show almost zero correlation between teachers’ evaluation scores and the percentages of their students who are poor, nonwhite, gifted, disabled or English language learners. Teachers similarly didn’t get any advantage or disadvantage based on what grade levels they teach.

“Those things didn’t seem to factor in,” Hebda said. “You can’t tell for a teacher’s classroom by the way the value-added scores turned out whether she had zero percent students on free and reduced price lunch or 100 percent.”

Hebda’s 2013 testimony in two public hearings was intended to assure policymakers that everything was just swell with VAM as an affirmation that the merit pay provision of the 2011 Student Success Act (SB736) was going to be ready for prime time in the scheduled 2015 roll-out. No wonder the FLDOE didn’t want actual VAM date released as data completely contradicts Hebda’s assurances that “the model did its job.”

I certainly have been a little disappointed with the media coverage of the FLDOE losing its lawsuit and being forced to release Florida teacher VAM data this week.  The Florida Times Union considers this data to be a treasure trove of information but they haven’t dug very deep into the data they fought so hard to procure. The Miami Herald barely acknowledged that anything noteworthy happened in education news this week.  You would think some other journalist would have thought to cover a story about “The Worst Teacher in Florida.” I write this blog to cover teacher stories that major media outlets don’t seem interested in telling (that, and I am trying to stave off early dementia while on maternity leave).  One journalist bothered to dig up the true story behind the top ten teachers in Florida. But no one has bothered telling the stories of the bottom ten. Those are the teachers who are most likely to be fired and have their teaching licenses revoked by the state. Let those stories be told. Let the public see what kinds of teachers they are at risk of losing to this absurd excuse of an “objective measure of teacher effectiveness” before it’s too late.