VAMmunition

ShareTweet about this on TwitterShare on Facebook2Email this to someoneShare on Google+0Share on LinkedIn0Share on Reddit0

While “Top Ten Lists” have become a recurrent trend in periodicals, magazines, blogs, and the like, one “Top Ten List,” presented here, should satisfy readers’ of this blog and hopefully other educators’ needs for VAMmunition, or rather, ammunition practitioners need to protect themselves against the unfair implementation and use of VAMs (i.e., value-added models).

Likewise, as “Top Ten Lists” typically serve reductionistic purposes, in the sense that they often reduce highly complex phenomenon into easy-to-understand, easy-to-interpret, and easy-to-use strings of information, this approach is more than suitable here whereas those who are trying to ward off the unfair implementation and use of VAMs do not have the VAMmunition they need to defend themselves in research-based ways. Hopefully this list will satisfy at least some of these needs.

Accordingly, I present here the “Top Ten Bits of VAMmunition” research-based reasons, listed in no particular order, that all public school educators should be able to use to defend themselves against VAMs.

  1. VAM estimates should not be used to assess teacher effectiveness. The standardized achievement tests on which VAM estimates are based, have always been, and continue to be, developed to assess levels of student achievement and not levels growth in student achievement nor growth in achievement that can be attributed to teacher effectiveness. The tests on which VAM estimates are based (among other issues) were never designed to estimate teachers’ causal effects.
  2. VAM estimates are often unreliable. Teachers who should be (more or less) consistently effective are being classified in sometimes highly inconsistent ways over time. A teacher classified as “adding value” has a 25 to 50% chance of being classified as “subtracting value” the following year(s), and vice versa. This sometimes makes the probability of a teacher being identified as effective no different than the flip of a coin.
  3. VAM estimates are often invalid. Without adequate reliability, as reliability is a qualifying condition for validity, valid VAM-based interpretations are even more difficult to defend. Likewise, very limited evidence exists to support that teachers who post high- or low-value added scores are effective using at least one other correlated criterion (e.g., teacher observational scores, teacher satisfaction surveys). The correlations being demonstrated across studies are not nearly high enough to support valid interpretation or use.
  4. VAM estimates can be biased. Teachers of certain students who are almost never randomly assigned to classrooms have more difficulties demonstrating value-added than their comparably effective peers. Estimates for teachers who teach inordinate proportions of English Language Learners (ELLs), special education students, students who receive free or reduced lunches, and students retained in grade, are more adversely impacted by bias. While bias can present itself in terms of reliability (e.g., when teachers post consistently high or low levels of value-added over time), the illusion of consistency can sometimes be due, rather, to teachers being consistently assigned more homogenous sets of students.
  5. Related, VAM estimates are fraught with measurement errors that negate their levels of reliability and validity, and contribute to issues of bias. These errors are caused by inordinate amounts of inaccurate or missing data that cannot be easily replaced or disregarded; variables that cannot be statistically “controlled for;” differential summer learning gains and losses and prior teachers’ residual effects that also cannot be “controlled for;” the effects of teaching in non-traditional, non-isolated, and non-insular classrooms; and the like.
  6. VAM estimates are unfair. Issues of fairness arise when test-based indicators and their inference-based uses impact some more than others in consequential ways. With VAMs, only teachers of mathematics and reading/language arts with pre and post-test data in certain grade levels (e.g., grades 3-8) are typically being held accountable. Across the nation, this is leaving approximately 60-70% of teachers, including entire campuses of teachers (e.g., early elementary and high school teachers), as VAM-ineligible.
  7. VAM estimates are non-transparent. Estimates must be made transparent in order to be understood, so that they can ultimately be used to “inform” change and progress in “[in]formative” ways. However, the teachers and administrators who are to use VAM estimates accordingly do not typically understand the VAMs or VAM estimates being used to evaluate them, particularly enough so to promote such change.
  8. Related, VAM estimates are typically of no informative, formative, or instructional value. No research to date suggests that VAM-use has improved teachers’ instruction or student learning and achievement.
  9. VAM estimates are being used inappropriately to make consequential decisions. VAM estimates do not have enough consistency, accuracy, or depth to satisfy that which VAMs are increasingly being tasked, for example, to help make high-stakes decisions about whether teachers receive merit pay, are rewarded/denied tenure, or are retained or inversely terminated. While proponents argue that because of VAMs’ imperfections, VAM estimates should not be used in isolation of other indicators, the fact of the matter is that VAMs are so imperfect they should not be used for much of anything unless largely imperfect decisions are desired.
  10. The unintended consequences of VAM use are continuously going unrecognized, although research suggests they continue to exits. For example, teachers are choosing not to teach certain students, including those who teachers deem as the most likely to hinder their potentials to demonstrate value-added. Principals are stacking classes to make sure certain teachers are more likely to demonstrate “value-added,” or vice versa, to protect or penalize certain teachers, respectively. Teachers are leaving/refusing assignments to grades in which VAM-based estimates matter most, and some teachers are leaving teaching altogether out of discontent or in protest. About the seriousness of these and other unintended consequences, weighed against VAMs’ intended consequences or the lack thereof, proponents and others simply do not seem to give a VAM.

 

ShareTweet about this on TwitterShare on Facebook2Email this to someoneShare on Google+0Share on LinkedIn0Share on Reddit0

12 thoughts on “VAMmunition

  1. I’ve heard argument 1 quite a bit, but what I’ve never heard is the obvious followup point: what exactly would a test look like if it were, indeed, “designed to estimate teachers’ causal effects”? Moreover, how different would it be from today’s tests? I assume no one knows what that is, or else they would have mentioned it by now.

    I’m not sure it makes any sense as a criticism anyway. By analogy, a timed mile isn’t “designed” to estimate the causal effect of a track coach. But who cares? We could still, if we wanted, try to estimate the causal effect of track coaches by measuring how well their track team improves at a timed mile.

    If Coach 1 and Coach 2 both start out with a team that can run 6 minute miles, but Coach 1’s team runs 4.5 minute miles by the end of the year whereas Coach 2’s team only improves to 5.5 minute miles, and if you also statistically control for other demographic factors (even though this shouldn’t really be necessary given that both teams had the same ability at the outset), then it doesn’t seem too crazy to suggest that maybe Coach 1’s training program and ability to inspire his runners is better than Coach 2.

    • We do know what the answer to this one is, actually, and it is more about “the same abilit[ies] at the outset” as you mentioned in your analogy than anything else, including the actual tests. While large-scale standardized tests are typically limited in both the number and types of items included, among other things, one could use a similar test with more items and more “instructionally sensitive” items to capture a teacher’s causal effects quite simply actually. This would be done with the pre and post-tests occurring in the same year while students are being instructed by the same (albeit not only…) teacher. However, this does not happen in any value-added system at this point as these tests are given once per year (typically spring to spring). Hence, student growth scores include prior teachers’ effects, as well as the differential learning gains/losses that also occur over the summers during which students have little to no interactions with formal education systems, or their teachers. This “biases” these measures of growth, big time! The other necessary condition would be random assignment. If students were randomly assigned to classrooms (and teachers were randomly assigned to classrooms), this would help to make sure that indeed all students are similar at the outset. However, again, this rarely if ever happens in practice as administrators and teachers (rightfully) see such assignment practices…while great for experimental research purposes, bad for students and their learning! Regardless, some statisticians suggest that their sophisticated controls can “account” for non-random assignment practices, yet again evidence suggests that no matter how sophisticated the controls are, they simply do not work either.

      See, for example, the Newton et al. and the Hermann et al. citations here, in this blog, in the Readings link. I also have an article coming out about this with one of my doctoral students, coming out this month actually, in a highly esteemed peer-reviewed journal. Here is the reference if you want to keep an eye out for it as well. These three should (hopefully) explain all of this with greater depth and clarity than here: Paufler, N. A. & Amrein-Beardsley, A. (2013, October). The random assignment of students Into elementary classrooms: Implications for value-added analyses and interpretations. American Educational Research Journal. doi: 10.3102/0002831213508299

    • To understand why educational standards and standardized testing and the “grading” of students is completely invalid and any results “vain and illusory”, having no epistemological and ontological reality read and understand Noel Wilson’s “Educational Standards and the Problem of Error” found at: http://epaa.asu.edu/ojs/article/view/577/700

      Brief outline of Wilson’s “Educational Standards and the Problem of Error” and some comments of mine. (updated 6/24/13 per Wilson email)

      1. A quality cannot be quantified. Quantity is a sub-category of quality. It is illogical to judge/assess a whole category by only a part (sub-category) of the whole. The assessment is, by definition, lacking in the sense that “assessments are always of multidimensional qualities. To quantify them as one dimensional quantities (numbers or grades) is to perpetuate a fundamental logical error” (per Wilson). The teaching and learning process falls in the logical realm of aesthetics/qualities of human interactions. In attempting to quantify educational standards and standardized testing we are lacking much information about said interactions.

      2. A major epistemological mistake is that we attach, with great importance, the “score” of the student, not only onto the student but also, by extension, the teacher, school and district. Any description of a testing event is only a description of an interaction, that of the student and the testing device at a given time and place. The only correct logical thing that we can attempt to do is to describe that interaction (how accurately or not is a whole other story). That description cannot, by logical thought, be “assigned/attached” to the student as it cannot be a description of the student but the interaction. And this error is probably one of the most egregious “errors” that occur with standardized testing (and even the “grading” of students by a teacher).

      3. Wilson identifies four “frames of reference” each with distinct assumptions (epistemological basis) about the assessment process from which the “assessor” views the interactions of the teaching and learning process: the Judge (think college professor who “knows” the students capabilities and grades them accordingly), the General Frame-think standardized testing that claims to have a “scientific” basis, the Specific Frame-think of learning by objective like computer based learning, getting a correct answer before moving on to the next screen, and the Responsive Frame-think of an apprenticeship in a trade or a medical residency program where the learner interacts with the “teacher” with constant feedback. Each category has its own sources of error and more error in the process is caused when the assessor confuses and conflates the categories.

      4. Wilson elucidates the notion of “error”: “Error is predicated on a notion of perfection; to allocate error is to imply what is without error; to know error it is necessary to determine what is true. And what is true is determined by what we define as true, theoretically by the assumptions of our epistemology, practically by the events and non-events, the discourses and silences, the world of surfaces and their interactions and interpretations; in short, the practices that permeate the field. . . Error is the uncertainty dimension of the statement; error is the band within which chaos reigns, in which anything can happen. Error comprises all of those eventful circumstances which make the assessment statement less than perfectly precise, the measure less than perfectly accurate, the rank order less than perfectly stable, the standard and its measurement less than absolute, and the communication of its truth less than impeccable.”

      In other word all the logical errors involved in the process render any conclusions invalid.

      5. The test makers/psychometricians, through all sorts of mathematical machinations attempt to “prove” that these tests (based on standards) are valid-errorless or supposedly at least with minimal error [they aren’t]. Wilson turns the concept of validity on its head and focuses on just how invalid the machinations and the test and results are. He is an advocate for the test taker not the test maker. In doing so he identifies thirteen sources of “error”, any one of which renders the test making/giving/disseminating of results invalid. As a basic logical premise is that once something is shown to be invalid it is just that, invalid, and no amount of “fudging” by the psychometricians/test makers can alleviate that invalidity.

      6. Having shown the invalidity, and therefore the unreliability, of the whole process Wilson concludes, rightly so, that any result/information gleaned from the process is “vain and illusory”. In other words start with an invalidity, end with an invalidity (except by sheer chance every once in a while, like a blind and anosmic squirrel who finds the occasional acorn, a result may be “true”) or to put in more mundane terms crap in-crap out.

      7. And so what does this all mean? I’ll let Wilson have the second to last word: “So what does a test measure in our world? It measures what the person with the power to pay for the test says it measures. And the person who sets the test will name the test what the person who pays for the test wants the test to be named.”

      In other words it measures “’something’ and we can specify some of the ‘errors’ in that ‘something’ but still don’t know [precisely] what the ‘something’ is.” The whole process harms many students as the social rewards for some are not available to others who “don’t make the grade (sic)” Should American public education have the function of sorting and separating students so that some may receive greater benefits than others, especially considering that the sorting and separating devices, educational standards and standardized testing, are so flawed not only in concept but in execution?

      My answer is NO!!!!!

      One final note with Wilson channeling Foucault and his concept of subjectivization:

      “So the mark [grade/test score] becomes part of the story about yourself and with sufficient repetitions becomes true: true because those who know, those in authority, say it is true; true because the society in which you live legitimates this authority; true because your cultural habitus makes it difficult for you to perceive, conceive and integrate those aspects of your experience that contradict the story; true because in acting out your story, which now includes the mark and its meaning, the social truth that created it is confirmed; true because if your mark is high you are consistently rewarded, so that your voice becomes a voice of authority in the power-knowledge discourses that reproduce the structure that helped to produce you; true because if your mark is low your voice becomes muted and confirms your lower position in the social hierarchy; true finally because that success or failure confirms that mark that implicitly predicted the now self-evident consequences. And so the circle is complete.”

      In other words students “internalize” what those “marks” (grades/test scores) mean, and since the vast majority of the students have not developed the mental skills to counteract what the “authorities” say, they accept as “natural and normal” that “story/description” of them. Although paradoxical in a sense, the “I’m an “A” student” is almost as harmful as “I’m an ‘F’ student” in hindering students becoming independent, critical and free thinkers. And having independent, critical and free thinkers is a threat to the current socio-economic structure of society.

    • What if coach 2’s team had the flu going around during the time of the big meet? What if coach 2’s star player had a tragic loss in his/her family? What if some members of coach 2’s team became homeless or didn’t have anything to eat the morning of the big meet because the parents were out of work for too long, or what if some runners were injured? Can you “statistically control for other demographic factors” such as these? Besides, as the system stands, neither coach had the team when they ran a 6 second mile–that would have been a year prior to the race by which we assess these coaches. Are you aware of how many developmental changes kids could undergo in one year? Please tell me: how do we “control” for these unforeseen variables for every individual student, every year?

  2. Wow-thanks for putting this crucial information in such an easy-to-follow format. I will definitely be using this to back me up when I attend an upcoming forum on high-stakes standardized testing.

  3. What is the difference in VAM and Student Growth Precentiles (SGP) and do SGPs have any usefulness. What about STudent Growth Objectives (SGOs)

  4. We are a group of volunteers and starting a new scheme in
    our community. Your website offered us with valuable information to work on.
    You’ve done an impressive job and our whole community will be thankful to
    you.

  5. VAM and SGP’s fall under the realm of “Doing the Wrong Things Righter”

    Doing the Wrong Thing Righter

    The proliferation of educational assessments, evaluations and canned programs belongs in the category of what systems theorist Russ Ackoff describes as “doing the wrong thing righter. The righter we do the wrong thing,” he explains, “the wronger we become. When we make a mistake doing the wrong thing and correct it, we become wronger. When we make a mistake doing the right thing and correct it, we become righter. Therefore, it is better to do the right thing wrong than the wrong thing right.”

    Our current neglect of instructional issues are the result of assessment policies that waste resources to do the wrong things, e.g., canned curriculum and standardized testing, right. Instructional central planning and student control doesn’t – can’t – work. But, that never stops people trying.

    The result is that each effort to control the uncontrollable does further damage, provoking more efforts to get things in order. So the function of management/administration becomes control rather than creation of resources. When Peter Drucker lamented that so much of management consists in making it difficult for people to work, he meant it literally. Inherent in obsessive command and control is the assumption that human beings can’t be trusted on their own to do what’s needed. Hierarchy and tight supervision are required to tell them what to do. So, fear-driven, hierarchical organizations turn people into untrustworthy opportunists. Doing the right thing instructionally requires less centralized assessment, less emphasis on evaluation and less fussy interference, not more. The way to improve controls is to eliminate most and reduce all.

    Former Green Beret Master Sergeant Donald Duncan (Viet Nam) did when he noted in Sir! No Sir! that:

    “I was doing it right but I wasn’t doing right.”

    And from one of America’s premier writers:

    “The mass of men [and women] serves the state [education powers that be] thus, not as men mainly, but as machines, with their bodies. They are the standing army, and the militia, jailors, constables, posse comitatus, [administrators and teachers], etc. In most cases there is no free exercise whatever of the judgment or of the moral sense; but they put themselves on a level with wood and earth and stones; and wooden men can perhaps be manufactured that will serve the purpose as well. Such command no more respect than men of straw or a lump of dirt.”- Henry David Thoreau [1817-1862], American author and philosopher

  6. A little something to include in the database on VAM, published at Substance News
    :
    ‘Value-added’ doesn’t add up… Evidence Presented in the Case Against Growth Models for High Stakes Purposes”

    Jim Horn and Denise Wilburn – October 09, 2013

    The following article quotes liberally from The Mismeasure of Education, and it represents an overview of the research-based critiques of value-added modeling, or growth models. We offer it here to Chicago educators with the hope that it may serve to inspire and inform the restoration of fairness, reliability, and validity to the teacher evaluation process and the assessment of children and schools.

    In the fall of 2009 before the final guidance was issued to all the cash-strapped states lined up for some of the $3.4 billion in Race to the Top grants, the Board of Testing and Assessment issued a 17-page letter to Arne Duncan, citing the National Research Council’s response to the RTTT draft plan. BOTA cited reasons to applaud the DOEd’s efforts, but the main purpose of the letter was to voice, in unequivocal language, the NRC’s concern regarding the use of value-added measures, or growth models, for high stakes purposes specifically related to the evaluation of teachers:

    BOTA has significant concerns that the Department’s proposal places too much emphasis on measures of growth in student achievement (1) that have not yet been adequately studied for the purposes of evaluating teachers and principals and (2) that face substantial practical barriers to being successfully deployed in an operational personnel system that is fair, reliable, and valid (p. 8).

    In 1992 when Dr. William Sanders sold value-added measurement to Tennessee politicians as the most reliable, valid, and fair way to measure student academic growth and the impact that teachers, schools, and districts have on student achievement, the idea seemed reasonable, more fair, and even scientific to some. But since 1992, leading statisticians and testing experts who have scrutinized value-added models have concluded that these assessment systems for measuring test score growth do not meet the reliability, validity, and fairness standards established by respected national organizations, such as the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education (Amrein-Beardsley, 2008). Nonetheless, value-added modeling for high-stakes decision making now consumes significant portions of state education budgets, even though there is no oversight agency to make sure that students and teachers are protected:

    Who protects [students and teachers in America’s schools] from assessment models that could do as much harm as good? Who protects their well-being and ensures that assessment models are safe, wholesome, and effective? Who guarantees that assessment models honestly and accurately inform the public about student progress and teacher effectiveness? Who regulates the assessment industry? (Amrein-Beardsley, 2008, p. 72)

    If value-added measures do not meet the highest standards established for reliable, valid and fair measurement, then the resulting high stakes decisions made based on these value-added measures are also unreliable, invalid and unfair. Therefore, legislators, policymakers, and administrators who require high-stakes decisions based on value-added measures are equally culpable and liable for the wrongful termination of educators mismeasured with metrics derived from tests that were never intended to do anything but give a ballpark idea of how students are progressing on academic benchmarks for subject matter concepts at each grade level.

    Is value-added measurement reliable? Is it consistent in its measurement and free of measurement errors from year to year? As early as 1995, the Tennessee Office of Education Accountability (OEA) reported “unexplained variability” in Tennessee’s value-added scores and called for an outside evaluation of all components of the Tennessee Value-Added Assessment System (TVAAS), including the achievement tests used in calculating the value-added scores. The outside evaluators, Bock, Wolfe & Fisher (1996), questioned the reliability of the system for high-stakes decisions based on how the achievement tests were constructed. These experts recognized that test makers are engaged in a very difficult and imprecise science when they attempt to rank order learning concepts by difficulty. This difficulty is compounded when they attempt to link those concepts across a grade-level continuum so that students successfully build subject matter knowledge and skill from grade to grade. An extra layer of content design imprecision is added when test makers create multiple forms of the test at each grade level to represent those rank-ordered test items. Bock, Wolfe, & Fisher found that variation in test construction was, in part, responsible for the “unexplained variability” in Tennessee’s state test results.

    Other highly respected researchers (Ballou, 2002; Lockwood, 2006; McCaffrey & Lockwood, 2008; Briggs, Weeks & Wiley, 2008) have weighed in on the issue of reliability of value-added measures based on questionable achievement test construction. As an invited speaker to the National Research Council workshop on value-added methodology and accountability in 2010, Ballou pointedly went to the heart of the test quality matter when he acknowledged the “most neglected” question among economists concerned with accountability measures:

    The question of what achievement tests measure and how they measure it is probably the [issue] most neglected by economists…. If tests do not cover enough of what teachers actually teach (a common complaint), the most sophisticated statistical analysis in the world still will not yield good estimates of value-added unless it is appropriate to attach zero weight to learning that is not covered by the test. (National Research Council and National Academy of Education, 2010, p. 27).

    In addition to these test issues, the reliability of the teacher effect estimates in high-stakes applications are compromised by a number of other recurring problems: 1) the timing of the test administration and summer learning loss (Papay, 2011); 2) missing student data (Bock & Wolfe, 1996; Fisher, 1996; McCaffrey et al, 2003; Braun, 2005; National Research Council, 2010);

    3) student data poorly linked to teachers (Dunn, Kadane & Garrow, 2003; Baker et al, 2010);

    4) inadequate sample size of students due to classroom arrangements or other school logistical and demographic issues (Ballou, 2005; McCaffrey, Sass, Lockwood, & Mihaly, 2009).

    Growth models such as the Sanders VAM use multiple years of data in order to reduce the degree of potential error in gauging teacher effect. Sanders justifies this practice by claiming that a teacher’s effect on her students learning will persist into the future and, therefore, can be measured with consistency. However, research conducted by McCaffrey, Lockwood, Koretz, Louis, & Hamilton (2004) and subsequently by Jacob, Lefgrens and Sims (2008) shatters this bedrock assumption. These researchers found that “only about one-fifth of the test score gain from a high value-added teacher remains after a single year…. After two years, about one-eighth of the original gain persists” (p. 33). Too many uncontrolled factors impact the stability and sensitivity of value-added measurement for making high-stakes personnel decisions for teachers. In fact, Schochet and Chiang (2010) found that the error rates for distinguishing teachers from the average teaching performance using three years of data was about 26 percent. They concluded more than 1 in 4 teachers who are truly average in performance will be erroneously identified for special treatment, and more than 1 in 4 teachers who differ from average performance by 3 months of student learning in math or 4 months in reading will be overlooked (p. 35). Schochet and Chiang also found that to reduce error in the variance in teachers’ effect scores to 12 percent, rather than 26 percent, ten years of data would be required for each teacher (p. 35). When we consider the stakes for altering school communities and students’ and teachers’ lives, the utter impracticality of reducing error to what may be argued as an acceptable level makes value-added modeling for high-stakes decisions simply unacceptable.

    Besides this brief summary above of reliability problems, growth models also have serious validity issues caused by the effect of nonrandom placement of students from classroom to classroom, from school to school within districts, or from system to system. Non-random placement of students further erodes Sanders’ causal claims for teacher effects on achievement, as well as his claims that the impact of student characteristics on student achievement is irrelevant. For a teacher’s value-added scores to be valid, she must have “an equal chance of being assigned any of the students in the district of the appropriate grade and subject” and “a teacher might be disadvantaged [her scores might be biased] by placement in a school serving a particular population” year after year (Ballou, 2005, p. 5). In Tennessee, no educational policy or administrative rule exists that requires schools to randomly assign teachers or students to classrooms, thereby denying an equal chance of randomly placed teachers or students within a school, within a district, or across the state. To underscore the effect of non-random placement of disadvantaged students on teacher effect estimates, Kupermintz (2003) found in reexamining Tennessee value-added data that “schools with more than 90% minority enrollment tend to exhibit lower cumulative average gains” and school systems’ data showed “even stronger relations between average gains and the percentage of students eligible for free or reduced-price lunch” (p. 295). Value-added models like TVAAS that assume random assignment of students to teachers’ classrooms “yield misleading [teacher effect] estimates, and policies that use these estimates in hiring, firing, and compensation decisions may reward and punish teachers for the students they are assigned as much as for their actual effectiveness in the classroom” (Rothstein, 2010, p. 177). In his value-added primer from 2005, Henry Braun (2005) stated clearly and boldly that the “fundamental concern is that, if making causal attributions (of teacher effect on student achievement performance) is the goal, then no statistical model, however complex, and no method of analysis, however sophisticated, can fully compensate for the lack of randomization” (p. 8).

    Any system of assessment that claims to measure teacher and school effectiveness must be fair in its application to all teachers and to all schools. Because teaching is a contextually-embedded, nonlinear activity that cannot be accurately assessed by using a linear, context-independent value-added model, it is unfair to use such a model at this time. Consensus among VAM researchers recommends against the use of growth models for high stakes purposes. Any assessment system that can misidentify 26 percent or more of the teachers as above or below average when they are neither is unfair when used for decisions of dismissal, merit pay, granting or revoking tenure, closing a school, retaining students, or withholding resources for poor performance. When almost two-thirds of teachers who do not teach subjects where standardized tests are administered will be rated based on the test score gains of other teachers in their schools, then the assessment system has led to unfair and unequal treatment (Gonzalez, 2012). When the assessment system intensifies teaching to the test, narrowing of curriculum, avoidance of the neediest students, reduction of teacher collaboration, or the widespread demoralization of teachers (Baker, E. et al, 2010), then it has unfair and regressive effects. Any assessment system whose proprietary status limits access by the scholarly community to validate its findings and interpretations is antithetical to the review process upon which knowledge claims are based. An unfair assessment system is unacceptable for high stakes decision-making. In August, 2013, the Tennessee State Board of Education adopted a new teacher licensing policy that ties teacher license renewal to value-added scores. However, the implementation of this policy was delayed by a very important presentation made public by the Tennessee Education Association. Presented by TEA attorney, Rick Colbert, and based on individual teachers sharing their value-added data for additional analysis, the TEA demonstrated that 43 percent of the teachers who would have lost their licenses due to declining value-added scores in one year had higher scores the following year, with twenty percent of those teachers scoring high enough in the following year to retain their licenses. The presentation may be viewed at YouTube: http://www.youtube.com/watch?v=l1BWGiqhHac

    After 20 years of using value-added assessment in Tennessee, educational achievement does not reflect an added value in the information gained from an expensive investment in the value-added assessment system. With $326,000,000 spent for assessment, the TVAAS, and other costs related to accountability since 1992, the State’s student achievement levels remain in the bottom quarter nationally (Score Report, 2010, p. 7). Tennessee received a D on K–12 achievement when compared to other states based on NAEP achievement levels and gains, poverty gaps, graduation rates, and Advanced Placement test scores (Quality Counts 2011, p. 46). The Public Education Finances Reports (U.S Census Bureau) ranks Tennessee’s per pupil spending as 47th for both 1992 and 2009. When state legislators and policymakers were led to believe in 1992 that the teacher is the single most important factor in improving student academic performance, they found reason to justify lowering education spending as a priority and increasing accountability. Finally, the evidence from twenty years of review and analysis by leading national experts in educational measurement, accountability, lead to the same conclusion when trying to answer Dr. Sanders’ original question: Can student test data be used to determine teacher effectiveness? The answer: No, not with enough certainty to make high-stakes personnel decisions. In turn, when we ask the larger social science question (Flyvbjerg, 2001): Is the use of value-added modeling and high-stakes testing a desirable social policy for improving learning conditions and learning for all students? The answer must be an unequivocal “no,” and it must remain so until assessments measure various levels of learning at the highest levels of reliability and validity, and with the conscious purpose of equality in educational opportunity for all students.

    We have wasted much time, money, and effort to find out what we already knew: effective teachers and schools make a difference in student learning and students’ lives. What the TVAAS and the EVAAS do not tell us, and what supporters of growth models seem oddly uncurious to know is what, how or why teachers make a difference? While test data and value-added analysis may highlight strengths and/or areas of needed intervention in school programs or subgroups of the student population, we can only know the “what,” “how” and “why” of effective teaching through careful observation by knowledgeable observers in classrooms where effective teachers engage students in varied levels of learning across multiple contexts. And while this kind of knowing may be too much to ask of any set of algorithms developed so far for deployment in schools, it is not at all alien to great educators who have been asking these questions and doing this kind of knowledge sharing since Socrates, at least. References

    Amrein-Beardsley, A. (2008). Methodological concerns about the Education Value-Added Assessment System. Educational Researcher, 37(2), 65-75. doi: 10.3102/0013189X08316420

    Baker, A., Xu, D., and Detch, E. (1995). The measure of education: A review of the Tennessee value added assessment system. Nashville, TN: Comptroller of the Treasury, Office of Education Accountability Report. Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn, R. L., Ravitch, D., Rothstein, R., Shavelson, R. J., & Shephard, L. A. (2010, August 29). Problems with the use of student test scores to evaluate teachers (Briefing Paper #278). Washington, DC: Economic Policy Institute.

    Ballou, D. (2002). Sizing up test scores. Education Next. Retrieved from http://www.educationnext.org

    Ballou, D. (2005). Value-added assessment: Lessons from Tennessee. Retrieved from http://dpi.state.nc.us/docs/superintendents/quarterly/2010-11/20100928/ballou-lessons.pdf

    Bock, R. and Wolfe, R. (1996, Jan. 23). Audit and review of the Tennessee value-added assessment system (TVAAS): Preliminary report. Nashville, TN: Comptroller of the Treasury, Office of Education Accountability Report. Braun, H. I. (2005). Using student progress to evaluate teachers (Policy Information Perspective). Retrieved from Educational Testing Service, Policy Information Center website: http://www.ets.org/Media/Research/pdf/PICVAM.pdf

    Briggs, D. C., Weeks, J. P. & Wiley, E. (2008, April). The sensitivity of value-added modeling to the creation of a vertical scale score. Paper presented at the National Conference on Value-Added Modeling, Madison, WI. Retrieved from http://academiclanguag.wceruw.org/news/events/VAM%20Conference%20Final%20Papers/SensitivityOfVAM_BriggsWeeksWiley.pdf

    Dunn, M., Kadane, J., & Garrow, J. (2003). Comparing harm done by mobility and class absence: Missing students and missing data. Journal of Educational and Behavioral Statistics, 28, 269–288.

    Fisher, T. (1996, January). A review and analysis of the Tennessee value-added assessment system. Nashville, TN: Tennessee Comptroller of the Treasury, Office of Education Accountability Report.

    Flyvbjerg, B. (2001). Making social science matter: Why social inquiry fails and how to make it succeed again. Cambridge: Cambridge University Press.

    Gonzalez, T. (2012, July 17). TN education reform hits bump in teacher evaluation. The Tennessean. Retrieved from

    http://www.wbir.com/news/article/226990/0/TN-education-reform-hits-bump-in-teacher-evaluation

    Jacob, B. A., Lefgrens, L. & Sims, D. P. (2008, June). The persistence of teacher-induced learning gains (Working Paper 14065). Retrieved from the National Bureau of Economic Research website: http://www.nber.org/papers/w14065

    Kupermintz, H. (2003). Teacher effects and teacher effectiveness: A validity investigation of the Tennessee Value Added Assessment System. Educational Evaluation and Policy Analysis, 25(3), 287-298.

    Lockwood, J. R., McCaffrey, D. F., Hamilton, L. S., Stecher, B. Le, V. & Martinez, F. (2006). The sensitivity of value-added teacher effect estimates to different mathematics achievement measures. Retrieved from The Rand Corporation website: http://www.rand.org/content/dam/rand/pubs/reports/2009/RAND_RP1269.pdf

    McCaffrey, D. F. & Lockwood, J. R. (2008, November). Value-added models: Analytic Issues. Paper presented at the National Research Council and the National Academy of Education, Board of Testing and Accountability Workshop on Value-Added Modeling, Washington DC.

    McCaffrey, D. F., Lockwood, J. R., Koretz, D. M. & Hamilton, L. S. (2003). Evaluating value-added models for teacher accountability. Retrieved from The Rand Corporation website: http://www.rand.org/pubs/monographs/MG158.html

    McCaffrey, D. F., Lockwood, J. R., Koretz, D., Louis, T. A., &Hamilton, L. (2004). Models for Value-Added Modeling of Teacher Effects. Journal of Educational and Behavioral Statistics, 29(1), 67-101.

    McCaffrey, D. F., Sass, T. R., Lockwood, J. R. & Mihaly, K. (2009). The intertemporal variability of teacher effect estimates. Education Finance and Policy, 4(4), 572-606.

    National Academy of Sciences. (2009). Letter report to the U. S. Department of Education on the Race to the Top fund. Washington, DC: National Academies of Sciences. Retrieved from http://www.nap.edu/catalog.php?record_id=12780

    National Research Council and National Academy of Education. (2010). Getting Value Out of Value-Added: Report of a Workshop. Committee on Value-Added Methodology for Instructional Improvement, Program Evaluation, and Educational Accountability, Henry Braun, Naomi Chudowsky, and Judith Koenig, Editors. Center for Education, Division of Behavioral and Social Sciences and Education. Washington, DC: The National Academies Press.

    Papay, J. (2011). Different tests, different answers: The stability of teacher value-added estimates across outcome measures. American Educational Research Journal, 48(1),163-193.

    Quality counts, 2011: Uncertain forecast. (2011, January 13). Education Week. Retrieved from http://www.edweek.org/ew/toc/2011/01/13/index.html

    Rothstein, J. (2010). Teacher quality in educational production: Tracking, decay, and student achievement. The Quarterly Journal of Economics, 125(1), 175-214.

    Schochet, P. Z. & Chiang, H. S. (2010). Error Rates in Measuring Teacher and School Performance Based on Student Test Score Gains (NCEE 2010-4004). Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education.

    State Collaborative on Reforming Education. (2010). The state of education in Tennessee (Annual Report). Retrieved from http://www.tnscore.org/wp-content/uploads/2010/06/Score-2010-Annual-Report-Full.pdf

    U. S. Census Bureau. (2011). Public Education Finances: 2009 (G09-ASPEF). Washington, DC: U.S. Government Printing Office.

Leave a Reply

Your email address will not be published. Required fields are marked *