Yesterday, the Council of the American Education Research Association (AERA) – AERA, founded in 1916, is the largest national professional organization devoted to the scientific study of education – publicly released their “AERA Statement on Use of Value-Added Models (VAM) for the Evaluation of Educators and Educator Preparation Programs.” Below is a summary of the AERA Council’s key points, noting for transparency that I contributed to these points in June of 2014, before the final statement was externally reviewed, revised, and vetted for public release.
As per the introduction: “The purpose of this statement is to inform those using or considering the use of value-added models (VAM) about their scientific and technical limitations in the evaluation of educators [as well as programs that prepare teachers].” The purpose of this statement is also to stress “the importance of any educator evaluation system meeting the highest standards of practice in statistics and measurement,” well before VAM output are to carry any “high-stakes, dispositive weight in [such teacher or other] evaluations” ( p. 1).
As per the main body of the statement, the AERA Council highlights eight very important technical requirements that must be met prior to such evaluative use. These eight technical requirements should be officially recognized by all states, and/or used by any of you out there to help inform your states regarding what they can and cannot, or should and should not do when using VAMs, whereas “[a]ny material departure from these [eight] requirements should preclude [VAM] use” (p. 2).
Here are AERA’s eight technical requirements for the use of VAM:
- “VAM scores must only be derived from students’ scores on assessments that meet professional standards of reliability and validity for the purpose to be served…Relevant evidence should be reported in the documentation supporting the claims and proposed uses of VAM results, including evidence that the tests used are a valid measure of growth [emphasis added] by measuring the actual subject matter being taught and the full range of student achievement represented in teachers’ classrooms” (p. 3).
- “VAM scores must be accompanied by separate lines of evidence of reliability and validity that support each [and every] claim and interpretative argument” (p. 3).
- “VAM scores must be based on multiple years of data from sufficient numbers of students…[Related,] VAM scores should always be accompanied by estimates of uncertainty to guard against [simplistic] overinterpretation[s] of [simple] differences” (p. 3).
- “VAM scores must only be calculated from scores on tests that are comparable over time…[In addition,] VAM scores should generally not be employed across transitions [to new, albeit different tests over time]” (AERA Council, 2015, p. 3).
- “VAM scores must not be calculated in grades or for subjects where there are not standardized assessments that are accompanied by evidence of their reliability and validity…When standardized assessment data are not available across all grades (K–12) and subjects (e.g., health, social studies) in a state or district, alternative measures (e.g., locally developed assessments, proxy measures, observational ratings) are often employed in those grades and subjects to implement VAM. Such alternative assessments should not be used unless they are accompanied by evidence of reliability and validity as required by the AERA, APA, and NCME Standards for Educational and Psychological Testing” (p. 3).
- “VAM scores must never be used alone or in isolation in educator or program evaluation systems…Other measures of practice and student outcomes should always be integrated into judgments about overall teacher effectiveness” (p. 3).
- “Evaluation systems using VAM must include ongoing monitoring for technical quality and validity of use…Ongoing monitoring is essential to any educator evaluation program and especially important for those incorporating indicators based on VAM that have only recently been employed widely. If authorizing bodies mandate the use of VAM, they, together with the organizations that implement and report results, are responsible for conducting the ongoing evaluation of both intended and unintended consequences. The monitoring should be of sufficient scope and extent to provide evidence to document the technical quality of the VAM application and the validity of its use within a given evaluation system” (AERA Council, 2015, p. 3).
- “Evaluation reports and determinations based on VAM must include statistical estimates of error associated with student growth measures and any ratings or measures derived from them…There should be transparency with respect to VAM uses and the overall evaluation systems in which they are embedded. Reporting should include the rationale and methods used to estimate error and the precision associated with different VAM scores. Also, their reliability from year to year and course to course should be reported. Additionally, when cut scores or performance levels are established for the purpose of evaluative decisions, the methods used, as well as estimates of classification accuracy, should be documented and reported. Justification should [also] be provided for the inclusion of each indicator and the weight accorded to it in the evaluation process…Dissemination should [also] include accessible formats that are widely available to the public, as well as to professionals” ( p. 3-4).
As per the conclusion: “The standards of practice in statistics and testing set a high technical bar for properly aggregating student assessment results for any purpose, especially those related to drawing inferences about teacher, school leader, or educator preparation program effectiveness” (p. 4). Accordingly, the AERA Council recommends that VAMs “not be used without sufficient evidence that this technical bar has been met in ways that support all claims, interpretative arguments, and uses (e.g., rankings, classification decisions)” (p. 4).
CITATION: AERA Council. (2015). AERA statement on use of value-added models (VAM) for the evaluation of educators and educator preparation programs. Educational Researcher, X(Y), 1-5. doi:10.3102/0013189X15618385 Retrieved from http://edr.sagepub.com/content/early/2015/11/10/0013189X15618385.full.pdf+html
First, Thanks for all of your efforts to make VAM visible as a big problem, within and beyond AERA.
AERA’s long delay in taking a position on the abuses of VAM in K-12 education over the last 15 years is unsatisfying. It seems to be politically expedient. It seems to be “protecting” fans of VAM.
The press release (Nov.11, 2015) actually gives credence to VAM and the whole mission of “measuring teacher IMPACTS on student learning outcomes.” Is no one at AERA paying attention to this horrible language? Or the gaping hole left for continuing abuse of VAM in this message?
…”While VAM may be superior to some other models of measuring teacher impacts on student learning outcomes, ‘it does not mean that they are ready for use in educator or program evaluation. There are potentially serious negative consequences in the context of evaluation that can result from the use of VAM based on incomplete or flawed data, as well as from the misinterpretation or misuse of the VAM results.’”
So, because VAM has migrated into higher education suddenly AERA discovers the need to say “not in my territory, not in teacher education, not in my program evaluations” ….but VAM may still be SUPERIOR for K-12 teacher evaluations?
Dear colleagues in research, the use of VAM for teacher, principal, and school evaluations has been common in K-12 education for fifteen years. Aided and abetted by your collective silence, countless schools have been closed. Able and committed principals and teachers have been fired. Many others have been demeaned by the aggrandizement of test scores and quixotic ratings from VAM and so-called alternative measures of teaching effectiveness (including the notoriously invalid SLOs, teacher observation protocols, and student surveys).
Researchers and scholars in teacher education are now themselves “at risk” of being VAMed. Your “value-added” status will be determined by the student scores produced by graduates of your programs. Teacher education programs will then be stack rated by the long reach of Bill Gate’s “teacher quality” initiative or a version of ALEC’s A to E grading scheme, or USDE’s HEDI metrics—including the production of “more than a year’s worth of growth” for a highly effective rating.
The red flags needed to be raised the very first day when VAM algorithms from William Sanders migrated from measuring the productivity of seeds, sows, and cows into education where they became measures of the productivity of teachers, with scores on standardized tests essential fuel for the algorithms.
VAMboozled, with the wonderful “OO” eyes spinning, is one of the few repositories of research citations showing the problems with VAM. The first on my list goes to 1971, one of the little exercises by fledgling economist Eric Hanushek, who has hammered educators since then in more than 500 articles bemoaning the fact that the work of educators does not meet his economic and statistical assumptions.
Being VAMed means you will be victimized by flawed policies and metrics. AERA still seems to think there is something worthwhile to be learned from feeding test scores to VAM.
I do not.
““VAM scores must only be derived from students’ scores on assessments that meet professional standards of reliability and validity for the purpose to be served…”
“. . . unless they are accompanied by evidence of reliability and validity as required by the AERA, APA, and NCME Standards for Educational and Psychological Testing” (p. 3).”
Noel Wilson has already shown us the COMPLETE INVALIDITY of those student test scores in his never refuted nor rebutted 1997 dissertation “Educational Standards and the Problem of Error” found at: http://epaa.asu.edu/ojs/article/view/577/700 And he has shown that the “standards” as delineated in the AERA, APA, and NCME “Standards for Educational and Psychological Testing” are indeed themselves fundamentally conceptually (epistemological and ontological underpinnings) rationo-logically unsound and intellectually bankrupt in his essay review of those standards A Little Less than Valid: An Essay Review
Considering what Wilson has proven, it only remains to reject any educational standardized testing* results for anything-VAM included.
*by educational standardized testing I mean tests where there are right/wrong answers and the student ranked, separated and sorted according to the number of correct answers. I am not referring to student diagnostic tests where there are no right/wrong answers and that is used in aiding the diagnosis of a learning/emotional or other disability.
While it may someday be possible to meet all of the 8 requirements, the cost of doing so vs. the small benefit that would be realized, in my mind at least, indicates that the entire idea of VAM should be abandoned. Spending that much effort to identify a small number of under performing teachers but not being able to answer the question of why and how they are seems absurd.
….”evidence that the tests used are a valid measure of growth [emphasis added] by measuring the actual subject matter being taught and the full range of student achievement represented in teachers’ classrooms.”
“Growth” means what? It has come to mean a gain in test scores, effectively denying that a more ample and multi-faceted concept of growth has any legitimacy in education.
A related point is that tests in use lack “instructional sensitivity,” and in one sense that will always be true. Why? Learning is almost always in some degree dependent on recall or some spillover from prior grades and experiences beyond school. Any current grade level and subject designation is filled with hauntings. In theory, only if the student enters the classroom as a blank slate can the influence of the current teacher be perfectly mapped.
The argument against VAM is technically powerful. Whether it will become politically persuasive is another.