Why Standardized Tests Should Not Be Used to Evaluate Teachers (and Teacher Education Programs)

Please follow and like us:

David C. Berliner, Regents’ Professor Emeritus here at Arizona State University (ASU), who also just happens to be my former albeit forever mentor, recently took up research on the use of test scores to evaluate teachers, for example, using value-added models (VAMs). While David is world-renowned for his research in educational psychology, and more specific to this case, his expertise on effective teaching behaviors and how to capture and observe them, he has also now ventured into the VAM-related debates.

Accordingly, he recently presented his newest and soon-to-be-forthcoming published research on using standardized tests to evaluate teachers, something he aptly termed in the title of his presentation “A Policy Fiasco.” He delivered his speech to an audience in Melbourne, Australia, and you can click here for the full video-taped presentation; however, given the whole presentation takes about one hour to watch, although I must say watching the full hour is well worth it, I highlight below what are his highlights and key points. These should certainly be of interest to you all as followers of this blog, and hopefully others.

Of main interest are his 14 reasons, “big and small’ for [his] judgment that assessing teacher competence using standardized achievement tests is nearly worthless.”

Here are his fourteen reasons:

  1. “When using standardized achievement tests as the basis for inferences about the quality of teachers, and the institutions from which they came, it is easy to confuse the effects of sociological variables on standardized test scores” and the effects teachers have on those same scores. Sociological variables (e.g., chronic absenteeism) continue to distort others’ even best attempts to disentangle them from the very instructional variables of interest. This, what we also term as biasing variables, are important not to inappropriately dismiss, as purportedly statistically “controlled for.”
  2. In law, we do not hold people accountable for the actions of others, for example, when a child kills another child and the parents are not charged as guilty. Hence, “[t]he logic of holding [teachers and] schools of education responsible for student achievement does not fit into our system of law or into the moral code subscribed to by most western nations.” Related, should medical school or doctors, for that matter, be held accountable for the health of their patients? One of the best parts of his talk, in fact, is about the medical field and the corollaries Berliner draws between doctors and medical schools, and teachers and colleges of education, respectively (around the 19-25 minute mark of his video presentation).
  3. Professionals are often held harmless for their lower success rates with clients who have observable difficulties in meeting the demands and the expectations of the professionals who attend to them. In medicine again, for example, when working with impoverished patients, “[t]here is precedent for holding [doctors] harmless for their lowest success rates with clients who have observable difficulties in meeting the demands and expectations of the [doctors] who attend to them, but the dispensation we offer to physicians is not offered to teachers.”
  4. There are other quite acceptable sources of data, besides tests, for judging the efficacy of teachers and teacher education programs. “People accept the fact that treatment and medicine may not result in the cure of a disease. Practicing good medicine is the goal, whether or not the patient gets better or lives. It is equally true that competent teaching can occur independent of student learning or of the achievement test scores that serve as proxies for said learning. A teacher can literally “save lives” and not move the metrics used to measure teacher effectiveness.
  5. Reliance on standardized achievement test scores as the source of data about teacher quality will inevitably promote confusion between “successful” instruction and “good” instruction. “Successful” instruction gets test scores up. “Good” instruction leaves lasting impressions, fosters further interest by the students, makes them feel competent in the area, etc. Good instruction is hard to measure, but remains the goal of our finest teachers.
  6. Related, teachers affect individual students greatly, but affect standardized achievement test scores very little. All can think of how their own teachers impacted their lives in ways that cannot be captured on a standardized achievement test.  Standardized achievement test scores are much more related to home, neighborhood and cohort than they are to teachers’ instructional capabilities. In more contemporary terms, this is also due the fact that large-scale standardized tests have (still) never been validated to measure student growth over time, nor have they been validated to attribute that growth to teachers. “Teachers have huge effects, it’s just that the tests are not sensitive to them.”
  7. Teacher’s effects on standardized achievement test scores fade quickly, barely discernable after a few years. So we might not want to overly worry about most teachers’ effects on their students—good or bad—as they are hard to detect on tests after two or so years. To use these ephemeral effects to then hold teacher education programs accountable seems even more problematic.
  8. Observational measures of teacher competency and achievement tests of teacher competency do not correlate well. This suggest nothing more than that one or both of these measures, and likely the latter, are malfunctioning in their capacities to measure the teacher effectiveness construct. See other Vamboozled posts about this here, here, and here.
  9. Different standardized achievement tests, both purporting to measure reading, mathematics, or science at the same grade level, will give different estimates of teacher competency. That is because different test developers have different visions of what it means to be competent in each of these subject areas. Thus one achievement test in these subject areas could find a teacher exemplary, but another test of those same subject areas would find the teacher lacking. What then? Have we an unstable teacher or an ill-defined subject area?
  10. Tests can be administered early or late in the fall, early or late in the spring, and the dates they are given influence the judgments about whether a teacher is performing well or poorly. Teacher competency should not be determined by minor differences in the date of testing, but that happens frequently.
  11. No standardized achievement tests have provided proof that their items are instructionally sensitive. If test items do not, because they cannot “react to good instruction,” how can one make a claim that the test items are “tapping good instruction?”
  12. Teacher effects show up more dramatically on teacher made tests than on standardized achievement tests because the former are based on the enacted curriculum, while the latter are based on the desired curriculum. You get seven times more instructionally sensitive tests the closer the test is to the classroom (i.e., teacher made tests).
  13. The opt-out testing movement invalidates inferences about teachers and schools that can be made from standardized achievement test results. Its not bad to remove these kids from taking these tests, and perhaps it is even necessary in our over-tested schools, but the tests and the VAM estimates derived via these tests, are far less valid when that happens. This is because the students who opt out are likely different in significant ways from those who do take the tests. This severely limits the validity claims that are made.
  14. Assessing new teachers with standardized achievement tests is likely to yield many false negatives. That is, the assessments would identify teachers early in their careers as ineffective in improving test scores, which is, in fact, often the case for new teachers. Two or three years later that could change. Perhaps the last thing we want to do in a time of teacher shortage is discourage new teachers while they acquire their skills.

6 thoughts on “Why Standardized Tests Should Not Be Used to Evaluate Teachers (and Teacher Education Programs)

  1. I just read an essay by Berliner from 2005 on poverty and education which is in a link from the current post on Daniel Katz’s blog about Hillary’s statement. It’s the most comprehensive analysis I’ve read on the subject. You’re so lucky to have had him as a mentor!

  2. Great post! I especially appreciate #14. I work with some excellent early career teachers, and I hope they stick around despite the negative influences beating down teacher efficacy in the profession today.

  3. ESSA is chock full of calls for measures, assessments, evaluations of teacher effectiveness, including measures of student “growth.” These calls are especially intense in Title II as the primary means of determining whether prospective teachers, including student teachers and those in extended residencies, should be eligible for certification.
    ESSA is, in my opinion, a congressional thumb in the eye to scholars in education and all coursework for teachers that is not strictly nuts and bolts and test prep. A clear express of this disdain can be read In the double speak about charter academies for teacher preparation, with the authorizor in chief the governor, five stipulations about what these academies are supposed to do Title II , section 2002, 4A with all those requirements removed in 4B if prospective teachers pass “state-approved content area examinations.” What is removed? Think Relay writ large. No need for faculty in those academies” to have advanced degrees or do academic research, no course credit requirements, no restrictions on academies via accrediting bodies and so on.

  4. It seems crazy to measure teachers’ effectiveness on students when that personal affect diminishes quickly. Certainly, a teacher who cannot help individual students find the intrinsic motivation within themselves to learn the material may not appear to be doing the student a service, but so much of that intrinsic motivation comes from places OTHER than the classroom. Teachers should not be “punished” on evaluations due to influences that are clearly out of their control and over which they have little to no influence (parental influences, societal influences, peer influences, etc.). We need a level playing field for teachers…the good ones should be rewarded based on FAIR standards for which they have some degree of control, and the poor ones should be given a chance to improve, based on their appropriate influence on the students, or find another career choice. In either event, FAIRness is the key.

  5. And another reason not to use VAM stats: the scores are so unstable for a given teacher, over time, as to defy belief. I and David Rubinstein have looked at the published value-added measurements of teachers in NYPS in the same schools, teaching the same subjects at the same grade levels, under almost identical conditions, or at different grade levels, and found that there is almost no correlation between them.
    Whatever VAM measures, it’s not teacher quality. It’s more like a random number generator .

  6. You make some really good points about turning away from standardized tests as a teacher evaluation method. Since some of these tests can be administered near the end of the school year, I imagine that a lot of students don’t try as well because they know it won’t be a part of their grade, which I think fits in with your tenth point. In fact, because of all of these problems, I think that schools should move towards a whole new evaluation method completely. It might be a good idea to introduce evaluation software that can let the school evaluate exactly what they want, and not just test scores.

Leave a Reply to Jess Ledbetter Cancel reply

Your email address will not be published. Required fields are marked *