“The Worst Popular Idea Out There”

Featured on the blog titled the “Big Education Ape,” David B. Cohen recently wrote a nice summary re: the current thinking about VAMs, with a heck-of-a way of capturing them, that also serves as the title of this post: “The Worst Popular Idea Out There.” That statement alone inspired me to post, below, some of the contents of his piece. Hopefully, his post will resonate and sound familiar, but to those who are new to VAMs and/or this blog, this is nice, short summary of, again, the current thinking about VAMs (see also the original and longer post by Cohen here).

David writes “about [this] evaluation method [as it] stands out as the worst popular idea out there – using value-added measurement (VAM) of student test scores as part of a teacher evaluation. The research evidence showing problems with VAM in teacher evaluation is solid, consistent, and comes from multiple fields and disciplines…The evidence comes from companies, universities, and governmental studies…the anecdotal evidence is rather damning as well: how many VAM train-wrecks do we need to see?…[Teachers all agree] that an effective teacher needs to be able to show student learning, as part of an analytical and reflective architecture of accomplished teaching. It doesn’t mean that student learning happens for every student on the same timeline, showing up on the same types of assessments, but effective teachers take all assessments and learning experiences into account in the constant effort to plan and improve good instruction. [While VAMs] have a certain intuitive appeal, because they claim the ability to predict the trajectory of student test scores,” they just do not work in the ways theorized and intended.

Using Student Surveys to Evaluate Teachers

The technology section of The New York Times released an article yesterday called “Grading Teachers, With Data From Class.” It’s about using student-level survey data, or what students themselves have to say about the effectiveness of their teachers, to supplement (or perhaps trump) value-added and other test-based data when evaluating teacher effectiveness.

I recommend this article to you all in that it’s pretty much right on in terms of using “multiple measures” to measure pretty much anything educational these days, including teacher effectiveness. Likewise, such an approach aligns with the 2014 “Standards for Educational and Psychological Testing” measurement standards recently released by the leading professional organizations in the area of educational measurement, including the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME).

Some of the benefits of using student surveys to help measure teacher effectiveness:

  • Student-level data based on such surveys typically yield data that are of more formative use to teachers than most other data, including data generated via value-added models (VAMs) and many observational systems.
  • These data represent students’ perceptions and opinions. This is important as these data come directly from students in teachers’ classrooms, and students are the most direct “consumers” of (in)effective teaching.
  • In this article in particular, the survey instrument described is open-source. This is definitely of “added value;” rare is it that products are offered to big (and small) money districts, more or less, for free.
  • This helps with current issues of fairness, or the lack thereof (whereas only about 30% of current PreK-12 teachers can be evaluated using students’ test scores). Using survey data can apply to really all teachers, if all teachers agree that the more generalized items pertain to them and the subject areas they teach (e.g., physical education). One thing to note, however, is that there are typically issues that arise when using these survey data when the data are to come from young children. Our littlest ones are typically happy with most any teacher and do not really have the capacities to differentiate among teacher effectiveness items or sub-factors; hence, these data do not typically yield very useful data for either formative (informative) or summative (summary) purposes in the lowest grade levels. Whether student surveys are appropriate for students in such grades is highly questionable, accordingly.

Some things to consider and some major notes of caution when using student surveys to help measure teacher effectiveness:

  • Response rates are always an issue when valid inferences are to be drawn from such survey data. Too often folks draw assertions and conclusions they believe to be valid from samples of respondents that are too small and not representative of the population, of in this case students, whom were initially solicited for their responses. Response rates cannot be overlooked; if response rates are inadequate this can and should void all data entirely.
  • There is a rapidly growing market for student-level survey systems such as these, and some are rushing to satisfy the demand without conducting the research necessary to make the claims they are simultaneously marketing. Consumers need to make sure such survey instruments themselves (as well as the online/paper administration systems that often come along with them) are functioning appropriately, and accordingly yielding reliable, good, accurate, useful, etc. data. These instruments are very difficult to construct and validate, so serious attention should be paid to the actual research supporting marketers’ claims. Consumers should continue to ask for the research evidence, as such research is often incomplete or not done when tools are needed ASAP. District-level researchers should be more than capable of examining the evidence before any contracts are signed.
  • Related, districts should not necessarily do this on their own. Not that district personnel are not capable, but as stated, validation research is a long, arduous, but also very necessary process. And typically, the instruments available (especially if for free) do a decent job capturing the general teacher effectiveness construct. This too can be debated, however (e.g., in terms of universal and/or too many items and halo effects).
  • Many in higher education have experience with both developing and using student-level survey data, and much can be learned from the wealth of research and information on using such systems to evaluate college instructor/professor effectiveness. This research certainly applies here. Accordingly, there is much research about how such survey data can be gamed and manipulated by instructors (e.g., via the use of external incentives/disincentives), can be biased by respondent or student background variables (e.g., charisma, attractiveness, gender and race as compared to the gender and race of the teacher or instructor, grade expected or earned in the class, overall grade point average, perceived course difficulty or the lack thereof), and the like. These literature should be consulted, so that all users of such student-level survey data are aware of the potential pitfalls when using and consuming such output. Accordingly, this research can help future consumers be proactive in terms of ensuring, as best they can, that results might yield as valid inferences as possible.
  • On that note, all educational measurements and measurement systems are imperfect. This is precisely why the standards of the profession call for “multiple measures” as with each multiple measure, the strengths of one hopefully help to offset the weaknesses of the others. This should yield a more holistic assessment of the construct of interest, which is in this case teacher effectiveness. However, the extent to which these data holistically capture teacher effectiveness, also needs to be continuously researched and assessed.

I hope this helps, and please do respond with comments if you all have anything else to add for the good of the group. I should also add that this is an incomplete list of both the strengths and drawbacks to such approaches; the aforementioned research literature, particularly as it represents 30+ years of using student-level surveys in higher education should be advised if more information is needed and desired.


VAMs in U.S. Healthcare: A Parody

Remember the AZ teacher who has written some great posts for us at VAMboozled (see here and here)? She’s at it again. Read this one for an (unfortunately) humorous parody on the topic of VAMs and how they might also be used to evaluate America’s doctors.

We have a real problem in this country. People are dying. They are dying from heart and blood pressure related illnesses. They are dying from diabetes. In 2010, close to 70,000 people in the U.S. died from diabetes. For heart and blood-pressure related illnesses, the news is even worse: over 700,000 people died. Healthcare in this country is going down the tubes. And when you compare American healthcare with the healthcare in countries like Finland or other some Asian countries, the problem is made even clearer.

How did things get so out of control? And what can be done to fix the problem?

The answer lies in the research: doctors. Research has shown that a doctor’s intervention is the single most important factor in whether a patient lives or dies. Additionally, the research has shown that the quality of a doctor impacts patients’ health outcomes. The solution to our healthcare woes, then, is our doctors. Imagine a country where we have a high-quality doctor in each and every doctor’s office and hospital!

But how might we do this? And how might we ensure that every patient in the United States has access to a high-quality doctor?

Fortunately, we need not look far. The United States education system has, for some time now, been “successfully” using an evaluation system to ensure that every student in America has a high-quality teacher. A new system would not need to be created, then—it could simply be modeled after the existing teacher evaluation system as based on VAMs!

This is how it would work. Upon a patient’s initial visit to a doctor’s office, the patient would be pre-tested. This pretest would be comprised of standard blood work (lipid profile, glucose, etc…) and a check of blood pressure. After nine months, the patient would be post-tested. The post-test would be comprised of, again, the same standard blood work and a check of blood pressure. The results of the pre- and post- tests would then be plugged into a sophisticated formula that controls for most of those factors not within the doctor’s immediate control (ex. patient diet, number of office visits, type of insurance plan, exercise, etc…). The result would then accurately indicate how much “value” the doctor “added” to the patient’s health.

Then, once we know which doctors are adding to patient health and which are not, insurance companies, hospitals, and medical practices could decide to which doctors they want to offer monetary bonuses, contracts, and special certifications, or rather renege on contracts and certifications in the inverse. Additionally, doctors might receive labels (highly-effective, effective, developing, or ineffective) that could be housed in state databases and perhaps advertised in searchable data-sets on line, to both streamline the vetting process for those (insurance companies, hospitals, and/or medical practices) who are interested in offering a doctor employment as well as members of the public so that they too might have access to “the best” information about doctors’ quality of care.

The only real problem here would be that only about 30% of the doctors would be eligible for ratings as the tests used are not “standard” across all doctors and all patients, depending on their needs and conditions. But these other tests shouldn’t count anyway as they are not standardized and accordingly more subjective. 

Not to fret, however, as statisticians could use the actual scores for the eligible 30% to make hospital-level value-added assertions about the others for whom these standardized data were not available. Because the value-added ineligible doctors ultimately contribute to the effects of the value-added eligible doctors, even though the ineligible may never come into contact with the eligible doctors’ patients, the ineligible are still contributing to the community’s overall effects.

Hence, implementing a value-added based evaluation system to hold doctors accountable for their effectiveness might just be the key to solving our health problem in the U.S. High-quality doctors will become more high-quality if held accountable for their performance, and THIS will better ensure the health and well-being of our nation.