Jesse Rothstein on Teacher Evaluation and Teacher Tenure

Share Button

Last week, released via the Washington Post’s Wonkblog, Max Ehrenfreund wrote a piece titled “Teacher tenure has little to do with student achievement, economist says.” For those of you who do not know Jesse Rothstein, he’s an Associate Professor of Economics at University of California – Berkeley, and he is one of the leading researchers/economists conducting research on teacher evaluation and accountability policies writ large, as well as the value-added models (VAMs) being used for such purposes. He’s probably most famous for a study he conducted in 2009 about how the non-random, purposeful sorting of students into classrooms indeed biases (or distorts) value-added estimations, pretty much despite the sophistication of the statistical controls meant to block (or control for) such bias (or distorting effects). You can find this study referenced here.

Anyhow, in this piece author Ehrenfreuend discusses with Rothstein teacher evaluation and teacher tenure. Some of the key take-aways from the interview and for this audience follow, but do read the full piece, linked again here, if so inclined:

Rothstein, on teacher evaluation:

  • In terms of evaluating teachers, “[t]here’s no perfect method. I think there are lots of methods that give you some information, and there are lots of problems with any method. I think there’s been a tendency in thinking about methods to prioritize cheap methods over methods that might be more expensive. In particular, there’s been a tendency to prioritize statistical computations based on student test scores, because all you need is one statistician and the test score data. Classroom observation requires having lots of people to sit in the back of lots and lots of classrooms and make judgments.
  • Why the interest in value-added? “I think that’s a complicated question. It seems scientific, in a way that other methods don’t. Partly it has to do with the fact that it’s cheap, and it seems like an easy answer.”
  • What about the fantabulous study Raj Chetty and his Harvard colleagues (Friedman and Rockoff) conducted about teachers’ value-added (which has been the source of many prior posts herein)? “I don’t think anybody disputes that good teachers are important, that teachers matter. I have some methodological concerns about that study, but in any case, even if you take it at face value, what it tells you is that higher value-added teachers’ students earn more on average.”
  • What are the alternatives? “We could double teachers’ salaries. I’m not joking about that. The standard way that you make a profession a prestigious, desirable profession, is you pay people enough to make it attractive. The fact that that doesn’t even enter the conversation tells you something about what’s wrong with the conversation around these topics. I could see an argument that says it’s just not worth it, that it would cost too much. The fact that nobody even asks the question tells me that people are only willing to consider cheap solutions.”

Rothstein, on teacher tenure:

  • “Getting good teachers in front of classrooms is tricky,” and it will likely “still be a challenge without tenure, possibly even harder. There are only so many people willing to consider teaching as a career, and getting rid of tenure could eliminate one of the job’s main attractions.”
  • Likewise, “there are certainly some teachers in urban, high-poverty settings that are not that good, and we ought to be figuring out ways to either help them get better or get them out of the classroom. But it’s important to keep in mind that that’s only one of several sources of the problem.”
  • “Even if you give the principal the freedom to fire lots of teachers, they won’t do it very often, because they know the alternative is worse.” The alternative being replacing an ineffective teacher by an even less effective teacher. Contrary to what is oft-assumed, high qualified teachers are not knocking down the doors to teach in such schools.
  • Teacher tenure is “really a red herring” in the sense that debating tenure ultimately misleads and distracts others from the more relevant and important issues at hand (e.g., recruiting strong teachers into such schools). Tenure “just doesn’t matter that much. If you got rid of tenure, you would find that the principals don’t really fire very many people anyway” (see also point above).
Share Button

Can Today’s Tests Yield Instructionally Useful Data?

Share Button

The answer is no, or at best not yet.

Some heavy hitters in the academy just released an article that might be of interest to you all. In the article the authors discuss whether “today’s standardized achievement tests [actually] yield instructionally useful data.”

The authors include W. James Popham, Professor Emeritus from the University of California, Los Angeles; David Berliner, Regents’ Professor Emeritus at Arizona State University; Neal Kingston, Professor at the University of Kansas; Susan Fuhrman, current President of Teachers College, Columbia University; Steven Ladd, Superintendent of Elk Grove Unified School District in California; Jeffrey Charbonneau, National Board Certified Teacher in Washington and the 2013 US National Teacher of the Year; and Madhabi Chatterji, Associate Professor at Teachers College, Columbia University.

These authors explored some of the challenges and promises in terms of using and designing standardized achievement tests and other educational tests that are “instructionally useful.” This was the focus of a recent post about whether Pearson’s tests are “instructionally sensitive” and what University of Texas – Austin’s Associate Professor Walter Stroup versus Pearson’s Senior Vice President had to say on this topic.

In this study, authors deliberate more specifically the consequences of using inappropriately designed tests for decision-making purposes, particularly when tests are insensitive to instruction. Here, the authors underscore serious issues related to validity, ethics, and consequences, all of which they use and appropriately elevate to speak out, particularly against the use of current, large-scale standardized achievement tests for evaluating teachers and schools.

The authors also make recommendations for local policy contexts, offering recommendations to support (1) the design of more instructionally sensitive large-scale tests as well as (2) the design of other smaller scale tests that can also be more instructionally sensitive, and just better. These include but are not limited to classroom tests as typically created, controlled, and managed by teachers, as well as district tests as sometimes created, controlled, and managed by district administrators.

Such tests might help to create more but also better comprehensive educational evaluation systems, the authors ultimately argue. Although this, of course, would require more professional development to help teachers (and others, including district personnel) develop more instructionally sensitive, and accordingly useful tests. As they also note, this would also require that “validation studies…be undertaken to ensure validity in interpretations of results within the larger accountability policy context where schools and teachers are evaluated.”

This is especially important if tests are to be used for low and high-stakes decision-making purposes. Yet this is something that is way too often forgotten when it comes to test use, and in particular test abuse. All should really take heed here.

Share Button

Charter Schools’ Value-Added in Ohio

Share Button

On the 10th Period blog, an Education Policy Fellow at Innovation Ohio named Stephen Dyer wrote about charter schools’ versus traditional schools’ value-added. Click here to read the full blog post, and also to view Dyer’s illustrative graphs explaining the headline: that “Charter Value Added Grades [are] Not Much Better” than the value added grades of their comparable public schools.

First, it is important to note that the state of Ohio uses the Education Value-Added Assessment Systems (EVAAS) of interest in many prior posts on this blog. Second, it is important to note that there are flaws in all of these data, so consume these findings with a critical eye in that very few people agree that value-added data are yielding valid results, or rather results from which valid inferences can be drawn. Even if the self-reported “best” value-added system is being used in the state of Ohio, this does not mean that these results (even though they support public schools) are indeed accurate much less informative.

Let’s just suppose…particularly, because as Dyer stated, using VAM at a more macro level (i.e., district/school versus teacher level) “VAM holds more promise, is less swayed by demographics than raw test scores, and is better philosophically. Though it still needs a lot of work,” using VAM output at the macro level might be okay, largely again if used only for descriptive purposes. Because this is a school level analysis, other researchers would also be more inclined to agree.

While there are certainly some sampling issues in this analysis, as also acknowledged by Dyer in that charter schools in general have fewer students making some analyses (e.g., analyses of gifted students) impossible, Dyer’s main findings follow:

  • “Districts still get higher percentages of As and Bs on all the value added categories. Meanwhile, Charters get higher percentages of Ds and Fs than districts do.”
  • In one value added category (VAM among the lowest scoring 20% of students), charters got 1% more As than districts.
  • Otherwise, charters “fail at a significantly higher level in all these categories than the districts from which they receive their children and money.”
  • Overall, “Charters do a little bit better than their raw scores would indicate. But it’s still nothing to write home about.”

It is also important to note that “every Ohio school district lost money and children to Charter Schools last year (only Ohio’s tiny Lake Erie island districts did not).” If I was a parent in Ohio, I for one would pause before making such a decision given the above, even given the limitations. If I was a policymaker in Ohio? I’d really rethink this year’s budget given last year’s budget that came in at $914 million.

Share Button

The Arbitrariness Inherent in Teacher Observations

Share Button

In a recent article released in The Journal News, a newspaper serving many suburban New York counties, another common problem is highlighted whereby districts that have adopted the same teacher observational system (in this case as mandated by the state) are scoring what are likely to be very similar teachers very differently. Whereby teachers in one of the best school districts not only in the state but in the nation apparently has no “highly effective” teachers on staff, teachers in a neighboring district apparently have a staff 99% filled with “highly effective” teachers.

The “believed to be” model developer, Charlotte Danielson, is cited as stating that “Saying 99 percent of your teachers are highly effective is laughable.” I don’t know if I completely agree with her statement, and I do have to admit I question her perspective on this one, and all of her comments throughout this article for that matter, as she is the one who is purportedly offering up her “valid” Framework for Teaching for such observational purposes. Perhaps she’s displacing blame and arguing that it’s the subjectivity of the scorers rather than the subjectivity inherent in her system that should be to blame for the stark discrepancies.

As per Danielson: “The local administrators know who they are evaluating and are often influenced by personal bias…What it also means is that they might have set the standards too low.” As per the Superintendent of the District with 99% highly effective teachers: The state’s “flawed” evaluation model forced districts to “bump up” the scores so “effective” teachers wouldn’t end up with a rating of “developing.” The Superintendent adds that it is possible under the state’s system to be rated “effective” across domains and still end up rated as “developing,” which means teachers may be in need of intervention/improvement, or may be eligible for an expedited hearing process that could lead to their termination. Rather it may have been the case that the scores were inflated to save effective teachers from what the district viewed as an ineffective set of consequences attached to the observational system (i.e., intervention or termination).

Danielson is also cited as saying that “teachers should live in “effective” and only [occasionally] visit “highly effective.” She also notes that if her system contradicts teachers’ value-added scores, this too should “raise red flags” about the quality of the teacher, although she does not (in this article) pay any respect or regard for the issues not only inherent in value-added measures but also her observational system.

What is most important in this article, though, is that reading through it illustrates well the arbitrariness of how all of the measures being mandated and used to evaluate teachers are actually being used. Take, for example, the other note herein that the state department’s intent seems to be that 70%-80% percent of teachers should “fall in the middle” as “developing” or “effective.” While this is mathematically impossible (i.e., to have 70%-80% hang around average), this could not be more arbitrary.

In the end, teacher evaluation systems are highly flawed and highly subjective and highly prone to error and the like, and for people who just don’t “get it” to be passing policies on the contrary, is nonsensical and absurd. These flaws are not as important when evaluation system data can be used for formative, or informative purposes whereas data consumers have more freedom to take the data for what they are worth. When summary, or summative decisions are to be made as based on these data, regardless of whether low or high-stakes are attached to the decision, this is where things really go awry.

Share Button

The War Report

Share Button

On September 14, (tomorrow, or for those of you reading this on Sunday, today), I will be interviewed by Dr. James Miller on an online radio show called The War Report. Do drop in. See all show details in the flyer below.





Share Button

Principals’ Perspectives on Value-Added

Share Button

Principals are not using recent teacher evaluation data, including data from value-added assessment systems, student surveys, and other student achievement indicators, to inform decisions about hiring, placements, and professional development, according to findings from a research study recently released by researchers at Vanderbilt University.

The data most often used by principals? Data collected via their direct observations of their teachers in practice.

Education Week’s Denisa Superville also covered this study here, writing that principals are most likely to use classroom-observation data to inform such decisions, rather than the data yielded via VAMs and other student test scores. Of least relevance were data derived via parent surveys.

Reasons for not using value-added data specifically? “[A]access to the data, the availability of value-added measures when decisions are being made, a lack of understanding of the statistical models used in the evaluation systems, and the absence of training in using [value-added] data.”

Moving forward, “the researchers recommend that districts clarify their expectations for how principals should use data and what data sources should be used for specific human-resources decisions. They recommend training for principals on using value-added estimates, openly encouraging discussions about data use, and clarifying the roles of value-added estimates and observation scores.”

If this is to happen, hopefully such efforts will be informed by the research community, in order to help district and administrators more critically consume value-added data in particular, for that which they can and cannot do.

Note: This study is not yet peer-reviewed, so please consume this information for yourself with that being known.

Share Button

Pearson Tests v. UT Austin’s Associate Professor Stroup

Share Button

Last week of the Texas Observer wrote an article, titled “Mute the Messenger,” about University of Texas – Austin’s Associate Professor Walter Stroup, who publicly and quite visibly claimed that Texas’ standardized tests as supported by Pearson were flawed, as per their purposes to measure teachers’ instructional effects. The article is also about how “the testing company [has since] struck back,” purportedly in a very serious way. This article (linked again here) is well worth a full read for many reasons I will leave you all to infer. This article was also covered recently on Diane Ravitch’s blog here, although readers should also see Pearson’s Senior Vice President’s prior response to, and critique of Stroup’s assertions and claims (from August 2, 2014) here.

The main issue? Whether Pearson’s tests are “instructionally sensitive.” That is, whether (as per testing and measurement expert – Professor Emeritus W. James Popham) a test is able to differentiate between well taught and poorly taught students, versus able to differentiate between high and low achievers regardless of how students were taught (i.e., as per that which happens outside of school that students bring with them to the schoolhouse door).

Testing developers like Pearson seem to focus on the prior, that their tests are indeed sensitive to instruction. While testing/measurement academics and especially practitioners seem to focus on the latter, that tests are sensitive to instruction, but such tests are not nearly as “instructionally sensitive” as testing companies might claim. Rather, tests are (as per testing and measurement expert – Regents Professor David Berliner) sensitive to instruction but more importantly sensitive to everything else students bring with them to school from their homes, parents, siblings, and families, all of which are situated in their neighborhoods and communities and related to their social class. Here seems to be where this, now very heated and polarized argument between Pearson and Associate Professor Stroup now stands.

Pearson is focusing on its advanced psychometric approaches, namely its use of Item Response Theory (IRT) while defending their tests as “instructionally sensitive.” IRT is used to examine things like p-values (or essentially proportions of students who respond to items correctly) and item-discrimination indices (to see if test items discriminate between students who know [or are taught] certain things and students who don’t know [or are not taught] certain things otherwise). This is much more complicated than what I am describing here, but hopefully this gives you all the gist of what now seems to be the crux of this situation.

As per Pearson’s Senior Vice President’s statement, linked again here, “Dr. Stroup claim[ed] that selecting questions based on Item Response Theory produces tests that are not sensitive to measuring what students have learned.” While from what I know about Dr. Stroup’s actual claims, this trivializes his overall arguments. Tests, after undergoing revisions as per IRT methods, are not always “instructionally sensitive.”

When using IRT methods, test companies, for example, remove items that “too many students get right” (e.g, as per items’ aforementioned p-values). This alone makes tests less “instructionally insensitive” in practiceIn other words, while the use of IRT methods is sound psychometric practice based on decades of research and development, if using IRT deems an item as “too easy,” even if the item is taught well (i.e., “instructionally senstive”), the item might be removed. This makes the test (1) less “instructionally sensitive” in the eyes of teachers who are to teach the tested content (and who are now more than before held accountable for teaching these items), and this makes the test (2) more “instructionally sensitive” in the eyes of test developers in that when fewer students get test items correct the better the items are when descriminating between those who know (or are taught) certain things and students who don’t know (or are not taught) certain things otherwise.

A paradigm example of what this looks like in practice comes from advanced (e.g., high school) mathematics tests.

Items capturing statistics and/or data displays on such tests should theoretically include items illustrating standard column or bar charts, with questions prompting students to interpret the meanings of the statistics illustrated in the figures. Too often, however, because these items are often taught (and taught well) by teachers (i.e., “instructionally sensitive”) “too many” students answer such items correctly. Sometimes these items yield p-values greater than p=0.80 or 80% correct.

When you need a test and its outcome score data to fit around the bell curve, you cannot have such, or too many of such items, on the final test. In the simplest of terms, for every item with a p-vale of 80% you would need another with a p-value of 20% to balance items out, or keep the overall mean of each test around p=0.50 (the center of the standard normal curve). It’s best if test items, more or less, hang around such a mean, otherwise the test will not function as it needs to, mainly to discriminate between who knows (or is taught) certain things and who doesn’t know (or isn’t taught) certain things otherwise. Such items (i.e., with high p-values) do not always distribute scores well enough because “too many students” answering such items correct reduces the variation (or spread of scores) needed.

The counter-item in this case is another item also meant to capture statistics and/or data display, but that is much more difficult, largely because it’s rarely taught because it rarely matters in the real world. Take, for example, the box and whisker plot. If you don’t know what this is, which is in and of itself telling in this example, see them described and illustrated here. Often, this item IS found on such tests because this item IS DIFFICULT and, accordingly, works wonderfully well to discriminate between those who know (or are taught) certain things and those who don’t know (or aren’t taught) certain things otherwise.

Because this item is not as often taught (unless teachers know it’s coming, which is a whole other issue when we think about “instructional sensitivity” and “teaching-to-the-test“), and because this item doesn’t really matter in the real world, it becomes an item that is more useful for the test, as well as the overall functioning of the test, than it is an item that is useful for the students tested on it.

A side bar on this: A few years ago I had a group of advanced doctoral students studying statistics take Arizona’s (now former) High School Graduation Exam. We then performed an honest analysis of the resulting doctoral students’ scores using some of the above-mentioned IRT methods. Guess which item students struggled with the most, which also happened to be the item that functioned the best as per our IRT analysis? The box and whisker plot. The conversation that followed was most memorable, as the statistics students themselves questioned the utility of this traditional item, for them as advanced doctoral students but also for high school graduates in general.

Anyhow, this item, like many other items similar, had a lower relative p-value, and accordingly helped to increase the difficulty of the test and discriminate results to assert a purported “insructional sensitivity,” regardless of whether the item was actually valued, and more importantly valued in instruction.

Thanks to IRT, the items often left on such tests are not often the items taught by teachers, or perhaps taught by teachers well, BUT they distribute students’ test scores effectively and help others make inferences about who knows what and who doesn’t. This happens even though the items left do not always capture what matters most. Yes – the tests are aligned with the standards as such items are in the standards, but when the most difficult items in the standards trump the others, and many of the others that likely matter more are removed for really no better reason than what IRT dictates, this is where things really go awry.

Share Button

“The Worst Popular Idea Out There”

Share Button

Featured on the blog titled the “Big Education Ape,” David B. Cohen recently wrote a nice summary re: the current thinking about VAMs, with a heck-of-a way of capturing them, that also serves as the title of this post: “The Worst Popular Idea Out There.” That statement alone inspired me to post, below, some of the contents of his piece. Hopefully, his post will resonate and sound familiar, but to those who are new to VAMs and/or this blog, this is nice, short summary of, again, the current thinking about VAMs (see also the original and longer post by Cohen here).

David writes “about [this] evaluation method [as it] stands out as the worst popular idea out there – using value-added measurement (VAM) of student test scores as part of a teacher evaluation. The research evidence showing problems with VAM in teacher evaluation is solid, consistent, and comes from multiple fields and disciplines…The evidence comes from companies, universities, and governmental studies…the anecdotal evidence is rather damning as well: how many VAM train-wrecks do we need to see?…[Teachers all agree] that an effective teacher needs to be able to show student learning, as part of an analytical and reflective architecture of accomplished teaching. It doesn’t mean that student learning happens for every student on the same timeline, showing up on the same types of assessments, but effective teachers take all assessments and learning experiences into account in the constant effort to plan and improve good instruction. [While VAMs] have a certain intuitive appeal, because they claim the ability to predict the trajectory of student test scores,” they just do not work in the ways theorized and intended.

Share Button

Using Student Surveys to Evaluate Teachers

Share Button

The technology section of The New York Times released an article yesterday called “Grading Teachers, With Data From Class.” It’s about using student-level survey data, or what students themselves have to say about the effectiveness of their teachers, to supplement (or perhaps trump) value-added and other test-based data when evaluating teacher effectiveness.

I recommend this article to you all in that it’s pretty much right on in terms of using “multiple measures” to measure pretty much anything educational these days, including teacher effectiveness. Likewise, such an approach aligns with the 2014 “Standards for Educational and Psychological Testing” measurement standards recently released by the leading professional organizations in the area of educational measurement, including the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME).

Some of the benefits of using student surveys to help measure teacher effectiveness:

  • Student-level data based on such surveys typically yield data that are of more formative use to teachers than most other data, including data generated via value-added models (VAMs) and many observational systems.
  • These data represent students’ perceptions and opinions. This is important as these data come directly from students in teachers’ classrooms, and students are the most direct “consumers” of (in)effective teaching.
  • In this article in particular, the survey instrument described is open-source. This is definitely of “added value;” rare is it that products are offered to big (and small) money districts, more or less, for free.
  • This helps with current issues of fairness, or the lack thereof (whereas only about 30% of current PreK-12 teachers can be evaluated using students’ test scores). Using survey data can apply to really all teachers, if all teachers agree that the more generalized items pertain to them and the subject areas they teach (e.g., physical education). One thing to note, however, is that there are typically issues that arise when using these survey data when the data are to come from young children. Our littlest ones are typically happy with most any teacher and do not really have the capacities to differentiate among teacher effectiveness items or sub-factors; hence, these data do not typically yield very useful data for either formative (informative) or summative (summary) purposes in the lowest grade levels. Whether student surveys are appropriate for students in such grades is highly questionable, accordingly.

Some things to consider and some major notes of caution when using student surveys to help measure teacher effectiveness:

  • Response rates are always an issue when valid inferences are to be drawn from such survey data. Too often folks draw assertions and conclusions they believe to be valid from samples of respondents that are too small and not representative of the population, of in this case students, whom were initially solicited for their responses. Response rates cannot be overlooked; if response rates are inadequate this can and should void all data entirely.
  • There is a rapidly growing market for student-level survey systems such as these, and some are rushing to satisfy the demand without conducting the research necessary to make the claims they are simultaneously marketing. Consumers need to make sure such survey instruments themselves (as well as the online/paper administration systems that often come along with them) are functioning appropriately, and accordingly yielding reliable, good, accurate, useful, etc. data. These instruments are very difficult to construct and validate, so serious attention should be paid to the actual research supporting marketers’ claims. Consumers should continue to ask for the research evidence, as such research is often incomplete or not done when tools are needed ASAP. District-level researchers should be more than capable of examining the evidence before any contracts are signed.
  • Related, districts should not necessarily do this on their own. Not that district personnel are not capable, but as stated, validation research is a long, arduous, but also very necessary process. And typically, the instruments available (especially if for free) do a decent job capturing the general teacher effectiveness construct. This too can be debated, however (e.g., in terms of universal and/or too many items and halo effects).
  • Many in higher education have experience with both developing and using student-level survey data, and much can be learned from the wealth of research and information on using such systems to evaluate college instructor/professor effectiveness. This research certainly applies here. Accordingly, there is much research about how such survey data can be gamed and manipulated by instructors (e.g., via the use of external incentives/disincentives), can be biased by respondent or student background variables (e.g., charisma, attractiveness, gender and race as compared to the gender and race of the teacher or instructor, grade expected or earned in the class, overall grade point average, perceived course difficulty or the lack thereof), and the like. These literature should be consulted, so that all users of such student-level survey data are aware of the potential pitfalls when using and consuming such output. Accordingly, this research can help future consumers be proactive in terms of ensuring, as best they can, that results might yield as valid inferences as possible.
  • On that note, all educational measurements and measurement systems are imperfect. This is precisely why the standards of the profession call for “multiple measures” as with each multiple measure, the strengths of one hopefully help to offset the weaknesses of the others. This should yield a more holistic assessment of the construct of interest, which is in this case teacher effectiveness. However, the extent to which these data holistically capture teacher effectiveness, also needs to be continuously researched and assessed.

I hope this helps, and please do respond with comments if you all have anything else to add for the good of the group. I should also add that this is an incomplete list of both the strengths and drawbacks to such approaches; the aforementioned research literature, particularly as it represents 30+ years of using student-level surveys in higher education should be advised if more information is needed and desired.


Share Button

VAMs in U.S. Healthcare: A Parody

Share Button

Remember the AZ teacher who has written some great posts for us at VAMboozled (see here and here)? She’s at it again. Read this one for an (unfortunately) humorous parody on the topic of VAMs and how they might also be used to evaluate America’s doctors.

We have a real problem in this country. People are dying. They are dying from heart and blood pressure related illnesses. They are dying from diabetes. In 2010, close to 70,000 people in the U.S. died from diabetes. For heart and blood-pressure related illnesses, the news is even worse: over 700,000 people died. Healthcare in this country is going down the tubes. And when you compare American healthcare with the healthcare in countries like Finland or other some Asian countries, the problem is made even clearer.

How did things get so out of control? And what can be done to fix the problem?

The answer lies in the research: doctors. Research has shown that a doctor’s intervention is the single most important factor in whether a patient lives or dies. Additionally, the research has shown that the quality of a doctor impacts patients’ health outcomes. The solution to our healthcare woes, then, is our doctors. Imagine a country where we have a high-quality doctor in each and every doctor’s office and hospital!

But how might we do this? And how might we ensure that every patient in the United States has access to a high-quality doctor?

Fortunately, we need not look far. The United States education system has, for some time now, been “successfully” using an evaluation system to ensure that every student in America has a high-quality teacher. A new system would not need to be created, then—it could simply be modeled after the existing teacher evaluation system as based on VAMs!

This is how it would work. Upon a patient’s initial visit to a doctor’s office, the patient would be pre-tested. This pretest would be comprised of standard blood work (lipid profile, glucose, etc…) and a check of blood pressure. After nine months, the patient would be post-tested. The post-test would be comprised of, again, the same standard blood work and a check of blood pressure. The results of the pre- and post- tests would then be plugged into a sophisticated formula that controls for most of those factors not within the doctor’s immediate control (ex. patient diet, number of office visits, type of insurance plan, exercise, etc…). The result would then accurately indicate how much “value” the doctor “added” to the patient’s health.

Then, once we know which doctors are adding to patient health and which are not, insurance companies, hospitals, and medical practices could decide to which doctors they want to offer monetary bonuses, contracts, and special certifications, or rather renege on contracts and certifications in the inverse. Additionally, doctors might receive labels (highly-effective, effective, developing, or ineffective) that could be housed in state databases and perhaps advertised in searchable data-sets on line, to both streamline the vetting process for those (insurance companies, hospitals, and/or medical practices) who are interested in offering a doctor employment as well as members of the public so that they too might have access to “the best” information about doctors’ quality of care.

The only real problem here would be that only about 30% of the doctors would be eligible for ratings as the tests used are not “standard” across all doctors and all patients, depending on their needs and conditions. But these other tests shouldn’t count anyway as they are not standardized and accordingly more subjective. 

Not to fret, however, as statisticians could use the actual scores for the eligible 30% to make hospital-level value-added assertions about the others for whom these standardized data were not available. Because the value-added ineligible doctors ultimately contribute to the effects of the value-added eligible doctors, even though the ineligible may never come into contact with the eligible doctors’ patients, the ineligible are still contributing to the community’s overall effects.

Hence, implementing a value-added based evaluation system to hold doctors accountable for their effectiveness might just be the key to solving our health problem in the U.S. High-quality doctors will become more high-quality if held accountable for their performance, and THIS will better ensure the health and well-being of our nation.


Share Button