Is Combining Different Tests to Measure Value-Added Valid?

A few days ago, on Diane Ravitch’s blog, a person posted the following comment, that Diane sent to me for a response:

“Diane, In Wis. one proposed Assembly bill directs the Value Added Research Center [VARC] at UW to devise a method to equate three different standardized tests (like Iowas, Stanford) to one another and to the new SBCommon Core to be given this spring. Is this statistically valid? Help!”

Here’s what I wrote in response, that I’m sharing here with you all as this too is becoming increasingly common across the country; hence, it is increasingly becoming a question of high priority and interest:

“We have three issues here when equating different tests SIMPLY because these tests test the same students on the same things around the same time.

First is the ASSUMPTION that all varieties of standardized tests can be used to accurately measure educational “value,” when none have been validated for such purposes. To measure student achievement? YES/OK. To measure teachers impacts on student learning? NO. The ASA statement captures decades of research on this point.

Second, doing this ASSUMES that all standardized tests are vertically scaled whereas scales increase linearly as students progress through different grades on similar linear scales. This is also (grossly) false, ESPECIALLY when one combines different tests with different scales to (force a) fit that simply doesn’t exist. While one can “norm” all test data to make the data output look “similar,” (e.g., with a similar mean and similar standard deviations around the mean), this is really nothing more that statistical wizardry without really any theoretical or otherwise foundation in support.

Third, in one of the best and most well-respected studies we have on this to date, Papay (2010) [in his Different tests, different answers: The stability of teacher value-added estimates across outcome measures study] found that value-added estimates WIDELY range across different standardized tests given to the same students at the same time. So “simply” combining these tests under the assumption that they are indeed similar “enough” is also problematic. Using different tests (in line with the proposal here) with the same students at the same time yields different results, so one cannot simply combine them thinking they will yield similar results regardless. They will not…because the test matters.”

See also: Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn, R. L., Ravitch, D., Rothstein, R., Shavelson, R. J., & Shepard, L. A. (2010). Problems with the use of student test scores to evaluate teachers. Washington, D.C.: Economic Policy Institute.

Please Refrain from “Think[ing] of VAMs Like an Oak Tree”

It happened again. In the Tampa Bay Times a journalist encouraged his readers to, as per the title of his article, “Think of VAMs Like an Oak Tree” as folks in Florida are now beginning to interpret and consume Florida teachers’ “value-added” data. It even seems that folks there are “pass[ing] around the University of Wisconsin’s ‘[O]ak [T]ree [A]nalogy,” to help others understand, unfortunately, what is a very over-simplistic and overoptimistic version of the very complex realities surrounding VAMs.

He, and others, obviously missed the memo.

So, I am redirecting current and future readers to Stanford Professor Edward Haertel’s deconstruction of the “Oak Tree Analogy,” so that we all might better spread the word about this faulty analogy.

I have also re-pasted Professor Haertel’s critique below:

The Value-Added Research Center’s ‘Oak Tree’ analogy is helpful in conveying the theory [emphasis added] behind value-added models. To compare the two gardeners, we adjust away various influences that are out of the gardeners’ control, and then, as with value added, we just assume that whatever is left over must have been due to the gardener.  But, we can draw some important lessons from this analogy in addition to those highlighted in the presentation.

In the illustration, the overall effect of rainfall was an 8-inch difference in annual growth (+3 inches for one gardener’s location; -5 for the other). Effects of soil and temperature, in one direction or the other, were 5 inches and 13 inches. But the estimated effect of the gardeners themselves was only a 4-inch difference. 

As with teaching, the value-added model must sort out a small “signal” from a much larger amount of “noise” in estimating the effects of interest. It follows that the answer obtained may depend critically on just what influences are adjusted for. Why adjust for soil condition? Couldn’t a skillful gardener aerate the soil or amend it with fertilizer? If we adjust only for rainfall and temperature then Gardener B wins. If we add in the soil adjustment, then Gardener A wins. Teasing apart precisely those factors for which teachers justifiably should be held accountable versus those beyond their control may be well-nigh impossible, and if some adjustments are left out, the results will change. 

Another message comes from the focus on oak tree height as the outcome variable.  The savvy gardener might improve the height measure by removing lower limbs to force growth in just one direction, just as the savvy teacher might improve standardized test scores by focusing instruction narrowly on tested content. If there are stakes attached to these gardener comparisons, the oak trees may suffer.

The oak tree height analogy also highlights another point. Think about the problem of measuring the exact height of a tree—not a little sketch on a PowerPoint slide, but a real tree. How confidently could you say how tall it was to the nearest inch?  Where, exactly, would you put your tape measure? Would you measure to the topmost branch, the topmost twig, or the topmost leaf? On a sunny day, or at a time when the leaves and branches were heavy with rain?

The oak tree analogy does not discuss measurement error. But one of the most profound limitations of value-added models, when used for individual decision making, is their degree of error, referred to technically as low reliability. Simply put, if we compare the same two gardeners again next year, it’s anyone’s guess which of the two will come out ahead.”

VAMs at the Value-Added Research Center (VARC)

Following up from our last post including Professor Haertel’s analysis of the “Oak Tree” video, produced and disseminated by the Value-Added Research Center (VARC) affiliated with the Wisconsin Center for Education Research at the University of Wisconsin-Madison, I thought I would follow-up, as also requested by the same VAMboozled! reader, a bit more about VARC and what I know about this organization and their VAM.

Dr. Robert H. Meyer founded VARC in 2004 and currently serves as VARC’s Research Director. Accordingly, VARC’s value-added model is also known as Meyer’s model, just as the EVAAS® is also known as Sanders’s model.

Like with the EVAAS®, VARC has a mission to perform ground-breaking work on value-added systems, as well as to conduct value-added research to evaluate the effectiveness of teachers (and schools/districts) and educational programs and policies. Unlike with the EVAAS®, however, VARC describes its methods as transparent. Although, there is actually more information about the inner workings of the EVAAS® model on the SAS website and via other publications than there is about the VARC model and its methods, this is likely due to the relative youth of the VARC model, as VARC is currently at year three in terms of model development and implementation (VARC, 2012c).

Nonetheless, VARC has a “research-based philosophy,” and VARC officials have stated that one of their missions is to publish VARC work in peer-reviewed, academic journals (Meyer, 2012). VARC has ostensibly made publishing in externally reviewed journals a priority, possibly because of the presence of the academics within VARC, as well as its affiliation with the University of Wisconsin, Madison. However, very few studies have been published to date about the model and its effectiveness, again likely given its infancy. Instead (like with the EVAAS®), the Center has disproportionally produced and disseminated technical reports, white papers, and presentations, all of which (like with the EVAAS®) seem to also be disseminated for marketing and other informational purposes, including the securing of additional contracts. Unfortunately, a commonality across the two models is that they both seem bent on implementation before validation.

Regardless, VARC defines its methods as “collaborative” given that VARC researchers have worked with school districts, mainly in Milwaukee and Madison, to help them better build and better situate their value-added model within the realities of districts and schools (VARC, 2012c). As well, VARC defines its value-added model as “fair.” What this means remains unclear. Otherwise, and again, little is still known about the VARC model itself, including its strengths and weaknesses.

But I would bet some serious cash the model, like the others, has the same or similar issues as all other VAMs. To review these issues, please click here to (re)read the very first post on VAMboozled! (October 30, 2013), about these general but major issues.

Otherwise, he are some additional specifics:

  • The VARC model uses generally accepted research methods (e.g., hierarchical linear modeling) to purportedly measure and evaluate the contributions that teachers (and schools/districts) make to student learning and achievement over time.
  • VARC compares individual students to students who are like them by adjusting the statistical models using the aforementioned student background factors. Unlike the EVAAS®, however, VARC does make modifications for student background variables that are outside of a teacher’s (or school’s/district’s) direct control.
  • VARC controls include up to approximately 30 variables including the standard race, gender, ethnicity, levels of poverty, students’ levels of English language proficiency, and special education statuses. VARC also uses other variables when available including, for example, student attendance, suspension, retention records and the like. For this, and other reasons, and according to Meyer, this helps to make the VARC model “arguably one of the best in the country in terms of attention to detail.
  • Then (like with the EVAAS®) whether students whose growth scores are aggregated at the teacher (or school/district) levels statistically exceed, meet, or fall below their growth projections (i.e., also above or below one standard deviation from the mean) helps to determine teachers’ (or schools’/districts’) value-added scores and subsequent rankings and categorizations. Again, these are relatively determined depending on where other teachers (or schools/districts) ultimately land, and they are based on the same assumption that effectiveness is the average of the teacher (or school/district) population.
  • Like with the EVAAS®, VARC also does this work with publicly subsidized monies, although, in contrast to SAS®, VARC is a non-profit organization.
  • Given my best estimates, VARC is currently operating 25 projects exceeding a combined $28 million (i.e., $28,607,000) given federal (e.g., from the U.S. Department of Education, Institute for Education Sciences, National Science Foundation), private (e.g., from Battelle for Kids, The Joyce Foundation, The Walton Foundation), and state and district funding.
  • VARC is currently contracting with the state departments of education in Minnesota, New York, North Dakota, South Dakota, and Wisconsin. VARC is also contracting with large school districts in Atlanta, Chicago, Dallas, Fort Lauderdale, Los Angeles, Madison, Milwaukee, Minneapolis, New York City, Tampa/St. Petersburg, and Tulsa.
  • Funding for the 25 projects currently in operation ranges from the lowest, short-termed, and smallest-scale $30,000 project to the highest, longer-termed, and larger-scale $4.2 million project.
  • Across the grants that have been funded, regardless of type, the VARC projects currently in operation are funded at an average of $335,000 per year with an average funding level just under $1.4 million per grant.
  • It is also evident that VARC is expanding its business rapidly across the nation. In 2004 when the center was first established, VARC was working with less than 100,000 students across the country. By 2010 this number increased 16-fold; VARC was then working with data from approximately 1.6 million students in total.
  • VARC delivers sales pitches in similar ways, although those affiliated with VARC do not seem to overstate their advertising claims quite like those affiliated with EVAAS®.
  • Additionally, VARC officials are greatly focused on the use of value-added estimates for data informed decision-making. “All teachers should [emphasis added] be able to deeply understand and discuss the impact of changes in practice and curriculum for themselves and their students.”

Comparing Oak Trees’ “Apples to Apples,” by Stanford’s Edward Haertel

A VAMboozled! follower posted this comment via Facebook the other day: “I was wondering if you had seen this video by The Value-Added Research Center [VARC], called the “Oak Tree Analogy” [it is the second video down]? My children’s school district has it on their web-site. What are your thoughts about VARC, and the video?”

I have my own thoughts about VARC, and I will share these next, but better than that I have somebody else’s much wiser thoughts about this video, as this video has in many ways gone “viral.”

Professor Edward Haertel, School of Education at Stanford University, wrote Linda Darling-Hammond (Stanford), Jesse Rothstein (Berkeley), and me an email a few years ago about just this video. While I could not find the email he eloquently drafted then, I persuaded (aka, begged) him to recreate what he wrote then, here, for all of you.

You might want to watch the video, first, to follow along, or least, to more critically view the contents of the video. You decide, but Professor Haertel writes:

The Value-Added Research Center’s ‘Oak Tree’ analogy is helpful in conveying the theory [emphasis added] behind value-added models. To compare the two gardeners, we adjust away various influences that are out of the gardeners’ control, and then, as with value added, we just assume that whatever is left over must have been due to the gardener.  But, we can draw some important lessons from this analogy in addition to those highlighted in the presentation.

In the illustration, the overall effect of rainfall was an 8-inch difference in annual growth (+3 inches for one gardener’s location; -5 for the other). Effects of soil and temperature, in one direction or the other, were 5 inches and 13 inches. But the estimated effect of the gardeners themselves was only a 4-inch difference. 

As with teaching, the value-added model must sort out a small “signal” from a much larger amount of “noise” in estimating the effects of interest. It follows that the answer obtained may depend critically on just what influences are adjusted for. Why adjust for soil condition? Couldn’t a skillful gardener aerate the soil or amend it with fertilizer? If we adjust only for rainfall and temperature then Gardener B wins. If we add in the soil adjustment, then Gardener A wins. Teasing apart precisely those factors for which teachers justifiably should be held accountable versus those beyond their control may be well-nigh impossible, and if some adjustments are left out, the results will change. 

Another message comes from the focus on oak tree height as the outcome variable.  The savvy gardener might improve the height measure by removing lower limbs to force growth in just one direction, just as the savvy teacher might improve standardized test scores by focusing instruction narrowly on tested content. If there are stakes attached to these gardener comparisons, the oak trees may suffer.

The oak tree height analogy also highlights another point. Think about the problem of measuring the exact height of a tree—not a little sketch on a PowerPoint slide, but a real tree. How confidently could you say how tall it was to the nearest inch?  Where, exactly, would you put your tape measure? Would you measure to the topmost branch, the topmost twig, or the topmost leaf? On a sunny day, or at a time when the leaves and branches were heavy with rain?

The oak tree analogy does not discuss measurement error. But one of the most profound limitations of value-added models, when used for individual decision making, is their degree of error, referred to technically as low reliability. Simply put, if we compare the same two gardeners again next year, it’s anyone’s guess which of the two will come out ahead.”

Thanks are very much in order, Professor Haertel, for having “added value” to the conversations surrounding these issues, and, helping us collectively understand the not-so-simple theory advanced via this video.