Is Combining Different Tests to Measure Value-Added Valid?

A few days ago, on Diane Ravitch’s blog, a person posted the following comment, that Diane sent to me for a response:

“Diane, In Wis. one proposed Assembly bill directs the Value Added Research Center [VARC] at UW to devise a method to equate three different standardized tests (like Iowas, Stanford) to one another and to the new SBCommon Core to be given this spring. Is this statistically valid? Help!”

Here’s what I wrote in response, that I’m sharing here with you all as this too is becoming increasingly common across the country; hence, it is increasingly becoming a question of high priority and interest:

“We have three issues here when equating different tests SIMPLY because these tests test the same students on the same things around the same time.

First is the ASSUMPTION that all varieties of standardized tests can be used to accurately measure educational “value,” when none have been validated for such purposes. To measure student achievement? YES/OK. To measure teachers impacts on student learning? NO. The ASA statement captures decades of research on this point.

Second, doing this ASSUMES that all standardized tests are vertically scaled whereas scales increase linearly as students progress through different grades on similar linear scales. This is also (grossly) false, ESPECIALLY when one combines different tests with different scales to (force a) fit that simply doesn’t exist. While one can “norm” all test data to make the data output look “similar,” (e.g., with a similar mean and similar standard deviations around the mean), this is really nothing more that statistical wizardry without really any theoretical or otherwise foundation in support.

Third, in one of the best and most well-respected studies we have on this to date, Papay (2010) [in his Different tests, different answers: The stability of teacher value-added estimates across outcome measures study] found that value-added estimates WIDELY range across different standardized tests given to the same students at the same time. So “simply” combining these tests under the assumption that they are indeed similar “enough” is also problematic. Using different tests (in line with the proposal here) with the same students at the same time yields different results, so one cannot simply combine them thinking they will yield similar results regardless. They will not…because the test matters.”

See also: Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn, R. L., Ravitch, D., Rothstein, R., Shavelson, R. J., & Shepard, L. A. (2010). Problems with the use of student test scores to evaluate teachers. Washington, D.C.: Economic Policy Institute.

Can Today’s Tests Yield Instructionally Useful Data?

The answer is no, or at best not yet.

Some heavy hitters in the academy just released an article that might be of interest to you all. In the article the authors discuss whether “today’s standardized achievement tests [actually] yield instructionally useful data.”

The authors include W. James Popham, Professor Emeritus from the University of California, Los Angeles; David Berliner, Regents’ Professor Emeritus at Arizona State University; Neal Kingston, Professor at the University of Kansas; Susan Fuhrman, current President of Teachers College, Columbia University; Steven Ladd, Superintendent of Elk Grove Unified School District in California; Jeffrey Charbonneau, National Board Certified Teacher in Washington and the 2013 US National Teacher of the Year; and Madhabi Chatterji, Associate Professor at Teachers College, Columbia University.

These authors explored some of the challenges and promises in terms of using and designing standardized achievement tests and other educational tests that are “instructionally useful.” This was the focus of a recent post about whether Pearson’s tests are “instructionally sensitive” and what University of Texas – Austin’s Associate Professor Walter Stroup versus Pearson’s Senior Vice President had to say on this topic.

In this study, authors deliberate more specifically the consequences of using inappropriately designed tests for decision-making purposes, particularly when tests are insensitive to instruction. Here, the authors underscore serious issues related to validity, ethics, and consequences, all of which they use and appropriately elevate to speak out, particularly against the use of current, large-scale standardized achievement tests for evaluating teachers and schools.

The authors also make recommendations for local policy contexts, offering recommendations to support (1) the design of more instructionally sensitive large-scale tests as well as (2) the design of other smaller scale tests that can also be more instructionally sensitive, and just better. These include but are not limited to classroom tests as typically created, controlled, and managed by teachers, as well as district tests as sometimes created, controlled, and managed by district administrators.

Such tests might help to create more but also better comprehensive educational evaluation systems, the authors ultimately argue. Although this, of course, would require more professional development to help teachers (and others, including district personnel) develop more instructionally sensitive, and accordingly useful tests. As they also note, this would also require that “validation studies…be undertaken to ensure validity in interpretations of results within the larger accountability policy context where schools and teachers are evaluated.”

This is especially important if tests are to be used for low and high-stakes decision-making purposes. Yet this is something that is way too often forgotten when it comes to test use, and in particular test abuse. All should really take heed here.

Pearson Tests v. UT Austin’s Associate Professor Stroup

Last week of the Texas Observer wrote an article, titled “Mute the Messenger,” about University of Texas – Austin’s Associate Professor Walter Stroup, who publicly and quite visibly claimed that Texas’ standardized tests as supported by Pearson were flawed, as per their purposes to measure teachers’ instructional effects. The article is also about how “the testing company [has since] struck back,” purportedly in a very serious way. This article (linked again here) is well worth a full read for many reasons I will leave you all to infer. This article was also covered recently on Diane Ravitch’s blog here, although readers should also see Pearson’s Senior Vice President’s prior response to, and critique of Stroup’s assertions and claims (from August 2, 2014) here.

The main issue? Whether Pearson’s tests are “instructionally sensitive.” That is, whether (as per testing and measurement expert – Professor Emeritus W. James Popham) a test is able to differentiate between well taught and poorly taught students, versus able to differentiate between high and low achievers regardless of how students were taught (i.e., as per that which happens outside of school that students bring with them to the schoolhouse door).

Testing developers like Pearson seem to focus on the prior, that their tests are indeed sensitive to instruction. While testing/measurement academics and especially practitioners seem to focus on the latter, that tests are sensitive to instruction, but such tests are not nearly as “instructionally sensitive” as testing companies might claim. Rather, tests are (as per testing and measurement expert – Regents Professor David Berliner) sensitive to instruction but more importantly sensitive to everything else students bring with them to school from their homes, parents, siblings, and families, all of which are situated in their neighborhoods and communities and related to their social class. Here seems to be where this, now very heated and polarized argument between Pearson and Associate Professor Stroup now stands.

Pearson is focusing on its advanced psychometric approaches, namely its use of Item Response Theory (IRT) while defending their tests as “instructionally sensitive.” IRT is used to examine things like p-values (or essentially proportions of students who respond to items correctly) and item-discrimination indices (to see if test items discriminate between students who know [or are taught] certain things and students who don’t know [or are not taught] certain things otherwise). This is much more complicated than what I am describing here, but hopefully this gives you all the gist of what now seems to be the crux of this situation.

As per Pearson’s Senior Vice President’s statement, linked again here, “Dr. Stroup claim[ed] that selecting questions based on Item Response Theory produces tests that are not sensitive to measuring what students have learned.” While from what I know about Dr. Stroup’s actual claims, this trivializes his overall arguments. Tests, after undergoing revisions as per IRT methods, are not always “instructionally sensitive.”

When using IRT methods, test companies, for example, remove items that “too many students get right” (e.g, as per items’ aforementioned p-values). This alone makes tests less “instructionally insensitive” in practiceIn other words, while the use of IRT methods is sound psychometric practice based on decades of research and development, if using IRT deems an item as “too easy,” even if the item is taught well (i.e., “instructionally senstive”), the item might be removed. This makes the test (1) less “instructionally sensitive” in the eyes of teachers who are to teach the tested content (and who are now more than before held accountable for teaching these items), and this makes the test (2) more “instructionally sensitive” in the eyes of test developers in that when fewer students get test items correct the better the items are when descriminating between those who know (or are taught) certain things and students who don’t know (or are not taught) certain things otherwise.

A paradigm example of what this looks like in practice comes from advanced (e.g., high school) mathematics tests.

Items capturing statistics and/or data displays on such tests should theoretically include items illustrating standard column or bar charts, with questions prompting students to interpret the meanings of the statistics illustrated in the figures. Too often, however, because these items are often taught (and taught well) by teachers (i.e., “instructionally sensitive”) “too many” students answer such items correctly. Sometimes these items yield p-values greater than p=0.80 or 80% correct.

When you need a test and its outcome score data to fit around the bell curve, you cannot have such, or too many of such items, on the final test. In the simplest of terms, for every item with a p-vale of 80% you would need another with a p-value of 20% to balance items out, or keep the overall mean of each test around p=0.50 (the center of the standard normal curve). It’s best if test items, more or less, hang around such a mean, otherwise the test will not function as it needs to, mainly to discriminate between who knows (or is taught) certain things and who doesn’t know (or isn’t taught) certain things otherwise. Such items (i.e., with high p-values) do not always distribute scores well enough because “too many students” answering such items correct reduces the variation (or spread of scores) needed.

The counter-item in this case is another item also meant to capture statistics and/or data display, but that is much more difficult, largely because it’s rarely taught because it rarely matters in the real world. Take, for example, the box and whisker plot. If you don’t know what this is, which is in and of itself telling in this example, see them described and illustrated here. Often, this item IS found on such tests because this item IS DIFFICULT and, accordingly, works wonderfully well to discriminate between those who know (or are taught) certain things and those who don’t know (or aren’t taught) certain things otherwise.

Because this item is not as often taught (unless teachers know it’s coming, which is a whole other issue when we think about “instructional sensitivity” and “teaching-to-the-test“), and because this item doesn’t really matter in the real world, it becomes an item that is more useful for the test, as well as the overall functioning of the test, than it is an item that is useful for the students tested on it.

A side bar on this: A few years ago I had a group of advanced doctoral students studying statistics take Arizona’s (now former) High School Graduation Exam. We then performed an honest analysis of the resulting doctoral students’ scores using some of the above-mentioned IRT methods. Guess which item students struggled with the most, which also happened to be the item that functioned the best as per our IRT analysis? The box and whisker plot. The conversation that followed was most memorable, as the statistics students themselves questioned the utility of this traditional item, for them as advanced doctoral students but also for high school graduates in general.

Anyhow, this item, like many other items similar, had a lower relative p-value, and accordingly helped to increase the difficulty of the test and discriminate results to assert a purported “insructional sensitivity,” regardless of whether the item was actually valued, and more importantly valued in instruction.

Thanks to IRT, the items often left on such tests are not often the items taught by teachers, or perhaps taught by teachers well, BUT they distribute students’ test scores effectively and help others make inferences about who knows what and who doesn’t. This happens even though the items left do not always capture what matters most. Yes – the tests are aligned with the standards as such items are in the standards, but when the most difficult items in the standards trump the others, and many of the others that likely matter more are removed for really no better reason than what IRT dictates, this is where things really go awry.

David Berliner’s “Thought Experiment”

My main mentor, David Berliner (Regents Professor at Arizona State University) wrote a “Thought Experiment” that Diane Ravitch posted on her blog yesterday. I have pasted the full contents here for those of you who may have missed it. Do take a read, and play along and see if you can predict which state will yield higher test performance in the end.


Let’s do a thought experiment. I will slowly parcel out data about two different states. Eventually, when you are nearly 100% certain of your choice, I want you to choose between them by identifying the state in which an average child is likely to be achieving better in school. But you have to be nearly 100% certain that you can make that choice.

To check the accuracy of your choice I will use the National Assessment of Educational Progress (NAEP) as the measure of school achievement. It is considered by experts to be the best indicator we have to determine how children in our nation are doing in reading and mathematics, and both states take this test.

Let’s start. In State A the percent of three and four year old children attending a state associated prekindergarten is 8.8% while in State B the percent is 1.7%. With these data think about where students might be doing better in 4th and 8th grade, the grades NAEP evaluates student progress in all our states. I imagine that most people will hold onto this information about preschool for a while and not yet want to choose one state over the other. A cautious person might rightly say it is too soon to make such a prediction based on a difference of this size, on a variable that has modest, though real effects on later school success.

So let me add more information to consider. In State A the percent of children living in poverty is 14% while in State B the percent is 24%. Got a prediction yet? See a trend? How about this related statistic: In State A the percent of households with food insecurity is 11.4% while in State B the percent is 14.9%. I also can inform you also that in State A the percent of people without health insurance is 3.8% while in State B the percent is 17.7%. Are you getting the picture? Are you ready to pick one state over another in terms of the likelihood that one state has its average student scoring higher on the NAEP achievement tests than the other?

​If you still say that this is not enough data to make yourself almost 100% sure of your pick, let me add more to help you. In State A the per capita personal income is $54,687 while in state B the per capita personal income is $35,979. Since per capita personal income in the country is now at about $42,693, we see that state A is considerably above the national average and State B is considerably below the national average. Still not ready to choose a state where kids might be doing better in school?

Alright, if you are still cautious in expressing your opinions, here is some more to think about. In State A the per capita spending on education is $2,764 while in State B the per capita spending on education is $2,095, about 25% less. Enough? Ready to choose now?
Maybe you should also examine some statistics related to the expenditure data, namely, that the pupil/teacher ratio (not the class sizes) in State A is 14.5 to one, while in State B it is 19.8 to one.

As you might now suspect, class size differences also occur in the two states. At the elementary and the secondary level, respectively, the class sizes for State A average 18.7 and 20.6. For State B those class sizes at elementary and secondary are 23.5 and 25.6, respectively. State B, therefore, averages at least 20% higher in the number of students per classroom. Ready now to pick the higher achieving state with near 100% certainty? If not, maybe a little more data will make you as sure as I am of my prediction.

​In State A the percent of those who are 25 years of age or older with bachelors degrees is 38.7% while in State B that percent is 26.4%. Furthermore, the two states have just about the same size population. But State A has 370 public libraries and State B has 89.
Let me try to tip the data scales for what I imagine are only a few people who are reluctant to make a prediction. The percent of teachers with Master degrees is 62% in State A and 41.6% in State B. And, the average public school teacher salary in the time period 2010-2012 was $72,000 in State A and $46,358 in State B. Moreover, during the time period from the academic year 1999-2000 to the academic year 2011-2012 the percent change in average teacher salaries in the public schools was +15% in State A. Over that same time period, in State B public school teacher salaries dropped -1.8%.

I will assume by now we almost all have reached the opinion that children in state A are far more likely to perform better on the NAEP tests than will children in State B. Everything we know about the ways we structure the societies we live in, and how those structures affect school achievement, suggests that State A will have higher achieving students. In addition, I will further assume that if you don’t think that State A is more likely to have higher performing students than State B you are a really difficult and very peculiar person. You should seek help!

So, for the majority of us, it should come as no surprise that in the 2013 data set on the 4th grade NAEP mathematics test State A was the highest performing state in the nation (tied with two others). And it had 16 percent of its children scoring at the Advanced level—the highest level of mathematics achievement. State B’s score was behind 32 other states, and it had only 7% of its students scoring at the Advanced level. The two states were even further apart on the 8th grade mathematics test, with State A the highest scoring state in the nation, by far, and with State B lagging behind 35 other states.

Similarly, it now should come as no surprise that State A was number 1 in the nation in the 4th grade reading test, although tied with 2 others. State A also had 14% of its students scoring at the advanced level, the highest rate in the nation. Students in State B scored behind 44 other states and only 5% of its students scored at the Advanced level. The 8th grade reading data was the same: State A walloped State B!

States A and B really exist. State B is my home state of Arizona, which obviously cares not to have its children achieve as well as do those in state A. It’s poor achievement is by design. Proof of that is not hard to find. We just learned that 6000 phone calls reporting child abuse to the state were uninvestigated. Ignored and buried! Such callous disregard for the safety of our children can only occur in an environment that fosters, and then condones a lack of concern for the children of the Arizona, perhaps because they are often poor and often minorities. Arizona, given the data we have, apparently does not choose to take care of its children. The agency with the express directive of insuring the welfare of children may need 350 more investigators of child abuse. But the governor and the majority of our legislature is currently against increased funding for that agency.

State A, where kids do a lot better, is Massachusetts. It is generally a progressive state in politics. To me, Massachusetts, with all its warts, resembles Northern European countries like Sweden, Finland, and Denmark more than it does states like Alabama, Mississippi or Arizona. According to UNESCO data and epidemiological studies it is the progressive societies like those in Northern Europe and Massachusetts that care much better for their children. On average, in comparisons with other wealthy nations, the U. S. turns out not to take good care of its children. With few exceptions, our politicians appear less likely to kiss our babies and more likely to hang out with individuals and corporations that won’t pay the taxes needed to care for our children, thereby insuring that our schools will not function well.

But enough political commentary: Here is the most important part of this thought experiment for those who care about education. Everyone of you who predicted that Massachusetts would out perform Arizona did so without knowing anything about the unions’ roles in the two states, the curriculum used by the schools, the quality of the instruction, the quality of the leadership of the schools, and so forth. You made your prediction about achievement without recourse to any of the variables the anti-public school forces love to shout about –incompetent teachers, a dumbed down curriculum, coddling of students, not enough discipline, not enough homework, and so forth. From a few variables about life in two different states you were able to predict differences in student achievement test scores quite accurately.

I believe it is time for the President, the Secretary of Education, and many in the press to get off the backs of educators and focus their anger on those who will not support societies in which families and children can flourish. Massachusetts still has many problems to face and overcome—but they are nowhere as severe as those in my home state and a dozen other states that will not support programs for neighborhoods, families, and children to thrive.

This little thought experiment also suggests also that a caution for Massachusetts is in order. It seems to me that despite all their bragging about their fine performance on international tests and NAEP tests, it’s not likely that Massachusetts’ teachers, or their curriculum, or their assessments are the basis of their outstanding achievements in reading and mathematics. It is much more likely that Massachusetts is a high performing state because it has chosen to take better care of its citizens than do those of us living in other states. The roots of high achievement on standardized tests is less likely to be found in the classrooms of Massachusetts and more likely to be discovered in its neighborhoods and families, a refection of the prevailing economic health of the community served by the schools of that state.