A VAMboozled! folllower, Dan Wright, who is also a statistician at ACT (the nonprofit company famously known for developing the college-entrance ACT test), wrote me an email a few weeks ago about his informed and statistical take on VAMs. I invited him to write a post for you all here. His response, with my appreciation, follows:
“I am part cognitive scientist, part statistician, and I work for ACT, Inc. About a year ago I was asked whether value-added models (VAMs) provide good estimates of teacher effectiveness. I found lots of papers that examined if different statistical models gave similar answers to each other and lots showing certain groups tended to have lower scores, but all of these papers seemed to bypass the question I was interested in: do VAMs accurately estimate teacher effectiveness? I began looking at this and Audrey invited me to write a guest blog post describing what I found.
The difficulty is that in the real world we don’t really know how effective a teacher, that is in “objective” terms, so that we do not have much with which to compare our VAM estimates. Hence, a popular alternative/way for statisticians to decide if an estimation method is good or bad is to simulate data using particular values, and then see if the method produces estimates that are similar to the values expected.
The trick is how to create the simulation data to do this. It is necessary to have some model for how the data arise. Hence, I will use a simple model to demonstrate the problem. It is usually best to start with simple models, make sure the statistical procedures work, and then progress to more complex models. On that note, I have an under review paper that goes into more details with other models with more simulation studies. Email me if you want a copy.
Anyhow, the model I used to create the data for this simulation starts with three variables. The first encapsulates everything about the students that is unique to them, including ability, effort, grit, and even the home environment (called AB in the code below). The second encapsulates all the neighborhood and environmental factors that can influence everything from financial spending in schools to which teachers are hired (called NE). This value is unique to each teacher (so it is simpler than real data where teachers are nested in schools). The third is teacher effectiveness (called TE). I created it from the neighborhood variable plus some random variation. These three variables would be unmeasured in a real data set (of course, some elements of them may be measured, but not all elements of them), but in a computer simulation their values are known.
In addition, there are two sets of test scores. The first are scores from before the student has encountered the teacher (called PRE). They are created by adding the student ability variable, the neighborhood variable, and some random variation. The second set of test scores are from after the student has encountered the teacher (called POST). They are created by adding the first set of test scores, the student ability variable, one-fifth of the teacher effectiveness variable (less than the other effects since the impact of a single teacher is usually estimated at about 10% or less of student achievement), and some random variation.
Again, however, this model is simpler than real educational data. Accordingly, there are no complications like missing values, teachers nested within schools, etc. Also, I will use a very simple VAM, just using the first set of scores to predict the second set of scores, but allowing for random variation by teachers and by students. Given the importance placed on the results of VAMs in many different countries and in many industries (not just in education), this method should work…Right?
Theories about causation actually suggest that there will be a problem. I won’t go into the details about this (again, email me for my paper if you want more information), but using the first set of scores in the VAM allows information to flow between the teacher effectiveness variable and the second set of test scores through the other unmeasured variables. This messes up trying to measure the effect of teachers on the final scores.
But let’s see if a simulation supports that this is problematic as well. The simulation used the freeware R and the code is below so that you can repeat it, if you want, or change the values. Download R onto your computer (follow the instructions on http://cran.us.r-project.org/), open it, copy the code, and paste it into R. The # sign means R ignores the remainder of the line so I have used that to make comments. If you think another set of numbers are more realistic, put those in. Or construct the data in some other way. One nice thing about simulations is you can keep trying all sorts of things.
For the values used, the correlation is -.13. A negative correlation means the teachers who are more effective tend to have lower estimated teacher effectiveness scores. That’s not good. That’s bad! Don’t get hung up with this specific number, though, as it moves around depending on how the data are constructed. The correlation can go up (e.g., change any one, but not two, of the first four values to -1) and down (e.g., increase the size of AB2POST to 2).
Given this, there is a conclusion, as well as a recommendation. The conclusion is that value-added estimates can be very inaccurate for what seem to be highly commonsensical and plausible models for how the data could arise, and where they are bad is predicted from theories of causation. The recommendation is that those promoting and those critical of VAMs should write down plausible models for how they think the data in which they are interested arose and see if the statistical procedures used perform well.
*The personal opinions expressed do not necessarily represent the views and opinions of ACT, Inc.