“A Concerned New Mexico Parent” wrote in the following. Be sure to follow along, as this parent demonstrates some of the very real, real-world issues with the data that are being used to calculate VAMs, given their missingness, incompleteness, arbitrariness, and the like. Yet, VAMs “spit out” mathematical estimates that, because they are based on advanced statistics are to be trusted, yet the resultant “estimates” mask all of the chaos (demonstrated below) behind the sophisticated statistical scene.
As a parent, I have been concerned with New Mexico’s implementation of the trifecta of bad education policies – stack-ranking of teachers, the Common Core curriculum, and the use of value-added models (VAMs) for teacher evaluation.
The Vamboozled blog has done a great job in educating teachers and the public about the pitfalls of the VAM approach to teacher evaluations. The recent blog post by an Arizona teacher about her “value-added” contribution caused me to investigate the issue closer to home.
Recently Hanna Skandera (the acting head of the New Mexico Public Education Department and a Jeb Bush protégé) recently posted information on how data are to be handled for the New Mexico VAM system. Click here to review Skandera’s post.
With these two internet postings in mind, I decided to investigate the quality of data available for evaluating teachers at my local school.
According to researchers (and basic common sense), the more data you have, the better. According to the VAM literature, everyone (critics and proponents alike) seems to agree that at least three years of test scores are the minimum necessary for a legitimate [valid] calculation. We should have that much data per teacher as well.
As you will soon see, even this seemingly simple requirement is often not met in a real life school.
To calculate any VAM-type score, data are needed. Specifically, the underlying student test scores and teacher data on which everything else depends are crucial.
One type of “problem data” are data that scramble several people together into one number. For example, if you and I team-teach a class of students, it is not possible to tell how much I contributed versus how much you contributed to the final student score [regardless of what the statisticians say they can do in terms of fractional effects]. Anyone who has ever worked on a team project knows that not everyone contributes equally. Any division of score is purely arbitrary (aka “made up”) and indefensible. This is the situation referenced by the Arizona teacher who loses her students to math tutors for months at a time.
A second type of “problem data” are data that change kind mid-stream. Suppose we have a teacher who teaches 3rd grade one year and is then switched to 6th grade the next. No one who has taught would ever believe that teaching 3rd graders is the same as teaching 6th graders. Just mentally imagine using a 3rd grade approach with 6th grade boys, and I think you can see the problems. So, any teacher who switches grades is likely to have questionable data if all of her scores are considered the same.
A third type of “problem data” are not really a problem but are simply problematic given the absence of information. Teachers who leave the school, are missing data for certain students [which occurs in non-random patterns], etc. often do not have data to support accurate VAM calculations.
Finally, a final type of “problem data” are data that are limited by too few observations. A teacher who has exactly one-year experience would have only one set of test scores, yet this is too few for any meaningful calculation. The New Mexico approach, as explained in the Skandera posting, then, is to “fill in” the data with surrogate [observational] data.
If one is presented with two observations – why not just use the surrogate data only? We already have teacher observations and evaluations without the added expense of VAM calculations with specialized software. And how does using these less precise data help a VAM become valid? It would be like the difference between taking your child’s temperature with a precise in-ear thermometer or simply putting the back of your hand against their cheek. Both measurements can probably tell you whether your child is sick, but both are not equally accurate.
Regardless, it appears that if we want to have a good statistical calculation, we need to make sure we have the following:
- At least three years of teaching data.
- No team teaching or sharing of classes or students.
- No grade changing (the most recent three years should be at the same grade)
- Data must include the current year.
With these four very modest assumptions for ensuring at least minimally good data, how do we fare in real life?
I decided to chart the information of my local school, in light of the VAM data requirements, and I was shocked by what I found.
The results of real world teacher churn are shown below. Each line shows a teacher, their grade-level, the number of years teaching at that grade level, and their data status for a VAM calculation in school-year 2013-2014.
The chart includes all teachers from one school in grades 3 through 6. These are the only grades that currently take the New Mexico state-wide standardized test. The data are for the time period Aug 2010 to Feb 2014.
The “Teacher” and “Grade” columns are self-explanatory. The “Years Active” column shows the dates when the teacher taught at the school. The “Years at Same Grade Level” column shows the number of years a teacher has taught at a consistent grade level.
The “Data Status” column explains briefly why the teachers’ data may be invalid. More specifically in the “Data status” column, “Team teacher” means that the teacher shares duties with another teacher for the same set of students; “No longer teaching” means the teacher may have taught 3rd-6th grades in the past but is not currently teaching any of these grades; “Insufficient data” means the teacher has taught less than three years and does not have sufficient data for valid statistical calculations; “Data not current” means the teacher has taught 3rd-6th grade but not in the year of the VAM calculation; “Grade change” means the teacher changed grade levels sometime during the 2010-2014 school years; and “Valid data” means the data for the teacher appears to be reliable.
As you might guess, all of the teachers’ names have been changed; all of the other data are correct.
Table of Teacher Data Quality
As demonstrated in the Table above, 28 teachers taught 65 classes of 3rd through 6th grade. Note that two teachers (Sharrow and Franzoni) taught several grades during this time period. Remember, these are the only grades (3rd – 6th) that are given the New Mexico standardized test.
Once we exclude questionable data, we are reduced to exactly one (1/28 = 3.6%) solitary teacher (Condron) who taught three classes of 6th graders during the last four years for whom VAM data were available. All of the other data required for the VAM calculation would be provided by surrogate data.
According to the Skandera posting, the missing data would be replaced with the much more subjective and less reliable observational data. It is unclear how the “team teacher” data would be assessed. There is no scientific means to disaggregate the results for the two teachers. Any assignment of “value-added” for team teachers would be [almost] completely arbitrary [as typically based on teacher self-reported proportions of time taught].
Thus, we can see that for the 28 teachers listed in the above table, the calculations would then be based on the following data substitutions:
Table of Surrogate Data Substitutions
Again, only one teacher, as demonstrated, has valid data. I can guarantee that these major problems with the integrity or quality of these data were NOT be publicized when the teacher VAM scores were released. Teachers were still held responsible, regardless.
Remember that for all of the other teachers on campus (K-2, computer, fine arts, PE, etc.), their VAM scores would be derived from the campus average. This average, as well, would be contaminated by the very problematic data revealed in these tables.
Does this approach of calculating VAM scores with such questionable and shaky data seem fair or equitable to the teachers or to this school? I believe not.