“A Concerned New Mexico Parent” wrote in the following. Be sure to follow along, as this parent demonstrates some of the very real, realworld issues with the data that are being used to calculate VAMs, given their missingness, incompleteness, arbitrariness, and the like. Yet, VAMs “spit out” mathematical estimates that, because they are based on advanced statistics are to be trusted, yet the resultant “estimates” mask all of the chaos (demonstrated below) behind the sophisticated statistical scene.
(S)he writes:
As a parent, I have been concerned with New Mexico’s implementation of the trifecta of bad education policies – stackranking of teachers, the Common Core curriculum, and the use of valueadded models (VAMs) for teacher evaluation.
The Vamboozled blog has done a great job in educating teachers and the public about the pitfalls of the VAM approach to teacher evaluations. The recent blog post by an Arizona teacher about her “valueadded” contribution caused me to investigate the issue closer to home.
Recently Hanna Skandera (the acting head of the New Mexico Public Education Department and a Jeb Bush protégé) recently posted information on how data are to be handled for the New Mexico VAM system. Click here to review Skandera’s post.
With these two internet postings in mind, I decided to investigate the quality of data available for evaluating teachers at my local school.
According to researchers (and basic common sense), the more data you have, the better. According to the VAM literature, everyone (critics and proponents alike) seems to agree that at least three years of test scores are the minimum necessary for a legitimate [valid] calculation. We should have that much data per teacher as well.
As you will soon see, even this seemingly simple requirement is often not met in a real life school.
To calculate any VAMtype score, data are needed. Specifically, the underlying student test scores and teacher data on which everything else depends are crucial.
One type of “problem data” are data that scramble several people together into one number. For example, if you and I teamteach a class of students, it is not possible to tell how much I contributed versus how much you contributed to the final student score [regardless of what the statisticians say they can do in terms of fractional effects]. Anyone who has ever worked on a team project knows that not everyone contributes equally. Any division of score is purely arbitrary (aka “made up”) and indefensible. This is the situation referenced by the Arizona teacher who loses her students to math tutors for months at a time.
A second type of “problem data” are data that change kind midstream. Suppose we have a teacher who teaches 3^{rd} grade one year and is then switched to 6^{th} grade the next. No one who has taught would ever believe that teaching 3^{rd} graders is the same as teaching 6^{th} graders. Just mentally imagine using a 3^{rd} grade approach with 6^{th} grade boys, and I think you can see the problems. So, any teacher who switches grades is likely to have questionable data if all of her scores are considered the same.
A third type of “problem data” are not really a problem but are simply problematic given the absence of information. Teachers who leave the school, are missing data for certain students [which occurs in nonrandom patterns], etc. often do not have data to support accurate VAM calculations.
Finally, a final type of “problem data” are data that are limited by too few observations. A teacher who has exactly oneyear experience would have only one set of test scores, yet this is too few for any meaningful calculation. The New Mexico approach, as explained in the Skandera posting, then, is to “fill in” the data with surrogate [observational] data.
If one is presented with two observations – why not just use the surrogate data only? We already have teacher observations and evaluations without the added expense of VAM calculations with specialized software. And how does using these less precise data help a VAM become valid? It would be like the difference between taking your child’s temperature with a precise inear thermometer or simply putting the back of your hand against their cheek. Both measurements can probably tell you whether your child is sick, but both are not equally accurate.
Regardless, it appears that if we want to have a good statistical calculation, we need to make sure we have the following:
 At least three years of teaching data.
 No team teaching or sharing of classes or students.
 No grade changing (the most recent three years should be at the same grade)
 Data must include the current year.
With these four very modest assumptions for ensuring at least minimally good data, how do we fare in real life?
I decided to chart the information of my local school, in light of the VAM data requirements, and I was shocked by what I found.
The results of real world teacher churn are shown below. Each line shows a teacher, their gradelevel, the number of years teaching at that grade level, and their data status for a VAM calculation in schoolyear 20132014.
The chart includes all teachers from one school in grades 3 through 6. These are the only grades that currently take the New Mexico statewide standardized test. The data are for the time period Aug 2010 to Feb 2014.
The “Teacher” and “Grade” columns are selfexplanatory. The “Years Active” column shows the dates when the teacher taught at the school. The “Years at Same Grade Level” column shows the number of years a teacher has taught at a consistent grade level.
The “Data Status” column explains briefly why the teachers’ data may be invalid. More specifically in the “Data status” column, “Team teacher” means that the teacher shares duties with another teacher for the same set of students; “No longer teaching” means the teacher may have taught 3^{rd}6^{th} grades in the past but is not currently teaching any of these grades; “Insufficient data” means the teacher has taught less than three years and does not have sufficient data for valid statistical calculations; “Data not current” means the teacher has taught 3^{rd}6^{th} grade but not in the year of the VAM calculation; “Grade change” means the teacher changed grade levels sometime during the 20102014 school years; and “Valid data” means the data for the teacher appears to be reliable.
As you might guess, all of the teachers’ names have been changed; all of the other data are correct.
Table of Teacher Data Quality
Teacher (n=28) 
Grade 
Years Active 
Years at Same Grade Level 
Data Status 
Govan

3

20102014

4

Team teacher

Grubb

3

20102014

4

Team teacher

Durling

3

20102014

4

Team teacher

Jen

3

20102012

2

No longer teaching 3^{rd}6^{th}

Bohanon

3

20122014

2

Insufficient data

Mcanulty

3

20132014

1

Insufficient data

Saum

4

20102012

2

No longer teaching 3^{rd}6^{th}

Wirtz

4

20102013

3

No longer teaching 3^{rd}6^{th}

Mccaslin

4

20102011

1

No longer teaching 3^{rd}6^{th}

Finamore

4

20102012

2

No longer teaching 3^{rd}6^{th}

Sharrow
Sharrow 
4
5 
20112012
20122014 
1
2 
Grade change
Insufficient data 
Kime

4

20122014

2

Insufficient data

Blish

4

20122014

2

Insufficient data

Obregon

4

20122014

2

Insufficient data

Fraise

4

20132014

1

Insufficient data

Franzoni
Franzoni Franzoni 
4
5 6 
20132014
20102013 20122013 
1
3 1 
Grade change
Grade change Grade change 
Henderson

5

20102012

2

Insufficient data

Regan

5

20102014

4

Team teacher

Kalis

5

20112013

2

No longer teaching 3^{rd}6^{th}

Combest

5

20132014

1

Insufficient data

Meister

5

20132014

1

Insufficient data

Treacy

6

20102011

1

No longer teaching 3^{rd}6^{th}

Sprayberry

6

20102014

4

Team teacher

Locust

6

20102011

1

No longer teaching 3^{rd}6^{th}

Condron

6

20112014

3

Valid data

Monteiro

6

20112014

3

Team teacher

Arnwine

6

20112012

1

No longer teaching 3^{rd}6^{th}

Sebree

6

201312014

1

Insufficient data

As demonstrated in the Table above, 28 teachers taught 65 classes of 3^{rd} through 6^{th} grade. Note that two teachers (Sharrow and Franzoni) taught several grades during this time period. Remember, these are the only grades (3^{rd} – 6^{th}) that are given the New Mexico standardized test.
Once we exclude questionable data, we are reduced to exactly one (1/28 = 3.6%) solitary teacher (Condron) who taught three classes of 6^{th} graders during the last four years for whom VAM data were available. All of the other data required for the VAM calculation would be provided by surrogate data.
According to the Skandera posting, the missing data would be replaced with the much more subjective and less reliable observational data. It is unclear how the “team teacher” data would be assessed. There is no scientific means to disaggregate the results for the two teachers. Any assignment of “valueadded” for team teachers would be [almost] completely arbitrary [as typically based on teacher selfreported proportions of time taught].
Thus, we can see that for the 28 teachers listed in the above table, the calculations would then be based on the following data substitutions:
Table of Surrogate Data Substitutions
Data Status 
Number of Teachers 
Surrogate Data Information 
Notes on Data Quality 
No longer teaching 3^{rd}6^{th}

9 (32.1%)

Excluded from calculation

Teachers that no longer teach the students in question would be excluded from the VAM calculation.

Team teacher

6 (21.4%)

Arbitrary division of credit for VAM score

Arbitrary (aka “selfreported”) division of responsibility

Insufficient data (missing one year of test data)

6 (21.4%)

The missing data would be replaced by observational data.

These teachers would have 1/3 of their VAM calculation based on observational data.

Insufficient data (missing two years of test data)

5 (17.9%)

The missing data would be replaced by observational data.

These teachers would have 2/3 of their VAM calculation based on observational data.

Grade change

1 (3.6%)

One teacher (Franzoni) is still teaching 3^{rd}6^{th} after a grade change.

Either her 4^{th} and 5^{th} grade data would be combined in some fashion, or the data for her most recent teaching (6^{th} grade) would be based on two years of observational data. In either case, the quality of her data would be very suspect.

Valid data

1 (3.6%)

There are at least three years of valid data.

The data would be valid for the VAM calculation.

Again, only one teacher, as demonstrated, has valid data. I can guarantee that these major problems with the integrity or quality of these data were NOT be publicized when the teacher VAM scores were released. Teachers were still held responsible, regardless.
Remember that for all of the other teachers on campus (K2, computer, fine arts, PE, etc.), their VAM scores would be derived from the campus average. This average, as well, would be contaminated by the very problematic data revealed in these tables.
Does this approach of calculating VAM scores with such questionable and shaky data seem fair or equitable to the teachers or to this school? I believe not.