A colleague of mine — Stephen Caldas, Professor of Educational Leadership at Manhattanville College, one of the “heavyweights” who recently visited New York to discuss the state’s teacher evaluation system, and who according to Chalkbeat New York, “once called New York’s evaluation system “psychometrically indefensible” — wrote me with a critique of New Yorks’ VAM which I decided to post for you all here.
His critique is of the 2013-2014 Growth Model for Educator Evaluation Technical Report, produced by the American Institute for Research (AIR) that, “describes the models used to measure student growth for the purpose of educator evaluation in New York State for the 2013-2014 School Year” (p. 1).
Here’s what he wrote:
I’ve analyzed this tech report, which for many would be a great sedative prior to sleeping. It’s the latest in a series of three reports by AIR paid for by the New York State Education Department. Although the truth of how good the growth models used by AIR really are is buried deep in the report in Table 11 (p. 31) and Table 20 (p. 44), both of which are recreated here.
These tables give us indicators of how well the growth models are at predicting growth in current year student English/language arts (ELA) and mathematics (MATH) student scores by grade level and subject (i.e., the dependent variables). At the secondary level, an additional outcome, or dependent variable predicted is the number of Regents Exams a student passed for the first time in the current year. The unadjusted models only included prior academic achievement as predictor variables, and are shown for comparison purposes only. The adjusted models were the models that were actually used by the state to make predictions that fed into teacher and principal effectiveness scores. In additional to using prior student achievement as a predictor, the adjusted prediction models included these additional predictor variables: student and school-level poverty status, student and school-level socio-economic status (SES), student and school-level English language learner (ELL) status, and scores on the New York State English as a Second Language Achievement Test (the NYSESLAT). These tables above report a statistic called “Pseudo R-squared” or just “R-squared,” and this statistic shows us the predictive power of the overall models.
To help interpret these numbers, if one observes a “1.0” (which one won’t), it would mean that the model was “100%” perfect (with no prediction error). One would obtain the “percentage of perfect” (if you will) by moving the decimal point two places to the right. Otherwise, the difference between the percentage perfect and 100 is called the “error” or “e.”
With this knowledge, one can see in the adjusted ELA 8th grade model (Table 11) that the predictor variables altogether explain “74%” of the variance of current year student ELA 8th grade scores (R-squared = 0.74). Conversely, this same model has a 26% of error (and this is one of the best ones illustrated in the report). In other words, this particular prediction model cannot account for 26% of the cause of current ELA 8th grade scores, “all other things considered” (i.e., the predictor variables that are so highly correlated with test scores in the first place).
The prediction models at the secondary level are much, MUCH worse. If one is to look at Table 20, one would see that in the worst model (adjusted ELA Common Core ) the predictor variables together only explain 45% of student ELA Common Core test scores. Thus, this prediction model cannot account for 55% of the causes of these scores!!
While not terrible R-squared values for social science research, these are horrific values for a model used to make individual level predictions at the teacher or school level with any degree of precision. Quite frankly, they simply cannot be precise given these huge quantities of error. The chances that these models would precisely (with no error) predict a teacher’s or school’s ACTUAL student test scores is slim to none. Yet, the results of these imprecise growth models can contribute up to 40% of a teacher’s effectiveness rating.
This high level of imprecision would explain why teachers like Sheri Lederman of Long Island, who is apparently a terrific fourth grade educator based on all kinds of data besides her most recent VAM scores, received an “ineffective” rating based on this flawed growth model (see prior posts here and here). She clearly has a solid basis for her lawsuit against the state of New York in which she claims her score was “arbitrary and capricious.”
This kind of information on all the prediction error in these growth models needs to be in an executive summary in front of these technical reports. The interpretation of this error should be in PLAIN LANGUAGE for the tax payers who foot the bill for these reports, the policy makers who need to understand the findings in these reports, and the educators who suffer the consequences of such imprecision in measurement.