Arizona’s Teacher Evaluation System, Not Strict Enough?

Last week, Arizona State Superintendent of Public Instruction, John Huppenthal, received the news that Arizona’s No Child Left Behind (NCLB) waiver extension request had been provisionally granted with a “high-risk” label (i.e., in danger of being revoked). Superintendent Huppenthal was given 60 days to make two revisions: (1) adjust the graduation rate to account for 20% of a school’s A-F letter grade instead of the proposed 15% and, as most pertinent here, (2) finalize the guidelines for the teacher and principal evaluations to comply with Elementary and Secondary Education Act (ESEA) Flexibility (i.e., the NCLB waiver guidelines).

Within 60 days, Superintendent Huppenthal and the Arizona Department of Education (ADE) must: (1) finalize its teacher and principal evaluation guidelines; (2) give sufficient weighting to student growth so as to differentiate between teachers/principals who have contributed to more/less growth in student learning and achievement; (3) ensure that shared attribution of growth does not mask high or low performing teachers as measured by growth; and (4) guarantee that all of this is done in time for schools to be prepared to implement for the 2014-2015 school year.

These demands, particularly #2 and #3 above, reflect some of the serious and unavoidable flaws with the new teacher evaluations that are based on student growth (e.g., and all other VAMs).

As per #2, the most blatant problem is with the limited number of teachers (typically around 30%, although reported as only 17% in the recent post about DC’s teacher evaluation system) who are eligible for classroom-level student growth data (i.e., value-added). Thus, one of the key expectations—to ensure sufficient weight to student growth scores so as to differentiate between teachers’/principals’ impact on student learning and achievement—is impossible for probably around seven out of every ten of Arizona’s and other states’ teachers. While most states, including Arizona, have chosen to remedy this problem by attributing a school-level (or grade-level) value-added score to classroom-level ineligible teachers (sometimes counting as much as 50% of the teacher’s overall evaluation), this solution does not (and likely never will) suffice as per #2 written above. It seems the feds do not quite understand that what they are mandating in practice leaves well over half of teachers’ evaluations based on both students and/or content that these teachers didn’t teach.

As per #3, Arizona (and all waiver-earning states) is also to demonstrate how the state will ensure that shared attribution of growth does not mask high or low performing teachers as measured by growth. Yet, again, when these systems are implemented in practice, 70+% of teachers are assigned a school-level student growth score, meaning that all teachers in any given school who fall into this group will all receive the same score. In what way is it feasible to “ensure” that no high or low performing teacher is “masked” by such a method of attributing student growth to teachers in this way? Yet this is another example of the type of illogical circumstances by which schools must abide in order to meet the arbitrary (and often impossible) demands of ESEA Flexibility (and Race to the Top).

If Arizona fails to comply with the USDOE requests within 60 days, they will lose their ESEA waiver and face the consequences of NCLB. In a statement to Education Week, however, AZ Superintendent Huppenthal stood by his position on providing school districts with as much flexibility as possible within the constraints of the waiver stipulations. He said he will not protest the “high risk” label and will instead attempt to “get around this and still keep local control for those school districts.” The revised application is due at the end of January.

Post contributed by Jessica Holloway-Libell

Mr. T’s Scores on the DC Public Schools’ IMPACT Evaluation System

After our recent post regarding the DC Public Schools’ IMPACT Evaluation System, and Diane Ravitch’s follow-up, a DC teacher wrote to Diane expressing his concerns about his DC IMPACT evaluation scores, attaching the scores he recently received after his supervising administrator and a master educator observed the same 30-minute lesson he recently taught to the same class.

First, take a look at his scores summarized below. Please note that other supportive “evidence” (e.g., notes re: what was observed to support and warrant the scores below) was available, but for purposes of brevity and confidentiality this “evidence” is not included here.

As you can easily see, these two evaluators were very much NOT on the same evaluation page, again, when observing the same thing during the same time at the same instructional occasion.

Evaluative Criteria Definition Administrator Scores   (Mean Score = 1.44) Master Educator Scores (Mean Score = 3.11)
TEACH 1 Lead Well-Organized, Objective-Driven Lessons = 1 Ineffective = 4 Highly Effective
TEACH 2 Explain Content Clearly = 1 Ineffective = 3 Effective
TEACH 3 Engage Students at All Learning Levels in Rigorous Work = 1 Ineffective = 3 Effective
TEACH 4 Provide Students Multiple Ways to Engage with Content = 1 Ineffective = 3 Effective
TEACH 5 Check for Student Understanding = 2 Minimally Effective = 4 Highly Effective
TEACH 6 Respond to Student Understandings = 1 Ineffective = 3 Effective
TEACH 7 Develop Higher- Level Understanding through Effective Questioning = 1 Ineffective = 2 Minimally Effective
TEACH 8 Maximize Instructional Time = 2 Minimally Effective = 3 Effective
TEACH 9 Build a Supportive, Learning-Focused Classroom Community = 3 Effective = 3 Effective

Overall, Mr. T (an obvious pseudonym) received a 1.44 from his supervising administrator and a 3.11 from the master educator, with scores ranging from 1 = Ineffective to 4 = Highly Effective.

This is particularly important as illustrated in the prior post (Footnote 8 of the full piece to be exact), because “Teacher effectiveness ratings were based on, in order of importance by the proportion of weight assigned to each indicator [including first and foremost]: (1) scores derived via [this] district-created and purportedly “rigorous” (Dee & Wyckoff, 2013, p. 5) yet invalid (i.e., not having been validated) observational instrument with which teachers are observed five times per year by different folks, but about which no psychometric data were made available (e.g., Kappa statistics to test for inter-rater consistencies among scores).” For all DC teachers, this is THE observational system used, and for 83% of them these data are weighted at 75% of their total “worth” (Dee & Wyckoff, 2013, p. 10). This is precisely the system that is receiving (and gaining) praise, especially as it has thus far led to teacher bonuses (professedly up to $25,000 per year) as well as terminations of more than 500 teachers (≈ 8%) throughout DC’s Public Schools. Yet as evident here, again,this system has some fatal flaws and serious issues, despite its praised “rigor” (Dee & Wyckoff, 2013, p. 5).

See also ten representative comments taken from both the administrator’s evaluation form and the master educator’s evaluation form. Revealed here, as well, are MAJOR issues and discrepancies that should not occur in any “objective” and reliable” evaluation system, especially in one to which such major consequences are attached and that has been, accordingly, so “rigorously” praised (Dee & Wyckoff, 2013, p. 5).

Administrator’s Comments:
1. The objective was not posted nor verbally articulated during the observation… Students were asked what the objective was and they looked to the board but when they saw no objective.
2. There was limited evidence that students mastered the content based on the work they produced.
3. Explanations of content weren’t clear and coherent based on student responses and the level of attention that Mr. T had to pay to most students.
4. Students were observed using limited academic language throughout the observation.
5. The lesson was not accessible to students and therefore posed too much challenge based on their level of ability.
6. [T]here wasn’t an appropriate balance between teacher‐directed and student‐centered learning.
7. There was limited higher-level understanding developed based on verbal conferencing or work products that were created.
8. Through [checks for understanding] Mr. T was able to get the pulse of the class… however there was limited evidence that Mr. T understood the depth of student understanding.
9. There were many students that had misunderstandings based on student responses from putting their heads down to moving to others to talk instead of work.
10. Inappropriate behaviors occurred regularly within the classroom.

Master Educator’s Comments:
1. Mr. T was highly effective at leading well-organized, objective-driven lessons.
2. Mr. T’s explanations of content were clear and coherent, and they built student understanding of content.
3. All parts of Mr. T’s lesson significantly moved students towards mastery of the objective as evidenced by students.
4. Mr. T included learning styles that were appropriate to students needs and all students responded positively and were actively involved.
5. Mr. T’s explanations of content were clear and coherent, and they built student understanding of content.
6. Mr. T was effective at engaging students at all levels in accessible and challenging work.
7. Students had adequate opportunities to meaningfully practice, apply, and demonstrate what they are learning.
8. Mr. T always used appropriate strategies to ensure that students moved toward higher-level understanding.
9. Mr. T was effective at maximizing instructional time…Inappropriate or off-task student behavior never interrupted or delayed the lesson.
10. Mr. T was effective at building a supportive, learning-focused classroom community. Students were invested in their work and valued academic success.

In sum, as Mr. T wrote in his email to Diane, while he is “fortunate enough to have a teaching position that is not affected by VAM nonsense…that doesn’t mean [he’s] completely immune from a flawed system of evaluations.” This “supposedly ‘objective’ measure seems to be anything but.” Is the administrator correct whereas positioning Mr. T as ineffective? Or might it be, perhaps, the master educator was “just being too soft.” Either way, “it’s confusing and it’s giving [Mr. T.] some thought as to whether [he] should just spend the school day at [his] desk working on [his] resumé.”

Our thanks to Mr. T for sharing his DC data, and for sharing his story!

VAMs v. Student Growth Percentiles (SGPs) – Part II

A few weeks ago, a reader posted the following question: “What is the difference [between] VAM and Student Growth Percentiles (SGP) and do SGPs have any usefulness[?]”

In response, I invited a scholar and colleague who knows a lot about the SGP. This is the first of two posts to help others understand the distinctions and similarities. Thanks to our Guest Blogger – Sarah Polasky – for writing the following:

“First, I direct the readers to the VAMboozled! glossary and the information provided that contrasts VAMs and Student Growth Models, if they haven’t visited that section of the site yet. Second, I hope to build upon this by highlighting key terms and methodological differences between traditional VAMs and SGPs.

A Value-Added Model (VAM) is a multivariate (multiple variable) student growth model that attempts to account or statistically control for all potential student, teacher, school, district, and external influences on outcome measures (i.e., growth in student achievement over time). The most well-known example of this model is the SAS Education Value-Added Assessment System (EVAAS)[1]. The primary goal of this model is to estimate teachers’ causal effects on student performance over time. Put differently, the purpose of this model is to measure groups of students’ academic gains over time and then attribute those gains (or losses) back to teachers as key indicators of the teachers’ effectiveness.

In contrast, the Student Growth Percentiles (SGP)[2] model uses students’ level(s) of past performance to determine students’ normative growth (i.e., as compared to his/her peers). As explained by Castellano & Ho[3], “SGPs describe the relative location of a student’s current score compared to the current scores of students with similar score histories” (p. 89). Students are compared to themselves (i.e., students serve as their own controls) over time; therefore, the need to control for other variables (e.g., student demographics) is less necessary. The SGP model was developed as a “better” alternative to existing models, with the goal of providing clearer, more accessible, and more understandable results to both internal and external education stakeholders and consumers. The primary goal of this model is to provide growth indicators for individual students, groups of students, schools, and districts.

The utility of the SGP model lies in reviewing, particularly by subject area, growth histories for individual students and aggregate measures for groups of students (e.g., English language learners) to track progress over time and examine group differences, respectively. The model’s developer admits that, on its own, SGPs should not be used to make causal interpretations, such as attributing high growth in one classroom to the teacher as the sole source of growth[4]. However, when paired with additional indicators, supporting concurrent-related evidence of validity (link to glossary), such inferences may be more appropriate.”


[1] Sanders, W.L. & Horn, S.P. (1994). The Tennessee value-added assessment system (TVAAS): Mixed-model methodology in educational assessment. Journal of Personnel Evaluation in Education, 8(3): 299-311.

[2] Betebenner, D.W. (2013). Package ‘SGP’. Retrieved from

[3] Castellano, K.E. & Ho, A.D. (2013). A Practitioner’s Guide to Growth Models. Council of Chief State School Officers.

[4] Betebenner, D. W. (2009). Norm- and criterion-referenced student growth. Educational Measurement: Issues and Practice, 28(4), 42-51. doi:10.1111/j.1745-3992.2009.00161.x

Random Assigment and Bias in VAM Estimates – Article Published in AERJ

“Nonsensical,” “impractical,” “unprofessional,” “unethical,” and even “detrimental” – these are just a few of the adjectives used by elementary school principals in Arizona to describe the use of randomized practices to assign students to teachers and classrooms. When asked whether principals might consider random assignment practices, one principal noted, “I prefer careful, thoughtful, and intentional placement [of students] to random. I’ve never considered using random placement. These are children, human beings.” Yet the value-added models (VAMs) being used in many states to measure the “valued-added” by individual teachers to their students’ learning assume that any school is as likely as any other school, and any teacher is as likely as any other teacher, to be assigned any student who is as likely as any other student to have similar backgrounds, abilities, aptitudes, dispositions, motivations, and the like.

One of my doctoral students – Noelle Paufler – and I recently reported in the highly esteemed American Educational Research Journal the results of a survey administered to all public and charter elementary principals in Arizona (see the online publication of “The Random Assignment of Students into Elementary Classrooms: Implications for Value-Added Analyses and Interpretations”). We examined the various methods used to assign students to classrooms in their schools, the student background characteristics considered in nonrandom placements, and the roles teachers and parents play in the placement process. In terms of bias, the fundamental question here was whether the use of nonrandom student assignment practices might lead to biased VAM estimates, if the nonrandom student sorting practices went beyond that which is typically controlled for in most VAM models (e.g., academic achievement and prior demonstrated abilities, special education status, ELL status, gender, giftedness, etc.).

We found that overwhelmingly, principals use various placement procedures through which administrators and teachers consider a variety of student background characteristics and student interactions to make placement decisions. In other words, student placements are by far nonrandom (contrary to the methodological assumptions to which VAM consumers often agree).

Principals frequently cited interactions between students, students’ peers, and previous teachers as justification for future placements. Principals stated that students were often matched with teachers based on their individual learning styles and respective teaching strengths. Parents also yielded considerable control over the placement process with a majority of principals stating that parents made placement requests, the majority of which are often honored.

In addition, in general, principal respondents were greatly opposed to using random student assignment methods in lieu of placement practices based on human judgment—practices they collectively agreed were in the best interest of students. Random assignment, even if necessary to produce unbiased VAM-based estimates, was deemed highly “nonsensical,” “impractical,” “unprofessional,” “unethical,” and even “detrimental” to student learning and teacher success.

The nonrandom assignment of students to classrooms has significant implications for the use of value-added models to estimate teacher effects on student learning using large-scale standardized test scores. Given the widespread use of nonrandom methods as indicated in this study, however, value-added researchers, policymakers, and educators should carefully consider the implications of their placement decisions as well as the validity of the inferences made using value-added estimates of teacher effectiveness.