Something to Be Thankful For, in New York

New York is one of a handful of states often of (dis)honrable mention on this blog (see for example here, here, and here), given its state Schools Chancellor Merryl Tisch, with the support and prodding of New York Governor Andrew Cuomo, have continuously pushed to have teacher-level growth scores count for up to 50% of teachers’ total evaluation scores.

But now, it looks like there is something for which we all, and especially those in New York, might be thankful.

As per an article published yesterday in The New York Times, Governor “Cuomo, in Shift, Is Said to Back Reducing Test Scores’ Role in Teacher Reviews.” Thankful we should be for teachers who expressed their frustrations with the state’s policy movements, who were apparently heard. And thankful we should be for the parents who opted out last year in protest throughout New York, as it looks like their collective efforts also worked to reverse this state trend. “More than 200,000 of the nearly 1.2 million students [16.7%] expected to take the annual reading and math tests [in New York] did not sit for them in 2015.”

“Now, facing a parents’ revolt against testing, the state is poised to change course and reduce the role of test scores in evaluations. And according to two people involved in making state education policy, [Governor] Cuomo has been quietly pushing for a reduction, even to zero. That would represent an about-face from January, when the governor called for test scores to determine 50 percent of a teacher’s evaluation.”

It looks like a task force is to make recommendations to Governor Cuomo before his 2016 State of the State speech in January, with recommendations potentially including the “decoupling test scores from [teacher] evaluations or putting in place some kind of moratorium on teacher evaluations.”

As per Diane Ravitch’s post on this breaking story, “Cuomo may not only reduce the role of testing in teacher evaluation, but eliminate it altogether.” However, we might also be cautiously thankful, and critically aware, as “[t]his may be a hoax, a temporary moratorium intended to deflate the Opt Out Movement and cause it to disappear. Do not rest until the law is changed to delink testing and teacher-principal evaluations.” Rather, “Let’s remain watchful and wait to see what happens. In the meanwhile, this is [certainly] reason for joy on the day [of] Thanksgiving.”

National Council on Teacher Quality (NCTQ) Report on States’ Teacher Evaluation Systems

The controversial National Council on Teacher Quality (NCTQ) — created by the conservative Thomas B. Fordham Institute, funded (in part) by the Bill & Melinda Gates Foundation, and “part of a coalition for ‘a better orchestrated agenda’ for accountability, choice, and using test scores to drive the evaluation of teachers (see here; see also other instances of controversy here and here) — recently issued a 106 page document report titled: “State of the States 2015: Evaluating Teaching, Leading and Learning.” In this report, they present “the most comprehensive and up-to-date policy trends on how states [plus DC] are evaluating teachers” (p. i). The report also provides similar information about how principals are also being evaluated across states, but given the focus of this blog, I focus only on the information they offer regarding states’ teacher evaluation systems.

I want to underscore that this is, indeed, the most comprehensive and up-to-date report capturing what states are currently doing in terms of their teacher evaluation policies and systems; however, I would not claim all of the data included within are entirely accurate, although this is understandable given how very difficult it is to be comprehensive and remain up-to-date on this topic, especially across all 50 states (plus DC). Therefore, do consume the factual data included within this report as potentially incorrect, and certainly imperfect.

I also want to bring attention to the many figures included within this report. Many should find these of interest and use, again, given likely/understandable errors and inconsistencies. The language around these figures, as well as much of the other text littered throughout this document, however, should be more critically consumed, in context, and primarily given the original source of the report (i.e., the NCTQ — see above). In other words, while the figures may prove to be useful, the polemics around them are likely of less value…unless, of course, you want to read/analyze a good example of the non-research-based rhetoric and assumptions advanced by “the other side.”

For example, their Figure B (p. i) titled Figure 1 below, illustrates that as of 2015 there were “just five states – California, Iowa, Montana, Nebraska and Vermont – that still have [emphasis added] no formal state policy requiring that teacher evaluations take objective measures of student achievement [emphasis added] into account in evaluating teacher effectiveness. Only three states – Alabama, New Hampshire and Texas – have evaluation policies that exist only in waiver requests to the federal government” (p. ii).


Figure 1. Teacher effectiveness state policy trends (2009-2015)

In addition, “27 states [now] require annual evaluations for all teachers, compared to just 15 states in 2009;” “17 states include student growth as the preponderant criterion in teacher evaluations, up from only four states in 2009…An additional 18 states include growth measures as a “significant” criterion in teacher evaluations;” “23 states require that evidence of teacher performance be used in tenure decisions [whereas no] state had such a policy in 2009;” “19 states require that teacher performance is considered in reduction in force decisions;” and the “majority of states (28) now articulate that ineffectiveness is grounds for teacher dismissal” (p. 6). These are all indicative of “progress,” as per the NCTQ.

Nonetheless, here is another figure that should be of interest (see their Figure 27, p. 37, titled Figure 2 below), capturing when states adopted their (more or less) current teacher evaluation policies.


Figure 2. Timeline for state adoption of teacher evaluation policies

One of the best/worst parts of their report is in one of their conclusions, that there’s “a real downside for states that indulge critics [emphasis added] by delaying implementation, adopting hold harmless policies or reducing the weight of student achievement in evaluations. These short-term public relations solutions [emphasis added] reinforce the idea that there are a lot of immediate punitive consequences coming for teachers when performance-based evaluations are fully implemented, which is simply not the case (p. iii).

Ironic here is that they, immediately thereafter, insert their Figure D (p. v titled Figure 3 below), that they explicitly title “Connecting the Dots,” to illustrate all of the punitive consequences already at play across the nation (giving Delaware, Florida, and Louisiana (dis)honorable mentions for leading the nation when it comes to their policy efforts to connect the dots). Some of these consequences/connectors are also at the source of the now 15 lawsuits occurring across the nation because of the low-to-high stakes consequences being attached to these data (see information about these 15 lawsuits here).


Figure 3. “Connecting the dots”

To what “dots” are they explicitly referring, and arguably hoping that states better “connect?” See their Figure 23 (p. 29 titled Figure 4) below.


Figure 4. “Connecting the dots” conceptual framework

Delaware, Florida, and Louisiana, as mentioned prior, “lead the nation when it comes to using teacher effectiveness data to inform other policies. Each of these states connects evaluation to nine of 11 related areas” (p. 29) illustrated above.

Related, the NCTQ advance a set of non-at-all-research-based claims that “there has been some good progress on connecting the dots in the states, [but] unless pay scales change [to increase merit pay initiatives, abort traditional salary schedules, and abandon traditional bonus pay systems, which research also does not support], evaluation is only going to be a feedback tool [also currently false] when it could be so much more [also currently false]” (p. vi). They conclude that “too few states are willing to take on the issue of teacher pay and lift the teaching profession [emphasis added] by rewarding excellence” (p. vi).

Related, NCTQ also highlights the state-level trends when tying teacher effectiveness to dismissal policies, also as progress or movement in the right direction (see their Figure 6, p.8, titled Figure 5 below).


Figure 5. Trends in state policy tying teacher effectiveness to dismissal policies

Here they make reference to this occurring, despite the current opt-out movement (see, for example, here) that has been taken up by teacher unions, setting to undermine teacher evaluations, protect teachers, and put students at risk — especially poor and minority students — by stripping states, districts, and schools “of any means of accountability [emphasis added] for ensuring that all children learn” (p. 10).

Finally, they also note that “States could [should also] do a lot more to use evaluation data to better prepare future teachers. Only 14 states with evaluations of effectiveness (up from eight in 2013) have adopted policies connecting the performance of students to their teachers and the institutions where their teachers were trained [i.e., colleges of education]” (p. 31). While up from years prior, more should also be done in this potential area of “value-added” accountability, as well.

See also what Diane Ravitch called “a hilarious summary” of this same NCTQ report. It is written by Peter Greene, and posted on his Curmudgucation blog, here. See also here, here, and here for prior posts by Peter Greene.

Houston’s “Split” Decision to Give Superintendent Grier $98,600 in Bonuses, Pre-Resignation

States of attention on this blog, and often of (dis)honorable mention as per their state-level policies bent on value-added models (VAMs), include Florida, New York, Tennessee, and New Mexico. As for a quick update about the latter state of New Mexico, we are still waiting to hear the final decision from the judge who recently heard the state-level lawsuit still pending on this matter in New Mexico (see prior posts about this case here, here, here, here, and here).

Another locale of great interest, though, is the Houston Independent School District. This is the seventh largest urban school district in the nation, and the district that has tied more high-stakes consequences to their value-added output than any other district/state in the nation. These “initiatives” were “led” by soon-to-resign/retire Superintendent Terry Greir who, during his time in Houston (2009-2015), implemented some of the harshest consequences ever attached to teacher-level value-added output, as per the district’s use of the Education Value-Added Assessment System (EVAAS) (see other posts about the EVAAS here, here, and here; see other posts about Houston here, here, and here).

In fact, the EVAAS is still used throughout Houston today to evaluate all EVAAS-eligible teachers, to also “reform” the district’s historically low-performing schools, by tying teachers’ purported value-added performance to teacher improvement plans, merit pay, nonrenewal, and termination (e.g., 221 Houston teachers were terminated “in large part” due to their EVAAS scores in 2011). However, pending litigation (i.e., this is the district in which the American and Houston Federation of Teachers (AFT/HFT) are currently suing the district for their wrongful use of, and over-emphasis on this particular VAM; see here), Superintendent Grier and the district have recoiled on some of the high-stakes consequences they formerly attached to the EVAAS  This particular lawsuit is to commence this spring/summer.

Nonetheless, my most recent post about Houston was about some of its future school board candidates, who were invited by The Houston Chronicle to respond to Superintendent Grier’s teacher evaluation system. For the most part, those who responded did so unfavorably, especially as the evaluation systems was/is disproportionately reliant on teachers’ EVAAS data and high-stakes use of these data in particular (see here).

Most recently, however, as per a “split” decision registered by Houston’s current school board (i.e., 4:3, and without any new members elected last November), Superintendent Grier received a $98,600 bonus for his “satisfactory evaluation” as the school district’s superintendent. See more from the full article published in The Houston Chronicle. As per the same article, Superintendent “Grier’s base salary is $300,000, plus $19,200 for car and technology allowances. He also is paid for unused leave time.”

More importantly, take a look at the two figures below, taken from actual district reports (see references below), highlighting Houston’s performance (declining, on average, in blue) as compared to the state of Texas (maintaining, on average, in black), to determine for yourself whether Superintendent Grier, indeed, deserved such a bonus (not to mention salary).

Another question to ponder is whether the district’s use of the EVAAS value-added system, especially since Superintendent Grier’s arrival in 2009, is actually reforming the school district as he and other district leaders have for so long now intended (e.g., since his Superintendent appointment in 2009).

Figure 1

Figure 1. Houston (blue trend line) v. Texas (black trend line) performance on the state’s STAAR tests, 2012-2015 (HISD, 2015a)

Figure 2

Figure 2. Houston (blue trend line) v. Texas (black trend line) performance on the state’s STAAR End-of-Course (EOC) tests, 2012-2015 (HISD, 2015b)


Houston Independent School District (HISD). (2015a). State of Texas Assessments of Academic Readiness (STAAR) performance, grades 3-8, spring 2015. Retrieved here.

Houston Independent School District (HISD). (2015b). State of Texas Assessments of Academic Readiness (STAAR) end-of-course results, spring 2015. Retrieved here.

Including Summers “Adds Considerable Measurement Error” to Value-Added Estimates

A new article titled “The Effect of Summer on Value-added Assessments of Teacher and School Performance” was recently released in the peer-reviewed journal Education Policy Analysis Archives. The article is authored by Gregory Palardy and Luyao Peng from the University of California, Riverside. 

Before we begin, though, here is some background so that you all understand the importance of the findings in this particular article.

In order to calculate teacher-level value added, all states are currently using (at minimum) the large-scale standardized tests mandated by No Child Left Behind (NCLB) in 2002. These tests were mandated for use in the subject areas of mathematics and reading/language arts. However, because these tests are given only once per year, typically in the spring, to calculate value-added statisticians measure actual versus predicted “growth” (aka “value-added”) from spring-to-spring, over a 12-month span, which includes summers.

While many (including many policymakers) assume that value-added estimations are calculated from fall to spring during time intervals under which students are under the same teachers’ supervision and instruction, this is not true. The reality is that the pre- to post-test occasions actually span 12-month periods, including the summers that often cause the nettlesome summer effects often observed via VAM-based estimates. Different students learn different things over the summer, and this is strongly associated (and correlated) with student’s backgrounds, and this is strongly associated (and correlated) with students’ out-of-school opportunities (e.g., travel, summer camps, summer schools). Likewise, because summers are the time periods over which teachers and schools tend to have little control over what students do, this is also the time period during which research  indicates that achievement gaps maintain or widen. More specifically, research indicates that indicates that students from relatively lower socio-economic backgrounds tend to suffer more from learning decay than their wealthier peers, although they learn at similar rates during the school year.

What these 12-month testing intervals also include are prior teachers’ residual effects, whereas students testing in the spring, for example, finish out every school year (e.g., two months or so) with their prior teachers before entering the classrooms of the teachers for whom value-added is to be calculated the following spring, although teachers’ residual effects were not of focus in this particular study.

Nonetheless, via the research, we have always known that these summer (and prior or adjacent teachers’ residual effects) are difficult if not impossible to statistically control. This in and of itself leads to much of the noise (fluctuations/lack of reliability, imprecision, and potential biases) we observe in the resulting value-added estimates. This is precisely what was of focus in this particular study.

In this study researchers examined “the effects of including the summer period on value-added assessments (VAA) of teacher and school performance at the [1st] grade [level],” as compared to using VAM-based estimates derived from a fall-to-spring test administration within the same grade and same year (i.e., using data derived via a nationally representative sample via the National Center for Education Statistics (NCES) with an n=5,034 children).

Researchers found that:

  • Approximately 40-62% of the variance in VAM-based estimates originates from the summer period, depending on the reading or math outcome;
  • When summer is omitted from VAM-based calculations using within year pre/post-tests, approximately 51-61% of the teachers change performance categories. What this means in simpler terms is that including summers in VAM-based estimates is indeed causing some of the errors and misclassification rates being observed across studies.
  • Statistical controls to control for student and classroom/school variables reduces summer effects considerably (e.g., via controlling for students’ prior achievement), yet 36-47% of teachers still fall into different quintiles when summers are included in the VAM-based estimates.
  • Findings also evidence that including summers within VAM-based calculations tends to bias VAM-based estimates against schools with higher relative concentrations of poverty, or rather higher relative concentrations of students who are eligible for the federal free-and-reduced lunch program.
  • Overall, results suggest that removing summer effects from VAM-based estimates may require biannual achievement assessments (i.e., fall and spring). If we want VAM-based estimates to be more accurate, we might have to double the number of tests we administer per year in each subject area for which teachers are to be held accountable using VAMs. However, “if twice-annual assessments are not conducted, controls for prior achievement seem to be the best method for minimizing summer effects.”

This is certainly something to consider in terms of trade-offs, specifically in terms of whether we really want to “double-down” on the number of tests we already require our public students to take (also given the time that testing and test preparation already takes away from students’ learning activities), and whether we also want to “double-down” on the increased costs of doing so. I should also note here, though, that using pre/post-tests within the same year is (also) not as simple as it may seem (either). See another post forthcoming about the potential artificial deflation/inflation of pre/post scores to manufacture artificial levels of growth.

To read the full study, click here.

*I should note that I am an Associate Editor for this journal, and I served as editor for this particular publication, seeing it through the full peer-reviewed process.

Citation: Palardy, G. J., & Peng, L. (2015). The effects of including summer on value-added assessments of teachers and schools. Education Policy Analysis Archives, 23(92). doi:10.14507/epaa.v23.1997 Retrieved from

Just Released: The American Education Research Association’s (AERA) Statement on VAMs

Yesterday, the Council of the American Education Research Association (AERA) – AERA, founded in 1916, is the largest national professional organization devoted to the scientific study of education – publicly released their “AERA Statement on Use of Value-Added Models (VAM) for the Evaluation of Educators and Educator Preparation Programs.” Below is a summary of the AERA Council’s key points, noting for transparency that I contributed to these points in June of 2014, before the final statement was externally reviewed, revised, and vetted for public release.

As per the introduction: “The purpose of this statement is to inform those using or considering the use of value-added models (VAM) about their scientific and technical limitations in the evaluation of educators [as well as programs that prepare teachers].” The purpose of this statement is also to stress “the importance of any educator evaluation system meeting the highest standards of practice in statistics and measurement,” well before VAM output are to carry any “high-stakes, dispositive weight in [such teacher or other] evaluations” ( p. 1).

As per the main body of the statement, the AERA Council highlights eight very important technical requirements that must be met prior to such evaluative use. These eight technical requirements should be officially recognized by all states, and/or used by any of you out there to help inform your states regarding what they can and cannot, or should and should not do when using VAMs, whereas “[a]ny material departure from these [eight] requirements should preclude [VAM] use” (p. 2).

Here are AERA’s eight technical requirements for the use of VAM:

  1. “VAM scores must only be derived from students’ scores on assessments that meet professional standards of reliability and validity for the purpose to be served…Relevant evidence should be reported in the documentation supporting the claims and proposed uses of VAM results, including evidence that the tests used are a valid measure of growth [emphasis added] by measuring the actual subject matter being taught and the full range of student achievement represented in teachers’ classrooms” (p. 3).
  2. “VAM scores must be accompanied by separate lines of evidence of reliability and validity that support each [and every] claim and interpretative argument” (p. 3).
  3. “VAM scores must be based on multiple years of data from sufficient numbers of students…[Related,] VAM scores should always be accompanied by estimates of uncertainty to guard against [simplistic] overinterpretation[s] of [simple] differences” (p. 3).
  4. “VAM scores must only be calculated from scores on tests that are comparable over time…[In addition,] VAM scores should generally not be employed across transitions [to new, albeit different tests over time]” (AERA Council, 2015, p. 3).
  5. “VAM scores must not be calculated in grades or for subjects where there are not standardized assessments that are accompanied by evidence of their reliability and validity…When standardized assessment data are not available across all grades (K–12) and subjects (e.g., health, social studies) in a state or district, alternative measures (e.g., locally developed assessments, proxy measures, observational ratings) are often employed in those grades and subjects to implement VAM. Such alternative assessments should not be used unless they are accompanied by evidence of reliability and validity as required by the AERA, APA, and NCME Standards for Educational and Psychological Testing” (p. 3).
  6. “VAM scores must never be used alone or in isolation in educator or program evaluation systems…Other measures of practice and student outcomes should always be integrated into judgments about overall teacher effectiveness” (p. 3).
  7. “Evaluation systems using VAM must include ongoing monitoring for technical quality and validity of use…Ongoing monitoring is essential to any educator evaluation program and especially important for those incorporating indicators based on VAM that have only recently been employed widely. If authorizing bodies mandate the use of VAM, they, together with the organizations that implement and report results, are responsible for conducting the ongoing evaluation of both intended and unintended consequences. The monitoring should be of sufficient scope and extent to provide evidence to document the technical quality of the VAM application and the validity of its use within a given evaluation system” (AERA Council, 2015, p. 3).
  8. “Evaluation reports and determinations based on VAM must include statistical estimates of error associated with student growth measures and any ratings or measures derived from them…There should be transparency with respect to VAM uses and the overall evaluation systems in which they are embedded. Reporting should include the rationale and methods used to estimate error and the precision associated with different VAM scores. Also, their reliability from year to year and course to course should be reported. Additionally, when cut scores or performance levels are established for the purpose of evaluative decisions, the methods used, as well as estimates of classification accuracy, should be documented and reported. Justification should [also] be provided for the inclusion of each indicator and the weight accorded to it in the evaluation process…Dissemination should [also] include accessible formats that are widely available to the public, as well as to professionals” ( p. 3-4).

As per the  conclusion: “The standards of practice in statistics and testing set a high technical bar for properly aggregating student assessment results for any purpose, especially those related to drawing inferences about teacher, school leader, or educator preparation program effectiveness” (p. 4). Accordingly, the AERA Council recommends that VAMs “not be used without sufficient evidence that this technical bar has been met in ways that support all claims, interpretative arguments, and uses (e.g., rankings, classification decisions)” (p. 4).

CITATION: AERA Council. (2015). AERA statement on use of value-added models (VAM) for the evaluation of educators and educator preparation programs. Educational Researcher, X(Y), 1-5. doi:10.3102/0013189X15618385 Retrieved from

Houston Board Candidates Respond to their Teacher Evaluation System

For a recent article in the Houston Chronicle, the newspaper sent 12 current candidates for the Houston Independent School District (HISD) School Board a series of questions about HISD, to which seven candidates responded. The seven candidates’ responses are of specific interest here in that HISD is the district well-known for attaching more higher-stakes consequences to value-added output (e.g., teacher termination) than others (see for example here, here, and here). The seven candidates’ responses are of general interest in that the district uses the popular and (in)famous Education Value-Added Assessment System (EVAAS) for said purposes (see also here, here, and here). Accordingly, what these seven candidates have to say about the EVAAS and/or HISD’s teacher evaluation system might also be a sign of things to come, perhaps for the better, throughout HISD.

The questions are: (1) Do you support HISD’s current teacher evaluation system, which includes student test scores? Why or why not? What, if any, changes would you make? And (2) Do you support HISD’s current bonus system based on student test scores? Why or why not? What, if any, changes would you make? To see candidate names, their background information, their responses to other questions, etc. please read in full the article in the Houston Chronicle.

Here are the seven candidates’ responses to question #1:

  • I do not support the current teacher evaluation system. Teacher’s performance should not rely on the current formula using the evaluation system with the amount of weight placed on student test scores. Too many obstacles outside the classroom affect student learning today that are unfair in this system. Other means of support such as a community school model must be put in place to support the whole student, supporting student learning in the classroom (Fonseca).
  • No, I do not support the current teacher evaluation system, EVAAS, because it relies on an algorithm that no one understands. Testing should be diagnostic, not punitive. Teachers must have the freedom to teach basic math, reading, writing and science and not only teach to the test, which determines if they keep a job and/or get bonuses. Teachers should be evaluated on student growth. For example, did the third-grade teacher raise his/her non-reading third-grader to a higher level than that student read when he/she came into the teacher’s class? Did the teacher take time to figure out what non-educational obstacles the student had in order to address those needs so that the student began learning? Did the teacher coach the debate team and help the students become more well-rounded, and so on? Standardized tests in a vacuum indicate nothing (Jones).
  • I remember the time when teachers practically never revised test scores. Tests can be one of the best tools to help a child identify strengths and weakness. Students’ scores were filed, and no one ever checked them out from the archives. When student scores became part of their evaluation, teachers began to look into data more often. It is a magnificent tool for student and teacher growth. Having said that, I also believe that many variables that make a teacher great are not measured in his or her evaluation. There is nothing on character education for which teachers are greatly responsible. I do not know of a domain in the teacher’s evaluation that quite measures the art of teaching. Data is about the scientific part of teaching, but the art of teaching has to be evaluated by an expert at every school; we call them principals (Leal).
  • Student test scores were not designed to be used for this purpose. The use of students’ test scores to evaluate teachers has been discredited by researchers and statisticians. EVAAS and other value-added models are deeply flawed and should not be major components of a teacher evaluation system. The existing research indicates that 10-14 percent of students’ test scores are attributable to teacher factors. Therefore, I would support using student test scores (a measure of student achievement) as no more than 10-14 percent of teachers’ evaluations (McCoy).
  • No, I do not support the current teacher evaluation system, which includes student test scores, for the following reasons: 1) High-stakes decisions should not be made based on the basis of value-added scores alone. 2) The system is meant to assess and predict student performance with precision and reliability, but the data revealed that the EVAAS system is inconsistent and has consistent problems. 3) The EVAAS repots do not match the teachers’ “observation” PDAS scores [on the formal evaluation]; therefore, data is manipulated to show a relationship. 4) Most importantly, teachers cannot use the information generated as a formative tool because teachers receive the EVAAS reports in the summer or fall after the students leave their classroom. 5) Very few teachers realized that there was an HISD-sponsored professional development training linked to the EVAAS system to improve instruction. Changes that I will make are to make recommendations and confer with other board members to revamp the system or identify a more equitable system (McCullough).
  • The current teacher evaluation system should be reviewed and modified. While I believe we should test, it should only be a diagnostic measure of progress and indicator of deficiency for the purpose of aligned instruction. There should not be any high stakes attached for the student or the teacher. That opens the door for restricting teaching-to-test content and stifles the learning potential. If we have to have it, make it 5 percent. The classroom should be based on rich academic experiences, not memorization regurgitation (Skillern-Jones).
  • I support evaluating teachers on how well their students perform and grow, but I do not support high-stakes evaluation of teachers using a value-added test score that is based on the unreliable STAAR test. Research indicates that value-added measures of student achievement tied to individual teachers should not be used for high-stakes decisions or compared across dissimilar student populations or schools. If we had a reliable test of student learning, I would support the use of value-added growth measures in a low-stakes fashion where measures of student growth are part of an integrated analysis of a teacher’s overall performance and practices. I strongly believe that teachers should be evaluated with an integrated set of measures that show what teachers do and what happens as a result. These measures may include meaningful evidence of student work and learning, pedagogy, classroom management, knowledge of content and even student surveys. Evaluators should be appropriately trained, and teachers should have regular evaluations with frequent feedback from strong mentors and professional development to strengthen their content knowledge and practice (Stipeche).

Here are the seven candidates’ responses to question #2:

  • I do not support the current bonus system based on student test scores as, again, teachers do not currently have support to affect what happens outside the classroom. Until we provide support, we cannot base teacher performance or bonuses on a heavy weight of test scores (Fonseca).
  • No, I do not support the current bonus system. Teachers who grow student achievement should receive bonuses, not just teachers whose students score well on tests. For example, a teacher who closes the educational achievement gap with a struggling student should earn a bonus before a teacher who has students who are not challenged and for whom learning is relatively easy. Teachers who grow their students in extracurricular activities should earn a bonus before a teacher that only focuses on education. Teachers that choose to teach in struggling schools should earn a bonus over a teacher that teaches in a school with non-struggling students. Teachers who work with their students in UIL participation, history fairs, debate, choir, student government and like activities should earn a bonus over a teacher who does not (Jones).
  • Extrinsic incentives killed creativity. I knew that from my counseling background, but in 2011 or 2010, Dr. Grier sent an email to school administrators with a link of a TED Talks video that contradicts any notion of giving monetary incentives to promote productivity in the classroom: Give incentives for perfect attendance or cooperation among teachers selected by teachers (Leal).
  • No. Student test scores were not designed to be used for this purpose. All teachers need salary increases (McCoy).
  • No, I do not support HISD’s current bonus system based on student test scores. Student test scores should be a diagnostic tool used to identify instructional gaps and improve student achievement. Not as a measure to reward teachers, because the process is flawed. I would work collaboratively to identify another system to reward teachers (McCullough).
  • The current bonus program does, in fact, reward teachers who students make significant academic gains. It leaves out those teachers who have students at the top of the achievement scale. By formulaic measures, it is flawed and the system, according to its creators, is being misused and misapplied. It would be beneficial overall to consider measures to expand the teacher population of recipients as well as to undertake measures to simplify the process if we keep it. I think a better focus would be to see how we can increase overall teacher salaries in a meaningful and impactful way to incentivize performance and longevity (Skillern-Jones).
  • No. I do not support the use of EVAAS in this manner. More importantly, ASPIRE has not closed the achievement gap nor dramatically improved the academic performance of all students in the district (Stipeche).

No responses or no responses of any general substance were received from Daniels, Davila, McKinzie, Smith, Williams.

The Nation’s “Best Test” Scores Released: Test-Based Policies (Evidently) Not Working

From Diane Ravitch’s Blog (click here for direct link):

Sometimes events happen that seem to be disconnected, but after a few days or weeks, the pattern emerges. Consider this: On October 2, [U.S.] Secretary of Education Arne Duncan announced that he was resigning and planned to return to Chicago. Former New York Commissioner of Education John King, who is a clone of Duncan in terms of his belief in testing and charter schools, was designated to take Duncan’s place. On October 23, the Obama administration held a surprise news conference to declare that testing was out of control and should be reduced to not more than 2% of classroom time [see prior link on this announcement here]. Actually, that wasn’t a true reduction, because 2% translates into between 18-24 hours of testing, which is a staggering amount of annual testing for children in grades 3-8 and not different from the status quo in most states.

Disconnected events?

Not at all. Here comes the pattern-maker: the federal tests called the National Assessment of Educational Progress [NAEP] released its every-other-year report card in reading and math, and the results were dismal. There would be many excuses offered, many rationales, but the bottom line: the NAEP scores are an embarrassment to the Obama administration (and the George W. Bush administration that preceded it).

For nearly 15 years, Presidents Bush and Obama and the Congress have bet billions of dollars—both federal and state—on a strategy of testing, accountability, and choice. They believed that if every student was tested in reading and mathematics every year from grades 3 to 8, test scores would go up and up. In those schools where test scores did not go up, the principals and teachers would be fired and replaced. Where scores didn’t go up for five years in a row, the schools would be closed. Thousands of educators were fired, and thousands of public schools were closed, based on the theory that sticks and carrots, rewards and punishments, would improve education.

But the 2015 NAEP scores released today by the National Assessment Governing Board (a federal agency) showed that Arne Duncan’s $4.35 billion Race to the Top program had flopped. It also showed that George W. Bush’s No Child Left Behind was as phony as the “Texas education miracle” of 2000, which Bush touted as proof of his education credentials.

NAEP is an audit test. It is given every other year to samples of students in every state and in about 20 urban districts. No one can prepare for it, and no one gets a grade. NAEP measures the rise or fall of average scores for states in fourth grade and eighth grade in reading and math and reports them by race, gender, disability status, English language ability, economic status, and a variety of other measures.

The 2015 NAEP scores showed no gains nationally in either grade in either subject. In mathematics, scores declined in both grades, compared to 2013. In reading, scores were flat in grade 4 and lower in grade 8. Usually the Secretary of Education presides at a press conference where he points with pride to increases in certain grades or in certain states. Two years ago, Arne Duncan boasted about the gains made in Tennessee, which had won $500 million in Duncan’s Race to the Top competition. This year, Duncan had nothing to boast about.

In his Race to the Top program, Duncan made testing the primary purpose of education. Scores had to go up every year, because the entire nation was “racing to the top.” Only 12 states won a share of the $4.35 billion that Duncan was given by Congress: Tennessee and Delaware were first to win, in 2010. The next round, the following states won multi-millions of federal dollars to double down on testing: Maryland, Massachusetts, the District of Columbia, Florida, Georgia, Hawaii, New York, North Carolina, Ohio, and Rhode Island.

Tennessee, Duncan’s showcase state in 2013, made no gains in reading or mathematics, neither in fourth grade or eighth grade. The black-white test score gap was as large in 2015 as it had been in 1998, before either NCLB or the Race to the Top.

The results in mathematics were bleak across the nation, in both grades 4 and 8. The declines nationally were only 1 or 2 points, but they were significant in a national assessment on the scale of NAEP.

In fourth grade mathematics, the only jurisdictions to report gains were the District of Columbia, Mississippi, and the Department of Defense schools. Sixteen states had significant declines in their math scores, and thirty-three were flat in relation to 2013 scores. The scores in Tennessee (the $500 million winner) were flat.

In eighth grade, the lack of progress in mathematics was universal. Twenty-two states had significantly lower scores than in 2013, while 30 states or jurisdictions had flat scores. Pennsylvania, Kansas, and Florida (a Race to the Top winner), were the biggest losers, by dropping six points. Among the states that declined by four points were Race to the Top winners Ohio, North Carolina, and Massachusetts. Maryland, Hawaii, New York, and the District of Columbia lost two points. The scores in Tennessee were flat.

The District of Columbia made gains in fourth grade reading and mathematics, but not in eighth grade. It continues to have the largest score gap-—56 points–between white and black students of any urban district in the nation. That is more than double the average of the other 20 urban districts. The state with the biggest achievement gap between black and white students is Wisconsin; it is also the state where black students have the lowest scores, lower than their peers in states like Mississippi and South Carolina. Wisconsin has invested heavily in vouchers and charter schools, which Governor Scott Walker intends to increase.

The best single word to describe NAEP 2015 is stagnation. Contrary to President George W. Bush’s law, many children have been left behind by the strategy of test-and-punish. Contrary to the Obama administration’s Race to the Top program, the mindless reliance on standardized testing has not brought us closer to some mythical “Top.”

No wonder Arne Duncan is leaving Washington. There is nothing to boast about, and the next set of NAEP results won’t be published until 2017. The program that he claimed would transform American education has not raised test scores, but has demoralized educators and created teacher shortages. Disgusted with the testing regime, experienced teachers leave and enrollments in teacher education programs fall. One can only dream about what the Obama administration might have accomplished had it spent that $5 billion in discretionary dollars to encourage states and districts to develop and implement realistic plans for desegregation of their schools, or had they invested the same amount of money in the arts.

The past dozen or so years have been a time when “reformers” like Arne Duncan, Michelle Rhee, Joel Klein, and Bill Gates proudly claimed that they were disrupting school systems and destroying the status quo. Now the “reformers” have become the status quo, and we have learned that disruption is not good for children or education.

Time is running out for this administration, and it is not likely that there will be any meaningful change of course in education policy. One can only hope that the next administration learns important lessons from the squandered resources and failure of NCLB and Race to the Top.

The Forgotten VAM: The A-F School Grading System

Here is another post from our “Concerned New Mexico Parent” (see prior posts from him/her here and here). This one is about New Mexico’s A-F School Grading System and how it is not only contradictory, within and beyond itself, but how it also provides little instrumental value to the public as an invalid indicator of the “quality” of any school.

(S)he writes:

  1. What do you call a high school that has only 38% of its students proficient in reading and 35% of its students proficient in mathematics?
  2. A school that needs help improving their student scores.
  3. What does the New Mexico Public Education Department (NMPED) call this same high school?
  4. A top-rated “A” school, of course.

Readers of this blog are familiar with the VAMs being used to grade teachers. Many states have implemented analogous formulas to grade entire schools. This “forgotten” VAM suffers from all of the familiar problems of the teacher formulas — incomprehensibility, lack of transparency, arbitrariness, and the like.

The first problem with the A-F Grading System is inherent in its very name. The “A-F” terminology implies that this one static assessment is an accurate representation of a school’s quality. As you will see, it is nothing of the sort.

The second problem with the A-F Grading System is that is is composed of benchmarks that are arbitrarily weighted and scored by the NMPED using VAM methodologies.

Thirdly, the “collapsing of the data” from a numeric score to a grade (corresponding to a range of values) causes valuable information to be lost.

Table 1 shows the range of values for reading and mathematics proficiencies for each of the five A-F grade categories for New Mexico schools.

Table 1: Ranges and Median of Reading and Mathematics Proficiencies by A-F School Grade

School A-F Grade Number of Schools

Proficiency Range


Mathematics Proficiency Range


A 86

37.90 – 94.00


31.50 – 95.70


B 237

16.90 – 90.90


4.90 – 90.90


C 177

0.00 – 83.80


0.00 – 76.20


D 21

4.50 – 64.60


2.20 – 70.00


F 88

7.80 – 52.30


3.30 – 40.90


For example, to earn an A rating, a school can have between 37.9% and 94.0% of its students proficient in reading. In other words, a school can have roughly two-thirds of its students fail reading proficiency yet be rated as an “A” school!

The median value listed shows the point which splits the group in half — one-half of the scores are below the median value. Thus, an “A” school median of 66.2% indicates that one-half of the “A” schools have a reading proficiency below 66.2%. In other words, in one-half of the “A” schools 1/3 or more of their students are NOT proficient in reading!

Amazingly, the figures for mathematics are even worse, the minimum proficiency for a B rating is only 4.9% proficient! Scandalous!

Obviously, and contrary to popular and press perceptions, the A-F Grading System has nothing to do with the actual or current quality of the school!

A few case studies will highlight further absurdities of the New Mexico A-F School Grading System next.

Case Study 1 – Highest “A”   vs. Lowest “A” High School

Logan High School, Logan, New Mexico received the lowest reading proficiency of any “A” school, and the Albuquerque Institute of Math and Science received the highest reading proficiency score.

These two schools have both received an “A” rating. The Albuquerque Institute had a reading proficiency of 94% and a mathematics proficiency rating of 93%. Logan HS had a reading proficiency of only 38% and a mathematics proficiency rating of only 35%!

How is that possible?

First, much of the A-F VAM, like the teacher VAM is based on multi-year growth calculations and predictions. Logan has plenty of opportunity for growth whereas the Math Academy has “maxed” out most of its scores. Thus, the Albuquerque Institute is penalized in a manner analogous to Gifted and Talented teachers when teacher-level VAM is used. With already excellent scores, there is little, if any, room for improvement.

Second, Logan has an emphasis on shop/trade classes which yields a very high college and CAREER readiness score for the VAM calculation.

Also, a final factor is that the NMPED-defined range for an “A” extends from 75 to 100 points, and Logan barely made it into the A grouping.

Thus, a proficiency score of only 37.9% is no deterrent to an A score for Logan High.

Case Study 2: Hanging on by a Thread

As noted above, any school that scores between 75 and 100 points is considered an “A” school.

This statistical oddity was very beneficial to Hagerman High (Hagerman, NM) in their 2014 School Grade Report Card. They fell 5.99 points overall from the previous year’s score, but they managed to still receive an “A” score since their resulting 2014 score was exactly 75.01.

With this one one-hundredth of a point, they are in the same “A” grade category as the Albuquerque Institute of Math and Science (rated best in New Mexico by NMPED) and the Cottonwood Classical Preparatory School of Albuquerque (rated best in New Mexico by US News).

Case Study 3: A Tale of Two Ranking Systems

This inaccuracy and arbitrariness of any A-F School Grading System was also apparent in a recent Albuquerque Journal News article (May 14, 2015) which reported on the most recent US News ratings of high schools nationwide.

The Journal reported on the top 12 high schools in New Mexico as rated by US News. It is not surprising that most were NMPED A-rated schools. What was unusual is that the 3rd and 5th US News highest rated schools in New Mexico (South Valley Academy and Albuquerque High, both in Albuquerque) were actually rated as B schools by the NMPED A-F School Grading System.

According to NMPED data, I tabulated at least forty-four (44) high schools that were rated as “A” schools with higher NMPED scores than South Valley Academy which had an NMPED score of 71.4.

None of these 44 higher NMPED scoring schools were rated above South Valley Academy by US News.

Case Study 4: Punitive Grading

Many school districts and school boards throughout New Mexico have adopted policies that prohibit punitive grading based on behavior. It is no longer possible to lower a student’s grade just because of their behavior. The grade should reflect classroom assessment only.

NMPED ignores this policy in the context of the A-F School Grading System. Schools were graded down one letter grade if they did not achieve 95% participation rates.

One such school was Mills Elementary in the Hobbs Municipal Schools District. Only 198 students were tested; they fell 11 short of the 95% mark and were penalized one “grade”-level. Their grade was reduced from a “D” to an “F” In fact, Mills Elementary proficiency scores were higher than the A-rated Logan High School discussed earlier.

The likely explanation is that Hobbs has a highly transient population with both seasonal farm laborers and oil-field workers predominating in the local economy.

For more urban schools, it will be interesting to see how the NMPED policy of punitive grading will play out with the increasingly popular Opt-Out movement.


It is apparent that the NMPED’s A-F School Grading System rates schools deceptively using VAM-augmented data and provides little of any value to the public as to the “quality” of a school. By presenting it in the form of an “NMPED School Grade Report Card” the state seeks to hide its arbitrary nature.

Such a useless grade should certainly not be used to declare a school a “failure” and in need of radical reform.

VAMboozled’s Two-Year Anniversary

It’s our two year anniversary, so I thought I would share our current stats and our thanks to all who are following (n ≥ 15,000), and also sharing out our independent, open-access, research- and community-based content. We still have a lot of work to be done in terms of America’s test-based teacher evaluation systems, but I feel like we are certainly having a positive impact on the nation writ large, again, with thanks to you all!!

On that note, if there are others (e.g., teachers, students, parents, administrators, school board members, policy advisers, policymakers) who you might know but who might not be following, please do also share and recommend.

Here are our stats (that are also available on our About page):

November, 2013: Blog went live
May, 2014: Subscribers ≈ 3,000; Hits per month ≈ 50,000*
November, 2014: Subscribers ≈ 8,000; Hits per month ≈ 100,000*
May, 2015: Subscribers ≈ 13,000; Hits per month ≈ 160,000*
November, 2015: Subscribers ≈ 15,000; Hits per month ≈ 180,000*


We have also made public 295 posts, to date, averaging 2.8 posts per week and 12.3 post per month.

*This number is calculated by ((subscribers x average number of posts per month) + external hits per month)), although external hits may also include subscribers as analytics cannot differentiate between subscribers and hits. These two indicators are not mutually exclusive.