Some Lawmakers Reconsidering VAMs in the South

A few weeks ago in Education Week, Stephen Sawchuk and Emmanuel Felton wrote a post in the its Teacher Beat blog about lawmakers, particularly in the southern states, who are beginning to reconsider, via legislation, the role of test scores and value-added measures in their states’ teacher evaluation systems. Perhaps the tides are turning.

I tweeted this one out, but I also pasted this (short) one below to make sure you all, especially those of you teaching and/or residing in states like Georgia, Oklahoma, Louisiana, Tennessee, and Virginia, did not miss it.

Southern Lawmakers Reconsidering Role of Test Scores in Teacher Evaluations

After years of fierce debates over effectiveness and fairness of the methodology, several southern lawmakers are looking to minimize the weight placed on so called value-added measures, derived from how much students’ test scores changed, in teacher-evaluation systems.

In part because these states are home to some of the weakest teachers unions in the country, southern policymakers were able to push past arguments that the state tests were ill suited for teacher-evaluation purposes and that the system would punish teachers for working in the toughest classrooms. States like Louisiana, Georgia and Tennessee, became some of the earliest and strongest adopters of the practice. But in the past few weeks, lawmakers from Baton Rouge, La., to Atlanta have introduced bills to limit the practice.

In February, the Georgia Senate unanimously passed a bill that would reduce the student-growth component from 50 percent of a teachers’ evaluation down to 30 percent. Earlier this week, nearly 30 individuals signed up to speak on behalf fo the bill at a State House hearing.

Similarly, Louisiana House Bill 479 would reduce student-growth weight from 50 percent to 35 percent. Tennessee House Bill 1453 would reduce the weight of student-growth data through the 2018-2019 school year and would require the state Board of Education to produce a report evaluating the policy’s ongoing effectiveness. Lawmakers in Florida, Kentucky, and Oklahoma have introduced similar bills, according to the Southern Regional Education Board’s 2016 educator-effectiveness bill tracker.

By and large, states adopted these test-score centric teacher-evaluation systems to attain waivers from No Child Left Behind’s requirement that all students by proficient by 2014. To get a waiver, states had to adopt systems that evaluated teachers “in significant part, based on student growth.” That has looked very different from state to state, ranging from 20 percent in Utah to 50 percent in states like Alaska, Tennessee, and Louisiana.

No Child Left Behind’s replacement, the Every Student Succeeds Act, doesn’t require states to have a teacher-evaluation system at all, but, as my colleague Stephen Sawchuk reported, the nation’s state superintendents say they remain committed to maintaining systems that regularly review teachers.

But, as Sawchuk reported, Steven Staples, Virginia’s state superintendent, signaled that his state may move away from its current system where student test scores make up 40 percent of a teacher’s evaluation:

“What we’ve found is that through our experience [with the NCLB waivers], we have had some unintended outcomes. The biggest one is that there’s an over-reliance on a single measure; too many of our divisions defaulted to the statewide standardized test … and their feedback was that because that was a focus [of the federal government], they felt they needed to emphasize that, ignoring some other factors. It also drove a real emphasis on a summative, final evaluation. And it resulted in our best teachers running away from our most challenged.”

Some state lawmakers appear to absorbing a similar message

Everything is Bigger (and Badder) in Texas: Houston’s Teacher Value-Added System

Last November, I published a post about “Houston’s “Split” Decision to Give Superintendent Grier $98,600 in Bonuses, Pre-Resignation.” Thereafter, I engaged some of my former doctoral students to further explore some data from Houston Independent School District (HISD), and what we collectively found and wrote up was just published in the highly-esteemed Teachers College Record journal (Amrein-Beardsley, Collins, Holloway-Libell, & Paufler, 2016). To view the full commentary, please click here.

In this commentary we discuss HISD’s highest-stakes use of its Education Value-Added Assessment System (EVAAS) data – the value-added system HISD pays for at an approximate rate of $500,000 per year. This district has used its EVAAS data for more consequential purposes (e.g., teacher merit pay and termination) than any other state or district in the nation; hence, HISD is well known for its “big use” of “big data” to reform and inform improved student learning and achievement throughout the district.

We note in this commentary, however, that as per the evidence, and more specifically the recent release of the Texas’s large-scale standardized test scores, that perhaps attaching such high-stakes consequences to teachers’ EVAAS output in Houston is not working as district leaders have, now for years, intended. See, for example, the recent test-based evidence comparing the state of Texas v. HISD, illustrated below.

Figure 1

“Perhaps the district’s EVAAS system is not as much of an “educational-improvement and performance-management model that engages all employees in creating a culture of excellence” as the district suggests (HISD, n.d.a). Perhaps, as well, we should “ponder the specific model used by HISD—the aforementioned EVAAS—and [EVAAS modelers’] perpetual claims that this model helps teachers become more “proactive [while] making sound instructional choices;” helps teachers use “resources more strategically to ensure that every student has the chance to succeed;” or “provides valuable diagnostic information about [teachers’ instructional] practices” so as to ultimately improve student learning and achievement (SAS Institute Inc., n.d.).

The bottom line, though, is that “Even the simplest evidence presented above should at the very least make us question this particular value-added system, as paid for, supported, and applied in Houston for some of the biggest and baddest teacher-level consequences in town.” See, again, the full text and another, similar graph in the commentary, linked  here.

*****

References:

Amrein-Beardsley, A., Collins, C., Holloway-Libell, J., & Paufler, N. A. (2016). Everything is bigger (and badder) in Texas: Houston’s teacher value-added system. [Commentary]. Teachers College Record. Retrieved from http://www.tcrecord.org/Content.asp?ContentId=18983

Houston Independent School District (HISD). (n.d.a). ASPIRE: Accelerating Student Progress Increasing Results & Expectations: Welcome to the ASPIRE Portal. Retrieved from http://portal.battelleforkids.org/Aspire/home.html

SAS Institute Inc. (n.d.). SAS® EVAAS® for K–12: Assess and predict student performance with precision and reliability. Retrieved from www.sas.com/govedu/edu/k12/evaas/index.html

A Retired Massachusetts Principal on her Teachers’ “Value-Added”

A retired Massachusetts principal, named Linda Murdock, posted a post on her blog titled “Murdock’s EduCorner” about her experiences, as a principal, with “value-added,” or more specifically in her state the use of Student Growth Percentile (SGP) scores to estimate said “value-added.” It’s certainly worth reading as one thing I continue to find is that which we continue to find in the research on value-added models (VAMs) is also being realized by practitioners in the schools being required to use value-added output such as these. In this case, for example, while Murdock does not discuss the technical terms we use in the research (e.g., reliability, validity, and bias), she discusses these in pragmatic, real terms (e.g., year-to-year fluctuations, lack of relationship of SGP scores and other indicators of teacher effectiveness, and the extent to which certain sets of students can hinder teachers’ demonstrated growth or value-added, respectively). Hence, do give her post a read here, and also pasted in full below. Do also pay special attention to the bulleted sections in which she discusses these and other issues on a case-by-case basis.

Murdock writes:

At the end of the last school year, I was chatting with two excellent teachers, and our conversation turned to the new state-mandated teacher evaluation system and its use of student “growth scores” (“Student Growth Percentiles” or “SGPs” in Massachusetts) to measure a teacher’s “impact on student learning.”

“Guess we didn’t have much of an impact this year,” said one teacher.

The other teacher added, “It makes you feel about this high,” showing a tiny space between her thumb and forefinger.

Throughout the school, comments were similar — indicating that a major “impact” of the new evaluation system is demoralizing and discouraging teachers. (How do I know, by the way, that these two teachers are excellent? I know because I worked with them as their principal – being in their classrooms, observing and offering feedback, talking to parents and students, and reviewing products demonstrating their students’ learning – all valuable ways of assessing a teacher’s “impact”.)

According to the Massachusetts Department of Elementary and Secondary Education (“DESE”), the new evaluation system’s goals include promoting the “growth and development of leaders and teachers,” and recognizing “excellence in teaching and leading.” The DESE website indicates that the DESE considers a teacher’s median SGP as an appropriate measure of that teacher’s “impact on student learning”:

“ESE has confidence that SGPs are a high quality measure of student growth. While the precision of a median SGP decreases with fewer students, median SGP based on 8-19 students still provides quality information that can be included in making a determination of an educator’s impact on students.”

Given the many concerns about the use of “value-added measurement” tools (such as SGPs) in teacher evaluation, this confidence is difficult to understand, particularly as applied to real teachers in real schools. Considerable research notes the imprecision and variability of these measures as applied to the evaluation of individual teachers. On the other side, experts argue that use of an “imperfect measure” is better than past evaluation methods. Theories aside, I believe that the actual impact of this “measure” on real people in real schools is important.

As a principal, when I first heard of SGPs I was curious. I wondered whether the data would actually filter out other factors affecting student performance, such as learning disabilities, English language proficiency, or behavioral challenges, and I wondered if the data would give me additional information useful in evaluating teachers.

Unfortunately, I found that SGPs did not provide useful information about student growth or learning, and median SGPs were inconsistent and not correlated with teaching skill, at least for the teachers with whom I was working. In two consecutive years of SGP data from our Massachusetts elementary school:

  • One 4th grade teacher had median SGPs of 37 (ELA) and 36 (math) in one year, and 61.5 and 79 the next year. The first year’s class included students with disabilities and the next year’s did not.
  • Two 4th grade teachers who co-teach their combined classes (teaching together, all students, all subjects) had widely differing median SGPs: one teacher had SGPs of 44 (ELA) and 42 (math) in the first year and 40 and 62.5 in the second, while the other teacher had SGPs of 61 and 50 in the first year and 41 and 45 in the second.
  • A 5th grade teacher had median SGPs of 72.5 and 64 for two math classes in the first year, and 48.5, 26, and 57 for three math classes in the following year. The second year’s classes included students with disabilities and English language learners, but the first year’s did not.
  • Another 5th grade teacher had median SGPs of 45 and 43 for two ELA classes in the first year, and 72 and 64 in the second year. The first year’s classes included students with disabilities and students with behavioral challenges while the second year’s classes did not.

As an experienced observer/evaluator, I found that median SGPs did not correlate with teachers’ teaching skills but varied with class composition. Stronger teachers had the same range of SGPs in their classes as teachers with weaker skills, and median SGPs for a new teacher with a less challenging class were higher than median SGPs for a highly skilled veteran teacher with a class that included English language learners.

Furthermore, SGP data did not provide useful information regarding student growth. In analyzing students’ SGPs, I noticed obvious general patterns: students with disabilities had lower SGPs than students without disabilities, English language learners had lower SGPs than students fluent in English, students who had some kind of trauma that year (e.g., parents’ divorce) had lower SGPs, and students with behavioral/social issues had lower SGPs. SGPs were correlated strongly with test performance: in one year, for example, the median ELA SGP for students in the “Advanced” category was 88, compared with 51.5 for “Proficient” students, 19.5 for “Needs Improvement,” and 5 for the “Warning” category.

There were also wide swings in student SGPs, not explainable except perhaps by differences in student performance on particular test days. One student with disabilities had an SGP of 1 in the first year and 71 in the next, while another student had SGPs of 4 in ELA and 94 in math in 4th grade and SGPs of 50 in ELA and 4 in math in 5th grade, both with consistent district test scores.

So how does this “information” impact real people in a real school?  As a principal, I found that it added nothing to what I already knew about the teaching and learning in my school. Using these numbers for teacher evaluation does, however, negatively impact schools: it demoralizes and discourages teachers, and it has the potential to affect class and teacher assignments.

In real schools, student and teacher assignments are not random. Students are grouped for specific purposes, and teachers are assigned classes for particular reasons. Students with disabilities and English language learners are often grouped to allow specialists, such as the speech/language teacher or the ELL teacher, to work more effectively with them. Students with behavioral issues are sometimes placed in special classes, and are often assigned to teachers who work particularly well with them. Leveled classes (AP, honors, remedial), create different student combinations, and teachers are assigned particular classes based on the administrator’s judgment of which teachers will do the best with which classes. For example, I would assign new or struggling teachers less challenging classes so I could work successfully with them on improving their skills.

In the past, when I told a teacher that he/she had a particularly challenging class, because he/she could best work with these students, he/she generally cheerfully accepted the challenge, and felt complimented on his/her skills. Now, that teacher could be concerned about the effect of that class on his/her evaluation. Teachers may be reluctant to teach lower level courses, or to work with English language learners or students with behavioral issues, and administrators may hesitate to assign the most challenging classes to the most skilled teachers.

In short, in my experience, the use of this type of “value-added” measurement provides no useful information and has a negative impact on real teachers and real administrators in real schools. If “data” is not only not useful, but actively harmful, to those who are supposedly benefitting from using it, what is the point? Why is this continuing?

You Are Invited to Participate in the #HowMuchTesting Debate!

As the scholarly debate about the extent and purpose of educational testing rages on, the American Educational Research Association (AERA) wants to hear from you.  During a key session at its Centennial Conference this spring in Washington DC, titled How Much Testing and for What Purpose? Public Scholarship in the Debate about Educational Assessment and Accountability, prominent educational researchers will respond to questions and concerns raised by YOU, parents, students, teachers, community members, and public at large.

Hence, any and all of you with an interest in testing, value-added modeling, educational assessment, educational accountability policies, and the like are invited to post your questions, concerns, and comments using the hashtag #HowMuchTesting on Twitter, Facebook, Instagram, Google+, or the social media platform of your choice, as these are the posts to which AERA’s panelists will respond.

Organizers are interested in all #HowMuchTesting posts, but they are particularly interested in video-recorded questions and comments of 30 – 45 seconds in duration so that you can ask your own questions, rather than having it read by a moderator. In addition, in order to provide ample time for the panel of experts to prepare for the discussion, comments and questions posted by March 17 have the best chances for inclusion in the debate.

Thank you all in advance for your contributions!!

To read more about this session, from the session’s organizer, click here.

Is Alabama the New, New Mexico?

In Alabama, the Grand Old Party (GOP) has put forth a draft bill to be entitled as an act and ultimately called the Rewarding Advancement in Instruction and Student Excellence (RAISE) Act of 2016. The purpose of the act will be to…wait for it…use test scores to grade and pay teachers annual bonuses (i.e., “supplements”) as per their performance. More specifically, the bill is to “provide a procedure for observing and evaluating teachers” to help make “significant differentiation[s] in pay, retention, promotion, dismissals, and other staffing decisions, including transfers, placements, and preferences in the event of reductions in force, [as] primarily [based] on evaluation results.” Related, Alabama districts may no longer use teachers’ “seniority, degrees, or credentials as a basis for determining pay or making the retention, promotion, dismissal, and staffing decisions.” Genius!

Accordingly, Larry Lee whose blog is based on the foundation that “education is everyone’s business,” sent me this bill to review, and critique, and help make everyone’s business. I attach it here for others who are interested, but I also summarize and critique it’s most relevant (but also contemptible) issues below.

For the Alabama teachers who are eligible, they are (after a staggered period of time) to be primarily evaluated (i.e., for up to 45% of a teacher’s total evaluation score) on the extent to which they purportedly cause student growth in achievement, with student growth being defined as the teachers’ purported impacts on “[t]he change in achievement for an individual student between two or more points in time.” Teachers are also to be observed at least twice per year (i.e., for up to 45% of a teacher’s total evaluation score), by their appropriate and appropriately trained evaluators/supervisors, and an unnamed and undefined set of parent and student surveys are to be used to evaluate the teachers (i.e., up to 15% of a teacher’s total evaluation score).

Again, no real surprises here as the adoption of such measures is common among states like Alabama (and New Mexico), but when these components are explained in more detail is where things really go awry.

“For grade levels and subjects for which student standardized assessment data is not available and for teachers for whom student standardized assessment data is not available, the [state’s] department [of education] shall establish a list of preapproved options for governing boards to utilize to measure student growth.” This is precisely what has gotten the whole state of New Mexico wrapped up in, and currently losing their ongoing lawsuit (see my most recent post on this here). While providing districts with menus of preapproved assessment options might make sense to policymakers, any self respecting researcher or even assessment commoner should know why this is entirely inappropriate. To read more about this, the best research study explaining why doing just this will set any state up for lawsuits comes from Brown University’s John Papay in his highly esteemed and highly cited “Different tests, different answers: The stability of teacher value-added estimates across outcome measures” article. The title of this research article alone should explain enough why simply positioning and offering up such tests in such casual (and quite careless) ways makes way for legal recourse.

Otherwise, the only test mentioned that is also to be used to measure teachers’ purported impacts on student growth is the ACT Aspire – the ACT test corporation’s “college and career readiness” test that is aligned to and connected with their more familiar college-entrance ACT. This, too, was one of the sources of the aforementioned lawsuit in New Mexico in terms of what we call content validity, in that states cannot simply pull in tests that are not adequately aligned with a state’s curriculum (e.g., I could find no information about the alignment of the ACT Aspire to Alabama’s curriculum here, which is also highly problematic as this information should definitely be available) and that have not been validated for such purposes (i.e., to measure teachers’ impacts on student growth).

Regardless of the tests, however, all of the secondary measures to be used to evaluate Alabama teachers (e.g., student and parent survey scores, observational scores) are also to be “correlated with impacts on student achievement results.” We’ve also increasingly seen this becoming the case across the nation, whereas state/district leaders are not simply assessing whether these indicators are independently correlated, which they should be if they all, in fact, help to measure our construct of interest = teacher effectiveness, but state/district leaders are rather manufacturing and forcing these correlations via what I have termed “artificial conflation” strategies (see also a recent post here about how this is one of the fundamental and critical points of litigation in Houston).

The state is apparently also set on going “all in” on evaluating their principals in many of the same ways, although I did not critique those sections for this particular post.

Most importantly, though, for those of you who have access to such leaders in Alabama, do send them this post so they might be a bit more proactive, and appropriately more careful and cautious, before going down this poor educational policy path. While I do embrace my professional responsibility as a public scholar to be called to court to testify about all of this when such high-stakes consequences are ultimately, yet inappropriately based upon invalid inferences, I’d much rather be proactive in this regard and save states and states’ taxpayers their time and money, respectively.

Accordingly, I see the state is also to put out a request for proposals to retain an external contractor to help them measure said student growth and teachers’ purported impacts on it. I would also be more than happy to help the state negotiate this contract, much more wisely than so many other states and districts have negotiated similar contracts thus far (e.g., without asking for reliability and validity evidence as a contractual deliverable)…should this poor educational policy actually come to fruition.

Houston’s “Split” Decision to Give Superintendent Grier $98,600 in Bonuses, Pre-Resignation

States of attention on this blog, and often of (dis)honorable mention as per their state-level policies bent on value-added models (VAMs), include Florida, New York, Tennessee, and New Mexico. As for a quick update about the latter state of New Mexico, we are still waiting to hear the final decision from the judge who recently heard the state-level lawsuit still pending on this matter in New Mexico (see prior posts about this case here, here, here, here, and here).

Another locale of great interest, though, is the Houston Independent School District. This is the seventh largest urban school district in the nation, and the district that has tied more high-stakes consequences to their value-added output than any other district/state in the nation. These “initiatives” were “led” by soon-to-resign/retire Superintendent Terry Greir who, during his time in Houston (2009-2015), implemented some of the harshest consequences ever attached to teacher-level value-added output, as per the district’s use of the Education Value-Added Assessment System (EVAAS) (see other posts about the EVAAS here, here, and here; see other posts about Houston here, here, and here).

In fact, the EVAAS is still used throughout Houston today to evaluate all EVAAS-eligible teachers, to also “reform” the district’s historically low-performing schools, by tying teachers’ purported value-added performance to teacher improvement plans, merit pay, nonrenewal, and termination (e.g., 221 Houston teachers were terminated “in large part” due to their EVAAS scores in 2011). However, pending litigation (i.e., this is the district in which the American and Houston Federation of Teachers (AFT/HFT) are currently suing the district for their wrongful use of, and over-emphasis on this particular VAM; see here), Superintendent Grier and the district have recoiled on some of the high-stakes consequences they formerly attached to the EVAAS  This particular lawsuit is to commence this spring/summer.

Nonetheless, my most recent post about Houston was about some of its future school board candidates, who were invited by The Houston Chronicle to respond to Superintendent Grier’s teacher evaluation system. For the most part, those who responded did so unfavorably, especially as the evaluation systems was/is disproportionately reliant on teachers’ EVAAS data and high-stakes use of these data in particular (see here).

Most recently, however, as per a “split” decision registered by Houston’s current school board (i.e., 4:3, and without any new members elected last November), Superintendent Grier received a $98,600 bonus for his “satisfactory evaluation” as the school district’s superintendent. See more from the full article published in The Houston Chronicle. As per the same article, Superintendent “Grier’s base salary is $300,000, plus $19,200 for car and technology allowances. He also is paid for unused leave time.”

More importantly, take a look at the two figures below, taken from actual district reports (see references below), highlighting Houston’s performance (declining, on average, in blue) as compared to the state of Texas (maintaining, on average, in black), to determine for yourself whether Superintendent Grier, indeed, deserved such a bonus (not to mention salary).

Another question to ponder is whether the district’s use of the EVAAS value-added system, especially since Superintendent Grier’s arrival in 2009, is actually reforming the school district as he and other district leaders have for so long now intended (e.g., since his Superintendent appointment in 2009).

Figure 1

Figure 1. Houston (blue trend line) v. Texas (black trend line) performance on the state’s STAAR tests, 2012-2015 (HISD, 2015a)

Figure 2

Figure 2. Houston (blue trend line) v. Texas (black trend line) performance on the state’s STAAR End-of-Course (EOC) tests, 2012-2015 (HISD, 2015b)

References:

Houston Independent School District (HISD). (2015a). State of Texas Assessments of Academic Readiness (STAAR) performance, grades 3-8, spring 2015. Retrieved here.

Houston Independent School District (HISD). (2015b). State of Texas Assessments of Academic Readiness (STAAR) end-of-course results, spring 2015. Retrieved here.

Houston Board Candidates Respond to their Teacher Evaluation System

For a recent article in the Houston Chronicle, the newspaper sent 12 current candidates for the Houston Independent School District (HISD) School Board a series of questions about HISD, to which seven candidates responded. The seven candidates’ responses are of specific interest here in that HISD is the district well-known for attaching more higher-stakes consequences to value-added output (e.g., teacher termination) than others (see for example here, here, and here). The seven candidates’ responses are of general interest in that the district uses the popular and (in)famous Education Value-Added Assessment System (EVAAS) for said purposes (see also here, here, and here). Accordingly, what these seven candidates have to say about the EVAAS and/or HISD’s teacher evaluation system might also be a sign of things to come, perhaps for the better, throughout HISD.

The questions are: (1) Do you support HISD’s current teacher evaluation system, which includes student test scores? Why or why not? What, if any, changes would you make? And (2) Do you support HISD’s current bonus system based on student test scores? Why or why not? What, if any, changes would you make? To see candidate names, their background information, their responses to other questions, etc. please read in full the article in the Houston Chronicle.

Here are the seven candidates’ responses to question #1:

  • I do not support the current teacher evaluation system. Teacher’s performance should not rely on the current formula using the evaluation system with the amount of weight placed on student test scores. Too many obstacles outside the classroom affect student learning today that are unfair in this system. Other means of support such as a community school model must be put in place to support the whole student, supporting student learning in the classroom (Fonseca).
  • No, I do not support the current teacher evaluation system, EVAAS, because it relies on an algorithm that no one understands. Testing should be diagnostic, not punitive. Teachers must have the freedom to teach basic math, reading, writing and science and not only teach to the test, which determines if they keep a job and/or get bonuses. Teachers should be evaluated on student growth. For example, did the third-grade teacher raise his/her non-reading third-grader to a higher level than that student read when he/she came into the teacher’s class? Did the teacher take time to figure out what non-educational obstacles the student had in order to address those needs so that the student began learning? Did the teacher coach the debate team and help the students become more well-rounded, and so on? Standardized tests in a vacuum indicate nothing (Jones).
  • I remember the time when teachers practically never revised test scores. Tests can be one of the best tools to help a child identify strengths and weakness. Students’ scores were filed, and no one ever checked them out from the archives. When student scores became part of their evaluation, teachers began to look into data more often. It is a magnificent tool for student and teacher growth. Having said that, I also believe that many variables that make a teacher great are not measured in his or her evaluation. There is nothing on character education for which teachers are greatly responsible. I do not know of a domain in the teacher’s evaluation that quite measures the art of teaching. Data is about the scientific part of teaching, but the art of teaching has to be evaluated by an expert at every school; we call them principals (Leal).
  • Student test scores were not designed to be used for this purpose. The use of students’ test scores to evaluate teachers has been discredited by researchers and statisticians. EVAAS and other value-added models are deeply flawed and should not be major components of a teacher evaluation system. The existing research indicates that 10-14 percent of students’ test scores are attributable to teacher factors. Therefore, I would support using student test scores (a measure of student achievement) as no more than 10-14 percent of teachers’ evaluations (McCoy).
  • No, I do not support the current teacher evaluation system, which includes student test scores, for the following reasons: 1) High-stakes decisions should not be made based on the basis of value-added scores alone. 2) The system is meant to assess and predict student performance with precision and reliability, but the data revealed that the EVAAS system is inconsistent and has consistent problems. 3) The EVAAS repots do not match the teachers’ “observation” PDAS scores [on the formal evaluation]; therefore, data is manipulated to show a relationship. 4) Most importantly, teachers cannot use the information generated as a formative tool because teachers receive the EVAAS reports in the summer or fall after the students leave their classroom. 5) Very few teachers realized that there was an HISD-sponsored professional development training linked to the EVAAS system to improve instruction. Changes that I will make are to make recommendations and confer with other board members to revamp the system or identify a more equitable system (McCullough).
  • The current teacher evaluation system should be reviewed and modified. While I believe we should test, it should only be a diagnostic measure of progress and indicator of deficiency for the purpose of aligned instruction. There should not be any high stakes attached for the student or the teacher. That opens the door for restricting teaching-to-test content and stifles the learning potential. If we have to have it, make it 5 percent. The classroom should be based on rich academic experiences, not memorization regurgitation (Skillern-Jones).
  • I support evaluating teachers on how well their students perform and grow, but I do not support high-stakes evaluation of teachers using a value-added test score that is based on the unreliable STAAR test. Research indicates that value-added measures of student achievement tied to individual teachers should not be used for high-stakes decisions or compared across dissimilar student populations or schools. If we had a reliable test of student learning, I would support the use of value-added growth measures in a low-stakes fashion where measures of student growth are part of an integrated analysis of a teacher’s overall performance and practices. I strongly believe that teachers should be evaluated with an integrated set of measures that show what teachers do and what happens as a result. These measures may include meaningful evidence of student work and learning, pedagogy, classroom management, knowledge of content and even student surveys. Evaluators should be appropriately trained, and teachers should have regular evaluations with frequent feedback from strong mentors and professional development to strengthen their content knowledge and practice (Stipeche).

Here are the seven candidates’ responses to question #2:

  • I do not support the current bonus system based on student test scores as, again, teachers do not currently have support to affect what happens outside the classroom. Until we provide support, we cannot base teacher performance or bonuses on a heavy weight of test scores (Fonseca).
  • No, I do not support the current bonus system. Teachers who grow student achievement should receive bonuses, not just teachers whose students score well on tests. For example, a teacher who closes the educational achievement gap with a struggling student should earn a bonus before a teacher who has students who are not challenged and for whom learning is relatively easy. Teachers who grow their students in extracurricular activities should earn a bonus before a teacher that only focuses on education. Teachers that choose to teach in struggling schools should earn a bonus over a teacher that teaches in a school with non-struggling students. Teachers who work with their students in UIL participation, history fairs, debate, choir, student government and like activities should earn a bonus over a teacher who does not (Jones).
  • Extrinsic incentives killed creativity. I knew that from my counseling background, but in 2011 or 2010, Dr. Grier sent an email to school administrators with a link of a TED Talks video that contradicts any notion of giving monetary incentives to promote productivity in the classroom: http://www.ted.com/talks/dan_pink_on_motivation?language=en. Give incentives for perfect attendance or cooperation among teachers selected by teachers (Leal).
  • No. Student test scores were not designed to be used for this purpose. All teachers need salary increases (McCoy).
  • No, I do not support HISD’s current bonus system based on student test scores. Student test scores should be a diagnostic tool used to identify instructional gaps and improve student achievement. Not as a measure to reward teachers, because the process is flawed. I would work collaboratively to identify another system to reward teachers (McCullough).
  • The current bonus program does, in fact, reward teachers who students make significant academic gains. It leaves out those teachers who have students at the top of the achievement scale. By formulaic measures, it is flawed and the system, according to its creators, is being misused and misapplied. It would be beneficial overall to consider measures to expand the teacher population of recipients as well as to undertake measures to simplify the process if we keep it. I think a better focus would be to see how we can increase overall teacher salaries in a meaningful and impactful way to incentivize performance and longevity (Skillern-Jones).
  • No. I do not support the use of EVAAS in this manner. More importantly, ASPIRE has not closed the achievement gap nor dramatically improved the academic performance of all students in the district (Stipeche).

No responses or no responses of any general substance were received from Daniels, Davila, McKinzie, Smith, Williams.

The Forgotten VAM: The A-F School Grading System

Here is another post from our “Concerned New Mexico Parent” (see prior posts from him/her here and here). This one is about New Mexico’s A-F School Grading System and how it is not only contradictory, within and beyond itself, but how it also provides little instrumental value to the public as an invalid indicator of the “quality” of any school.

(S)he writes:

  1. What do you call a high school that has only 38% of its students proficient in reading and 35% of its students proficient in mathematics?
  2. A school that needs help improving their student scores.
  3. What does the New Mexico Public Education Department (NMPED) call this same high school?
  4. A top-rated “A” school, of course.

Readers of this blog are familiar with the VAMs being used to grade teachers. Many states have implemented analogous formulas to grade entire schools. This “forgotten” VAM suffers from all of the familiar problems of the teacher formulas — incomprehensibility, lack of transparency, arbitrariness, and the like.

The first problem with the A-F Grading System is inherent in its very name. The “A-F” terminology implies that this one static assessment is an accurate representation of a school’s quality. As you will see, it is nothing of the sort.

The second problem with the A-F Grading System is that is is composed of benchmarks that are arbitrarily weighted and scored by the NMPED using VAM methodologies.

Thirdly, the “collapsing of the data” from a numeric score to a grade (corresponding to a range of values) causes valuable information to be lost.

Table 1 shows the range of values for reading and mathematics proficiencies for each of the five A-F grade categories for New Mexico schools.

Table 1: Ranges and Median of Reading and Mathematics Proficiencies by A-F School Grade

School A-F Grade Number of Schools

Reading
Proficiency Range

Median

Mathematics Proficiency Range

Median

A 86

37.90 – 94.00

66.16

31.50 – 95.70

58.95

B 237

16.90 – 90.90

58.00

4.90 – 90.90

51.30

C 177

0.00 – 83.80

46.30

0.00 – 76.20

38.00

D 21

4.50 – 64.60

40.70

2.20 – 70.00

31.80

F 88

7.80 – 52.30

31.85

3.30 – 40.90

23.30

For example, to earn an A rating, a school can have between 37.9% and 94.0% of its students proficient in reading. In other words, a school can have roughly two-thirds of its students fail reading proficiency yet be rated as an “A” school!

The median value listed shows the point which splits the group in half — one-half of the scores are below the median value. Thus, an “A” school median of 66.2% indicates that one-half of the “A” schools have a reading proficiency below 66.2%. In other words, in one-half of the “A” schools 1/3 or more of their students are NOT proficient in reading!

Amazingly, the figures for mathematics are even worse, the minimum proficiency for a B rating is only 4.9% proficient! Scandalous!

Obviously, and contrary to popular and press perceptions, the A-F Grading System has nothing to do with the actual or current quality of the school!

A few case studies will highlight further absurdities of the New Mexico A-F School Grading System next.

Case Study 1 – Highest “A”   vs. Lowest “A” High School

Logan High School, Logan, New Mexico received the lowest reading proficiency of any “A” school, and the Albuquerque Institute of Math and Science received the highest reading proficiency score.

These two schools have both received an “A” rating. The Albuquerque Institute had a reading proficiency of 94% and a mathematics proficiency rating of 93%. Logan HS had a reading proficiency of only 38% and a mathematics proficiency rating of only 35%!

How is that possible?

First, much of the A-F VAM, like the teacher VAM is based on multi-year growth calculations and predictions. Logan has plenty of opportunity for growth whereas the Math Academy has “maxed” out most of its scores. Thus, the Albuquerque Institute is penalized in a manner analogous to Gifted and Talented teachers when teacher-level VAM is used. With already excellent scores, there is little, if any, room for improvement.

Second, Logan has an emphasis on shop/trade classes which yields a very high college and CAREER readiness score for the VAM calculation.

Also, a final factor is that the NMPED-defined range for an “A” extends from 75 to 100 points, and Logan barely made it into the A grouping.

Thus, a proficiency score of only 37.9% is no deterrent to an A score for Logan High.

Case Study 2: Hanging on by a Thread

As noted above, any school that scores between 75 and 100 points is considered an “A” school.

This statistical oddity was very beneficial to Hagerman High (Hagerman, NM) in their 2014 School Grade Report Card. They fell 5.99 points overall from the previous year’s score, but they managed to still receive an “A” score since their resulting 2014 score was exactly 75.01.

With this one one-hundredth of a point, they are in the same “A” grade category as the Albuquerque Institute of Math and Science (rated best in New Mexico by NMPED) and the Cottonwood Classical Preparatory School of Albuquerque (rated best in New Mexico by US News).

Case Study 3: A Tale of Two Ranking Systems

This inaccuracy and arbitrariness of any A-F School Grading System was also apparent in a recent Albuquerque Journal News article (May 14, 2015) which reported on the most recent US News ratings of high schools nationwide.

The Journal reported on the top 12 high schools in New Mexico as rated by US News. It is not surprising that most were NMPED A-rated schools. What was unusual is that the 3rd and 5th US News highest rated schools in New Mexico (South Valley Academy and Albuquerque High, both in Albuquerque) were actually rated as B schools by the NMPED A-F School Grading System.

According to NMPED data, I tabulated at least forty-four (44) high schools that were rated as “A” schools with higher NMPED scores than South Valley Academy which had an NMPED score of 71.4.

None of these 44 higher NMPED scoring schools were rated above South Valley Academy by US News.

Case Study 4: Punitive Grading

Many school districts and school boards throughout New Mexico have adopted policies that prohibit punitive grading based on behavior. It is no longer possible to lower a student’s grade just because of their behavior. The grade should reflect classroom assessment only.

NMPED ignores this policy in the context of the A-F School Grading System. Schools were graded down one letter grade if they did not achieve 95% participation rates.

One such school was Mills Elementary in the Hobbs Municipal Schools District. Only 198 students were tested; they fell 11 short of the 95% mark and were penalized one “grade”-level. Their grade was reduced from a “D” to an “F” In fact, Mills Elementary proficiency scores were higher than the A-rated Logan High School discussed earlier.

The likely explanation is that Hobbs has a highly transient population with both seasonal farm laborers and oil-field workers predominating in the local economy.

For more urban schools, it will be interesting to see how the NMPED policy of punitive grading will play out with the increasingly popular Opt-Out movement.

Conclusion

It is apparent that the NMPED’s A-F School Grading System rates schools deceptively using VAM-augmented data and provides little of any value to the public as to the “quality” of a school. By presenting it in the form of an “NMPED School Grade Report Card” the state seeks to hide its arbitrary nature.

Such a useless grade should certainly not be used to declare a school a “failure” and in need of radical reform.

“Value-Less” Value-Added Data

Peter Greene, a veteran teacher of English in Pennsylvania who works as a teacher in a state using the Pennsylvania version of the Education Value-Added Assessment System (EVAAS), wrote last week (October 5, 2015) in his Curmudgucation blog about his “Value-Less Data.” I thought it very important to share with you all, as he does a great job deconstructing one of the most widespread claims being made, and most lacking research support, about using the data derived via value-added models (VAMs) to inform and improve what teachers do in their classrooms.

Greene sententiously critiques this claim, writing:

It’s autumn in Pennsylvania, which means it’s time to look at the rich data to be gleaned from our Big Standardized Test (called PSSA for grades 3-8, and Keystone Exams at the high school level).

We love us some value added data crunching in PA (our version is called PVAAS, an early version of the value-added baloney model). This is a model that promises far more than it can deliver, but it also makes up a sizeable chunk of our school evaluation model, which in turn is part of our teacher evaluation model.

Of course the data crunching and collecting is supposed to have many valuable benefits, not the least of which is unleashing a pack of rich and robust data hounds who will chase the wild beast of low student achievement up the tree of instructional re-alignment. Like every other state, we have been promised that the tests will have classroom teachers swimming in a vast vault of data, like Scrooge McDuck on a gold bullion bender. So this morning I set out early to the states Big Data Portal to see what riches the system could reveal.

Here’s what I can learn from looking at the rich data.

* the raw scores of each student
* how many students fell into each of the achievement subgroups (test scores broken down by 20 point percentile slices)
* if each of the five percentile slices was generally above, below, or at its growth target

Annnnd that’s about it. I can sift through some of that data for a few other features.

For instance, PVAAS can, in a Minority Report sort of twist, predict what each student should get as a score based on– well, I’ve been trying for six years to find someone who can explain this to me, and still nothing. But every student has his or her own personal alternate universe score. If the student beats that score, they have shown growth. If they don’t, they have not.

The state’s site will actually tell me what each student’s alternate universe score was, side by side with their actual score. This is kind of an amazing twist– you might think this data set would be useful for determining how well the state’s predictive legerdemain actually works. Or maybe a discrepancy might be a signal that something is up with the student. But no — all discrepancies between predicted and actual scores are either blamed on or credited to the teacher.

I can use that same magical power to draw a big target on the backs of certain students. I can generate a list of students expected to fall within certain score ranges and throw them directly into the extra test prep focused remediation tank. Although since I’m giving them the instruction based on projected scores from a test they haven’t taken yet, maybe I should call it premediation.

Of course, either remediation or premediation would be easier to develop if I knew exactly what the problem was.

But the website gives only raw scores. I don’t know what “modules” or sections of the test the student did poorly on. We’ve got a principal working on getting us that breakdown, but as classroom teachers we don’t get to see it. Hell, as classroom teachers, we are not allowed to see the questions, and if we do see them, we are forbidden to talk about them, report on them, or use them in any way. (Confession: I have peeked, and many of the questions absolutely suck as measures of anything).

Bottom line– we have no idea what exactly our students messed up to get a low score on the test. In fact, we have no idea what they messed up generally.

So that’s my rich data. A test grade comes back, but I can’t see the test, or the questions, or the actual items that the student got wrong.

The website is loaded with bells and whistles and flash-dependent functions along with instructional videos that seem to assume that the site will be used by nine-year-olds, combining instructions that should be unnecessary (how to use a color-coding key to read a pie chart) to explanations of “analysis” that isn’t (by looking at how many students have scored below basic, we can determine how many students have scored below basic).

I wish some of the reformsters who believe that BS [i.e., not “basic skills” but the “other” BS] Testing gets us rich data that can drive and focus instruction would just get in there and take a look at this, because they would just weep. No value is being added, but lots of time and money is being wasted.

Valerie Strauss also covered Greene’s post in her Answer Sheet Blog in The Washington Post here, in case you’re interested in seeing her take on this as well: “Why the ‘rich’ student data we get from testing is actually worthless.”

NY Teacher Lederman’s Day in Court

Do you recall the case of Sheri Lederman? The Long Island teacher who, apparently by all accounts other than her composite growth (or value-added) score is a terrific 4th grade/18 year veteran teacher, who received a score of 1 out of 20 after she scored a 14 out of 20 the year prior (see prior posts herehere and here; see also here and here)?

With her husband, attorney Bruce Lederman leading her case, she is suing the state of New York (the state in which Governor Cuomo is pushing to now have teachers’ value-added scores count for approximately 50% of their total evaluations) to challenge the state’s teacher evaluation system. She is also being fully supported by her students, her principal, her superintendent, and a series of VAM experts including: Linda Darling-Hammond (Stanford), Aaron Pallas (Columbia University Teachers College), Carol Burris (Educator and Principal of the Year from New York), Brad Lindell (Long Island Research Consultant), and me (Arizona State University) (see their/our expert witness affidavits here). See also an affidavit more recently submitted by Jesse Rothstein (Berkeley) here, as well as the full document explaining the entire case – the Memorandum of Law – here.

Well, the Ledermans had their day in court this past Wednesday (August 12, 2015).

It was apparent in the hearing that the Judge carefully read all the papers prior, and he was fully familiar with the issues. As per Bruce Lederman, “[t]he issue that seemed to catch the Judge’s attention the most was whether it was rational to have a system which decides in advance that 7% of teachers will be ineffective, regardless of actual results. The Judge asked numerous questions about whether it was fair to use a bell curve,” whereby when using a bell curve to distribute teachers’ growth or value-added scores, there will always be a set of “ineffective” teachers, regardless of whether in face they are truly “ineffective.” This occurs not naturally but by the statistical manipulation needed to fit all scores within the normal distribution needed to spread out the scores in order to make relative distinctions and categorizations (e.g., highly effective, effective, ineffective), the validity of which are highly uncertain (see, for example, a prior post here). Hence, “[t]he Judge pressed the lawyer representing New York’s Education Department very hard on this particular issue,” but the state’s lawyer did not (most likely because she could not) give the Judge a satisfactory explanation, justification, or rationale.

For more information on the case, see here the video that I feel best captures the case, thanks to CBS news in Albany. For another video see here, compliments of NBC news in Albany. See also two additional articles, here and here, with the latter including the photo of Sheri and Bruce Lederman pasted below.

a - ledermans_0