Is Alabama the New, New Mexico?

In Alabama, the Grand Old Party (GOP) has put forth a draft bill to be entitled as an act and ultimately called the Rewarding Advancement in Instruction and Student Excellence (RAISE) Act of 2016. The purpose of the act will be to…wait for it…use test scores to grade and pay teachers annual bonuses (i.e., “supplements”) as per their performance. More specifically, the bill is to “provide a procedure for observing and evaluating teachers” to help make “significant differentiation[s] in pay, retention, promotion, dismissals, and other staffing decisions, including transfers, placements, and preferences in the event of reductions in force, [as] primarily [based] on evaluation results.” Related, Alabama districts may no longer use teachers’ “seniority, degrees, or credentials as a basis for determining pay or making the retention, promotion, dismissal, and staffing decisions.” Genius!

Accordingly, Larry Lee whose blog is based on the foundation that “education is everyone’s business,” sent me this bill to review, and critique, and help make everyone’s business. I attach it here for others who are interested, but I also summarize and critique it’s most relevant (but also contemptible) issues below.

For the Alabama teachers who are eligible, they are (after a staggered period of time) to be primarily evaluated (i.e., for up to 45% of a teacher’s total evaluation score) on the extent to which they purportedly cause student growth in achievement, with student growth being defined as the teachers’ purported impacts on “[t]he change in achievement for an individual student between two or more points in time.” Teachers are also to be observed at least twice per year (i.e., for up to 45% of a teacher’s total evaluation score), by their appropriate and appropriately trained evaluators/supervisors, and an unnamed and undefined set of parent and student surveys are to be used to evaluate the teachers (i.e., up to 15% of a teacher’s total evaluation score).

Again, no real surprises here as the adoption of such measures is common among states like Alabama (and New Mexico), but when these components are explained in more detail is where things really go awry.

“For grade levels and subjects for which student standardized assessment data is not available and for teachers for whom student standardized assessment data is not available, the [state’s] department [of education] shall establish a list of preapproved options for governing boards to utilize to measure student growth.” This is precisely what has gotten the whole state of New Mexico wrapped up in, and currently losing their ongoing lawsuit (see my most recent post on this here). While providing districts with menus of preapproved assessment options might make sense to policymakers, any self respecting researcher or even assessment commoner should know why this is entirely inappropriate. To read more about this, the best research study explaining why doing just this will set any state up for lawsuits comes from Brown University’s John Papay in his highly esteemed and highly cited “Different tests, different answers: The stability of teacher value-added estimates across outcome measures” article. The title of this research article alone should explain enough why simply positioning and offering up such tests in such casual (and quite careless) ways makes way for legal recourse.

Otherwise, the only test mentioned that is also to be used to measure teachers’ purported impacts on student growth is the ACT Aspire – the ACT test corporation’s “college and career readiness” test that is aligned to and connected with their more familiar college-entrance ACT. This, too, was one of the sources of the aforementioned lawsuit in New Mexico in terms of what we call content validity, in that states cannot simply pull in tests that are not adequately aligned with a state’s curriculum (e.g., I could find no information about the alignment of the ACT Aspire to Alabama’s curriculum here, which is also highly problematic as this information should definitely be available) and that have not been validated for such purposes (i.e., to measure teachers’ impacts on student growth).

Regardless of the tests, however, all of the secondary measures to be used to evaluate Alabama teachers (e.g., student and parent survey scores, observational scores) are also to be “correlated with impacts on student achievement results.” We’ve also increasingly seen this becoming the case across the nation, whereas state/district leaders are not simply assessing whether these indicators are independently correlated, which they should be if they all, in fact, help to measure our construct of interest = teacher effectiveness, but state/district leaders are rather manufacturing and forcing these correlations via what I have termed “artificial conflation” strategies (see also a recent post here about how this is one of the fundamental and critical points of litigation in Houston).

The state is apparently also set on going “all in” on evaluating their principals in many of the same ways, although I did not critique those sections for this particular post.

Most importantly, though, for those of you who have access to such leaders in Alabama, do send them this post so they might be a bit more proactive, and appropriately more careful and cautious, before going down this poor educational policy path. While I do embrace my professional responsibility as a public scholar to be called to court to testify about all of this when such high-stakes consequences are ultimately, yet inappropriately based upon invalid inferences, I’d much rather be proactive in this regard and save states and states’ taxpayers their time and money, respectively.

Accordingly, I see the state is also to put out a request for proposals to retain an external contractor to help them measure said student growth and teachers’ purported impacts on it. I would also be more than happy to help the state negotiate this contract, much more wisely than so many other states and districts have negotiated similar contracts thus far (e.g., without asking for reliability and validity evidence as a contractual deliverable)…should this poor educational policy actually come to fruition.

Houston’s “Split” Decision to Give Superintendent Grier $98,600 in Bonuses, Pre-Resignation

States of attention on this blog, and often of (dis)honorable mention as per their state-level policies bent on value-added models (VAMs), include Florida, New York, Tennessee, and New Mexico. As for a quick update about the latter state of New Mexico, we are still waiting to hear the final decision from the judge who recently heard the state-level lawsuit still pending on this matter in New Mexico (see prior posts about this case here, here, here, here, and here).

Another locale of great interest, though, is the Houston Independent School District. This is the seventh largest urban school district in the nation, and the district that has tied more high-stakes consequences to their value-added output than any other district/state in the nation. These “initiatives” were “led” by soon-to-resign/retire Superintendent Terry Greir who, during his time in Houston (2009-2015), implemented some of the harshest consequences ever attached to teacher-level value-added output, as per the district’s use of the Education Value-Added Assessment System (EVAAS) (see other posts about the EVAAS here, here, and here; see other posts about Houston here, here, and here).

In fact, the EVAAS is still used throughout Houston today to evaluate all EVAAS-eligible teachers, to also “reform” the district’s historically low-performing schools, by tying teachers’ purported value-added performance to teacher improvement plans, merit pay, nonrenewal, and termination (e.g., 221 Houston teachers were terminated “in large part” due to their EVAAS scores in 2011). However, pending litigation (i.e., this is the district in which the American and Houston Federation of Teachers (AFT/HFT) are currently suing the district for their wrongful use of, and over-emphasis on this particular VAM; see here), Superintendent Grier and the district have recoiled on some of the high-stakes consequences they formerly attached to the EVAAS  This particular lawsuit is to commence this spring/summer.

Nonetheless, my most recent post about Houston was about some of its future school board candidates, who were invited by The Houston Chronicle to respond to Superintendent Grier’s teacher evaluation system. For the most part, those who responded did so unfavorably, especially as the evaluation systems was/is disproportionately reliant on teachers’ EVAAS data and high-stakes use of these data in particular (see here).

Most recently, however, as per a “split” decision registered by Houston’s current school board (i.e., 4:3, and without any new members elected last November), Superintendent Grier received a $98,600 bonus for his “satisfactory evaluation” as the school district’s superintendent. See more from the full article published in The Houston Chronicle. As per the same article, Superintendent “Grier’s base salary is $300,000, plus $19,200 for car and technology allowances. He also is paid for unused leave time.”

More importantly, take a look at the two figures below, taken from actual district reports (see references below), highlighting Houston’s performance (declining, on average, in blue) as compared to the state of Texas (maintaining, on average, in black), to determine for yourself whether Superintendent Grier, indeed, deserved such a bonus (not to mention salary).

Another question to ponder is whether the district’s use of the EVAAS value-added system, especially since Superintendent Grier’s arrival in 2009, is actually reforming the school district as he and other district leaders have for so long now intended (e.g., since his Superintendent appointment in 2009).

Figure 1

Figure 1. Houston (blue trend line) v. Texas (black trend line) performance on the state’s STAAR tests, 2012-2015 (HISD, 2015a)

Figure 2

Figure 2. Houston (blue trend line) v. Texas (black trend line) performance on the state’s STAAR End-of-Course (EOC) tests, 2012-2015 (HISD, 2015b)

References:

Houston Independent School District (HISD). (2015a). State of Texas Assessments of Academic Readiness (STAAR) performance, grades 3-8, spring 2015. Retrieved here.

Houston Independent School District (HISD). (2015b). State of Texas Assessments of Academic Readiness (STAAR) end-of-course results, spring 2015. Retrieved here.

Houston Board Candidates Respond to their Teacher Evaluation System

For a recent article in the Houston Chronicle, the newspaper sent 12 current candidates for the Houston Independent School District (HISD) School Board a series of questions about HISD, to which seven candidates responded. The seven candidates’ responses are of specific interest here in that HISD is the district well-known for attaching more higher-stakes consequences to value-added output (e.g., teacher termination) than others (see for example here, here, and here). The seven candidates’ responses are of general interest in that the district uses the popular and (in)famous Education Value-Added Assessment System (EVAAS) for said purposes (see also here, here, and here). Accordingly, what these seven candidates have to say about the EVAAS and/or HISD’s teacher evaluation system might also be a sign of things to come, perhaps for the better, throughout HISD.

The questions are: (1) Do you support HISD’s current teacher evaluation system, which includes student test scores? Why or why not? What, if any, changes would you make? And (2) Do you support HISD’s current bonus system based on student test scores? Why or why not? What, if any, changes would you make? To see candidate names, their background information, their responses to other questions, etc. please read in full the article in the Houston Chronicle.

Here are the seven candidates’ responses to question #1:

  • I do not support the current teacher evaluation system. Teacher’s performance should not rely on the current formula using the evaluation system with the amount of weight placed on student test scores. Too many obstacles outside the classroom affect student learning today that are unfair in this system. Other means of support such as a community school model must be put in place to support the whole student, supporting student learning in the classroom (Fonseca).
  • No, I do not support the current teacher evaluation system, EVAAS, because it relies on an algorithm that no one understands. Testing should be diagnostic, not punitive. Teachers must have the freedom to teach basic math, reading, writing and science and not only teach to the test, which determines if they keep a job and/or get bonuses. Teachers should be evaluated on student growth. For example, did the third-grade teacher raise his/her non-reading third-grader to a higher level than that student read when he/she came into the teacher’s class? Did the teacher take time to figure out what non-educational obstacles the student had in order to address those needs so that the student began learning? Did the teacher coach the debate team and help the students become more well-rounded, and so on? Standardized tests in a vacuum indicate nothing (Jones).
  • I remember the time when teachers practically never revised test scores. Tests can be one of the best tools to help a child identify strengths and weakness. Students’ scores were filed, and no one ever checked them out from the archives. When student scores became part of their evaluation, teachers began to look into data more often. It is a magnificent tool for student and teacher growth. Having said that, I also believe that many variables that make a teacher great are not measured in his or her evaluation. There is nothing on character education for which teachers are greatly responsible. I do not know of a domain in the teacher’s evaluation that quite measures the art of teaching. Data is about the scientific part of teaching, but the art of teaching has to be evaluated by an expert at every school; we call them principals (Leal).
  • Student test scores were not designed to be used for this purpose. The use of students’ test scores to evaluate teachers has been discredited by researchers and statisticians. EVAAS and other value-added models are deeply flawed and should not be major components of a teacher evaluation system. The existing research indicates that 10-14 percent of students’ test scores are attributable to teacher factors. Therefore, I would support using student test scores (a measure of student achievement) as no more than 10-14 percent of teachers’ evaluations (McCoy).
  • No, I do not support the current teacher evaluation system, which includes student test scores, for the following reasons: 1) High-stakes decisions should not be made based on the basis of value-added scores alone. 2) The system is meant to assess and predict student performance with precision and reliability, but the data revealed that the EVAAS system is inconsistent and has consistent problems. 3) The EVAAS repots do not match the teachers’ “observation” PDAS scores [on the formal evaluation]; therefore, data is manipulated to show a relationship. 4) Most importantly, teachers cannot use the information generated as a formative tool because teachers receive the EVAAS reports in the summer or fall after the students leave their classroom. 5) Very few teachers realized that there was an HISD-sponsored professional development training linked to the EVAAS system to improve instruction. Changes that I will make are to make recommendations and confer with other board members to revamp the system or identify a more equitable system (McCullough).
  • The current teacher evaluation system should be reviewed and modified. While I believe we should test, it should only be a diagnostic measure of progress and indicator of deficiency for the purpose of aligned instruction. There should not be any high stakes attached for the student or the teacher. That opens the door for restricting teaching-to-test content and stifles the learning potential. If we have to have it, make it 5 percent. The classroom should be based on rich academic experiences, not memorization regurgitation (Skillern-Jones).
  • I support evaluating teachers on how well their students perform and grow, but I do not support high-stakes evaluation of teachers using a value-added test score that is based on the unreliable STAAR test. Research indicates that value-added measures of student achievement tied to individual teachers should not be used for high-stakes decisions or compared across dissimilar student populations or schools. If we had a reliable test of student learning, I would support the use of value-added growth measures in a low-stakes fashion where measures of student growth are part of an integrated analysis of a teacher’s overall performance and practices. I strongly believe that teachers should be evaluated with an integrated set of measures that show what teachers do and what happens as a result. These measures may include meaningful evidence of student work and learning, pedagogy, classroom management, knowledge of content and even student surveys. Evaluators should be appropriately trained, and teachers should have regular evaluations with frequent feedback from strong mentors and professional development to strengthen their content knowledge and practice (Stipeche).

Here are the seven candidates’ responses to question #2:

  • I do not support the current bonus system based on student test scores as, again, teachers do not currently have support to affect what happens outside the classroom. Until we provide support, we cannot base teacher performance or bonuses on a heavy weight of test scores (Fonseca).
  • No, I do not support the current bonus system. Teachers who grow student achievement should receive bonuses, not just teachers whose students score well on tests. For example, a teacher who closes the educational achievement gap with a struggling student should earn a bonus before a teacher who has students who are not challenged and for whom learning is relatively easy. Teachers who grow their students in extracurricular activities should earn a bonus before a teacher that only focuses on education. Teachers that choose to teach in struggling schools should earn a bonus over a teacher that teaches in a school with non-struggling students. Teachers who work with their students in UIL participation, history fairs, debate, choir, student government and like activities should earn a bonus over a teacher who does not (Jones).
  • Extrinsic incentives killed creativity. I knew that from my counseling background, but in 2011 or 2010, Dr. Grier sent an email to school administrators with a link of a TED Talks video that contradicts any notion of giving monetary incentives to promote productivity in the classroom: http://www.ted.com/talks/dan_pink_on_motivation?language=en. Give incentives for perfect attendance or cooperation among teachers selected by teachers (Leal).
  • No. Student test scores were not designed to be used for this purpose. All teachers need salary increases (McCoy).
  • No, I do not support HISD’s current bonus system based on student test scores. Student test scores should be a diagnostic tool used to identify instructional gaps and improve student achievement. Not as a measure to reward teachers, because the process is flawed. I would work collaboratively to identify another system to reward teachers (McCullough).
  • The current bonus program does, in fact, reward teachers who students make significant academic gains. It leaves out those teachers who have students at the top of the achievement scale. By formulaic measures, it is flawed and the system, according to its creators, is being misused and misapplied. It would be beneficial overall to consider measures to expand the teacher population of recipients as well as to undertake measures to simplify the process if we keep it. I think a better focus would be to see how we can increase overall teacher salaries in a meaningful and impactful way to incentivize performance and longevity (Skillern-Jones).
  • No. I do not support the use of EVAAS in this manner. More importantly, ASPIRE has not closed the achievement gap nor dramatically improved the academic performance of all students in the district (Stipeche).

No responses or no responses of any general substance were received from Daniels, Davila, McKinzie, Smith, Williams.

The Forgotten VAM: The A-F School Grading System

Here is another post from our “Concerned New Mexico Parent” (see prior posts from him/her here and here). This one is about New Mexico’s A-F School Grading System and how it is not only contradictory, within and beyond itself, but how it also provides little instrumental value to the public as an invalid indicator of the “quality” of any school.

(S)he writes:

  1. What do you call a high school that has only 38% of its students proficient in reading and 35% of its students proficient in mathematics?
  2. A school that needs help improving their student scores.
  3. What does the New Mexico Public Education Department (NMPED) call this same high school?
  4. A top-rated “A” school, of course.

Readers of this blog are familiar with the VAMs being used to grade teachers. Many states have implemented analogous formulas to grade entire schools. This “forgotten” VAM suffers from all of the familiar problems of the teacher formulas — incomprehensibility, lack of transparency, arbitrariness, and the like.

The first problem with the A-F Grading System is inherent in its very name. The “A-F” terminology implies that this one static assessment is an accurate representation of a school’s quality. As you will see, it is nothing of the sort.

The second problem with the A-F Grading System is that is is composed of benchmarks that are arbitrarily weighted and scored by the NMPED using VAM methodologies.

Thirdly, the “collapsing of the data” from a numeric score to a grade (corresponding to a range of values) causes valuable information to be lost.

Table 1 shows the range of values for reading and mathematics proficiencies for each of the five A-F grade categories for New Mexico schools.

Table 1: Ranges and Median of Reading and Mathematics Proficiencies by A-F School Grade

School A-F Grade Number of Schools

Reading
Proficiency Range

Median

Mathematics Proficiency Range

Median

A 86

37.90 – 94.00

66.16

31.50 – 95.70

58.95

B 237

16.90 – 90.90

58.00

4.90 – 90.90

51.30

C 177

0.00 – 83.80

46.30

0.00 – 76.20

38.00

D 21

4.50 – 64.60

40.70

2.20 – 70.00

31.80

F 88

7.80 – 52.30

31.85

3.30 – 40.90

23.30

For example, to earn an A rating, a school can have between 37.9% and 94.0% of its students proficient in reading. In other words, a school can have roughly two-thirds of its students fail reading proficiency yet be rated as an “A” school!

The median value listed shows the point which splits the group in half — one-half of the scores are below the median value. Thus, an “A” school median of 66.2% indicates that one-half of the “A” schools have a reading proficiency below 66.2%. In other words, in one-half of the “A” schools 1/3 or more of their students are NOT proficient in reading!

Amazingly, the figures for mathematics are even worse, the minimum proficiency for a B rating is only 4.9% proficient! Scandalous!

Obviously, and contrary to popular and press perceptions, the A-F Grading System has nothing to do with the actual or current quality of the school!

A few case studies will highlight further absurdities of the New Mexico A-F School Grading System next.

Case Study 1 – Highest “A”   vs. Lowest “A” High School

Logan High School, Logan, New Mexico received the lowest reading proficiency of any “A” school, and the Albuquerque Institute of Math and Science received the highest reading proficiency score.

These two schools have both received an “A” rating. The Albuquerque Institute had a reading proficiency of 94% and a mathematics proficiency rating of 93%. Logan HS had a reading proficiency of only 38% and a mathematics proficiency rating of only 35%!

How is that possible?

First, much of the A-F VAM, like the teacher VAM is based on multi-year growth calculations and predictions. Logan has plenty of opportunity for growth whereas the Math Academy has “maxed” out most of its scores. Thus, the Albuquerque Institute is penalized in a manner analogous to Gifted and Talented teachers when teacher-level VAM is used. With already excellent scores, there is little, if any, room for improvement.

Second, Logan has an emphasis on shop/trade classes which yields a very high college and CAREER readiness score for the VAM calculation.

Also, a final factor is that the NMPED-defined range for an “A” extends from 75 to 100 points, and Logan barely made it into the A grouping.

Thus, a proficiency score of only 37.9% is no deterrent to an A score for Logan High.

Case Study 2: Hanging on by a Thread

As noted above, any school that scores between 75 and 100 points is considered an “A” school.

This statistical oddity was very beneficial to Hagerman High (Hagerman, NM) in their 2014 School Grade Report Card. They fell 5.99 points overall from the previous year’s score, but they managed to still receive an “A” score since their resulting 2014 score was exactly 75.01.

With this one one-hundredth of a point, they are in the same “A” grade category as the Albuquerque Institute of Math and Science (rated best in New Mexico by NMPED) and the Cottonwood Classical Preparatory School of Albuquerque (rated best in New Mexico by US News).

Case Study 3: A Tale of Two Ranking Systems

This inaccuracy and arbitrariness of any A-F School Grading System was also apparent in a recent Albuquerque Journal News article (May 14, 2015) which reported on the most recent US News ratings of high schools nationwide.

The Journal reported on the top 12 high schools in New Mexico as rated by US News. It is not surprising that most were NMPED A-rated schools. What was unusual is that the 3rd and 5th US News highest rated schools in New Mexico (South Valley Academy and Albuquerque High, both in Albuquerque) were actually rated as B schools by the NMPED A-F School Grading System.

According to NMPED data, I tabulated at least forty-four (44) high schools that were rated as “A” schools with higher NMPED scores than South Valley Academy which had an NMPED score of 71.4.

None of these 44 higher NMPED scoring schools were rated above South Valley Academy by US News.

Case Study 4: Punitive Grading

Many school districts and school boards throughout New Mexico have adopted policies that prohibit punitive grading based on behavior. It is no longer possible to lower a student’s grade just because of their behavior. The grade should reflect classroom assessment only.

NMPED ignores this policy in the context of the A-F School Grading System. Schools were graded down one letter grade if they did not achieve 95% participation rates.

One such school was Mills Elementary in the Hobbs Municipal Schools District. Only 198 students were tested; they fell 11 short of the 95% mark and were penalized one “grade”-level. Their grade was reduced from a “D” to an “F” In fact, Mills Elementary proficiency scores were higher than the A-rated Logan High School discussed earlier.

The likely explanation is that Hobbs has a highly transient population with both seasonal farm laborers and oil-field workers predominating in the local economy.

For more urban schools, it will be interesting to see how the NMPED policy of punitive grading will play out with the increasingly popular Opt-Out movement.

Conclusion

It is apparent that the NMPED’s A-F School Grading System rates schools deceptively using VAM-augmented data and provides little of any value to the public as to the “quality” of a school. By presenting it in the form of an “NMPED School Grade Report Card” the state seeks to hide its arbitrary nature.

Such a useless grade should certainly not be used to declare a school a “failure” and in need of radical reform.

“Value-Less” Value-Added Data

Peter Greene, a veteran teacher of English in Pennsylvania who works as a teacher in a state using the Pennsylvania version of the Education Value-Added Assessment System (EVAAS), wrote last week (October 5, 2015) in his Curmudgucation blog about his “Value-Less Data.” I thought it very important to share with you all, as he does a great job deconstructing one of the most widespread claims being made, and most lacking research support, about using the data derived via value-added models (VAMs) to inform and improve what teachers do in their classrooms.

Greene sententiously critiques this claim, writing:

It’s autumn in Pennsylvania, which means it’s time to look at the rich data to be gleaned from our Big Standardized Test (called PSSA for grades 3-8, and Keystone Exams at the high school level).

We love us some value added data crunching in PA (our version is called PVAAS, an early version of the value-added baloney model). This is a model that promises far more than it can deliver, but it also makes up a sizeable chunk of our school evaluation model, which in turn is part of our teacher evaluation model.

Of course the data crunching and collecting is supposed to have many valuable benefits, not the least of which is unleashing a pack of rich and robust data hounds who will chase the wild beast of low student achievement up the tree of instructional re-alignment. Like every other state, we have been promised that the tests will have classroom teachers swimming in a vast vault of data, like Scrooge McDuck on a gold bullion bender. So this morning I set out early to the states Big Data Portal to see what riches the system could reveal.

Here’s what I can learn from looking at the rich data.

* the raw scores of each student
* how many students fell into each of the achievement subgroups (test scores broken down by 20 point percentile slices)
* if each of the five percentile slices was generally above, below, or at its growth target

Annnnd that’s about it. I can sift through some of that data for a few other features.

For instance, PVAAS can, in a Minority Report sort of twist, predict what each student should get as a score based on– well, I’ve been trying for six years to find someone who can explain this to me, and still nothing. But every student has his or her own personal alternate universe score. If the student beats that score, they have shown growth. If they don’t, they have not.

The state’s site will actually tell me what each student’s alternate universe score was, side by side with their actual score. This is kind of an amazing twist– you might think this data set would be useful for determining how well the state’s predictive legerdemain actually works. Or maybe a discrepancy might be a signal that something is up with the student. But no — all discrepancies between predicted and actual scores are either blamed on or credited to the teacher.

I can use that same magical power to draw a big target on the backs of certain students. I can generate a list of students expected to fall within certain score ranges and throw them directly into the extra test prep focused remediation tank. Although since I’m giving them the instruction based on projected scores from a test they haven’t taken yet, maybe I should call it premediation.

Of course, either remediation or premediation would be easier to develop if I knew exactly what the problem was.

But the website gives only raw scores. I don’t know what “modules” or sections of the test the student did poorly on. We’ve got a principal working on getting us that breakdown, but as classroom teachers we don’t get to see it. Hell, as classroom teachers, we are not allowed to see the questions, and if we do see them, we are forbidden to talk about them, report on them, or use them in any way. (Confession: I have peeked, and many of the questions absolutely suck as measures of anything).

Bottom line– we have no idea what exactly our students messed up to get a low score on the test. In fact, we have no idea what they messed up generally.

So that’s my rich data. A test grade comes back, but I can’t see the test, or the questions, or the actual items that the student got wrong.

The website is loaded with bells and whistles and flash-dependent functions along with instructional videos that seem to assume that the site will be used by nine-year-olds, combining instructions that should be unnecessary (how to use a color-coding key to read a pie chart) to explanations of “analysis” that isn’t (by looking at how many students have scored below basic, we can determine how many students have scored below basic).

I wish some of the reformsters who believe that BS [i.e., not “basic skills” but the “other” BS] Testing gets us rich data that can drive and focus instruction would just get in there and take a look at this, because they would just weep. No value is being added, but lots of time and money is being wasted.

Valerie Strauss also covered Greene’s post in her Answer Sheet Blog in The Washington Post here, in case you’re interested in seeing her take on this as well: “Why the ‘rich’ student data we get from testing is actually worthless.”

NY Teacher Lederman’s Day in Court

Do you recall the case of Sheri Lederman? The Long Island teacher who, apparently by all accounts other than her composite growth (or value-added) score is a terrific 4th grade/18 year veteran teacher, who received a score of 1 out of 20 after she scored a 14 out of 20 the year prior (see prior posts herehere and here; see also here and here)?

With her husband, attorney Bruce Lederman leading her case, she is suing the state of New York (the state in which Governor Cuomo is pushing to now have teachers’ value-added scores count for approximately 50% of their total evaluations) to challenge the state’s teacher evaluation system. She is also being fully supported by her students, her principal, her superintendent, and a series of VAM experts including: Linda Darling-Hammond (Stanford), Aaron Pallas (Columbia University Teachers College), Carol Burris (Educator and Principal of the Year from New York), Brad Lindell (Long Island Research Consultant), and me (Arizona State University) (see their/our expert witness affidavits here). See also an affidavit more recently submitted by Jesse Rothstein (Berkeley) here, as well as the full document explaining the entire case – the Memorandum of Law – here.

Well, the Ledermans had their day in court this past Wednesday (August 12, 2015).

It was apparent in the hearing that the Judge carefully read all the papers prior, and he was fully familiar with the issues. As per Bruce Lederman, “[t]he issue that seemed to catch the Judge’s attention the most was whether it was rational to have a system which decides in advance that 7% of teachers will be ineffective, regardless of actual results. The Judge asked numerous questions about whether it was fair to use a bell curve,” whereby when using a bell curve to distribute teachers’ growth or value-added scores, there will always be a set of “ineffective” teachers, regardless of whether in face they are truly “ineffective.” This occurs not naturally but by the statistical manipulation needed to fit all scores within the normal distribution needed to spread out the scores in order to make relative distinctions and categorizations (e.g., highly effective, effective, ineffective), the validity of which are highly uncertain (see, for example, a prior post here). Hence, “[t]he Judge pressed the lawyer representing New York’s Education Department very hard on this particular issue,” but the state’s lawyer did not (most likely because she could not) give the Judge a satisfactory explanation, justification, or rationale.

For more information on the case, see here the video that I feel best captures the case, thanks to CBS news in Albany. For another video see here, compliments of NBC news in Albany. See also two additional articles, here and here, with the latter including the photo of Sheri and Bruce Lederman pasted below.

a - ledermans_0

The Multiple Teacher Evaluation System(s) in New Mexico, from a Concerned New Mexico Parent

A “concerned New Mexico parent” who wrote a prior post for this blog here, wrote another for you all below, about the sheer numbers of different teacher evaluation systems, or variations, now in place in his/her state of New Mexico. (S)he writes:

Readers of this blog are well aware of the limitations of VAMs for evaluating teachers. However, many readers may not be aware that there are actually many system variations used to evaluate teachers. In the state of New Mexico, for example, 217 different variations are used to evaluate the many and diverse types of teachers teaching in the state [and likely all other states].

But. Is there any evidence that they are valid? NO. Is there any evidence that they are equivalent? NO. Is there any evidence that this is fair? NO.

The New Mexico Public Education Department (NMPED) provides a framework for teacher evaluations, and the final teacher evaluation should be weighted as follows: Improved Student Achievement (50%), Teacher Observations (25%), and Multiple Measures (25%).

Every school district in New Mexico is required to submit a detailed evaluation plan of specifically what measures will be used to satisfy the overall NMPED 50-25-25 percentage framework, after which NMPED approves all plans.

The exact details of any district’s educator effectiveness plan can be found on the NMTEACH website, as every public and charter school plan is posted here.

There are massive differences between how groups of teachers are graded between districts, however, which distorts most everything about the system(s), including the extent to which similar (and different) teachers might be similarly (and fairly) evaluated and assessed.

Even within districts, there are massive differences in how grade level (elementary, middle, high school) teachers are evaluated.

And, even something as seemingly simple as evaluating K-2 teachers requires 42 different variations in scoring.

Table 1 below shows the number of different scales used to calculate teacher effectiveness for each group of teachers and each grade level, for example, at the state level.

New Mexico divides all teachers into three categories — group A teachers have scores based on the statewide test (mathematics, English/language arts (ELA)), group B teachers (e.g. music or history) do not have a corresponding statewide test, and group C teachers teach grades K-2. Table 1 shows the number of scales used by New Mexico school districts for each teacher group. It is further broken down by grade-level. For example, as illustrated, there are 42 different scales used to evaluate Elementary-level Group A teachers in New Mexico. The column marked “Unique (one-offs)” indicates the number of scales that are completely unique for a given teacher group and grade-level. For example, as illustrated, there are 11 unique scales used to grade Group B High School teachers, and for each of these eleven scales, only one district, one grade-level, and one teacher group is evaluated within the entire state.

Based on the size of the school district, a unique scale may be grading as few as a dozen teachers! In addition, there are 217 scales used statewide, with 99 of these scales being unique (by teacher)!

Table 1: New Mexico Teacher Evaluation System(s)

Group Grade Scales Used Unique (one-offs)
Group A (SBA-based) All 58 15
(e.g. 5th grade English teacher) Elem 42 10
MS 37 2
HS 37 3
Group B (non-SBA) All 117 56
(e.g. Elem music teacher) Elem 67 37
MS 62 8
HS 61 11
Group C (grades K-2) All 42 28
Elem 42 28
TOTAL   217 variants 99 one-offs

The table above highlights the spectacular absurdity of the New Mexico Teacher Evaluation System.

(The complete listings of all variants for the three groups are contained here (in Table A for Group A), here (in Table B for Group B), and here (in Table C for Group C). The abbreviations and notes for these tables are listed here (in Table D).

By approving all of these different formulas, all things considered, NMPED is also making the following nonsensical claims..

NMPED Claim: The prototype 50-25-25 percentage split has some validity.

There is no evidence to support this division between student achievement measures, observation, and multiple measures at all. It simply represents what NMPED could politically “get away with” in terms of a formula. Why not 60-20-20 or 57-23-20 or 46-18-36, etcetera? The NMPED prototype scale has no proven validity, whatsoever.

NMPED Claim: All 217 formulas are equivalent to evaluate teachers.

This claim by NMPED is absurd on its face and every other part of its… Is there any evidence that they have cross-validated the tests? There is no evidence that any of these scales are valid or accurate measures of “teacher effectiveness.” Also, there is no evidence whatsoever that they are equivalent.

Further, if the formulas are equivalent (as NMPED claims), why is New Mexico wasting money on technology for administering SBA tests or End-of-Course exams? Why not use an NMPED-approved formula that includes tests like Discovery, MAPS, DIBELS, or Star that are already being used?

NMPED Claim: Teacher Attendance and Student Surveys are interchangeable.

According to the approved plans, many districts assign 10% to Teacher Attendance while other districts assign 10% to Student Surveys. Both variants have been approved by NMPED.

Mathematically, (i.e., in terms of the proportions either is to be allotted) they appear to be interchangeable. If that is so, why is NMPED also specifically trying to enforce Teacher Attendance as an element of the evaluation scale? Why did Hanna Skandera proclaim to the press that this measure improved New Mexico education? (For typical news coverage, on this topic, for example, see here).

The use of teacher attendance appears to be motivated by union-busting rather than any mathematical rationale.

NMPED Claim: All observation methods are equivalent.

NMPED allows for three very different observation methods to be used for 40% of the final score. Each method is somewhat complicated and involves different observers.

There is no indication that NMPED has evaluated the reliability or validity of these three very different observation methods, or tested their results for equivalence. They simply assert that they are equivalent.

NMPED Claim: These formulas will be used to rate teachers.

These formulas are the worst kind of statistical jiggery-pokery (to use a newly current phrase). NMPED presents a seemingly rational, scientific number to the public using invalid and unvalidated mathematical manipulations and then determines teachers’ careers based on the completely bogus New Mexico teacher evaluation system(s).

Conclusion: Not only is the emperor naked, he has a closet containing 217 equivalent outfits at home!

Splits, Rotations, and Other Consequences of Teaching in a High-Stakes Environment in an Urban School

An Arizona teacher who teaches in a very urban, high-needs schools writes about the realities of teaching in her school, under the pressures that come along with high-stakes accountability and a teacher workforce working under an administration, both of which are operating in chaos. This is a must read, as she also talks about two unintended consequences of educational reform in her school about which I’ve never heard before: splits and rotations. Both seem to occur at all costs simply to stay afloat during “rough” times, but both also likely have deleterious effects on students in such schools, as well as teachers being held accountable for the students “they” teach.

She writes:

Last academic year (2012-2013) a new system for evaluating teachers was introduced into my school district. And it was rough. Teachers were dropping like flies. Some were stressed to the point of requiring medical leave. Others were labeled ineffective based on a couple classroom observations and were asked to leave. By mid-year, the school was down five teachers. And there were a handful of others who felt it was just a matter of time before they were labeled ineffective and asked to leave, too.

The situation became even worse when the long-term substitutes who had been brought in to cover those teacher-less classrooms began to leave also. Those students with no contracted teacher and no substitute began getting “split”. “Splitting” is what the administration of a school does in a desperate effort to put kids somewhere. And where the students go doesn’t seem to matter. A class roster is printed, and the first five students on the roster go to teacher A. The second five students go to teacher B, and so on. Grade-level isn’t even much of a consideration. Fourth graders get split to fifth grade classrooms. Sixth graders get split to 5th and 7th grade classrooms. And yes, even 7th and 8th graders get split to 5th grade classrooms. Was it difficult to have another five students in my class? Yes. Was it made more difficult that they weren’t even of the same grade level I was teaching? Yes. This went on for weeks…

And then the situation became even worse. As it became more apparent that the revolving door of long-term substitutes was out of control, the administration began “The Rotation.” “The Rotation” was a plan that used the contracted teachers (who remained!) as substitutes in those teacher-less classrooms. And so once or twice a week, I (and others) would get an email from the administration alerting me that it was my turn to substitute during prep time. Was it difficult to sacrifice 20-40 % of weekly prep time (that is used to do essential work like plan lessons, gather materials, grade, call parents, etc…) Yes. Was it difficult to teach in a classroom that had a different teacher, literally, every hour without coordinated lessons? Yes.

Despite this absurd scenario, in October 2013, I received a letter from my school district indicating how I fared in this inaugural year of the teacher evaluation system. It wasn’t good. Fifty percent of my performance label was based on school test scores (not on the test scores of my homeroom students). How well can students perform on tests when they don’t have a consistent teacher?

So when I think about accountability, I wonder now what it is I was actually held accountable for? An ailing, urban school? An ineffective leadership team who couldn’t keep a workforce together? Or was I just held accountable for not walking away from a no-win situation?

Coincidentally, this 2013-2014 academic year has, in many ways, mirrored the 2012-2013. The upside is that this year, only 10% of my evaluation is based on school-wide test scores (the other 40% will be my homeroom students’ test scores). This year, I have a fighting chance to receive a good label. One more year of an unfavorable performance label and the district will have to, by law, do something about me. Ironically, if it comes to that point, the district can replace me with a long-term substitute, who is not subject to the same evaluation system that I am. Moreover, that long-term substitute doesn’t have to hold a teaching certificate. Further, that long-term substitute will cost the district a lot less money in benefits (i.e. healthcare, retirement system contributions).

I should probably start looking for a job—maybe as a long-term substitute.

Another Teacher’s Contract Not Renewed — Another Teacher Speaks Out

As per a video linked within this article just released by ABC News in Tennessee, another teacher, this time coming from Knox County, Tennessee, speaks to her school board about her teaching contract not being renewed.

vlcsnap-2015-06-03-22h20m40s131

As per the article, teacher “Christina Graham taught kindergarten for three years at Copper Ridge Elementary. Last year she spoke out against SAT-10 testing for kindergartners. After her speech to the school board, Graham was called in to talk to her principal about being a representative for Knox County Schools. Graham said that was only time she was ever pulled in to talk about an issue, and she was told then that it wasn’t a disciplinary meeting.”

The only reason for her non-renewal? “She no longer fit the vision for that school.”

Many parents and teachers are speaking out on her behalf, arguing her non-renewal is the district’s way of retaliating against a teacher who spoke out. See, also, the district’s official response here.

One School’s Legitimately, “New and Improved” Teacher Evaluation System: In TIME Magazine

In an article featured this week in TIME Magazine titled “How Do You Measure a Teacher’s Worth?” author Karen Hunter Quartz – research director at the UCLA Community School and a faculty member in the UCLA Graduate School of Education – describes the legitimately, “new and improved” teacher evaluation system co-constructed by teachers, valued as professionals, in Los Angeles.

Below are what I read as the highlights, and also some comments re: the highlights, but please do click here for the full read as this whole article is in line with what many who research teacher evaluation systems support (see, for example, Chapter 8 in my Rethinking Value-Added Models in Education…).

“For the past five years, teachers at the UCLA Community School, in Koreatown, have been mapping out their own process of evaluation based on multiple measures — and building both a new system and their faith in it…this school is the only one trying to create its own teacher evaluation infrastructure, building on the district’s groundwork…[with] the evaluation process [fully] owned by the teachers themselves.”

“Indeed, these teachers embrace their individual and collective responsibility to advance exemplary teaching practices and believe that collecting and using multiple measures of teaching practice will increase their professional knowledge and growth. They are tough critics of the measures under development, with a focus on making sure the measures help make teachers better at their craft.”

Their new and improved system is based on three different kinds of data — student surveys, observations, and portfolio assessments. The latter includes an assignment teachers gave students, how teachers taught this assignment, and samples of the student work produced during/post the assignment given. Teachers’ portfolios were then scored by “educators trained at UCLA to assess teaching quality on several dimensions, including academic rigor and relevance. Teachers then completed a reflection on the scores they received, what they learned from the data, and how they planned to improve their practice.”

Hence, the “legitimate” part of the title of this post, in that this section is being externally vetted. As for the “new and improved” part of the title of this post, this comes from data indicating that “almost all teachers reported in a survey that they appreciated receiving multiple measures of their practice. Most teachers reported that the measures were a fair assessment of the quality of their teaching, and that the evaluation process helped them grow as educators.”

However, there was also “consensus that more information was needed to help them improve their scores. For example, some teachers wanted to know how to make assignments more relevant to students’ lives; others asked for more support reflecting on their observation transcripts.”

In the end, though, “[p]erhaps the most important accomplishment of this new system was that it restored teachers’ trust in the process of evaluation. Very few teachers trust that value-added measures — which are based on tests that are far removed from their daily work — can inform their improvement. This is an issue explored by researchers who are probing the unintended consequences of teacher accountability systems tied to value-added measures.”