States’ Teacher Evaluation Systems Now “All over the Map”

We are now just one year past the federal passage of the Every Student Succeeds Act (ESSA), within which it is written that states must no longer set up teacher-evaluation systems based in significant part on their students’ test scores. As per a recent article written in Education Week, accordingly, most states are still tinkering with their teacher evaluation systems—particularly regarding the student growth or value-added measures (VAMs) that were also formerly required to help states assesses teachers’ purported impacts on students’ test scores over time.

“States now have a newfound flexibility to adjust their evaluation systems—and in doing so, they’re all over the map.” Likewise, though, “[a] number of states…have been moving away from [said] student growth [and value-added] measures in [teacher] evaluations,” said a friend, colleague, co-editor, and occasional writer on this blog (see, for example, here and here) Kimberly Kappler Hewitt (University of North Carolina at Greensboro).  She added that this is occurring “whether [this] means postponing [such measures’] inclusion, reducing their percentage in the evaluation breakdown, or eliminating those measures altogether.”

While states like Alabama, Iowa, and Ohio seem to still be moving forward with the attachment of students’ test scores to their teachers, other states seem to be going “back and forth” or putting a halt to all of this altogether (e.g, California). Alaska cut back the weight of the measure, while New Jersey tripled the weight to count for 30% of a teacher’s evaluation score, and then introduced a bill to reduce it back to 0%. In New York teacher are to still receive a test-based evaluation score, but it is not to be tied to consequences and completely revamped by 2019. In Alabama a bill that would have tied 25% of a teacher’s evaluation to his/her students’ ACT and ACT Aspire college-readiness tests has yet to see the light of day. In North Carolina state leaders re-framed the use(s) of such measures to be more for improvement tool (e.g., for professional development), but not “a hammer” to be used against schools or teachers. The same thing is happening in Oklahoma, although this state is not specifically mentioned in this piece.

While some might see all of this as good news — or rather better news than what we have seen for nearly the last decade during which states, state departments of education, and practitioners have been grappling with and trying to make sense of student growth measures and VAMs — others are still (and likely forever will be) holding onto what now seems to be some of the now unclenched promises attached to such stronger accountability measures.

Namely in this article, Daniel Weisberg of The New Teacher Project (TNTP) and author of the now famous “Widget Effect” report — about “Our National Failure to Acknowledge and Act on Differences in Teacher Effectiveness” that helped to “inspire” the last near-decade of these policy-based reforms — “doesn’t see states backing away” from using these measures given ESSA’s new flexibility. We “haven’t seen the clock turn back to 2009, and I don’t think [we]’re going to see that.”

Citation: Will, M. (2017). States are all over the map when it comes to how they’re looking to approach teacher-evaluation systems under ESSA. Education Week. Retrieved from http://www.edweek.org/ew/articles/2017/01/04/assessing-quality-of-teaching-staff-still-complex.html?intc=EW-QC17-TOC&_ga=1.138540723.1051944855.1481128421

Another Study about Bias in Teachers’ Observational Scores

Following-up on two prior posts about potential bias in teachers’ observations (see prior posts here and here), another research study was recently released evidencing, again, that the evaluation ratings derived via observations of teachers in practice are indeed related to (and potentially biased by) teachers’ demographic characteristics. The study also evidenced that teachers representing racial and ethnic minority background might be more likely than others to not only receive lower relatively scores but also be more likely identified for possible dismissal as a result of their relatively lower evaluation scores.

The Regional Educational Laboratory (REL) authored and U.S. Department of Education (Institute of Education Sciences) sponsored study titled “Teacher Demographics and Evaluation: A Descriptive Study in a Large Urban District” can be found here, and a condensed version of the study can be found here. Interestingly, the study was commissioned by district leaders who were already concerned about what they believed to be occurring in this regard, but for which they had no hard evidence… until the completion of this study.

Authors’ key finding follows (as based on three consecutive years of data): Black teachers, teachers age 50 and older, and male teachers were rated below proficient relatively more often than the same district teachers to whom they were compared. More specifically,

  • In all three years the percentage of teachers who were rated below proficient was higher among Black teachers than among White teachers, although the gap was smaller in 2013/14 and 2014/15.
  • In all three years the percentage of teachers with a summative performance rating who were rated below proficient was higher among teachers age 50 and older than among teachers younger than age 50.
  • In all three years the difference in the percentage of male and female teachers with a summative performance rating who were rated below proficient was approximately 5 percentage points or less.
  • The percentage of teachers who improved their rating during all three year-to-year
    comparisons did not vary by race/ethnicity, age, or gender.

This is certainly something to (still) keep in consideration, especially when teachers are rewarded (e.g., via merit pay) or penalized (e.g., vie performance improvement plans or plans for dismissal). Basing these or other high-stakes decisions on not only subjective but also likely biased observational data (see, again, other studies evidencing that this is happening here and here), is not only unwise, it’s also possibly prejudiced.

While study authors note that their findings do not necessarily “explain why the
patterns exist or to what they may be attributed,” and that there is a “need
for further research on the potential causes of the gaps identified, as well as strategies for
ameliorating them,” for starters and at minimum, those conducting these observations literally across the country must be made aware.

Citation: Bailey, J., Bocala, C., Shakman, K., & Zweig, J. (2016). Teacher demographics and evaluation: A descriptive study in a large urban district. Washington DC: U.S. Department of Education. Retrieved from http://ies.ed.gov/ncee/edlabs/regions/northeast/pdf/REL_2017189.pdf

Miami-Dade, Florida’s Recent “Symbolic” and “Artificial” Teacher Evaluation Moves

Last spring, Eduardo Porter – writer of the Economic Scene column for The New York Times – wrote an excellent article, from an economics perspective, about that which is happening with our current obsession in educational policy with “Grading Teachers by the Test” (see also my prior post about this article here; although you should give the article a full read; it’s well worth it). In short, though, Porter wrote about what economist’s often refer to as Goodhart’s Law, which states that “when a measure becomes the target, it can no longer be used as the measure.” This occurs given the great (e.g., high-stakes) value (mis)placed on any measure, and the distortion (i.e., in terms of artificial inflation or deflation, depending on the desired direction of the measure) that often-to-always comes about as a result.

Well, it’s happened again, this time in Miami-Dade, Florida, where the Miami-Dade district’s teachers are saying its now “getting harder to get a good evaluation” (see the full article here). Apparently, teachers evaluation scores, from last to this year, are being “dragged down,” primarily given teachers’ students’ performances on tests (as well as tests of subject areas that and students whom they do not teach).

“In the weeks after teacher evaluations for the 2015-16 school year were distributed, Miami-Dade teachers flooded social media with questions and complaints. Teachers reported similar stories of being evaluated based on test scores in subjects they don’t teach and not being able to get a clear explanation from school administrators. In dozens of Facebook posts, they described feeling confused, frustrated and worried. Teachers risk losing their jobs if they get a series of low evaluations, and some stand to gain pay raises and a bonus of up to $10,000 if they get top marks.”

As per the figure also included in this article, see the illustration of how this is occurring below; that is, how it is becoming more difficult for teachers to get “good” overall evaluation scores but also, and more importantly, how it is becoming more common for districts to simply set different cut scores to artificially increase teachers’ overall evaluation scores.

00-00 template_cs5

“Miami-Dade say the problems with the evaluation system have been exacerbated this year as the number of points needed to get the “highly effective” and “effective” ratings has continued to increase. While it took 85 points on a scale of 100 to be rated a highly effective teacher for the 2011-12 school year, for example, it now takes 90.4.”

This, as mentioned prior, is something called “artificial deflation,” whereas the quality of teaching is likely not changing nearly to the extent the data might illustrate it is. Rather, what is happening behind the scenes (e.g., the manipulation of cut scores) is giving the impression that indeed the overall teacher system is in fact becoming better, more rigorous, aligning with policymakers’ “higher standards,” etc).

This is something in the educational policy arena that we also call “symbolic policies,” whereas nothing really instrumental or material is happening, and everything else is a facade, concealing a less pleasant or creditable reality that nothing, in fact, has changed.

Citation: Gurney, K. (2016). Teachers say it’s getting harder to get a good evaluation. The school district disagrees. The Miami Herald. Retrieved from http://www.miamiherald.com/news/local/education/article119791683.html#storylink=cpy

New Mexico: Holding Teachers Accountable for Missing More Than 3 Days of Work

One state that seems to still be going strong after the passage of last January’s Every Student Succeeds Act (ESSA) — via which the federal government removed (or significantly relaxed) its former mandates that all states adopt and use of growth and value-added models (VAMs) to hold their teachers accountable (see here) — is New Mexico.

This should be of no surprise to followers of this blog, especially those who have not only recognized the decline in posts via this blog post ESSA (see a post about this decline here), but also those who have noted that “New Mexico” is the state most often mentioned in said posts post ESSA (see for example here, here, and here).

Well, apparently now (and post  revisions likely caused by the ongoing lawsuit regarding New Mexico’s teacher evaluation system, of which attendance is/was a part; see for example here, here, and here), teachers are to now also be penalized if missing more than three days of work.

As per a recent article in the Santa Fe New Mexican (here), and the title of this article, these new teacher attendance regulations, as to be factored into teachers’ performance evaluations, has clearly caught schools “off guard.”

“The state has said that including attendance in performance reviews helps reduce teacher absences, which saves money for districts and increases students’ learning time.” In fact, effective this calendar year, 5 percent of a teacher’s evaluation is to be made up of teacher attendance. New Mexico Public Education Department spokesman Robert McEntyre clarified that “teachers can miss up to three days of work without being penalized.” He added that “Since attendance was first included in teacher evaluations, it’s estimated that New Mexico schools are collectively saving $3.5 million in costs for substitute teachers and adding 300,000 hours of instructional time back into [their] classrooms.”

“The new guidelines also do not dock teachers for absences covered by the federal Family and Medical Leave Act, or absences because of military duty, jury duty, bereavement, religious leave or professional development programs.” Reported to me only anecdotally (i.e., I could not find evidence of this elsewhere), the new guidelines might also dock teachers for engaging in professional development or overseeing extracurricular events such as debate team performances. If anybody has anything to add on this end, especially as evidence of this, please do comment below.

Value-Added for Kindergarten Teachers in Ecuador

In a study a colleague of mine recently sent me, authors of a study recently released in The Quarterly Journal of Economics and titled “Teacher Quality and Learning Outcomes in Kindergarten,” (nearly randomly) assigned two cohorts of more than 24,000 kindergarten students to teachers to examine whether, indeed and once again, teacher behaviors are related to growth in students’ test scores over time (i.e., value-added).

To assess this, researchers administered 12 tests to the Kindergarteners (I know) at the beginning and end of the year in mathematics and language arts (although apparently the 12 posttests only took 30-40 minutes to complete, which is a content validity and coverage issue in and of itself, p. 1424). They also assessed something they called the executive function (EF), and that they defined as children’s inhibitory control, working memory, capacity to pay attention, and cognitive flexibility, all of which they argue to be related to “Volumetric measures of prefrontal cortex size [when] predict[ed]” (p. 1424). This, along with the fact that teachers’ IQs were also measured (using the Spanish-speaking version of the Wechsler Adult Intelligence Scale) speaks directly to the researchers’ background theory and approach (e.g., recall our world’s history with craniometry, aptly captured in one of my favorite books — Stephen J. Gould’s best selling “The Mismeasure of Man”). Teachers were also observed using the Classroom Assessment Scoring System (CLASS), and parents were also solicited for their opinions about their children’s’ teachers (see other measures collected p. 1417-1418).

What should by now be some familiar names (e.g., Raj Chetty, Thomas Kane) served as collaborators on the study. Likewise, their works and the works of other likely familiar scholars and notorious value-added supporters (e.g., Eric Hanushek, Jonah Rockoff) are also cited throughout in support as evidence of “substantial research” (p. 1416) in support of value-added models (VAMs). Of course, this is unfortunate but important to point out in that this is an indicator of “researcher bias” in and of itself. For example, one of the authors’ findings really should come at no surprise: “Our results…complement estimates from [Thomas Kane’s Bill & Melinda Gates Measures of Effective Teaching] MET project” (p. 1419); although, the authors in a very interesting footnote (p. 1419) describe in more detail than I’ve seen elsewhere all of the weaknesses with the MET study in terms of its design, “substantial attrition,” “serious issue[s]” with contamination and compliance, and possibly/likely biased findings caused by self-selection given the extent to which teachers volunteered to be a part of the MET study.

Also very important to note is that this study took place in Ecuador. Apparently, “they,” including some of the key players in this area of research noted above, are moving their VAM-based efforts across international waters, perhaps in part given the Every Student Succeeds Act (ESSA) recently passed in the U.S., that we should all know by now dramatically curbed federal efforts akin to what is apparently going on now and being pushed here and in other developing countries (although the authors assert that Ecuador is a middle-income country, not a developing country, even though this categorization apparently only applies to the petroleum rich sections of the nation). Related, they assert that, “concerns about teacher quality are likely to be just as important in [other] developing countries” (p. 1416); hence, adopting VAMs in such countries might just be precisely what these countries need to “reform” their schools, as well.

Unfortunately, many big businesses and banks (e.g., the Inter-American Development Bank that funded this particular study) are becoming increasingly interested in investing in and solving these and other developing countries’ educational woes, as well, via measuring and holding teachers accountable for teacher-level value-added, regardless of the extent to which doing this has not worked in the U.S to improve much of anything. Needless to say, many who are involved with these developing nation initiatives, including some of those mentioned above, are also financially benefitting by continuing to serve others their proverbial Kool-Aid.

Nonetheless, their findings:

  • First, they “estimate teacher (rather than classroom) effects of 0.09 on language and math” (p. 1434). That is, just less than 1/10th of a standard deviation, or just over a 3% move in the positive direction away from the mean.
  • Similarly, the “estimate classroom effects of 0.07 standard deviation on EF” (p. 1433). That is, precisely 7/100th of a standard deviation, or about a 2% move in the positive direction away from the mean.
  • They found that “children assigned to teachers with a 1-standard deviation higher CLASS score have between 0.05 and 0.07 standard deviation higher end-of-year test scores” (p. 1437), or a 1-2% move in the positive direction away from the mean.
  • And they found that “that parents generally give higher scores to better teachers…parents are 15 percentage points more likely to classify a teacher who produces 1 standard deviation higher test scores as ‘‘very good’’ rather than ‘‘good’’ or lower” (p. 1442). This is quite an odd way of putting it, along with the assumption that the difference between “very good” and “good” is not arbitrary but empirically grounded, along with whatever reason a simple correlation was not more simply reported.
  • Their most major finding is that “a 1 standard deviation increase in classroom quality, corrected for sampling error, results in 0.11 standard deviation higher test scores in both language and math” (p. 1433; see also other findings from p. 1434-447).

Interestingly, the authors equivocate all of these effects to teacher or classroom “shocks,” although I’d hardly call them “shocks” that inherently imply a large, unidirectional, and causal impact. Moreover, this also implies how the authors, also as economists, still view this type of research (i.e., not correlational, even with close-to-random assignment, although they make a slight mention of this possibility on p. 1449).

Nonetheless, the authors conclude that in this article they effectively evidenced “that there are substantial differences [emphasis added] in the amount of learning that takes place in language, math, and executive function across kindergarten classrooms in Ecuador” (p. 1448). In addition, “These differences are associated with differences in teacher behaviors and practices,” as observed, and “that parents can generally tell better from worse teachers, but do not meaningfully alter their investments in children in response to random shocks [emphasis added] to teacher quality” (p. 1448).

Ultimately, they find that “value added is a useful summary measure of teacher quality in Ecuador” (p. 1448). Go figure…

They conclude “to date, no country in Latin America regularly calculates the value added of teachers,” yet “in virtually all countries in the region, decisions about tenure, in-service training, promotion, pay, and early retirement are taken with no regard for (and in most cases no knowledge about) a teacher’s effectiveness” (p. 1448). Also sound familiar??

“Value added is no silver bullet,” and indeed it is not as per much evidence now existent throughout the U.S., “but knowing which teachers produce more or less learning among equivalent students [is] an important step to designing policies to improve learning outcomes” (p. 1448), they also recognizably argue.

Citation: Araujo, M. C., Carneiro, P.,  Cruz-Aguayo, Y., & Schady, N. (2016). Teacher quality and learning outcomes in Kindergarten. The Quarterly Journal of Economics, 1415–1453. doi:10.1093/qje/qjw016  Retrieved from http://qje.oxfordjournals.org/content/131/3/1415.abstract

Bias in Teacher Observations, As Well

Following a post last month titled “New Empirical Evidence: Students’ ‘Persistent Economic Disadvantage’ More Likely to Bias Value-Added Estimates,” Matt Barnum — senior staff writer for The 74, an (allegedly) non-partisan, honest, and fact-based news site backed by Editor-in-Chief Campbell Brown and covering America’s education system “in crisis” (see, also, a prior post about The 74 here) — followed up with a tweet via Twitter. He wrote: “Yes, though [bias caused by economic disadvantage] likely applies with equal or even more force to other measures of teacher quality, like observations.” I replied via Twitter that I disagreed with this statement in that I was unaware of research in support of his assertion, and Barnum sent me two articles to review thereafter.

I attempted to review both of these articles herein, although I quickly figured out that I had actually read and reviewed the first (2014) piece on this blog (see original post here, see also a 2014 Brookings Institution article summarizing this piece here). In short, in this study researchers found that the observational components of states’ contemporary teacher systems certainly “add” more “value” than their value-added counterparts, especially for (in)formative purposes. However, researchers  found that observational bias also exists, as akin to value-added bias, whereas teachers who are non-randomly assigned students who enter their classrooms with higher levels of prior achievement tend to get higher observational scores than teachers non-randomly assigned students entering their classrooms with lower levels of prior achievement. Researchers concluded that because districts “do not have processes in place to address the possible biases in observational scores,” statistical adjustments might be made to offset said bias, as might external observers/raters be brought in to yield more “objective” observational assessments of teachers.

For the second study, and this post here, I gave this one a more thorough read (you can find the full study, pre-publication here). Using data from the Measures of Effective
Teaching (MET) Project, in which random assignment was used (or more accurately attempted), researchers also explored the extent to which students enrolled in teachers’ classrooms influence classroom observational scores.

They found, primarily, that:

  1. “[T]he context in which teachers work—most notably, the incoming academic performance of their students—plays a critical role in determining teachers’ performance” as measured by teacher observations. More specifically, “ELA [English/language arts] teachers were more than twice as likely to be rated in the top performance quintile if [nearly randomly] assigned the highest achieving students compared with teachers assigned the low-est achieving students,” and “math teachers were more than 6 times as likely.” In addition, “approximately half of the teachers—48% in ELA and 54% in math—were rated in the top two performance quintiles if assigned the highest performing students, while 37% of ELA and only 18% of math teachers assigned the lowest performing students were highly rated based on classroom observation scores”
  2. “[T]he intentional sorting of teachers to students has a significant influence on measured performance” as well. More specifically, results further suggest that “higher performing students [are, at least sometimes] endogenously sorted into the classes of higher performing teachers…Therefore, the nonrandom and positive assignment of teachers to classes of students based on time-invariant (and unobserved) teacher
    characteristics would reveal more effective teacher performance, as measured by classroom observation scores, than may actually be true.”

So, the non-random assignment of teachers biases both the value-added and observational components written into America’s now “more objective” teacher evaluation systems, as (formerly) required of all states that were to comply with federal initiatives and incentives (e.g., Race to the Top). In addition, when those responsible for assigning students to classrooms (sub)consciously favor teachers with high, prior observational scores, this exacerbates the issues. This is especially important when observational (and value-added) data are to be used for high-stakes accountability systems in that the data yielded via really both measurement systems may be less likely to reflect “true” teaching effectiveness due to “true” bias. “Indeed, teachers working with higher achieving students tend to receive higher performance ratings, above and beyond that which might be attributable to aspects of teacher quality,” and vice-versa.

Citation Study #1: Whitehurst, G. J., Chingos, M. M., & Lindquist, K. M. (2014). Evaluating teachers with classroom observations: Lessons learned in four districts. Washington, DC: Brookings Institution. Retrieved from https://www.brookings.edu/wp-content/uploads/2016/06/Evaluating-Teachers-with-Classroom-Observations.pdf

Citation Study #2: Steinberg, M. P., & Garrett, R. (2016). Classroom composition and measured teacher performance: What do teacher observation scores really measure? Educational Evaluation and Policy Analysis, 38(2), 293-317. doi:10.3102/0162373715616249  Retrieved from http://static.politico.com/58/5f/f14b2b144846a9b3365b8f2b0897/study-of-classroom-observations-of-teachers.pdf

 

The “Value-Added” of Teacher Preparation Programs: New Research

The journal Education of Economics Review recently published a study titled “Teacher Quality Differences Between Teacher Preparation Programs: How Big? How Reliable? Which Programs Are Different?” The study was authored by researchers at the University of Texas – Austin, Duke University, and Tulane. The pre-publication version of this piece can be found here.

As the title implies, the purpose of the study was to “evaluate statistical methods for estimating teacher quality differences between TPPs [teacher preparation programs].” Needless to say, this research is particularly relevant, here, given “Sixteen US states have begun to hold teacher preparation programs (TPPs) accountable for teacher quality, where quality is estimated by teacher value-added to student test scores.” The federal government continues to support and advance these initiatives, as well (see, for example, here).

But this research study is also particularly important because while researchers found that “[t]he most convincing estimates [of TPP quality] [came] from a value-added model where confidence intervals [were] widened;” that is, the extent to which measurement errors were permitted was dramatically increased, and also widened further using statistical corrections. But even when using these statistical techniques and accomodations, they found that it was still “rarely possible to tell which TPPs, if any, [were] better or worse than average.”

They therefore concluded that “[t]he potential benefits of TPP accountability may be too small to balance the risk that a proliferation of noisy TPP estimates will encourage arbitrary and ineffective policy actions” in response. More specifically, and in their own words, they found that:

  1. Differences between TPPs. While most of [their] results suggest that real differences between TPPs exist, the differences [were] not large [or large enough to make or evidence the differentiation between programs as conceptualized and expected]. [Their] estimates var[ied] a bit with their statistical methods, but averaging across plausible methods [they] conclude[d] that between TPPs the heterogeneity [standard deviation (SD) was] about .03 in math and .02 in reading. That is, a 1 SD increase in TPP quality predict[ed] just [emphasis added] a [very small] .03 SD increase in student math scores and a [very small] .02 SD increase in student reading scores.
  2. Reliability of TPP estimates. Even if the [above-mentioned] differences between TPPs were large enough to be of policy interest, accountability could only work if TPP differences could be estimated reliably. And [their] results raise doubts that they can. Every plausible analysis that [they] conducted suggested that TPP estimates consist[ed] mostly of noise. In some analyses, TPP estimates appeared to be about 50% noise; in other analyses, they appeared to be as much as 80% or 90% noise…Even in large TPPs the estimates were mostly noise [although]…[i]t is plausible [although perhaps not probable]…that TPP estimates would be more reliable if [researchers] had more than one year of data…[although states smaller than the one in this study — Texs]…would require 5 years to accumulate the amount of data that [they used] from one year of data.
  3. Notably Different TPPs. Even if [they] focus[ed] on estimates from a single model, it remains hard to identify which TPPs differ from the average…[Again,] TPP differences are small and estimates of them are uncertain.

In conclusion, that researchers found “that there are only small teacher quality differences between TPPs” might seem surprising, but not really given the outcome variables they used to measure and assess TPP effects were students’ test scores. In short, students’ test scores are three times removed from the primary unit of analysis in studies like these. That is, (1) the TPP is to be measured by the effectiveness of its teacher graduates, and (2) teacher graduates are to be measured by their purported impacts on their students’ test scores, while (3) students’ test scores are to only and have only been validated for measuring student learning and achievement. These test scores have not been validated to assess and measure, in the inverse, teachers causal impacts on said achievements or on TPPs impacts on teachers on said achievements.

If this sounds confusing, it is, and also highly nonsensical, but this is also a reason why this is so difficult to do, and as evidenced in this study, improbable to do this well or as theorized in that TPP estimates are sensitive to error, insensitive given error, and, accordingly, highly uncertain and invalid.

Citation: von Hippela, P. T., Bellowsb, L., Osbornea, C., Lincovec, J. A., & Millsd, N. (2016). Teacher quality differences between teacher preparation programs: How big? How reliable? Which programs are different? Education of Economics Review, 53, 31–45. doi:10.1016/j.econedurev.2016.05.002

VAM-Based Chaos Reigns in Florida, as Caused by State-Mandated Teacher Turnovers

The state of Florida is another one of our state’s to watch in that, even since the passage of the Every Student Succeeds Act (ESSA) last January, the state is still moving forward with using its VAMs for high-stakes accountability reform. See my most recent post about one district in Florida here, after the state ordered it to dismiss a good number of its teachers as per their low VAM scores when this school year started. After realizing this also caused or contributed to a teacher shortage in the district, the district scrambled to hire Kelly Services contracted substitute teachers to replace them, after which the district also put administrators back into the classroom to help alleviate the bad situation turned worse.

In a recent post released by The Ledger, teachers from the same Polk County School District (size = 100K students) added much needed details and also voiced concerns about all of this in the article that author Madison Fantozzi titled “Polk teachers: We are more than value-added model scores.”

Throughout this piece Fantozzi covers the story of Elizabeth Keep, a teacher who was “plucked from” the middle school in which she taught for 13 years, after which she was involuntarily placed at a district high school “just days before she was to report back to work.” She was one of 35 teachers moved from five schools in need of reform as based on schools’ value-added scores, although this was clearly done with no real concern or regard of the disruption this would cause these teachers, not to mention the students on the exiting and receiving ends. Likewise, and according to Keep, “If you asked students what they need, they wouldn’t say a teacher with a high VAM score…They need consistency and stability.” Apparently not. In Keep’s case, she “went from being the second most experienced person in [her middle school’s English] department…where she was department chair and oversaw the gifted program, to a [new, and never before] 10th- and 11th-grade English teacher” at the new high school to which she was moved.

As background, when Polk County School District officials presented turnaround plans to the State Board of Education last July, school board members “were most critical of their inability to move ‘unsatisfactory’ teachers out of the schools and ‘effective’ teachers in.”  One board member, for example, expressed finding it “horrendous” that the district was “held hostage” by the extent to which the local union was protecting teachers from being moved as per their value-added scores. Referring to the union, and its interference in this “reform,” he accused the unions of “shackling” the districts and preventing its intended reforms. Note that the “effective” teachers who are to replace the “ineffective” ones can earn up to $7,500 in bonuses per year to help the “turnaround” the schools into which they enter.

Likewise, the state’s Commissioner of Education concurred saying that she also “wanted ‘unsatisfactory’ teachers out and ‘highly effective’ teachers in,” again, with effectiveness being defined by teachers’ value-added or lack thereof, even though (1) the teachers targeted only had one or two years of the three years of value-added data required by state statute, and even though (2) the district’s senior director of assessment, accountability and evaluation noted that, in line with a plethora of other research findings, teachers being evaluated using the state’s VAM have a 51% chance of changing their scores from one year to the next. This lack of reliability, as we know it, should outright prevent any such moves in that without some level of stability, valid inferences from which valid decisions are to be made cannot be drawn. It’s literally impossible.

Nonetheless, state board of education members “unanimously… threatened to take [all of the district’s poor performing] over or close them in 2017-18 if district officials [didn’t] do what [the Board said].” See also other tales of similar districts in the article available, again, here.

In Keep’s case, “her ‘unsatisfactory’ VAM score [that caused the district to move her, as] paired with her ‘highly effective’ in-class observations by her administrators brought her overall district evaluation to ‘effective’…[although she also notes that]…her VAM scores fluctuate because the state has created a moving target.” Regardless, Keep was notified “five days before teachers were due back to their assigned schools Aug. 8 [after which she was] told she had to report to a new school with a different start time that [also] disrupted her 13-year routine and family that shares one car.”

VAM-based chaos reigns, especially in Florida.

New Mexico Is “At It Again”

“A Concerned New Mexico Parent” sent me yet another blog entry for you all to stay apprised of the ongoing “situation” in New Mexico and the continuous escapades of the New Mexico Public Education Department (NMPED). See “A Concerned New Mexico Parent’s” prior posts here, here, and here, but in this one (s)he writes what follows:

Well, the NMPED is at it again.

They just released the teacher evaluation results for the 2015-2016 school year. And, the report and media press releases are a something.

Readers of this blog are familiar with my earlier documentation of the myriad varieties of scoring formulas used by New Mexico to evaluate its teachers. If I recall, I found something like 200 variations in scoring formulas [see his/her prior post on this here with an actual variation count at n=217].

However, a recent article published in the Albuquerque Journal indicates that, now according to the NMPED, “only three types of test scores are [being] used in the calculation: Partnership for Assessment of Readiness for College and Careers [PARCC], end-of-course exams, and the [state’s new] Istation literacy test.” [Recall from another article released last January that New Mexico’s Secretary of Education Hanna Skandera is also the head of the governing board for the PARCC test].

Further, the Albuquerque Journal article author reports that the “PED also altered the way it classifies teachers, dropping from 107 options to three. Previously, the system incorporated many combinations of criteria such as a teacher’s years in the classroom and the type of standardized test they administer.”

The new state-wide evaluation plan is also available in more detail here. Although I should also add that there has been no published notification of the radical changes in this plan. It was just simply and quietly posted on NMPED’s public website.

Important to note, though, is that for Group B teachers (all levels), the many variations documented previously have all been replaced by end-of-course (EOC) exams. Also note that for Group A teachers (all levels) the percentage assigned to the PARCC test has been reduced from 50% to 35%. (Oh, how the mighty have fallen …). The remaining 15% of the Group A score is to be composed of EOC exam scores.

There are only two small problems with this NMPED simplification.

First, in many districts, no EOC exams were given to Group B teachers in the 2015-2016 school year, and none were given in the previous year either. Any EOC scores that might exist were from a solitary administration of EOC exams three years previously.

Second, for Group A teachers whose scores formerly relied solely on the PARCC test for 50% of their score, no EOC exams were ever given.

Thus, NMPED has replaced their policy of evaluating teachers on the basis of students they don’t teach to this new policy of evaluating teachers on the basis of tests they never administered!

Well done, NMPED (not…)

Luckily, NMPED still cannot make any consequential decisions based on these data, again, until NMPED proves to the court that the consequential decisions that they would still very much like to make (e.g., employment, advancement and licensure decisions) are backed by research evidence. I know, interesting concept…

A Case of VAM-Based Chaos in Florida

Within a recent post, I wrote about my recent “silence” explaining that, apparently, post the passage of federal government’s (January 1, 2016) passage of the Every Student Succeeds Act (ESSA) that no longer requires teachers to be evaluated by their student’s tests score using VAMs (see prior posts on this here and here), “crazy” VAM-related events have apparently subsided. While I noted in the post that this also did not mean that certain states and districts are not still drinking (and overdosing on) the VAM-based Kool-Aid, what I did not note is that the ways by which I get many of the stories I cover on this blog is via Google Alerts. This is where I have noticed a significant decline in VAM-related stories. Clearly, however, the news outlets often covered via Google Alerts don’t include district-level stories, so to cover these we must continue to rely on our followers (i.e., teachers, administrators, parents, students, school board members, etc.) to keep the stories coming.

Coincidentally — Billy Townsend, who is running for a school board seat in Polk County, Florida (district size = 100K students) — sent me one such story. As an edublogger himself, he actually sent me three blog posts (see post #1, post #2, and post #3 listed by order of relevance) capturing what is happening in his district, again, as situated under the state of Florida’s ongoing, VAM-based, nonsense. I’ve summarized the situation below as based on his three posts.

In short, the state ordered the district to dismiss a good number of its teachers as per their VAM scores when this school year started. “[T]his has been Florida’s [educational reform] model for nearly 20 years [actually since 1979, so 35 years]: Choose. Test. Punish. Stigmatize. Segregate. Turnover.” Because the district already had a massive teacher shortage as well, however, these teachers were replaced with Kelly Services contracted substitute teachers. Thereafter, district leaders decided that this was not “a good thing,” and they decided that administrators and “coaches” would temporarily replace the substitute teachers to make the situation “better.” While, of course, the substitutes’ replacements did not have VAM scores themselve, they were nonetheless deemed fit to teach and clearly more fit to teach than the teachers who were terminated as based on their VAM scores.

According to one teacher who anonymously wrote about her terminated teacher colleagues, and one of the district’s “best” teachers: “She knew our kids well. She understood how to reach them, how to talk to them. Because she ‘looked like them’ and was from their neighborhood, she [also] had credibility with the students and parents. She was professional, always did what was best for students. She had coached several different sports teams over the past decade. Her VAM score just wasn’t good enough.”

Consequently, this has turned into a “chaotic reality for real kids and adults” throughout the county’s schools, and the district and state apparently realized this by “threaten[ing] all of [the district’s] teachers with some sort of ethics violation if they talk about what’s happening” throughout the district. While “[t]he repetition of stories that sound just like this from [the districts’] schools is numbing and heartbreaking at the same time,” the state, district, and school board, apparently, “has no interest” in such stories.

Put simply, and put well as this aligns with our philosophy here: “Let’s [all] consider what [all of this] really means: [Florida] legislators do not want to hear from you if you are communicating a real experience from your life at a school — whether you are a teacher, parent, or student. Your experience doesn’t matter. Only your test score.”

Isn’t that the unfortunate truth; hence, and with reference to the introduction above, please do keep these relatively more invisible studies coming so that we can share out with the nation and make such stories more visible and accessible. VAMs, again, are alive and well, just perhaps in more undisclosed ways, like within districts as is the case here.