Breaking News: The End of Value-Added Measures for Teacher Termination in Houston

Recall from multiple prior posts (see, for example, here, here, here, here, and here) that a set of teachers in the Houston Independent School District (HISD), with the support of the Houston Federation of Teachers (HFT) and the American Federation of Teachers (AFT), took their district to federal court to fight against the (mis)use of their value-added scores derived via the Education Value-Added Assessment System (EVAAS) — the “original” value-added model (VAM) developed in Tennessee by William L. Sanders who just recently passed away (see here). Teachers’ EVAAS scores, in short, were being used to evaluate teachers in Houston in more consequential ways than any other district or state in the nation (e.g., the termination of 221 teachers in one year as based, primarily, on their EVAAS scores).

The case — Houston Federation of Teachers et al. v. Houston ISD — was filed in 2014 and just one day ago (October 10, 2017) came the case’s final federal suit settlement. Click here to read the “Settlement and Full and Final Release Agreement.” But in short, this means the “End of Value-Added Measures for Teacher Termination in Houston” (see also here).

More specifically, recall that the judge notably ruled prior (in May of 2017) that the plaintiffs did have sufficient evidence to proceed to trial on their claims that the use of EVAAS in Houston to terminate their contracts was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case district shall deprive any person of life, liberty, or property, without due process). That is, the judge ruled that “any effort by teachers to replicate their own scores, with the limited information available to them, [would] necessarily fail” (see here p. 13). This was confirmed by the one of the plaintiffs’ expert witness who was also “unable to replicate the scores despite being given far greater access to the underlying computer codes than [was] available to an individual teacher” (see here p. 13).

Hence, and “[a]ccording to the unrebutted testimony of [the] plaintiffs’ expert [witness], without access to SAS’s proprietary information – the value-added equations, computer source codes, decision rules, and assumptions – EVAAS scores will remain a mysterious ‘black box,’ impervious to challenge” (see here p. 17). Consequently, the judge concluded that HISD teachers “have no meaningful way to ensure correct calculation of their EVAAS scores, and as a result are unfairly subject to mistaken deprivation of constitutionally protected property interests in their jobs” (see here p. 18).

Thereafter, and as per this settlement, HISD agreed to refrain from using VAMs, including the EVAAS, to terminate teachers’ contracts as long as the VAM score is “unverifiable.” More specifically, “HISD agree[d] it will not in the future use value-added scores, including but not limited to EVAAS scores, as a basis to terminate the employment of a term or probationary contract teacher during the term of that teacher’s contract, or to terminate a continuing contract teacher at any time, so long as the value-added score assigned to the teacher remains unverifiable. (see here p. 2; see also here). HISD also agreed to create an “instructional consultation subcommittee” to more inclusively and democratically inform HISD’s teacher appraisal systems and processes, and HISD agreed to pay the Texas AFT $237,000 in its attorney and other legal fees and expenses (State of Texas, 2017, p. 2; see also AFT, 2017).

This is yet another big win for teachers in Houston, and potentially elsewhere, as this ruling is an unprecedented development in VAM litigation. Teachers and others using the EVAAS or another VAM for that matter (e.g., that is also “unverifiable”) do take note, at minimum.

Call for Stories for An Education/Test-Based Musical

Anne Heintz is a teacher and writer whom I met a couple of months ago. She is working on a musical about teachers in high-stakes testing environments, including environments in which high-stakes consequences (e.g., merit pay, tenure decisions, contract renewal, termination) are being attached to value-added model (VAM) output.
Accordingly, she is gathering stories from educational practitioners including teachers, principals, and staff, who are new to the profession as well as seasoned professionals, about their experiences working in such environments.
While it may not be the easiest sell to a producer, this topic has it all: absurdism, humor, and the pathos of seeing good people fighting for autonomy, caught in a storm of externally imposed aims.
Hence, if you have an especially good anecdote, an insider perspective, if you want to help her turn these stories into art, so that she and her creative colleagues can reach out to others and make these stories known, please send contact her directly via email at:
She is trying to get “it” as right as she can. Some of you out there may be the key!
Thank you in advance!

Much of the Same in Louisiana

As I wrote into a recent post: “…it seems that the residual effects of the federal governments’ former [teacher evaluation reform policies and] efforts are still dominating states’ actions with regards to educational accountability.” In other words, many states are still moving forward, more specifically in terms of states’ continued reliance on the use of value-added models (VAMs) for increased teacher accountability purposes, regardless of the passage of the Every Student Succeeds Act (ESSA).

Related, three articles were recently published online (here, here, and here) about how in Louisiana, the state’s old and controversial teacher evaluation system as based on VAMs is resuming after a four-year hiatus. It was put on hold when the state was in the process of adopting The Common Core.

This, of course, has serious implications for the approximately 50,000 teachers throughout the state, or the unknown proportion of them who are now VAM-eligible, believed to be around 15,000 (i.e., approximately 30% which is inline with other state trends).

While the state’s system has been partly adjusted, whereas 50% of a teacher’s evaluation was to be based on growth in student achievement over time using VAMs, and the new system has reduced this percentage down to 35%, now teachers of mathematics, English, science, and social studies are also to be held accountable using VAMs. The other 50% of these teachers’ evaluation scores are to be assessed using observations with 15% based on student learning targets (a.k.a., student learning objectives (SLOs)).

Evaluation system output are to be used to keep teachers from earning tenure, or to cause teachers to lose the tenure they might already have.

Among other controversies and issues of contention noted in these articles (see again here, here, and here), one of note (highlighted here) is also that now, “even after seven years”… the state is still “unable to truly explain or provide the actual mathematical calculation or formula’ used to link test scores with teacher ratings. ‘This obviously lends to the distrust of the entire initiative among the education community.”

A spokeswoman for the state, however, countered the transparency charge noting that the VAM formula has been on the state’s department of education website, “and updated annually, since it began in 2012.” She did not provide a comment about how to adequately explain the model, perhaps because she could not either.

Just because it might be available does not mean it is understandable and, accordingly, usable. This we have come to know from administrators, teachers, and yes, state-level administrators in charge of these models (and their adoption and implementation) for years. This is, indeed, one of the largest criticisms of VAMs abound.

“Virginia SGP” Overruled

You might recall from a post I released approximately 1.5 years ago a story about how a person who self-identifies as “Virginia SGP,” who is also now known as Brian Davison — a parent of two public school students in the affluent Loudoun, Virginia area (hereafter referred to as Virginia SGP), sued the state of Virginia in an attempt to force the release of teachers’ student growth percentile (SGP) data for all teachers across the state.

More specifically, Virginia SGP “pressed for the data’s release because he thinks parents have a right to know how their children’s teachers are performing, information about public employees that exists but has so far been hidden. He also want[ed] to expose what he sa[id was] Virginia’s broken promise to begin [to use] the data to evaluate how effective the state’s teachers are.” The “teacher data should be out there,” especially if taxpayers are paying for it.

In January of 2016, a Richmond, Virginia judge ruled in Virginia SGP’s favor. The following April, a Richmond Circuit Court judge ruled that the Virginia Department of Education was to also release Loudoun County Public Schools’ SGP scores by school and by teacher, including teachers’ identifying information. Accordingly, the judge noted that the department of education and the Loudoun school system failed to “meet the burden of proof to establish an exemption’ under Virginia’s Freedom of Information Act [FOIA]” preventing the release of teachers’ identifiable information (i.e., beyond teachers’ SGP data). The court also ordered VDOE to pay Davison $35,000 to cover his attorney fees and other costs.

As per an article published last week, the Virginia Supreme Court overruled this former ruling, noting that the department of education did not have to provide teachers’ identifiable information along with teachers’ SGP data, after all.

See more details in the actual article here, but ultimately the Virginia Supreme Court concluded that the Richmond Circuit Court “erred in ordering the production of these documents containing teachers’ identifiable information.” The court added that “it was [an] error for the circuit court to order that the School Board share in [Virginia SGP’s] attorney’s fees and costs,” pushing that decision (i.e., the decision regarding how much to pay, if anything at all, in legal fees) back down to the circuit court.

Virginia SGP plans to ask for a rehearing of this ruling. See also his comments on this ruling here.

One Florida District Kisses VAMs Goodbye

I recently wrote about how, in Louisiana, the state is reverting back to its value-added model (VAM)-based teacher accountability system after its four year hiatus (see here). The post titled “Much of the Same in Louisiana” likely did not come as a surprise to teachers there in that the state (like most other states in the sunbelt, excluding California) have a common and also perpetual infatuation with such systems, whether they be based on student-level or teacher-level accountability.

Well, at least one school district in Florida is kissing the state’s six-year infatuation with its VAM-based teacher accountability system goodbye. I could have invoked a much more colorful metaphor here, but let’s just go with something along the lines of a sophomoric love affair.

According to a recent article in the Tampa Bay Times (see here), “[u]sing new authority from the [state] Legislature, the Citrus County School Board became the first in the state to stop using VAM, citing its unfairness and opaqueness…[with this]…decision…expected to prompt other boards to action.”

That’s all the article has to offer on the topic, but let’s all hope others, in Florida and beyond, do follow.

On Conditional Bias and Correlation: A Guest Post

After I posted about “Observational Systems: Correlations with Value-Added and Bias,” a blog follower, associate professor, and statistician named Laura Ring Kapitula (see also a very influential article she wrote on VAMs here) posted comments on this site that I found of interest, and I thought would also be of interest to blog followers. Hence, I invited her to write a guest post, and she did.

She used R (i.e., a free software environment for statistical computing and graphics) to simulate correlation scatterplots (see Figures below) to illustrate three unique situations: (1) a simulation where there are two indicators (e.g., teacher value-added and observational estimates plotted on the x and y axes) that have a correlation of r = 0.28 (the highest correlation coefficient at issue in the aforementioned post); (2) a simulation exploring the impact of negative bias and a moderate correlation on a group of teachers; and (3) another simulation with two indicators that have a non-linear relationship possibly induced or caused by bias. She designed simulations (2) and (3) to illustrate the plausibility of the situation suggested next (as written into Audrey’s post prior) about potential bias in both value-added and observational estimates:

If there is some bias present in value-added estimates, and some bias present in the observational estimates…perhaps this is why these low correlations are observed. That is, only those teachers teaching classrooms inordinately stacked with students from racial minority, poor, low achieving, etc. groups might yield relatively stronger correlations between their value-added and observational scores given bias, hence, the low correlations observed may be due to bias and bias alone.

Laura continues…

Here, Audrey makes the point that a correlation of r = 0.28 is “weak.” It is, accordingly, useful to see an example of just how “weak” such a correlation is by looking at a scatterplot of data selected from a population where the true correlation is r = 0.28. To make the illustration more meaningful the points are colored based on their quintile scores as per simulated teachers’ value-added divided into the lowest 20%, next 20%, etc.

In this figure you can see by looking at the blue “least squares line” that, “on average,” as a simulated teacher’s value-added estimate increases the average of a teacher’s observational estimate increases. However, there is a lot of variability (or scatter points) around the (scatterplot) line. Given this variability, we can make statements about averages, such as “on average” teachers in the top 20% for VAM scores will likely have on average higher observed observational scores; however, there is not nearly enough precision to make any (and certainly not any good) predictions about the observational score from the VAM score for individual teachers. In fact, the linear relationship between teachers’ VAM and observational scores only accounts for about 8% of the variation in VAM score. Note: we get 8% by squaring the aforementioned r = 0.28 correlation (i.e., an R squared). The other 92% of the variance is due to error and other factors.

What this means in practice is that when correlations are this “weak,” it is reasonable to say statements about averages, for example, that “on average” as one variable increases the mean of the other variable increases, but it would not be prudent or wise to make predictions for individuals based on these data. See, for example, that individuals in the top 20% (quintile 5) of VAM have a very large spread in their scores on the observational score, with 95% of the scores in the top quintile being in between the 7th and 98th percentiles for their observational scores. So, here if we observe a VAM for a specific teacher in the top 20%, and we do not know their observational score, we cannot say much more than their observational score is likely to be in the top 90%. Similarly, if we observe a VAM in the bottom 20%, we cannot say much more than their observational score is likely to be somewhere in the bottom 90%. That’s not saying a lot, in terms of precision, but also in terms of practice.

The second scatterplot I ran to test how bias that only impacts a small group of teachers might theoretically impact an overall correlation, as posited by Audrey. Here I simulated a situation where, again, there are two values present in a population of teachers: a teacher’s value-added and a teacher’s observational score. Then I insert a group of teachers (as Audrey described) who represent 20% of a population and teach a disproportionate number of students who come from relatively lower socioeconomic, high racial minority, etc. backgrounds, and I assume this group is measured with negative bias on both indicators and this group has a moderate correlation between indicators of r = 0.50. The other 80% of the population is assumed to be uncorrelated. Note: for this demonstration I assume that this group includes 20% of teachers from the aforementioned population, these teachers I assume to be measured with negative bias (by one standard deviation on average) on both measures, and, again, I set their correlation at r = 0.50 with the other 80% of teachers at a correlation of zero.

What you can see is that if there is bias in this correlation that impacts only a certain group on the two instrument indicators; hence, it is possible that this bias can result in an observed correlation overall. In other words, a strong correlation noted in just one group of teachers (i.e., teachers scoring the lowest on their value-added and observational indicators in this case) can be relatively stronger than the “weak” correlation observed on average or overall.

Another, possible situation is that there might be a non-linear relationship between these two measures. In the simulation below, I assume that different quantiles on VAM have a different linear relationship with the observational score. For example, in the plot there is not a constant slope, but teachers who are in the first quintile on VAM I assume to have a correlation of r = 0.50 with observational scores, the second quintile I assume to have a correlation of r = 0.20, and the other quintiles I assume to be uncorrelated. This results in an overall correlation in the simulation of r = 0.24, with a very small p-value (i.e. a very small chance that a correlation of this size would be observed by random chance alone if the true correlation was zero).

What this means in practice is that if, in fact, there is a non-linear relationship between teachers’ observational and VAM scores, this can induce a small but statistically significant correlation. As evidenced, teachers in the lowest 20% on the VAM score have differences in the mean observational score depending on the VAM score (a moderate correlation of r = 0.50), but for the other 80%, knowing the VAM score is not informative as there is a very small correlation for the second quintile and no correlation for the upper 60%. So, if quintile cut-off scores are used, teachers can easily be misclassified. In sum, Pearson Correlations (the standard correlation coefficient) measure the overall strength of  linear relationships between X and Y, but if X and Y have a non-linear relationship (like as illustrated in the above), this statistic can be very misleading.

Note also that for all of these simulations very small p-values are observed (e.g., p-values <0.0000001 which, again, mean these correlations are statistically significant or that the probability of observing correlations this large by chance if the true correlation is zero, is nearly 0%). What this illustrates, again, is that correlations (especially correlations this small) are (still) often misleading. While they might be statistically significant, they might mean relatively little in the grand scheme of things (i.e., in terms of practical significance; see also “The Difference Between”Significant’ and ‘Not Significant’ is not Itself Statistically Significant” or posts on Andrew Gelman’s blog for more discussion on these topics if interested).

At the end of the day r = 0.28 is still a “weak” correlation. In addition, it might be “weak,” on average, but much stronger and statistically and practically significant for teachers in the bottom quintiles (e.g., teachers in the bottom 20%, as illustrated in the final figure above) typically teaching the highest needs students. Accordingly, this might be due, at least in part, to bias.

In conclusion, one should always be wary of claims based on “weak” correlations, especially if they are positioned to be stronger than industry standards would classify them (e.g., in the case highlighted in the prior post). Even if a correlation is “statistically significant,” it is possible that the correlation is the result of bias, and that the relationship is so weak that it is not meaningful in practice, especially when the goal is to make high-stakes decisions about individual teachers. Accordingly, when you see correlations this small, keep these scatterplots in mind or generate some of your own (see, for example, here to dive deeper into what these correlations might mean and how significant these correlations might really be).

*Please contact Dr. Kapitula directly at kapitull@gvsu.edu if you want more information or to access the R code she used for the above.

The “Widget Effect” Report Revisited

You might recall that in 2009, The New Teacher Project published a highly influential “Widget Effect” report in which researchers (see citation below) evidenced that 99% of teachers (whose teacher evaluation reports they examined across a sample of school districts spread across a handful of states) received evaluation ratings of “satisfactory” or higher. Inversely, only 1% of the teachers whose reports researchers examined received ratings of “unsatisfactory,” even though teachers’ supervisors could identify more teachers whom they deemed ineffective when asked otherwise.

Accordingly, this report was widely publicized given the assumed improbability that only 1% of America’s public school teachers were, in fact, ineffectual, and given the fact that such ineffective teachers apparently existed but were not being identified using standard teacher evaluation/observational systems in use at the time.

Hence, this report was used as evidence that America’s teacher evaluation systems were unacceptable and in need of reform, primarily given the subjectivities and flaws apparent and arguably inherent across the observational components of these systems. This reform was also needed to help reform America’s public schools, writ large, so the logic went and (often) continues to go. While binary constructions of complex data such as these are often used to ground simplistic ideas and push definitive policies, ideas, and agendas, this tactic certainly worked here, as this report (among a few others) was used to inform the federal and state policies pushing teacher evaluation system reform as a result (e.g., Race to the Top (RTTT)).

Likewise, this report continues to be used whenever a state’s or district’s new-and-improved teacher evaluation systems (still) evidence “too many” (as typically arbitrarily defined) teachers as effective or higher (see, for example, an Education Week article about this here). Although, whether in fact the systems have actually been reformed is also of debate in that states are still using many of the same observational systems they were using prior (i.e., not the “binary checklists” exaggerated in the original as well as this report, albeit true in the case of the district of focus in this study). The real “reforms,” here, pertained to the extent to which value-added model (VAM) or other growth output were combined with these observational measures, and the extent to which districts adopted state-level observational models as per the centralized educational policies put into place at the same time.

Nonetheless, now eight years later, Matthew A. Kraft – an Assistant Professor of Education & Economics at Brown University and Allison F. Gilmour – an Assistant Professor at Temple University (and former doctoral student at Vanderbilt University), revisited the original report. Just published in the esteemed, peer-reviewed journal Educational Researcher (see an earlier version of the published study here), Kraft and Gilmour compiled “teacher performance ratings across 24 [of the 38, including 14 RTTT] states that [by 2014-2015] adopted major reforms to their teacher evaluation systems” as a result of such policy initiatives. They found that “the percentage of teachers rated Unsatisfactory remains less than 1%,” except for in two states (i.e., Maryland and New Mexico), with Unsatisfactory (or similar) ratings varying “widely across states with 0.7% to 28.7%” as the low and high, respectively (see also the study Abstract).

Related, Kraft and Gilmour found that “some new teacher evaluation systems do differentiate among teachers, but most only do so at the top of the ratings spectrum” (p. 10). More specifically, observers in states in which teacher evaluation ratings include five versus four rating categories differentiate teachers more, but still do so along the top three ratings, which still does not solve the negative skew at issue (i.e., “too many” teachers still scoring “too well”). They also found that when these observational systems were used for formative (i.e., informative, improvement) purposes, teachers’ ratings were lower than when they were used for summative (i.e., final summary) purposes.

Clearly, the assumptions of all involved in this area of policy research come into play, here, akin to how they did in The Bell Curve and The Bell Curve Debate. During this (still ongoing) debate, many fervently debated whether socioeconomic and educational outcomes (e.g., IQ) should be normally distributed. What this means in this case, for example, is that for every teacher who is rated highly effective there should be a teacher rated as highly ineffective, more or less, to yield a symmetrical distribution of teacher observational scores across the spectrum.

In fact, one observational system of which I am aware (i.e., the TAP System for Teacher and Student Advancement) is marketing its proprietary system, using as a primary selling point figures illustrating (with text explaining) how clients who use their system will improve their prior “Widget Effect” results (i.e., yielding such normal curves; see Figure below, as per Jerald & Van Hook, 2011, p. 1).

Evidence also suggests that these scores are also (sometimes) being artificially deflated to assist in these attempts (see, for example, a recent publication of mine released a few days ago here in the (also) esteemed, peer-reviewed Teachers College Record about how this is also occurring in response to the “Widget Effect” report and the educational policies that follows).

While Kraft and Gilmour assert that “systems that place greater weight on normative measures such as value-added scores rather than…[just]…observations have fewer teachers rated proficient” (p. 19; see also Steinberg & Kraft, forthcoming; a related article about how this has occurred in New Mexico here; and New Mexico’s 2014-2016 data below and here, as also illustrative of the desired normal curve distributions discussed above), I highly doubt this purely reflects New Mexico’s “commitment to putting students first.”

I also highly doubt that, as per New Mexico’s acting Secretary of Education, this was “not [emphasis added] designed with quote unquote end results in mind.” That is, “the New Mexico Public Education Department did not set out to place any specific number or percentage of teachers into a given category.” If true, it’s pretty miraculous how this simply worked out as illustrated… This is also at issue in the lawsuit in which I am involved in New Mexico, in which the American Federation of Teachers won an injunction in 2015 that still stands today (see more information about this lawsuit here). Indeed, as per Kraft, all of this “might [and possibly should] undercut the potential for this differentiation [if ultimately proven artificial, for example, as based on statistical or other pragmatic deflation tactics] to be seen as accurate and valid” (as quoted here).

Notwithstanding, Kraft and Gilmour, also as part (and actually the primary part) of this study, “present original survey data from an urban district illustrating that evaluators perceive more than three times as many teachers in their schools to be below Proficient than they rate as such.” Accordingly, even though their data for this part of this study come from one district, their findings are similar to others evidenced in the “Widget Effect” report; hence, there are still likely educational measurement (and validity) issues on both ends (i.e., with using such observational rubrics as part of America’s reformed teacher evaluation systems and using survey methods to put into check these systems, overall). In other words, just because the survey data did not match the observational data does not mean either is wrong, or right, but there are still likely educational measurement issues.

Also of issue in this regard, in terms of the 1% issue, is (a) the time and effort it takes supervisors to assist/desist after rating teachers low is sometimes not worth assigning low ratings; (b) how supervisors often give higher ratings to those with perceived potential, also in support of their future growth, even if current evidence suggests a lower rating is warranted; (c) how having “difficult conversations” can sometimes prevent supervisors from assigning the scores they believe teachers may deserve, especially if things like job security are on the line; (d) supervisors’ challenges with removing teachers, including “long, laborious, legal, draining process[es];” and (e) supervisors’ challenges with replacing teachers, if terminated, given current teacher shortages and the time and effort, again, it often takes to hire (ideally more qualified) replacements.

References:

Jerald, C. D., & Van Hook, K. (2011). More than measurement: The TAP system’s lessons learned for designing better teacher evaluation systems. Santa Monica, CA: National Institute for Excellence in Teaching (NIET). Retrieved from http://files.eric.ed.gov/fulltext/ED533382.pdf

Kraft, M. A, & Gilmour, A. F. (2017). Revisiting the Widget Effect: Teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46(5) 234-249. doi:10.3102/0013189X17718797

Steinberg, M. P., & Kraft, M. A. (forthcoming). The sensitivity of teacher performance ratings to the design of teacher evaluation systems. Educational Researcher.

Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). “The Widget Effect.” Education Digest, 75(2), 31–35.

A “Next Generation” Vision for School, Teacher, and Student Accountability

Within a series of prior posts (see, for example, here and here), I have written about what the Every Student Succeeds Act (ESSA), passed in December of 2015, means for the U.S., or more specifically states’ school and teacher evaluation systems as per the federal government’s prior mandates requiring their use of growth and value-added models (VAMs).

Related, states were recently (this past May) required to submit to the federal government their revised school and teacher evaluation plans, post ESSA, given how they have changed, or not. While I have a doctoral student currently gathering updated teacher evaluation data, state-by-state, and our preliminary findings indicate that “things” have not (yet) changed much post ESSA, at least at the teacher level of focus in this study and except for in a few states (e.g., Connecticut, Oklahoma), states still have the liberties to change that which they do on both ends (i.e., school and teacher accountability).

Recently, a colleague recently shared with me a study titled “Next Generation Accountability: A Vision for School Improvement Under ESSA” that warrants coverage here, in hopes that states are still “out there” trying to reform their school and teacher evaluation systems, of course, for the better. While the document was drafted by folks coming from the aforementioned state of Oklahoma, who are also affiliated with the Learning Policy Institute, it is important to note that the document was also vetted by some “heavy hitters” in this line of research including, but not limited to, David C. Berliner (Arizona State University), Peter W. Cookson Jr. (American Institutes for Research (AIR)), Linda Darling-Hammond (Stanford University), and William A. Firestone (Rutgers University).

As per ESSA, states are to have increased opportunities “to develop innovative strategies for advancing equity, measuring success, and developing cycles of continuous improvement” while using “multiple measures to assess school and student performance” (p. iii). Likewise, the authors of this report state that “A broader spectrum of indicators,
going well beyond a summary of annual test performance, seems necessary to account transparently for performance and assign responsibility for improvement.”

Here are some of their more specific recommendations that I found of value for blog followers:

  • The continued use of a single composite indicator to reduce and then sort teachers or schools by their overall effectiveness or performance (e.g., using teacher “effectiveness” categories or school A–F letter grades) is myopic, to say the least. This is because doing this (a) misses all that truly “matters,” including  multidimensional concepts and (non)cognitive competencies we want students to know and to be able to do, not captured by large-scale tests; and (b) inhibits the usefulness of what may be informative, stand-alone data (i.e., as taken from “multiple measures” individually) once these data are reduced and then collapsed so that they can be used for hierarchical categorizations and rankings. This also (c) very much trivializes the multiple causes of low achievement, also of importance and in much greater need of attention.
  • Accordingly, “Next Generation” accountability systems should include “a broad palette of functionally significant indicators to replace [such] single composite indicators [as this] will likely be regarded as informational rather than controlling, thereby motivating stakeholders to action” (p. ix). Stakeholders should be defined in the following terms…
  • “Next Generation” accountability systems should incorporate principles of “shared accountability,” whereby educational responsibility and accountability should be “distributed across system components and not foisted upon any one group of actors or stakeholders” (p. ix). “[E]xerting pressure on stakeholders who do not have direct control over [complex educational] elements is inappropriate and worse, harmful” (p. ix). Accordingly, the goal of “shared accountability” is to “create an accountability environment in which all participants [including governmental organizations] recognize their obligations and commitments in relation to each other” (p. ix) and their collective educational goals.
  • To facilitate this, “Next Generation” information systems should be designed and implemented in order to service the “dual reporting needs of compliance with federal mandates and the particular improvement needs of a state’s schools,” while also addressing “the different information needs of state, district, school site
    leadership, teachers, and parents” (p. ix). Data may include, at minimum, data on school resources, processes, outcomes, and other nuanced indicators, and this information must be made transparent and accessible in order for all types of data users to be responsive, holistically and individually (e.g, at school or classroom levels). The formative functions of such “Next Generation” informational systems, accordingly, take priority, at least for initial terms, until informational data can be used to, with priority, “identify and transform schools in catastrophic failure” (p. ix).
  • Related, all test- or other educational measurement-related components of states’ “Next Generation” statutes and policies should adhere to the Standards for Educational and Psychological Testing, and more specifically their definitions of reliability, validity, bias, fairness, and the like. Statutes and policies should also be written “in the least restrictive and prescriptive terms possible to allow for [continous] corrective action and improvement” (p. x).
  • Finally, “Next Generation” accountability systems should adhere to the following five essentials: “(a) state, district, and school leaders must create a system-wide culture grounded in “learning to improve;” (b) learning to improve using [the aforementioned informational systems also] necessitates the [overall] development of [students’] strong pedagogical data-literacy skills; (c) resources in addition to funding—including time, access to expertise, and collaborative opportunities—should be prioritized for sustaining these ongoing improvement efforts; (d) there must be a coherent structure of state-level support for learning to improve, including the development of a strong Longitudinal Data System (LDS) infrastructure; and (e) educator labor market policy in some states may need adjustment to support the above elements” (p. x).

To read more, please access the full report here.

In sum, “Next Generation” accountability systems aim at “a loftier goal—universal college and career readiness—a goal that current accountability systems were not designed to achieve. To reach this higher level, next generation accountability must embrace a wider vision, distribute trustworthy performance information, and build support infrastructure, while eliciting the assent, support, and enthusiasm of citizens and educators” (p. vii).

As briefly noted prior, “a few states have been working to put more supportive, humane accountability systems in place, but others remain stuck in a compliance mindset that undermines their ability to design effective accountability systems” (p. vii). Perhaps (or perhaps likely) this is because for the past decade or so states invested so much time, effort, and money to “reforming” their prior teacher evaluations systems as formerly required by the federal government. This included investments in states’ growth models of VAMs, onto which many/most states seem to be holding firm.

Hence, while it seems that the residual effects of the federal governments’ former efforts are still dominating states’ actions with regards to educational accountability, hopefully some states can at least begin to lead the way to what will likely yield the educational reform…still desired…

Observational Systems: Correlations with Value-Added and Bias

A colleague recently sent me a report released in November of 2016 by the Institute of Education Sciences (IES) division of the U.S. Department of Education that should be of interest to blog followers. The study is about “The content, predictive power, and potential bias in five widely used teacher observation instruments” and is authored by affiliates of Mathematica Policy Research.

Using data from the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) studies, researchers examined five widely used teacher observation instruments. Instruments included the more generally popular Classroom Assessment Scoring System (CLASS) and Danielson Framework for Teaching (of general interest in this post), as well as the more subject-specific instruments including the Protocol for Language Arts Teaching Observations (PLATO), the Mathematical Quality of Instruction (MQI), and the UTeach Observational Protocol (UTOP) for science and mathematics teachers.

Researchers examined these instruments in terms of (1) what they measure (which is not of general interest in this post), but also (2) the relationships of observational output to teachers’ impacts on growth in student learning over time (as measured using a standard value-added model (VAM)), and (3) whether observational output are biased by the characteristics of the students non-randomly (or in this study randomly) assigned to teachers’ classrooms.

As per #2 above, researchers found that the instructional practices captured across these instruments modestly [emphasis added] correlate with teachers’ value-added scores, with an adjusted (and likely, artificially inflated; see Note 1 below) correlation coefficient between observational and value added indicators at: 0.13 ≤ r ≤ 0.28 (see also Table 4, p. 10). As per the higher, adjusted r (emphasis added; see also Note 1 below), they found that these instruments’ classroom management dimensions most strongly (r = 0.28) correlated with teachers’ value-added.

Related, also at issue here is that such correlations are not “modest,” but rather “weak” to “very weak” (see Note 2 below). While all correlation coefficients were statistically significant, this is much more likely due to the sample size used in this study versus the actual or practical magnitude of these results. “In sum” this hardly supports the overall conclusion that “observation scores predict teachers’ value-added scores” (p. 11); although, it should also be noted that this summary statement, in and of itself, suggests that the value-added score is the indicator around which all other “less objective” indicators are to revolve.

As per #3 above, researchers found that students randomly assigned to teachers’ classrooms (as per the MET data, although there was some noncompliance issues with the random assignment employed in the MET studies) do bias teachers’ observational scores, for better or worse, and more often in English language arts than in mathematics. More specifically, they found that for the Danielson Framework and CLASS (the two more generalized instruments examined in this study, also of main interest in this post), teachers with relatively more racial/ethnic minority and lower-achieving students (in that order, although these are correlated themselves) tended to receive lower observation scores. Bias was observed more often for the Danielson Framework versus the CLASS, but it was observed in both cases. An “alternative explanation [may be] that teachers are providing less-effective instruction to non-White or low-achieving students” (p. 14).

Notwithstanding, and in sum, in classrooms in which students were randomly assigned to teachers, teachers’ observational scores were biased by students’ group characteristics, which also means that  bias is also likely more prevalent in classrooms to which students are non-randomly assigned (which is common practice). These findings are also akin to those found elsewhere (see, for example, two similar studies here), as this was also evidenced in mathematics, which may also be due to the random assignment factor present in this study. In other words, if non-random assignment of students into classrooms is practice, a biasing influence may (likely) still exist in English language arts and mathematics.

The long and short of it, though, is that the observational components of states’ contemporary teacher systems certainly “add” more “value” than their value-added counterparts (see also here), especially when considering these systems’ (in)formative purposes. But to suggest that because these observational indicators (artificially) correlate with teachers’ value-added scores at “weak” and “very weak” levels (see Notes 1 and 2 below), that this means that these observational systems might “add” more “value” to the summative sides of teacher evaluations (i.e., their predictive value) is premature, not to mention a bit absurd. Adding import to this statement is the fact that, as s duly noted in this study, these observational indicators are oft-to-sometimes biased against teachers who teacher lower-achieving and racial minority students, even when random assignment is present, making such bias worse when non-random assignment, which is very common, occurs.

Hence, and again, this does not make the case for the summative uses of really either of these indicators or instruments, especially when high-stakes consequences are to be attached to output from either indicator (or both indicators together given the “weak” to “very weak” relationships observed). On the plus side, though, remain the formative functions of the observational indicators.

*****

Note 1: Researchers used the “year-to-year variation in teachers’ value-added scores to produce an adjusted correlation [emphasis added] that may be interpreted as the correlation between teachers’ average observation dimension score and their underlying value added—the value added that is [not very] stable [or reliable] for a teacher over time, rather than a single-year measure (Kane & Staiger, 2012)” (p. 9). This practice or its statistic derived has not been externally vetted. Likewise, this also likely yields a correlation coefficient that is falsely inflated. Both of these concerns are at issue in the ongoing New Mexico and Houston lawsuits, in which Kane is one of the defendants’ expert witnesses in both cases testifying in support of his/this practice.

Note 2: As is common with social science research when interpreting correlation coefficients: 0.8 ≤ r ≤ 1.0 = a very strong correlation; 0.6 ≤ r ≤ 0.8 = a strong correlation; 0.4 ≤ r ≤ 0.6 = a moderate correlation; 0.2 ≤ r ≤ 0.4 = a weak correlation; and 0 ≤ r ≤ 0.2 = a very weak correlation, if any at all.

*****

Citation: Gill, B., Shoji, M., Coen, T., & Place, K. (2016). The content, predictive power, and potential bias in five widely used teacher observation instruments. Washington, DC: U.S. Department of Education, Institute of Education Sciences. Retrieved from https://ies.ed.gov/ncee/edlabs/regions/midatlantic/pdf/REL_2017191.pdf

The New York Times on “The Little Known Statistician” Who Passed

As many of you may recall, I wrote a post last March about the passing of William L. Sanders at age 74. Sanders developed the Education Value-Added Assessment System (EVAAS) — the value-added model (VAM) on which I have conducted most of my research (see, for example, here and here) and the VAM at the core of most of the teacher evaluation lawsuits in which I have been (or still am) engaged (see here, here, and here).

Over the weekend, though, The New York Times released a similar piece about Sanders’s passing, titled “The Little-Known Statistician Who Taught Us to Measure Teachers.” Because I had multiple colleagues and blog followers email me (or email me about) this article, I thought I would share it out with all of you, with some additional comments, of course, but also given the comments I already made in my prior post here.

First, I will start by saying that the title of this article is misleading in that what this “little-known” statistician contributed to the field of education was hardly “little” in terms of its size and impact. Rather, Sanders and his associates at SAS Institute Inc. greatly influenced our nation in terms of the last decade of our nation’s educational policies, as largely bent on high-stakes teacher accountability for educational reform. This occurred in large part due to Sanders’s (and others’) lobbying efforts when the federal government ultimately choose to incentivize and de facto require that all states hold their teachers accountable for their value-added, or lack thereof, while attaching high-stakes consequences (e.g., teacher termination) to teachers’ value-added estimates. This, of course, was to ensure educational reform. This occurred at the federal level, as we all likely know, primarily via Race to the Top and the No Child Left Behind Waivers essentially forced upon states when states had to adopt VAMs (or growth models) to also reform their teachers, and subsequently their schools, in order to continue to receive the federal funds upon which all states still rely.

It should be noted, though, that we as a nation have been relying upon similar high-stakes educational policies since the late 1970s (i.e., for now over 35 years); however, we have literally no research evidence that these high-stakes accountability policies have yielded any of their intended effects, as still perpetually conceptualized (see, for example, Nevada’s recent legislative ruling here) and as still advanced via large- and small-scale educational policies (e.g., we are still A Nation At Risk in terms of our global competitiveness). Yet, we continue to rely on the logic in support of such “carrot and stick” educational policies, even with this last decade’s teacher- versus student-level “spin.” We as a nation could really not be more ahistorical in terms of our educational policies in this regard.

Regardless, Sanders contributed to all of this at the federal level (that also trickled down to the state level) while also actively selling his VAM to state governments as well as local school districts (i.e., including the Houston Independent School District in which teacher plaintiffs just won a recent court ruling against the Sanders value-added system here), and Sanders did this using sets of (seriously) false marketing claims (e.g., purchasing and using the EVAAS will help “clear [a] path to achieving the US goal of leading the world in college completion by the year 2020”). To see two empirical articles about the claims made to sell Sanders’s EVAAS system, the research non-existent in support of each of the claims, and the realities of those at the receiving ends of this system (i.e., teachers) as per their experiences with each of the claims, see here and here.

Hence, to assert that what this “little known” statistician contributed to education was trivial or inconsequential is entirely false. Thankfully, with the passage of the Every Student Succeeds Act” (ESSA) the federal government came around, in at least some ways. While not yet acknowledging how holding teachers accountable for their students’ test scores, while ideal, simply does not work (see the “Top Ten” reasons why this does not work here), at least the federal government has given back to the states the authority to devise, hopefully, some more research-informed educational policies in these regards (I know….).

Nonetheless, may he rest in peace (see also here), perhaps also knowing that his forever stance of “[making] no apologies for the fact that his methods were too complex for most of the teachers whose jobs depended on them to understand,” just landed his EVAAS in serious jeopardy in court in Houston (see here) given this stance was just ruled as contributing to the violation of teachers’ Fourteenth Amendment rights (i.e., no state or in this case organization shall deprive any person of life, liberty, or property, without due process [emphasis added]).