On Conditional Bias and Correlation: A Guest Post

After I posted about “Observational Systems: Correlations with Value-Added and Bias,” a blog follower, associate professor, and statistician named Laura Ring Kapitula (see also a very influential article she wrote on VAMs here) posted comments on this site that I found of interest, and I thought would also be of interest to blog followers. Hence, I invited her to write a guest post, and she did.

She used R (i.e., a free software environment for statistical computing and graphics) to simulate correlation scatterplots (see Figures below) to illustrate three unique situations: (1) a simulation where there are two indicators (e.g., teacher value-added and observational estimates plotted on the x and y axes) that have a correlation of r = 0.28 (the highest correlation coefficient at issue in the aforementioned post); (2) a simulation exploring the impact of negative bias and a moderate correlation on a group of teachers; and (3) another simulation with two indicators that have a non-linear relationship possibly induced or caused by bias. She designed simulations (2) and (3) to illustrate the plausibility of the situation suggested next (as written into Audrey’s post prior) about potential bias in both value-added and observational estimates:

If there is some bias present in value-added estimates, and some bias present in the observational estimates…perhaps this is why these low correlations are observed. That is, only those teachers teaching classrooms inordinately stacked with students from racial minority, poor, low achieving, etc. groups might yield relatively stronger correlations between their value-added and observational scores given bias, hence, the low correlations observed may be due to bias and bias alone.

Laura continues…

Here, Audrey makes the point that a correlation of r = 0.28 is “weak.” It is, accordingly, useful to see an example of just how “weak” such a correlation is by looking at a scatterplot of data selected from a population where the true correlation is r = 0.28. To make the illustration more meaningful the points are colored based on their quintile scores as per simulated teachers’ value-added divided into the lowest 20%, next 20%, etc.

In this figure you can see by looking at the blue “least squares line” that, “on average,” as a simulated teacher’s value-added estimate increases the average of a teacher’s observational estimate increases. However, there is a lot of variability (or scatter points) around the (scatterplot) line. Given this variability, we can make statements about averages, such as “on average” teachers in the top 20% for VAM scores will likely have on average higher observed observational scores; however, there is not nearly enough precision to make any (and certainly not any good) predictions about the observational score from the VAM score for individual teachers. In fact, the linear relationship between teachers’ VAM and observational scores only accounts for about 8% of the variation in VAM score. Note: we get 8% by squaring the aforementioned r = 0.28 correlation (i.e., an R squared). The other 92% of the variance is due to error and other factors.

What this means in practice is that when correlations are this “weak,” it is reasonable to say statements about averages, for example, that “on average” as one variable increases the mean of the other variable increases, but it would not be prudent or wise to make predictions for individuals based on these data. See, for example, that individuals in the top 20% (quintile 5) of VAM have a very large spread in their scores on the observational score, with 95% of the scores in the top quintile being in between the 7th and 98th percentiles for their observational scores. So, here if we observe a VAM for a specific teacher in the top 20%, and we do not know their observational score, we cannot say much more than their observational score is likely to be in the top 90%. Similarly, if we observe a VAM in the bottom 20%, we cannot say much more than their observational score is likely to be somewhere in the bottom 90%. That’s not saying a lot, in terms of precision, but also in terms of practice.

The second scatterplot I ran to test how bias that only impacts a small group of teachers might theoretically impact an overall correlation, as posited by Audrey. Here I simulated a situation where, again, there are two values present in a population of teachers: a teacher’s value-added and a teacher’s observational score. Then I insert a group of teachers (as Audrey described) who represent 20% of a population and teach a disproportionate number of students who come from relatively lower socioeconomic, high racial minority, etc. backgrounds, and I assume this group is measured with negative bias on both indicators and this group has a moderate correlation between indicators of r = 0.50. The other 80% of the population is assumed to be uncorrelated. Note: for this demonstration I assume that this group includes 20% of teachers from the aforementioned population, these teachers I assume to be measured with negative bias (by one standard deviation on average) on both measures, and, again, I set their correlation at r = 0.50 with the other 80% of teachers at a correlation of zero.

What you can see is that if there is bias in this correlation that impacts only a certain group on the two instrument indicators; hence, it is possible that this bias can result in an observed correlation overall. In other words, a strong correlation noted in just one group of teachers (i.e., teachers scoring the lowest on their value-added and observational indicators in this case) can be relatively stronger than the “weak” correlation observed on average or overall.

Another, possible situation is that there might be a non-linear relationship between these two measures. In the simulation below, I assume that different quantiles on VAM have a different linear relationship with the observational score. For example, in the plot there is not a constant slope, but teachers who are in the first quintile on VAM I assume to have a correlation of r = 0.50 with observational scores, the second quintile I assume to have a correlation of r = 0.20, and the other quintiles I assume to be uncorrelated. This results in an overall correlation in the simulation of r = 0.24, with a very small p-value (i.e. a very small chance that a correlation of this size would be observed by random chance alone if the true correlation was zero).

What this means in practice is that if, in fact, there is a non-linear relationship between teachers’ observational and VAM scores, this can induce a small but statistically significant correlation. As evidenced, teachers in the lowest 20% on the VAM score have differences in the mean observational score depending on the VAM score (a moderate correlation of r = 0.50), but for the other 80%, knowing the VAM score is not informative as there is a very small correlation for the second quintile and no correlation for the upper 60%. So, if quintile cut-off scores are used, teachers can easily be misclassified. In sum, Pearson Correlations (the standard correlation coefficient) measure the overall strength of  linear relationships between X and Y, but if X and Y have a non-linear relationship (like as illustrated in the above), this statistic can be very misleading.

Note also that for all of these simulations very small p-values are observed (e.g., p-values <0.0000001 which, again, mean these correlations are statistically significant or that the probability of observing correlations this large by chance if the true correlation is zero, is nearly 0%). What this illustrates, again, is that correlations (especially correlations this small) are (still) often misleading. While they might be statistically significant, they might mean relatively little in the grand scheme of things (i.e., in terms of practical significance; see also “The Difference Between”Significant’ and ‘Not Significant’ is not Itself Statistically Significant” or posts on Andrew Gelman’s blog for more discussion on these topics if interested).

At the end of the day r = 0.28 is still a “weak” correlation. In addition, it might be “weak,” on average, but much stronger and statistically and practically significant for teachers in the bottom quintiles (e.g., teachers in the bottom 20%, as illustrated in the final figure above) typically teaching the highest needs students. Accordingly, this might be due, at least in part, to bias.

In conclusion, one should always be wary of claims based on “weak” correlations, especially if they are positioned to be stronger than industry standards would classify them (e.g., in the case highlighted in the prior post). Even if a correlation is “statistically significant,” it is possible that the correlation is the result of bias, and that the relationship is so weak that it is not meaningful in practice, especially when the goal is to make high-stakes decisions about individual teachers. Accordingly, when you see correlations this small, keep these scatterplots in mind or generate some of your own (see, for example, here to dive deeper into what these correlations might mean and how significant these correlations might really be).

*Please contact Dr. Kapitula directly at kapitull@gvsu.edu if you want more information or to access the R code she used for the above.

The “Widget Effect” Report Revisited

You might recall that in 2009, The New Teacher Project published a highly influential “Widget Effect” report in which researchers (see citation below) evidenced that 99% of teachers (whose teacher evaluation reports they examined across a sample of school districts spread across a handful of states) received evaluation ratings of “satisfactory” or higher. Inversely, only 1% of the teachers whose reports researchers examined received ratings of “unsatisfactory,” even though teachers’ supervisors could identify more teachers whom they deemed ineffective when asked otherwise.

Accordingly, this report was widely publicized given the assumed improbability that only 1% of America’s public school teachers were, in fact, ineffectual, and given the fact that such ineffective teachers apparently existed but were not being identified using standard teacher evaluation/observational systems in use at the time.

Hence, this report was used as evidence that America’s teacher evaluation systems were unacceptable and in need of reform, primarily given the subjectivities and flaws apparent and arguably inherent across the observational components of these systems. This reform was also needed to help reform America’s public schools, writ large, so the logic went and (often) continues to go. While binary constructions of complex data such as these are often used to ground simplistic ideas and push definitive policies, ideas, and agendas, this tactic certainly worked here, as this report (among a few others) was used to inform the federal and state policies pushing teacher evaluation system reform as a result (e.g., Race to the Top (RTTT)).

Likewise, this report continues to be used whenever a state’s or district’s new-and-improved teacher evaluation systems (still) evidence “too many” (as typically arbitrarily defined) teachers as effective or higher (see, for example, an Education Week article about this here). Although, whether in fact the systems have actually been reformed is also of debate in that states are still using many of the same observational systems they were using prior (i.e., not the “binary checklists” exaggerated in the original as well as this report, albeit true in the case of the district of focus in this study). The real “reforms,” here, pertained to the extent to which value-added model (VAM) or other growth output were combined with these observational measures, and the extent to which districts adopted state-level observational models as per the centralized educational policies put into place at the same time.

Nonetheless, now eight years later, Matthew A. Kraft – an Assistant Professor of Education & Economics at Brown University and Allison F. Gilmour – an Assistant Professor at Temple University (and former doctoral student at Vanderbilt University), revisited the original report. Just published in the esteemed, peer-reviewed journal Educational Researcher (see an earlier version of the published study here), Kraft and Gilmour compiled “teacher performance ratings across 24 [of the 38, including 14 RTTT] states that [by 2014-2015] adopted major reforms to their teacher evaluation systems” as a result of such policy initiatives. They found that “the percentage of teachers rated Unsatisfactory remains less than 1%,” except for in two states (i.e., Maryland and New Mexico), with Unsatisfactory (or similar) ratings varying “widely across states with 0.7% to 28.7%” as the low and high, respectively (see also the study Abstract).

Related, Kraft and Gilmour found that “some new teacher evaluation systems do differentiate among teachers, but most only do so at the top of the ratings spectrum” (p. 10). More specifically, observers in states in which teacher evaluation ratings include five versus four rating categories differentiate teachers more, but still do so along the top three ratings, which still does not solve the negative skew at issue (i.e., “too many” teachers still scoring “too well”). They also found that when these observational systems were used for formative (i.e., informative, improvement) purposes, teachers’ ratings were lower than when they were used for summative (i.e., final summary) purposes.

Clearly, the assumptions of all involved in this area of policy research come into play, here, akin to how they did in The Bell Curve and The Bell Curve Debate. During this (still ongoing) debate, many fervently debated whether socioeconomic and educational outcomes (e.g., IQ) should be normally distributed. What this means in this case, for example, is that for every teacher who is rated highly effective there should be a teacher rated as highly ineffective, more or less, to yield a symmetrical distribution of teacher observational scores across the spectrum.

In fact, one observational system of which I am aware (i.e., the TAP System for Teacher and Student Advancement) is marketing its proprietary system, using as a primary selling point figures illustrating (with text explaining) how clients who use their system will improve their prior “Widget Effect” results (i.e., yielding such normal curves; see Figure below, as per Jerald & Van Hook, 2011, p. 1).

Evidence also suggests that these scores are also (sometimes) being artificially deflated to assist in these attempts (see, for example, a recent publication of mine released a few days ago here in the (also) esteemed, peer-reviewed Teachers College Record about how this is also occurring in response to the “Widget Effect” report and the educational policies that follows).

While Kraft and Gilmour assert that “systems that place greater weight on normative measures such as value-added scores rather than…[just]…observations have fewer teachers rated proficient” (p. 19; see also Steinberg & Kraft, forthcoming; a related article about how this has occurred in New Mexico here; and New Mexico’s 2014-2016 data below and here, as also illustrative of the desired normal curve distributions discussed above), I highly doubt this purely reflects New Mexico’s “commitment to putting students first.”

I also highly doubt that, as per New Mexico’s acting Secretary of Education, this was “not [emphasis added] designed with quote unquote end results in mind.” That is, “the New Mexico Public Education Department did not set out to place any specific number or percentage of teachers into a given category.” If true, it’s pretty miraculous how this simply worked out as illustrated… This is also at issue in the lawsuit in which I am involved in New Mexico, in which the American Federation of Teachers won an injunction in 2015 that still stands today (see more information about this lawsuit here). Indeed, as per Kraft, all of this “might [and possibly should] undercut the potential for this differentiation [if ultimately proven artificial, for example, as based on statistical or other pragmatic deflation tactics] to be seen as accurate and valid” (as quoted here).

Notwithstanding, Kraft and Gilmour, also as part (and actually the primary part) of this study, “present original survey data from an urban district illustrating that evaluators perceive more than three times as many teachers in their schools to be below Proficient than they rate as such.” Accordingly, even though their data for this part of this study come from one district, their findings are similar to others evidenced in the “Widget Effect” report; hence, there are still likely educational measurement (and validity) issues on both ends (i.e., with using such observational rubrics as part of America’s reformed teacher evaluation systems and using survey methods to put into check these systems, overall). In other words, just because the survey data did not match the observational data does not mean either is wrong, or right, but there are still likely educational measurement issues.

Also of issue in this regard, in terms of the 1% issue, is (a) the time and effort it takes supervisors to assist/desist after rating teachers low is sometimes not worth assigning low ratings; (b) how supervisors often give higher ratings to those with perceived potential, also in support of their future growth, even if current evidence suggests a lower rating is warranted; (c) how having “difficult conversations” can sometimes prevent supervisors from assigning the scores they believe teachers may deserve, especially if things like job security are on the line; (d) supervisors’ challenges with removing teachers, including “long, laborious, legal, draining process[es];” and (e) supervisors’ challenges with replacing teachers, if terminated, given current teacher shortages and the time and effort, again, it often takes to hire (ideally more qualified) replacements.

References:

Jerald, C. D., & Van Hook, K. (2011). More than measurement: The TAP system’s lessons learned for designing better teacher evaluation systems. Santa Monica, CA: National Institute for Excellence in Teaching (NIET). Retrieved from http://files.eric.ed.gov/fulltext/ED533382.pdf

Kraft, M. A, & Gilmour, A. F. (2017). Revisiting the Widget Effect: Teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46(5) 234-249. doi:10.3102/0013189X17718797

Steinberg, M. P., & Kraft, M. A. (forthcoming). The sensitivity of teacher performance ratings to the design of teacher evaluation systems. Educational Researcher.

Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). “The Widget Effect.” Education Digest, 75(2), 31–35.

Breaking News: Another Big Victory in Court in Texas

Earlier today I released a post regarding “A Big Victory in Court in Houston,” in which I wrote about how, yesterday, US Magistrate Judge Smith ruled — in Houston Federation of Teachers et al. v. Houston Independent School District — that Houston teacher plaintiffs’ have legitimate claims regarding how their Education Value-Added Assessment System (EVAAS) value-added scores, as used (and abused) in HISD, was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case organization shall deprive any person of life, liberty, or property, without due process). Hence, on this charge, this case is officially going to trial.

Well, also yesterday, “we” won another court case on which I also served as an expert witness (I served as an expert witness on behalf of the plaintiffs alongside Jesse Rothstein in the court case noted above). As per this case — Texas State Teachers Association v. Texas Education Agency, Mike Morath in his Official Capacity as Commissioner of Education for the State of Texas (although there were three similar cases also filed – see all four referenced below) — The Honorable Lora J. Livingston ruled that the Defendants are to make revisions to 19 Tex. Admin. Code § 150.1001 that most notably include the removal of (A) student learning objectives [SLOs], (B) student portfolios, (C) pre and post test results on district level assessments; or (D) value added data based on student state assessment results. In addition, “The rules do not restrict additional factors a school district may consider…,” and “Under the local appraisal system, there [will be] no required weighting for each measure…,” although districts can chose to weight whatever measures they might choose. “Districts can also adopt an appraisal system that does not provide a single, overall summative rating.” That is, increased local control.

If the Texas Education Agency (TEA) does not adopt the regulations put forth by the court by next October, this case will continue. This does not look likely, however, in that as per a news article released today, here, Texas “Commissioner of Education Mike Morath…agreed to revise the [states’] rules in exchange for the four [below] teacher groups’ suspending their legal challenges.” As noted prior, the terms of this settlement call for the removal of the above-mentioned, state-required, four growth measures when evaluating teachers.

This was also highlighted in a news article, released yesterday, here, with this one more generally about how teachers throughout Texas will no longer be evaluated using their students’ test scores, again, as required by the state.

At the crux of this case, as also highlighted in this particular piece, and to which I testified (quite extensively), was that the value-added measures formerly required/suggested by the state did not constitute teachers’ “observable,” job-related behaviors. See also a prior post about this case here.

*****

Cases Contributing to this Ruling:

1. Texas State Teachers Association v. Texas Education Agency, Mike Morath, in his Official Capacity as Commissioner of Education for the State of Texas; in the 345th Judicial District Court, Travis County, Texas

2. Texas Classroom Teachers Association v. Mike Morath, Texas Commissioner of Education; in the 419th Judicial District Court, Travis County, Texas

3. Texas American Federation of Teachers v. Mike Morath, Commissioner of Education, in his official capacity, and Texas Education Agency; in the 201st Judicial District Court, Travis County, Texas

4. Association of Texas Professional Educators v. Mike Morath, the Commissioner of Education and the Texas Education Agency; in the 200th District Court of Travis County, Texas.

Breaking News: A Big Victory in Court in Houston

Recall from multiple prior posts (see here, here, here, and here) that a set of teachers in the Houston Independent School District (HISD), with the support of the Houston Federation of Teachers (HFT) and the American Federation of Teachers (AFT), took their district to federal court to fight against the (mis)use of their value-added scores, derived via the Education Value-Added Assessment System (EVAAS) — the “original” value-added model (VAM) developed in Tennessee by William L. Sanders who just recently passed away (see here). Teachers’ EVAAS scores, in short, were being used to evaluate teachers in Houston in more consequential ways than anywhere else in the nation (e.g., the termination of 221 teachers in just one year as based, primarily, on their EVAAS scores).

The case — Houston Federation of Teachers et al. v. Houston ISD — was filed in 2014 and just yesterday, United States Magistrate Judge Stephen Wm. Smith denied in the United States District Court, Southern District of Texas, the district’s request for summary judgment given the plaintiffs’ due process claims. Put differently, Judge Smith ruled that the plaintiffs’ did have legitimate claims regarding how EVAAS use in HISD was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case organization shall deprive any person of life, liberty, or property, without due process). Hence, on this charge, this case is officially going to trial.

This is a huge victory, and one unprecedented that will likely set precedent, trial pending, for others, and more specifically other teachers.

Of primary issue will be the following (as taken from Judge Smith’s Summary Judgment released yesterday): “Plaintiffs [will continue to] challenge the use of EVAAS under various aspects of the Fourteenth Amendment, including: (1) procedural due process, due to lack of sufficient information to meaningfully challenge terminations based on low EVAAS scores,” and given “due process is designed to foster government decision-making that is both fair and accurate.”

Related, and of most importance, as also taken directly from Judge Smith’s Summary, he wrote:

  • HISD’s value-added appraisal system poses a realistic threat to deprive plaintiffs of constitutionally protected property interests in employment.
  • HISD does not itself calculate the EVAAS score for any of its teachers. Instead, that task is delegated to its third party vendor, SAS. The scores are generated by complex algorithms, employing “sophisticated software and many layers of calculations.” SAS treats these algorithms and software as trade secrets, refusing to divulge them to either HISD or the teachers themselves. HISD has admitted that it does not itself verify or audit the EVAAS scores received from SAS, nor does it engage any contractor to do so. HISD further concedes that any effort by teachers to replicate their own scores, with the limited information available to them, will necessarily fail. This has been confirmed by plaintiffs’ expert, who was unable to replicate the scores despite being given far greater access to the underlying computer codes than is available to an individual teacher [emphasis added, as also related to a prior post about how SAS claimed that plaintiffs violated SAS’s protective order (protecting its trade secrets), that the court overruled, see here].
  • The EVAAS score might be erroneously calculated for any number of reasons, ranging from data-entry mistakes to glitches in the computer code itself. Algorithms are human creations, and subject to error like any other human endeavor. HISD has acknowledged that mistakes can occur in calculating a teacher’s EVAAS score; moreover, even when a mistake is found in a particular teacher’s score, it will not be promptly corrected. As HISD candidly explained in response to a frequently asked question, “Why can’t my value-added analysis be recalculated?”:
    • Once completed, any re-analysis can only occur at the system level. What this means is that if we change information for one teacher, we would have to re- run the analysis for the entire district, which has two effects: one, this would be very costly for the district, as the analysis itself would have to be paid for again; and two, this re-analysis has the potential to change all other teachers’ reports.
  • The remarkable thing about this passage is not simply that cost considerations trump accuracy in teacher evaluations, troubling as that might be. Of greater concern is the house-of-cards fragility of the EVAAS system, where the wrong score of a single teacher could alter the scores of every other teacher in the district. This interconnectivity means that the accuracy of one score hinges upon the accuracy of all. Thus, without access to data supporting all teacher scores, any teacher facing discharge for a low value-added score will necessarily be unable to verify that her own score is error-free.
  • HISD’s own discovery responses and witnesses concede that an HISD teacher is unable to verify or replicate his EVAAS score based on the limited information provided by HISD.
  • According to the unrebutted testimony of plaintiffs’ expert, without access to SAS’s proprietary information – the value-added equations, computer source codes, decision rules, and assumptions – EVAAS scores will remain a mysterious “black box,” impervious to challenge.
  • While conceding that a teacher’s EVAAS score cannot be independently verified, HISD argues that the Constitution does not require the ability to replicate EVAAS scores “down to the last decimal point.” But EVAAS scores are calculated to the second decimal place, so an error as small as one hundredth of a point could spell the difference between a positive or negative EVAAS effectiveness rating, with serious consequences for the affected teacher.

Hence, “When a public agency adopts a policy of making high stakes employment decisions based on secret algorithms incompatible with minimum due process, the proper remedy is to overturn the policy.”

Moreover, he wrote, that all of this is part of the violation of teaches’ Fourteenth Amendment rights. Hence, he also wrote, “On this summary judgment record, HISD teachers have no meaningful way to ensure correct calculation of their EVAAS scores, and as a result are unfairly subject to mistaken deprivation of constitutionally protected property interests in their jobs.”

Otherwise, Judge Smith granted summary judgment to the district on the other claims forwarded by the plaintiffs, including plaintiffs’ equal protection claims. All of us involved in the case — recall that Jesse Rothstein and I served as the expert witnesses on behalf of the plaintiffs, and Thomas Kane of the Measures of Effective Teaching (MET) Project and John Friedman of the infamous Chetty et al. studies (see here and here) served as the expert witnesses on behalf of the defendants — knew that all of the plaintiffs’ claims would be tough to win given all of the constitutional legal standards would be difficult for plaintiffs to satisfy (e.g., that evaluating teachers using their value-added scores was not “unreasonable” was difficult to prove, as it was in the Tennessee case we also fought and was then dismissed on similar grounds (see here)).

Nonetheless, that “we” survived on the due process claim is fantastic, especially as this is the first case like this of which we are aware across the country.

Here is the press release, released last night by the AFT:

May 4, 2017 – AFT, Houston Federation of Teachers Hail Court Ruling on Flawed Evaluation System

Statements by American Federation of Teachers President Randi Weingarten and Houston Federation of Teachers President Zeph Capo on U.S. District Court decision on Houston’s Evaluation Value-Added Assessment System (EVAAS), known elsewhere as VAM or value-added measures:

AFT President Randi Weingarten: “Houston developed an incomprehensible, unfair and secret algorithm to evaluate teachers that had no rational meaning. This is the algebraic formula: = + (Σ∗≤Σ∗∗ × ∗∗∗∗=1)+

“U.S. Magistrate Judge Stephen Smith saw that it was seriously flawed and posed a threat to teachers’ employment rights; he rejected it. This is a huge victory for Houston teachers, their students and educators’ deeply held contention that VAM is a sham.

“The judge said teachers had no way to ensure that EVAAS was correctly calculating their performance score, nor was there a way to promptly correct a mistake. Judge Smith added that the proper remedy is to overturn the policy; we wholeheartedly agree. Teaching must be about helping kids develop the skills and knowledge they need to be prepared for college, career and life—not be about focusing on test scores for punitive purposes.”

HFT President Zeph Capo: “With this decision, Houston should wipe clean the record of every teacher who was negatively evaluated. From here on, teacher evaluation systems should be developed with educators to ensure that they are fair, transparent and help inform instruction, not be used as a punitive tool.”

New Texas Lawsuit: VAM-Based Estimates as Indicators of Teachers’ “Observable” Behaviors

Last week I spent a few days in Austin, one day during which I provided expert testimony for a new state-level lawsuit that has the potential to impact teachers throughout Texas. The lawsuit — Texas State Teachers Association (TSTA) v. Texas Education Agency (TEA), Mike Morath in his Official Capacity as Commissioner of Education for the State of Texas.

The key issue is that, as per the state’s Texas Education Code (Sec. § 21.351, see here) regarding teachers’ “Recommended Appraisal Process and Performance Criteria,” The Commissioner of Education must adopt “a recommended teacher appraisal process and criteria on which to appraise the performance of teachers. The criteria must be based on observable, job-related behavior, including: (1) teachers’ implementation of discipline management procedures; and (2) the performance of teachers’ students.” As for the latter, the State/TEA/Commissioner defined, as per its Texas Administrative Code (T.A.C., Chapter 15, Sub-Chapter AA, §150.1001, see here), that teacher-level value-added measures should be treated as one of the four measures of “(2) the performance of teachers’ students;” that is, one of the four measures recognized by the State/TEA/Commissioner as an “observable” indicator of a teacher’s “job-related” performance.

While currently no district throughout the State of Texas is required to use a value-added component to assess and evaluate its teachers, as noted, the value-added component is listed as one of four measures from which districts must choose at least one. All options listed in the category of “observable” indicators include: (A) student learning objectives (SLOs); (B) student portfolios; (C) pre- and post-test results on district-level assessments; and (D) value-added data based on student state assessment results.

Related, the state has not recommended or required that any district, if the value-added option is selected, to choose any particular value-added model (VAM) or calculation approach. Nor has it recommended or required that any district adopt any consequences as attached to these output; however, things like teacher contract renewal and sharing teachers’ prior appraisals with other districts in which teachers might be applying for new jobs is not discouraged. Again, though, the main issue here (and the key points to which I testified) was that the value-added component is listed as an “observable” and “job-related” teacher effectiveness indicator as per the state’s administrative code.

Accordingly, my (5 hour) testimony was primarily (albeit among many other things including the “job-related” part) about how teacher-level value-added data do not yield anything that is observable in terms of teachers’ effects. Likewise, officially referring to these data in this way is entirely false, in fact, in that:

  • “We” cannot directly observe a teacher “adding” (or detracting) value (e.g., with our own eyes, like supervisors can when they conduct observations of teachers in practice);
  • Using students’ test scores to measure student growth upwards (or downwards) and over time, as is very common practice using the (very often instructionally insensitive) state-level tests required by No Child Left Behind (NCLB), and doing this once per year in mathematics and reading/language arts (that includes prior and other current teachers’ effects, summer learning gains and decay, etc.), is not valid practice. That is, doing this has not been validated by the scholarly/testing community; and
  • Worse and less valid is to thereafter aggregate this student-level growth to the teacher level and then call whatever “growth” (or the lack thereof) is because of something the teacher (and really only the teacher did), as directly “observable.” These data are far from assessing a teacher’s causal or “observable” impacts on his/her students’ learning and achievement over time. See, for example, the prior statement released about value-added data use in this regard by the American Statistical Association (ASA) here. In this statement it is written that: “Research on VAMs has been fairly consistent that aspects of educational effectiveness that are measurable and within teacher control represent a small part of the total variation [emphasis added to note that this is variation explained which = correlational versus causal research] in student test scores or growth; most estimates in the literature attribute between 1% and 14% of the total variability [emphasis added] to teachers. This is not saying that teachers have little effect on students, but that variation among teachers [emphasis added] accounts for a small part of the variation [emphasis added] in [said test] scores. The majority of the variation in [said] test scores is [inversely, 86%-99% related] to factors outside of the teacher’s control such as student and family background, poverty, curriculum, and unmeasured influences.”

If any of you have anything to add to this, please do so in the comments section of this post. Otherwise, I will keep you posted on how this goes. My current understanding is that this one will be headed to court.

Difficulties When Combining Multiple Teacher Evaluation Measures

A new study about multiple “Approaches for Combining Multiple Measures of Teacher Performance,” with special attention paid to reliability, validity, and policy, was recently published in the American Educational Research Association (AERA) sponsored and highly-esteemed Educational Evaluation and Policy Analysis journal. You can find the free and full version of this study here.

In this study authors José Felipe Martínez – Associate Professor at the University of California, Los Angeles, Jonathan Schweig – at the RAND Corporation, and Pete Goldschmidt – Associate Professor at California State University, Northridge and creator of the value-added model (VAM) at legal issue in the state of New Mexico (see, for example, here), set out to help practitioners “combine multiple measures of complex [teacher evaluation] constructs into composite indicators of performance…[using]…various conjunctive, disjunctive (or complementary), and weighted (or compensatory) models” (p. 738). Multiple measures in this study include teachers’ VAM estimates, observational scores, and student survey results.

While authors ultimately suggest that “[a]ccuracy and consistency are greatest if composites are constructed to maximize reliability,” perhaps more importantly, especially for practitioners, authors note that “accuracy varies across models and cut-scores and that models with similar accuracy may yield different teacher classifications.”

This, of course, has huge implications for teacher evaluation systems as based upon multiple measures in that “accuracy” means “validity” and “valid” decisions cannot be made as based on “invalid” or “inaccurate” data that can so arbitrarily change. In other words, what this means is that likely never will a decision about a teacher being this or that actually mean this or that. In fact, this or that might be close, not so close, or entirely wrong, which is a pretty big deal when the measures combined are assumed to function otherwise. This is especially interesting, again and as stated prior, that the third author on this piece – Pete Goldschmidt – is the person consulting with the state of New Mexico. Again, this is the state that is still trying to move forward with the attachment of consequences to teachers’ multiple evaluation measures, as assumed (by the state but not the state’s consultant?) to be accurate and correct (see, for example, here).

Indeed, this is a highly inexact and imperfect social science.

Authors also found that “policy weights yield[ed] more reliable composites than optimal prediction [i.e., empirical] weights” (p. 750). In addition, “[e]mpirically derived weights may or may not align with important theoretical and policy rationales” (p. 750); hence, the authors collectively referred others to use theory and policy when combining measures, while also noting that doing so would (a) still yield overall estimates that would “change from year to year as new crops of teachers and potentially measures are incorporated” (p. 750) and (b) likely “produce divergent inferences and judgments about individual teachers (p. 751). Authors, therefore, concluded that “this in turn highlights the need for a stricter measurement validity framework guiding the development, use, and monitoring of teacher evaluation systems” (p. 751), given all of this also makes the social science arbitrary, which is also a legal issue in and of itself, as also quasi noted.

Now, while I will admit that those who are (perhaps unwisely) devoted to the (in many ways forced) combining of these measures (despite what low reliability indicators already mean for validity, as unaddressed in this piece) might find some value in this piece (e.g., how conjunctive and disjunctive models vary, how principal component, unit weight, policy weight, optimal prediction approaches vary), I will also note that forcing the fit of such multiple measures in such ways, especially without a thorough background in and understanding of reliability and validity and what reliability means for validity (i.e., with rather high levels of reliability required before any valid inferences and especially high-stakes decisions can be made) is certainly unwise.

If high-stakes decisions are not to be attached, such nettlesome (but still necessary) educational measurement issues are of less importance. But any positive (e.g., merit pay) or negative (e.g., performance improvement plan) consequence that comes about without adequate reliability and validity should certainly cause pause, if not a justifiable grievance as based on the evidence provided herein, called for herein, and required pretty much every time such a decision is to be made (and before it is made).

Citation: Martinez, J. F., Schweig, J., & Goldschmidt, P. (2016). Approaches for combining multiple measures of teacher performance: Reliability, validity, and implications for evaluation policy. Educational Evaluation and Policy Analysis, 38(4), 738–756. doi: 10.3102/0162373716666166 Retrieved from http://journals.sagepub.com/doi/pdf/10.3102/0162373716666166

Note: New Mexico’s data were not used for analytical purposes in this study, unless any districts in New Mexico participated in the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) study yielding the data used for analytical purposes herein.

Miami-Dade, Florida’s Recent “Symbolic” and “Artificial” Teacher Evaluation Moves

Last spring, Eduardo Porter – writer of the Economic Scene column for The New York Times – wrote an excellent article, from an economics perspective, about that which is happening with our current obsession in educational policy with “Grading Teachers by the Test” (see also my prior post about this article here; although you should give the article a full read; it’s well worth it). In short, though, Porter wrote about what economist’s often refer to as Goodhart’s Law, which states that “when a measure becomes the target, it can no longer be used as the measure.” This occurs given the great (e.g., high-stakes) value (mis)placed on any measure, and the distortion (i.e., in terms of artificial inflation or deflation, depending on the desired direction of the measure) that often-to-always comes about as a result.

Well, it’s happened again, this time in Miami-Dade, Florida, where the Miami-Dade district’s teachers are saying its now “getting harder to get a good evaluation” (see the full article here). Apparently, teachers evaluation scores, from last to this year, are being “dragged down,” primarily given teachers’ students’ performances on tests (as well as tests of subject areas that and students whom they do not teach).

“In the weeks after teacher evaluations for the 2015-16 school year were distributed, Miami-Dade teachers flooded social media with questions and complaints. Teachers reported similar stories of being evaluated based on test scores in subjects they don’t teach and not being able to get a clear explanation from school administrators. In dozens of Facebook posts, they described feeling confused, frustrated and worried. Teachers risk losing their jobs if they get a series of low evaluations, and some stand to gain pay raises and a bonus of up to $10,000 if they get top marks.”

As per the figure also included in this article, see the illustration of how this is occurring below; that is, how it is becoming more difficult for teachers to get “good” overall evaluation scores but also, and more importantly, how it is becoming more common for districts to simply set different cut scores to artificially increase teachers’ overall evaluation scores.

00-00 template_cs5

“Miami-Dade say the problems with the evaluation system have been exacerbated this year as the number of points needed to get the “highly effective” and “effective” ratings has continued to increase. While it took 85 points on a scale of 100 to be rated a highly effective teacher for the 2011-12 school year, for example, it now takes 90.4.”

This, as mentioned prior, is something called “artificial deflation,” whereas the quality of teaching is likely not changing nearly to the extent the data might illustrate it is. Rather, what is happening behind the scenes (e.g., the manipulation of cut scores) is giving the impression that indeed the overall teacher system is in fact becoming better, more rigorous, aligning with policymakers’ “higher standards,” etc).

This is something in the educational policy arena that we also call “symbolic policies,” whereas nothing really instrumental or material is happening, and everything else is a facade, concealing a less pleasant or creditable reality that nothing, in fact, has changed.

Citation: Gurney, K. (2016). Teachers say it’s getting harder to get a good evaluation. The school district disagrees. The Miami Herald. Retrieved from http://www.miamiherald.com/news/local/education/article119791683.html#storylink=cpy

Ohio Rejects Subpar VAM, for Another VAM Arguably Less Subpar?

From a prior post coming from Ohio (see here), you may recall that Ohio state legislators recently introduced a bill to review its state’s value-added model (VAM), especially as it pertains to the state’s use of their VAM (i.e., the Education Value-Added Assessment System (EVAAS); see more information about the use of this model in Ohio here).

As per an article published last week in The Columbus Dispatch, the Ohio Department of Education (ODE) apparently rejected a proposal made by the state’s pro-charter school Ohio Coalition for Quality Education and the state’s largest online charter school, all of whom wanted to add (or replace) this state’s VAM with another, unnamed “Similar Students” measure (which could be the Student Growth Percentiles model discussed prior on this blog, for example, here, here, and here) used in California.

The ODE charged that this measure “would lower expectations for students with different backgrounds, such as those in poverty,” which is not often a common criticism of this model (if I have the model correct), nor is it a common criticism of the model they already have in place. In fact, and again if I have the model correct, these are really the only two models that do not statistically control for potentially biasing factors (e.g., student demographic and other background factors) when calculating teachers’ value-added; hence, their arguments about this model may be in actuality no different than that which they are already doing. Hence, statements like that made by Chris Woolard, senior executive director of the ODE, are false: “At the end of the day, our system right now has high expectations for all students. This (California model) violates that basic principle that we want all students to be able to succeed.”

The models, again if I am correct, are very much the same. While indeed the California measurement might in fact consider “student demographics such as poverty, mobility, disability and limited-English learners,” this model (if I am correct on the model) does not statistically factor these variables out. If anything, the state’s EVAAS system does, even though EVAAS modelers claim they do not do this, by statistically controlling for students’ prior performance, which (unfortunately) has these demographics already built into them. In essence, they are already doing the same thing they now protest.

Indeed, as per a statement made by Ron Adler, president of the Ohio Coalition for Quality Education, not only is it “disappointing that ODE spends so much time denying that poverty and mobility of students impedes their ability to generate academic performance…they [continue to] remain absolutely silent about the state’s broken report card and continually defend their value-added model that offers no transparency and creates wild swings for schools across Ohio” (i.e., the EVAAS system, although in all fairness all VAMs and the SGP yield the “wild swings’ noted). See, for example, here.

What might be worse, though, is that the ODE apparently found that, depending on the variables used in the California model, it produced different results. Guess what! All VAMs, depending on the variables used, produce different results. In fact, using the same data and different VAMs for the same teachers at the same time also produce (in some cases grossly) different results. The bottom line here is if any thinks that any VAM is yielding estimates from which valid or “true” statements can be made are fooling themselves.

One Score and Seven Policy Iterations Ago…

I just read what might be one of the best articles I’ve read in a long time on using test scores to measure teacher effectiveness, and why this is such a bad idea. Not surprisingly, unfortunately, this article was written 20 years ago (i.e., 1986) by – Edward Haertel, National Academy of Education member and recently retired Professor at Stanford University. If the name sounds familiar, it should as Professor Emeritus Haertel is one of the best on the topic of, and history behind VAMs (see prior posts about his related scholarship here, here, and here). To access the full article, please scroll to the reference at the bottom of this post.

Heartel wrote this article when at the time policymakers were, like they still are now, trying to hold teachers accountable for their students’ learning as measured on states’ standardized test scores. Although this article deals with minimum competency tests, which were in policy fashion at the time, about seven policy iterations ago, the contents of the article still have much relevance given where we are today — investing in “new and improved” Common Core tests and still riding on unsinkable beliefs that this is the way to reform the schools that have been in despair and (still) in need of major repair since 20+ years ago.

Here are some of the points I found of most “value:”

  • On isolating teacher effects: “Inferring teacher competence from test scores requires the isolation of teaching effects from other major influences on student test performance,” while “the task is to support an interpretation of student test performance as reflecting teacher competence by providing evidence against plausible rival hypotheses or interpretation.” While “student achievement depends on multiple factors, many of which are out of the teacher’s control,” and many of which cannot and likely never will be able to be “controlled.” In terms of home supports, “students enjoy varying levels of out-of-school support for learning. Not only may parental support and expectations influence student motivation and effort, but some parents may share directly in the task of instruction itself, reading with children, for example, or assisting them with homework.” In terms of school supports, “[s]choolwide learning climate refers to the host of factors that make a school more than a collection of self-contained classrooms. Where the principal is a strong instructional leader; where schoolwide policies on attendance, drug use, and discipline are consistently enforced; where the dominant peer culture is achievement-oriented; and where the school is actively supported by parents and the community.” This, all, makes isolating the teacher effect nearly if not wholly impossible.
  • On the difficulties with defining the teacher effect: “Does it include homework? Does it include self-directed study initiated by the student? How about tutoring by a parent or an older sister or brother? For present purposes, instruction logically refers to whatever the teacher being evaluated is responsible for, but there are degrees of responsibility, and it is often shared. If a teacher informs parents of a student’s learning difficulties and they arrange for private tutoring, is the teacher responsible for the student’s improvement? Suppose the teacher merely gives the student low marks, the student informs her parents, and they arrange for a tutor? Should teachers be credited with inspiring a student’s independent study of school subjects? There is no time to dwell on these difficulties; others lie ahead. Recognizing that some ambiguity remains, it may suffice to define instruction as any learning activity directed by the teacher, including homework….The question also must be confronted of what knowledge counts as achievement. The math teacher who digresses into lectures on beekeeping may be effective in communicating information, but for purposes of teacher evaluation the learning outcomes will not match those of a colleague who sticks to quadratic equations.” Much if not all of this cannot and likely never will be able to be “controlled” or “factored” in or our, as well.
  • On standardized tests: The best of standardized tests will (likely) always be too imperfect and not up to the teacher evaluation task, no matter the extent to which they are pitched as “new and improved.” While it might appear that these “problem[s] could be solved with better tests,” they cannot. Ultimately, all that these tests provide is “a sample of student performance. The inference that this performance reflects educational achievement [not to mention teacher effectiveness] is probabilistic [emphasis added], and is only justified under certain conditions.” Likewise, these tests “measure only a subset of important learning objectives, and if teachers are rated on their students’ attainment of just those outcomes, instruction of unmeasured objectives [is also] slighted.” Like it was then as it still is today, “it has become a commonplace that standardized student achievement tests are ill-suited for teacher evaluation.”
  • On the multiple choice formats of such tests: “[A] multiple-choice item remains a recognition task, in which the problem is to find the best of a small number of predetermined alternatives and the cri- teria for comparing the alternatives are well defined. The nonacademic situations where school learning is ultimately ap- plied rarely present problems in this neat, closed form. Discovery and definition of the problem itself and production of a variety of solutions are called for, not selection among a set of fixed alternatives.”
  • On students and the scores they are to contribute to the teacher evaluation formula: “Students varying in their readiness to profit from instruction are said to differ in aptitude. Not only general cognitive abilities, but relevant prior instruction, motivation, and specific inter- actions of these and other learner characteristics with features of the curriculum and instruction will affect academic growth.” In other words, one cannot simply assume all students will learn or grow at the same rate with the same teacher. Rather, they will learn at different rates given their aptitudes, their “readiness to profit from instruction,” the teachers’ instruction, and sometimes despite the teachers’ instruction or what the teacher teaches.
  • And on the formative nature of such tests, as it was then: “Teachers rarely consult standardized test results except, perhaps, for initial grouping or placement of students, and they believe that the tests are of more value to school or district administrators than to themselves.”

Sound familiar?

Reference: Haertel, E. (1986). The valid use of student performance measures for teacher evaluation. Educational Evaluation and Policy Analysis, 8(1), 45-60.

The Late Stephen Jay Gould on IQ Testing (with Implications for Testing Today)

One of my doctoral students sent me a YouTube video I feel compelled to share with you all. It is an interview with one of my all time favorite and most admired academics — Stephen Jay Gould. Gould, who passed away at age 60 from cancer, was a paleontologist, evolutionary biologist, and scientist who spent most of his academic career at Harvard. He was “one of the most influential and widely read writers of popular science of his generation,” and he was also the author of one of my favorite books of all time: The Mismeasure of Man (1981).

In The Mismeasure of Man Gould examined the history of psychometrics and the history of intelligence testing (e.g., the methods of nineteenth century craniometry, or the physical measures of peoples’ skulls to “objectively” capture their intelligence). Gould examined psychological testing and the uses of all sorts of tests and measurements to inform decisions (which is still, as we know, uber-relevant today) as well as “inform” biological determinism (i.e., “the view that “social and economic differences between human groups—primarily races, classes, and sexes—arise from inherited, inborn distinctions and that society, in this sense, is an accurate reflection of biology). Gould also examined in this book the general use of mathematics and “objective” numbers writ large to measure pretty much anything, as well as to measure and evidence predetermined sets of conclusions. This book is, as I mentioned, one of the best. I highly recommend it to all.

In this seven-minute video, you can get a sense of what this book is all about, as also so relevant to that which we continue to believe or not believe about tests and what they really are or are not worth. Thanks, again, to my doctoral student for finding this as this is a treasure not to be buried, especially given Gould’s 2002 passing.