A North Carolina Teacher’s Guest Post on His/Her EVAAS Scores

A teacher from the state of North Carolina recently emailed me for my advice regarding how to help him/her read and understand his/her recently received Education Value-Added Assessment System (EVAAS) value added scores. You likely recall that the EVAAS is the model I cover most on this blog, also in that this is the system I have researched the most, as well as the proprietary system adopted by multiple states (e.g., Ohio, North Carolina, and South Carolina) and districts across the country for which taxpayers continue to pay big $. Of late, this is also the value-added model (VAM) of sole interest in the recent lawsuit that teachers won in Houston (see here).

You might also recall that the EVAAS is the system developed by the now late William Sanders (see here), who ultimately sold it to SAS Institute Inc. that now holds all rights to the VAM (see also prior posts about the EVAAS here, here, here, here, here, and here). It is also important to note, because this teacher teaches in North Carolina where SAS Institute Inc. is located and where its CEO James Goodnight is considered the richest man in the state, that as a major Grand Old Party (GOP) donor “he” helps to set all of of the state’s education policy as the state is also dominated by Republicans. All of this also means that it is unlikely EVAAS will go anywhere unless there is honest and open dialogue about the shortcomings of the data.

Hence, the attempt here is to begin at least some honest and open dialogue herein. Accordingly, here is what this teacher wrote in response to my request that (s)he write a guest post:

***

SAS Institute Inc. claims that the EVAAS enables teachers to “modify curriculum, student support and instructional strategies to address the needs of all students.”  My goal this year is to see whether these claims are actually possible or true. I’d like to dig deep into the data made available to me — for which my state pays over $3.6 million per year — in an effort to see what these data say about my instruction, accordingly.

For starters, here is what my EVAAS-based growth looks like over the past three years:

As you can see, three years ago I met my expected growth, but my growth measure was slightly below zero. The year after that I knocked it out of the park. This past year I was right in the middle of my prior two years of results. Notice the volatility [aka an issue with VAM-based reliability, or consistency, or a lack thereof; see, for example, here].

Notwithstanding, SAS Institute Inc. makes the following recommendations in terms of how I should approach my data:

Reflecting on Your Teaching Practice: Learn to use your Teacher reports to reflect on the effectiveness of your instructional delivery.

The Teacher Value Added report displays value-added data across multiple years for the same subject and grade or course. As you review the report, you’ll want to ask these questions:

  • Looking at the Growth Index for the most recent year, were you effective at helping students to meet or exceed the Growth Standard?
  • If you have multiple years of data, are the Growth Index values consistent across years? Is there a positive or negative trend?
  • If there is a trend, what factors might have contributed to that trend?
  • Based on this information, what strategies and instructional practices will you replicate in the current school year? What strategies and instructional practices will you change or refine to increase your success in helping students make academic growth?

Yet my growth index values are not consistent across years, as also noted above. Rather, my “trends” are baffling to me.  When I compare those three instructional years in my mind, nothing stands out to me in terms of differences in instructional strategies that would explain the fluctuations in growth measures, either.

So let’s take a closer look at my data for last year (i.e., 2016-2017).  I teach 7th grade English/language arts (ELA), so my numbers are based on my students reading grade 7 scores in the table below.

What jumps out for me here is the contradiction in “my” data for achievement Levels 3 and 4 (achievement levels start at Level 1 and top out at Level 5, whereas levels 3 and 4 are considered proficient/middle of the road).  There is moderate evidence that my grade 7 students who scored a Level 4 on the state reading test exceeded the Growth Standard.  But there is also moderate evidence that my same grade 7 students who scored Level 3 did not meet the Growth Standard.  At the same time, the number of students I had demonstrating proficiency on the same reading test (by scoring at least a 3) increased from 71% in 2015-2016 (when I exceeded expected growth) to 76% in school year 2016-2017 (when my growth declined significantly). This makes no sense, right?

Hence, and after considering my data above, the question I’m left with is actually really important:  Are the instructional strategies I’m using for my students whose achievement levels are in the middle working, or are they not?

I’d love to hear from other teachers on their interpretations of these data.  A tool that costs taxpayers this much money and impacts teacher evaluations in so many states should live up to its claims of being useful for informing our teaching.

The More Weight VAMs Carry, the More Teacher Effects (Will Appear to) Vary

Matthew A. Kraft — an Assistant Professor of Education & Economics at Brown University and co-author of an article published in Educational Researcher on “Revisiting The Widget Effect” (here), and another of his co-authors Matthew P. Steinberg — an Assistant Professor of Education Policy at the University of Pennsylvania — just published another article in this same journal on “The Sensitivity of Teacher Performance Ratings to the Design of Teacher Evaluation Systems” (see the full and freely accessible, at least for now, article here; see also its original and what should be enduring version here).

In this article, Steinberg and Kraft (2017) examine teacher performance measure weights while conducting multiple simulations of data taken from the Bill & Melinda Gates Measures of Effective Teaching (MET) studies. They conclude that “performance measure weights and ratings” surrounding teachers’ value-added, observational measures, and student survey indicators play “critical roles” when “determining teachers’ summative evaluation ratings and the distribution of teacher proficiency rates.” In other words, the weighting of teacher evaluation systems’ multiple measures matter, matter differently for different types of teachers within and across school districts and states, and matter also in that so often these weights are arbitrarily and politically defined and set.

Indeed, because “state and local policymakers have almost no empirically based evidence [emphasis added, although I would write “no empirically based evidence”] to inform their decision process about how to combine scores across multiple performance measures…decisions about [such] weights…are often made through a somewhat arbitrary and iterative process, one that is shaped by political considerations in place of empirical evidence” (Steinberg & Kraft, 2017, p. 379).

This is very important to note in that the consequences attached to these measures, also given the arbitrary and political constructions they represent, can be both professionally and personally, career and life changing, respectively. How and to what extent “the proportion of teachers deemed professionally proficient changes under different weighting and ratings thresholds schemes” (p. 379), then, clearly matters.

While Steinberg and Kraft (2017) have other key findings they also present throughout this piece, their most important finding, in my opinion, is that, again, “teacher proficiency rates change substantially as the weights assigned to teacher performance measures change” (p. 387). Moreover, the more weight assigned to measures with higher relative means (e.g., observational or student survey measures), the greater the rate by which teachers are rated effective or proficient, and vice versa (i.e., the more weight assigned to teachers’ value-added, the higher the rate by which teachers will be rated ineffective or inadequate; as also discussed on p. 388).

Put differently, “teacher proficiency rates are lowest across all [district and state] systems when norm-referenced teacher performance measures, such as VAMs [i.e., with scores that are normalized in line with bell curves, with a mean or average centered around the middle of the normal distributions], are given greater relative weight” (p. 389).

This becomes problematic when states or districts then use these weighted systems (again, weighted in arbitrary and political ways) to illustrate, often to the public, that their new-and-improved teacher evaluation systems, as inspired by the MET studies mentioned prior, are now “better” at differentiating between “good and bad” teachers. Thereafter, some states over others are then celebrated (e.g., by the National Center of Teacher Quality; see, for example, here) for taking the evaluation of teacher effects more seriously than others when, as evidenced herein, this is (unfortunately) more due to manipulation than true changes in these systems. Accordingly, the fact remains that the more weight VAMs carry, the more teacher effects (will appear to) vary. It’s not necessarily that they vary in reality, but the manipulation of the weights on the back end, rather, cause such variation and then lead to, quite literally, such delusions of grandeur in these regards (see also here).

At a more pragmatic level, this also suggests that the teacher evaluation ratings for the roughly 70% of teachers who are not VAM eligible “are likely to differ in systematic ways from the ratings of teachers for whom VAM scores can be calculated” (p. 392). This is precisely why evidence in New Mexico suggests VAM-eligible teachers are up to five times more likely to be ranked as “ineffective” or “minimally effective” than their non-VAM-eligible colleagues; that is, “[also b]ecause greater weight is consistently assigned to observation scores for teachers in nontested grades and subjects” (p. 392). This also causes a related but also important issue with fairness, whereas equally effective teachers, just by being VAM eligible, may be five-or-so times likely (e.g., in states like New Mexico) of being rated as ineffective by the mere fact that they are VAM eligible and their states, quite literally, “value” value-added “too much” (as also arbitrarily defined).

Finally, it should also be noted as an important caveat here, that the findings advanced by Steinberg and Kraft (2017) “are not intended to provide specific recommendations about what weights and ratings to select—such decisions are fundamentally subject to local district priorities and preferences. (p. 379). These findings do, however, “offer important insights about how these decisions will affect the distribution of teacher performance ratings as policymakers and administrators continue to refine and possibly remake teacher evaluation systems” (p. 379).

Related, please recall that via the MET studies one of the researchers’ goals was to determine which weights per multiple measure were empirically defensible. MET researchers failed to do so and then defaulted to recommending an equal distribution of weights without empirical justification (see also Rothstein & Mathis, 2013). This also means that anyone at any state or district level who might say that this weight here or that weight there is empirically defensible should be asked for the evidence in support.

Citations:

Rothstein, J., & Mathis, W. J. (2013, January). Review of two culminating reports from the MET Project. Boulder, CO: National Educational Policy Center. Retrieved from http://nepc.colorado.edu/thinktank/review-MET-final-2013

Steinberg, M. P., & Kraft, M. A. (2017). The sensitivity of teacher performance ratings to the design of teacher evaluation systems. Educational Researcher, 46(7), 378–
396. doi:10.3102/0013189X17726752 Retrieved from http://journals.sagepub.com/doi/abs/10.3102/0013189X17726752

Breaking News: The End of Value-Added Measures for Teacher Termination in Houston

Recall from multiple prior posts (see, for example, here, here, here, here, and here) that a set of teachers in the Houston Independent School District (HISD), with the support of the Houston Federation of Teachers (HFT) and the American Federation of Teachers (AFT), took their district to federal court to fight against the (mis)use of their value-added scores derived via the Education Value-Added Assessment System (EVAAS) — the “original” value-added model (VAM) developed in Tennessee by William L. Sanders who just recently passed away (see here). Teachers’ EVAAS scores, in short, were being used to evaluate teachers in Houston in more consequential ways than any other district or state in the nation (e.g., the termination of 221 teachers in one year as based, primarily, on their EVAAS scores).

The case — Houston Federation of Teachers et al. v. Houston ISD — was filed in 2014 and just one day ago (October 10, 2017) came the case’s final federal suit settlement. Click here to read the “Settlement and Full and Final Release Agreement.” But in short, this means the “End of Value-Added Measures for Teacher Termination in Houston” (see also here).

More specifically, recall that the judge notably ruled prior (in May of 2017) that the plaintiffs did have sufficient evidence to proceed to trial on their claims that the use of EVAAS in Houston to terminate their contracts was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case district shall deprive any person of life, liberty, or property, without due process). That is, the judge ruled that “any effort by teachers to replicate their own scores, with the limited information available to them, [would] necessarily fail” (see here p. 13). This was confirmed by the one of the plaintiffs’ expert witness who was also “unable to replicate the scores despite being given far greater access to the underlying computer codes than [was] available to an individual teacher” (see here p. 13).

Hence, and “[a]ccording to the unrebutted testimony of [the] plaintiffs’ expert [witness], without access to SAS’s proprietary information – the value-added equations, computer source codes, decision rules, and assumptions – EVAAS scores will remain a mysterious ‘black box,’ impervious to challenge” (see here p. 17). Consequently, the judge concluded that HISD teachers “have no meaningful way to ensure correct calculation of their EVAAS scores, and as a result are unfairly subject to mistaken deprivation of constitutionally protected property interests in their jobs” (see here p. 18).

Thereafter, and as per this settlement, HISD agreed to refrain from using VAMs, including the EVAAS, to terminate teachers’ contracts as long as the VAM score is “unverifiable.” More specifically, “HISD agree[d] it will not in the future use value-added scores, including but not limited to EVAAS scores, as a basis to terminate the employment of a term or probationary contract teacher during the term of that teacher’s contract, or to terminate a continuing contract teacher at any time, so long as the value-added score assigned to the teacher remains unverifiable. (see here p. 2; see also here). HISD also agreed to create an “instructional consultation subcommittee” to more inclusively and democratically inform HISD’s teacher appraisal systems and processes, and HISD agreed to pay the Texas AFT $237,000 in its attorney and other legal fees and expenses (State of Texas, 2017, p. 2; see also AFT, 2017).

This is yet another big win for teachers in Houston, and potentially elsewhere, as this ruling is an unprecedented development in VAM litigation. Teachers and others using the EVAAS or another VAM for that matter (e.g., that is also “unverifiable”) do take note, at minimum.

On Conditional Bias and Correlation: A Guest Post

After I posted about “Observational Systems: Correlations with Value-Added and Bias,” a blog follower, associate professor, and statistician named Laura Ring Kapitula (see also a very influential article she wrote on VAMs here) posted comments on this site that I found of interest, and I thought would also be of interest to blog followers. Hence, I invited her to write a guest post, and she did.

She used R (i.e., a free software environment for statistical computing and graphics) to simulate correlation scatterplots (see Figures below) to illustrate three unique situations: (1) a simulation where there are two indicators (e.g., teacher value-added and observational estimates plotted on the x and y axes) that have a correlation of r = 0.28 (the highest correlation coefficient at issue in the aforementioned post); (2) a simulation exploring the impact of negative bias and a moderate correlation on a group of teachers; and (3) another simulation with two indicators that have a non-linear relationship possibly induced or caused by bias. She designed simulations (2) and (3) to illustrate the plausibility of the situation suggested next (as written into Audrey’s post prior) about potential bias in both value-added and observational estimates:

If there is some bias present in value-added estimates, and some bias present in the observational estimates…perhaps this is why these low correlations are observed. That is, only those teachers teaching classrooms inordinately stacked with students from racial minority, poor, low achieving, etc. groups might yield relatively stronger correlations between their value-added and observational scores given bias, hence, the low correlations observed may be due to bias and bias alone.

Laura continues…

Here, Audrey makes the point that a correlation of r = 0.28 is “weak.” It is, accordingly, useful to see an example of just how “weak” such a correlation is by looking at a scatterplot of data selected from a population where the true correlation is r = 0.28. To make the illustration more meaningful the points are colored based on their quintile scores as per simulated teachers’ value-added divided into the lowest 20%, next 20%, etc.

In this figure you can see by looking at the blue “least squares line” that, “on average,” as a simulated teacher’s value-added estimate increases the average of a teacher’s observational estimate increases. However, there is a lot of variability (or scatter points) around the (scatterplot) line. Given this variability, we can make statements about averages, such as “on average” teachers in the top 20% for VAM scores will likely have on average higher observed observational scores; however, there is not nearly enough precision to make any (and certainly not any good) predictions about the observational score from the VAM score for individual teachers. In fact, the linear relationship between teachers’ VAM and observational scores only accounts for about 8% of the variation in VAM score. Note: we get 8% by squaring the aforementioned r = 0.28 correlation (i.e., an R squared). The other 92% of the variance is due to error and other factors.

What this means in practice is that when correlations are this “weak,” it is reasonable to say statements about averages, for example, that “on average” as one variable increases the mean of the other variable increases, but it would not be prudent or wise to make predictions for individuals based on these data. See, for example, that individuals in the top 20% (quintile 5) of VAM have a very large spread in their scores on the observational score, with 95% of the scores in the top quintile being in between the 7th and 98th percentiles for their observational scores. So, here if we observe a VAM for a specific teacher in the top 20%, and we do not know their observational score, we cannot say much more than their observational score is likely to be in the top 90%. Similarly, if we observe a VAM in the bottom 20%, we cannot say much more than their observational score is likely to be somewhere in the bottom 90%. That’s not saying a lot, in terms of precision, but also in terms of practice.

The second scatterplot I ran to test how bias that only impacts a small group of teachers might theoretically impact an overall correlation, as posited by Audrey. Here I simulated a situation where, again, there are two values present in a population of teachers: a teacher’s value-added and a teacher’s observational score. Then I insert a group of teachers (as Audrey described) who represent 20% of a population and teach a disproportionate number of students who come from relatively lower socioeconomic, high racial minority, etc. backgrounds, and I assume this group is measured with negative bias on both indicators and this group has a moderate correlation between indicators of r = 0.50. The other 80% of the population is assumed to be uncorrelated. Note: for this demonstration I assume that this group includes 20% of teachers from the aforementioned population, these teachers I assume to be measured with negative bias (by one standard deviation on average) on both measures, and, again, I set their correlation at r = 0.50 with the other 80% of teachers at a correlation of zero.

What you can see is that if there is bias in this correlation that impacts only a certain group on the two instrument indicators; hence, it is possible that this bias can result in an observed correlation overall. In other words, a strong correlation noted in just one group of teachers (i.e., teachers scoring the lowest on their value-added and observational indicators in this case) can be relatively stronger than the “weak” correlation observed on average or overall.

Another, possible situation is that there might be a non-linear relationship between these two measures. In the simulation below, I assume that different quantiles on VAM have a different linear relationship with the observational score. For example, in the plot there is not a constant slope, but teachers who are in the first quintile on VAM I assume to have a correlation of r = 0.50 with observational scores, the second quintile I assume to have a correlation of r = 0.20, and the other quintiles I assume to be uncorrelated. This results in an overall correlation in the simulation of r = 0.24, with a very small p-value (i.e. a very small chance that a correlation of this size would be observed by random chance alone if the true correlation was zero).

What this means in practice is that if, in fact, there is a non-linear relationship between teachers’ observational and VAM scores, this can induce a small but statistically significant correlation. As evidenced, teachers in the lowest 20% on the VAM score have differences in the mean observational score depending on the VAM score (a moderate correlation of r = 0.50), but for the other 80%, knowing the VAM score is not informative as there is a very small correlation for the second quintile and no correlation for the upper 60%. So, if quintile cut-off scores are used, teachers can easily be misclassified. In sum, Pearson Correlations (the standard correlation coefficient) measure the overall strength of  linear relationships between X and Y, but if X and Y have a non-linear relationship (like as illustrated in the above), this statistic can be very misleading.

Note also that for all of these simulations very small p-values are observed (e.g., p-values <0.0000001 which, again, mean these correlations are statistically significant or that the probability of observing correlations this large by chance if the true correlation is zero, is nearly 0%). What this illustrates, again, is that correlations (especially correlations this small) are (still) often misleading. While they might be statistically significant, they might mean relatively little in the grand scheme of things (i.e., in terms of practical significance; see also “The Difference Between”Significant’ and ‘Not Significant’ is not Itself Statistically Significant” or posts on Andrew Gelman’s blog for more discussion on these topics if interested).

At the end of the day r = 0.28 is still a “weak” correlation. In addition, it might be “weak,” on average, but much stronger and statistically and practically significant for teachers in the bottom quintiles (e.g., teachers in the bottom 20%, as illustrated in the final figure above) typically teaching the highest needs students. Accordingly, this might be due, at least in part, to bias.

In conclusion, one should always be wary of claims based on “weak” correlations, especially if they are positioned to be stronger than industry standards would classify them (e.g., in the case highlighted in the prior post). Even if a correlation is “statistically significant,” it is possible that the correlation is the result of bias, and that the relationship is so weak that it is not meaningful in practice, especially when the goal is to make high-stakes decisions about individual teachers. Accordingly, when you see correlations this small, keep these scatterplots in mind or generate some of your own (see, for example, here to dive deeper into what these correlations might mean and how significant these correlations might really be).

*Please contact Dr. Kapitula directly at kapitull@gvsu.edu if you want more information or to access the R code she used for the above.

The “Widget Effect” Report Revisited

You might recall that in 2009, The New Teacher Project published a highly influential “Widget Effect” report in which researchers (see citation below) evidenced that 99% of teachers (whose teacher evaluation reports they examined across a sample of school districts spread across a handful of states) received evaluation ratings of “satisfactory” or higher. Inversely, only 1% of the teachers whose reports researchers examined received ratings of “unsatisfactory,” even though teachers’ supervisors could identify more teachers whom they deemed ineffective when asked otherwise.

Accordingly, this report was widely publicized given the assumed improbability that only 1% of America’s public school teachers were, in fact, ineffectual, and given the fact that such ineffective teachers apparently existed but were not being identified using standard teacher evaluation/observational systems in use at the time.

Hence, this report was used as evidence that America’s teacher evaluation systems were unacceptable and in need of reform, primarily given the subjectivities and flaws apparent and arguably inherent across the observational components of these systems. This reform was also needed to help reform America’s public schools, writ large, so the logic went and (often) continues to go. While binary constructions of complex data such as these are often used to ground simplistic ideas and push definitive policies, ideas, and agendas, this tactic certainly worked here, as this report (among a few others) was used to inform the federal and state policies pushing teacher evaluation system reform as a result (e.g., Race to the Top (RTTT)).

Likewise, this report continues to be used whenever a state’s or district’s new-and-improved teacher evaluation systems (still) evidence “too many” (as typically arbitrarily defined) teachers as effective or higher (see, for example, an Education Week article about this here). Although, whether in fact the systems have actually been reformed is also of debate in that states are still using many of the same observational systems they were using prior (i.e., not the “binary checklists” exaggerated in the original as well as this report, albeit true in the case of the district of focus in this study). The real “reforms,” here, pertained to the extent to which value-added model (VAM) or other growth output were combined with these observational measures, and the extent to which districts adopted state-level observational models as per the centralized educational policies put into place at the same time.

Nonetheless, now eight years later, Matthew A. Kraft – an Assistant Professor of Education & Economics at Brown University and Allison F. Gilmour – an Assistant Professor at Temple University (and former doctoral student at Vanderbilt University), revisited the original report. Just published in the esteemed, peer-reviewed journal Educational Researcher (see an earlier version of the published study here), Kraft and Gilmour compiled “teacher performance ratings across 24 [of the 38, including 14 RTTT] states that [by 2014-2015] adopted major reforms to their teacher evaluation systems” as a result of such policy initiatives. They found that “the percentage of teachers rated Unsatisfactory remains less than 1%,” except for in two states (i.e., Maryland and New Mexico), with Unsatisfactory (or similar) ratings varying “widely across states with 0.7% to 28.7%” as the low and high, respectively (see also the study Abstract).

Related, Kraft and Gilmour found that “some new teacher evaluation systems do differentiate among teachers, but most only do so at the top of the ratings spectrum” (p. 10). More specifically, observers in states in which teacher evaluation ratings include five versus four rating categories differentiate teachers more, but still do so along the top three ratings, which still does not solve the negative skew at issue (i.e., “too many” teachers still scoring “too well”). They also found that when these observational systems were used for formative (i.e., informative, improvement) purposes, teachers’ ratings were lower than when they were used for summative (i.e., final summary) purposes.

Clearly, the assumptions of all involved in this area of policy research come into play, here, akin to how they did in The Bell Curve and The Bell Curve Debate. During this (still ongoing) debate, many fervently debated whether socioeconomic and educational outcomes (e.g., IQ) should be normally distributed. What this means in this case, for example, is that for every teacher who is rated highly effective there should be a teacher rated as highly ineffective, more or less, to yield a symmetrical distribution of teacher observational scores across the spectrum.

In fact, one observational system of which I am aware (i.e., the TAP System for Teacher and Student Advancement) is marketing its proprietary system, using as a primary selling point figures illustrating (with text explaining) how clients who use their system will improve their prior “Widget Effect” results (i.e., yielding such normal curves; see Figure below, as per Jerald & Van Hook, 2011, p. 1).

Evidence also suggests that these scores are also (sometimes) being artificially deflated to assist in these attempts (see, for example, a recent publication of mine released a few days ago here in the (also) esteemed, peer-reviewed Teachers College Record about how this is also occurring in response to the “Widget Effect” report and the educational policies that follows).

While Kraft and Gilmour assert that “systems that place greater weight on normative measures such as value-added scores rather than…[just]…observations have fewer teachers rated proficient” (p. 19; see also Steinberg & Kraft, forthcoming; a related article about how this has occurred in New Mexico here; and New Mexico’s 2014-2016 data below and here, as also illustrative of the desired normal curve distributions discussed above), I highly doubt this purely reflects New Mexico’s “commitment to putting students first.”

I also highly doubt that, as per New Mexico’s acting Secretary of Education, this was “not [emphasis added] designed with quote unquote end results in mind.” That is, “the New Mexico Public Education Department did not set out to place any specific number or percentage of teachers into a given category.” If true, it’s pretty miraculous how this simply worked out as illustrated… This is also at issue in the lawsuit in which I am involved in New Mexico, in which the American Federation of Teachers won an injunction in 2015 that still stands today (see more information about this lawsuit here). Indeed, as per Kraft, all of this “might [and possibly should] undercut the potential for this differentiation [if ultimately proven artificial, for example, as based on statistical or other pragmatic deflation tactics] to be seen as accurate and valid” (as quoted here).

Notwithstanding, Kraft and Gilmour, also as part (and actually the primary part) of this study, “present original survey data from an urban district illustrating that evaluators perceive more than three times as many teachers in their schools to be below Proficient than they rate as such.” Accordingly, even though their data for this part of this study come from one district, their findings are similar to others evidenced in the “Widget Effect” report; hence, there are still likely educational measurement (and validity) issues on both ends (i.e., with using such observational rubrics as part of America’s reformed teacher evaluation systems and using survey methods to put into check these systems, overall). In other words, just because the survey data did not match the observational data does not mean either is wrong, or right, but there are still likely educational measurement issues.

Also of issue in this regard, in terms of the 1% issue, is (a) the time and effort it takes supervisors to assist/desist after rating teachers low is sometimes not worth assigning low ratings; (b) how supervisors often give higher ratings to those with perceived potential, also in support of their future growth, even if current evidence suggests a lower rating is warranted; (c) how having “difficult conversations” can sometimes prevent supervisors from assigning the scores they believe teachers may deserve, especially if things like job security are on the line; (d) supervisors’ challenges with removing teachers, including “long, laborious, legal, draining process[es];” and (e) supervisors’ challenges with replacing teachers, if terminated, given current teacher shortages and the time and effort, again, it often takes to hire (ideally more qualified) replacements.

References:

Jerald, C. D., & Van Hook, K. (2011). More than measurement: The TAP system’s lessons learned for designing better teacher evaluation systems. Santa Monica, CA: National Institute for Excellence in Teaching (NIET). Retrieved from http://files.eric.ed.gov/fulltext/ED533382.pdf

Kraft, M. A, & Gilmour, A. F. (2017). Revisiting the Widget Effect: Teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46(5) 234-249. doi:10.3102/0013189X17718797

Steinberg, M. P., & Kraft, M. A. (forthcoming). The sensitivity of teacher performance ratings to the design of teacher evaluation systems. Educational Researcher.

Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). “The Widget Effect.” Education Digest, 75(2), 31–35.

The Tripod Student Survey Instrument: Its Factor Structure and Value-Added Correlations

The Tripod student perception survey instrument is a “research-based” instrument increasingly being used by states to add to state’s teacher evaluation systems as based on “multiple measures.” While there are other instruments also in use, as well as student survey instruments being developed by states and local districts, this one in particular is gaining in popularity, also in that it was used throughout the Bill & Melinda Gates Foundation’s ($43 million worth of) Measures of Effective Teaching (MET) studies. A current estimate (as per the study discussed in this post) is that during the 2015–2016 school year approximately 1,400 schools purchased and administered the Tripod. See also a prior post (here) about this instrument, or more specifically a chapter of a book about the instrument as authored by the instrument’s developer and lead researcher in a  research surrounding it – Ronald Ferguson.

In a study recently released in the esteemed American Educational Research Journal (AERJ), and titled “What Can Student Perception Surveys Tell Us About Teaching? Empirically Testing the Underlying Structure of the Tripod Student Perception Survey,” researchers found that the Tripod’s factor structure did not “hold up.” That is, Tripod’s 7Cs (i.e., seven constructs including: Care, Confer, Captivate, Clarify, Consolidate, Challenge, Classroom Management; see more information about the 7Cs here) and the 36 items that are positioned within each of the 7Cs did not fit the 7C framework as theorized by instrument developer(s).

Rather, using the MET database (N=1,049 middle school math class sections; N=25,423 students), researchers found that an alternative bi-factor structure (i.e., two versus seven constructs) best fit the Tripod items theoretically positioned otherwise. These two factors included (1) a general responsivity dimension that includes all items (more or less) unrelated to (2) a classroom management dimension that governs responses on items surrounding teachers’ classroom management. Researchers were unable to to distinguish across items seven separate dimensions.

Researchers also found that the two alternative factors noted — general responsivity and classroom management — were positively associated with teacher value-added scores. More specifically, results suggested that these two factors were positively and statistically significantly associated with teachers’ value-added measures based on state mathematics tests (standardized coefficients were .25 and .25, respectively), although for undisclosed reasons, results apparently suggested nothing about these two factors’ (cor)relationships with value-added estimates base on state English/language arts (ELA) tests. As per authors’ findings in the area of mathematics, prior researchers have also found low to moderate agreement between teacher ratings and student perception ratings; hence, this particular finding simply adds another source of convergent evidence.

Authors do give multiple reasons and plausible explanations as to why they found what they did that you all can read in more depth via the full article, linked to above and fully cited below. Authors also note that “It is unclear whether the original 7Cs that describe the Tripod instrument were intended to capture seven distinct dimensions on which students can reliably discriminate among teachers or whether the 7Cs were merely intended to be more heuristic domains that map out important aspects of teaching” (p. 1859); hence, this is also important to keep in mind given study findings.

As per study authors, and to their knowledge, “this study [was] the first to systematically investigate the multidimensionality of the Tripod student perception survey” (p. 1863).

Citation: Wallace, T. L., Kelcey, B., &  Ruzek, E. (2016). What can student perception surveys tell us about teaching? Empirically testing the underlying structure of the Tripod student perception survey.  American Educational Research Journal, 53(6), 1834–1868.
doiI:10.3102/0002831216671864 Retrieved from http://journals.sagepub.com/doi/pdf/10.3102/0002831216671864

New Texas Lawsuit: VAM-Based Estimates as Indicators of Teachers’ “Observable” Behaviors

Last week I spent a few days in Austin, one day during which I provided expert testimony for a new state-level lawsuit that has the potential to impact teachers throughout Texas. The lawsuit — Texas State Teachers Association (TSTA) v. Texas Education Agency (TEA), Mike Morath in his Official Capacity as Commissioner of Education for the State of Texas.

The key issue is that, as per the state’s Texas Education Code (Sec. § 21.351, see here) regarding teachers’ “Recommended Appraisal Process and Performance Criteria,” The Commissioner of Education must adopt “a recommended teacher appraisal process and criteria on which to appraise the performance of teachers. The criteria must be based on observable, job-related behavior, including: (1) teachers’ implementation of discipline management procedures; and (2) the performance of teachers’ students.” As for the latter, the State/TEA/Commissioner defined, as per its Texas Administrative Code (T.A.C., Chapter 15, Sub-Chapter AA, §150.1001, see here), that teacher-level value-added measures should be treated as one of the four measures of “(2) the performance of teachers’ students;” that is, one of the four measures recognized by the State/TEA/Commissioner as an “observable” indicator of a teacher’s “job-related” performance.

While currently no district throughout the State of Texas is required to use a value-added component to assess and evaluate its teachers, as noted, the value-added component is listed as one of four measures from which districts must choose at least one. All options listed in the category of “observable” indicators include: (A) student learning objectives (SLOs); (B) student portfolios; (C) pre- and post-test results on district-level assessments; and (D) value-added data based on student state assessment results.

Related, the state has not recommended or required that any district, if the value-added option is selected, to choose any particular value-added model (VAM) or calculation approach. Nor has it recommended or required that any district adopt any consequences as attached to these output; however, things like teacher contract renewal and sharing teachers’ prior appraisals with other districts in which teachers might be applying for new jobs is not discouraged. Again, though, the main issue here (and the key points to which I testified) was that the value-added component is listed as an “observable” and “job-related” teacher effectiveness indicator as per the state’s administrative code.

Accordingly, my (5 hour) testimony was primarily (albeit among many other things including the “job-related” part) about how teacher-level value-added data do not yield anything that is observable in terms of teachers’ effects. Likewise, officially referring to these data in this way is entirely false, in fact, in that:

  • “We” cannot directly observe a teacher “adding” (or detracting) value (e.g., with our own eyes, like supervisors can when they conduct observations of teachers in practice);
  • Using students’ test scores to measure student growth upwards (or downwards) and over time, as is very common practice using the (very often instructionally insensitive) state-level tests required by No Child Left Behind (NCLB), and doing this once per year in mathematics and reading/language arts (that includes prior and other current teachers’ effects, summer learning gains and decay, etc.), is not valid practice. That is, doing this has not been validated by the scholarly/testing community; and
  • Worse and less valid is to thereafter aggregate this student-level growth to the teacher level and then call whatever “growth” (or the lack thereof) is because of something the teacher (and really only the teacher did), as directly “observable.” These data are far from assessing a teacher’s causal or “observable” impacts on his/her students’ learning and achievement over time. See, for example, the prior statement released about value-added data use in this regard by the American Statistical Association (ASA) here. In this statement it is written that: “Research on VAMs has been fairly consistent that aspects of educational effectiveness that are measurable and within teacher control represent a small part of the total variation [emphasis added to note that this is variation explained which = correlational versus causal research] in student test scores or growth; most estimates in the literature attribute between 1% and 14% of the total variability [emphasis added] to teachers. This is not saying that teachers have little effect on students, but that variation among teachers [emphasis added] accounts for a small part of the variation [emphasis added] in [said test] scores. The majority of the variation in [said] test scores is [inversely, 86%-99% related] to factors outside of the teacher’s control such as student and family background, poverty, curriculum, and unmeasured influences.”

If any of you have anything to add to this, please do so in the comments section of this post. Otherwise, I will keep you posted on how this goes. My current understanding is that this one will be headed to court.

Difficulties When Combining Multiple Teacher Evaluation Measures

A new study about multiple “Approaches for Combining Multiple Measures of Teacher Performance,” with special attention paid to reliability, validity, and policy, was recently published in the American Educational Research Association (AERA) sponsored and highly-esteemed Educational Evaluation and Policy Analysis journal. You can find the free and full version of this study here.

In this study authors José Felipe Martínez – Associate Professor at the University of California, Los Angeles, Jonathan Schweig – at the RAND Corporation, and Pete Goldschmidt – Associate Professor at California State University, Northridge and creator of the value-added model (VAM) at legal issue in the state of New Mexico (see, for example, here), set out to help practitioners “combine multiple measures of complex [teacher evaluation] constructs into composite indicators of performance…[using]…various conjunctive, disjunctive (or complementary), and weighted (or compensatory) models” (p. 738). Multiple measures in this study include teachers’ VAM estimates, observational scores, and student survey results.

While authors ultimately suggest that “[a]ccuracy and consistency are greatest if composites are constructed to maximize reliability,” perhaps more importantly, especially for practitioners, authors note that “accuracy varies across models and cut-scores and that models with similar accuracy may yield different teacher classifications.”

This, of course, has huge implications for teacher evaluation systems as based upon multiple measures in that “accuracy” means “validity” and “valid” decisions cannot be made as based on “invalid” or “inaccurate” data that can so arbitrarily change. In other words, what this means is that likely never will a decision about a teacher being this or that actually mean this or that. In fact, this or that might be close, not so close, or entirely wrong, which is a pretty big deal when the measures combined are assumed to function otherwise. This is especially interesting, again and as stated prior, that the third author on this piece – Pete Goldschmidt – is the person consulting with the state of New Mexico. Again, this is the state that is still trying to move forward with the attachment of consequences to teachers’ multiple evaluation measures, as assumed (by the state but not the state’s consultant?) to be accurate and correct (see, for example, here).

Indeed, this is a highly inexact and imperfect social science.

Authors also found that “policy weights yield[ed] more reliable composites than optimal prediction [i.e., empirical] weights” (p. 750). In addition, “[e]mpirically derived weights may or may not align with important theoretical and policy rationales” (p. 750); hence, the authors collectively referred others to use theory and policy when combining measures, while also noting that doing so would (a) still yield overall estimates that would “change from year to year as new crops of teachers and potentially measures are incorporated” (p. 750) and (b) likely “produce divergent inferences and judgments about individual teachers (p. 751). Authors, therefore, concluded that “this in turn highlights the need for a stricter measurement validity framework guiding the development, use, and monitoring of teacher evaluation systems” (p. 751), given all of this also makes the social science arbitrary, which is also a legal issue in and of itself, as also quasi noted.

Now, while I will admit that those who are (perhaps unwisely) devoted to the (in many ways forced) combining of these measures (despite what low reliability indicators already mean for validity, as unaddressed in this piece) might find some value in this piece (e.g., how conjunctive and disjunctive models vary, how principal component, unit weight, policy weight, optimal prediction approaches vary), I will also note that forcing the fit of such multiple measures in such ways, especially without a thorough background in and understanding of reliability and validity and what reliability means for validity (i.e., with rather high levels of reliability required before any valid inferences and especially high-stakes decisions can be made) is certainly unwise.

If high-stakes decisions are not to be attached, such nettlesome (but still necessary) educational measurement issues are of less importance. But any positive (e.g., merit pay) or negative (e.g., performance improvement plan) consequence that comes about without adequate reliability and validity should certainly cause pause, if not a justifiable grievance as based on the evidence provided herein, called for herein, and required pretty much every time such a decision is to be made (and before it is made).

Citation: Martinez, J. F., Schweig, J., & Goldschmidt, P. (2016). Approaches for combining multiple measures of teacher performance: Reliability, validity, and implications for evaluation policy. Educational Evaluation and Policy Analysis, 38(4), 738–756. doi: 10.3102/0162373716666166 Retrieved from http://journals.sagepub.com/doi/pdf/10.3102/0162373716666166

Note: New Mexico’s data were not used for analytical purposes in this study, unless any districts in New Mexico participated in the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) study yielding the data used for analytical purposes herein.

Another Study about Bias in Teachers’ Observational Scores

Following-up on two prior posts about potential bias in teachers’ observations (see prior posts here and here), another research study was recently released evidencing, again, that the evaluation ratings derived via observations of teachers in practice are indeed related to (and potentially biased by) teachers’ demographic characteristics. The study also evidenced that teachers representing racial and ethnic minority background might be more likely than others to not only receive lower relatively scores but also be more likely identified for possible dismissal as a result of their relatively lower evaluation scores.

The Regional Educational Laboratory (REL) authored and U.S. Department of Education (Institute of Education Sciences) sponsored study titled “Teacher Demographics and Evaluation: A Descriptive Study in a Large Urban District” can be found here, and a condensed version of the study can be found here. Interestingly, the study was commissioned by district leaders who were already concerned about what they believed to be occurring in this regard, but for which they had no hard evidence… until the completion of this study.

Authors’ key finding follows (as based on three consecutive years of data): Black teachers, teachers age 50 and older, and male teachers were rated below proficient relatively more often than the same district teachers to whom they were compared. More specifically,

  • In all three years the percentage of teachers who were rated below proficient was higher among Black teachers than among White teachers, although the gap was smaller in 2013/14 and 2014/15.
  • In all three years the percentage of teachers with a summative performance rating who were rated below proficient was higher among teachers age 50 and older than among teachers younger than age 50.
  • In all three years the difference in the percentage of male and female teachers with a summative performance rating who were rated below proficient was approximately 5 percentage points or less.
  • The percentage of teachers who improved their rating during all three year-to-year
    comparisons did not vary by race/ethnicity, age, or gender.

This is certainly something to (still) keep in consideration, especially when teachers are rewarded (e.g., via merit pay) or penalized (e.g., vie performance improvement plans or plans for dismissal). Basing these or other high-stakes decisions on not only subjective but also likely biased observational data (see, again, other studies evidencing that this is happening here and here), is not only unwise, it’s also possibly prejudiced.

While study authors note that their findings do not necessarily “explain why the
patterns exist or to what they may be attributed,” and that there is a “need
for further research on the potential causes of the gaps identified, as well as strategies for
ameliorating them,” for starters and at minimum, those conducting these observations literally across the country must be made aware.

Citation: Bailey, J., Bocala, C., Shakman, K., & Zweig, J. (2016). Teacher demographics and evaluation: A descriptive study in a large urban district. Washington DC: U.S. Department of Education. Retrieved from http://ies.ed.gov/ncee/edlabs/regions/northeast/pdf/REL_2017189.pdf

Miami-Dade, Florida’s Recent “Symbolic” and “Artificial” Teacher Evaluation Moves

Last spring, Eduardo Porter – writer of the Economic Scene column for The New York Times – wrote an excellent article, from an economics perspective, about that which is happening with our current obsession in educational policy with “Grading Teachers by the Test” (see also my prior post about this article here; although you should give the article a full read; it’s well worth it). In short, though, Porter wrote about what economist’s often refer to as Goodhart’s Law, which states that “when a measure becomes the target, it can no longer be used as the measure.” This occurs given the great (e.g., high-stakes) value (mis)placed on any measure, and the distortion (i.e., in terms of artificial inflation or deflation, depending on the desired direction of the measure) that often-to-always comes about as a result.

Well, it’s happened again, this time in Miami-Dade, Florida, where the Miami-Dade district’s teachers are saying its now “getting harder to get a good evaluation” (see the full article here). Apparently, teachers evaluation scores, from last to this year, are being “dragged down,” primarily given teachers’ students’ performances on tests (as well as tests of subject areas that and students whom they do not teach).

“In the weeks after teacher evaluations for the 2015-16 school year were distributed, Miami-Dade teachers flooded social media with questions and complaints. Teachers reported similar stories of being evaluated based on test scores in subjects they don’t teach and not being able to get a clear explanation from school administrators. In dozens of Facebook posts, they described feeling confused, frustrated and worried. Teachers risk losing their jobs if they get a series of low evaluations, and some stand to gain pay raises and a bonus of up to $10,000 if they get top marks.”

As per the figure also included in this article, see the illustration of how this is occurring below; that is, how it is becoming more difficult for teachers to get “good” overall evaluation scores but also, and more importantly, how it is becoming more common for districts to simply set different cut scores to artificially increase teachers’ overall evaluation scores.

00-00 template_cs5

“Miami-Dade say the problems with the evaluation system have been exacerbated this year as the number of points needed to get the “highly effective” and “effective” ratings has continued to increase. While it took 85 points on a scale of 100 to be rated a highly effective teacher for the 2011-12 school year, for example, it now takes 90.4.”

This, as mentioned prior, is something called “artificial deflation,” whereas the quality of teaching is likely not changing nearly to the extent the data might illustrate it is. Rather, what is happening behind the scenes (e.g., the manipulation of cut scores) is giving the impression that indeed the overall teacher system is in fact becoming better, more rigorous, aligning with policymakers’ “higher standards,” etc).

This is something in the educational policy arena that we also call “symbolic policies,” whereas nothing really instrumental or material is happening, and everything else is a facade, concealing a less pleasant or creditable reality that nothing, in fact, has changed.

Citation: Gurney, K. (2016). Teachers say it’s getting harder to get a good evaluation. The school district disagrees. The Miami Herald. Retrieved from http://www.miamiherald.com/news/local/education/article119791683.html#storylink=cpy