Can More Teachers Be Covered Using VAMs?

Some researchers continue to explore the potential worth of value-added models (VAMs) for measuring teacher effectiveness. Not that I endorse the perpetual tweaking of this or twisting of that to explore how VAMs might be made “better” for such purposes, also given the abundance of decades research we now have evidencing the plethora of problems with using VAMs for such purposes, I do try to write about current events including current research published on this topic for this blog. Hence, I write here about a study researchers from Mathematica Policy Research released last month, about whether more teachers might be VAM-eligible (download the full study here).

One of the main issues with VAMs is that they can typically be used to measure the effects of only approximately 30% of all public school teachers. The other 70%, which sometimes includes entire campuses of teachers (e.g., early elementary and high school teachers) or teachers who do not teach the core subject areas assessed using large-scale standardized tests (e.g., mathematics and reading/language arts) cannot be evaluated or held accountable using VAM data. This is more generally termed an issue with fairness, defined by our profession’s Standards for Educational and Psychological Testing as the impartiality of “test score interpretations for intended use(s) for individuals from all [emphasis added] relevant subgroups” (p. 219). Issues of fairness arise when a test, or test-based inference or use impacts some more than others in unfair or prejudiced, yet often consequential ways.

Accordingly, in this study researchers explored whether VAMs can be used to evaluate teachers of subject areas that are only tested occasionally and in non-consecutive grade levels (e.g., science and social studies, for example, in grades 4 and 7 or 5 and 8) using teachers’ students’ other, consecutively administered subject area tests (i.e., mathematics and reading/language arts) can be used to help isolate teachers’ contributions to students’ achievement in said excluded subject areas. Indeed, it is true that “states and districts have little information about how value-added models [VAMs] perform in grades when tests in the same subject are not available from the previous year.” Yet, states (e.g., New Mexico) continue to do this without evidence that it works. This is also one point of contention in the ongoing lawsuit there. Hence, the purpose of this study was to explore (using state-level data from Oklahoma) how well doing this works, again, given the use of such proxy pretests “could allow states and districts to increase the number of teachers for whom value-added models [could] be used” (i.e., increase fairness).

However, researchers found that when doing just this (1) VAM estimates that do not account for a same-subject pretests may be less credible than estimates that use same-subject pretests from prior and adjacent grade levels (note that authors do not explicitly define what they mean by credible but infer the term to be synonymous with valid). In addition, (2) doing this may subsequently lead to relatively more biased VAM estimates, even more so than changing some other features of VAMs, and (3) doing this may make VAM estimates less precise, or reliable. Put more succinctly, using mathematics and reading/language arts as pretest scores to help measure (e.g., science and social studies) teachers’ value-added effects yields VAM estimates that are less credible (aka less valid), more biased, and less precise (aka less reliable).

The authors conclude that “some policy makers might interpret [these] findings as firm evidence against using value-added estimates that rely on proxy pretests [may be] too strong. The choice between different evaluation measures always involves trade-offs, and alternatives to value-added estimates [e.g., classroom observations and student learning objectives {SLOs)] also have important limitations.”

Their suggestion, rather, is for “[p]olicymakers [to] reduce the weight given to value-added estimates from models that rely on proxy pretests relative to the weight given to those of other teachers in subjects with pretests.” With all of this, I disagree. Using this or that statistical adjustment, or shrinkage approach, or adjusted weights, or…etc., is as I said before, at this point frivolous.

Reference: Walsh, E., Dotter, D., & Liu, A. Y. (2018). Can more teachers be covered? The accuracy, credibility, and precision of value-added estimates with proxy pre-tests. Washington DC: Mathematica Policy Research. Retrieved from https://www.mathematica-mpr.com/our-publications-and-findings/publications/can-more-teachers-be-covered-the-accuracy-credibility-and-precision-of-value-added-estimates

New Mexico Teacher Evaluation Lawsuit Updates

In December of 2015 in New Mexico, via a preliminary injunction set forth by state District Judge David K. Thomson, all consequences attached to teacher-level value-added model (VAM) scores (e.g., flagging the files of teachers with low VAM scores) were suspended throughout the state until the state (and/or others external to the state) could prove to the state court that the system was reliable, valid, fair, uniform, and the like. The trial during which this evidence is to be presented by the state is currently set for this October. See more information about this ruling here.

As the expert witness for the plaintiffs in this case, I was deposed a few weeks ago here in Phoenix, given my analyses of the state’s data (supported by one of my PhD students – Tray Geiger). In short, we found and I testified during the deposition that:

  • In terms of uniformity and fairness, there seem to be 70% or so of New Mexico teachers who are ineligible to be assessed using VAMs, and this proportion held constant across the years of data analyzed. This is even more important to note knowing that when VAM-based data are to be used to make consequential decisions about teachers, issues with fairness and uniformity become even more important given accountability-eligible teachers are also those who are relatively more likely to realize the negative or reap the positive consequences attached to VAM-based estimates.
  • In terms of reliability (or the consistency of teachers’ VAM-based scores over time), approximately 40% of teachers differed by one quintile (quintiles are derived when a sample or population is divided into fifths) and approximately 28% of teachers differed, from year-to-year, by two or more quintiles in terms of their VAM-derived effectiveness ratings. These results make sense when New Mexico’s results are situated within the current literature, whereas teachers classified as “effective” one year can have a 25%-59% chance of being classified as “ineffective” the next, or vice versa, with other permutations also possible.
  • In terms of validity (i.e., concurrent related evidence of validity), and importantly as also situated within the current literature, the correlations between New Mexico teachers’ VAM-based and observational scores ranged from r = 0.153 to r = 0.210. Not only are these correlations very weak[1], they are also very weak as appropriately situated within the literature, via which it is evidenced that correlations between multiple VAMs and observational scores typically range from 0.30 ≤ r ≤ 0.50.
  • In terms of bias, New Mexico’s Caucasian teachers had significantly higher observation scores than non-Caucasian teachers implying, also as per the current research, that Caucasian teachers may be (falsely) perceived as being better teachers than non-Caucasians teachers given bias within these instruments and/or bias of the scorers observing and scoring teachers using these instruments in practice. See prior posts about observational-based bias here, here and here.
  • Also of note in terms of bias was that: (1) teachers with fewer years of experience yielded VAM scores that were significantly lower than teachers with more years of experience, with similar patterns noted across teachers’ observation scores, which could all mean, as also in line with common sense as well as the research, that teachers with more experience are typically better teachers; (2) teachers who taught English language learners (ELLs) or special education students had lower VAM scores across the board than those who did not teach such students; (3) teachers who taught gifted students had significantly higher VAM scores than non-gifted teachers which runs counter to the current research evidencing that teachers’ gifted students oft-thwart or prevent them from demonstrating growth given ceiling effects; (4) teachers in schools with lower relative proportions of ELLs, special education students, students eligible for free-or-reduced lunches, and students from racial minority backgrounds, as well as higher relative proportions of gifted students, consistently had significantly higher VAM scores. These results suggest that teachers in these schools are as a group better, and/or that VAM-based estimates might be biased against teachers not teaching in these schools, preventing them from demonstrating comparable growth.

To read more about the data and methods used, as well as other findings, please see my affidavit submitted to the court attached here: Affidavit Feb2018.

Although, also in terms of a recent update, I should also note that a few weeks ago, as per an article in the AlbuquerqueJournal, New Mexico’s teacher evaluation systems is now likely to be overhauled, or simply “expired” as early as 2019. In short, “all three Democrats running for governor and the lone Republican candidate…have expressed misgivings about using students’ standardized test scores to evaluate the effectiveness of [New Mexico’s] teachers, a key component of the current system [at issue in this lawsuit and] imposed by the administration of outgoing Gov. Susana Martinez.” All four candidates described the current system “as fundamentally flawed and said they would move quickly to overhaul it.”

While I/we will proceed our efforts pertaining to this lawsuit until further notice, this is also important to note at this time in that it seems that New Mexico’s policymakers of new are going to be much wiser than those of late, at least in these regards.

[1] Interpreting r: 0.8 ≤ r ≤ 1.0 = a very strong correlation; 0.6 ≤ r ≤ 0.8 = a strong correlation; 0.4 ≤ r ≤ 0.6 = a moderate correlation; 0.2 ≤ r ≤ 0.4 = a weak correlation; and 0.0 ≤ r ≤ 0.2 = a very weak correlation, if any at all.

 

An Important but False Claim about the EVAAS in Ohio

Just this week in Ohio – a state that continues to contract with SAS Institute Inc. for test-based accountability output from its Education Value-Added Assessment System – SAS’s EVAAS Director, John White, “defended” the use of his model statewide, during which he also claimed before Ohio’s Joint Education Oversight Committee (JEOC) that “poorer schools do no better or worse on student growth than richer schools” when using the EVAAS model.

For the record, this is false. First, about five years ago in Ohio, while the state of Ohio was using the same EVAAS model, Ohio’s The Plain Dealer in conjunction with StateImpact Ohio found that Ohio’s “value-added results show that districts, schools and teachers with large numbers of poor students tend to have lower value-added results than those that serve more-affluent ones.” They also found that:

  • Value-added scores were 2½ times higher on average for districts where the median family income is above $35,000 than for districts with income below that amount.
  • For low-poverty school districts, two-thirds had positive value-added scores — scores indicating students made more than a year’s worth of progress.
  • For high-poverty school districts, two-thirds had negative value-added scores — scores indicating that students made less than a year’s progress.
  • Almost 40 percent of low-poverty schools scored “Above” the state’s value-added target, compared with 20 percent of high-poverty schools.
  • At the same time, 25 percent of high-poverty schools scored “Below” state value-added targets while low-poverty schools were half as likely to score “Below.” See the study here.

Second, about three years ago, similar results were evidenced in Pennsylvania – another state that uses the same EVAAS statewide, although in Pennsylvania the model is known as the Pennsylvania Education Value-Added Assessment System (PVAAS). Research for Action (click here for more about the organization and its mission), more specifically, evidenced that bias also appears to exist particularly at the school-level. See more here.

Third, and related, in Arizona – my state that is also using growth to measure school-level value-added, albeit not with the EVAAS – the same issues with bias are being evidenced when measuring school-level growth for similar purposes. Just two days ago, for example, The Arizona Republic evidenced that the “schools with ‘D’ and ‘F’ letter grades” recently released by the state board of education “were more likely to have high percentages of students eligible for free and reduced-price lunch, an indicator of poverty” (see more here). In actuality, the correlation is as high or “strong” as r = -0.60 (e.g., correlation coefficient values that land between = ± 0.50 and ± 1.00 are often said to indicate “strong” correlations). What this means in more pragmatic terms is that the better the school letter grade received the lower the level of poverty at the school (i.e., a negative correlation which indicates in this case that as the letter grade goes up the level of poverty goes down).

While the state of Arizona combines with growth a proficiency measure (always strongly correlated with poverty), and this explains at least some of the strength of this correlation (although combining proficiency with growth is also a practice endorsed and encouraged by John White), this strong correlation is certainly at issue.

More specifically at issue, though, should be how to get any such correlation down to zero or near-zero (if possible), which is the only correlation that would, in fact, warrant any such claim, again as noted to the JEOC this week in Ohio, that “poorer schools do no better or worse on student growth than richer schools”.

Bias in VAMs, According to Validity Expert Michael T. Kane

During the still ongoing, value-added lawsuit in New Mexico (see my most recent update about this case here), I was honored to testify as the expert witness on behalf of the plaintiffs (see, for example, here). I was also fortunate to witness the testimony of the expert witness who testified on behalf of the defendants – Thomas Kane, Economics Professor at Harvard and former Director of the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) studies. During Kane’s testimony, one of the highlights (i.e., for the plaintiffs), or rather the low-lights (i.e., for him and the defendants), in my opinion, was when one of the plaintiff’s attorney’s questioned Kane, on the stand, about his expertise in the area of validity. In sum, Kane responded that he defined himself as an “expert” in the area, having also been trained by some of the best. Consequently, the plaintiff’s attorney’s questioned Kane about different types of validity evidences (e.g., construct, content, criterion), and Kane could not answer those questions. The only form of validity evidence with which he was familiar, and which he could clearly define, was evidence related to predictive validity. This hardly made him the expert he proclaimed himself to be minutes prior.

Let’s not mince words, though, or in this case names.

A real expert in validity (and validity theory) is another Kane, who goes by the full name of Michael T. Kane. This Kane is The Samuel J. Messick Chair in Test Validity at the Educational Testing Service (ETS); this Kane wrote one of the best, most contemporary, and currently most foundational papers on validity (see here); and this Kane just released an ETS-sponsored paper on Measurement Error and Bias in Value-Added Models certainly of interest here. I summarize this piece below (see the PDF of this report here).

In this paper Kane examines “the origins of [value-added model (VAM)-based] bias and its potential impact” and indicates that bias that is observed “is an increasing linear function of the student’s prior achievement and can be quite large (e.g., half a true-score standard deviation) for very low-scoring and high-scoring students [i.e., students in the extremes of any normal distribution]” (p. 1). Hence, Kane argues, “[t]o the extent that students with relatively low or high prior scores are clustered in particular classes and schools, the student-level bias will tend to generate bias in VAM estimates of teacher and school effects” (p. 1; see also prior posts about this type of bias here, here, and here; see also Haertel (2013) cited below). Kane concludes that “[a]djusting for this bias is possible, but it requires estimates of generalizability (or reliability) coefficients that are more accurate and precise than those that are generally available for standardized achievement tests” (p. 1; see also prior posts about issues with reliability across VAMs here, here, and here).

Kane’s more specific points of note:

  • To accurately calculate teachers’/schools’ value-added, “current and prior scores have to be on the same scale (or on vertically aligned scales) for the differences to make sense. Furthermore, the scale has to be an interval scale in the sense that a difference of a certain number of points has, at least approximately, the same meaning along the scale, so that it makes sense to compare gain scores from different parts of the scale…some uncertainty about scale characteristics is not a problem for many applications of vertical scaling, but it is a serious problem if the proposed use of the scores (e.g., educational accountability based on growth scores) demands that the vertical scale be demonstrably equal interval” (p. 1).
  • Likewise, while some approaches can be used to minimize the need for such scales (e.g., residual gain scores, covariate-adjustment models, and ordinary least squares (OLS) regression approaches which are of specific interest in this piece), “it is still necessary to assume [emphasis added] that a difference of a certain number of points has more or less the same meaning along the score scale for the current test scores” (p. 2).
  • Related, “such adjustments can [still] be biased to the extent that the predicted score does not include all factors that may have an impact on student performance. Bias can also result from errors of measurement in the prior scores included in the prediction equation…[and this can be]…substantial” (p. 2).
  • Accordingly, “gains for students with high true scores on the prior year’s test will be overestimated, and the gains for students with low true scores in the prior year will be underestimated. To the extent that students with relatively low and high true scores tend to be clustered in particular classes and schools, the student-level bias will generate bias in estimates of teacher and school effects” (p. 2).
  • Hence, if not corrected, this source of bias could have a substantial negative impact on estimated VAM scores for teachers and schools that serve students with low prior true scores and could have a substantial positive impact for teachers and schools that serve mainly high-performing students” (p. 2).
  • Put differently, random errors in students’ prior scores may “tend to add a positive bias to the residual gain scores for students with prior scores above the population mean, and they [may] tend to add a negative bias to the residual gain scores for students with prior scores below the mean. Th[is] bias is associated with the well-known phenomenon of regression to the mean” (p. 10).
  • Although, at least this latter claim — that students with relatively high true scores in the prior year could substantially and positively impact their teachers’/schools value-added estimates — does run somewhat contradictory to other claims as evidenced in the literature in terms of the extent to which ceiling effects substantially and negatively impact their teachers’/schools value-added estimates (see, for example, Point #7 as per the ongoing lawsuit in Houston here, and see also Florida teacher Luke Flint’s “Story” here).
  • In sum, and as should be a familiar conclusion to followers of this blog, “[g]iven that the results of VAMs may be used for high-stakes decisions about teachers and schools in the context of accountability programs,…any substantial source of bias would be a matter of great concern” (p. 2).

Citation: Kane, M. T. (2017). Measurement error and bias in value-added models. Princeton, NJ: Educational Testing Service (ETS) Research Report Series. doi:10.1002/ets2.12153 Retrieved from http://onlinelibrary.wiley.com/doi/10.1002/ets2.12153/full

See also Haertel, E. H. (2013). Reliability and validity of inferences about teachers based on student test scores (14th William H. Angoff Memorial Lecture). Princeton, NJ: Educational Testing Service (ETS).

More of Kane’s “Objective” Insights on Teacher Evaluation Measures

You might recall from a series of prior posts (see, for example, here, here, and here), the name of Thomas Kane — an economics professor from Harvard University who directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation, who also testified as an expert witness in two lawsuits (i.e., in New Mexico and Houston) opposite me (and in the case of Houston, also opposite Jesse Rothstein).

He, along with Andrew Bacher-Hicks (PhD Candidate at Harvard), Mark Chin (PhD Candidate at Harvard), and Douglas Staiger (Economics Professor of Dartmouth), just released yet another National Bureau of Economic Research (NBER) “working paper” (i.e., not peer-reviewed, and in this case not internally reviewed by NBER for public consumption and use either) titled “An Evaluation of Bias in Three Measures of Teacher Quality: Value-Added, Classroom Observations, and Student Surveys.” I review this study here.

Using Kane’s MET data, they test whether 66 mathematics teachers’ performance measured (1) by using teachers’ student test achievement gains (i.e., calculated using value-added models (VAMs)), classroom observations, and student surveys, and (2) under naturally occurring (i.e., non-experimental) settings “predicts performance following random assignment of that teacher to a class of students” (p. 2). More specifically, researchers “observed a sample of fourth- and fifth-grade mathematics teachers and collected [these] measures…[under normal conditions, and then in]…the third year…randomly assigned participating teachers to classrooms within their schools and then again collected all three measures” (p. 3).

They concluded that “the test-based value-added measure—is a valid predictor of teacher impacts on student achievement following random assignment” (p. 28). This finding “is the latest in a series of studies” (p. 27) substantiating this not-surprising, as-oft-Kane-asserted finding, or as he might assert it, fact. I should note here that no other studies substantiating “the latest in a series of studies” (p. 27) claim are referenced or cited, but a quick review of the 31 total references included in this report include 16/31 (52%) references conducted by only econometricians (i.e., not statisticians or other educational researchers) on this general topic, of which 10/16 (63%) are not peer reviewed and of which 6/16 (38%) are either authored or co-authored by Kane (1/6 being published in a peer-reviewed journal). The other articles cited are about the measurements used, the geenral methods used in this study, and four other articles written on the topic not authored by econometricians. Needless to say, there is clearly a slant that is quite obvious in this piece, and unfortunately not surprising, but that had it gone through any respectable vetting process, this sh/would have been caught and addressed prior to this study’s release.

I must add that this reminds me of Kane’s New Mexico testimony (see here) where he, again, “stressed that numerous studies [emphasis added] show[ed] that teachers [also] make a big impact on student success.” He stated this on the stand while expressly contradicting the findings of the American Statistical Association (ASA). While testifying otherwise, and again, he also only referenced (non-representative) studies in his (or rather defendants’ support) authored by primarily him (e.g, as per his MET studies) and some of his other econometric friends (e.g. Raj Chetty, Eric Hanushek, Doug Staiger) as also cited within this piece here. This was also a concern registered by the court, in terms of whether Kane’s expertise was that of a generalist (i.e., competent across multi-disciplinary studies conducted on the matter) or a “selectivist” (i.e., biased in terms of his prejudice against, or rather selectivity of certain studies for confirmation, inclusion, or acknowledgment). This is also certainly relevant, and should be taken into consideration here.

Otherwise, in this study the authors also found that the Mathematical Quality of Instruction (MQI) observational measure (one of two observational measures they used in this study, with the other one being the Classroom Assessment Scoring System (CLASS)) was a valid predictor of teachers’ classroom observations following random assignment. The MQI also, did “not seem to be biased by the unmeasured characteristics of students [a] teacher typically teaches” (p. 28). This also expressly contradicts what is now an emerging set of studies evidencing the contrary, also not cited in this particular piece (see, for example, here, here, and here), some of which were also conducted using Kane’s MET data (see, for example, here and here).

Finally, authors’ evidence on the predictive validity of student surveys was inconclusive.

Needless to say…

Citation: Bacher-Hicks, A., Chin, M. J., Kane, T. J., & Staiger, D. O. (2017). An evaluation of bias in three measures of teacher quality: Value-added, classroom observations, and student surveys. Cambridge, MA: ational Bureau of Economic Research (NBER). Retrieved from http://www.nber.org/papers/w23478

On Conditional Bias and Correlation: A Guest Post

After I posted about “Observational Systems: Correlations with Value-Added and Bias,” a blog follower, associate professor, and statistician named Laura Ring Kapitula (see also a very influential article she wrote on VAMs here) posted comments on this site that I found of interest, and I thought would also be of interest to blog followers. Hence, I invited her to write a guest post, and she did.

She used R (i.e., a free software environment for statistical computing and graphics) to simulate correlation scatterplots (see Figures below) to illustrate three unique situations: (1) a simulation where there are two indicators (e.g., teacher value-added and observational estimates plotted on the x and y axes) that have a correlation of r = 0.28 (the highest correlation coefficient at issue in the aforementioned post); (2) a simulation exploring the impact of negative bias and a moderate correlation on a group of teachers; and (3) another simulation with two indicators that have a non-linear relationship possibly induced or caused by bias. She designed simulations (2) and (3) to illustrate the plausibility of the situation suggested next (as written into Audrey’s post prior) about potential bias in both value-added and observational estimates:

If there is some bias present in value-added estimates, and some bias present in the observational estimates…perhaps this is why these low correlations are observed. That is, only those teachers teaching classrooms inordinately stacked with students from racial minority, poor, low achieving, etc. groups might yield relatively stronger correlations between their value-added and observational scores given bias, hence, the low correlations observed may be due to bias and bias alone.

Laura continues…

Here, Audrey makes the point that a correlation of r = 0.28 is “weak.” It is, accordingly, useful to see an example of just how “weak” such a correlation is by looking at a scatterplot of data selected from a population where the true correlation is r = 0.28. To make the illustration more meaningful the points are colored based on their quintile scores as per simulated teachers’ value-added divided into the lowest 20%, next 20%, etc.

In this figure you can see by looking at the blue “least squares line” that, “on average,” as a simulated teacher’s value-added estimate increases the average of a teacher’s observational estimate increases. However, there is a lot of variability (or scatter points) around the (scatterplot) line. Given this variability, we can make statements about averages, such as “on average” teachers in the top 20% for VAM scores will likely have on average higher observed observational scores; however, there is not nearly enough precision to make any (and certainly not any good) predictions about the observational score from the VAM score for individual teachers. In fact, the linear relationship between teachers’ VAM and observational scores only accounts for about 8% of the variation in VAM score. Note: we get 8% by squaring the aforementioned r = 0.28 correlation (i.e., an R squared). The other 92% of the variance is due to error and other factors.

What this means in practice is that when correlations are this “weak,” it is reasonable to say statements about averages, for example, that “on average” as one variable increases the mean of the other variable increases, but it would not be prudent or wise to make predictions for individuals based on these data. See, for example, that individuals in the top 20% (quintile 5) of VAM have a very large spread in their scores on the observational score, with 95% of the scores in the top quintile being in between the 7th and 98th percentiles for their observational scores. So, here if we observe a VAM for a specific teacher in the top 20%, and we do not know their observational score, we cannot say much more than their observational score is likely to be in the top 90%. Similarly, if we observe a VAM in the bottom 20%, we cannot say much more than their observational score is likely to be somewhere in the bottom 90%. That’s not saying a lot, in terms of precision, but also in terms of practice.

The second scatterplot I ran to test how bias that only impacts a small group of teachers might theoretically impact an overall correlation, as posited by Audrey. Here I simulated a situation where, again, there are two values present in a population of teachers: a teacher’s value-added and a teacher’s observational score. Then I insert a group of teachers (as Audrey described) who represent 20% of a population and teach a disproportionate number of students who come from relatively lower socioeconomic, high racial minority, etc. backgrounds, and I assume this group is measured with negative bias on both indicators and this group has a moderate correlation between indicators of r = 0.50. The other 80% of the population is assumed to be uncorrelated. Note: for this demonstration I assume that this group includes 20% of teachers from the aforementioned population, these teachers I assume to be measured with negative bias (by one standard deviation on average) on both measures, and, again, I set their correlation at r = 0.50 with the other 80% of teachers at a correlation of zero.

What you can see is that if there is bias in this correlation that impacts only a certain group on the two instrument indicators; hence, it is possible that this bias can result in an observed correlation overall. In other words, a strong correlation noted in just one group of teachers (i.e., teachers scoring the lowest on their value-added and observational indicators in this case) can be relatively stronger than the “weak” correlation observed on average or overall.

Another, possible situation is that there might be a non-linear relationship between these two measures. In the simulation below, I assume that different quantiles on VAM have a different linear relationship with the observational score. For example, in the plot there is not a constant slope, but teachers who are in the first quintile on VAM I assume to have a correlation of r = 0.50 with observational scores, the second quintile I assume to have a correlation of r = 0.20, and the other quintiles I assume to be uncorrelated. This results in an overall correlation in the simulation of r = 0.24, with a very small p-value (i.e. a very small chance that a correlation of this size would be observed by random chance alone if the true correlation was zero).

What this means in practice is that if, in fact, there is a non-linear relationship between teachers’ observational and VAM scores, this can induce a small but statistically significant correlation. As evidenced, teachers in the lowest 20% on the VAM score have differences in the mean observational score depending on the VAM score (a moderate correlation of r = 0.50), but for the other 80%, knowing the VAM score is not informative as there is a very small correlation for the second quintile and no correlation for the upper 60%. So, if quintile cut-off scores are used, teachers can easily be misclassified. In sum, Pearson Correlations (the standard correlation coefficient) measure the overall strength of  linear relationships between X and Y, but if X and Y have a non-linear relationship (like as illustrated in the above), this statistic can be very misleading.

Note also that for all of these simulations very small p-values are observed (e.g., p-values <0.0000001 which, again, mean these correlations are statistically significant or that the probability of observing correlations this large by chance if the true correlation is zero, is nearly 0%). What this illustrates, again, is that correlations (especially correlations this small) are (still) often misleading. While they might be statistically significant, they might mean relatively little in the grand scheme of things (i.e., in terms of practical significance; see also “The Difference Between”Significant’ and ‘Not Significant’ is not Itself Statistically Significant” or posts on Andrew Gelman’s blog for more discussion on these topics if interested).

At the end of the day r = 0.28 is still a “weak” correlation. In addition, it might be “weak,” on average, but much stronger and statistically and practically significant for teachers in the bottom quintiles (e.g., teachers in the bottom 20%, as illustrated in the final figure above) typically teaching the highest needs students. Accordingly, this might be due, at least in part, to bias.

In conclusion, one should always be wary of claims based on “weak” correlations, especially if they are positioned to be stronger than industry standards would classify them (e.g., in the case highlighted in the prior post). Even if a correlation is “statistically significant,” it is possible that the correlation is the result of bias, and that the relationship is so weak that it is not meaningful in practice, especially when the goal is to make high-stakes decisions about individual teachers. Accordingly, when you see correlations this small, keep these scatterplots in mind or generate some of your own (see, for example, here to dive deeper into what these correlations might mean and how significant these correlations might really be).

*Please contact Dr. Kapitula directly at kapitull@gvsu.edu if you want more information or to access the R code she used for the above.

Observational Systems: Correlations with Value-Added and Bias

A colleague recently sent me a report released in November of 2016 by the Institute of Education Sciences (IES) division of the U.S. Department of Education that should be of interest to blog followers. The study is about “The content, predictive power, and potential bias in five widely used teacher observation instruments” and is authored by affiliates of Mathematica Policy Research.

Using data from the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) studies, researchers examined five widely used teacher observation instruments. Instruments included the more generally popular Classroom Assessment Scoring System (CLASS) and Danielson Framework for Teaching (of general interest in this post), as well as the more subject-specific instruments including the Protocol for Language Arts Teaching Observations (PLATO), the Mathematical Quality of Instruction (MQI), and the UTeach Observational Protocol (UTOP) for science and mathematics teachers.

Researchers examined these instruments in terms of (1) what they measure (which is not of general interest in this post), but also (2) the relationships of observational output to teachers’ impacts on growth in student learning over time (as measured using a standard value-added model (VAM)), and (3) whether observational output are biased by the characteristics of the students non-randomly (or in this study randomly) assigned to teachers’ classrooms.

As per #2 above, researchers found that the instructional practices captured across these instruments modestly [emphasis added] correlate with teachers’ value-added scores, with an adjusted (and likely, artificially inflated; see Note 1 below) correlation coefficient between observational and value added indicators at: 0.13 ≤ r ≤ 0.28 (see also Table 4, p. 10). As per the higher, adjusted r (emphasis added; see also Note 1 below), they found that these instruments’ classroom management dimensions most strongly (r = 0.28) correlated with teachers’ value-added.

Related, also at issue here is that such correlations are not “modest,” but rather “weak” to “very weak” (see Note 2 below). While all correlation coefficients were statistically significant, this is much more likely due to the sample size used in this study versus the actual or practical magnitude of these results. “In sum” this hardly supports the overall conclusion that “observation scores predict teachers’ value-added scores” (p. 11); although, it should also be noted that this summary statement, in and of itself, suggests that the value-added score is the indicator around which all other “less objective” indicators are to revolve.

As per #3 above, researchers found that students randomly assigned to teachers’ classrooms (as per the MET data, although there was some noncompliance issues with the random assignment employed in the MET studies) do bias teachers’ observational scores, for better or worse, and more often in English language arts than in mathematics. More specifically, they found that for the Danielson Framework and CLASS (the two more generalized instruments examined in this study, also of main interest in this post), teachers with relatively more racial/ethnic minority and lower-achieving students (in that order, although these are correlated themselves) tended to receive lower observation scores. Bias was observed more often for the Danielson Framework versus the CLASS, but it was observed in both cases. An “alternative explanation [may be] that teachers are providing less-effective instruction to non-White or low-achieving students” (p. 14).

Notwithstanding, and in sum, in classrooms in which students were randomly assigned to teachers, teachers’ observational scores were biased by students’ group characteristics, which also means that  bias is also likely more prevalent in classrooms to which students are non-randomly assigned (which is common practice). These findings are also akin to those found elsewhere (see, for example, two similar studies here), as this was also evidenced in mathematics, which may also be due to the random assignment factor present in this study. In other words, if non-random assignment of students into classrooms is practice, a biasing influence may (likely) still exist in English language arts and mathematics.

The long and short of it, though, is that the observational components of states’ contemporary teacher systems certainly “add” more “value” than their value-added counterparts (see also here), especially when considering these systems’ (in)formative purposes. But to suggest that because these observational indicators (artificially) correlate with teachers’ value-added scores at “weak” and “very weak” levels (see Notes 1 and 2 below), that this means that these observational systems might “add” more “value” to the summative sides of teacher evaluations (i.e., their predictive value) is premature, not to mention a bit absurd. Adding import to this statement is the fact that, as s duly noted in this study, these observational indicators are oft-to-sometimes biased against teachers who teacher lower-achieving and racial minority students, even when random assignment is present, making such bias worse when non-random assignment, which is very common, occurs.

Hence, and again, this does not make the case for the summative uses of really either of these indicators or instruments, especially when high-stakes consequences are to be attached to output from either indicator (or both indicators together given the “weak” to “very weak” relationships observed). On the plus side, though, remain the formative functions of the observational indicators.

*****

Note 1: Researchers used the “year-to-year variation in teachers’ value-added scores to produce an adjusted correlation [emphasis added] that may be interpreted as the correlation between teachers’ average observation dimension score and their underlying value added—the value added that is [not very] stable [or reliable] for a teacher over time, rather than a single-year measure (Kane & Staiger, 2012)” (p. 9). This practice or its statistic derived has not been externally vetted. Likewise, this also likely yields a correlation coefficient that is falsely inflated. Both of these concerns are at issue in the ongoing New Mexico and Houston lawsuits, in which Kane is one of the defendants’ expert witnesses in both cases testifying in support of his/this practice.

Note 2: As is common with social science research when interpreting correlation coefficients: 0.8 ≤ r ≤ 1.0 = a very strong correlation; 0.6 ≤ r ≤ 0.8 = a strong correlation; 0.4 ≤ r ≤ 0.6 = a moderate correlation; 0.2 ≤ r ≤ 0.4 = a weak correlation; and 0 ≤ r ≤ 0.2 = a very weak correlation, if any at all.

*****

Citation: Gill, B., Shoji, M., Coen, T., & Place, K. (2016). The content, predictive power, and potential bias in five widely used teacher observation instruments. Washington, DC: U.S. Department of Education, Institute of Education Sciences. Retrieved from https://ies.ed.gov/ncee/edlabs/regions/midatlantic/pdf/REL_2017191.pdf

New Article Published on Using Value-Added Data to Evaluate Teacher Education Programs

A former colleague, a current PhD student, and I just had an article released about using value-added data to (or rather not to) evaluate teacher education/preparation, higher education programs. The article is titled “An Elusive Policy Imperative: Data and Methodological Challenges When Using Growth in Student Achievement to Evaluate Teacher Education Programs’ ‘Value-Added,” and the abstract of the article is included below.

If there is anyone out there who might be interested in this topic, please note that the journal in which this piece was published (online first and to be published in its paper version later) – Teaching Education – has made the article free for its first 50 visitors. Hence, I thought I’d share this with you all first.

If you’re interested, do access the full piece here.

Happy reading…and here’s the abstract:

In this study researchers examined the effectiveness of one of the largest teacher education programs located within the largest research-intensive universities within the US. They did this using a value-added model as per current federal educational policy imperatives to assess the measurable effects of teacher education programs on their teacher graduates’ students’ learning and achievement as compared to other teacher education programs. Correlational and group comparisons revealed little to no relationship between value-added scores and teacher education program regardless of subject area or position on the value-added scale. These findings are discussed within the context of several very important data and methodological challenges researchers also made transparent, as also likely common across many efforts to evaluate teacher education programs using value-added approaches. Such transparency and clarity might assist in the creation of more informed value-added practices (and more informed educational policies) surrounding teacher education accountability.

Another Study about Bias in Teachers’ Observational Scores

Following-up on two prior posts about potential bias in teachers’ observations (see prior posts here and here), another research study was recently released evidencing, again, that the evaluation ratings derived via observations of teachers in practice are indeed related to (and potentially biased by) teachers’ demographic characteristics. The study also evidenced that teachers representing racial and ethnic minority background might be more likely than others to not only receive lower relatively scores but also be more likely identified for possible dismissal as a result of their relatively lower evaluation scores.

The Regional Educational Laboratory (REL) authored and U.S. Department of Education (Institute of Education Sciences) sponsored study titled “Teacher Demographics and Evaluation: A Descriptive Study in a Large Urban District” can be found here, and a condensed version of the study can be found here. Interestingly, the study was commissioned by district leaders who were already concerned about what they believed to be occurring in this regard, but for which they had no hard evidence… until the completion of this study.

Authors’ key finding follows (as based on three consecutive years of data): Black teachers, teachers age 50 and older, and male teachers were rated below proficient relatively more often than the same district teachers to whom they were compared. More specifically,

  • In all three years the percentage of teachers who were rated below proficient was higher among Black teachers than among White teachers, although the gap was smaller in 2013/14 and 2014/15.
  • In all three years the percentage of teachers with a summative performance rating who were rated below proficient was higher among teachers age 50 and older than among teachers younger than age 50.
  • In all three years the difference in the percentage of male and female teachers with a summative performance rating who were rated below proficient was approximately 5 percentage points or less.
  • The percentage of teachers who improved their rating during all three year-to-year
    comparisons did not vary by race/ethnicity, age, or gender.

This is certainly something to (still) keep in consideration, especially when teachers are rewarded (e.g., via merit pay) or penalized (e.g., vie performance improvement plans or plans for dismissal). Basing these or other high-stakes decisions on not only subjective but also likely biased observational data (see, again, other studies evidencing that this is happening here and here), is not only unwise, it’s also possibly prejudiced.

While study authors note that their findings do not necessarily “explain why the
patterns exist or to what they may be attributed,” and that there is a “need
for further research on the potential causes of the gaps identified, as well as strategies for
ameliorating them,” for starters and at minimum, those conducting these observations literally across the country must be made aware.

Citation: Bailey, J., Bocala, C., Shakman, K., & Zweig, J. (2016). Teacher demographics and evaluation: A descriptive study in a large urban district. Washington DC: U.S. Department of Education. Retrieved from http://ies.ed.gov/ncee/edlabs/regions/northeast/pdf/REL_2017189.pdf

Ohio Rejects Subpar VAM, for Another VAM Arguably Less Subpar?

From a prior post coming from Ohio (see here), you may recall that Ohio state legislators recently introduced a bill to review its state’s value-added model (VAM), especially as it pertains to the state’s use of their VAM (i.e., the Education Value-Added Assessment System (EVAAS); see more information about the use of this model in Ohio here).

As per an article published last week in The Columbus Dispatch, the Ohio Department of Education (ODE) apparently rejected a proposal made by the state’s pro-charter school Ohio Coalition for Quality Education and the state’s largest online charter school, all of whom wanted to add (or replace) this state’s VAM with another, unnamed “Similar Students” measure (which could be the Student Growth Percentiles model discussed prior on this blog, for example, here, here, and here) used in California.

The ODE charged that this measure “would lower expectations for students with different backgrounds, such as those in poverty,” which is not often a common criticism of this model (if I have the model correct), nor is it a common criticism of the model they already have in place. In fact, and again if I have the model correct, these are really the only two models that do not statistically control for potentially biasing factors (e.g., student demographic and other background factors) when calculating teachers’ value-added; hence, their arguments about this model may be in actuality no different than that which they are already doing. Hence, statements like that made by Chris Woolard, senior executive director of the ODE, are false: “At the end of the day, our system right now has high expectations for all students. This (California model) violates that basic principle that we want all students to be able to succeed.”

The models, again if I am correct, are very much the same. While indeed the California measurement might in fact consider “student demographics such as poverty, mobility, disability and limited-English learners,” this model (if I am correct on the model) does not statistically factor these variables out. If anything, the state’s EVAAS system does, even though EVAAS modelers claim they do not do this, by statistically controlling for students’ prior performance, which (unfortunately) has these demographics already built into them. In essence, they are already doing the same thing they now protest.

Indeed, as per a statement made by Ron Adler, president of the Ohio Coalition for Quality Education, not only is it “disappointing that ODE spends so much time denying that poverty and mobility of students impedes their ability to generate academic performance…they [continue to] remain absolutely silent about the state’s broken report card and continually defend their value-added model that offers no transparency and creates wild swings for schools across Ohio” (i.e., the EVAAS system, although in all fairness all VAMs and the SGP yield the “wild swings’ noted). See, for example, here.

What might be worse, though, is that the ODE apparently found that, depending on the variables used in the California model, it produced different results. Guess what! All VAMs, depending on the variables used, produce different results. In fact, using the same data and different VAMs for the same teachers at the same time also produce (in some cases grossly) different results. The bottom line here is if any thinks that any VAM is yielding estimates from which valid or “true” statements can be made are fooling themselves.