Special Issue of “Educational Researcher” (Paper #8 of 9, Part II): A Modest Solution Offered by Linda Darling-Hammond

One of my prior posts was about the peer-reviewed journal Educational Researcher (ER)’sSpecial Issue” on VAMs and the commentary titled “Can Value-Added Add Value to Teacher Evaluation?” contributed to the “Special Issue” by Linda Darling-Hammond – Professor of Education, Emeritus, at Stanford University.

In this post, I noted that Darling-Hammond “added” a lot of “value” in one particular section of her commentary, in which she offerec a very sound set of solutions, using VAMs for teacher evaluations or not. Given it’s rare in this area of research to focus on actual solutions, and this section is a must read, I paste this small section here for you all to read (and bookmark, especially if you are currently grappling with how to develop good evaluation systems that must meet external mandates, requiring VAMs).

Here is Darling-Hammond’s “Modest Proposal” (p. 135-136):

What if, instead of insisting on the high-stakes use of a single approach to VAM as a significant percentage of teachers’ ratings, policymakers were to acknowledge the limitations that have been identified and allow educators to develop more thoughtful
approaches to examining student learning in teacher evaluation? This might include sharing with practitioners honest information about imprecision and instability of the measures they receive, with instructions to use them cautiously, along with other evidence that can help paint a more complete picture of how students are learning in a teacher’s classroom. An appropriate warning might alert educators to the fact that VAM ratings
based on state tests are more likely to be informative for students already at grade level, and least likely to display the gains of students who are above or below grade level in their knowledge and skills. For these students, other measures will be needed.

What if teachers could create a collection of evidence about their students’ learning that is appropriate for the curriculum and students being taught and targeted to goals the teacher is pursuing for improvement? In a given year, one teacher’s evidence set might include gains on the vertically scaled Developmental Reading Assessment she administers to students, plus gains on the English language proficiency test for new English learners,
and rubric scores on the beginning and end of the year essays her grade level team assigns and collectively scores.

Another teacher’s evidence set might include the results of the AP test in Calculus with a pretest on key concepts in the course, plus pre- and posttests on a unit regarding the theory of limits which he aimed to improve this year, plus evidence from students’ mathematics projects using trigonometry to estimate the distance of a major landmark from their home. VAM ratings from a state test might be included when appropriate, but they would not stand alone as though they offered incontrovertible evidence about teacher effectiveness.

Evaluation ratings would combine the evidence from multiple sources in a judgment model, as Massachusetts’ plan does, using a matrix to combine and evaluate several pieces of student learning data, and then integrate that rating with those from observations and professional contributions. Teachers receive low or high ratings when multiple indicators point in the same direction. Rather than merely tallying up disparate percentages and urging administrators to align their observations with inscrutable VAM scores, this approach would identify teachers who warrant intervention while enabling pedagogical discussions among teachers and evaluators based on evidence that connects what teachers do with how their students learn. A number of studies suggest that teachers become more effective as they receive feedback from standards-based observations and as they develop ways to evaluate their students’ learning in relation to their practice (Darling-Hammond, 2013).

If the objective is not just to rank teachers and slice off those at the bottom, irrespective of accuracy, but instead to support improvement while providing evidence needed for action, this modest proposal suggests we might make more headway by allowing educators to design systems that truly add value to their knowledge of how students are learning in relation to how teachers are teaching.

*****

If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; see the Review of Article #5 – on teachers’ perceptions of observations and student growth here; see the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here; and see the Review of Article (Commentary) #7 – on VAMs situated in their appropriate ecologies here; and see the Review of Article #8, Part I – on a more research-based assessment of VAMs’ potentials here.

Article #8, Part II Reference: Darling-Hammond, L. (2015). Can value-added add value to teacher evaluation? Educational Researcher, 44(2), 132-137. doi:10.3102/0013189X15575346

Special Issue of “Educational Researcher” (Paper #8 of 9, Part I): A More Research-Based Assessment of VAMs’ Potentials

Recall that the peer-reviewed journal Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#8 of 9), which is actually a commentary titled “Can Value-Added Add Value to Teacher Evaluation?” This commentary is authored by Linda Darling-Hammond – Professor of Education, Emeritus, at Stanford University.

Like with the last commentary reviewed here, Darling-Hammond reviews some of the key points taken from the five feature articles in the aforementioned “Special Issue.” More specifically, though, Darling-Hammond “reflect[s] on [these five] articles’ findings in light of other work in this field, and [she] offer[s her own] thoughts about whether and how VAMs may add value to teacher evaluation” (p. 132).

She starts her commentary with VAMs “in theory,” in that VAMs COULD accurately identify teachers’ contributions to student learning and achievement IF (and this is a big IF) the following three conditions were met: (1) “student learning is well-measured by tests that reflect valuable learning and the actual achievement of individual students along a vertical scale representing the full range of possible achievement measures in equal interval units” (2) “students are randomly assigned to teachers within and across schools—or, conceptualized another way, the learning conditions and traits of the group of students assigned to one teacher do not vary substantially from those assigned to another;” and (3) “individual teachers are the only contributors to students’ learning over the period of time used for measuring gains” (p. 132).

None of things are actual true (or near to true, nor will they likely ever be true) in educational practice, however. Hence, the errors we continue to observe that continue to prevent VAM use for their intended utilities, even with the sophisticated statistics meant to mitigate errors and account for the above-mentioned, let’s call them, “less than ideal” conditions.

Other pervasive and perpetual issues surrounding VAMs as highlighted by Darling-Hammond, per each of the three categories above, pertain to (1) the tests used to measure value-added is that the tests are very narrow, focus on lower level skills, and are manipulable. These tests in their current form cannot effectively measure the learning gains of a large share of students who are above or below grade level given a lack of sufficient coverage and stretch. As per Haertel (2013, as cited in Darling-Hammond’s commentary), this “translates into bias against those teachers working with the lowest-performing or the highest-performing classes’…and “those who teach in tracked school settings.” It is also important to note here that the new tests created by the Partnership for Assessing Readiness for College and Careers (PARCC) and Smarter Balanced, multistate consortia “will not remedy this problem…Even though they will report students’ scores on a vertical scale, they will not be able to measure accurately the achievement or learning of students who started out below or above grade level” (p.133).

With respect to (2) above, on the equivalence (or rather non-equivalence) of groups of student across teachers classrooms who are the ones whose VAM scores are relativistically compared, the main issue here is that “the U.S. education system is the one of most segregated and unequal in the industrialized world…[likewise]…[t]he country’s extraordinarily high rates of childhood poverty, homelessness, and food insecurity are not randomly distributed across communities…[Add] the extensive practice of tracking to the mix, and it is clear that the assumption of equivalence among classrooms is far from reality” (p. 133). Whether sophisticated statistics can control for all of this variation is one of most debated issues surrounding VAMs and their levels of outcome bias, accordingly.

And as per (3) above, “we know from decades of educational research that many things matter for student achievement aside from the individual teacher a student has at a moment in time for a given subject area. A partial list includes the following [that are also supposed to be statistically controlled for in most VAMs, but are also clearly not controlled for effectively enough, if even possible]: (a) school factors such as class sizes, curriculum choices, instructional time, availability of specialists, tutors, books, computers, science labs, and other resources; (b) prior teachers and schooling, as well as other current teachers—and the opportunities for professional learning and collaborative planning among them; (c) peer culture and achievement; (d) differential summer learning gains and losses; (e) home factors, such as parents’ ability to help with homework, food and housing security, and physical and mental support or abuse; and (e) individual student needs, health, and attendance” (p. 133).

“Given all of these influences on [student] learning [and achievement], it is not surprising that variation among teachers accounts for only a tiny share of variation in achievement, typically estimated at under 10%” (see, for example, highlights from the American Statistical Association’s (ASA’s) Position Statement on VAMs here). “Suffice it to say [these issues]…pose considerable challenges to deriving accurate estimates of teacher effects…[A]s the ASA suggests, these challenges may have unintended negative effects on overall educational quality” (p. 133). “Most worrisome [for example] are [the] studies suggesting that teachers’ ratings are heavily influenced [i.e., biased] by the students they teach even after statistical models have tried to control for these influences” (p. 135).

Other “considerable challenges” include: VAM output are grossly unstable given the swings and variations observed in teacher classifications across time, and VAM output are “notoriously imprecise” (p. 133) given the other errors observed as caused, for example, by varying class sizes (e.g., Sean Corcoran (2010) documented with New York City data that the “true” effectiveness of a teacher ranked in the 43rd percentile could have had a range of possible scores from the 15th to the 71st percentile, qualifying as “below average,” “average,” or close to “above average). In addition, practitioners including administrators and teachers are skeptical of these systems, and their (appropriate) skepticisms are impacting the extent to which they use and value their value-added data, noting that they value their observational data (and the professional discussions surrounding them) much more. Also important is that another likely unintended effect exists (i.e., citing Susan Moore Johnson’s essay here) when statisticians’ efforts to parse out learning to calculate individual teachers’ value-added causes “teachers to hunker down and focus only on their own students, rather than working collegially to address student needs and solve collective problems” (p. 134). Related, “the technology of VAM ranks teachers against each other relative to the gains they appear to produce for students, [hence] one teacher’s gain is another’s loss, thus creating disincentives for collaborative work” (p. 135). This is what Susan Moore Johnson termed the egg-crate model, or rather the egg-crate effects.

Darling-Hammond’s conclusions are that VAMs have “been prematurely thrust into policy contexts that have made it more the subject of advocacy than of careful analysis that shapes its use. There is [good] reason to be skeptical that the current prescriptions for using VAMs can ever succeed in measuring teaching contributions well (p. 135).

Darling-Hammond also “adds value” in one whole section (highlighted in another post forthcoming here), offering a very sound set of solutions, using VAMs for teacher evaluations or not. Given it’s rare in this area of research we can focus on actual solutions, this section is a must read. If you don’t want to wait for the next post, read Darling-Hammond’s “Modest Proposal” (p. 135-136) within her larger article here.

In the end, Darling-Hammond writes that, “Trying to fix VAMs is rather like pushing on a balloon: The effort to correct one problem often creates another one that pops out somewhere else” (p. 135).

*****

If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; see the Review of Article #5 – on teachers’ perceptions of observations and student growth here; see the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here; and see the Review of Article (Commentary) #7 – on VAMs situated in their appropriate ecologies here.

Article #8, Part I Reference: Darling-Hammond, L. (2015). Can value-added add value to teacher evaluation? Educational Researcher, 44(2), 132-137. doi:10.3102/0013189X15575346

Teachers’ “Similar” Value-Added Estimates Yield “Different” Meanings Across “Different” Contexts

Some, particularly educational practitioners, might respond with a sense of “duh”-like sarcasm to the title of this post above, but as per a new research study recently released in the highly reputable, peer-reviewed American Educational Research Journal (AERJ), researchers evidenced this very headline via an extensive research study they conducted in the northeast United States. Hence, this title has now been substantiated with empirical evidence.

Researchers David Blazar (Doctoral Candidate at Harvard), Erica Litke (Assistant Professor at University of Delaware), and Johanna Barmore (Doctoral Candidate at Harvard) examined (1) the comparability of teachers’ value-added estimates within and across four urban districts and (2), given the extent to which variations observed, how and whether said value-added estimates consistently captured differences in teachers’ observed, videotaped, and scored classroom practices.

Regarding their first point of investigation, they found that teachers were categorized differently when compared within versus across districts (i.e., when compared to other similar teachers within districts versus across districts, which is a methodological choice that value-added modelers often make). Researchers found and asserted not that either approach yielded more valid interpretations, however. Rather, they evidenced that the differences they observed within and across districts were notable, and these differences had notable implications for validity, whereas a teacher classified as adding X value in one context could be categorized as adding Y value in another, given the context in which (s)he was teaching. In other words, the validity of the inferences to be drawn about potentially any teacher depended greatly on context in which the teacher taught, in that his/her value-added estimate did not necessarily generalize across contexts. Put in their words, “it is not clear whether the signal of teachers’ effectiveness sent by their value-added rankings retains a substantive interpretation across contexts” (p. 326). Inversely put, “it is clear that labels such as highly effective or ineffective based on value-added scores do not have fixed meaning” (p. 351).

Regarding their second point of investigation, they found “stark differences in instructional practices across districts among teachers who received similar within-district value-added rankings” (p. 324). In other words, “when comparing [similar] teachers within districts, value-added rankings signaled differences in instructional quality in some but not all instances” (p. 351), whereas similarly ranked teachers did not necessarily display effective or ineffective teachers. This has also been more loosely evidenced via those who have investigated the correlations between teachers’ value-added and observational scores, and have found weak to moderate correlations (see prior posts on this here, here, here, and here). In the simplest of terms, “value-added categorizations did not signal common sets of instructional practices across districts” (p. 352).

The bottom line here, then, is that those in charge of making consequential decisions about teachers, as based even if in part on teachers’ value-added estimates, need to be cautious when making particularly high-stakes decisions about teachers as based on said estimates. A teacher, as based on the evidence presented in this particular study could logically but also legally argue that had (s)he been teaching in a different district, even within the same state and using the same assessment instruments, (s)he could have received a substantively different value-added score given the teacher(s) to whom (s)he was compared when estimating his/her value-added elsewhere. Hence, the validity of the inferences and statements asserting that one teacher was effective or not as based on his/her value-added estimates is suspect, again, as based on the contexts in which teachers teach, as well as when compared to whatever other comparable teachers to whom teachers are compared when estimating teacher-level value-added. “Here, the instructional quality of the lowest ranked teachers was not particularly weak and in fact was as strong as the instructional quality of the highest ranked teachers in other districts” (p. 353).

This has serious implications, not only for practice but also for the lawsuits ongoing across the nation, especially in terms of those pertaining to teachers’ wrongful terminations, as charged.

Citation: Blazar, D., Litke, E., & Barmore, J. (2016). What does it mean to be ranked a ‘‘high’’ or ‘‘low’’ value-added teacher? Observing differences in instructional quality across districts. American Educational Research Journal, 53(2), 324–359.  doi:10.3102/0002831216630407

The Marzano Observational Framework’s Correlations with Value-Added

“Good news for schools using the Marzano framework,”…according to Marzano researchers. “One of the largest validation studies ever conducted on [the] observation framework shows that the Marzano model’s research-based structure is correlated with state VAMs.” See this claim, via the full report here.

The more specific claim is as follows: Study researchers found a “strong [emphasis added, see discussion forthcoming] correlation between Dr. Marzano’s nine Design Questions [within the model’s Domain 1] and increased student achievement on state math and reading scores.” All correlations were positive, with the highest correlation just below r = 0.4 and the lowest correlation just above r = 0.0. See the actual correlations illustrated here:

Screen Shot 2016-04-13 at 7.59.37 AM

 

 

 

 

 

See also a standard model to categorize such correlations, albeit out of any particular context, below. Using this, one can see that the correlations observed were indeed small to moderate, but not “strong” as claimed. Elsewhere, as also cited in this report, other observed correlations from similar studies on the same model ranged from r = 0.13 to 0.15, r = 0.14 to 0.21, and r = 0.21 to 0.26. While these are also noted as statistically significant, using the table below one can determine that statistical significance does not necessarily mean that such “very weak” to “weak” correlations are of much practical significance, especially if and when high-stakes decisions about teachers and their effects are to be attached to such evidence.

Screen Shot 2016-04-13 at 8.28.23 AM

Likewise, if such results (i.e., 0.0 < r < 0.4) sound familiar, they should, as a good number of researchers have set out to explore similar correlations in the past, using different value-added and observational data, and these researchers have also found similar zero-to-moderate (i.e., 0.0 < r < 0.4), but not (and dare I say never) “strong” correlations. See prior posts about such studies, for example, here, here, and here. See also the authors’ Endnote #1 in their report, again, here.

As the authors write: “When evaluating the validity of observation protocols, studies [sic] typically assess the correlations between teacher observation scores and their value-added scores.” This is true, but this is true only in that such correlations offer only one piece of validity evidence.

Validity, or rather evidencing that something from which inferences are drawn is in fact valid, is MUCH more complicated than simply running these types of correlations. Rather, the type of evidence that these authors are exploring is called convergent-related evidence of validity; however, for something to actually be deemed valid, MUCH more validity evidence is needed (e.g., content-, consequence-, predictive-, etc.- related evidence of validity). See, for example, some of the Educational Testing Service (ETS)’s Michael T. Kane’s work on validity here. See also The Standards for Educational and Psychological Testing developed by the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME) here.

Instead, in this report the authors write that “Small to moderate correlations permit researchers to claim that the framework is validated (Kane, Taylor, Tyler, & Wooten, 2010).” This is false. As well, this unfortunately demonstrates a very naive conception and unsophisticated treatment of validity. This is also illustrated in that the authors use one external citation, authored by Thomas Kane NOT the aforementioned validity expert Michael Kane, to allege that validity can be (and is being) claimed. Here is the actual Thomas Kane et al. article the Marzano authors reference to support their validity claim, also noting that nowhere in this piece do Thomas Kane et al. make this actual claim. In fact, a search for “small” or “moderate” correlations yields zero total hits.

In the end, what can be more fairly and appropriately asserted from this research report is that the Marzano model is indeed correlated with value-added estimates, and their  correlation coefficients fall right in line with all other correlation coefficients evidenced via  other current studies on this topic, again, whereby researchers have correlated multiple observational models with multiple value-added estimates. These correlations are small to moderate, and certainly not “strong,” and definitely not “strong” enough to warrant high-stakes decisions (e.g., teacher termination) given everything (i.e., the unexplained variance) that is still not captured among these multiple measures…and that still threatens the validity of the inferences to be drawn from these measures combined.

Contradictory Data Complicate Definitions of, and Attempts to Capture, “Teacher Effects”

An researcher from Brown University — Matthew Kraft — and his student, recently released a “working paper” in which I think you all will (and should) be interested. The study is about “…Teacher Effects on Complex Cognitive Skills and Social-Emotional Competencies” — those effects that are beyond “just” test scores (see also a related article on this working piece released in The Seventy Four). This one is 64-pages long, but here are the (condensed) highlights as I see them.

The researchers use data from the Bill & Melinda Gates Foundations’ Measures of Effective Teaching (MET) Project “to estimate teacher effects on students’ performance on cognitively demanding open-ended tasks in math and reading, as well as their growth mindset, grit, and effort in class.” They find “substantial variation in teacher effects on complex task performance and social-emotional measures. [They] also find weak relationships between teacher effects on state standardized tests, complex tasks, and social-emotional competencies” (p. 1).

More specifically, researchers found that: (1) “teachers who are most effective at raising student performance on standardized tests are not consistently the same teachers who develop students’ complex cognitive abilities and social-emotional competencies” (p. 7); (2) “While teachers who add the most value to students’ performance on state tests in math do also appear to strengthen their analytic and problem-solving skills, teacher effects on state [English/language arts] tests are only moderately correlated with open-ended tests in reading” (p. 7); and (3) “[T]eacher effects on social-emotional measures are only weakly correlated with effects on state achievement tests and more cognitively demanding open-ended tasks (p. 7).

The ultimate finding, then, is that “teacher effectiveness differs across specific abilities” and definitions of what it means to be an effective teacher (p. 7). Likewise, authors concluded that really all current teacher evaluation systems, also given those included within the MET studies are/were some of the best given the multiple sources of data MET researchers included (e.g., traditional tests, indicators capturing complex cognitive skills and social-emotional competencies, observations, student surveys), are not mapping onto similar definitions of teacher quality or effectiveness, as “we” have and continue to theorize.

Hence, attaching high-stakes consequences to data, especially when multiple data yield contradictory findings as based on how one might define effective teaching or its most important components (e.g., test scores v. affective, socio-emotional effects), is (still) not yet warranted, whereby an effective teacher here might not be an effective teacher there, even if defined similarly in like schools, districts, or states. As per Kraft, “while high-stakes decisions may be an important part of teacher evaluation systems, we need to decide on what we value most when making these decisions rather than just using what we measure by default because it is easier.” Ignoring such complexities will not make standard, uniform, or defensible some of the high-stakes decisions that some states are still wanting to attach to such data derived via  multiple measures and sources, given data defined differently even within standard definitions of “effective teaching” continue to contradict one another.

Accordingly, “[q]uestions remain about whether those teachers and schools that are judged as effective by state standardized tests [and the other measures] are also developing the skills necessary to succeed in the 21st century economy.” (p. 36). Likewise, it is not necessarily the case that teachers defined as high value-added teachers, using these common indicators, are indeed high value-added teachers given they are the same teachers defined as low value-added teachers when different aspects of “effective teaching” are also examined.

Ultimately, this further complicates “our” current definitions of effective teaching, especially when those definitions are constructed in policy arenas oft-removed from the realities of America’s public schools.

Reference: Kraft, M. A., & Grace, S. (2016). Teaching for tomorrow’s economy? Teacher effects on complex cognitive skills and social-emotional competencies. Providence, RI: Brown University. Working paper.

*Note: This study was ultimately published with the following citation: Blazar, D., & Kraft, M. A. (2017). Teacher and teaching effects on students’ attitudes and behaviors. Educational Evaluation and Policy Analysis, 39(1), 146 –170. doi:10.3102/0162373716670260. Retrieved from http://journals.sagepub.com/doi/pdf/10.3102/0162373716670260

Everything is Bigger (and Badder) in Texas: Houston’s Teacher Value-Added System

Last November, I published a post about “Houston’s “Split” Decision to Give Superintendent Grier $98,600 in Bonuses, Pre-Resignation.” Thereafter, I engaged some of my former doctoral students to further explore some data from Houston Independent School District (HISD), and what we collectively found and wrote up was just published in the highly-esteemed Teachers College Record journal (Amrein-Beardsley, Collins, Holloway-Libell, & Paufler, 2016). To view the full commentary, please click here.

In this commentary we discuss HISD’s highest-stakes use of its Education Value-Added Assessment System (EVAAS) data – the value-added system HISD pays for at an approximate rate of $500,000 per year. This district has used its EVAAS data for more consequential purposes (e.g., teacher merit pay and termination) than any other state or district in the nation; hence, HISD is well known for its “big use” of “big data” to reform and inform improved student learning and achievement throughout the district.

We note in this commentary, however, that as per the evidence, and more specifically the recent release of the Texas’s large-scale standardized test scores, that perhaps attaching such high-stakes consequences to teachers’ EVAAS output in Houston is not working as district leaders have, now for years, intended. See, for example, the recent test-based evidence comparing the state of Texas v. HISD, illustrated below.

Figure 1

“Perhaps the district’s EVAAS system is not as much of an “educational-improvement and performance-management model that engages all employees in creating a culture of excellence” as the district suggests (HISD, n.d.a). Perhaps, as well, we should “ponder the specific model used by HISD—the aforementioned EVAAS—and [EVAAS modelers’] perpetual claims that this model helps teachers become more “proactive [while] making sound instructional choices;” helps teachers use “resources more strategically to ensure that every student has the chance to succeed;” or “provides valuable diagnostic information about [teachers’ instructional] practices” so as to ultimately improve student learning and achievement (SAS Institute Inc., n.d.).

The bottom line, though, is that “Even the simplest evidence presented above should at the very least make us question this particular value-added system, as paid for, supported, and applied in Houston for some of the biggest and baddest teacher-level consequences in town.” See, again, the full text and another, similar graph in the commentary, linked  here.

*****

References:

Amrein-Beardsley, A., Collins, C., Holloway-Libell, J., & Paufler, N. A. (2016). Everything is bigger (and badder) in Texas: Houston’s teacher value-added system. [Commentary]. Teachers College Record. Retrieved from http://www.tcrecord.org/Content.asp?ContentId=18983

Houston Independent School District (HISD). (n.d.a). ASPIRE: Accelerating Student Progress Increasing Results & Expectations: Welcome to the ASPIRE Portal. Retrieved from http://portal.battelleforkids.org/Aspire/home.html

SAS Institute Inc. (n.d.). SAS® EVAAS® for K–12: Assess and predict student performance with precision and reliability. Retrieved from www.sas.com/govedu/edu/k12/evaas/index.html

Chetty et al. v. Rothstein on VAM-Based Bias, Again

Recall the Chetty, Friedman, and Rockoff studies at focus of many posts on this blog in the past (see for example here, here, and here)? These studies were cited in President Obama’s 2012 State of the Union address. Since, they have been cited by every VAM proponent as the key set of studies to which others should defer, especially when advancing, or defending in court, the large- and small-scale educational policies bent on VAM-based accountability for educational reform.

In a newly released working, not-yet-peer-reviewed, National Bureau of Economic Research (NBER) paper, Chetty, Friedman, and Rockoff attempt to assess how “Using Lagged Outcomes to Evaluate Bias in Value-Added Models [VAMs]” might better address the amount of bias in VAM-based estimates due to the non-random assignment of students to teachers (a.k.a. sorting). Accordingly, Chetty et al. argue that the famous “Rothstein” falsification test (a.k.a. the Jesse Rothstein — Associate Professor of Economics at University of California – Berkeley — falsification test) that is oft-referenced/used to test for the presence of bias in VAM-based estimates might not be the most effective approach. This is the second time this set of researchers have  argued with Rothstein about the merits of his falsification test (see prior posts about these debates here and here).

In short, at question is the extent to which teacher-level VAM-based estimates might be influenced by the groups of students a teacher is assigned to teach. If biased, the value-added estimates are said to be biased or markedly different from the actual parameter of interest the VAM is supposed to estimate, ideally, in an unbiased way. If bias is found, the VAM-based estimates should not be used in personnel evaluations, especially those associated with high-stakes consequences (e.g., merit pay, teacher termination). Hence, in order to test for the presence of the bias, Rothstein demonstrated that he could predict past outcomes of students with current teacher value-added estimates, which is impossible (i.e., the prediction of past outcomes). One would expect that past outcomes should not be related to current teacher effectiveness, so if the Rothstein falsification test proves otherwise, it indicates the presence of bias. Rothstein also demonstrated that this was (is still) the case with all conventional VAMs.

In their new study, however, Chetty et al. demonstrate that there might be another explanation regarding why Rothstein’s falsification test would reveal bias, even if there might not be bias in VAM estimates, and this bias is not caused by student sorting. Rather, the bias might result from different reasons, given the presence of what they term as dynamic sorting (i.e., there are common trends across grades and years, known as correlated shocks). Likewise, they argue, small sample sizes for a teacher, which are normally calculated as the number of students in a teacher’s class or on a teacher’s roster, also cause such bias. However, this problem cannot be solved even with the large scale data since the number of students per teacher remains the same, independent of the total number of students in any data set.

Chetty et al., then, using simulated data (i.e., generated with predetermined characteristics of teachers and students), demonstrate that even in the absence of bias, when dynamic sorting is not accounted for in a VAM, teacher-level VAM estimates will be correlated with  lagged student outcomes that will still “reveal” said bias. However,  they argue that the correlations observed will be due to noise rather than, again, the non-random sorting of students as claimed by Rothstein.

So, the bottom line is that bias exists, it just depends on whose side one might fall to claim from where it came.

Accordingly, Chetty et al. offer two potential solutions: (1) “We” develop VAMs that might account for dynamic sorting and be, thus, more robust to misspecification, or (2) “We” use experimental or quasi-experimental data to estimate the magnitude of such bias. This all, of course, assumes we should continue with our use of VAMs for said purposes, but given the academic histories of these authors, this is of no surprise.

Chetty et al. ultimately conclude that more research is needed on this matter, and that researchers should focus future studies on quantifying the bias that appears within and across any VAM, thus providing a potential threshold for an acceptable magnitude of bias, versus trying to prove its existence or lack thereof.

*****

Thanks to ASU Assistant Professor of Education Economics, Margarita Pivovarova, for her review of this study

Report on the Stability of Student Growth Percentile (SGP) “Value-Added” Estimates

The Student Growth Percentiles (SGPs) model, which is loosely defined by value-added model (VAM) purists as a VAM, uses students’ level(s) of past performance to determine students’ normative growth over time, as compared to his/her peers. “SGPs describe the relative location of a student’s current score compared to the current scores of students with similar score histories” (Castellano & Ho, p. 89). Students are compared to themselves (i.e., students serve as their own controls) over time; therefore, the need to control for other variables (e.g., student demographics) is less necessary, although this is of debate. Nonetheless, the SGP model was developed as a “better” alternative to existing models, with the goal of providing clearer, more accessible, and more understandable results to both internal and external education stakeholders and consumers. For more information about the SGP please see prior posts here and here. See also an original source about the SGP here.

Related, in a study released last week, WestEd researchers conducted an “Analysis of the stability of teacher-level growth scores [derived] from the student growth percentile [SGP] model” in one, large school district in Nevada (n=370 teachers). The key finding they present is that “half or more of the variance in teacher scores from the [SGP] model is due to random or otherwise unstable sources rather than to reliable information that could predict future performance. Even when derived by averaging several years of teacher scores, effectiveness estimates are unlikely to provide a level of reliability desired in scores used for high-stakes decisions, such as tenure or dismissal. Thus, states may want to be cautious in using student growth percentile [SGP] scores for teacher evaluation.”

Most importantly, the evidence in this study should make us (continue to) question the extent to which “the learning of a teacher’s students in one year will [consistently] predict the learning of the teacher’s future students.” This is counter to the claims continuously made by VAM proponents, including folks like Thomas Kane — economics professor from Harvard University who directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation. While faint signals of what we call predictive validity might be observed across VAMs, what folks like Kane overlook or avoid is that very often these faint signals do not remain constant over time. Accordingly, the extent to which we can make stable predictions is limited.

Worse is when folks falsely assume that said predictions will remain constant over time, and they make high-stakes decisions about teachers unaware of the lack of stability present, in typically 25-59% of teachers’ value-added (or in this case SGP) scores (estimates vary by study and by analyses using one to three years of data — see, for example, the studies detailed in Appendix A of this report; see also other research on this topic here, here, and here). Nonetheless, researchers in this study found that in mathematics, 50% of the variance in teachers’ value-added scores were attributable to differences among teachers, and the other 50% was random or unstable. In reading, 41% of the variance in teachers’ value-added scores were attributable to differences among teachers, and the other 59% was random or unstable.

In addition, using a 95% confidence interval (which is very common in educational statistics) researchers found that in mathematics, a teacher’s true score would span 48 points, “a margin of error that covers nearly half the 100 point score scale,” whereby “one would be 95 percent confident that the true math score of a teacher who received a score of 50 [would actually fall] between 26 and 74.” For reading, a teacher’s true score would span 44 points, whereby one would be 95 percent confident that the true reading score of a teacher who received a score of 50 would actually fall between 38 and 72. The stability of these scores would increase with three years of data, which has also been found by other researchers on this topic. However, they too have found that such error rates persist to an extent that still prohibits high-stakes decision making.

In more practical terms, what this also means is that a teacher who might be considered highly ineffective might be terminated, even though the following year (s)he could have been observed to be highly effective. Inversely, teachers who are awarded tenure might be observed as ineffective one, two, and/or three years following, not because their true level(s) of effectiveness change, but because of the error in the estimates that causes such instabilities to occur. Hence, examinations of the the stability of such estimates over time provides essential evidence of the validity, and in this case predictive validity, of the interpretations and uses of such scores over time. This is particularly pertinent when high-stakes decisions are to be based on (or in large part on) such scores, especially given some researchers are calling for reliability coefficients of .85 or higher to make such decisions (Haertel, 2013; Wasserman & Bracken, 2003).

In the end, researchers’ overall conclusion is that SGP-derived “growth scores alone may not be sufficiently stable to support high-stakes decisions.” Likewise, relying on the extant research on this topic, the overall conclusion can be broadened in that neither SGP- or VAM-based growth scores may be sufficiently stable to support high-stakes decisions. In other words, it is not just the SGP model that is yielding such issues with stability (or a lack thereof). Again, see the other literature in which researchers situated their findings in Appendix A. See also other similar studies here, here, and here.

Accordingly, those who read this report, and consequently seek to find a better or more stable model that yields more stable estimates, will unfortunately but likely fail in their search.

References:

Castellano, K. E., & Ho, A. D. (2013). A practitioner’s guide to growth models. Washington, DC: Council of Chief State School Officers.

Haertel, E. H. (2013). Reliability and validity of inferences about teachers based on student test scores (14th William H. Angoff Memorial Lecture). Princeton, NJ: Educational Testing Service (ETS).

Lash, A., Makkonen, R., Tran, L., & Huang, M. (2016). Analysis of the stability of teacher-level growth scores [derived] from the student growth percentile [SGP] model. (16–104). Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory West.

Wasserman, J. D., & Bracken, B. A. (2003). Psychometric characteristics of assessment procedures. In I. B. Weiner, J. R. Graham, & J. A. Naglieri (Eds.), Handbook of psychology:
Assessment psychology (pp. 43–66). Hoboken, NJ: John Wiley & Sons.

Special Issue of “Educational Researcher” (Paper #5 of 9): Teachers’ Perceptions of Observations and Student Growth

Recall that the peer-reviewed journal Educational Researcher (ER) – recently published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#5 of 9) here, titled “Teacher Perspectives on Evaluation Reform: Chicago’s REACH [Recognizing Educators Advancing Chicago Students] Students.” This one is authored by Jennie Jiang, Susan Sporte, and Stuart Luppescu, all of whom are associated with The University of Chicago’s Consortium on Chicago School Research, and all of whom conducted survey- and interview-based research on teachers’ perceptions of the Chicago Public Schools (CPS) teacher evaluation system, twice since it was implemented in 2012–2013. They did this across CPS’s almost 600 schools and its more than 12,000 teachers, with high-stakes being recently attached to teacher evaluations (e.g., professional development plans, remediation, tenure attainment, teacher dismissal/contract non-renewal; p. 108).

Directly related to the Review of Article #4 prior (i.e., #4 of 9 on observational systems’ potentials here), these researchers found that Chicago teachers are, in general, positive about the evaluation system, primarily given the system’s observational component (i.e., the Charlotte Danielson Framework for Teaching, used twice per year for tenured teachers and that counts for 75% of teachers’ evaluation scores), and not given the inclusion of student growth in this evaluation system (that counts for the other 25%). Although researchers also found that overall satisfaction levels with the REACH system at large is declining at a statistically significant rate over time, as teachers get to know the system, perhaps, better.

This system, like the strong majority of others across the nation, is based on only these two components, although the growth measure includes a combination of two different metrics (i.e., value-added scores and growth on “performance tasks” as per the grades and subject areas taught). See more information about how these measures are broken down by teacher type in Table 1 (p. 107), and see also (p. 107) for the different types of measures used (e.g., the Northwest Evaluation Association’s Measures of Academic Progress assessment (NWEA-MAP), a Web-based, computer-adaptive, multiple-choice assessment, that is used to measure value-added scores for teachers in grades 3-8).

As for the student growth component, more specifically, when researchers asked teachers “if their evaluation relies too heavily on student growth, 65% of teachers agreed or strongly agreed” (p. 112); “Fifty percent of teachers disagreed or strongly disagreed that NWEA-MAP [and other off-the-shelf tests used to measure growth in CPS offered] a fair assessment of their student’s learning” (p. 112); “teachers expressed concerns about the narrow representation of student learning that is measured by standardized tests and the increase in the already heavy testing burden on teachers and students” (p. 112); and “Several teachers also expressed concerns that measures of student growth were unfair to teachers in more challenging schools [i.e., bias], because student growth is related to the supports that students may or may not receive outside of the classroom” (p. 112). “One teacher explained this concern [writing]: “I think the part that I find unfair is that so much of what goes on in these kids’ lives is affecting their academics, and those are things that a
teacher cannot possibly control” (p. 112).

As for the performance tasks meant to compliment (or serve as) the student growth or VAM measure, teachers were discouraged with this being so subjective, and susceptible to distortion because teachers “score their own students’ performance tasks at both the beginning and end of the year. Teachers noted that if they wanted to maximize their student growth score, they could simply give all students a low score on the beginning-of-year task and a higher score at the end of the year” (p. 113).

As for the observational component, however, researchers found that “almost 90% of teachers agreed that the feedback they were provided in post-observation conferences” (p. 111) was of highest value; the observational processes but more importantly the post-observational processes made them and their supervisors more accountable for their effectiveness, and more importantly their improvement. While in the conclusions section of this article authors stretch this finding out a bit, writing that “Overall, this study finds that there is promise in teacher evaluation reform in Chicago,” (p. 114) as primarily based on their findings about “the new observation process” (p. 114) being used in CPS, recall from the Review of Article #4 prior (i.e., #4 of 9 on observational systems’ potentials here), these observational systems are not “new and improved.” Rather, these are the same observational systems that, given their levels of subjectivity featured and highlighted in reports like “The Widget Effect” (here), brought us to our now (over)reliance on VAMs.

Researchers also found that teachers were generally confused about the REACH system, and what actually “counted” and for how much in their evaluations. The most confusion surrounded the student growth or value-added component, as (based on prior research) would be expected. Beginning teachers reported more clarity, than did relatively more experienced teachers, high school teachers, and teachers of special education students, and all of this was related to the extent to which a measure of student growth directly impacted teachers’ evaluations. Teachers receiving school-wide value-added scores were also relatively more critical.

Lastly, researchers found that in 2014, “79% of teachers reported that the evaluation process had increased their levels of stress and anxiety, and almost 60% of teachers
agreed or strongly agreed the evaluation process takes more effort than the results are worth.” Again, beginning teachers were “consistently more positive on all…measures than veteran teachers; elementary teachers were consistently more positive than high
school teachers, special education teachers were significantly more negative about student growth than general teachers,” and the like (p. 113). And all of this was positively and significantly related to teachers’ perceptions of their school’s leadership, perceptions of the professional communities at their schools, and teachers’ perceptions of evaluation writ large.

*****

If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; and see the Review of Article #4 – on observational systems’ potentials here.

Article #5 Reference: Jiang, J. Y., Sporte, S. E., & Luppescu, S. (2015). Teacher perspectives on evaluation reform: Chicago’s REACH students. Educational Researcher, 44(2), 105-116. doi:10.3102/0013189X15575517

Special Issue of “Educational Researcher” (Paper #4 of 9): Make Room VAMs for Observations

Recall that the peer-reviewed journal Educational Researcher (ER) – recently published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#4 of 9) here, titled “Make Room Value-Added: Principals’ Human Capital Decisions and the Emergence of Teacher Observation Data. This one is authored by Ellen Goldring, Jason A. Grissom, Christine Neumerski, Marisa Cannata, Mollie Rubin, Timothy Drake, and Patrick Schuermann, all of whom are associated with Vanderbilt University.

This article is primarily about (1) the extent to which the data generated by “high-quality observation systems” can inform principals’ human capital decisions (e.g., teacher hiring, contract renewal, assignment to classrooms, professional development), and (2) the extent to which principals are relying less on test scores derived via value-added models (VAMs), when making the same decisions, and why. Here are some of their key (and most important, in my opinion) findings:

  • Principals across all school systems revealed major hesitations and challenges regarding the use of VAM output for human capital decisions. Barriers preventing VAM use included the timing of data availability (e.g., the fall), which is well after human capital decisions are made (p. 99).
  • VAM output are too far removed from the practice of teaching (p. 99), and this lack of instructional sensitivity impedes, if not entirely prevents their actual versus hypothetical use for school/teacher improvement.
  • “Principals noted they did not really understand how value-added scores were calculated, and therefore they were not completely comfortable using them” (p. 99). Likewise, principals reported that because teachers did not understand how the systems worked either, teachers did not use VAM output data either (p. 100).
  • VAM output are not transparent when used to determine compensation, and especially when used to evaluate teachers teaching nontested subject areas. In districts that use school-wide VAM output to evaluate teachers in nontested subject areas, in fact, principals reported regularly ignoring VAM output altogether (p. 99-100).
  • “Principals reported that they perceived observations to be more valid than value-added measures” (p. 100); hence, principals reported using observational output much more, again, in terms of human capital decisions and making such decisions “valid.” (p. 100).
  • “One noted exception to the use of value-added scores seemed to be in the area of assigning teachers to particular grades, subjects, and classes. Many principals mentioned they use value-added measures to place teachers in tested subjects and with students in grade levels that ‘count’ for accountability purpose…some principals [also used] VAM [output] to move ineffective teachers to untested grades, such as K-2 in elementary schools and 12th grade in high schools” (p. 100).

Of special note here is also the following finding: “In half of the systems [in which researchers investigated these systems], there [was] a strong and clear expectation that there be alignment between a teacher’s value-added growth score and observation ratings…Sometimes this was a state directive and other times it was district-based. In some systems, this alignment is part of the principal’s own evaluation; principals receive reports that show their alignment” (p. 101). In other words, principals are being evaluated and held accountable given the extent to which their observations of their teachers match their teachers’ VAM-based data. If misalignment is noticed, it is not to be the fault of either measure (e.g., in terms of measurement error), it is to be the fault of the principal who is critiqued for inaccuracy, and therefore (inversely) incentivized to skew their observational data (the only data over which the supervisor has control) to artificially match VAM-based output. This clearly distorts validity, or rather the validity of the inferences that are to be made using such data. Appropriately, principals also “felt uncomfortable [with this] because they were not sure if their observation scores should align primarily…with the VAM” output (p. 101).

“In sum, the use of observation data is important to principals for a number of reasons: It provides a “bigger picture” of the teacher’s performance, it can inform individualized and large group professional development, and it forms the basis of individualized support for remediation plans that serve as the documentation for dismissal cases. It helps principals provides specific and ongoing feedback to teachers. In some districts, it is beginning to shape the approach to teacher hiring as well” (p. 102).

The only significant weakness, again in my opinion, with this piece is that the authors write that these observational data, at focus in this study, are “new,” thanks to recent federal initiatives. They write, for example, that “data from structured teacher observations—both quantitative and qualitative—constitute a new [emphasis added] source of information principals and school systems can utilize in decision making” (p. 96). They are also “beginning to emerge [emphasis added] in the districts…as powerful engines for principal data use” (p. 97). I would beg to differ as these systems have not changed much over time, pre and post these federal initiatives as (without evidence or warrant) claimed by these authors herein. See, for example, Table 1 on p. 98 of the article to see if what they have included within the list of components of such new and “complex, elaborate teacher observation systems systems” is actually new or much different than most of the observational systems in use prior. As an aside, one such system in use and of issue in this examination is one with which I am familiar, in use in the Houston Independent School District. Click here to also see if this system is also more “complex” or “elaborate” over and above such systems prior.

Also recall that one of the key reports that triggered the current call for VAMs, as the “more objective” measures needed to measure and therefore improve teacher effectiveness, was based on data that suggested that “too many teachers” were being rated as satisfactory or above. The observational systems in use then are essentially the same observational systems still in use today (see “The Widget Effect” report here). This is in stark contradiction to authors’ claims throughout this piece, for example, when they write “Structured teacher observations, as integral components of teacher evaluations, are poised to be a very powerful lever for changing principal leadership and the influence of principals on schools, teachers, and learning.” This counters all that is and all that came from “The Widget Effect” report here.

*****

If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; and see the Review of Article #3 – on VAMs’ potentials here.

Article #4 Reference: Goldring, E., Grissom, J. A., Rubin, M., Neumerski, C. M., Cannata, M., Drake, T., & Schuermann, P. (2015). Make room value-added: Principals’ human capital decisions and the emergence of teacher observation data. Educational Researcher, 44(2), 96-104. doi:10.3102/0013189X15575031