Nevada (Potentially) Dropping Students’ Test Scores from Its Teacher Evaluation System

This week in Nevada “Lawmakers Mull[ed] Dropping Student Test Scores from Teacher Evaluations,” as per a recent article in The Nevada Independent (see here). This would be quite a move from 2011 when the state (as backed by state Republicans, not backed by federal Race to the Top funds, and as inspired by Michelle Rhee) passed into policy a requirement that 50% of all Nevada teachers’ evaluations were to rely on said data. The current percentage rests at 20%, but it is to double next year to 40%.

Nevada is one of a still uncertain number of states looking to retract the weight and purported “value-added” of such measures. Note also that last week Connecticut dropped some of its test-based components of its teacher evaluation system (see here). All of this is occurring, of course, post the federal passage of the Every Student Succeeds Act (ESSA), within which it is written that states must no longer set up teacher-evaluation systems based in significant part on their students’ test scores.

Accordingly, Nevada’s “Democratic lawmakers are trying to eliminate — or at least reduce — the role [students’] standardized tests play in evaluations of teachers, saying educators are being unfairly judged on factors outside of their control.” The Democratic Assembly Speaker, for example, said that “he’s always been troubled that teachers are rated on standardized test scores,” more specifically noting: “I don’t think any single teacher that I’ve talked to would shirk away from being held accountable…[b]ut if they’re going to be held accountable, they want to be held accountable for things that … reflect their actual work.” I’ve never met a teacher would disagree with this statement.

Anyhow, this past Monday the state’s Assembly Education Committee heard public testimony on these matters and three bills “that would alter the criteria for how teachers’ effectiveness is measured.” These three bills are as follows:

  • AB212 would prohibit the use of student test scores in evaluating teachers, while
  • AB320 would eliminate statewide [standardized] test results as a measure but allow local assessments to account for 20 percent of the total evaluation.
  • AB312 would ensure that teachers in overcrowded classrooms not be penalized for certain evaluation metrics deemed out of their control given the student-to-teacher ratio.

Many presented testimony in support of these bills over an extended period of time on Tuesday. I was also invited to speak, during which I “cautioned lawmakers against being ‘mesmerized’ by the promised objectivity of standardized tests. They have their own flaws, [I] argued, estimating that 90-95 percent of researchers who are looking at the effects of high-stakes testing agree that they’re not moving the dial [really whatsoever] on teacher performance.”

Lawmakers have until the end of tomorrow (i.e., Friday) to pass these bills outside of the committee. Otherwise, they will die.

Of course, I will keep you posted, but things are currently looking “very promising,” especially for AB320.

The Tripod Student Survey Instrument: Its Factor Structure and Value-Added Correlations

The Tripod student perception survey instrument is a “research-based” instrument increasingly being used by states to add to state’s teacher evaluation systems as based on “multiple measures.” While there are other instruments also in use, as well as student survey instruments being developed by states and local districts, this one in particular is gaining in popularity, also in that it was used throughout the Bill & Melinda Gates Foundation’s ($43 million worth of) Measures of Effective Teaching (MET) studies. A current estimate (as per the study discussed in this post) is that during the 2015–2016 school year approximately 1,400 schools purchased and administered the Tripod. See also a prior post (here) about this instrument, or more specifically a chapter of a book about the instrument as authored by the instrument’s developer and lead researcher in a  research surrounding it – Ronald Ferguson.

In a study recently released in the esteemed American Educational Research Journal (AERJ), and titled “What Can Student Perception Surveys Tell Us About Teaching? Empirically Testing the Underlying Structure of the Tripod Student Perception Survey,” researchers found that the Tripod’s factor structure did not “hold up.” That is, Tripod’s 7Cs (i.e., seven constructs including: Care, Confer, Captivate, Clarify, Consolidate, Challenge, Classroom Management; see more information about the 7Cs here) and the 36 items that are positioned within each of the 7Cs did not fit the 7C framework as theorized by instrument developer(s).

Rather, using the MET database (N=1,049 middle school math class sections; N=25,423 students), researchers found that an alternative bi-factor structure (i.e., two versus seven constructs) best fit the Tripod items theoretically positioned otherwise. These two factors included (1) a general responsivity dimension that includes all items (more or less) unrelated to (2) a classroom management dimension that governs responses on items surrounding teachers’ classroom management. Researchers were unable to to distinguish across items seven separate dimensions.

Researchers also found that the two alternative factors noted — general responsivity and classroom management — were positively associated with teacher value-added scores. More specifically, results suggested that these two factors were positively and statistically significantly associated with teachers’ value-added measures based on state mathematics tests (standardized coefficients were .25 and .25, respectively), although for undisclosed reasons, results apparently suggested nothing about these two factors’ (cor)relationships with value-added estimates base on state English/language arts (ELA) tests. As per authors’ findings in the area of mathematics, prior researchers have also found low to moderate agreement between teacher ratings and student perception ratings; hence, this particular finding simply adds another source of convergent evidence.

Authors do give multiple reasons and plausible explanations as to why they found what they did that you all can read in more depth via the full article, linked to above and fully cited below. Authors also note that “It is unclear whether the original 7Cs that describe the Tripod instrument were intended to capture seven distinct dimensions on which students can reliably discriminate among teachers or whether the 7Cs were merely intended to be more heuristic domains that map out important aspects of teaching” (p. 1859); hence, this is also important to keep in mind given study findings.

As per study authors, and to their knowledge, “this study [was] the first to systematically investigate the multidimensionality of the Tripod student perception survey” (p. 1863).

Citation: Wallace, T. L., Kelcey, B., &  Ruzek, E. (2016). What can student perception surveys tell us about teaching? Empirically testing the underlying structure of the Tripod student perception survey.  American Educational Research Journal, 53(6), 1834–1868.
doiI:10.3102/0002831216671864 Retrieved from http://journals.sagepub.com/doi/pdf/10.3102/0002831216671864

New (Unvetted) Research about Washington DC’s Teacher Evaluation Reforms

In November of 2013, I published a blog post about a “working paper” released by the National Bureau of Economic Research (NBER) and written by authors Thomas Dee – Economics and Educational Policy Professor at Stanford, and James Wyckoff – Economics and Educational Policy Professor at the University of Virginia. In the study titled “Incentives, Selection, and Teacher Performance: Evidence from IMPACT,” Dee and Wyckoff (2013) analyzed the controversial IMPACT educator evaluation system that was put into place in Washington DC Public Schools (DCPS) under the then Chancellor, Michelle Rhee. In this paper, Dee and Wyckoff (2013) presented what they termed to be “novel evidence” to suggest that the “uniquely high-powered incentives” linked to “teacher performance” via DC’s IMPACT initiative worked to improve the performance of high-performing teachers, and that dismissal threats worked to increase the voluntary attrition of low-performing teachers, as well as improve the performance of the students of the teachers who replaced them.

I critiqued this study in full (see both short and long versions of this critique here), and ultimately asserted that the study had “fatal flaws” which compromised the (exaggerated) claims Dee and Wyckoff (2013) advanced. These flaws included but were not limited to that only 17% of the teachers included in this study (i.e., teachers of reading and mathematics in grades 4 through 8) were actually evaluated under the value-added component of the IMPACT system. Put inversely, 83% of the teachers included in this study about teachers’ “value-added” did not have student test scores available to determine if they were indeed of “added value.” That is, 83% of the teachers evaluated, rather, were assessed on their overall levels of effectiveness or subsequent increases/decreases in effectiveness as per only the subjective observational and other self-report data include within the IMPACT system. Hence, while authors’ findings were presented as hard fact, given the 17% fact, their (exaggerated) conclusions did not at all generalize across teachers given the sample limitations, and despite what they claimed.

In short, the extent to which Dee and Wyckoff (2013) oversimplified very complex data to oversimplify a very complex context and policy situation, after which they exaggerated questionable findings, was of issue, that should have been reconciled or cleared out prior to the study’s release. I should add that this study was published in 2015 in the (economics-oriented and not-educational-policy specific) Journal of Policy Analysis and Management (see here), although I have not since revisited the piece to analyze, comparatively (e.g., via a content analysis), the original 2013 to the final 2015 piece.

Anyhow, they are at it again. Just this past January (2017) they published another report, albeit alongside two additional authors: Melinda Adnot – a Visiting Assistant Professor at the University of Virginia, and Veronica Katz – an Educational Policy PhD student, also at the University of Virginia. This study titled “Teacher Turnover, Teacher Quality, and Student Achievement in DCPS,” was also (prematurely) released as a “working paper” by the same NBER, again, without any internal or external vetting but (irresponsibly) released “for discussion and comment.”

Hence, I provide below my “discussion and comments” below, all the while underscoring how this continues to be problematic, also given the fact that I was contacted by the media for comment. Frankly, no media reports should be released about these (or for that matter any other) “working papers” until they are not only internally but also externally reviewed (e.g., in press or published, post vetting). Unfortunately, as they too commonly do, however, NBER released this report, clearly without such concern. Now, we as the public are responsible for consuming this study with much critical caution, while also advocating that others (and helping others to) do the same. Hence, I write into this post my critiques of this particular study.

First, the primary assumption (i.e., the “conceptual model”) driving this Adnot, Dee, Katz, & Wyckoff (2016) piece is that low-performing teachers should be identified and replaced with more effective teachers. This is akin to the assumption noted in the first Dee and Wyckoff (2013) piece. It should be noted here that in DCPS teachers rated as “Ineffective” or consecutively as “Minimally Effective” are “separated” from the district; hence, DCPS has adopted educational policies that align with this “conceptual model” as well. Interesting to note is how researchers, purportedly external to DCPS, entered into this study with the same a priori “conceptual model.” This, in and of itself, is an indicator of researcher bias (see also forthcoming).

Nonetheless, Adnot et al.’s (2016) highlighted finding was that “on average, DCPS replaced teachers who left with teachers who increased student achievement by 0.08 SD [standard deviations] in math.” Buried further into the report they also found that DCPS replaced teachers who left with teachers who increased student achievement by 0.05 SD in reading (at not a 5% but a 10% statistical significance level). These findings, in simpler but also more realistic terms, mean that (if actually precise and correct, also given all of the problems with how teacher classifications were determined at the DCPS level), “effective” mathematics teachers who replaced “ineffective” mathematics teachers increased student achievement by approximately 2.7%, and “effective” reading teachers who replaced “ineffective” reading teachers increased student achievement by approximately 1.7% (at not a 5% but a 10% statistical significance level). These are hardly groundbreaking results as these proportional movements likely represented one or maybe two total test items on the large-scale standardized tests uses to assess DCPS’s policy impacts.

Interesting to also note is that not only were the “small effects” exaggerated to mean so much more than what they are actually worth (see also forthcoming), but also that only the larger of the two findings – the mathematics finding – is highlighted in the abstract. The complimentary and smaller reading effect is actually buried into the text. Also buried is that these findings pertain only to grade four and eight, general education teachers who were value-added eligible, akin to Dee and Wyckoff’s (2013) earlier piece (e.g., typically 30% of a school’s population, although Dee and Wyckoff’s (2013) piece marked this percentage at 17%).

As mentioned prior, none of this would have likely happened had this piece been internally and/or externally reviewed prior to this study’s release.

Regardless, Adnot et al. (2016) also found that the attrition of relatively higher-performing teachers (e.g., “Effective” or “Highly Effective”) had a negative but also statistically insignificant effect.

Hence, it can be concluded that the only “finding” highlighted in the abstract of this piece was not the only “finding,” but rather buried in this piece were these other findings that researchers (perhaps) purposefully buried into the text. It is possible, in other words, that because these other findings did not support researchers a priori conclusions and claims, researchers chose not to bring attention to these findings, or rather the lack thereof (e.g., in the abstract).

Related, I should note that in a few places the authors exaggerate how, for example, teachers’ effects on their students’ achievement are so tangible, without any mention of contrary reports, namely as published by the American Statistical Association (ASA), in which the ASA evidenced that these (oft-exaggerated) teacher effects account for no more than 1%-14% of the variance in students’ growth scores (see more information here). In fact, teacher effectiveness is very likely not “qualitatively large” as Adnot et al. (2016) argue, without evidence in this piece, and also imply throughout this piece as a foundational part of their aforementioned “conceptual model.”

Likewise, while most everyone would likely agree that there are “stark inequities” in students’ access to effective teachers, how to improve this condition is certainly of great debate, as also not explicitly or implicitly acknowledged throughout this piece. Rather, much disagreement and debate, in fact, still exist regarding whether inducing teacher turnover will get “us” really anywhere in terms of school reform, as also related to how big (or small) teachers’ effects on students’ measurable performance actually are as discussed prior. Accordingly, and perhaps not surprisingly, Adnot et al. (2016) cite only the articles of other researchers, or rather members of their proverbial choir (e.g, Eric Hanushek who, without actual evidence, has been hypothesizing about how replacing “ineffective” teachers with “effective” teachers will reform America’s schools for nearly now one decade) to support these same a priori conclusions. Consequently, the professional integrity of the researchers must be put into check given these simple (albeit biased) errors.

Taking all of this into consideration, I would hardly call the findings they advanced in this piece (and emphasized in the abstract) solid indicators of the “overall positive effects of teacher turnover,” with only one statistically but not practically significant finding of note in mathematics (i.e., a 2.7% increase if accurate)? None of this could/should, accordingly, lead anyone to conclude that “the supply of entering teachers appears to be of sufficient quality to sustain a relatively high turnover rate.”

Hence, this is yet another case of these authors oversimplifying very complex data to oversimplify a very complex context and policy situation, after which they exaggerated negligible findings while also dismissing others.

Related, would we not expect greater results given teachers who are deemed highly effective are to be given one-time bonuses of up to $25,000, and permanent increases to teachers’ base salaries of up to $27,000 per year? This bang, or lack thereof, may not be worth the buck, either.

Additionally, is an annual attrition rate of “low-performing teachers” (e.g., classified as such for one or two consecutive years) in the district, currently hanging at around 46% worth these diminutive results?

Did they also actually find, overall, that “high-poverty schools actually improve as a result of teacher turnover?” I don’t think so, but do give this study a full read to test their as well as my conclusions for yourself (see, again, the full study here).

In the end, Adnot et al. (2016) do conclude that they found “that the overall effect of teacher turnover in DCPS conservatively had no effect on achievement and, under reasonable assumptions, improved achievement.” This is a MUCH more balanced interpretation of this study, although I would certainly question their “reasonable assumptions” (see also prior). Moreover, it is much more curious as to why we had to wait for the actual headline of this study until the end. This is especially important given that others, including members of the media, public, and policy making community, might not make it that far (i.e., trusting only what is in the abstract).

Citations:

Adnot, M., Dee, T., Katz, V., & Wyckoff, J. (2016). Teacher turnover, teacher quality, and student achievement in DCPS [Washington DC Public Schools]. Cambridge, MA: National Bureau of Economic Research (NBER). Retrieved from http://www.nber.org/papers/w21922.pdf

Dee, T., & Wyckoff, J. (2013). Incentives, selection, and teacher performance: Evidence from IMPACT. National Bureau of Economic Research (NBER). Retrieved from http://www.nber.org/papers/w19529.pdf

Rest in Peace, EVAAS Developer William L. Sanders

Over the last 3.5 years since I developed this blog, I have written many posts about one particular value-added model (VAM) – the Education Value-Added Assessment System (EVAAS), formerly known as the Tennessee Value-Added Assessment System (TVAAS), now known by some states as the TxVAAS in Texas, the PVAAS in Pennsylvania, and also known as the generically-named EVAAS in states like Ohio, North Carolina, and South Carolina (and many districts throughout the nation). It is this model on which I have conducted most of my research (see, for example, the first piece I published about this model here, in which most of the claims I made still stand, although EVAAS modelers disagreed here). And it is this model that is at the source of the majority of the teacher evaluation lawsuits in which I have been or still am currently engaged (see, for example, details about the Houston lawsuit here, the former Tennessee lawsuit here, and the new Texas lawsuit here, although the model is more peripheral in this particular case).

Anyhow, the original EVAAS model (i.e, the TVAAS) was originally developed by a man named William L. Sanders who ultimately sold it to SAS Institute Inc. that now holds all rights to the proprietary model. See, for example, here. See also examples of prior posts about Sanders here, here, here, here, here, and here. See also examples of prior posts about the EVAAS here, here, here, here, here, and here.

It is William L. Sanders who just passed away and we sincerely hope may rest in peace.

Sanders had a bachelors degree in animal science and a doctorate in statistics and quantitative genetics. As an adjunct professor and agricultural statistician in the college of business at the University of Knoxville, Tennessee, he developed in the late 1980s his TVAAS.

Sanders thought that educators struggling with student achievement in the state should “simply” use more advanced statistics, similar to those used when modeling genetic and reproductive trends among cattle, to measure growth, hold teachers accountable for that growth, and solve the educational measurement woes facing the state of Tennessee at the time. It was to be as simple as that…. I should also mention that given this history, not surprisingly, Tennessee was one of the first states to receive Race to the Top funds to the tune of $502 million to further advance this model; hence, this has also contributed to this model’s popularity across the nation.

Nonetheless, Sanders passed away this past Thursday, March 16, 2017, from natural causes in Columbia, Tennessee. As per his obituary here,

  • He was most well-known for developing “a method used to measure a district, school, and teacher’s effect on student performance by tracking the year-to-year progress of students against themselves over their school career with various teachers’ classes.”
  • He “stood for a hopeful view that teacher effectiveness dwarfs all other factors as a predictor of student academic growth…[challenging]…decades of assumptions that student family life, income, or ethnicity has more effect on student learning.”
  • He believed, in the simplest of terms, “that educational influence matters and teachers matter most.”

Of course, we have much research evidence to counter these claims, but for now we will just leave all of this at that. Again, may he rest in peace.

New Texas Lawsuit: VAM-Based Estimates as Indicators of Teachers’ “Observable” Behaviors

Last week I spent a few days in Austin, one day during which I provided expert testimony for a new state-level lawsuit that has the potential to impact teachers throughout Texas. The lawsuit — Texas State Teachers Association (TSTA) v. Texas Education Agency (TEA), Mike Morath in his Official Capacity as Commissioner of Education for the State of Texas.

The key issue is that, as per the state’s Texas Education Code (Sec. § 21.351, see here) regarding teachers’ “Recommended Appraisal Process and Performance Criteria,” The Commissioner of Education must adopt “a recommended teacher appraisal process and criteria on which to appraise the performance of teachers. The criteria must be based on observable, job-related behavior, including: (1) teachers’ implementation of discipline management procedures; and (2) the performance of teachers’ students.” As for the latter, the State/TEA/Commissioner defined, as per its Texas Administrative Code (T.A.C., Chapter 15, Sub-Chapter AA, §150.1001, see here), that teacher-level value-added measures should be treated as one of the four measures of “(2) the performance of teachers’ students;” that is, one of the four measures recognized by the State/TEA/Commissioner as an “observable” indicator of a teacher’s “job-related” performance.

While currently no district throughout the State of Texas is required to use a value-added component to assess and evaluate its teachers, as noted, the value-added component is listed as one of four measures from which districts must choose at least one. All options listed in the category of “observable” indicators include: (A) student learning objectives (SLOs); (B) student portfolios; (C) pre- and post-test results on district-level assessments; and (D) value-added data based on student state assessment results.

Related, the state has not recommended or required that any district, if the value-added option is selected, to choose any particular value-added model (VAM) or calculation approach. Nor has it recommended or required that any district adopt any consequences as attached to these output; however, things like teacher contract renewal and sharing teachers’ prior appraisals with other districts in which teachers might be applying for new jobs is not discouraged. Again, though, the main issue here (and the key points to which I testified) was that the value-added component is listed as an “observable” and “job-related” teacher effectiveness indicator as per the state’s administrative code.

Accordingly, my (5 hour) testimony was primarily (albeit among many other things including the “job-related” part) about how teacher-level value-added data do not yield anything that is observable in terms of teachers’ effects. Likewise, officially referring to these data in this way is entirely false, in fact, in that:

  • “We” cannot directly observe a teacher “adding” (or detracting) value (e.g., with our own eyes, like supervisors can when they conduct observations of teachers in practice);
  • Using students’ test scores to measure student growth upwards (or downwards) and over time, as is very common practice using the (very often instructionally insensitive) state-level tests required by No Child Left Behind (NCLB), and doing this once per year in mathematics and reading/language arts (that includes prior and other current teachers’ effects, summer learning gains and decay, etc.), is not valid practice. That is, doing this has not been validated by the scholarly/testing community; and
  • Worse and less valid is to thereafter aggregate this student-level growth to the teacher level and then call whatever “growth” (or the lack thereof) is because of something the teacher (and really only the teacher did), as directly “observable.” These data are far from assessing a teacher’s causal or “observable” impacts on his/her students’ learning and achievement over time. See, for example, the prior statement released about value-added data use in this regard by the American Statistical Association (ASA) here. In this statement it is written that: “Research on VAMs has been fairly consistent that aspects of educational effectiveness that are measurable and within teacher control represent a small part of the total variation [emphasis added to note that this is variation explained which = correlational versus causal research] in student test scores or growth; most estimates in the literature attribute between 1% and 14% of the total variability [emphasis added] to teachers. This is not saying that teachers have little effect on students, but that variation among teachers [emphasis added] accounts for a small part of the variation [emphasis added] in [said test] scores. The majority of the variation in [said] test scores is [inversely, 86%-99% related] to factors outside of the teacher’s control such as student and family background, poverty, curriculum, and unmeasured influences.”

If any of you have anything to add to this, please do so in the comments section of this post. Otherwise, I will keep you posted on how this goes. My current understanding is that this one will be headed to court.

New Article Published on Using Value-Added Data to Evaluate Teacher Education Programs

A former colleague, a current PhD student, and I just had an article released about using value-added data to (or rather not to) evaluate teacher education/preparation, higher education programs. The article is titled “An Elusive Policy Imperative: Data and Methodological Challenges When Using Growth in Student Achievement to Evaluate Teacher Education Programs’ ‘Value-Added,” and the abstract of the article is included below.

If there is anyone out there who might be interested in this topic, please note that the journal in which this piece was published (online first and to be published in its paper version later) – Teaching Education – has made the article free for its first 50 visitors. Hence, I thought I’d share this with you all first.

If you’re interested, do access the full piece here.

Happy reading…and here’s the abstract:

In this study researchers examined the effectiveness of one of the largest teacher education programs located within the largest research-intensive universities within the US. They did this using a value-added model as per current federal educational policy imperatives to assess the measurable effects of teacher education programs on their teacher graduates’ students’ learning and achievement as compared to other teacher education programs. Correlational and group comparisons revealed little to no relationship between value-added scores and teacher education program regardless of subject area or position on the value-added scale. These findings are discussed within the context of several very important data and methodological challenges researchers also made transparent, as also likely common across many efforts to evaluate teacher education programs using value-added approaches. Such transparency and clarity might assist in the creation of more informed value-added practices (and more informed educational policies) surrounding teacher education accountability.

David Berliner on The Purported Failure of America’s Schools

My primary mentor, David Berliner (Regents Professor at Arizona State University (ASU)) wrote, yesterday, a blog post for the Equity Alliance Blog (also at ASU) on “The Purported Failure of America’s Schools, and Ways to Make Them Better” (click here to access the original blog post). See other posts about David’s scholarship on this blog here, here, and here. See also one of our best blog posts that David also wrote here, about “Why Standardized Tests Should Not Be Used to Evaluate Teachers (and Teacher Education Programs).”

In sum, for many years David has been writing “about the lies told about the poor performance of our students and the failure of our schools and teachers.” For example, he wrote one of the education profession’s all time classics and best sellers: The Manufactured Crisis: Myths, Fraud, And The Attack On America’s Public Schools (1995). If you have not read it, you should! All educators should read this book, on that note and in my opinion, but also in the opinion of many other iconic educational scholars throughout the U.S. (Paufler, Amrein-Beardsley, Hobson, under revision for publication).

While the title of this book accurately captures its contents, more specifically it “debunks the myths that test scores in America’s schools are falling, that illiteracy is rising, and that better funding has no benefit. It shares the good news about public education.” I’ve found the contents of this book to still be my best defense when others with whom I interact attack America’s public schools, as often misinformed and perpetuated by many American politicians and journalists.

In this blog post David, once again, debunks many of these myths surrounding America’s public schools using more up-to-date data from international tests, our country’s National Assessment of Educational Progress (NAEP), state-level SAT and ACT scores, and the like. He reminds us of how student characteristics “strongly influence the [test] scores obtained by the students” at any school and, accordingly, “strongly influence” or bias these scores when used in any aggregate form (e.g., to hold teachers, schools, districts, and states accountable for their students’ performance).

He reminds us that “in the US, wealthy children attending public schools that serve the wealthy are competitive with any nation in the world…[but in]…schools in which low-income students do not achieve well, [that are not competitive with many nations in the world] we find the common correlates of poverty: low birth weight in the neighborhood, higher than average rates of teen and single parenthood, residential mobility, absenteeism, crime, and students in need of special education or English language instruction.” These societal factors explain poor performance much more (i.e., more variance explained) than any school-level, and as pertinent to this blog, teacher-level factor (e.g., teacher quality as measured by large-scale standardized test scores).

In this post David reminds us of much, much more, that we need to remember and also often recall in defense of our public schools and in support of our schools’ futures (e.g., research-based notes to help “fix” some of our public schools).

Again, please do visit the original blog post here to read more.

Difficulties When Combining Multiple Teacher Evaluation Measures

A new study about multiple “Approaches for Combining Multiple Measures of Teacher Performance,” with special attention paid to reliability, validity, and policy, was recently published in the American Educational Research Association (AERA) sponsored and highly-esteemed Educational Evaluation and Policy Analysis journal. You can find the free and full version of this study here.

In this study authors José Felipe Martínez – Associate Professor at the University of California, Los Angeles, Jonathan Schweig – at the RAND Corporation, and Pete Goldschmidt – Associate Professor at California State University, Northridge and creator of the value-added model (VAM) at legal issue in the state of New Mexico (see, for example, here), set out to help practitioners “combine multiple measures of complex [teacher evaluation] constructs into composite indicators of performance…[using]…various conjunctive, disjunctive (or complementary), and weighted (or compensatory) models” (p. 738). Multiple measures in this study include teachers’ VAM estimates, observational scores, and student survey results.

While authors ultimately suggest that “[a]ccuracy and consistency are greatest if composites are constructed to maximize reliability,” perhaps more importantly, especially for practitioners, authors note that “accuracy varies across models and cut-scores and that models with similar accuracy may yield different teacher classifications.”

This, of course, has huge implications for teacher evaluation systems as based upon multiple measures in that “accuracy” means “validity” and “valid” decisions cannot be made as based on “invalid” or “inaccurate” data that can so arbitrarily change. In other words, what this means is that likely never will a decision about a teacher being this or that actually mean this or that. In fact, this or that might be close, not so close, or entirely wrong, which is a pretty big deal when the measures combined are assumed to function otherwise. This is especially interesting, again and as stated prior, that the third author on this piece – Pete Goldschmidt – is the person consulting with the state of New Mexico. Again, this is the state that is still trying to move forward with the attachment of consequences to teachers’ multiple evaluation measures, as assumed (by the state but not the state’s consultant?) to be accurate and correct (see, for example, here).

Indeed, this is a highly inexact and imperfect social science.

Authors also found that “policy weights yield[ed] more reliable composites than optimal prediction [i.e., empirical] weights” (p. 750). In addition, “[e]mpirically derived weights may or may not align with important theoretical and policy rationales” (p. 750); hence, the authors collectively referred others to use theory and policy when combining measures, while also noting that doing so would (a) still yield overall estimates that would “change from year to year as new crops of teachers and potentially measures are incorporated” (p. 750) and (b) likely “produce divergent inferences and judgments about individual teachers (p. 751). Authors, therefore, concluded that “this in turn highlights the need for a stricter measurement validity framework guiding the development, use, and monitoring of teacher evaluation systems” (p. 751), given all of this also makes the social science arbitrary, which is also a legal issue in and of itself, as also quasi noted.

Now, while I will admit that those who are (perhaps unwisely) devoted to the (in many ways forced) combining of these measures (despite what low reliability indicators already mean for validity, as unaddressed in this piece) might find some value in this piece (e.g., how conjunctive and disjunctive models vary, how principal component, unit weight, policy weight, optimal prediction approaches vary), I will also note that forcing the fit of such multiple measures in such ways, especially without a thorough background in and understanding of reliability and validity and what reliability means for validity (i.e., with rather high levels of reliability required before any valid inferences and especially high-stakes decisions can be made) is certainly unwise.

If high-stakes decisions are not to be attached, such nettlesome (but still necessary) educational measurement issues are of less importance. But any positive (e.g., merit pay) or negative (e.g., performance improvement plan) consequence that comes about without adequate reliability and validity should certainly cause pause, if not a justifiable grievance as based on the evidence provided herein, called for herein, and required pretty much every time such a decision is to be made (and before it is made).

Citation: Martinez, J. F., Schweig, J., & Goldschmidt, P. (2016). Approaches for combining multiple measures of teacher performance: Reliability, validity, and implications for evaluation policy. Educational Evaluation and Policy Analysis, 38(4), 738–756. doi: 10.3102/0162373716666166 Retrieved from http://journals.sagepub.com/doi/pdf/10.3102/0162373716666166

Note: New Mexico’s data were not used for analytical purposes in this study, unless any districts in New Mexico participated in the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) study yielding the data used for analytical purposes herein.

NCTQ on States’ Teacher Evaluation Systems’ Failures

The controversial National Council on Teacher Quality (NCTQ) — created by the conservative Thomas B. Fordham Institute and funded (in part) by the Bill & Melinda Gates Foundation as “part of a coalition for ‘a better orchestrated agenda’ for accountability, choice, and using test scores to drive the evaluation of teachers” (see here; see also other instances of controversy here and here) — recently issued yet another report about state’s teacher evaluation systems titled: “Running in Place: How New Teacher Evaluations Fail to Live Up to Promises.” See a related blog post in Education Week about this report here. See also a related blog post about NCTQ’s prior large-scale (and also slanted) study — “State of the States 2015: Evaluating Teaching, Leading and Learning” — here. Like I did in that post, I summarize this study below.

From the abstract: Authors of this report find that “within the 30 states that [still] require student learning measures to be at least a significant factor in teacher evaluations, state guidance and rules in most states allow teachers to be rated effective even if they receive low scores on the student learning component of the evaluation.” They add in the full report that in many states “a high score on an evaluation’s observation and [other] non-student growth components [can] result in a teacher earning near or at the minimum number of points needed to earn an effective rating. As a result, a low score on the student growth component of the evaluation is sufficient in several states to push a teacher over the minimum number of points needed to earn a summative effective rating. This essentially diminishes any real influence the student growth component has on the summative evaluation rating” (p. 3-4).

The first assumption surrounding the authors’ main tenets they make explicit: that “[u]nfortunately, [the] policy transformation [that began with the publication of the “Widget Effect” report in 2009] has not resulted in drastic alterations in outcomes” (p. 2). This is because, “[in] effect…states have been running in place” (p. 2) and not using teachers’ primarily test-based indicators for high-stakes decision-making. Hence, “evaluation results continue to look much like they did…back in 2009” (p. 2). The authors then, albeit ahistorically, ask, “How could so much effort to change state laws result in so little actual change?” (p. 2). Yet they don’t realize (or care to realize) that this is because we have almost 40 years of evidence that really any type of test-based, educational accountability policies and initiatives have never yield their intended consequences (i.e., increased student achievement on national and international indicators). Rather, the authors argue, that “most states’ evaluation laws fated these systems to status quo results long before” they really had a chance (p. 2).

The authors’ second assumption they imply: that the two most often used teacher evaluation indicators (i.e., the growth or value-added and observational measures) should be highly correlated, which many argue they should be IF in fact they are measuring general teacher effectiveness. But the more fundamental assumption here is that if the student learning (i.e., test based) indicators do not correlate with the observational indicators, the latter MUST be wrong, biased, distorted, and accordingly less trustworthy and the like. They add that “teachers and students are not well served when a teacher is rated effective or higher even though her [sic] students have not made sufficient gains in their learning over the course of a school year” (p. 4). Accordingly, they add that “evaluations should require that a teacher is rated well on both the student growth measures and the professional practice component (e.g., observations, student surveys, etc.) in order to be rated effective” (p. 4). Hence, also in this report the authors put forth recommendations for how states might address this challenge. See these recommendations forthcoming, as also related to a new phenomenon my students and I are studying called artificial inflation.

Artificial inflation is a term I recently coined to represent what is/was happening in Houston, and elsewhere (e.g., Tennessee), when district leaders (e.g., superintendents) mandate or force principals and other teacher effectiveness appraisers or evaluators to align their observational ratings of teachers’ effectiveness with teachers’ value-added scores, with the latter being (sometimes relentlessly) considered the “objective measure” around which all other measures (e.g., subjective observational measures) should revolve, or align. Hence, the push is to conflate the latter “subjective” measure to match the former “objective” measure, even if the process of artificial conflation causes both indicators to become invalid. As per my affidavit from the still ongoing lawsuit in Houston (see here), “[t]o purposefully and systematically endorse the engineering and distortion of the perceptible ‘subjective’ indicator, using the perceptibly ‘objective’ indicator as a keystone of truth and consequence, is more than arbitrary, capricious, and remiss…not to mention in violation of the educational measurement field’s “Standards for Educational and Psychological Testing.”

Nonetheless…

Here is one important figure, taken out of context in some ways on purpose (e.g., as the text surrounding this particular figure is ironically, subjectively used to define what the NCTQ defines as as indicators or progress, or regress).

Near Figure 1 (p. 1) the authors note that “as of January 2017, there has been little evidence of a large-scale reversal of states’ formal evaluation policies. In fact, only four states (Alaska, Mississippi, North Carolina, and Oklahoma) have reversed course on factoring student learning into a teacher’s evaluation rating” (p. 3). While this reversal of four is not illustrated in their accompanying figure, see also a prior post about what other states, beyond just these four states of dishonorable mention, have done to “reverse” the “course” (p. 3) here. While the authors shame all states for minimizing teachers’ test-based ratings before these systems had a chance, as also ignorant to what they cite as “a robust body of research” (without references or citations here, and few elsewhere in a set of footnotes), they add that it remains an unknown as to “why state educational agencies put forth regulations or guidance that would allow teachers to be rated effective without meeting their student growth goals” (p. 4). Many of us know that this was often done to counter the unreliable and invalid results often yielded via the “objective” test-based sides of things that the NCTQ continues to advance.

Otherwise, here are also some important descriptive findings:

  • Thirty states require measures of student academic growth to be at least a significant factor within teacher evaluations; another 10 states require some student growth, and 11 states do not require any objective measures of student growth (p. 5).
  • With only [emphasis added] two exceptions, in the 30 states where student
    growth is at least a significant factor in teacher evaluations, state
    rules or guidance effectively allow teachers who have not met student
    growth goals to still receive a summative rating of at least effective (p. 5).
  • In 18 [of these 30] states, state educational agency regulations and/or guidance
    explicitly permit teachers to earn a summative rating of effective even after earning a less-than-effective score on the student learning portion of their evaluations…these regulations meet the letter of the law while still allowing teachers with low ratings on
    student growth measures to be rated effective or higher (p. 5). In Colorado, for example…a teacher can earn a rating of highly effective with a score of just 1 for student growth (which the state classifies as “less than expected”) in conjunction with a top professional practice score (p. 4).
  • Ten states do not specifically address whether a teacher who has not met student growth goals may be rated as effective or higher. These states neither specifically allow nor specifically disallow such a scenario, but by failing to provide guidance to prevent such an occurrence, they enable it to exist (p. 6).
  • Only two of the 30 states (Indiana and Kentucky) make it impossible for a teacher who has not been found effective at increasing student learning to receive a summative rating of effective (p. 6).

Finally, here are some of their important recommendations, as related to all of the above, and to create more meaningful teacher evaluation systems. So they argue, states should:

  • Establish policies that preclude teachers from earning a label of effective if they are found ineffective at increasing student learning (p. 12).
  • Track the results of discrete components within evaluation systems, both statewide and districtwide. In districts where student growth measures and observation measures are significantly out of alignment, states should reevaluate their systems and/or offer districts technical assistance (p. 12). ][That is, states should possibly promote artificial inflation as we have observed elsewhere. The authors add that] to ensure that evaluation ratings better reflect teacher performance, states should [more specifically] track the results of each evaluation measure to pinpoint where misalignment between components, such as between student learning and observation measures, exists. Where major components within an evaluation system are significantly misaligned, states should examine their systems and offer districts technical assistance where needed, whether through observation training or examining student growth models or calculations (p. 12-13). [Tennessee, for example,] publishes this information so that it is transparent and publicly available to guide actions by key stakeholders and point the way to needed reforms (p. 13).

See also state-by-state reports in the appendices of the full report, in case your state was one of the state’s that responded or, rather, “recognized the factual accuracy of this analysis.”

Citation: Walsh, K., Joseph, N., Lakis, K., & Lubell, S. (2017). Running in place: How new teacher evaluations fail to live up to promises. Washington DC: National Council on Teacher Quality (NCTQ). Retrieved from http://www.nctq.org/dmsView/Final_Evaluation_Paper

Last Saturday Night Live’s VAM-Related Skit

For those of you who may have missed it last Saturday, Melissa McCarthy portrayed Sean Spicer — President Trump’s new White House Press Secretary and Communications Director — in one of the funniest of a very funny set of skits recently released on Saturday Night Live. You can watch the full video, compliments of YouTube, here:

In one of the sections of the skit, though, “Spicer” introduces “Betsy DeVos” — portrayed by Kate McKinnon and also just today confirmed as President Trump’s Secretary of Education — to answer some very simple questions about today’s public schools which she, well, very simply could not answer. See this section of the clip starting at about 6:00 (of the above 8:00 minute total skit).

In short, “the man” reporter asks “DeVos” how she values “growth versus proficiency in [sic] measuring progress in students.” Literally at a loss of words, “DeVos” responds that she really doesn’t “know anything about school.” She rambles on, until “Spicer” pushes her off of the stage 40-or-so seconds later.

Humor set aside, this was the one question Saturday Night Live writers wrote into this skit, which reminds us that what we know more generally as the purpose of VAMs is still alive and well in our educational rhetoric as well as popular culture. As background, this question apparently came from Minnesota Sen. Al Franken’s prior, albeit similar question during DeVos’s confirmation hearing.

Notwithstanding, Steve Snyder – the editorial director of The 74 — an (allegedly) non-partisan, honest, and fact-based backed by Editor-in-Chief Campbell Brown (see prior posts about this news site here and here) — took the opportunity to write a “featured” piece about this section of the script (see here). The purpose of the piece was, as the title illustrates, to help us “understand” the skit, as well as it’s important meaning for all of “us.”

Snyder notes that Saturday Night Live writers, with their humor, might have consequently (and perhaps mistakenly) “made their viewers just a little more knowledgeable about how their child’s school works,” or rather should work, as “[g]rowth vs. proficiency is a key concept in the world of education research.” Thereafter, Snyder falsely asserts that more than 2/3rds of educational researchers agree that VAMs are a good way to measure school quality. If you visit the actual statistic cited in this piece, however, as “non-partison, honest, and fact-based” that it is supposed to be, you would find (here) that this 2/3rds consists of 57% of responding American Education Finance Association (AEFA) members, and AEFA members alone, who are certainly not representative of “educational researchers” as claimed.

Regardless, Snyder asks: “Why are researchers…so in favor of [these] growth measures?” Because this disciplinary subset does not represent educational researchers writ large, but only a subset, Snyder.

As it is with politics today, many educational researchers who define themselves as aligned with the disciplines of educational finance or educational econometricians are substantively more in favor of VAMs than those who align more with the more general disciplines of educational research and educational measurement, methods, and statistics, in general. While this is somewhat of a sweeping generalization, which is not wise as I also argue and also acknowledge in this piece, there is certainly more to be said here about the validity of the inferences drawn here, and (too) often driven via the “media” like The 74.

The bottom line is to question and critically consume everything, and everyone who feels qualified to write about particular things without enough expertise in most everything, including in this case good and professional journalism, this area of educational research, and what it means to make valid inferences and then responsibly share them out with the public.