The “Widget Effect” Report Revisited

You might recall that in 2009, The New Teacher Project published a highly influential “Widget Effect” report in which researchers (see citation below) evidenced that 99% of teachers (whose teacher evaluation reports they examined across a sample of school districts spread across a handful of states) received evaluation ratings of “satisfactory” or higher. Inversely, only 1% of the teachers whose reports researchers examined received ratings of “unsatisfactory,” even though teachers’ supervisors could identify more teachers whom they deemed ineffective when asked otherwise.

Accordingly, this report was widely publicized given the assumed improbability that only 1% of America’s public school teachers were, in fact, ineffectual, and given the fact that such ineffective teachers apparently existed but were not being identified using standard teacher evaluation/observational systems in use at the time.

Hence, this report was used as evidence that America’s teacher evaluation systems were unacceptable and in need of reform, primarily given the subjectivities and flaws apparent and arguably inherent across the observational components of these systems. This reform was also needed to help reform America’s public schools, writ large, so the logic went and (often) continues to go. While binary constructions of complex data such as these are often used to ground simplistic ideas and push definitive policies, ideas, and agendas, this tactic certainly worked here, as this report (among a few others) was used to inform the federal and state policies pushing teacher evaluation system reform as a result (e.g., Race to the Top (RTTT)).

Likewise, this report continues to be used whenever a state’s or district’s new-and-improved teacher evaluation systems (still) evidence “too many” (as typically arbitrarily defined) teachers as effective or higher (see, for example, an Education Week article about this here). Although, whether in fact the systems have actually been reformed is also of debate in that states are still using many of the same observational systems they were using prior (i.e., not the “binary checklists” exaggerated in the original as well as this report, albeit true in the case of the district of focus in this study). The real “reforms,” here, pertained to the extent to which value-added model (VAM) or other growth output were combined with these observational measures, and the extent to which districts adopted state-level observational models as per the centralized educational policies put into place at the same time.

Nonetheless, now eight years later, Matthew A. Kraft – an Assistant Professor of Education & Economics at Brown University and Allison F. Gilmour – an Assistant Professor at Temple University (and former doctoral student at Vanderbilt University), revisited the original report. Just published in the esteemed, peer-reviewed journal Educational Researcher (see an earlier version of the published study here), Kraft and Gilmour compiled “teacher performance ratings across 24 [of the 38, including 14 RTTT] states that [by 2014-2015] adopted major reforms to their teacher evaluation systems” as a result of such policy initiatives. They found that “the percentage of teachers rated Unsatisfactory remains less than 1%,” except for in two states (i.e., Maryland and New Mexico), with Unsatisfactory (or similar) ratings varying “widely across states with 0.7% to 28.7%” as the low and high, respectively (see also the study Abstract).

Related, Kraft and Gilmour found that “some new teacher evaluation systems do differentiate among teachers, but most only do so at the top of the ratings spectrum” (p. 10). More specifically, observers in states in which teacher evaluation ratings include five versus four rating categories differentiate teachers more, but still do so along the top three ratings, which still does not solve the negative skew at issue (i.e., “too many” teachers still scoring “too well”). They also found that when these observational systems were used for formative (i.e., informative, improvement) purposes, teachers’ ratings were lower than when they were used for summative (i.e., final summary) purposes.

Clearly, the assumptions of all involved in this area of policy research come into play, here, akin to how they did in The Bell Curve and The Bell Curve Debate. During this (still ongoing) debate, many fervently debated whether socioeconomic and educational outcomes (e.g., IQ) should be normally distributed. What this means in this case, for example, is that for every teacher who is rated highly effective there should be a teacher rated as highly ineffective, more or less, to yield a symmetrical distribution of teacher observational scores across the spectrum.

In fact, one observational system of which I am aware (i.e., the TAP System for Teacher and Student Advancement) is marketing its proprietary system, using as a primary selling point figures illustrating (with text explaining) how clients who use their system will improve their prior “Widget Effect” results (i.e., yielding such normal curves; see Figure below, as per Jerald & Van Hook, 2011, p. 1).

Evidence also suggests that these scores are also (sometimes) being artificially deflated to assist in these attempts (see, for example, a recent publication of mine released a few days ago here in the (also) esteemed, peer-reviewed Teachers College Record about how this is also occurring in response to the “Widget Effect” report and the educational policies that follows).

While Kraft and Gilmour assert that “systems that place greater weight on normative measures such as value-added scores rather than…[just]…observations have fewer teachers rated proficient” (p. 19; see also Steinberg & Kraft, forthcoming; a related article about how this has occurred in New Mexico here; and New Mexico’s 2014-2016 data below and here, as also illustrative of the desired normal curve distributions discussed above), I highly doubt this purely reflects New Mexico’s “commitment to putting students first.”

I also highly doubt that, as per New Mexico’s acting Secretary of Education, this was “not [emphasis added] designed with quote unquote end results in mind.” That is, “the New Mexico Public Education Department did not set out to place any specific number or percentage of teachers into a given category.” If true, it’s pretty miraculous how this simply worked out as illustrated… This is also at issue in the lawsuit in which I am involved in New Mexico, in which the American Federation of Teachers won an injunction in 2015 that still stands today (see more information about this lawsuit here). Indeed, as per Kraft, all of this “might [and possibly should] undercut the potential for this differentiation [if ultimately proven artificial, for example, as based on statistical or other pragmatic deflation tactics] to be seen as accurate and valid” (as quoted here).

Notwithstanding, Kraft and Gilmour, also as part (and actually the primary part) of this study, “present original survey data from an urban district illustrating that evaluators perceive more than three times as many teachers in their schools to be below Proficient than they rate as such.” Accordingly, even though their data for this part of this study come from one district, their findings are similar to others evidenced in the “Widget Effect” report; hence, there are still likely educational measurement (and validity) issues on both ends (i.e., with using such observational rubrics as part of America’s reformed teacher evaluation systems and using survey methods to put into check these systems, overall). In other words, just because the survey data did not match the observational data does not mean either is wrong, or right, but there are still likely educational measurement issues.

Also of issue in this regard, in terms of the 1% issue, is (a) the time and effort it takes supervisors to assist/desist after rating teachers low is sometimes not worth assigning low ratings; (b) how supervisors often give higher ratings to those with perceived potential, also in support of their future growth, even if current evidence suggests a lower rating is warranted; (c) how having “difficult conversations” can sometimes prevent supervisors from assigning the scores they believe teachers may deserve, especially if things like job security are on the line; (d) supervisors’ challenges with removing teachers, including “long, laborious, legal, draining process[es];” and (e) supervisors’ challenges with replacing teachers, if terminated, given current teacher shortages and the time and effort, again, it often takes to hire (ideally more qualified) replacements.


Jerald, C. D., & Van Hook, K. (2011). More than measurement: The TAP system’s lessons learned for designing better teacher evaluation systems. Santa Monica, CA: National Institute for Excellence in Teaching (NIET). Retrieved from

Kraft, M. A, & Gilmour, A. F. (2017). Revisiting the Widget Effect: Teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46(5) 234-249. doi:10.3102/0013189X17718797

Steinberg, M. P., & Kraft, M. A. (forthcoming). The sensitivity of teacher performance ratings to the design of teacher evaluation systems. Educational Researcher.

Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). “The Widget Effect.” Education Digest, 75(2), 31–35.

Observational Systems: Correlations with Value-Added and Bias

A colleague recently sent me a report released in November of 2016 by the Institute of Education Sciences (IES) division of the U.S. Department of Education that should be of interest to blog followers. The study is about “The content, predictive power, and potential bias in five widely used teacher observation instruments” and is authored by affiliates of Mathematica Policy Research.

Using data from the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) studies, researchers examined five widely used teacher observation instruments. Instruments included the more generally popular Classroom Assessment Scoring System (CLASS) and Danielson Framework for Teaching (of general interest in this post), as well as the more subject-specific instruments including the Protocol for Language Arts Teaching Observations (PLATO), the Mathematical Quality of Instruction (MQI), and the UTeach Observational Protocol (UTOP) for science and mathematics teachers.

Researchers examined these instruments in terms of (1) what they measure (which is not of general interest in this post), but also (2) the relationships of observational output to teachers’ impacts on growth in student learning over time (as measured using a standard value-added model (VAM)), and (3) whether observational output are biased by the characteristics of the students non-randomly (or in this study randomly) assigned to teachers’ classrooms.

As per #2 above, researchers found that the instructional practices captured across these instruments modestly [emphasis added] correlate with teachers’ value-added scores, with an adjusted (and likely, artificially inflated; see Note 1 below) correlation coefficient between observational and value added indicators at: 0.13 ≤ r ≤ 0.28 (see also Table 4, p. 10). As per the higher, adjusted r (emphasis added; see also Note 1 below), they found that these instruments’ classroom management dimensions most strongly (r = 0.28) correlated with teachers’ value-added.

Related, also at issue here is that such correlations are not “modest,” but rather “weak” to “very weak” (see Note 2 below). While all correlation coefficients were statistically significant, this is much more likely due to the sample size used in this study versus the actual or practical magnitude of these results. “In sum” this hardly supports the overall conclusion that “observation scores predict teachers’ value-added scores” (p. 11); although, it should also be noted that this summary statement, in and of itself, suggests that the value-added score is the indicator around which all other “less objective” indicators are to revolve.

As per #3 above, researchers found that students randomly assigned to teachers’ classrooms (as per the MET data, although there was some noncompliance issues with the random assignment employed in the MET studies) do bias teachers’ observational scores, for better or worse, and more often in English language arts than in mathematics. More specifically, they found that for the Danielson Framework and CLASS (the two more generalized instruments examined in this study, also of main interest in this post), teachers with relatively more racial/ethnic minority and lower-achieving students (in that order, although these are correlated themselves) tended to receive lower observation scores. Bias was observed more often for the Danielson Framework versus the CLASS, but it was observed in both cases. An “alternative explanation [may be] that teachers are providing less-effective instruction to non-White or low-achieving students” (p. 14).

Notwithstanding, and in sum, in classrooms in which students were randomly assigned to teachers, teachers’ observational scores were biased by students’ group characteristics, which also means that  bias is also likely more prevalent in classrooms to which students are non-randomly assigned (which is common practice). These findings are also akin to those found elsewhere (see, for example, two similar studies here), as this was also evidenced in mathematics, which may also be due to the random assignment factor present in this study. In other words, if non-random assignment of students into classrooms is practice, a biasing influence may (likely) still exist in English language arts and mathematics.

The long and short of it, though, is that the observational components of states’ contemporary teacher systems certainly “add” more “value” than their value-added counterparts (see also here), especially when considering these systems’ (in)formative purposes. But to suggest that because these observational indicators (artificially) correlate with teachers’ value-added scores at “weak” and “very weak” levels (see Notes 1 and 2 below), that this means that these observational systems might “add” more “value” to the summative sides of teacher evaluations (i.e., their predictive value) is premature, not to mention a bit absurd. Adding import to this statement is the fact that, as s duly noted in this study, these observational indicators are oft-to-sometimes biased against teachers who teacher lower-achieving and racial minority students, even when random assignment is present, making such bias worse when non-random assignment, which is very common, occurs.

Hence, and again, this does not make the case for the summative uses of really either of these indicators or instruments, especially when high-stakes consequences are to be attached to output from either indicator (or both indicators together given the “weak” to “very weak” relationships observed). On the plus side, though, remain the formative functions of the observational indicators.


Note 1: Researchers used the “year-to-year variation in teachers’ value-added scores to produce an adjusted correlation [emphasis added] that may be interpreted as the correlation between teachers’ average observation dimension score and their underlying value added—the value added that is [not very] stable [or reliable] for a teacher over time, rather than a single-year measure (Kane & Staiger, 2012)” (p. 9). This practice or its statistic derived has not been externally vetted. Likewise, this also likely yields a correlation coefficient that is falsely inflated. Both of these concerns are at issue in the ongoing New Mexico and Houston lawsuits, in which Kane is one of the defendants’ expert witnesses in both cases testifying in support of his/this practice.

Note 2: As is common with social science research when interpreting correlation coefficients: 0.8 ≤ r ≤ 1.0 = a very strong correlation; 0.6 ≤ r ≤ 0.8 = a strong correlation; 0.4 ≤ r ≤ 0.6 = a moderate correlation; 0.2 ≤ r ≤ 0.4 = a weak correlation; and 0 ≤ r ≤ 0.2 = a very weak correlation, if any at all.


Citation: Gill, B., Shoji, M., Coen, T., & Place, K. (2016). The content, predictive power, and potential bias in five widely used teacher observation instruments. Washington, DC: U.S. Department of Education, Institute of Education Sciences. Retrieved from

The Tripod Student Survey Instrument: Its Factor Structure and Value-Added Correlations

The Tripod student perception survey instrument is a “research-based” instrument increasingly being used by states to add to state’s teacher evaluation systems as based on “multiple measures.” While there are other instruments also in use, as well as student survey instruments being developed by states and local districts, this one in particular is gaining in popularity, also in that it was used throughout the Bill & Melinda Gates Foundation’s ($43 million worth of) Measures of Effective Teaching (MET) studies. A current estimate (as per the study discussed in this post) is that during the 2015–2016 school year approximately 1,400 schools purchased and administered the Tripod. See also a prior post (here) about this instrument, or more specifically a chapter of a book about the instrument as authored by the instrument’s developer and lead researcher in a  research surrounding it – Ronald Ferguson.

In a study recently released in the esteemed American Educational Research Journal (AERJ), and titled “What Can Student Perception Surveys Tell Us About Teaching? Empirically Testing the Underlying Structure of the Tripod Student Perception Survey,” researchers found that the Tripod’s factor structure did not “hold up.” That is, Tripod’s 7Cs (i.e., seven constructs including: Care, Confer, Captivate, Clarify, Consolidate, Challenge, Classroom Management; see more information about the 7Cs here) and the 36 items that are positioned within each of the 7Cs did not fit the 7C framework as theorized by instrument developer(s).

Rather, using the MET database (N=1,049 middle school math class sections; N=25,423 students), researchers found that an alternative bi-factor structure (i.e., two versus seven constructs) best fit the Tripod items theoretically positioned otherwise. These two factors included (1) a general responsivity dimension that includes all items (more or less) unrelated to (2) a classroom management dimension that governs responses on items surrounding teachers’ classroom management. Researchers were unable to to distinguish across items seven separate dimensions.

Researchers also found that the two alternative factors noted — general responsivity and classroom management — were positively associated with teacher value-added scores. More specifically, results suggested that these two factors were positively and statistically significantly associated with teachers’ value-added measures based on state mathematics tests (standardized coefficients were .25 and .25, respectively), although for undisclosed reasons, results apparently suggested nothing about these two factors’ (cor)relationships with value-added estimates base on state English/language arts (ELA) tests. As per authors’ findings in the area of mathematics, prior researchers have also found low to moderate agreement between teacher ratings and student perception ratings; hence, this particular finding simply adds another source of convergent evidence.

Authors do give multiple reasons and plausible explanations as to why they found what they did that you all can read in more depth via the full article, linked to above and fully cited below. Authors also note that “It is unclear whether the original 7Cs that describe the Tripod instrument were intended to capture seven distinct dimensions on which students can reliably discriminate among teachers or whether the 7Cs were merely intended to be more heuristic domains that map out important aspects of teaching” (p. 1859); hence, this is also important to keep in mind given study findings.

As per study authors, and to their knowledge, “this study [was] the first to systematically investigate the multidimensionality of the Tripod student perception survey” (p. 1863).

Citation: Wallace, T. L., Kelcey, B., &  Ruzek, E. (2016). What can student perception surveys tell us about teaching? Empirically testing the underlying structure of the Tripod student perception survey.  American Educational Research Journal, 53(6), 1834–1868.
doiI:10.3102/0002831216671864 Retrieved from

New Article Published on Using Value-Added Data to Evaluate Teacher Education Programs

A former colleague, a current PhD student, and I just had an article released about using value-added data to (or rather not to) evaluate teacher education/preparation, higher education programs. The article is titled “An Elusive Policy Imperative: Data and Methodological Challenges When Using Growth in Student Achievement to Evaluate Teacher Education Programs’ ‘Value-Added,” and the abstract of the article is included below.

If there is anyone out there who might be interested in this topic, please note that the journal in which this piece was published (online first and to be published in its paper version later) – Teaching Education – has made the article free for its first 50 visitors. Hence, I thought I’d share this with you all first.

If you’re interested, do access the full piece here.

Happy reading…and here’s the abstract:

In this study researchers examined the effectiveness of one of the largest teacher education programs located within the largest research-intensive universities within the US. They did this using a value-added model as per current federal educational policy imperatives to assess the measurable effects of teacher education programs on their teacher graduates’ students’ learning and achievement as compared to other teacher education programs. Correlational and group comparisons revealed little to no relationship between value-added scores and teacher education program regardless of subject area or position on the value-added scale. These findings are discussed within the context of several very important data and methodological challenges researchers also made transparent, as also likely common across many efforts to evaluate teacher education programs using value-added approaches. Such transparency and clarity might assist in the creation of more informed value-added practices (and more informed educational policies) surrounding teacher education accountability.

Difficulties When Combining Multiple Teacher Evaluation Measures

A new study about multiple “Approaches for Combining Multiple Measures of Teacher Performance,” with special attention paid to reliability, validity, and policy, was recently published in the American Educational Research Association (AERA) sponsored and highly-esteemed Educational Evaluation and Policy Analysis journal. You can find the free and full version of this study here.

In this study authors José Felipe Martínez – Associate Professor at the University of California, Los Angeles, Jonathan Schweig – at the RAND Corporation, and Pete Goldschmidt – Associate Professor at California State University, Northridge and creator of the value-added model (VAM) at legal issue in the state of New Mexico (see, for example, here), set out to help practitioners “combine multiple measures of complex [teacher evaluation] constructs into composite indicators of performance…[using]…various conjunctive, disjunctive (or complementary), and weighted (or compensatory) models” (p. 738). Multiple measures in this study include teachers’ VAM estimates, observational scores, and student survey results.

While authors ultimately suggest that “[a]ccuracy and consistency are greatest if composites are constructed to maximize reliability,” perhaps more importantly, especially for practitioners, authors note that “accuracy varies across models and cut-scores and that models with similar accuracy may yield different teacher classifications.”

This, of course, has huge implications for teacher evaluation systems as based upon multiple measures in that “accuracy” means “validity” and “valid” decisions cannot be made as based on “invalid” or “inaccurate” data that can so arbitrarily change. In other words, what this means is that likely never will a decision about a teacher being this or that actually mean this or that. In fact, this or that might be close, not so close, or entirely wrong, which is a pretty big deal when the measures combined are assumed to function otherwise. This is especially interesting, again and as stated prior, that the third author on this piece – Pete Goldschmidt – is the person consulting with the state of New Mexico. Again, this is the state that is still trying to move forward with the attachment of consequences to teachers’ multiple evaluation measures, as assumed (by the state but not the state’s consultant?) to be accurate and correct (see, for example, here).

Indeed, this is a highly inexact and imperfect social science.

Authors also found that “policy weights yield[ed] more reliable composites than optimal prediction [i.e., empirical] weights” (p. 750). In addition, “[e]mpirically derived weights may or may not align with important theoretical and policy rationales” (p. 750); hence, the authors collectively referred others to use theory and policy when combining measures, while also noting that doing so would (a) still yield overall estimates that would “change from year to year as new crops of teachers and potentially measures are incorporated” (p. 750) and (b) likely “produce divergent inferences and judgments about individual teachers (p. 751). Authors, therefore, concluded that “this in turn highlights the need for a stricter measurement validity framework guiding the development, use, and monitoring of teacher evaluation systems” (p. 751), given all of this also makes the social science arbitrary, which is also a legal issue in and of itself, as also quasi noted.

Now, while I will admit that those who are (perhaps unwisely) devoted to the (in many ways forced) combining of these measures (despite what low reliability indicators already mean for validity, as unaddressed in this piece) might find some value in this piece (e.g., how conjunctive and disjunctive models vary, how principal component, unit weight, policy weight, optimal prediction approaches vary), I will also note that forcing the fit of such multiple measures in such ways, especially without a thorough background in and understanding of reliability and validity and what reliability means for validity (i.e., with rather high levels of reliability required before any valid inferences and especially high-stakes decisions can be made) is certainly unwise.

If high-stakes decisions are not to be attached, such nettlesome (but still necessary) educational measurement issues are of less importance. But any positive (e.g., merit pay) or negative (e.g., performance improvement plan) consequence that comes about without adequate reliability and validity should certainly cause pause, if not a justifiable grievance as based on the evidence provided herein, called for herein, and required pretty much every time such a decision is to be made (and before it is made).

Citation: Martinez, J. F., Schweig, J., & Goldschmidt, P. (2016). Approaches for combining multiple measures of teacher performance: Reliability, validity, and implications for evaluation policy. Educational Evaluation and Policy Analysis, 38(4), 738–756. doi: 10.3102/0162373716666166 Retrieved from

Note: New Mexico’s data were not used for analytical purposes in this study, unless any districts in New Mexico participated in the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) study yielding the data used for analytical purposes herein.

Value-Added for Kindergarten Teachers in Ecuador

In a study a colleague of mine recently sent me, authors of a study recently released in The Quarterly Journal of Economics and titled “Teacher Quality and Learning Outcomes in Kindergarten,” (nearly randomly) assigned two cohorts of more than 24,000 kindergarten students to teachers to examine whether, indeed and once again, teacher behaviors are related to growth in students’ test scores over time (i.e., value-added).

To assess this, researchers administered 12 tests to the Kindergarteners (I know) at the beginning and end of the year in mathematics and language arts (although apparently the 12 posttests only took 30-40 minutes to complete, which is a content validity and coverage issue in and of itself, p. 1424). They also assessed something they called the executive function (EF), and that they defined as children’s inhibitory control, working memory, capacity to pay attention, and cognitive flexibility, all of which they argue to be related to “Volumetric measures of prefrontal cortex size [when] predict[ed]” (p. 1424). This, along with the fact that teachers’ IQs were also measured (using the Spanish-speaking version of the Wechsler Adult Intelligence Scale) speaks directly to the researchers’ background theory and approach (e.g., recall our world’s history with craniometry, aptly captured in one of my favorite books — Stephen J. Gould’s best selling “The Mismeasure of Man”). Teachers were also observed using the Classroom Assessment Scoring System (CLASS), and parents were also solicited for their opinions about their children’s’ teachers (see other measures collected p. 1417-1418).

What should by now be some familiar names (e.g., Raj Chetty, Thomas Kane) served as collaborators on the study. Likewise, their works and the works of other likely familiar scholars and notorious value-added supporters (e.g., Eric Hanushek, Jonah Rockoff) are also cited throughout in support as evidence of “substantial research” (p. 1416) in support of value-added models (VAMs). Of course, this is unfortunate but important to point out in that this is an indicator of “researcher bias” in and of itself. For example, one of the authors’ findings really should come at no surprise: “Our results…complement estimates from [Thomas Kane’s Bill & Melinda Gates Measures of Effective Teaching] MET project” (p. 1419); although, the authors in a very interesting footnote (p. 1419) describe in more detail than I’ve seen elsewhere all of the weaknesses with the MET study in terms of its design, “substantial attrition,” “serious issue[s]” with contamination and compliance, and possibly/likely biased findings caused by self-selection given the extent to which teachers volunteered to be a part of the MET study.

Also very important to note is that this study took place in Ecuador. Apparently, “they,” including some of the key players in this area of research noted above, are moving their VAM-based efforts across international waters, perhaps in part given the Every Student Succeeds Act (ESSA) recently passed in the U.S., that we should all know by now dramatically curbed federal efforts akin to what is apparently going on now and being pushed here and in other developing countries (although the authors assert that Ecuador is a middle-income country, not a developing country, even though this categorization apparently only applies to the petroleum rich sections of the nation). Related, they assert that, “concerns about teacher quality are likely to be just as important in [other] developing countries” (p. 1416); hence, adopting VAMs in such countries might just be precisely what these countries need to “reform” their schools, as well.

Unfortunately, many big businesses and banks (e.g., the Inter-American Development Bank that funded this particular study) are becoming increasingly interested in investing in and solving these and other developing countries’ educational woes, as well, via measuring and holding teachers accountable for teacher-level value-added, regardless of the extent to which doing this has not worked in the U.S to improve much of anything. Needless to say, many who are involved with these developing nation initiatives, including some of those mentioned above, are also financially benefitting by continuing to serve others their proverbial Kool-Aid.

Nonetheless, their findings:

  • First, they “estimate teacher (rather than classroom) effects of 0.09 on language and math” (p. 1434). That is, just less than 1/10th of a standard deviation, or just over a 3% move in the positive direction away from the mean.
  • Similarly, the “estimate classroom effects of 0.07 standard deviation on EF” (p. 1433). That is, precisely 7/100th of a standard deviation, or about a 2% move in the positive direction away from the mean.
  • They found that “children assigned to teachers with a 1-standard deviation higher CLASS score have between 0.05 and 0.07 standard deviation higher end-of-year test scores” (p. 1437), or a 1-2% move in the positive direction away from the mean.
  • And they found that “that parents generally give higher scores to better teachers…parents are 15 percentage points more likely to classify a teacher who produces 1 standard deviation higher test scores as ‘‘very good’’ rather than ‘‘good’’ or lower” (p. 1442). This is quite an odd way of putting it, along with the assumption that the difference between “very good” and “good” is not arbitrary but empirically grounded, along with whatever reason a simple correlation was not more simply reported.
  • Their most major finding is that “a 1 standard deviation increase in classroom quality, corrected for sampling error, results in 0.11 standard deviation higher test scores in both language and math” (p. 1433; see also other findings from p. 1434-447).

Interestingly, the authors equivocate all of these effects to teacher or classroom “shocks,” although I’d hardly call them “shocks” that inherently imply a large, unidirectional, and causal impact. Moreover, this also implies how the authors, also as economists, still view this type of research (i.e., not correlational, even with close-to-random assignment, although they make a slight mention of this possibility on p. 1449).

Nonetheless, the authors conclude that in this article they effectively evidenced “that there are substantial differences [emphasis added] in the amount of learning that takes place in language, math, and executive function across kindergarten classrooms in Ecuador” (p. 1448). In addition, “These differences are associated with differences in teacher behaviors and practices,” as observed, and “that parents can generally tell better from worse teachers, but do not meaningfully alter their investments in children in response to random shocks [emphasis added] to teacher quality” (p. 1448).

Ultimately, they find that “value added is a useful summary measure of teacher quality in Ecuador” (p. 1448). Go figure…

They conclude “to date, no country in Latin America regularly calculates the value added of teachers,” yet “in virtually all countries in the region, decisions about tenure, in-service training, promotion, pay, and early retirement are taken with no regard for (and in most cases no knowledge about) a teacher’s effectiveness” (p. 1448). Also sound familiar??

“Value added is no silver bullet,” and indeed it is not as per much evidence now existent throughout the U.S., “but knowing which teachers produce more or less learning among equivalent students [is] an important step to designing policies to improve learning outcomes” (p. 1448), they also recognizably argue.

Citation: Araujo, M. C., Carneiro, P.,  Cruz-Aguayo, Y., & Schady, N. (2016). Teacher quality and learning outcomes in Kindergarten. The Quarterly Journal of Economics, 1415–1453. doi:10.1093/qje/qjw016  Retrieved from

The “Value-Added” of Teacher Preparation Programs: New Research

The journal Education of Economics Review recently published a study titled “Teacher Quality Differences Between Teacher Preparation Programs: How Big? How Reliable? Which Programs Are Different?” The study was authored by researchers at the University of Texas – Austin, Duke University, and Tulane. The pre-publication version of this piece can be found here.

As the title implies, the purpose of the study was to “evaluate statistical methods for estimating teacher quality differences between TPPs [teacher preparation programs].” Needless to say, this research is particularly relevant, here, given “Sixteen US states have begun to hold teacher preparation programs (TPPs) accountable for teacher quality, where quality is estimated by teacher value-added to student test scores.” The federal government continues to support and advance these initiatives, as well (see, for example, here).

But this research study is also particularly important because while researchers found that “[t]he most convincing estimates [of TPP quality] [came] from a value-added model where confidence intervals [were] widened;” that is, the extent to which measurement errors were permitted was dramatically increased, and also widened further using statistical corrections. But even when using these statistical techniques and accomodations, they found that it was still “rarely possible to tell which TPPs, if any, [were] better or worse than average.”

They therefore concluded that “[t]he potential benefits of TPP accountability may be too small to balance the risk that a proliferation of noisy TPP estimates will encourage arbitrary and ineffective policy actions” in response. More specifically, and in their own words, they found that:

  1. Differences between TPPs. While most of [their] results suggest that real differences between TPPs exist, the differences [were] not large [or large enough to make or evidence the differentiation between programs as conceptualized and expected]. [Their] estimates var[ied] a bit with their statistical methods, but averaging across plausible methods [they] conclude[d] that between TPPs the heterogeneity [standard deviation (SD) was] about .03 in math and .02 in reading. That is, a 1 SD increase in TPP quality predict[ed] just [emphasis added] a [very small] .03 SD increase in student math scores and a [very small] .02 SD increase in student reading scores.
  2. Reliability of TPP estimates. Even if the [above-mentioned] differences between TPPs were large enough to be of policy interest, accountability could only work if TPP differences could be estimated reliably. And [their] results raise doubts that they can. Every plausible analysis that [they] conducted suggested that TPP estimates consist[ed] mostly of noise. In some analyses, TPP estimates appeared to be about 50% noise; in other analyses, they appeared to be as much as 80% or 90% noise…Even in large TPPs the estimates were mostly noise [although]…[i]t is plausible [although perhaps not probable]…that TPP estimates would be more reliable if [researchers] had more than one year of data…[although states smaller than the one in this study — Texs]…would require 5 years to accumulate the amount of data that [they used] from one year of data.
  3. Notably Different TPPs. Even if [they] focus[ed] on estimates from a single model, it remains hard to identify which TPPs differ from the average…[Again,] TPP differences are small and estimates of them are uncertain.

In conclusion, that researchers found “that there are only small teacher quality differences between TPPs” might seem surprising, but not really given the outcome variables they used to measure and assess TPP effects were students’ test scores. In short, students’ test scores are three times removed from the primary unit of analysis in studies like these. That is, (1) the TPP is to be measured by the effectiveness of its teacher graduates, and (2) teacher graduates are to be measured by their purported impacts on their students’ test scores, while (3) students’ test scores are to only and have only been validated for measuring student learning and achievement. These test scores have not been validated to assess and measure, in the inverse, teachers causal impacts on said achievements or on TPPs impacts on teachers on said achievements.

If this sounds confusing, it is, and also highly nonsensical, but this is also a reason why this is so difficult to do, and as evidenced in this study, improbable to do this well or as theorized in that TPP estimates are sensitive to error, insensitive given error, and, accordingly, highly uncertain and invalid.

Citation: von Hippela, P. T., Bellowsb, L., Osbornea, C., Lincovec, J. A., & Millsd, N. (2016). Teacher quality differences between teacher preparation programs: How big? How reliable? Which programs are different? Education of Economics Review, 53, 31–45. doi:10.1016/j.econedurev.2016.05.002

U.S. Department of Education: Value-Added Not Good for Evaluating Schools and Principals

Just this month, the Institute of Education Sciences (IES) wing of the U.S. Department of Education released a report about using value-added models (VAMs) for measuring school principals’ performance. The article conducted by researchers at Mathematica Policy Research and titled “Can Student Test Scores Provide Useful Measures of School Principals’ Performance?” can be found online here, with my summary of the study findings highlighted next and herein.

Before the passage of the Every Student Succeeds Act (ESSA), 40 states had written into their state statutes, as incentivized by the federal government, to use growth in student achievement growth for annual principal evaluation purposes. More states had written growth/value-added models (VAMs) for teacher evaluation purposes, which we have covered extensively via this blog, but this pertains only to school and/or principal evaluation purposes. Now since the passage of ESSA, and the reduction in the federal government’s control over state-level policies, states now have much more liberty to more freely decide whether to continue using student achievement growth for either purposes. This paper is positioned within this reasoning, and more specifically to help states decide whether or to what extent they might (or might not) continue to move forward with using growth/VAMs for school and principal evaluation purposes.

Researchers, more specifically, assessed (1) reliability – or the consistency or stability of these ratings over time, which is important “because only stable parts of a rating have the potential to contain information about principals’ future performance; unstable parts reflect only transient aspects of their performance;” and (2) one form of multiple evidences of validity – the predictive validity of these principal-level measures, with predictive validity defined as “the extent to which ratings from these measures accurately reflect principals’ contributions to student achievement in future years.” In short, “A measure could have high predictive validity only if [emphasis added] it was highly stable between consecutive years [i.e., reliability]…and its stable part was strongly related to principals’ contributions to student achievement” over time (i.e., predictive validity).

Researchers used principal-level value-added (unadjusted and adjusted for prior achievement and other potentially biasing demographic variables) to more directly examine “the extent to which student achievement growth at a school differed from average growth statewide for students with similar prior achievement and background characteristics.” Also important to note is that the data they used to examine school-level value-added came from Pennsylvania, which is one of a handful of states that uses the popular and proprietary (and controversial) Education Value-Added Assessment System (EVAAS) statewide.

Here are the researchers’ key findings, taken directly from the study’s summary (again, for more information see the full manuscript here).

  • The two performance measures in this study that did not account for students’ past achievement—average achievement and adjusted average achievement—provided no information for predicting principals’ contributions to student achievement in the following year.
  • The two performance measures in this study that accounted for students’ past achievement—school value-added and adjusted school value-added—provided, at most, a small amount of information for predicting principals’ contributions to student achievement in the following year. This was due to instability and inaccuracy in the stable parts.
  • Averaging performance measures across multiple recent years did not improve their accuracy for predicting principals’ contributions to student achievement in the following year. In simpler terms, a principal’s average rating over three years did not predict his or her future contributions more accurately than did a rating from the most recent year only. This is more of a statistical finding than one that has direct implications for policy and practice (except for silly states who might, despite findings like those presented in this study, decide that they can use one year to do this not at all well instead of three years to do this not at all well).

Their bottom line? “…no available measures of principal [/school] performance have yet been shown to accurately identify principals [/schools] who will contribute successfully to student outcomes in future years,” especially if based on students’ test scores, although the researchers also assert that “no research has ever determined whether non-test measures, such as measures of principals’ leadership practices, [have successfully or accurately] predict[ed] their future contributions” either.

The researchers follow-up with a highly cautionary note: “the value-added measures will make plenty of mistakes when trying to identify principals [/schools] who will contribute effectively or ineffectively to student achievement in future years. Therefore, states and districts should exercise caution when using these measures to make major decisions about principals. Given the inaccuracy of the test-based measures, state and district leaders and researchers should also make every effort to identify nontest measures that can predict principals’ future contributions to student outcomes [instead].”

Citation: Chiang, H., McCullough, M., Lipscomb, S., & Gill, B. (2016). Can student test scores provide useful measures of school principals’ performance? Washington DC: U.S. Department of Education, Institute of Education Sciences. Retrieved from

New Empirical Evidence: Students’ “Persistent Economic Disadvantage” More Likely to Bias Value-Added Estimates

The National Bureau of Economic Research (NBER) recently released a circulated but not-yet internally or externally reviewed study titled “The Gap within the Gap: Using Longitudinal Data to Understand Income Differences in Student Achievement.” Note that we have covered NBER studies such as this in the past in this blog, so in all fairness and like I have noted in the past, this paper should also be critically consumed, as well as my interpretations of the authors’ findings.

Nevertheless, this study is authored by Katherine Michelmore — Assistant Professor of Public Administration and International Affairs at Syracuse University, and Susan Dynarski — Professor of Public Policy, Education, and Economics at the University of Michigan, and this study is entirely relevant to value-added models (VAMs). Hence, below I cover their key highlights and takeaways, as I see them. I should note up front, however, that the authors did not directly examine how the new measure of economic disadvantage that they introduce (see below) actually affects calculations of teacher-level value-added. Rather, they motivate their analyses by saying that calculating teacher value-added is one application of their analyses.

The background to their study is as follows: “Gaps in educational achievement between high- and low-income children are growing” (p. 1), but the data that are used to capture “high- and low-income” in the state of Michigan (i.e., the state in which their study took place) and many if not most other states throughout the US, capture “income” demographics in very rudimentary, blunt, and often binary ways (i.e., “yes” for students who are eligible to receive federally funded free-or-reduced lunches and “no” for the ineligible).

Consequently, in this study the authors “leverage[d] the longitudinal structure of these data sets to develop a new measure of persistent economic disadvantage” (p. 1), all the while defining “persistent economic disadvantage” by the extent to which students were “eligible for subsidized meals in every grade since kindergarten” (p. 8). Students “who [were] never eligible for subsidized meals during those grades [were] defined as never [being economically] disadvantaged” (p. 8), and students who were eligible for subsidized meals for variable years were defined as “transitorily disadvantaged” (p. 8). This all runs counter, however, to the binary codes typically used, again, across the nation.

Appropriately, then, their goal (among other things) was to see how a new measure they constructed to better measure and capture “persistent economic disadvantage” might help when calculating teacher-level value-added. They accordingly argue (among other things) that, perhaps, not accounting for persistent disadvantage might subsequently cause more biased value-added estimates “against teachers of [and perhaps schools educating] persistently disadvantaged children” (p. 3). This, of course, also depends on how persistently disadvantaged students are (non)randomly assigned to teachers.

With statistics like the following as also reported in their report: “Students [in Michigan] [persistently] disadvantaged by 8th grade were six times more likely to be black and four times more likely to be Hispanic, compared to those who were never disadvantaged,” their assertions speak volumes not only to the importance of their findings for educational policy, but also for the teachers and schools still being evaluated using value-added scores and the researchers investigating, criticizing, promoting, or even trying to make these models better (if that is possible). In short, though, teachers who are disproportionately teaching in urban areas with more students akin to their equally disadvantaged peers, might realize relatively more biased value-added estimates as a result.

For value-added purposes, then, it is clear that the assumptions that controlling for student disadvantage by using such basal indicators of current economic disadvantage is overly simplistic, and just using test scores to also count for this economic disadvantage (i.e., as promoted in most versions of the Education Value-Added Assessment System (EVAAS)) is likely worse. More specifically, the assumption that economic disadvantage also does not impact some students more than others over time, or over the period of data being used to capture value-added (typically 3-5 years of students’ test score data), is also highly susceptible. “[T]hat children who are persistently disadvantaged perform worse than those who are disadvantaged in only some grades” (p. 14) also violates another fundamental assumption that teachers’ effects are consistent over time for similar students who learn at more or less consistent rates over time, regardless of these and other demographics.

The bottom line here, then, is that the indicator that should be used instead of our currently used proxies for current economic disadvantage is the number of grades students spend in economic disadvantage. If the value-added indicator does not effectively account for the “negative, nearly linear relationship between [students’ test] scores and the number of grades spent in economic disadvantage” (p. 18), while controlling for other student demographics and school fixed effects, value-added estimates will likely be (even) more biased against teachers who teach these students as a result.

Otherwise, teachers who teach students with persistent economic disadvantages will likely have it worse (i.e., in terms of bias) than teachers who teach students with current economic disadvantages, teachers who teach students with economically disadvantaged in their current or past histories will have it worse than teachers who teach students without (m)any prior economic disadvantages, and so on.

Citation: Michelmore, K., & Dynarski, S. (2016). The gap within the gap: Using longitudinal data to understand income differences in student achievement. Cambridge, MA: National Bureau of Economic Research (NBER). Retrieved from

One Score and Seven Policy Iterations Ago…

I just read what might be one of the best articles I’ve read in a long time on using test scores to measure teacher effectiveness, and why this is such a bad idea. Not surprisingly, unfortunately, this article was written 20 years ago (i.e., 1986) by – Edward Haertel, National Academy of Education member and recently retired Professor at Stanford University. If the name sounds familiar, it should as Professor Emeritus Haertel is one of the best on the topic of, and history behind VAMs (see prior posts about his related scholarship here, here, and here). To access the full article, please scroll to the reference at the bottom of this post.

Heartel wrote this article when at the time policymakers were, like they still are now, trying to hold teachers accountable for their students’ learning as measured on states’ standardized test scores. Although this article deals with minimum competency tests, which were in policy fashion at the time, about seven policy iterations ago, the contents of the article still have much relevance given where we are today — investing in “new and improved” Common Core tests and still riding on unsinkable beliefs that this is the way to reform the schools that have been in despair and (still) in need of major repair since 20+ years ago.

Here are some of the points I found of most “value:”

  • On isolating teacher effects: “Inferring teacher competence from test scores requires the isolation of teaching effects from other major influences on student test performance,” while “the task is to support an interpretation of student test performance as reflecting teacher competence by providing evidence against plausible rival hypotheses or interpretation.” While “student achievement depends on multiple factors, many of which are out of the teacher’s control,” and many of which cannot and likely never will be able to be “controlled.” In terms of home supports, “students enjoy varying levels of out-of-school support for learning. Not only may parental support and expectations influence student motivation and effort, but some parents may share directly in the task of instruction itself, reading with children, for example, or assisting them with homework.” In terms of school supports, “[s]choolwide learning climate refers to the host of factors that make a school more than a collection of self-contained classrooms. Where the principal is a strong instructional leader; where schoolwide policies on attendance, drug use, and discipline are consistently enforced; where the dominant peer culture is achievement-oriented; and where the school is actively supported by parents and the community.” This, all, makes isolating the teacher effect nearly if not wholly impossible.
  • On the difficulties with defining the teacher effect: “Does it include homework? Does it include self-directed study initiated by the student? How about tutoring by a parent or an older sister or brother? For present purposes, instruction logically refers to whatever the teacher being evaluated is responsible for, but there are degrees of responsibility, and it is often shared. If a teacher informs parents of a student’s learning difficulties and they arrange for private tutoring, is the teacher responsible for the student’s improvement? Suppose the teacher merely gives the student low marks, the student informs her parents, and they arrange for a tutor? Should teachers be credited with inspiring a student’s independent study of school subjects? There is no time to dwell on these difficulties; others lie ahead. Recognizing that some ambiguity remains, it may suffice to define instruction as any learning activity directed by the teacher, including homework….The question also must be confronted of what knowledge counts as achievement. The math teacher who digresses into lectures on beekeeping may be effective in communicating information, but for purposes of teacher evaluation the learning outcomes will not match those of a colleague who sticks to quadratic equations.” Much if not all of this cannot and likely never will be able to be “controlled” or “factored” in or our, as well.
  • On standardized tests: The best of standardized tests will (likely) always be too imperfect and not up to the teacher evaluation task, no matter the extent to which they are pitched as “new and improved.” While it might appear that these “problem[s] could be solved with better tests,” they cannot. Ultimately, all that these tests provide is “a sample of student performance. The inference that this performance reflects educational achievement [not to mention teacher effectiveness] is probabilistic [emphasis added], and is only justified under certain conditions.” Likewise, these tests “measure only a subset of important learning objectives, and if teachers are rated on their students’ attainment of just those outcomes, instruction of unmeasured objectives [is also] slighted.” Like it was then as it still is today, “it has become a commonplace that standardized student achievement tests are ill-suited for teacher evaluation.”
  • On the multiple choice formats of such tests: “[A] multiple-choice item remains a recognition task, in which the problem is to find the best of a small number of predetermined alternatives and the cri- teria for comparing the alternatives are well defined. The nonacademic situations where school learning is ultimately ap- plied rarely present problems in this neat, closed form. Discovery and definition of the problem itself and production of a variety of solutions are called for, not selection among a set of fixed alternatives.”
  • On students and the scores they are to contribute to the teacher evaluation formula: “Students varying in their readiness to profit from instruction are said to differ in aptitude. Not only general cognitive abilities, but relevant prior instruction, motivation, and specific inter- actions of these and other learner characteristics with features of the curriculum and instruction will affect academic growth.” In other words, one cannot simply assume all students will learn or grow at the same rate with the same teacher. Rather, they will learn at different rates given their aptitudes, their “readiness to profit from instruction,” the teachers’ instruction, and sometimes despite the teachers’ instruction or what the teacher teaches.
  • And on the formative nature of such tests, as it was then: “Teachers rarely consult standardized test results except, perhaps, for initial grouping or placement of students, and they believe that the tests are of more value to school or district administrators than to themselves.”

Sound familiar?

Reference: Haertel, E. (1986). The valid use of student performance measures for teacher evaluation. Educational Evaluation and Policy Analysis, 8(1), 45-60.