The Gates Foundation’s Expensive ($335 Million) Teacher Evaluation Missteps

The header of an Education Week article released last week (click here) was that “[t]he Bill & Melinda Gates Foundation’s multi-million-dollar, multi-year effort aimed at making teachers more effective largely fell short of its goal to increase student achievement-including among low-income and minority students.”

An evaluation of Gates Foundation’s Intensive Partnerships for Effective Teaching initiative funded at $290 million, an extension of its Measures of Effective Teaching (MET) project funded at $45 million, was the focus of this article. The MET project was lead by Thomas Kane (Professor of Education and Economics at Harvard, former leader of the MET project, and expert witness on the defendant’s side of the ongoing lawsuit supporting New Mexico’s MET project-esque statewide teacher evaluation system; see here and here), and both projects were primarily meant to hold teachers accountable using their students test scores via growth or value-added models (VAMs) and financial incentives. Both projects were tangentially meant to improve staffing, professional development opportunities, improve the retention of the teachers of “added value,” and ultimately lead to more-effective teaching and student achievement, especially in low-income schools and schools with higher relative proportions of racial minority students. The six-year evaluation of focus in this Education Week article was conducted by the RAND Corporation and the American Institutes for Research, and the evaluation was also funded by the Gates Foundation (click here for the evaluation report, see below for the full citation of this study).

Their key finding was that Intensive Partnerships for Effective Teaching district/school sites (see them listed here) implemented new measures of teaching effectiveness and modified personnel policies, but they did not achieve their goals for students.

Evaluators also found (see also here):

  • The sites succeeded in implementing measures of effectiveness to evaluate teachers and made use of the measures in a range of human-resource decisions.
  • Every site adopted an observation rubric that established a common understanding of effective teaching. Sites devoted considerable time and effort to train and certify classroom observers and to observe teachers on a regular basis.
  • Every site implemented a composite measure of teacher effectiveness that included scores from direct classroom observations of teaching and a measure of growth in student achievement.
  • Every site used the composite measure to varying degrees to make decisions about human resource matters, including recruitment, hiring, placement, tenure, dismissal, professional development, and compensation.

Overall, the initiative did not achieve its goals for student achievement or graduation, especially for low-income and racial minority students. With minor exceptions, student achievement, access to effective teaching, and dropout rates were also not dramatically better than they were for similar sites that did not participate in the intensive initiative.

Their recommendations were as follows (see also here):

  • Reformers should not underestimate the resistance that could arise if changes to teacher-evaluation systems have major negative consequences.
  • A near-exclusive focus on teacher evaluation systems such as these might be insufficient to improve student outcomes. Many other factors might also need to be addressed, ranging from early childhood education, to students’ social and emotional competencies, to the school learning environment, to family support. Dramatic improvement in outcomes, particularly for low-income and racial minority students, will likely require attention to many of these factors as well.
  • In change efforts such as these, it is important to measure the extent to which each of the new policies and procedures is implemented in order to understand how the specific elements of the reform relate to outcomes.


Stecher, B. M., Holtzman, D. J., Garet, M. S., Hamilton, L. S., Engberg, J., Steiner, E. D., Robyn, A., Baird, M. D., Gutierrez, I. A., Peet, E. D., de los Reyes, I. B., Fronberg, K., Weinberger, G., Hunter, G. P., & Chambers, J. (2018). Improving teaching effectiveness: Final report. The Intensive Partnerships for Effective Teaching through 2015–2016. Santa Monica, CA: The RAND Corporation. Retrieved from

More of Kane’s “Objective” Insights on Teacher Evaluation Measures

You might recall from a series of prior posts (see, for example, here, here, and here), the name of Thomas Kane — an economics professor from Harvard University who directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation, who also testified as an expert witness in two lawsuits (i.e., in New Mexico and Houston) opposite me (and in the case of Houston, also opposite Jesse Rothstein).

He, along with Andrew Bacher-Hicks (PhD Candidate at Harvard), Mark Chin (PhD Candidate at Harvard), and Douglas Staiger (Economics Professor of Dartmouth), just released yet another National Bureau of Economic Research (NBER) “working paper” (i.e., not peer-reviewed, and in this case not internally reviewed by NBER for public consumption and use either) titled “An Evaluation of Bias in Three Measures of Teacher Quality: Value-Added, Classroom Observations, and Student Surveys.” I review this study here.

Using Kane’s MET data, they test whether 66 mathematics teachers’ performance measured (1) by using teachers’ student test achievement gains (i.e., calculated using value-added models (VAMs)), classroom observations, and student surveys, and (2) under naturally occurring (i.e., non-experimental) settings “predicts performance following random assignment of that teacher to a class of students” (p. 2). More specifically, researchers “observed a sample of fourth- and fifth-grade mathematics teachers and collected [these] measures…[under normal conditions, and then in]…the third year…randomly assigned participating teachers to classrooms within their schools and then again collected all three measures” (p. 3).

They concluded that “the test-based value-added measure—is a valid predictor of teacher impacts on student achievement following random assignment” (p. 28). This finding “is the latest in a series of studies” (p. 27) substantiating this not-surprising, as-oft-Kane-asserted finding, or as he might assert it, fact. I should note here that no other studies substantiating “the latest in a series of studies” (p. 27) claim are referenced or cited, but a quick review of the 31 total references included in this report include 16/31 (52%) references conducted by only econometricians (i.e., not statisticians or other educational researchers) on this general topic, of which 10/16 (63%) are not peer reviewed and of which 6/16 (38%) are either authored or co-authored by Kane (1/6 being published in a peer-reviewed journal). The other articles cited are about the measurements used, the geenral methods used in this study, and four other articles written on the topic not authored by econometricians. Needless to say, there is clearly a slant that is quite obvious in this piece, and unfortunately not surprising, but that had it gone through any respectable vetting process, this sh/would have been caught and addressed prior to this study’s release.

I must add that this reminds me of Kane’s New Mexico testimony (see here) where he, again, “stressed that numerous studies [emphasis added] show[ed] that teachers [also] make a big impact on student success.” He stated this on the stand while expressly contradicting the findings of the American Statistical Association (ASA). While testifying otherwise, and again, he also only referenced (non-representative) studies in his (or rather defendants’ support) authored by primarily him (e.g, as per his MET studies) and some of his other econometric friends (e.g. Raj Chetty, Eric Hanushek, Doug Staiger) as also cited within this piece here. This was also a concern registered by the court, in terms of whether Kane’s expertise was that of a generalist (i.e., competent across multi-disciplinary studies conducted on the matter) or a “selectivist” (i.e., biased in terms of his prejudice against, or rather selectivity of certain studies for confirmation, inclusion, or acknowledgment). This is also certainly relevant, and should be taken into consideration here.

Otherwise, in this study the authors also found that the Mathematical Quality of Instruction (MQI) observational measure (one of two observational measures they used in this study, with the other one being the Classroom Assessment Scoring System (CLASS)) was a valid predictor of teachers’ classroom observations following random assignment. The MQI also, did “not seem to be biased by the unmeasured characteristics of students [a] teacher typically teaches” (p. 28). This also expressly contradicts what is now an emerging set of studies evidencing the contrary, also not cited in this particular piece (see, for example, here, here, and here), some of which were also conducted using Kane’s MET data (see, for example, here and here).

Finally, authors’ evidence on the predictive validity of student surveys was inconclusive.

Needless to say…

Citation: Bacher-Hicks, A., Chin, M. J., Kane, T. J., & Staiger, D. O. (2017). An evaluation of bias in three measures of teacher quality: Value-added, classroom observations, and student surveys. Cambridge, MA: ational Bureau of Economic Research (NBER). Retrieved from

Difficulties When Combining Multiple Teacher Evaluation Measures

A new study about multiple “Approaches for Combining Multiple Measures of Teacher Performance,” with special attention paid to reliability, validity, and policy, was recently published in the American Educational Research Association (AERA) sponsored and highly-esteemed Educational Evaluation and Policy Analysis journal. You can find the free and full version of this study here.

In this study authors José Felipe Martínez – Associate Professor at the University of California, Los Angeles, Jonathan Schweig – at the RAND Corporation, and Pete Goldschmidt – Associate Professor at California State University, Northridge and creator of the value-added model (VAM) at legal issue in the state of New Mexico (see, for example, here), set out to help practitioners “combine multiple measures of complex [teacher evaluation] constructs into composite indicators of performance…[using]…various conjunctive, disjunctive (or complementary), and weighted (or compensatory) models” (p. 738). Multiple measures in this study include teachers’ VAM estimates, observational scores, and student survey results.

While authors ultimately suggest that “[a]ccuracy and consistency are greatest if composites are constructed to maximize reliability,” perhaps more importantly, especially for practitioners, authors note that “accuracy varies across models and cut-scores and that models with similar accuracy may yield different teacher classifications.”

This, of course, has huge implications for teacher evaluation systems as based upon multiple measures in that “accuracy” means “validity” and “valid” decisions cannot be made as based on “invalid” or “inaccurate” data that can so arbitrarily change. In other words, what this means is that likely never will a decision about a teacher being this or that actually mean this or that. In fact, this or that might be close, not so close, or entirely wrong, which is a pretty big deal when the measures combined are assumed to function otherwise. This is especially interesting, again and as stated prior, that the third author on this piece – Pete Goldschmidt – is the person consulting with the state of New Mexico. Again, this is the state that is still trying to move forward with the attachment of consequences to teachers’ multiple evaluation measures, as assumed (by the state but not the state’s consultant?) to be accurate and correct (see, for example, here).

Indeed, this is a highly inexact and imperfect social science.

Authors also found that “policy weights yield[ed] more reliable composites than optimal prediction [i.e., empirical] weights” (p. 750). In addition, “[e]mpirically derived weights may or may not align with important theoretical and policy rationales” (p. 750); hence, the authors collectively referred others to use theory and policy when combining measures, while also noting that doing so would (a) still yield overall estimates that would “change from year to year as new crops of teachers and potentially measures are incorporated” (p. 750) and (b) likely “produce divergent inferences and judgments about individual teachers (p. 751). Authors, therefore, concluded that “this in turn highlights the need for a stricter measurement validity framework guiding the development, use, and monitoring of teacher evaluation systems” (p. 751), given all of this also makes the social science arbitrary, which is also a legal issue in and of itself, as also quasi noted.

Now, while I will admit that those who are (perhaps unwisely) devoted to the (in many ways forced) combining of these measures (despite what low reliability indicators already mean for validity, as unaddressed in this piece) might find some value in this piece (e.g., how conjunctive and disjunctive models vary, how principal component, unit weight, policy weight, optimal prediction approaches vary), I will also note that forcing the fit of such multiple measures in such ways, especially without a thorough background in and understanding of reliability and validity and what reliability means for validity (i.e., with rather high levels of reliability required before any valid inferences and especially high-stakes decisions can be made) is certainly unwise.

If high-stakes decisions are not to be attached, such nettlesome (but still necessary) educational measurement issues are of less importance. But any positive (e.g., merit pay) or negative (e.g., performance improvement plan) consequence that comes about without adequate reliability and validity should certainly cause pause, if not a justifiable grievance as based on the evidence provided herein, called for herein, and required pretty much every time such a decision is to be made (and before it is made).

Citation: Martinez, J. F., Schweig, J., & Goldschmidt, P. (2016). Approaches for combining multiple measures of teacher performance: Reliability, validity, and implications for evaluation policy. Educational Evaluation and Policy Analysis, 38(4), 738–756. doi: 10.3102/0162373716666166 Retrieved from

Note: New Mexico’s data were not used for analytical purposes in this study, unless any districts in New Mexico participated in the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) study yielding the data used for analytical purposes herein.

Special Issue of “Educational Researcher” (Paper #9 of 9): Amidst the “Blooming Buzzing Confusion”

Recall that the peer-reviewed journal Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the last of nine articles (#9 of 9), which is actually a commentary titled “Value Added: A Case Study in the Mismatch Between Education Research and Policy.” This commentary is authored by Stephen Raudenbush – Professor of Sociology and Public Policy Studies at the University of Chicago.

Like with the last two commentaries reviewed here and here, Raudenbush writes of the “Special Issue” that, in this topical area, “[r]esearchers want their work to be used, so we flirt with the idea that value-added research tells us how to improve schooling…[Luckily, perhaps] this volume has some potential to subdue this flirtation” (p. 138).

Raudenbush positions the research covered in this “Special Issue,” as well as the research on teacher evaluation and education in general, as being conducted amidst the “blooming buzzing confusion” (p. 138) surrounding the messy world through which we negotiate life. This is why “specific studies don’t tell us what to do, even if they sometimes have large potential for informing expert judgment” (p. 138).

With that being said, “[t]he hard question is how to integrate the new research on teachers with other important strands of research [e.g., effective schools research] in order to inform rather than distort practical judgment” (p. 138). Echoing Susan Moore Johnson’s sentiments, reviewed as article #6 here, this is appropriately hard if we are to augment versus undermine “our capacity to mobilize the “social capital” of the school to strengthen the human capital of the teacher” (p. 138).

On this note, and “[i]n sum, recent research on value added tells us that, by using data from student perceptions, classroom observations, and test score growth, we can obtain credible evidence [albeit weakly related evidence, referring to the Bill & Melinda Gates Foundation’s MET studies] of the relative effectiveness of a set of teachers who teach similar kids [emphasis added] under similar conditions [emphasis added]…[Although] if a district administrator uses data like that collected in MET, we can anticipate that an attempt to classify teachers for personnel decisions will be characterized by intolerably high error rates [emphasis added]. And because districts can collect very limited information, a reliance on district-level data collection systems will [also] likely generate…distorted behavior[s] which teachers attempt to “game” the
comparatively simple indicators,” or system (p. 138-139).

Accordingly, “[a]n effective school will likely be characterized by effective ‘distributed’ leadership, meaning that expert teachers share responsibility for classroom observation, feedback, and frequent formative assessments of student learning. Intensive professional development combined with classroom follow-up generates evidence about teacher learning and teacher improvement. Such local data collection efforts [also] have some potential to gain credibility among teachers, a virtue that seems too often absent” (p. 140).

This, might be at least a significant part of the solution.

“If the school is potentially rich in information about teacher effectiveness and teacher improvement, it seems to follow that key personnel decisions should be located firmly at the school level..This sense of collective efficacy [accordingly] seems to be a key feature of…highly effective schools” (p. 140).


If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; see the Review of Article #5 – on teachers’ perceptions of observations and student growth here; see the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here; see the Review of Article (Commentary) #7 – on VAMs situated in their appropriate ecologies here; and see the Review of Article #8, Part I – on a more research-based assessment of VAMs’ potentials here and Part II on “a modest solution” provided to us by Linda Darling-Hammond here.

Article #9 Reference: Raudenbush, S. W. (2015). Value added: A case study in the mismatch between education research and policy. Educational Researcher, 44(2), 138-141. doi:10.3102/0013189X15575345




Laura Chapman: SLOs Continued

Within my last post, about “Student Learning Objectives (SLOs) [and] What (Little) We Know about Them…,” I requested more information about SLOs and Laura H. Chapman (whose work on SLOs was at the core of this prior post) responded with the paper also referenced in the prior post. This paper is about using SLOs as a proxy for value-added modeling (VAM) and is available for download here: The Marketing of Student Learning Objectives (SLOs)-1999-2014.

Chapman defines SLOs as “a version of the 1950s business practice known as management-by-objectives modified with pseudo-scientific specifications intended to create an aura of objectivity,” although “the apparent scientific precision of the SLO process [remains] an illusion.” In business, this occurs when “lower-level managers identify measurable goals and ‘targets’ to be met [and a] manager of higher rank approves the goals, targets, and measures,” after which performance pay is attained if and when the targets are met. In education, SLOs are to be used “for rating the majority of teachers not covered by VAM, including teachers in the arts and other ‘untested’ or ‘nontested’ subjects.” In education, SLOs are also otherwise called “student learning targets,” “student learning goals,” “student growth targets (SGOs),” or “SMART goals”—Specific, Measurable, Achievable, Results-oriented and Relevant, and Time-bound.

Why is this all happening in Chapman’s view? “This preoccupation with ratings and other forms of measurement is one manifestation of what I have called the econometric turn in federal and state policies. The econometric turn is most evident in the treatment of educational issues as managerial problems and the reification of metrics, especially test scores, as if these are objective, trustworthy, and essential for making educational decisions (Chapman, 2013).”

Chapman then reviews four reports funded by the US Department of Education that, despite a series of positive promotional attempts, altogether “point out the absence of evidence to support any use of SLOs other than securing teacher compliance with administrative mandates.” I also discussed this in my aforementioned post on this topic, but do read Chapman’s full report for more in-depth coverage.

Regardless, SLOs along with VAMs have become foundational to the “broader federal project to make pay-for-performance the national norm for teacher compensation.” Likewise, internal funders including the US Department of Education and their Reform Support Network (RSN), and external funders including but not limited to the Bill and Melinda Gates Foundation, Teach Plus, Center for Teacher Quality, Hope Street Group, Educators for Excellence, and Teachers United continue to fund and advance SLO + VAM efforts, despite the evidence, or lack thereof, especially in the case of SLOs.

As per Chapman, folks affiliated with these groups (and others) continue to push SLOs forward by focusing on four points in the hope of inducing increased compliance. These points include assertions that the SLO process (1) is collaborative, (2) is adaptable, (3) improves instruction (which has no evidence in support), and (4) improves student learning (which has no evidence in support). You can read more about each of these studies in Chapman’s report, linked to again here, and the evidence that exists (or not) per report.

Surveys + Observations for Measuring Value-Added

Following up on a recent post about the promise of Using Student Surveys to Evaluate Teachers using a more holistic definition of a teacher’s valued added, I just read a chapter written by Ronald Ferguson — the creator of the Tripod student survey instrument and Tripod’s lead researcher — and written along with Charlotte Danielson — the creator of the Framework for Teaching and founder of The Danielson Group (see a prior post about this instrument here). Both instruments are “research-based,” both are used nationally and internationally, both are (increasingly being) used as key indicators to evaluate teachers across the U.S., and both were used throughout the Bill & Melinda Gates Foundation’s ($43 million worth of) Measures of Effective Teaching (MET) studies.

The chapter titled, “How Framework for Teaching and Tripod 7Cs Evidence Distinguish Key Components of Effective Teaching,” was recently published in a book all about the MET studies, titled “Designing Teacher Evaluation Systems: New Guidance from the Measures of Effective Teaching Project” written by Thomas Kane, Kerri Kerr, and Robert Pianta. The chapter is about whether and how data derived via the Tripod student survey instrument (i.e., as built on 7Cs: challenging students, control of the classroom, teacher caring, teachers confer with students, teachers captivate their students, teachers clarify difficult concepts, teachers consolidate students’ concerns) align with the data derived via Danielson’s Framework for Teaching, to collectively capture teacher effectiveness.

Another purpose for this chapter is to examine how both indicators also align with teacher level-value-added. Ferguson (and Danielson) find that:

  • Their two measures (i.e., the Tripod and the Framework for Teaching) are more reliable (and likely more valid) than value-added measures. The over-time, teacher-level classroom correlations, cited in this chapter, are r = 0.38 for value-added (which is comparable with the correlations noted in plentiful studies elsewhere), r = 0.42 for the Danielson Framework, and r = 0.61 for the Tripod student survey component. These “clear correlations,” while not strong particularly in terms of value-added, do indicate there is some common signal that the indicators are capturing, some stronger than the others (as should be obvious given the above numbers).
  • Contrary to what some (softies) might think, classroom management, not caring (i.e., the extent to which teachers care about their students and what their students learn and achieve), is the strongest predictor of a teachers’ value-added. However, the correlation (i.e., the strongest of the bunch) is still quite “weak” at an approximate r = 0.26, even though it is statistically significant. Caring, rather, is the strongest predictor of whether students are happy in their classrooms with their teachers.
  • In terms of “predicting” teacher-level value-added, and of the aforementioned 7Cs, the things that also matter “most” next to classroom management (although none of the coefficients are as strong as we might expect [i.e., r < 0.26]) include: the extent to which teachers challenge their students and have control over their classrooms.
  • Value-added in general is more highly correlated with teachers at the extremes in terms of their student survey and observational composite indicators.

In the end, while the authors of this chapter do not disclose the actual correlations between their two measures and value-added, specifically (although from the appendix one can infer that the correlation between value-added and Tripod output is around r = 0.45 as based on an unadjusted r-squared), and I should mention this is a HUGE shortcoming of this chapter (one that would not have passed peer review should this chapter have been submitted to a journal for publication), the authors do mention that “the conceptual overlap between the frameworks is substantial and that empirical patterns in the data show similarities.” Unfortunately again, however, they do not quantify the strength of said “similarities.” This only leaves us to assume that since they were not reported the actual strength of the similarities empirically observed between was likely low (as is also evidenced in many other studies, although not as often with student survey indicators as opposed to observational indicators.)

The final conclusion the authors of this chapter make is that educators “cross-walk” the two frameworks (i.e., the Tripod and the Danielson Framework) and use both frameworks when reflecting on teaching. I must say I’m concerned about these recommendations, as well, mainly given this recommendation will cost states and districts more $$$, and the returns or “added value” (using the grandest definition of this term) of doing so and engaging in such an approach does not have the necessary evidence I would say one might use to adequately justify such recommendations.

State Tests: Instructional Sensitivity and (Small) Differences between Extreme Teachers

As per Standard 1.2. of the newly released 2014 Standards for Educational and Psychological Testing authored by the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME):

A rationale should be presented for each intended interpretation of test scores for a given use, together with a summary of the evidence and theory bearing on the intended interpretation. (p. 23)

After a series of conversations I have had, also recently with a colleague named Harris Zwerling who is a researcher for the Pennsylvania State Education Association, I asked him to submit a guest blog about what he has observed working with Pennsylvania Education Value-Added Assessment (PVAAS, aka the Education Value-Added Assessment System [EVAAS) administered by SAS-EVAAS) data in his state as per this standard. Here is what he wrote:

Recent Vamboozled posts have focused on one of the qualities of the achievement tests used as the key measures in value-added models (VAMs) and other “growth” models. Specifically, the instructional sensitivity of current student achievement tests has been questioned (see, for example, here). Evidence of whether the achievement tests in use can actually detect the influence of instruction on student test performance should, accordingly, be a central part of validity evidence required for VAMs [as per the specific standard also mentioned above]. After all, it is achievement test scores that provide the bases for VAM-based estimates. In addition, while many have noted that the statistical methods employed to generate value-added estimates will force a distribution of “teacher effectiveness” scores, it is not clear that the distinctions generated and measured have educational (or clinical) meaning.

This concern can be illustrated with an example taken from comments I previously made while serving on a Gates Foundation-funded stakeholders committee tasked with making recommendations for the redesign of Pennsylvania’s teacher and principal evaluation systems (see the document to which I refer here). Let me note that what follows is only suggestive of the possible limitations of some of Pennsylvania’s state achievement tests (the PSSAs) and is not offered as an example of an actual analysis of their instructional sensitivity. The value-added calculations, which I converted to raw scores, were performed by a research team from Mathematica Policy Research (for more information click here).

From my comments in this report:

“…8th grade Reading provided the smallest teacher effects from three subjects reported in Table III.1. The difference the [Final Report] estimates comparing the teacher at the 15th percentile of effectiveness to the average teacher (50th percentile) is -22 scaled score points on the 5th grade PSSA Reading test…[referring] to the 2010 PSSA Technical Manual raw score table… for the 8th grade Reading test, that would be a difference of approximately 2 raw score points, or the equivalent of 2 multiple choice (MC) questions (1 point apiece) or half credit on one OE [open-ended] question. (There are 40 MC questions and 4 (3 point) OE questions on the Reading test.) The entire range from the 15th percentile of effectiveness to the 85th percentile of effectiveness …covers approximately 3.5 raw score points [emphasis added].” (p. 9)… [T]he range of teacher effectiveness covering the 5th to the 95th percentiles (73 scaled score points) represents approximately a 5.5 point change in the raw score (i.e., 5.5 of 52 total possible points [emphasis added].” (p. 9)

The 5th to 95th percentiles corresponds to the effectiveness range used in the PVAAS.

“It would be difficult to see how the range of teacher effects on the 8th grade Reading test would be considered substantively meaningful if this result holds up in a larger and random sample. In the case of this subject and grade level test, the VAM does not appear to be able to detect a range of teacher effectiveness that is substantially more discriminating than those reported for more traditional forms of evaluation. (p. 9-10)

Among other issues, researchers have considered which scaling properties are necessary for measuring growth (see, for example, here), whether the tests’ scale properties met the assumptions of the statistical models being used (see, for example, here), if growth in student achievement is scale dependent (see, for example, here), and even if tests that were vertically scaled could meet the assumptions required by regression-based models (see, for example, here).

Another issue raised by researchers is the fundamental question of whether state tests could be put to more than one use, that is, as a valid measure of student progress and teacher performance, or would the multiple uses inevitably lead to behaviors that would invalidate the tests for either purpose (see, for example, here). All this suggests is that before racing ahead to implement value-added components of teacher evaluation systems, states have an obligation to assure all involved that these psychometric issues have been explicitly addressed and the tests used are properly validated for use in value-added measurements of “teacher effectiveness.”

So…How many states have validated their student achievement tests for use as measures of teachers’ performance? How many have produced evidence of their tests instructional sensitivity or of the educational significance of the distinctions made by the value-added models they use?

One major vendor of value-added measures (i.e., SAS as in SAS-EVAAS) long has held that the tests need only to have 1) sufficient “stretch” in the scales “to ensure that progress could be measured for low-and high achieving students”, 2) that “the test is highly related to the academic standards,” and 3) “the scales are sufficiently reliable from one year to the next” (see, for example, here). These assertions should be subject to independent psychometric verification, but they have not. Rather, and notably, in all the literature they have produced regarding their VAMs, SAS does not indicate that it requires or conducts the tests that researchers have designed for determining instructional sensitivity. However, they assert that they can provide accurate measures of individual teacher effectiveness nonetheless.

To date Pennsylvania’s Department of Education (PDE) apparently has ignored the fact that the Data Recognition Corporation (DRC), the vendor of its state tests, has avoided making any claims that their tests are valid for use as measures of teacher performance (see, for example, here). In DRC’s most recent Technical Report chapter on validity, for example, they list the main purposes of the PSSA, the second of which is to “Provid[e] information on school and district accountability” (p.283). The omission of teacher accountability here certainly stands out.

To date, again, none of the publicly released DRC technical reports for Pennsylvania’s state tests provides any indication that instructional sensitivity has been examined, either. Nor do the reports provide a basis for others to make their own judgment. In fact, they make it clear that historically the PSSA exams were designed for school level accountability and only later have moved toward measuring individual student mastery of Pennsylvania’s academic standards. Various PSSA tests will sample an entire academic standard with as little as one or two questions (see, for example, Appendix C here), which aligns with the evidence presented in the first section of this post.

In sum, and given that VAMs, by design, force a distribution of “teacher effectiveness,” even if the measured differences are substantively quite small, and much smaller than the lay observer or in particular educational policymaker might have believed otherwise (in the case of the 8th grade PSSA Reading test 5.5 of 52 total possible points covers the entire range from the bottom to the top categories of “ teacher effectiveness”) , it is essential that both instructional sensitivity and educational significance be explicitly established. Otherwise, untrained users of the data may labor under the misapprehension that the difference observed provides a solid basis for making such (low- and high-stakes) personnel decisions.

This is a completely separate concern from the fact that, at best, measured “teacher effects” may only explain from 1% to 15 or 20% of the variance in student test scores, a figure dwarfed by the variance explained by factors beyond the control of teachers. To date, SAS has not publicly reported the amount of variance explained by its PVAAS models for any of PA’s state tests. I believe every vendor of value-added models should report this information for every achievement test being used as a measure of “teacher effectiveness.” This too, would provide essential context for understanding the substantive meaning of reported scores.

Again…Does the Test Matter?

Following up on my most recent post about whether “The Test Matters,” a colleague of mine from the University of Southern California (USC) – Morgan Polikoff whom I also referenced on this blog this past summer as per a study he and Andrew Porter (University of Pennsylvania) released about the “Surprisingly Weak” Correlations among VAMs and Other Teacher Quality Indicators” – wrote me an email. In this email he sent another study very similar to the one referenced above, about whether “The Test Matters.” But this study is titled “Does the Test Matter?” (see full reference below).

Given this is coming from a book chapter included within a volume capturing many of the Bill & Melinda Gates Measures of Effective Teaching (MET) studies – studies that have also been at the sources of prior posts here and here – I thought it even more appropriate to share this with you all given book chapters are sometimes hard to access and find.

Like the authors of “The Test Matters,” Polikoff used MET data to investigate whether large-scale standardized state tests “differ in the extent to which they reflect the content or quality of teachers’ instruction” (i.e., tests’ instructional sensitivity).” He also investigated whether differences in instructional sensitivity affect recommendations made in the MET reports for creating multiple-measure evaluation systems.

Polikoff found that “state tests indeed vary considerably in their correlations with observational and student survey measures of effective teaching.” These correlations, as they were in the previous article/post cited above, were statistically significant, positive, and fell in the very weak range in mathematics (0.15 < 0.19) and very weak range in English/language arts (0.07 < < 0.15). This, and other noted correlations that approach zero (i.e., = 0 or no correlation at all), all, as per Polikoff, indicate “weak sensitivity to instructional quality.”

Put differently, while this indicates that the extent to which state tests are instructionally sensitive appears slightly more favorable in mathematics versus English/language arts, we might otherwise conclude that the state tests used in at least the MET partner states “are only weakly sensitive to pedagogical quality, as judged by the MET study instruments” – the instruments that in many ways are “the best” we have thus far to offer in terms of teacher evaluation (i.e., observational, and student survey instruments).

But why does instructional sensitivity matter? Because the inferences made from these state test results, independently or more likely post VAM calculation “rely on the assumption that [state test] results accurately reflect the instruction received by the students taking the test. This assumption is at the heart of [investigations] of instructional sensitivity.” See another related (and still troublesome) post about instructional sensitivity here.

Polikoff, M. S. (2014). Does the Test Matter? Evaluating teachers when tests differ in their sensitivity to instruction. In T. J. Kane, K. A. Kerr, & R. C. Pianta (Eds.). Designing teacher evaluation systems: New guidance from the Measures of Effective Teaching project (pp. 278-302). San Francisco, CA: Jossey-Bass.

“The Test Matters”

Last month in Educational Researcher (ER), researchers Pam Grossman (Stanford University), Julie Cohen (University of Virginia), Matthew Ronfeldt (University of Michigan), and Lindsay Brown (Stanford University) published a new article titled: “The Test Matters: The Relationship Between Classroom Observation Scores and Teacher Value Added on Multiple Types of Assessment.” Click here to view a prior version of this document, pre-official-publication in ER.

Building upon what the research community has consistently evidenced in terms of the generally modest to low correlations (or relationships) consistently being observed between observational data and VAMs, researchers in this study set out to investigate these relationships a bit further, and in more depth.

Using the Bill & Melinda Gates Foundation-funded Measures of Effective Teaching (MET) data, researchers examined the extent to which the (cor)relationships between one specific observation tool designed to assess instructional quality in English/language arts (ELA) – the Protocol for Language Arts Teaching Observation (PLATO) – and value-added model (VAM) output changed when two different tests were used to assess student achievement using VAMs. One set of tests included the states’ tests (depending on where teachers in the sample were located by state), and the other test was the Stanford Achievement Test (SAT-9) Open-Ended ELA test on which students are required not only to answer multiple choice items but to also construct responses to open or free response items. This is more akin to what is to be expected with the new Common Core tests, hence, this is significant looking forward.

Researchers found, not surprisingly given the aforementioned prior research (see, for example, Papay’s similar and seminal 2010 work here), that the relationship between teachers’ PLATO observational scores and VAM output scores varied given which of the two aforementioned tests was used to calculate VAM output. In addition, researchers found that both sets of correlation coefficients were low, which also aligns with current research, but in this case it might be more accurate to say the correlations they observed were very low.

More descriptively, PLATO was more highly correlated with the VAM scores derived using the SAT-9 (r = 0.16) versus the states’ tests (r = .09), and these differences were statistically significant. These differences also varied by the type of teacher, the type of “teacher effectiveness” domain observed, etc. To read more about the finer distinctions the authors make in this regard, please click, again, here.

These results do also provide “initial evidence that SAT-9 VAMs may be more sensitive to the kinds of instruction measured by PLATO.” In other words, the SAT-9 test may be more sensitive than the state tests all states currently use (post the implementation of No Child Left Behind [NCLB] in 2002). These are the tests that are to align “better” with state standards, but on which multiple-choice items (that oft-drive multiple-choice learning activities) are much more, if not solely, valued. This is an interesting conundrum, to say the least, as we anticipate the release of the “new and improved” set of Common Core tests.

Researchers also found some preliminary evidence that teachers who scored relatively higher on the “Classroom Environment” observational factor demonstrated “better” value-added scores across tests. Noting here again, however, that the correlation coefficients demonstrated here were also quite weak (r = 0.15 for the SAT-9 VAM and r = 0.13 for the state VAM). Researchers also found preliminary evidence that only the VAM that used the SAT-9 test scores was significantly related to the “Cognitive and Disciplinary Demand” factor, although again the correlation coefficient was also very weak (r = 0.12 for the SAT-9 VAM and r = 0.05 for the state VAM).

That being said, the “big ideas” I think we can takeaway from this study are namely that:

  1. Which tests we use to construct value-added scores matter because which tests we decide to use yield different results (see also the Papay 2010 reference cited above). Accordingly, “researchers and policymakers [DO] need to pay careful attention to the assessments used to measure student achievement in designing teacher evaluation systems” as these decisions will [emphasis added] yield different results.”
  2. Related, certain tests are going to be more instructionally sensitive than others. Click here and here for two prior posts about this topic.
  3. In terms of using observational data, in general, “[p]erhaps the best indicator[s] of whether or not students are learning to engage in rigorous academic discourse…[should actually come from] classroom observations that capture the extent to which students are actually participating in [rigorous/critical] discussions” and their related higher-order thinking activities. This line of thinking aligns, as well, with a recent post on this blog titled “Observations: “Where Most of the Action and Opportunities Are.

Consumer Alert: Researchers in New Study Find “Surprisingly Weak” Correlations among VAMs and Other Teacher Quality Indicators

Two weeks ago, an article in the U.S. News and World Report (as well as similar articles in Education Week and linked to from the homepage of the American Educational Research Association [AERA]) highlighted the results of a recent research conducted by University of Southern California’s Morgan Polikoff and University of Pennsylvania’s Andrew Porter. The research article was released online here, in the AERA-based, peer-reviewed, and highly esteemed journal: Education Evaluation and Policy Analysis.
As per the study’s abstract, the researchers found (to which their peer-reviewers apparently agreed) that the extent to which teachers’ instructional alignment was associated with their contributions to student learning and their effectiveness on VAMs, using data from the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) study, were “surprisingly weak,” or as per the aforementioned U.S. News and World Report article, “weak to nonexistent.”
Researchers, specifically, analyzed the (co)relationships among VAM estimates and observational data, student survey data, and other data pertaining to whether teachers aligned their instruction with state standards. They did this using data taken from 327 fourth and eighth grade math and English teachers in six school districts, again as derived via the aforementioned MET study.
Researchers concluded that “there were few if any correlations that large [i.e., greater than r = 0.3] between any of the indicators of pedagogical quality and the VAM scores. Nor were there many correlations of that magnitude in the main MET study. Simply put, the correlations of value-added with observational measures of pedagogical quality, student survey measures, and instructional alignment were small” (Polikoff & Porter, 2014, p. 13).
Interestingly enough, the research I recently conducted with my current doctoral student (see Paufler & Amrein-Beardsley, here), was used to supplement these researchers’ findings. In Education Week, the articles’ author, Holly Yettick, wrote the following:
In addition to raising questions about the sometimes weak correlations between value-added assessments and other teacher-evaluation methods, researchers continue to assess how the models are created, interpreted, and used.
In a study that appears in the current issue of the American Educational Research Journal, Noelle A. Paufler and Audrey Amrein-Beardsley, a doctoral candidate and an associate professor at Arizona State University, respectively, conclude that elementary school students are not randomly distributed into classrooms. That finding is significant because random distribution of students is a technical assumption that underlies some value-added models.
Even when value-added models do account for nonrandom classroom assignment, they typically fail to consider behavior, personality, and other factors that profoundly influenced the classroom-assignment decisions of the 378 Arizona principals surveyed. That, too, can bias value-added results.