Morality, Validity, and the Design of Instructionally Sensitive Tests by David Berliner

A recent article in Education Week was written by David C. Berliner, Regents Professor Emeritus at Arizona State University. As pertinent to our purposes, here, I have included his piece in full below, as well as the direct link to this piece here.

Please pay particular attention to the points about (1) “instructional sensitivity” and why this relates to the tests we currently use to calculate value-added, (2) attribution and whether valid inferences can be made about teachers as isolated from other competing variables, also (3) in consideration of teachers’ actual potentials to have “effects” on test scores versus “effects” on students’ lives otherwise.

Moral Reasons for Using Appropriate Tests to Evaluate Teachers and Schools

The first reason for caring about how sensitive our standardized tests are to instruction is moral. If the tests we use to judge the effects of instruction on student learning are not sensitive to differences in the instructional skills of teachers, then teachers will be seen as less powerful than they might actually be in affecting student achievement. This would not be fair. Thus, instructionally insensitive tests give rise to concerns about fairness, a moral issue.

Additionally, we need to be concerned about whether the scores obtained on instructionally insensitive tests are consequentially, used, for example, to judge a teacher’s performance, with the possibility of the teacher being fired or rewarded. If that is the case, then we move from the moral issue of fairness in trying to assess the contributions of teachers to student achievement, to the psychometric issue of test validity: What inference can we make about teachers, from the scores students get on a typical standardized test?

Validity Reasons for Using Appropriate Tests to Evaluate Teachers and Schools

What does a change in a student’s test score over the course of a year actually mean? To whom or to what do we attribute the changes that occur? If the standardized tests we use are not sensitive to instruction by teachers, yet still show growth in achievement over a year, the likely causes of such growth will be attributed to other influences on our nations’ students. These would be school factors other than teachers–say qualities of the peer group, or the textbook, or the principal’s leadership. Or such changes might be attributed to outside-of-school factors, such as parental involvement in schooling and homework, income and social class of the neighborhood in which the child lives, and so forth.

Currently, all the evidence we have is that teachers are not particularly powerful sources of influence on aggregate measures of student achievement such as mean scores of classrooms on standardized tests. Certainly teachers do, occasionally and to some extent, affect the test scores of everyone in a class (Pedersen, Faucher, & Eaton, 1978; Barone, 2001). And teachers can make a school or a district look like a great success based on average student test scores (Casanova, 2010; Kirp, 2013). But exceptions do not negate the rule.

Teachers Account for Only a Little Variance in Students’ Test Scores

Teachers are not powerful forces in accounting for the variance we see in the achievement test scores of students in classrooms, grades, schools, districts, states and nations. Teachers, it turns out, affect individuals a lot more than they affect aggregate test scores, say, the means of classrooms, schools or districts.

A consensus is that outside of school factors account for about 60% of the variance in student test scores, while schools account for about 20% of that variance (Haertel, 2013; Borman and Dowling, 2012; Coleman et al., 1966). Further, about half of the variance accounted for by schools is attributed to teachers. So, on tests that may be insensitive to instruction, teachers appear to account for about 10% of the variance we see in student achievement test scores (American Statistical Association, 2014). Thus outside-of-school factors appear 6 times more powerful than teachers in effecting student achievement. 

How Instructionally Sensitive Tests Might Help

What would teacher effects on student achievement test scores be were tests designed differently? We don’t know because we have no information about the sensitivity of the tests currently used to detect teacher differences in instructional competence. Teachers judged excellent might be able to screen items for instructional sensitivity during test design. That might be helpful. Even better, I think, might be cognitive laboratories, in which teachers judged to be excellent provide instruction to students on curriculum units appropriate for a grade. The test items showing pre-post gains–items empirically found to be sensitive to instruction–could be chosen for the tests, while less sensitive items would be rejected.

Would the percent of variance attributed to teachers be greater if the tests used to judge teachers were more sensitive to instruction? I think so. Would the variance accounted for by teachers be a lot greater? I doubt that. But if the variance accounted for by teachers went up from 10% to 15%, then teacher effects would be estimated to be 50% greater than currently. And that is the estimate of teachers’ effects over just one year. Over twelve years, teachers clearly can play an influential role on aggregate data, as well as continuing to be a powerful force on their students, at an individual level. In sum, only with instructionally sensitive tests can we be fair to teachers and make valid inferences about their contributions to student growth. 

Reference: Berliner, D. C. (2014). Morality, validity, and the design of instructionally sensitive tests. Education Week. Retrieved from

Brookings Institute Paper on VAMs and Their More Important Observational Counterparts

A paper recently released by the Brookings Institute (click here to read the paper in full) has “added” some “value” to our thinking about the use of observations in the teacher evaluation systems of topic here, and most all educational policy circles these days.

Researchers, as situated in the federal context surrounding these systems (including more than $4 billion now released to 19 states via Race to the Top and now 43 NCLB waivers also granted), examined the observational systems, rather than the VAMs themselves, as these observational systems typically accompany the VAM components in these (oft-high-stakes) systems.

Researchers found that the observational components of these systems certainly “add” more “value” than their VAM counterparts. This is true largely because, as researchers found in their analyses, only 22% of teachers were VAM-eligible (which is significantly lower than others’ estimates that usually hang around 30%). In this case, observational systems were the only real “hard data” available for the other 78% of teachers across school sites. In addition, researchers found (although we have known this prior) that “classroom observations have the potential of providing formative feedback to teachers that [theoretically] helps them improve their practice [more than VAMs]…[because] feedback from [VAMs]…is often too delayed and vague to produce improvement in teaching.”

Researchers do note, however, that “improvements are needed” if such observational systems are to carry the weight for which they are currently being tasked, again provided the above and their findings below:

  1. Observational biases also exist, as per their research, whereas teachers who are non-randomly assigned students who enter their classrooms with higher levels of prior achievement tend to get higher observational scores than teachers non-randomly assigned students entering their classrooms with lower levels of prior achievement;
  2. Related, districts “do not have processes in place to address the possible biases in observational scores,” not to mention the VAM scores with which they are most often combined across contemporary systems. While authors suggest that statistical adjustments be made, I’m not certain I agree with this for the same research-based reasons I don’t agree with this approach with VAMs — no matter how sophisticated the statistics they do not work to effectively control for such bias;
  3. “The reliability of both value-added measures and demographic-adjusted teacher evaluation scores is dependent on sample size, such that these measures will be less reliable and valid when calculated in small districts than in large districts.” While authors recommend that states also control for this statistically, I would default to reality whereas districts interpreting and using these data should certainly keep this in mind, especially if they are set to make high-stakes decisions as based on these, yet another set of still faulty data;
  4. “Observations conducted by outside observers are more valid than observations conducted by school administrators. At least one observation of a teacher each year should conducted by a trained observer from outside the teacher’s school who does not have substantial prior knowledge of the teacher being observed.” On this, I would agree, but I would still recommend observers either represent or fully understand the full context in which teachers are to be observed before passing “more objective” judgments about them;
  5. And finally, their best recommendation…wait for it…that “[t]he inclusion of a school value-added component in teachers’ evaluation scores negatively impacts good teachers in bad schools and positively impacts bad teachers in good schools. This measure should be eliminated or reduced to a low weight in teacher evaluation programs.”

Citation: Whitehurst, G. J., Chingos, M. M., & Lindquist, K. M. (2014). Evaluating teachers with classroom observations: Lessons learned in four districts. Washington, DC: Brookings Institution. Retrieved from

Chetty Et Al…Are “Et” It “Al” Again!

As per a recent post about Raj Chetty, the fantabulous study he and his colleagues (i.e., Chetty, Friedman, and Rockoff) “Keep on Giving…,” and a recent set of emails exchanged among Raj, Diane, and me about their (implied) nobel-worthy work in this area, given said fantabulous study, it seems Chetty et al are “et” it “al” again!

Excuse my snarkiness here as I am still in shock that they took it upon themselves to “take on” the American Statistical Association.” Yes, the ASA as critqued by Chetty et al. If that’s not a sign of how Chetty et al feel about their work in this area, then I don’t know what is.”

Given my already close proximity to this debate, I thought it wise to consult a colleague of mine–one who, in fact, is an economist (like Chetty et al) and who is also an expert in this area–to write a response to their newly released “take” on the ASA’s recently released “Statement on Using Value-Added Models for Educational Assessment.” Margarita Pivovarova is also an economist conducting research in this area, and I value her perspective more than many other potential reviewers out there, as she is both smart and wise when it comes to the careful use and interpretation of data and methods like those at the focus of these papers. Here’s what she wrote:

What an odd set of opponents in the interpretation of ASA statement, this one apparently being a set of scholars who self-selected themselves to answer what they seemingly believed others were asking; that is, “What say Chetty et al on this?”

That being said, the overall tone of this response reminds of a “response to the editor” as if, in a more customary manner, ASA would have reviewed [the aforementioned] Chetty et al paper and Chetty et al were invited to respond to the ASA. Interestingly here, though, Chetty et al seemed compelled to take a self-assigned “expert” stance on the ASA’s statement, regardless of the fact that this statement represents the expertise and collective stance of a whole set of highly regarded and highly deserving scholars in this area, who quite honestly would NOT have asked for Chetty et al’s advice on the matter. 

This is the ASA, and these are the ASA experts Chetty et al also explicitly marginalize in the reference section of their critique. [Note for yourselves the major lack of alignment between Chetty et al.’s references versus the 70 articles linked to here on this website]. This should cause others pause in terms of why Chetty et al are so extremely selective in their choices of relevant literature – the literature on which they rely “to prove” some of their own statements and “to prove” some of the ASA’s statements false or suspect. The largest chunk of the best and most relevant literature [as again linked to here] is completely left out of their “critique.” This also includes the whole set of peer-reviewed articles published in educational and education finance journals. A search on ERIC (The Educational Resources Information Center) with “value-added” as key words for the last 5 years yields 406 entries, and a similar search in JSTOR (a shared digital library) returning 495. This is quite a deep and rich literature to outright and in its entirety leave out of Chetty et al’s purportedly more informed and expert take on this statement, ESPECIALLY given many in the ASA wrote these aforementioned, marginalized articles! While Chetty et al critique the ASA for not including citations, this is quite customary in a position statement that is meant to be research-based, but not cluttered with the hundreds of articles in support of the association’s position.

Another point of contention surrounds ASA Point #5:, VAMs typically measure correlation, not causation: Effects – positive or negative – attributed to a teacher may actually be caused by other factors that are not captured in the model. Although I should mention that this was taken from within the ASA statement, interpreted, and then defined by Chetty et al as “ASA Point #5.” Notwithstanding, the answer to “ASA Point #5” guides  policy implications, so it is important to consider: Do value-added estimates yield causal interpretations as opposed to just descriptive information? In 2004, the editors of the American Educational Research Association’s (AERA) Journal of Educational and Behavioral Statistics (JEBS) published a special issue entirely devoted to value-added models (VAMs). One notable publication in that issue was written by Rubin, Stuart, and Zanutto (2004). Any discussion of the causal interpretations of VAMs cannot be devoid of this foundational article and its potential outcomes framework, or Rubin Causal Model, advanced within this piece and elsewhere (see also Holland, 1986). These authors clearly communicated that value-added estimates cannot be considered causal unless a set of “heroic assumptions” are agreed to and imposed. Accordingly, and pertinent here, “anyone familiar with education will realize that this [is]…fairly unrealistic.” Even back then, Rubin et al suggested we switch gears and instead evaluate interventions and reward incentives as based on the descriptive qualities of the indicators and estimates derived via VAMs.This is something Chetty et al overlooked here and continue to overlook and neglect in their research.

Another concern surrounds how Chetty et al suggest “we” deal with the interpretation of VAM estimates (what they define as ASA Points #2 and #3). The grander question here is: Does one need to be an expert to understand the “truth” behind the numbers? If we, in general, are talking about those who are in charge of making decisions based on these numbers, then the answer is yes, or at least aware of what the numbers do and do not do! While probably different levels of sophistication are required to develop VAMs versus interpret VAM results, interpretation is a science in itself and ultimately requires background knowledge about the model’s design and how the estimates were generated if informed, accurate, and valid inferences are to be drawn. While it is true that using an iPad doesn’t require one to understand the technology behind it, the same cannot be said about the interpretation of this set of statistics upon which people’s lives increasingly depend.

Related, ASA rightfully states that the standardized test scores used in VAMs are not the only outcomes that should be of interest for policy makers and stake-holders (ASA Point #4). In view of recent findings of the role of cognitive versus non-cognitive skills for students’ future well-beings, current agreement is that test scores might not even be one of the most important outcomes capturing a students’ educated self. Addressing this point, Chetty et al, rather, demonstrate how “interpretation matters.” In Jackson’s (2013) study [not surprisingly cited by Chetty et al, likely because it was published by the National Bureau of Educational Research (NBER) – the same organization that published their now infamous study without it being peer or even internally reviewed way back when], Jackson used value-added techniques to estimate teacher effects on the non-cognitive and long-term outcomes of students. One of his main findings was that teachers who are good at boosting test scores are not always the same teachers who have positive and long-lasting outcomes on non-cognitive skills acquisition. In fact, value-added to test scores and to non-cognitive outcomes for the same teacher were then and have since been shown to be only weakly correlated [or related] with one another. Regardless, Chetty et al use the results of standardized test scores to rank teachers and then use those rankings to estimate the effects of “highly-effective” teachers on the long-run outcomes of students, REGARDLESS OF THE RESEARCH in which their study should have been situated (excluding the work of Thomas Kane and Eric Hanushek). As noted before, if value-added estimates from standardized test scores cannot be interpreted as causal, then the effect of “high value-added” teachers on college attendance, earnings, and reduced teenage birth rates CANNOT BE CONSIDERED CAUSAL either.

Lastly, ASA expressed concerns about the sensitivity of value-added estimates to model specifications (ASA Point #6). Likewise, researchers contributing to a now large body of literature have also recently explored this same issue, and collectively they have found that value-added estimates are highly sensitive to the assessments being used, even within the same subject areas (Papay, 2011) and different subject areas taught by the same teachers given different student compositions (Loeb & Candelaria, 2012). Echoing those findings, others (also ironically cited by Chetty et al in the discussion to demonstrate the opposite) have also found that “even when the correlations between performance estimates generated by different models are quite high for the workforce as a whole, there are still sizable differences in teacher rankings generated by different models that are associated with the composition of students in a teacher’s classroom” (Goldhaber, Walch, & Gabele, 2014). And others have noted that “models that are less aggressive in controlling for student-background and schooling-environment information systematically assign higher rankings to more-advantaged schools, and to individuals who teach at these schools” (Ehlert, Koeldel, Parsons, and Podgursky, 2014).”

In sum, these are only a few “points” from this “point-by-point discussion” that would, or rather should, immediately strike anyone fairly familiar with the debate surrounding the use and abuse of VAMs as way off-base. While the debates surrounding VAMs continue, it is so easy for a casual and/or careless observer to lose the big picture behind the surrounding debates and get distracted by technicalities. This certainly seems to be the case here, as Chetty et al continue to argue their technical arguments from a million miles away from the actual schools’, teachers’, and students’ data they study, and from which they make miracles.

*To see VAMboozled!’s initial “take” on the same ASA statement, a take that in our minds still stands as it was not a self-promotional critique but only a straightforward interpretation for our followers, please click here.

Consumer Alert: Researchers in New Study Find “Surprisingly Weak” Correlations among VAMs and Other Teacher Quality Indicators

Two weeks ago, an article in the U.S. News and World Report (as well as similar articles in Education Week and linked to from the homepage of the American Educational Research Association [AERA]) highlighted the results of a recent research conducted by University of Southern California’s Morgan Polikoff and University of Pennsylvania’s Andrew Porter. The research article was released online here, in the AERA-based, peer-reviewed, and highly esteemed journal: Education Evaluation and Policy Analysis.
As per the study’s abstract, the researchers found (to which their peer-reviewers apparently agreed) that the extent to which teachers’ instructional alignment was associated with their contributions to student learning and their effectiveness on VAMs, using data from the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) study, were “surprisingly weak,” or as per the aforementioned U.S. News and World Report article, “weak to nonexistent.”
Researchers, specifically, analyzed the (co)relationships among VAM estimates and observational data, student survey data, and other data pertaining to whether teachers aligned their instruction with state standards. They did this using data taken from 327 fourth and eighth grade math and English teachers in six school districts, again as derived via the aforementioned MET study.
Researchers concluded that “there were few if any correlations that large [i.e., greater than r = 0.3] between any of the indicators of pedagogical quality and the VAM scores. Nor were there many correlations of that magnitude in the main MET study. Simply put, the correlations of value-added with observational measures of pedagogical quality, student survey measures, and instructional alignment were small” (Polikoff & Porter, 2014, p. 13).
Interestingly enough, the research I recently conducted with my current doctoral student (see Paufler & Amrein-Beardsley, here), was used to supplement these researchers’ findings. In Education Week, the articles’ author, Holly Yettick, wrote the following:
In addition to raising questions about the sometimes weak correlations between value-added assessments and other teacher-evaluation methods, researchers continue to assess how the models are created, interpreted, and used.
In a study that appears in the current issue of the American Educational Research Journal, Noelle A. Paufler and Audrey Amrein-Beardsley, a doctoral candidate and an associate professor at Arizona State University, respectively, conclude that elementary school students are not randomly distributed into classrooms. That finding is significant because random distribution of students is a technical assumption that underlies some value-added models.
Even when value-added models do account for nonrandom classroom assignment, they typically fail to consider behavior, personality, and other factors that profoundly influenced the classroom-assignment decisions of the 378 Arizona principals surveyed. That, too, can bias value-added results.