“The Test Matters”

Last month in Educational Researcher (ER), researchers Pam Grossman (Stanford University), Julie Cohen (University of Virginia), Matthew Ronfeldt (University of Michigan), and Lindsay Brown (Stanford University) published a new article titled: “The Test Matters: The Relationship Between Classroom Observation Scores and Teacher Value Added on Multiple Types of Assessment.” Click here to view a prior version of this document, pre-official-publication in ER.

Building upon what the research community has consistently evidenced in terms of the generally modest to low correlations (or relationships) consistently being observed between observational data and VAMs, researchers in this study set out to investigate these relationships a bit further, and in more depth.

Using the Bill & Melinda Gates Foundation-funded Measures of Effective Teaching (MET) data, researchers examined the extent to which the (cor)relationships between one specific observation tool designed to assess instructional quality in English/language arts (ELA) – the Protocol for Language Arts Teaching Observation (PLATO) – and value-added model (VAM) output changed when two different tests were used to assess student achievement using VAMs. One set of tests included the states’ tests (depending on where teachers in the sample were located by state), and the other test was the Stanford Achievement Test (SAT-9) Open-Ended ELA test on which students are required not only to answer multiple choice items but to also construct responses to open or free response items. This is more akin to what is to be expected with the new Common Core tests, hence, this is significant looking forward.

Researchers found, not surprisingly given the aforementioned prior research (see, for example, Papay’s similar and seminal 2010 work here), that the relationship between teachers’ PLATO observational scores and VAM output scores varied given which of the two aforementioned tests was used to calculate VAM output. In addition, researchers found that both sets of correlation coefficients were low, which also aligns with current research, but in this case it might be more accurate to say the correlations they observed were very low.

More descriptively, PLATO was more highly correlated with the VAM scores derived using the SAT-9 (r = 0.16) versus the states’ tests (r = .09), and these differences were statistically significant. These differences also varied by the type of teacher, the type of “teacher effectiveness” domain observed, etc. To read more about the finer distinctions the authors make in this regard, please click, again, here.

These results do also provide “initial evidence that SAT-9 VAMs may be more sensitive to the kinds of instruction measured by PLATO.” In other words, the SAT-9 test may be more sensitive than the state tests all states currently use (post the implementation of No Child Left Behind [NCLB] in 2002). These are the tests that are to align “better” with state standards, but on which multiple-choice items (that oft-drive multiple-choice learning activities) are much more, if not solely, valued. This is an interesting conundrum, to say the least, as we anticipate the release of the “new and improved” set of Common Core tests.

Researchers also found some preliminary evidence that teachers who scored relatively higher on the “Classroom Environment” observational factor demonstrated “better” value-added scores across tests. Noting here again, however, that the correlation coefficients demonstrated here were also quite weak (r = 0.15 for the SAT-9 VAM and r = 0.13 for the state VAM). Researchers also found preliminary evidence that only the VAM that used the SAT-9 test scores was significantly related to the “Cognitive and Disciplinary Demand” factor, although again the correlation coefficient was also very weak (r = 0.12 for the SAT-9 VAM and r = 0.05 for the state VAM).

That being said, the “big ideas” I think we can takeaway from this study are namely that:

  1. Which tests we use to construct value-added scores matter because which tests we decide to use yield different results (see also the Papay 2010 reference cited above). Accordingly, “researchers and policymakers [DO] need to pay careful attention to the assessments used to measure student achievement in designing teacher evaluation systems” as these decisions will [emphasis added] yield different results.”
  2. Related, certain tests are going to be more instructionally sensitive than others. Click here and here for two prior posts about this topic.
  3. In terms of using observational data, in general, “[p]erhaps the best indicator[s] of whether or not students are learning to engage in rigorous academic discourse…[should actually come from] classroom observations that capture the extent to which students are actually participating in [rigorous/critical] discussions” and their related higher-order thinking activities. This line of thinking aligns, as well, with a recent post on this blog titled “Observations: “Where Most of the Action and Opportunities Are.

Bias in School Composite Measures Using Value-Added

Ed Fuller – Associate Professor of Educational Policy at The Pennsylvania State University (as also affiliated with this university’s Center for Evaluation and Education Policy Analysis [CEEPA]) recently released an externally-reviewed technical report that I believe you all might find of interest. He conducted “An Examination of Pennsylvania School Performance Profile Scores.”

He found that the Commonwealth of Pennsylvania’s School Performance Profile (SPP) scores, as defined by the Pennsylvania Department of Education (PDE) for each school as also 40% based on Pennsylvania Value-Added Assessment System (PVAAS, which is a derivative of the Education Value-Added Assessment System [EVAAS]), are strongly associated with student- and school-characteristics. Thus, he concludes, these measures are “inaccurate measures of school effectiveness. [Pennsylvania’s] SPP scores, in [fact], are more accurate indicators of the percentage of economically disadvantaged students in a school than of the effectiveness of a school.”

Take a look at Fuller’s Figure 1. In this figure is a scatterplot that illustrates that there is indeed a “strong relationship” or “strong (co)relationship” between the percent of economically disadvantaged students and elementary schools’ SPP scores. Put more simply, as the percentage of economically disadvantaged students goes up (moving from left to right on the x-axis), the SPP goes down (moving from top to bottom on the y-axis). The actual correlation coefficient illustrated, while not listed explicitly, is noted as being greater than r = –0.60.

Untitled 2

Note also the very few outliers, or schools that defy the trend as is often assumed to occur when “we” more fairly measure growth, give all types of students a fair chance to demonstrate growth, level the playing field for all, and the like. Figures 2 and 3 in the same document illustrate essentially the same things, but for middle schools (r = 0.65) and high schools (r = 0.68). In these cases the outliers are even less rare.

When Fuller conducted a bit more complicated statistical analyses (i.e., regression analyses) that help to control or account for additional variables and factors than simple correlations themselves, he found still that “the vast majority of the differences in SPP scores across schools [were still] explained [better] by student- and school-characteristics that [were not and will likely never be] under the control of educators.”

Fuller’s conclusions? “If the scores accurately capture the true effectiveness of schools and the educators within schools, there should be only a weak or non-existent relationship between the SPP scores and the percentage of economically disadvantaged students.”

Hence: (1) “SPP scores should not be used as an indication of either school effectiveness or as a component of educator evaluations” because of the bias inherent within the system, of which PVAAS data are 40%. Bias in this case means that “teachers and principals in schools serving high percentages of economically disadvantaged students will be identified as less effective than they really are while those serving in schools with low percentages of economically disadvantaged students will be identified as more effective than in actuality.” (2) Because SPP scores are biased, and therefore inaccurate assessments of school effectiveness, “the use of the SPP scores in teacher and principal evaluations will lead to inaccurate judgments about teacher and principal effectiveness.” And (3) Using SPP scores in teacher and principal evaluations will create “additional incentive[s] for the most qualified and effective educators to seek employment in schools with high SPP scores. Thus, the use of SPP scores on educator evaluations will simply exacerbate the existing inequities in the distribution of educator quality across schools based on the characteristics of the students enrolled in the schools.”

Critique of Chetty et al.’s ASA Critique Published in TCR

You might recall from this past summer a post I wrote titled “Chetty Et Al…Are “Et” It “Al” Again!” The post was about how Raj Chetty and his colleagues (Harvard academics who have been the sources of other posts here, here, and here) took it upon themselves to “take on” the American Statistical Association (ASA) and their recently released “Statement on Using Value-Added Models for Educational Assessment.” Yes, the ASA as critiqued by Chetty et al., as yet another sign of how Chetty et al. feel about their own work in this area.

Anyhow, I consulted a colleague to write a critique of Chetty et al.’s critique that we published first on VAMboozled (here). Soon thereafter, also given some VAMboozled followers’ feedback, we consulted another colleague/statistician, built this critique into a more thorough/scholarly commentary, and recently found out that the highly-esteemed, peer-reviewed journal – Teachers College Record – picked it up for publication.

Congrats to my colleagues, Assistant Professors Margarita Pivovarova and Jennifer Broatch.

For followers, here is the piece in full in case you’re all interested in reading the revised (and now more thorough with citations) post. Here also is the direct link to the commentary online.