Last month in Educational Researcher (ER), researchers Pam Grossman (Stanford University), Julie Cohen (University of Virginia), Matthew Ronfeldt (University of Michigan), and Lindsay Brown (Stanford University) published a new article titled: “The Test Matters: The Relationship Between Classroom Observation Scores and Teacher Value Added on Multiple Types of Assessment.” Click here to view a prior version of this document, pre-official-publication in ER.
Building upon what the research community has consistently evidenced in terms of the generally modest to low correlations (or relationships) consistently being observed between observational data and VAMs, researchers in this study set out to investigate these relationships a bit further, and in more depth.
Using the Bill & Melinda Gates Foundation-funded Measures of Effective Teaching (MET) data, researchers examined the extent to which the (cor)relationships between one specific observation tool designed to assess instructional quality in English/language arts (ELA) – the Protocol for Language Arts Teaching Observation (PLATO) – and value-added model (VAM) output changed when two different tests were used to assess student achievement using VAMs. One set of tests included the states’ tests (depending on where teachers in the sample were located by state), and the other test was the Stanford Achievement Test (SAT-9) Open-Ended ELA test on which students are required not only to answer multiple choice items but to also construct responses to open or free response items. This is more akin to what is to be expected with the new Common Core tests, hence, this is significant looking forward.
Researchers found, not surprisingly given the aforementioned prior research (see, for example, Papay’s similar and seminal 2010 work here), that the relationship between teachers’ PLATO observational scores and VAM output scores varied given which of the two aforementioned tests was used to calculate VAM output. In addition, researchers found that both sets of correlation coefficients were low, which also aligns with current research, but in this case it might be more accurate to say the correlations they observed were very low.
More descriptively, PLATO was more highly correlated with the VAM scores derived using the SAT-9 (r = 0.16) versus the states’ tests (r = .09), and these differences were statistically significant. These differences also varied by the type of teacher, the type of “teacher effectiveness” domain observed, etc. To read more about the finer distinctions the authors make in this regard, please click, again, here.
These results do also provide “initial evidence that SAT-9 VAMs may be more sensitive to the kinds of instruction measured by PLATO.” In other words, the SAT-9 test may be more sensitive than the state tests all states currently use (post the implementation of No Child Left Behind [NCLB] in 2002). These are the tests that are to align “better” with state standards, but on which multiple-choice items (that oft-drive multiple-choice learning activities) are much more, if not solely, valued. This is an interesting conundrum, to say the least, as we anticipate the release of the “new and improved” set of Common Core tests.
Researchers also found some preliminary evidence that teachers who scored relatively higher on the “Classroom Environment” observational factor demonstrated “better” value-added scores across tests. Noting here again, however, that the correlation coefficients demonstrated here were also quite weak (r = 0.15 for the SAT-9 VAM and r = 0.13 for the state VAM). Researchers also found preliminary evidence that only the VAM that used the SAT-9 test scores was significantly related to the “Cognitive and Disciplinary Demand” factor, although again the correlation coefficient was also very weak (r = 0.12 for the SAT-9 VAM and r = 0.05 for the state VAM).
That being said, the “big ideas” I think we can takeaway from this study are namely that:
- Which tests we use to construct value-added scores matter because which tests we decide to use yield different results (see also the Papay 2010 reference cited above). Accordingly, “researchers and policymakers [DO] need to pay careful attention to the assessments used to measure student achievement in designing teacher evaluation systems” as these decisions will [emphasis added] yield different results.”
- Related, certain tests are going to be more instructionally sensitive than others. Click here and here for two prior posts about this topic.
- In terms of using observational data, in general, “[p]erhaps the best indicator[s] of whether or not students are learning to engage in rigorous academic discourse…[should actually come from] classroom observations that capture the extent to which students are actually participating in [rigorous/critical] discussions” and their related higher-order thinking activities. This line of thinking aligns, as well, with a recent post on this blog titled “Observations: “Where Most of the Action and Opportunities Are.”