VAMs and Observations: Consistencies, Correlations, and Contortions

Share Button

A research article was recently published in the peer-reviewed Education Policy Analysis Archives journal titled “The Stability of Teacher Performance and Effectiveness: Implications for Policies Concerning Teacher Evaluation.” This articles was published by professors/doctoral researchers from Baylor University, Texas A & M, & University of South Carolina. Researchers set out to explore the stability (or fluctuations) observed across 132 teachers’ effectiveness ratings across 23 schools in South Carolina over time using both observational and VAM-based output.

Researchers found, not surprisingly given prior research in this area, that neither teacher performance using value-added nor effectiveness using observations was highly stable over time. This is most problematic when “sound decisions about continued employment, tenure, and promotion are predicated on some degree of stability over time. It is imprudent to make such decisions of the performance and effectiveness of a teacher as “Excellent” one year and “Mediocre” the next.”

They also observed “a generally weak positive relationship between the two sets of ratings [i.e., value-added estimates and observations], which has also been the source of many literature studies.

On both of these findings, really, we are well pas the point of saturation. That is, we could not have more agreement across research studies on the following (1) that teacher-level value-added scores are highly unstable over time and (2) that these value-added scores do not align well with observational scores, as they should if both measures were to be appropriate capturing the “teacher effectiveness” construct.

Another interesting finding from this study, discussed before but that has not yet reached the point of saturation in the research literature like the prior two is how (3) different teacher performance ratings as based on observational data are also markedly different across schools. At issue here is “that performance ratings may be school-specific.” Or as per a recent post on this blog, that there is indeed much “Arbitrariness Inherent in Teacher Observations.” This is also highly problematic in that where a teacher might be housed might determine more his/her ratings based not necessarily (or entirely) on his/her actual “quality” or “effectiveness” but his/her location, his/her rater, and his/her rater’s scoring approach given differential tendencies towards leniency, or severity. This might leave us with more of a luck-of-the-draw approach than an actually “objective” measurement of true teacher quality, contrary to current and popular (especially policy) beliefs.

Accordingly, and also per the research, this is not getting much better in that, as per the authors of this article as well as many other scholars, (1) “the variance in value-added scores that can be attributed to teacher performance rarely exceeds 10 percent; (2) in many ways “gross” measurement errors that in many ways come, first, from the tests being used to calculate value-added; (3) the restricted ranges in teacher effectiveness scores also given these test scores and their limited stretch, and depth, and instructional insensitivity — this was also at the heart of a recent post whereas in what demonstrated that “the entire range from the 15th percentile of effectiveness to the 85th percentile of [teacher] effectiveness [using the EVAAS] cover[ed] approximately 3.5 raw score points [given the tests used to measure value-added];” (4) context or student, family, school, and community background effects that simply cannot be controlled for, or factored out; (5) especially at the classroom/teacher level when students are not randomly assigned to classrooms (and teachers assigned to teach those classrooms)… although this will likely never happen for the sake of improving the sophistication and rigor of the value-added model over students’ “best interests.”

Share Button

Rothstein, Chetty et al., and VAM-Based Bias

Share Button

Recall the Chetty et al. study at focus of many posts on this blog (see for example here, here, and here)? The study was cited in President Obama’ 2012 State of the Union address when Obama said, “We know a good teacher can increase the lifetime income of a classroom by over $250,000,” and this study was more recently the focus of attention when the judge in Vergara v. California cited Chetty et al.’s study as providing evidence that “a single year in a classroom with a grossly ineffective teacher costs students $1.4 million in lifetime earnings per classroom.” Well, this study is at the source of a new, and very interesting VAM-based debate, again.

This time, new research conducted by Berkeley Associate Professor of Economics – Jesse Rothstein – provides evidence that puts the aforementioned Chetty et al. results under another appropriate light. While Rothstein and others have written critiques of the Chetty et al. study prior (see prior reviews here, here, here, and here), what Rothstein recently found (in his working, not-yet-peer-reviewed-study here) is that by using “teacher switching” statistical procedures, Chetty et al. masked evidence of bias in their prior study. While Chetty et al. have repeatedly claimed bias was not an issue (see for example a series of emails on this topic here), it seems indeed it was.

While Rothstein replicated Chetty et al.’s overall results using a similar dataset, Rothstein did not replicate Chetty et al.’s findings when it came to bias. As mentioned, Chetty et al. used a process of “teacher switching” to test for bias in their study, and by doing so found, with evidence, that bias did not exist in their value-added output. Rothstein found that when “teacher switching” is appropriately controlled, however, “bias accounts for about 20% of the variance in [VAM] scores.” This makes suspect, more now than before, Chetty et al.’s prior assertions that their model, and their findings, were immune to bias.

What this means, as per Rothstein, is that “teacher switching [the process used by Chetty et al.] is correlated with changes in students’ prior grade scores that bias the key coefficient toward a finding of no bias.” Hence, there was a reason Chetty et al. did not find bias in their value-added estimates because they did not use the proper, statistical controls to control for bias in the first place. When properly controlled, or adjusted, estimates yield “evidence of moderate bias;” hence, “[t]he association between [value-added] and long-run outcomes is not robust and quite sensitive to controls.”

This has major implications in the sense that this makes suspect the causal statements also made by Chetty et al. and repeated by President Obama, the Vergara v. California judge, and others – that “high value-added” teachers caused students to ultimately realize higher long-term incomes, fewer pregnancies, etc. X years down the road. If Chetty et al. did not appropriately control for bias, which again Rothstein argues with evidence they did not, it is likely that students would have realized these “things” almost if not entirely regardless of their teachers or what “value” their teachers purportedly “added” to their learning X years prior.

In other words, students were likely not randomly assigned to classrooms in either the Chetty et al. or the Rothstein datasets (making these datasets comparable). So if the statistical controls used did not effectively “control for” the lack of non random assignment of students into classrooms, teachers may have been assigned high value-added scores not necessarily because they were high value-added teachers but because they were non-randomly assigned higher performing, higher aptitude, etc. students, in the first place and as a whole. Thereafter, they were given credit for the aforementioned long-term outcomes, regardless.

If the name Jesse Rothstein sounds familiar, it should. I have referenced his research in prior posts here, here, and here, as he is well-known in the area of VAM research, in particular, for a series of papers in which he provided evidence that students who are assigned to classrooms in non-random ways can create biased, teacher-level value-added scores. If random assignment was the norm (i.e., whereas students are randomly assigned to classrooms and, ideally, teachers are randomly assigned to teach those classrooms of randomly assigned students), teacher-level bias would not be so problematic. However, given research I also recently conducted on this topic (see here), random assignment (at least in the state of Arizona) occurs 2% of the time, at best. Principals otherwise outright reject the notion as random assignment is not viewed as in “students’ best interests,” regardless of whether randomly assigning students to classrooms might mean “more accurate” value-added output as a result.

So it seems, we either get the statistical controls right (which I doubt is possible) or we randomly assign (which I highly doubt is possible). Otherwise, we are left wondering whether value-added analyses will ever work as per their intended (and largely ideal) purposes, especially when it comes to evaluating and holding accountable America’s public school teachers for their effectiveness.


In case you’re interested, Chetty et al have responded to Rothstein’s critique. Their full response can be accessed here. Not surprisingly, they first highlight that Rothstein (and another set of their colleagues at Harvard), replicated their results. That “value-added (VA) measures of teacher quality show very consistent properties across different settings” is that on which Chetty et al. focus first and foremost. What they dismiss, however, is whether the main concerns raised by Rothstein threaten the validity of their methods, and their conclusions. They also dismiss the fact that Rothstein addressed Chetty et al.’s counterpoints before they published them, in Appendix B of his paper given Chetty et al. shared their concerns with Rothstein prior to his study’s release.

Nonetheless, the concerns Chetty et al. attempt to counter are whether their “teacher-switching” approach was invalid, and whether the “exclusion of teachers with missing [value-added] estimates biased the[ir]conclusion[s]” as well. The extent to which missing data bias value-added estimates has also been discussed prior when statisticians force the assumption in their analyses that missing data are “missing at random” (MAR), which is a difficult (although for some like Chetty et al, necessary) assumption to swallow (see, for example, the Braun 2004 reference here).

Share Button

Five Nashville Teachers Face Termination

Share Button

Three months ago, Tennessee Schools Director Jesse Register announced he was to fire 63 Tennessean teachers, of 195 total who for two consecutive years scored lowest (i.e., a 1 on a scale of 1 to 5) in terms of their overall “value-added” (as based on 35% EVAAS, 15% related “student achievement,” and 50% observational data). These 63 were the ones who three months ago were still “currently employed,” given the other 132 apparently left voluntarily or retired (which is a nice-sized cost-savings, so let’s be sure not to forget about the economic motivators behind all of this as well). To see a better breakdown of these numbers, click here.

This was to be the first time in Tennessee that its controversial, and “new and improved” teacher evaluation system would be used to take deliberate action against whom they deemed their “lowest-performing” teachers, as “objectively” identified in the classroom; although, officials at that time did not expect to have a “final number” to be terminated until fall.

Well, fall is here, and it seems this final number is officially five: three middle school teachers, one elementary school teacher, and one high school teacher, all teaching in metro Nashville.

The majority of these teachers come from Neely’s Bend: “one of 14 Nashville schools on the state’s priority list for operating at the bottom 5 percent in performance statewide.” Some of these teachers were evaluated even though their principal who evaluated them is  “no longer there.” Another is a computer instructor being terminated as based on this school’s overall “school-level value-added.” This is problematic in and of itself given teacher-level and in this case school-level bias seem to go hand in hand with the use of these models, and grossly interfere with accusations that these teachers “caused” low performance (see a recent post about this here).

It’s not to say these teachers were not were indeed the lowest performing; maybe they were. But I for one would love to talk to these teachers and take a look at their actual data, EVAAS and observational data included. Based on prior experiences working with such individuals, there may be more to this than what it seems. Hence, if anybody knows these folks, do let them know I’d like to better understand their stories.

Otherwise, all of this effort to ultimately attempt to terminate five of a total 5,685 certified teachers in the district (0.09%) seems awfully inefficient, and costly, and quite frankly absurd given this is a “new and improved” system meant to be much better than a prior system that likely yielded a similar termination rate, not including, however, those who left voluntarily prior.

Perhaps an ulterior motive is, indeed, the cost-savings realized given the mere “new and improved” threat.

Share Button

State Tests: Instructional Sensitivity and (Small) Differences between Extreme Teachers

Share Button

As per Standard 1.2. of the newly released 2014 Standards for Educational and Psychological Testing authored by the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME):

A rationale should be presented for each intended interpretation of test scores for a given use, together with a summary of the evidence and theory bearing on the intended interpretation. (p. 23)

After a series of conversations I have had, also recently with a colleague named Harris Zwerling who is a researcher for the Pennsylvania State Education Association, I asked him to submit a guest blog about what he has observed working with Pennsylvania Education Value-Added Assessment (PVAAS, aka the Education Value-Added Assessment System [EVAAS) administered by SAS-EVAAS) data in his state as per this standard. Here is what he wrote:

Recent Vamboozled posts have focused on one of the qualities of the achievement tests used as the key measures in value-added models (VAMs) and other “growth” models. Specifically, the instructional sensitivity of current student achievement tests has been questioned (see, for example, here). Evidence of whether the achievement tests in use can actually detect the influence of instruction on student test performance should, accordingly, be a central part of validity evidence required for VAMs [as per the specific standard also mentioned above]. After all, it is achievement test scores that provide the bases for VAM-based estimates. In addition, while many have noted that the statistical methods employed to generate value-added estimates will force a distribution of “teacher effectiveness” scores, it is not clear that the distinctions generated and measured have educational (or clinical) meaning.

This concern can be illustrated with an example taken from comments I previously made while serving on a Gates Foundation-funded stakeholders committee tasked with making recommendations for the redesign of Pennsylvania’s teacher and principal evaluation systems (see the document to which I refer here). Let me note that what follows is only suggestive of the possible limitations of some of Pennsylvania’s state achievement tests (the PSSAs) and is not offered as an example of an actual analysis of their instructional sensitivity. The value-added calculations, which I converted to raw scores, were performed by a research team from Mathematica Policy Research (for more information click here).

From my comments in this report:

“…8th grade Reading provided the smallest teacher effects from three subjects reported in Table III.1. The difference the [Final Report] estimates comparing the teacher at the 15th percentile of effectiveness to the average teacher (50th percentile) is -22 scaled score points on the 5th grade PSSA Reading test…[referring] to the 2010 PSSA Technical Manual raw score table… for the 8th grade Reading test, that would be a difference of approximately 2 raw score points, or the equivalent of 2 multiple choice (MC) questions (1 point apiece) or half credit on one OE [open-ended] question. (There are 40 MC questions and 4 (3 point) OE questions on the Reading test.) The entire range from the 15th percentile of effectiveness to the 85th percentile of effectiveness …covers approximately 3.5 raw score points [emphasis added].” (p. 9)… [T]he range of teacher effectiveness covering the 5th to the 95th percentiles (73 scaled score points) represents approximately a 5.5 point change in the raw score (i.e., 5.5 of 52 total possible points [emphasis added].” (p. 9)

The 5th to 95th percentiles corresponds to the effectiveness range used in the PVAAS.

“It would be difficult to see how the range of teacher effects on the 8th grade Reading test would be considered substantively meaningful if this result holds up in a larger and random sample. In the case of this subject and grade level test, the VAM does not appear to be able to detect a range of teacher effectiveness that is substantially more discriminating than those reported for more traditional forms of evaluation. (p. 9-10)

Among other issues, researchers have considered which scaling properties are necessary for measuring growth (see, for example, here), whether the tests’ scale properties met the assumptions of the statistical models being used (see, for example, here), if growth in student achievement is scale dependent (see, for example, here), and even if tests that were vertically scaled could meet the assumptions required by regression-based models (see, for example, here).

Another issue raised by researchers is the fundamental question of whether state tests could be put to more than one use, that is, as a valid measure of student progress and teacher performance, or would the multiple uses inevitably lead to behaviors that would invalidate the tests for either purpose (see, for example, here). All this suggests is that before racing ahead to implement value-added components of teacher evaluation systems, states have an obligation to assure all involved that these psychometric issues have been explicitly addressed and the tests used are properly validated for use in value-added measurements of “teacher effectiveness.”

So…How many states have validated their student achievement tests for use as measures of teachers’ performance? How many have produced evidence of their tests instructional sensitivity or of the educational significance of the distinctions made by the value-added models they use?

One major vendor of value-added measures (i.e., SAS as in SAS-EVAAS) long has held that the tests need only to have 1) sufficient “stretch” in the scales “to ensure that progress could be measured for low-and high achieving students”, 2) that “the test is highly related to the academic standards,” and 3) “the scales are sufficiently reliable from one year to the next” (see, for example, here). These assertions should be subject to independent psychometric verification, but they have not. Rather, and notably, in all the literature they have produced regarding their VAMs, SAS does not indicate that it requires or conducts the tests that researchers have designed for determining instructional sensitivity. However, they assert that they can provide accurate measures of individual teacher effectiveness nonetheless.

To date Pennsylvania’s Department of Education (PDE) apparently has ignored the fact that the Data Recognition Corporation (DRC), the vendor of its state tests, has avoided making any claims that their tests are valid for use as measures of teacher performance (see, for example, here). In DRC’s most recent Technical Report chapter on validity, for example, they list the main purposes of the PSSA, the second of which is to “Provid[e] information on school and district accountability” (p.283). The omission of teacher accountability here certainly stands out.

To date, again, none of the publicly released DRC technical reports for Pennsylvania’s state tests provides any indication that instructional sensitivity has been examined, either. Nor do the reports provide a basis for others to make their own judgment. In fact, they make it clear that historically the PSSA exams were designed for school level accountability and only later have moved toward measuring individual student mastery of Pennsylvania’s academic standards. Various PSSA tests will sample an entire academic standard with as little as one or two questions (see, for example, Appendix C here), which aligns with the evidence presented in the first section of this post.

In sum, and given that VAMs, by design, force a distribution of “teacher effectiveness,” even if the measured differences are substantively quite small, and much smaller than the lay observer or in particular educational policymaker might have believed otherwise (in the case of the 8th grade PSSA Reading test 5.5 of 52 total possible points covers the entire range from the bottom to the top categories of “ teacher effectiveness”) , it is essential that both instructional sensitivity and educational significance be explicitly established. Otherwise, untrained users of the data may labor under the misapprehension that the difference observed provides a solid basis for making such (low- and high-stakes) personnel decisions.

This is a completely separate concern from the fact that, at best, measured “teacher effects” may only explain from 1% to 15 or 20% of the variance in student test scores, a figure dwarfed by the variance explained by factors beyond the control of teachers. To date, SAS has not publicly reported the amount of variance explained by its PVAAS models for any of PA’s state tests. I believe every vendor of value-added models should report this information for every achievement test being used as a measure of “teacher effectiveness.” This too, would provide essential context for understanding the substantive meaning of reported scores.

Share Button

A Major VAMbarrassment in New Orleans

Share Button

Tulane University’s Cowen Institute for Education Initiatives, in what is now being called a “high-profile embarrassment,” is apologizing for releasing a high-profile report based on faulty research. Their report, that at one point seemed to have successfully VAMboozle folks representing numerous Louisiana governmental agencies and education non-profits, many of which emerged after Hurricane Katrina, has since been pulled from its once prevalent placement on the institute’s website.

It seems that on October 1, 2014 the institute released a report titled “Beating-the-Odds: Academic Performance and Vulnerable Student Populations in New Orleans Public High Schools.” The report was celebrated widely (see, for example, here) in that it “proved” that students in post-Hurricane Katrina (largely charter) schools, despite the disadvantages they faced prior to the school-based reforms triggered by Katrina, were now “beating the odds, while “posting better test scores and graduation rates than predicted by their populations,” thanks to these reforms. Demographics were no longer predicting students’ educational destinies, and the institute had the VAM-based evidence to prove it.

Institute researchers also “found” that New Orleans charter schools, in which over 80% of all New Orleans are now educated, were to have been substantively helping students “beat the odds,” accordingly. Let’s just say the leaders of the charter movement were all over this one, and also allegedly involved.

To some, however, the report’s initial findings (the findings that have now been retracted) did not come as much of a surprise. It seems the Cowen Institute and the city’s leading charter school incubator, New Schools for New Orleans, literally share office space; that is, they are literally “in office space” together.

To read more about this, as per the research of Kristen Buras (Associate Professor at Georgia State), click here. Mercedes Schneider on a recent post she also wrote about this noted that the “Cowen Institute at Tulane University has been promoting the New Orleans Charter Miracle [emphasis added]” and consistently trying “to sell the ‘transformed’ post-Katrina education system in New Orleans” since 2007 (two years post Katrina). Thanks also go out to Mercedes Schneider as before the report was brought down, she downloaded the report. This report can still be accessed there, or also directly here: Beating-the-Odds for those who want to dive into this further.

Anyhow, and as per another news article recently released about this mess, the Cowen Institute’s Executive Director John Ayers removed the report because the research within it was “inaccurate,” and institute “[o]fficials determined the report’s methodology was flawed, making its conclusions inaccurate.” The report is not be reissued. The institute also intends to “thoroughly examine and strengthen [the institute's] internal protocols” because of this and to make sure that this does not happen again. The released report was not appropriately internally reviewed, as per the institute’s official response, although external review would have certainly been more appropriate here. Similar situations capturing why internal AND more importantly external review are so very important have been the focus of prior blog posts here, here, and here.

But in this report, that listed Debra Vaughan (the Institute’s Director of Research) and Patrick Sims (the Institute’s Senior Research Analysts) as the lead researchers, they used what they called a VAM – BUT what they did in terms of analyses was certainly not akin to an advanced or “sophisticated” VAM (not that using a VAM would have revealed entirely more accurate and/or less flawed results). Instead, they used a simpler regression approach to reach (or confirm) their conclusions. While Ayers “would not say what piece of the methodology was flawed,” we can all be quite certain it was the so-called “VAM” that was the cause of this serious case of VAMbarrassment.

As background, and from a related article about the Cowen Institute and one of its new affiliates – Douglas Harris, who has also written a book about VAMs but positioned VAMs in a more positive light than I did in my book, but who is also not listed as a direct or affiliated author on the report – the situation in New Orleans post Katrina is as follows:

Before the storm hit in August 2005, New Orleans public schools were like most cities and followed the 100-year-old “One Best System.” A superintendent managed all schools in the school district, which were governed by a locally-elected school board. Students were taught by certified and unionized teachers and attended schools mainly based on where they lived.

But that all changed when the hurricane shuttered most schools and scattered students around the nation. That opened the door for alternative forms of public education, such as charter schools that shifted control from the Orleans Parish School Board into the hands of parents and a state agency, the Recovery School District.

In the 2012-13 school year, 84 percent of New Orleans public school students attended charter schools…New Orleans [currently] leads the nation in the percentage of public school students enrolled in charter schools, with the next-highest percentages in Washington D.C. and Detroit (41 percent in each).

Share Button

Again…Does the Test Matter?

Share Button

Following up on my most recent post about whether “The Test Matters,” a colleague of mine from the University of Southern California (USC) – Morgan Polikoff whom I also referenced on this blog this past summer as per a study he and Andrew Porter (University of Pennsylvania) released about the “Surprisingly Weak” Correlations among VAMs and Other Teacher Quality Indicators” – wrote me an email. In this email he sent another study very similar to the one referenced above, about whether “The Test Matters.” But this study is titled “Does the Test Matter?” (see full reference below).

Given this is coming from a book chapter included within a volume capturing many of the Bill & Melinda Gates Measures of Effective Teaching (MET) studies – studies that have also been at the sources of prior posts here and here – I thought it even more appropriate to share this with you all given book chapters are sometimes hard to access and find.

Like the authors of “The Test Matters,” Polikoff used MET data to investigate whether large-scale standardized state tests “differ in the extent to which they reflect the content or quality of teachers’ instruction” (i.e., tests’ instructional sensitivity).” He also investigated whether differences in instructional sensitivity affect recommendations made in the MET reports for creating multiple-measure evaluation systems.

Polikoff found that “state tests indeed vary considerably in their correlations with observational and student survey measures of effective teaching.” These correlations, as they were in the previous article/post cited above, were statistically significant, positive, and fell in the very weak range in mathematics (0.15 < 0.19) and very weak range in English/language arts (0.07 < < 0.15). This, and other noted correlations that approach zero (i.e., = 0 or no correlation at all), all, as per Polikoff, indicate “weak sensitivity to instructional quality.”

Put differently, while this indicates that the extent to which state tests are instructionally sensitive appears slightly more favorable in mathematics versus English/language arts, we might otherwise conclude that the state tests used in at least the MET partner states “are only weakly sensitive to pedagogical quality, as judged by the MET study instruments” – the instruments that in many ways are “the best” we have thus far to offer in terms of teacher evaluation (i.e., observational, and student survey instruments).

But why does instructional sensitivity matter? Because the inferences made from these state test results, independently or more likely post VAM calculation “rely on the assumption that [state test] results accurately reflect the instruction received by the students taking the test. This assumption is at the heart of [investigations] of instructional sensitivity.” See another related (and still troublesome) post about instructional sensitivity here.

Polikoff, M. S. (2014). Does the Test Matter? Evaluating teachers when tests differ in their sensitivity to instruction. In T. J. Kane, K. A. Kerr, & R. C. Pianta (Eds.). Designing teacher evaluation systems: New guidance from the Measures of Effective Teaching project (pp. 278-302). San Francisco, CA: Jossey-Bass.

Share Button

“The Test Matters”

Share Button

Last month in Educational Researcher (ER), researchers Pam Grossman (Stanford University), Julie Cohen (University of Virginia), Matthew Ronfeldt (University of Michigan), and Lindsay Brown (Stanford University) published a new article titled: “The Test Matters: The Relationship Between Classroom Observation Scores and Teacher Value Added on Multiple Types of Assessment.” Click here to view a prior version of this document, pre-official-publication in ER.

Building upon what the research community has consistently evidenced in terms of the generally modest to low correlations (or relationships) consistently being observed between observational data and VAMs, researchers in this study set out to investigate these relationships a bit further, and in more depth.

Using the Bill & Melinda Gates Foundation-funded Measures of Effective Teaching (MET) data, researchers examined the extent to which the (cor)relationships between one specific observation tool designed to assess instructional quality in English/language arts (ELA) – the Protocol for Language Arts Teaching Observation (PLATO) – and value-added model (VAM) output changed when two different tests were used to assess student achievement using VAMs. One set of tests included the states’ tests (depending on where teachers in the sample were located by state), and the other test was the Stanford Achievement Test (SAT-9) Open-Ended ELA test on which students are required not only to answer multiple choice items but to also construct responses to open or free response items. This is more akin to what is to be expected with the new Common Core tests, hence, this is significant looking forward.

Researchers found, not surprisingly given the aforementioned prior research (see, for example, Papay’s similar and seminal 2010 work here), that the relationship between teachers’ PLATO observational scores and VAM output scores varied given which of the two aforementioned tests was used to calculate VAM output. In addition, researchers found that both sets of correlation coefficients were low, which also aligns with current research, but in this case it might be more accurate to say the correlations they observed were very low.

More descriptively, PLATO was more highly correlated with the VAM scores derived using the SAT-9 (r = 0.16) versus the states’ tests (r = .09), and these differences were statistically significant. These differences also varied by the type of teacher, the type of “teacher effectiveness” domain observed, etc. To read more about the finer distinctions the authors make in this regard, please click, again, here.

These results do also provide “initial evidence that SAT-9 VAMs may be more sensitive to the kinds of instruction measured by PLATO.” In other words, the SAT-9 test may be more sensitive than the state tests all states currently use (post the implementation of No Child Left Behind [NCLB] in 2002). These are the tests that are to align “better” with state standards, but on which multiple-choice items (that oft-drive multiple-choice learning activities) are much more, if not solely, valued. This is an interesting conundrum, to say the least, as we anticipate the release of the “new and improved” set of Common Core tests.

Researchers also found some preliminary evidence that teachers who scored relatively higher on the “Classroom Environment” observational factor demonstrated “better” value-added scores across tests. Noting here again, however, that the correlation coefficients demonstrated here were also quite weak (r = 0.15 for the SAT-9 VAM and r = 0.13 for the state VAM). Researchers also found preliminary evidence that only the VAM that used the SAT-9 test scores was significantly related to the “Cognitive and Disciplinary Demand” factor, although again the correlation coefficient was also very weak (r = 0.12 for the SAT-9 VAM and r = 0.05 for the state VAM).

That being said, the “big ideas” I think we can takeaway from this study are namely that:

  1. Which tests we use to construct value-added scores matter because which tests we decide to use yield different results (see also the Papay 2010 reference cited above). Accordingly, “researchers and policymakers [DO] need to pay careful attention to the assessments used to measure student achievement in designing teacher evaluation systems” as these decisions will [emphasis added] yield different results.”
  2. Related, certain tests are going to be more instructionally sensitive than others. Click here and here for two prior posts about this topic.
  3. In terms of using observational data, in general, “[p]erhaps the best indicator[s] of whether or not students are learning to engage in rigorous academic discourse…[should actually come from] classroom observations that capture the extent to which students are actually participating in [rigorous/critical] discussions” and their related higher-order thinking activities. This line of thinking aligns, as well, with a recent post on this blog titled “Observations: “Where Most of the Action and Opportunities Are.
Share Button

Bias in School Composite Measures Using Value-Added

Share Button

Ed Fuller – Associate Professor of Educational Policy at The Pennsylvania State University (as also affiliated with this university’s Center for Evaluation and Education Policy Analysis [CEEPA]) recently released an externally-reviewed technical report that I believe you all might find of interest. He conducted “An Examination of Pennsylvania School Performance Profile Scores.”

He found that the Commonwealth of Pennsylvania’s School Performance Profile (SPP) scores, as defined by the Pennsylvania Department of Education (PDE) for each school as also 40% based on Pennsylvania Value-Added Assessment System (PVAAS, which is a derivative of the Education Value-Added Assessment System [EVAAS]), are strongly associated with student- and school-characteristics. Thus, he concludes, these measures are “inaccurate measures of school effectiveness. [Pennsylvania's] SPP scores, in [fact], are more accurate indicators of the percentage of economically disadvantaged students in a school than of the effectiveness of a school.”

Take a look at Fuller’s Figure 1. In this figure is a scatterplot that illustrates that there is indeed a “strong relationship” or “strong (co)relationship” between the percent of economically disadvantaged students and elementary schools’ SPP scores. Put more simply, as the percentage of economically disadvantaged students goes up (moving from left to right on the x-axis), the SPP goes down (moving from top to bottom on the y-axis). The actual correlation coefficient illustrated, while not listed explicitly, is noted as being greater than r = -0.60.

Untitled 2

Note also the very few outliers, or schools that defy the trend as is often assumed to occur when “we” more fairly measure growth, give all types of students a fair chance to demonstrate growth, level the playing field for all, and the like. Figures 2 and 3 in the same document illustrate essentially the same things, but for middle schools (r = 0.65) and high schools (r = 0.68). In these cases the outliers are even less rare.

When Fuller conducted a bit more complicated statistical analyses (i.e., regression analyses) that help to control or account for additional variables and factors than simple correlations themselves, he found still that “the vast majority of the differences in SPP scores across schools [were still] explained [better] by student- and school-characteristics that [were not and will likely never be] under the control of educators.”

Fuller’s conclusions? “If the scores accurately capture the true effectiveness of schools and the educators within schools, there should be only a weak or non-existent relationship between the SPP scores and the percentage of economically disadvantaged students.”

Hence: (1) “SPP scores should not be used as an indication of either school effectiveness or as a component of educator evaluations” because of the bias inherent within the system, of which PVAAS data are 40%. Bias in this case means that “teachers and principals in schools serving high percentages of economically disadvantaged students will be identified as less effective than they really are while those serving in schools with low percentages of economically disadvantaged students will be identified as more effective than in actuality.” (2) Because SPP scores are biased, and therefore inaccurate assessments of school effectiveness, “the use of the SPP scores in teacher and principal evaluations will lead to inaccurate judgments about teacher and principal effectiveness.” And (3) Using SPP scores in teacher and principal evaluations will create “additional incentive[s] for the most qualified and effective educators to seek employment in schools with high SPP scores. Thus, the use of SPP scores on educator evaluations will simply exacerbate the existing inequities in the distribution of educator quality across schools based on the characteristics of the students enrolled in the schools.”

Share Button

Critique of Chetty et al.’s ASA Critique Published in TCR

Share Button

You might recall from this past summer a post I wrote titled “Chetty Et Al…Are “Et” It “Al” Again!” The post was about how Raj Chetty and his colleagues (Harvard academics who have been the sources of other posts here, here, and here) took it upon themselves to “take on” the American Statistical Association (ASA) and their recently released “Statement on Using Value-Added Models for Educational Assessment.” Yes, the ASA as critiqued by Chetty et al., as yet another sign of how Chetty et al. feel about their own work in this area.

Anyhow, I consulted a colleague to write a critique of Chetty et al.’s critique that we published first on VAMboozled (here). Soon thereafter, also given some VAMboozled followers’ feedback, we consulted another colleague/statistician, built this critique into a more thorough/scholarly commentary, and recently found out that the highly-esteemed, peer-reviewed journal – Teachers College Record – picked it up for publication.

Congrats to my colleagues, Assistant Professors Margarita Pivovarova and Jennifer Broatch.

For followers, here is the piece in full in case you’re all interested in reading the revised (and now more thorough with citations) post. Here also is the direct link to the commentary online.

Share Button

Observations: “Where Most of the Action and Opportunities Are”

Share Button

In a study just released on the website of Education Next, researchers discuss results from their recent examinations of “new teacher-evaluation systems in four school districts that are at the forefront of the effort [emphasis added] to evaluate teachers meaningfully.” The four districts’ evaluation systems were based on classroom observations, achievement test gains for the whole school (i.e., school-level value-added), performance on non-standardized tests, and some form of measure of teacher professionalism and/or teacher commitment to the school community.

Researchers found the following: The ratings assigned teachers across the four districts’ leading evaluation systems as based primarily (i.e., 50-75%) on observations — not including value-added scores except for the amazingly low 20% of teachers who were VAM eligible — were “sufficiently predictive” of a teacher’s future performance. Later they define what “sufficiently predictive” is in terms of predictive validity coefficients that ranged between 0.33 to 0.38, which are actually quite “low” coefficients in reality. Later they say these coefficients are also “quite predictive,” regardless.

While such low coefficients are to be expected as per others’ research on this topic, one must question how authors came up with their determinations that these were “sufficiently” and “quite” predictive (see also Bill Honig’s comments at the bottom of this article). The authors of this article qualify these classifications later, though, writing that “[t]he degree of correlation confirms that these systems perform substantially better in predicting future teacher performance than traditional systems based on paper credentials and years of experience.” They explain further that these correlations are “in the range that is typical of systems for evaluating and predicting future performance in other fields of human endeavor, including, for example, those used to make management decisions on player contracts in professional sports.” So it seems their qualifications were based on a “better than” or relative but not empirical judgment (see also Bill Honig’s comments at the bottom of this article). That being said, this is something to certainly consume critically, particularly in the ways they’ve inappropriately categorized these coefficients.

Researchers also found the following: “The stability generated by the districts’ evaluation systems range[d] from a bit more than 0.50 for teachers with value-added scores to about 0.65 when value-added is not a component of the score.” In other words, districts’ “[e]valuation scores that [did] not include value-added [were] more stable [when districts] assign[ed] more weight to observation scores, which [were demonstrably] more stable over time than value-added scores.” In other words, observational scores outperformed value-added scores. Likewise, the stability they observed in the value-added scores (i.e., 0.50) fell within the upper range of those coefficients also reported elsewhere in the research. So, researchers also confirmed that teacher-level value-added scores are still quite inconsistent from year to year as they still (and too often) vary widely, and wildly over time.

Researchers’ key recommendations, as based on improving the quality of data derived from classroom observations: “Teacher evaluations should include two to three annual classroom observations, with at least one observation being conducted by a trained external observer.” They provide some evidence in support of this assertion in the full article. In addition, they assert that “[c]lassroom observations should carry at least as much weight as test-score gains in determining a teacher’s overall evaluation score.” Although I would argue, as based on their (and others’ results), they certainly made a greater case for observations in lieu of teacher-level value-added, throughout their paper..

Put differently, and in their own words – words with which I agree: “[M]ost of the action and nearly all the opportunities for improving teacher evaluations lie in the area of classroom observations rather than in test-score gains.” So there it is.

Note: The authors of this article do also talk about the “bias” inherent in classroom observations. As based on their findings, for example, they also recommend that “districts adjust classroom observation scores for the degree to which the students assigned to a teacher create challenging conditions for the teacher. Put simply, the current observation systems are patently unfair to teachers who are assigned less-able and -prepared students. The result is an unintended but strong incentive for good teachers to avoid teaching low-performing students and to avoid teaching in low-performing schools.” While I did not highlight these sections above, do click here if wanting to read more.

Share Button