Another Oldie but Still Very Relevant Goodie, by McCaffrey et al.

I recently re-read an article in full that is now 10 years old, or 10 years out, as published in 2004 and, as per the words of the authors, before VAM approaches were “widely adopted in formal state or district accountability systems.” Unfortunately, I consistently find it interesting, particularly in terms of the research on VAMs, to re-explore/re-discover what we actually knew 10 years ago about VAMs, as most of the time, this serves as a reminder of how things, most of the time, have not changed.

The article, “Models for Value-Added Modeling of Teacher Effects,” is authored by Daniel McCaffrey (Educational Testing Service [ETS] Scientist, and still a “big name” in VAM research), J. R. Lockwood (RAND Corporation Scientists),  Daniel Koretz (Professor at Harvard), Thomas Louis (Professor at Johns Hopkins), and Laura Hamilton (RAND Corporation Scientist).

At the point at which the authors wrote this article, besides the aforementioned data and data base issues, were issues with “multiple measures on the same student and multiple teachers instructing each student” as “[c]lass groupings of students change annually, and students are taught by a different teacher each year.” Authors, more specifically, questioned “whether VAM really does remove the effects of factors such as prior performance and [students’] socio-economic status, and thereby provide[s] a more accurate indicator of teacher effectiveness.”

The assertions they advanced, accordingly and as relevant to these questions, follow:

  • Across different types of VAMs, given different types of approaches to control for some of the above (e.g., bias), teachers’ contribution to total variability in test scores (as per value-added gains) ranged from 3% to 20%. That is, teachers can realistically only be held accountable for 3% to 20% of the variance in test scores using VAMs, while the other 80% to 97% of the variance (stil) comes from influences outside of the teacher’s control. A similar statistic (i.e., 1% to 14%) was similarly and recently highlighted in the recent position statement on VAMs released by the American Statistical Association.
  • Most VAMs focus exclusively on scores from standardized assessments, although I will take this one-step further now, noting that all VAMs now focus exclusively on large-scale standardized tests. This I evidenced in a recent paper I published here: Putting growth and value-added models on the map: A national overview).
  • VAMs introduce bias when missing test scores are not missing completely at random. The missing at random assumption, however, runs across most VAMs because without it, data missingness would be pragmatically insolvable, especially “given the large proportion of missing data in many achievement databases and known differences between students with complete and incomplete test data.” The really only solution here is to use “implicit imputation of values for unobserved gains using the observed scores” which is “followed by estimation of teacher effect[s] using the means of both the imputed and observe gains [together].”
  • Bias “[still] is one of the most difficult issues arising from the use of VAMs to estimate school or teacher effects…[and]…the inclusion of student level covariates is not necessarily the solution to [this] bias.” In other words, “Controlling for student-level covariates alone is not sufficient to remove the effects of [students’] background [or demographic] characteristics.” There is a reason why bias is still such a highly contested issue when it comes to VAMs (see a recent post about this here).
  • All (or now most) commonly-used VAMs assume that teachers’ (and prior teachers’) effects persist undiminished over time. This assumption “is not empirically or theoretically justified,” either, yet it persists.

These authors’ overall conclusion, again from 10 years ago but one that in many ways still stands? VAMs “will often be too imprecise to support some of [its] desired inferences” and uses including, for example, making low- and high-stakes decisions about teacher effects as produced via VAMs. “[O]btaining sufficiently precise estimates of teacher effects to support ranking [and such decisions] is likely to [forever] be a challenge.”

Victory in Court: Consequences Attached to VAMs Suspended Throughout New Mexico

Great news for New Mexico and New Mexico’s approximately 23,000 teachers, and great news for states and teachers potentially elsewhere, in terms of setting precedent!

Late yesterday, state District Judge David K. Thomson, who presided over the ongoing teacher-evaluation lawsuit in New Mexico, granted a preliminary injunction preventing consequences from being attached to the state’s teacher evaluation data. More specifically, Judge Thomson ruled that the state can proceed with “developing” and “improving” its teacher evaluation system, but the state is not to make any consequential decisions about New Mexico’s teachers using the data the state collects until the state (and/or others external to the state) can evidence to the court during another trial (set for now, for April) that the system is reliable, valid, fair, uniform, and the like.

As you all likely recall, the American Federation of Teachers (AFT), joined by the Albuquerque Teachers Federation (ATF), last year, filed a “Lawsuit in New Mexico Challenging [the] State’s Teacher Evaluation System.” Plaintiffs charged that the state’s teacher evaluation system, imposed on the state in 2012 by the state’s current Public Education Department (PED) Secretary Hanna Skandera (with value-added counting for 50% of teachers’ evaluation scores), is unfair, error-ridden, spurious, harming teachers, and depriving students of high-quality educators, among other claims (see the actual lawsuit here).

Thereafter, one scheduled day of testimonies turned into five in Santa Fe, that ran from the end of September through the beginning of October (each of which I covered here, here, here, here, and here). I served as the expert witness for the plaintiff’s side, along with other witnesses including lawmakers (e.g., a state senator) and educators (e.g., teachers, superintendents) who made various (and very articulate) claims about the state’s teacher evaluation system on the stand. Thomas Kane served as the expert witness for the defendant’s side, along with other witnesses including lawmakers and educators who made counter claims about the system, some of which backfired, unfortunately for the defense, primarily during cross-examination.

See articles released about this ruling this morning in the Santa Fe New Mexican (“Judge suspends penalties linked to state’s teacher eval system”) and the Albuquerque Journal (“Judge curbs PED teacher evaluations).” See also the AFT’s press release, written by AFT President Randi Weingarten, here. Click here for the full 77-page Order written by Judge Thomson (see also, below, five highlights I pulled from this Order).

The journalist of the Santa Fe New Mexican, though, provided the most detailed information about Judge Thomson’s Order, writing, for example, that the “ruling by state District Judge David Thomson focused primarily on the complicated combination of student test scores used to judge teachers. The ruling [therefore] prevents the Public Education Department [PED] from denying teachers licensure advancement or renewal, and it strikes down a requirement that poorly performing teachers be placed on growth plans.” In addition, the Judge noted that “the teacher evaluation system varies from district to district, which goes against a state law calling for a consistent evaluation plan for all educators.”

The PED continues to stand by its teacher evaluation system, calling the court challenge “frivolous” and “a legal PR stunt,” all the while noting that Judge Thomson’s decision “won’t affect how the state conducts its teacher evaluations.” Indeed it will, for now and until the state’s teacher evaluation system is vetted, and validated, and “the court” is “assured” that the system can actually be used to take the “consequential actions” against teachers, “required” by the state’s PED.

Here are some other highlights that I took directly from Judge Thomson’s ruling, capturing what I viewed as his major areas of concern about the state’s system (click here, again, to read Judge Thomson’s full Order):

  • Validation Needed: “The American Statistical Association says ‘estimates from VAM should always be accompanied by measures of precision and a discussion of the assumptions and possible limitations of the model. These limitations are particularly relevant if VAM are used for high stake[s] purposes” (p. 1). These are the measures, assumptions, limitations, and the like that are to be made transparent in this state.
  • Uniformity Required: “New Mexico’s evaluation system is less like a [sound] model than a cafeteria-style evaluation system where the combination of factors, data, and elements are not easily determined and the variance from school district to school district creates conflicts with the [state] statutory mandate” (p. 2)…with the existing statutory framework for teacher evaluations for licensure purposes requiring “that the teacher be evaluated for ‘competency’ against a ‘highly objective uniform statewide standard of evaluation’ to be developed by PED” (p. 4). “It is the term ‘highly objective uniform’ that is the subject matter of this suit” (p. 4), whereby the state and no other “party provided [or could provide] the Court a total calculation of the number of available district-specific plans possible given all the variables” (p. 54). See also the Judge’s points #78-#80 (starting on page 70) for some of the factors that helped to “establish a clear lack of statewide uniformity among teachers” (p. 70).
  • Transparency Missing: “The problem is that it is not easy to pull back the curtain, and the inner workings of the model are not easily understood, translated or made accessible” (p. 2). “Teachers do not find the information transparent or accurate” and “there is no evidence or citation that enables a teacher to verify the data that is the content of their evaluation” (p. 42). In addition, “[g]iven the model’s infancy, there are no real studies to explain or define the [s]tate’s value-added system…[hence, the consequences and decisions]…that are to be made using such system data should be examined and validated prior to making such decisions” (p. 12).
  • Consequences Halted: “Most significant to this Order, [VAMs], in this [s]tate and others, are being used to make consequential decisions…This is where the rubber hits the road [as per]…teacher employment impacts. It is also where, for purposes of this proceeding, the PED departs from the statutory mandate of uniformity requiring an injunction” (p. 9). In addition, it should be noted that indeed “[t]here are adverse consequences to teachers short of termination” (p. 33) including, for example, “a finding of ‘minimally effective’ [that] has an impact on teacher licenses” (p. 41). These, too, are to be halted under this injunction Order.
  • Clarification Required: “[H]ere is what this [O]rder is not: This [O]rder does not stop the PED’s operation, development and improvement of the VAM in this [s]tate, it simply restrains the PED’s ability to take consequential actions…until a trial on the merits is held” (p. 2). In addition, “[a] preliminary injunction differs from a permanent injunction, as does the factors for its issuance…’ The objective of the preliminary injunction is to preserve the status quo [minus the consequences] pending the litigation of the merits. This is quite different from finally determining the cause itself” (p. 74). Hence, “[t]he court is simply enjoining the portion of the evaluation system that has adverse consequences on teachers” (p. 75).

The PED also argued that “an injunction would hurt students because it could leave in place bad teachers.” As per Judge Thomson, “That is also a faulty argument. There is no evidence that temporarily halting consequences due to the errors outlined in this lengthy Opinion more likely results in retention of bad teachers than in the firing of good teachers” (p. 75).

Finally, given my involvement in this lawsuit and given the team with whom I was/am still so fortunate to work (see picture below), including all of those who testified as part of the team and whose testimonies clearly proved critical in Judge Thomson’s final Order, I want to thank everyone for all of their time, energy, and efforts in this case, thus far, on behalf of the educators attempting to (still) do what they love to do — teach and serve students in New Mexico’s public schools.

IMG_0123

Left to right: (1) Stephanie Ly, President of AFT New Mexico; (2) Dan McNeil, AFT Legal Department; (3) Ellen Bernstein, ATF President; (4) Shane Youtz, Attorney at Law; and (5) me 😉

Rothstein, Chetty et al., and VAM-Based Bias

Recall the Chetty et al. study at focus of many posts on this blog (see for example here, here, and here)? The study was cited in President Obama’ 2012 State of the Union address when Obama said, “We know a good teacher can increase the lifetime income of a classroom by over $250,000,” and this study was more recently the focus of attention when the judge in Vergara v. California cited Chetty et al.’s study as providing evidence that “a single year in a classroom with a grossly ineffective teacher costs students $1.4 million in lifetime earnings per classroom.” Well, this study is at the source of a new, and very interesting VAM-based debate, again.

This time, new research conducted by Berkeley Associate Professor of Economics – Jesse Rothstein – provides evidence that puts the aforementioned Chetty et al. results under another appropriate light. While Rothstein and others have written critiques of the Chetty et al. study prior (see prior reviews here, here, here, and here), what Rothstein recently found (in his working, not-yet-peer-reviewed-study here) is that by using “teacher switching” statistical procedures, Chetty et al. masked evidence of bias in their prior study. While Chetty et al. have repeatedly claimed bias was not an issue (see for example a series of emails on this topic here), it seems indeed it was.

While Rothstein replicated Chetty et al.’s overall results using a similar dataset, Rothstein did not replicate Chetty et al.’s findings when it came to bias. As mentioned, Chetty et al. used a process of “teacher switching” to test for bias in their study, and by doing so found, with evidence, that bias did not exist in their value-added output. Rothstein found that when “teacher switching” is appropriately controlled, however, “bias accounts for about 20% of the variance in [VAM] scores.” This makes suspect, more now than before, Chetty et al.’s prior assertions that their model, and their findings, were immune to bias.

What this means, as per Rothstein, is that “teacher switching [the process used by Chetty et al.] is correlated with changes in students’ prior grade scores that bias the key coefficient toward a finding of no bias.” Hence, there was a reason Chetty et al. did not find bias in their value-added estimates because they did not use the proper, statistical controls to control for bias in the first place. When properly controlled, or adjusted, estimates yield “evidence of moderate bias;” hence, “[t]he association between [value-added] and long-run outcomes is not robust and quite sensitive to controls.”

This has major implications in the sense that this makes suspect the causal statements also made by Chetty et al. and repeated by President Obama, the Vergara v. California judge, and others – that “high value-added” teachers caused students to ultimately realize higher long-term incomes, fewer pregnancies, etc. X years down the road. If Chetty et al. did not appropriately control for bias, which again Rothstein argues with evidence they did not, it is likely that students would have realized these “things” almost if not entirely regardless of their teachers or what “value” their teachers purportedly “added” to their learning X years prior.

In other words, students were likely not randomly assigned to classrooms in either the Chetty et al. or the Rothstein datasets (making these datasets comparable). So if the statistical controls used did not effectively “control for” the lack of non random assignment of students into classrooms, teachers may have been assigned high value-added scores not necessarily because they were high value-added teachers but because they were non-randomly assigned higher performing, higher aptitude, etc. students, in the first place and as a whole. Thereafter, they were given credit for the aforementioned long-term outcomes, regardless.

If the name Jesse Rothstein sounds familiar, it should. I have referenced his research in prior posts here, here, and here, as he is well-known in the area of VAM research, in particular, for a series of papers in which he provided evidence that students who are assigned to classrooms in non-random ways can create biased, teacher-level value-added scores. If random assignment was the norm (i.e., whereas students are randomly assigned to classrooms and, ideally, teachers are randomly assigned to teach those classrooms of randomly assigned students), teacher-level bias would not be so problematic. However, given research I also recently conducted on this topic (see here), random assignment (at least in the state of Arizona) occurs 2% of the time, at best. Principals otherwise outright reject the notion as random assignment is not viewed as in “students’ best interests,” regardless of whether randomly assigning students to classrooms might mean “more accurate” value-added output as a result.

So it seems, we either get the statistical controls right (which I doubt is possible) or we randomly assign (which I highly doubt is possible). Otherwise, we are left wondering whether value-added analyses will ever work as per their intended (and largely ideal) purposes, especially when it comes to evaluating and holding accountable America’s public school teachers for their effectiveness.

—–

In case you’re interested, Chetty et al have responded to Rothstein’s critique. Their full response can be accessed here. Not surprisingly, they first highlight that Rothstein (and another set of their colleagues at Harvard), replicated their results. That “value-added (VA) measures of teacher quality show very consistent properties across different settings” is that on which Chetty et al. focus first and foremost. What they dismiss, however, is whether the main concerns raised by Rothstein threaten the validity of their methods, and their conclusions. They also dismiss the fact that Rothstein addressed Chetty et al.’s counterpoints before they published them, in Appendix B of his paper given Chetty et al. shared their concerns with Rothstein prior to his study’s release.

Nonetheless, the concerns Chetty et al. attempt to counter are whether their “teacher-switching” approach was invalid, and whether the “exclusion of teachers with missing [value-added] estimates biased the[ir]conclusion[s]” as well. The extent to which missing data bias value-added estimates has also been discussed prior when statisticians force the assumption in their analyses that missing data are “missing at random” (MAR), which is a difficult (although for some like Chetty et al, necessary) assumption to swallow (see, for example, the Braun 2004 reference here).

Observations: “Where Most of the Action and Opportunities Are”

In a study just released on the website of Education Next, researchers discuss results from their recent examinations of “new teacher-evaluation systems in four school districts that are at the forefront of the effort [emphasis added] to evaluate teachers meaningfully.” The four districts’ evaluation systems were based on classroom observations, achievement test gains for the whole school (i.e., school-level value-added), performance on non-standardized tests, and some form of measure of teacher professionalism and/or teacher commitment to the school community.

Researchers found the following: The ratings assigned teachers across the four districts’ leading evaluation systems as based primarily (i.e., 50-75%) on observations — not including value-added scores except for the amazingly low 20% of teachers who were VAM eligible — were “sufficiently predictive” of a teacher’s future performance. Later they define what “sufficiently predictive” is in terms of predictive validity coefficients that ranged between 0.33 to 0.38, which are actually quite “low” coefficients in reality. Later they say these coefficients are also “quite predictive,” regardless.

While such low coefficients are to be expected as per others’ research on this topic, one must question how authors came up with their determinations that these were “sufficiently” and “quite” predictive (see also Bill Honig’s comments at the bottom of this article). The authors of this article qualify these classifications later, though, writing that “[t]he degree of correlation confirms that these systems perform substantially better in predicting future teacher performance than traditional systems based on paper credentials and years of experience.” They explain further that these correlations are “in the range that is typical of systems for evaluating and predicting future performance in other fields of human endeavor, including, for example, those used to make management decisions on player contracts in professional sports.” So it seems their qualifications were based on a “better than” or relative but not empirical judgment (see also Bill Honig’s comments at the bottom of this article). That being said, this is something to certainly consume critically, particularly in the ways they’ve inappropriately categorized these coefficients.

Researchers also found the following: “The stability generated by the districts’ evaluation systems range[d] from a bit more than 0.50 for teachers with value-added scores to about 0.65 when value-added is not a component of the score.” In other words, districts’ “[e]valuation scores that [did] not include value-added [were] more stable [when districts] assign[ed] more weight to observation scores, which [were demonstrably] more stable over time than value-added scores.” In other words, observational scores outperformed value-added scores. Likewise, the stability they observed in the value-added scores (i.e., 0.50) fell within the upper range of those coefficients also reported elsewhere in the research. So, researchers also confirmed that teacher-level value-added scores are still quite inconsistent from year to year as they still (and too often) vary widely, and wildly over time.

Researchers’ key recommendations, as based on improving the quality of data derived from classroom observations: “Teacher evaluations should include two to three annual classroom observations, with at least one observation being conducted by a trained external observer.” They provide some evidence in support of this assertion in the full article. In addition, they assert that “[c]lassroom observations should carry at least as much weight as test-score gains in determining a teacher’s overall evaluation score.” Although I would argue, as based on their (and others’ results), they certainly made a greater case for observations in lieu of teacher-level value-added, throughout their paper..

Put differently, and in their own words – words with which I agree: “[M]ost of the action and nearly all the opportunities for improving teacher evaluations lie in the area of classroom observations rather than in test-score gains.” So there it is.

Note: The authors of this article do also talk about the “bias” inherent in classroom observations. As based on their findings, for example, they also recommend that “districts adjust classroom observation scores for the degree to which the students assigned to a teacher create challenging conditions for the teacher. Put simply, the current observation systems are patently unfair to teachers who are assigned less-able and -prepared students. The result is an unintended but strong incentive for good teachers to avoid teaching low-performing students and to avoid teaching in low-performing schools.” While I did not highlight these sections above, do click here if wanting to read more.

American Statistical Association (ASA) Position Statement on VAMs

Inside my most recent post, about the Top 14 research-based articles about VAMs, there was a great research-based statement that was released just last week by the American Statistical Association (ASA), titled the “ASA Statement on Using Value-Added Models for Educational Assessment.”

It is short, accessible, easy to understand, and hard to dispute, so I wanted to be sure nobody missed it as this is certainly a must read for all of you following this blog, not to mention everybody else dealing/working with VAMs and their related educational policies. Likewise, this represents the current, research-based evidence and thinking of probably 90% of the educational researchers and econometricians (still) conducting research in this area.

Again, the ASA is the best statistical organization in the U.S. and likely one of if not the best statistical associations in the world. Some of the most important parts of their statement, taken directly from their full statement as I see them, follow:

  1. VAMs are complex statistical models, and high-level statistical expertise is needed to
    develop the models and [emphasis added] interpret their results.
  2. Estimates from VAMs should always be accompanied by measures of precision and a discussion of the assumptions and possible limitations of the model. These limitations are particularly relevant if VAMs are used for high-stakes purposes.
  3. VAMs are generally based on standardized test scores, and do not directly measure
    potential teacher contributions toward other student outcomes.
  4. VAMs typically measure correlation, not causation: Effects – positive or negative –
    attributed to a teacher may actually be caused by other factors that are not captured in the model.
  5. Under some conditions, VAM scores and rankings can change substantially when a
    different model or test is used, and a thorough analysis should be undertaken to
    evaluate the sensitivity of estimates to different models.
  6. VAMs should be viewed within the context of quality improvement, which distinguishes aspects of quality that can be attributed to the system from those that can be attributed to individual teachers, teacher preparation programs, or schools.
  7. Most VAM studies find that teachers account for about 1% to 14% of the variability in test scores, and that the majority of opportunities for quality improvement are found in the system-level conditions. Ranking teachers by their VAM scores can have unintended consequences that reduce quality.
  8. Attaching too much importance to a single item of quantitative information is counter-productive—in fact, it can be detrimental to the goal of improving quality.
  9. When used appropriately, VAMs may provide quantitative information that is relevant for improving education processes…[but only if used for descriptive/description purposes]. Otherwise, using VAM scores to improve education requires that they provide meaningful information about a teacher’s ability to promote student learning…[and they just do not do this at this point, as there is no research evidence to support this ideal].
  10. A decision to use VAMs for teacher evaluations might change the way the tests are viewed and lead to changes in the school environment. For example, more classroom time might be spent on test preparation and on specific content from the test at the exclusion of content that may lead to better long-term learning gains or motivation for students. Certain schools may be hard to staff if there is a perception that it is harder for teachers to achieve good VAM scores when working in them. Overreliance on VAM scores may foster a competitive environment, discouraging collaboration and efforts to improve the educational system as a whole.

Also important to point out is that included in the report the ASA makes recommendations regarding the “key questions states and districts [yes, practitioners!] should address regarding the use of any type of VAM.” These include, although they are not limited to questions about reliability (consistency), validity, the tests on which VAM estimates are based, and the major statistical errors that always accompany VAM estimates, but are often buried and often not reported with results (i.e., in terms of confidence
intervals or standard errors).

Also important is the purpose for ASA’s statement, as written by them: “As the largest organization in the United States representing statisticians and related professionals, the American Statistical Association (ASA) is making this statement to provide guidance, given current knowledge and experience, as to what can and cannot reasonably be expected from the use of VAMs. This statement focuses on the use of VAMs for assessing teachers’ performance but the issues discussed here also apply to their use for school or principal accountability. The statement is not intended to be prescriptive. Rather, it is intended to enhance general understanding of the strengths and limitations of the results generated by VAMs and thereby encourage the informed use of these results.”

Do give the position statement a read and use it as needed!

Correction: Make the “Top 13” VAM Articles the “Top 14”

As per my most recent post earlier today, about the Top 13 research-based articles about VAMs, low and behold another great research-based statement was just this week released by the American Statistical Association (ASA), titled the “ASA Statement on Using Value-Added Models for Educational Assessment.”

So, let’s make the Top 13 the Top 14 and call it a day. I say “day” deliberately; this is such a hot and controversial topic it is often hard to keep up with the literature in this area, on literally a daily basis.

As per this outstanding statement released by the ASA – the best statistical organization in the U.S. and one of if not the best statistical associations in the world – some of the most important parts of their statement, taken directly from their full statement as I see them, follow:

  1. VAMs are complex statistical models, and high-level statistical expertise is needed to
    develop the models and [emphasis added] interpret their results.
  2. Estimates from VAMs should always be accompanied by measures of precision and a discussion of the assumptions and possible limitations of the model. These limitations are particularly relevant if VAMs are used for high-stakes purposes.
  3. VAMs are generally based on standardized test scores, and do not directly measure
    potential teacher contributions toward other student outcomes.
  4. VAMs typically measure correlation, not causation: Effects – positive or negative –
    attributed to a teacher may actually be caused by other factors that are not captured in the model.
  5. Under some conditions, VAM scores and rankings can change substantially when a
    different model or test is used, and a thorough analysis should be undertaken to
    evaluate the sensitivity of estimates to different models.
  6. VAMs should be viewed within the context of quality improvement, which distinguishes aspects of quality that can be attributed to the system from those that can be attributed to individual teachers, teacher preparation programs, or schools.
  7. Most VAM studies find that teachers account for about 1% to 14% of the variability in test scores, and that the majority of opportunities for quality improvement are found in the system-level conditions. Ranking teachers by their VAM scores can have unintended consequences that reduce quality.
  8. Attaching too much importance to a single item of quantitative information is counter-productive—in fact, it can be detrimental to the goal of improving quality.
  9. When used appropriately, VAMs may provide quantitative information that is relevant for improving education processes…[but only if used for descriptive/description purposes]. Otherwise, using VAM scores to improve education requires that they provide meaningful information about a teacher’s ability to promote student learning…[and they just do not do this at this point, as there is no research evidence to support this ideal].
  10. A decision to use VAMs for teacher evaluations might change the way the tests are viewed and lead to changes in the school environment. For example, more classr
    oom time might be spent on test preparation and on specific content from the test at the exclusion of content that may lead to better long-term learning gains or motivation for students. Certain schools may be hard to staff if there is a perception that it is harder for teachers to achieve good VAM scores when working in them. Overreliance on VAM scores may foster a competitive environment, discouraging collaboration and efforts to improve the educational system as a whole.

Also important to point out is that included in the report the ASA makes recommendations regarding the “key questions states and districts [yes, practitioners!] should address regarding the use of any type of VAM.” These include, although they are not limited to questions about reliability (consistency), validity, the tests on which VAM estimates are based, and the major statistical errors that always accompany VAM estimates, but are often buried and often not reported with results (i.e., in terms of confidence
intervals or standard errors).

Also important is the purpose for ASA’s statement, as written by them: “As the largest organization in the United States representing statisticians and related professionals, the American Statistical Association (ASA) is making this statement to provide guidance, given current knowledge and experience, as to what can and cannot reasonably be expected from the use of VAMs. This statement focuses on the use of VAMs for assessing teachers’ performance but the issues discussed here also apply to their use for school or principal accountability. The statement is not intended to be prescriptive. Rather, it is intended to enhance general understanding of the strengths and limitations of the results generated by VAMs and thereby encourage the informed use of these results.”

If you’re going to choose one article to read and review, this week or this month, and one that is thorough and to the key points, this is the one I recommend you read…at least for now!

Another Lawsuit in Tennessee

As per Diane Ravitch’s blog, “The Tennessee Education Association filed a second lawsuit against the use if value-added assessment (called TVAAS in Tennessee), this time including extremist Governor Haslam and ex-TFA state commissioner Huffman in their suit.

As per a more detailed post about this lawsuit, “The state’s largest association for teachers filed a second lawsuit on behalf of a Knox County teacher, calling the use of the Tennessee Value-Added Assessment System (TVAAS), which uses students’ growth on state assessments to evaluate teachers, unconstitutional.

Farragut Middle School eighth grade science teacher Mark Taylor believes he was unfairly denied a bonus after his value-added estimate was based on the standardized test scores of 22 of his 142 students. “Mr. Taylor teaches four upper-level physical science courses and one regular eighth grade science class,” said Richard Colbert, TEA general counsel, in a press release. “The students in the upper-level course take a locally developed end-of-course test in place of the state’s TCAP assessment. As a result, those high-performing students were not included in Mr. Taylor’s TVAAS estimate.”

Taylor received ‘exceeding expectations’ classroom observation scores, but a low value-added estimate reduced his final evaluation score below the requirement to receive the bonus.

The lawsuit includes six counts against the governor, commissioner and local school board.

TEA’s general counsel argues the state has violated Taylor’s 14th Amendment right to equal protection from “irrational state-imposed classifications” by using a small fraction of his students to determine his overall effectiveness.

TEA filed its first TVAAS lawsuit last month on behalf of Knox County teacher Lisa Trout, who was denied the district’s bonus. The lawsuit also cites the arbitrariness of TVAAS estimates that use test results of only a small segment of a teacher’s students to estimate her overall effectiveness.

TEA says it expects additional lawsuits to be filed so long as the state continues to tie more high-stakes decisions to TVAAS estimates.”

Florida’s Released Albeit “Flawed” VAM Data

The Florida Times-Union’s Lead Op-Ed Letter on Monday was about why the value-added data recently released by the Florida Department of Education has, at best, made it “clearer than ever that the data is meaningless,” made even more unfortunate by the nearly four years and millions of dollars (including human resource dollars) spent on perfecting the state’s VAM and its advanced-as-accurate estimates.

In the letter, Andy Ford who is the current president of the Florida Education Association (yes, the union), writes, in sum and among other key points:

  • “The lists released by the DOE are a confusing mess.”
  • “Throughout the state, teachers who have been honored as among the best in their districts received low VAM marks.”
  • “Band teachers, physical education teachers and guidance counselors received VAM ratings despite not teaching subjects that are tested.”
  • “Teachers who worked at their schools for a few weeks received VAM scores as did teachers who retired three years ago.”
  • “A given teacher may appear to have differential effectiveness from class to class, from year to year and from test to test. Ratings are most unstable at the upper and lower ends where the ratings are most likely to be used to determine high or low levels of effectiveness…Most researchers agree that VAM is not appropriate as a primary measure for evaluating individual teachers. Reviews of research on value-added methods have concluded that they are too unstable and too vulnerable to many sources of error to be used for teacher evaluation.”

“Once again the state of Florida has proven that it puts test scores above everything else in public education. And once again it provided false data that misleads more than informs…When will our political leaders and the DOE stop chasing these flawed data models and begin listening to the teachers, education staff professionals, administrators and parents of Florida?”

The union “fully supports teacher accountability. But assessments of teachers, like assessments of students, must be valid, transparent and multi-faceted. These value-added model calculations are none of these.”

Research Study: Missing Data and VAM-Based Bias

A new Assistant Professor here at ASU, from outside the College of Education but in the College of Mathematical and Natural Sciences also specializes in value-added modeling (and statistics). Her name is Jennifer Broatch, she is a rising star in this area of research, and she just sent me an article I missed, just read, and certainly found worth sharing with you all.

The peer-reviewed article, published in Statistics and Public Policy this past November, is fully cited and linked below so that you all can read it in full. But in terms of its CliffsNotes version, researchers evidenced the following two key findings:

First, researchers found that, “VAMs that include shorter test score histories perform fairly well compared to those with longer score histories.” The current thinking is that we need at least two if not three years of data to yield reliable estimates, or estimates that are consistent over time (which they should be). These authors argue that with three years of data the amount of data that go missing are not worth shooting for that target. Rather, again they argue, this is an issue of trade-offs. This is certainly something to consider, as long as we continue to understand that all of this is about “tinkering towards a utopia” (Tyack & Cuban, 1997) that I’m not at all certain exists in terms of VAMs and VAM-based accuracy.

Second, researchers found that, “the decision about whether to control for student covariates [or background/demographic variables] and schooling environments, and how to control for this information, influences [emphasis added] which types of schools and teachers are identified as top and bottom performers. Models that are less aggressive in controlling for student characteristics and schooling environments systematically identify schools and teachers that serve more advantaged students as providing the most value-added, and correspondingly, schools and teachers that serve more disadvantaged students as providing the least.”

This certainly adds evidence to the research on VAM-based bias. While there are many researchers who still claim that controlling for student background variables is unnecessary when using VAMs, and if anything bad practice because controlling for such demographics causes perverse effects (e.g., if teachers focus relatively less on such students who are given such statistical accommodations or boosts), this study adds more evidence that “to not control” for such demographics does indeed yield biased estimates. The authors do not disclose, however, how much bias is still “left over” after the controls are used; hence, this is still a very serious point of contention. Whether the controls, even if used, function appropriately is still something to be taken in earnest, particularly when consequential decisions are to be tied to VAM-based output (see also “The Random Assignment of Students into Elementary Classrooms: Implications for Value-Added Analyses and Interpretations”).

Citation: Ehlert, M., Koedel, C., Parsons, E., & Podgursky, M. (2013, November). The sensitivity of value-added estimates to specification adjustments: Evidence from school- and teacher-level models in Missouri. Statistics and Public Policy, 1(1), 19-27. doi: 10.1080/2330443X.2013.856152

Data Quality and VAM Evaluations, by “A Concerned New Mexico Parent”

“A Concerned New Mexico Parent” wrote in the following. Be sure to follow along, as this parent demonstrates some of the very real, real-world issues with the data that are being used to calculate VAMs, given their missingness, incompleteness, arbitrariness, and the like. Yet, VAMs “spit out” mathematical estimates that, because they are based on advanced statistics are to be trusted, yet the resultant “estimates” mask all of the chaos (demonstrated below) behind the sophisticated statistical scene.

(S)he writes:

As a parent, I have been concerned with New Mexico’s implementation of the trifecta of bad education policies – stack-ranking of teachers, the Common Core curriculum, and the use of value-added models (VAMs) for teacher evaluation.

The Vamboozled blog has done a great job in educating teachers and the public about the pitfalls of the VAM approach to teacher evaluations. The recent blog post by an Arizona teacher about her “value-added” contribution caused me to investigate the issue closer to home.

Recently Hanna Skandera (the acting head of the New Mexico Public Education Department and a Jeb Bush protégé) recently posted information on how data are to be handled for the New Mexico VAM system. Click here to review Skandera’s post.

With these two internet postings in mind, I decided to investigate the quality of data available for evaluating teachers at my local school.

According to researchers (and basic common sense), the more data you have, the better. According to the VAM literature, everyone (critics and proponents alike) seems to agree that at least three years of test scores are the minimum necessary for a legitimate [valid] calculation. We should have that much data per teacher as well.

As you will soon see, even this seemingly simple requirement is often not met in a real life school.

To calculate any VAM-type score, data are needed. Specifically, the underlying student test scores and teacher data on which everything else depends are crucial.

One type of “problem data” are data that scramble several people together into one number. For example, if you and I team-teach a class of students, it is not possible to tell how much I contributed versus how much you contributed to the final student score [regardless of what the statisticians say they can do in terms of fractional effects]. Anyone who has ever worked on a team project knows that not everyone contributes equally. Any division of score is purely arbitrary (aka “made up”) and indefensible. This is the situation referenced by the Arizona teacher who loses her students to math tutors for months at a time.

A second type of “problem data” are data that change kind mid-stream. Suppose we have a teacher who teaches 3rd grade one year and is then switched to 6th grade the next. No one who has taught would ever believe that teaching 3rd graders is the same as teaching 6th graders. Just mentally imagine using a 3rd grade approach with 6th grade boys, and I think you can see the problems. So, any teacher who switches grades is likely to have questionable data if all of her scores are considered the same.

A third type of “problem data” are not really a problem but are simply problematic given the absence of information. Teachers who leave the school, are missing data for certain students [which occurs in non-random patterns], etc. often do not have data to support accurate VAM calculations.

Finally, a final type of “problem data” are data that are limited by too few observations. A teacher who has exactly one-year experience would have only one set of test scores, yet this is too few for any meaningful calculation. The New Mexico approach, as explained in the Skandera posting, then, is to “fill in” the data with surrogate [observational] data.

If one is presented with two observations – why not just use the surrogate data only? We already have teacher observations and evaluations without the added expense of VAM calculations with specialized software. And how does using these less precise data help a VAM become valid? It would be like the difference between taking your child’s temperature with a precise in-ear thermometer or simply putting the back of your hand against their cheek. Both measurements can probably tell you whether your child is sick, but both are not equally accurate.

Regardless, it appears that if we want to have a good statistical calculation, we need to make sure we have the following:

  1. At least three years of teaching data.
  2. No team teaching or sharing of classes or students.
  3. No grade changing (the most recent three years should be at the same grade)
  4. Data must include the current year.

With these four very modest assumptions for ensuring at least minimally good data, how do we fare in real life?

I decided to chart the information of my local school, in light of the VAM data requirements, and I was shocked by what I found.

The results of real world teacher churn are shown below. Each line shows a teacher, their grade-level, the number of years teaching at that grade level, and their data status for a VAM calculation in school-year 2013-2014.

The chart includes all teachers from one school in grades 3 through 6. These are the only grades that currently take the New Mexico state-wide standardized test. The data are for the time period Aug 2010 to Feb 2014.

The “Teacher” and “Grade” columns are self-explanatory. The “Years Active” column shows the dates when the teacher taught at the school. The “Years at Same Grade Level” column shows the number of years a teacher has taught at a consistent grade level.

The “Data Status” column explains briefly why the teachers’ data may be invalid. More specifically in the “Data status” column, “Team teacher” means that the teacher shares duties with another teacher for the same set of students; “No longer teaching” means the teacher may have taught 3rd-6th grades in the past but is not currently teaching any of these grades; “Insufficient data” means the teacher has taught less than three years and does not have sufficient data for valid statistical calculations; “Data not current” means the teacher has taught 3rd-6th grade but not in the year of the VAM calculation; “Grade change” means the teacher changed grade levels sometime during the 2010-2014 school years; and “Valid data” means the data for the teacher appears to be reliable.

As you might guess, all of the teachers’ names have been changed; all of the other data are correct.

 Table of Teacher Data Quality

Teacher    (n=28)

Grade

Years Active

Years at Same Grade Level

Data Status

Govan
3
2010-2014
4
Team teacher
Grubb
3
2010-2014
4
Team teacher
Durling
3
2010-2014
4
Team teacher
Jen
3
2010-2012
2
No longer teaching 3rd-6th
Bohanon
3
2012-2014
2
Insufficient data
Mcanulty
3
2013-2014
1
Insufficient data
Saum
4
2010-2012
2
No longer teaching 3rd-6th
Wirtz
4
2010-2013
3
No longer teaching 3rd-6th
Mccaslin
4
2010-2011
1
No longer teaching 3rd-6th
Finamore
4
2010-2012
2
No longer teaching 3rd-6th
Sharrow
Sharrow
4
5
2011-2012
2012-2014
1
2
Grade change
Insufficient data
Kime
4
2012-2014
2
Insufficient data
Blish
4
2012-2014
2
Insufficient data
Obregon
4
2012-2014
2
Insufficient data
Fraise
4
2013-2014
1
Insufficient data
Franzoni
Franzoni
Franzoni
4
5
6
2013-2014
2010-2013
2012-2013
1
3
1
Grade change
Grade change
Grade change
Henderson
5
2010-2012
2
Insufficient data
Regan
5
2010-2014
4
Team teacher
Kalis
5
2011-2013
2
No longer teaching 3rd-6th
Combest
5
2013-2014
1
Insufficient data
Meister
5
2013-2014
1
Insufficient data
Treacy
6
2010-2011
1
No longer teaching 3rd-6th
Sprayberry
6
2010-2014
4
Team teacher
Locust
6
2010-2011
1
No longer teaching 3rd-6th
Condron
6
2011-2014
3
Valid data
Monteiro
6
2011-2014
3
Team teacher
Arnwine
6
2011-2012
1
No longer teaching 3rd-6th
Sebree
6
2013-12014
1
Insufficient data

As demonstrated in the Table above, 28 teachers taught 65 classes of 3rd through 6th grade. Note that two teachers (Sharrow and Franzoni) taught several grades during this time period. Remember, these are the only grades (3rd – 6th) that are given the New Mexico standardized test.

Once we exclude questionable data, we are reduced to exactly one (1/28 = 3.6%) solitary teacher (Condron) who taught three classes of 6th graders during the last four years for whom VAM data were available. All of the other data required for the VAM calculation would be provided by surrogate data.

According to the Skandera posting, the missing data would be replaced with the much more subjective and less reliable observational data. It is unclear how the “team teacher” data would be assessed. There is no scientific means to disaggregate the results for the two teachers. Any assignment of “value-added” for team teachers would be [almost] completely arbitrary [as typically based on teacher self-reported proportions of time taught].

Thus, we can see that for the 28 teachers listed in the above table, the calculations would then be based on the following data substitutions:

Table of Surrogate Data Substitutions

Data Status

Number of Teachers

Surrogate Data Information

Notes on Data Quality

No longer teaching 3rd-6th
9 (32.1%)
Excluded from calculation
Teachers that no longer teach the students in question would be excluded from the VAM calculation.
Team teacher
6 (21.4%)
Arbitrary division of credit for VAM score
Arbitrary (aka “self-reported”) division of responsibility
Insufficient data (missing one year of test data)
6 (21.4%)
The missing data would be replaced by observational data.
These teachers would have 1/3 of their VAM calculation based on observational data.
Insufficient data (missing two years of test data)
5 (17.9%)
The missing data would be replaced by observational data.
These teachers would have 2/3 of their VAM calculation based on observational data.
Grade change
1 (3.6%)
One teacher (Franzoni) is still teaching 3rd-6th after a grade change.
Either her 4th and 5th grade data would be combined in some fashion, or the data for her most recent teaching (6th grade) would be based on two years of observational data. In either case, the quality of her data would be very suspect.
Valid data
1 (3.6%)
There are at least three years of valid data.
The data would be valid for the VAM calculation.

Again, only one teacher, as demonstrated, has valid data. I can guarantee that these major problems with the integrity or quality of these data were NOT be publicized when the teacher VAM scores were released. Teachers were still held responsible, regardless.

Remember that for all of the other teachers on campus (K-2, computer, fine arts, PE, etc.), their VAM scores would be derived from the campus average. This average, as well, would be contaminated by the very problematic data revealed in these tables.

Does this approach of calculating VAM scores with such questionable and shaky data seem fair or equitable to the teachers or to this school? I believe not.