Difficulties When Combining Multiple Teacher Evaluation Measures

A new study about multiple “Approaches for Combining Multiple Measures of Teacher Performance,” with special attention paid to reliability, validity, and policy, was recently published in the American Educational Research Association (AERA) sponsored and highly-esteemed Educational Evaluation and Policy Analysis journal. You can find the free and full version of this study here.

In this study authors José Felipe Martínez – Associate Professor at the University of California, Los Angeles, Jonathan Schweig – at the RAND Corporation, and Pete Goldschmidt – Associate Professor at California State University, Northridge and creator of the value-added model (VAM) at legal issue in the state of New Mexico (see, for example, here), set out to help practitioners “combine multiple measures of complex [teacher evaluation] constructs into composite indicators of performance…[using]…various conjunctive, disjunctive (or complementary), and weighted (or compensatory) models” (p. 738). Multiple measures in this study include teachers’ VAM estimates, observational scores, and student survey results.

While authors ultimately suggest that “[a]ccuracy and consistency are greatest if composites are constructed to maximize reliability,” perhaps more importantly, especially for practitioners, authors note that “accuracy varies across models and cut-scores and that models with similar accuracy may yield different teacher classifications.”

This, of course, has huge implications for teacher evaluation systems as based upon multiple measures in that “accuracy” means “validity” and “valid” decisions cannot be made as based on “invalid” or “inaccurate” data that can so arbitrarily change. In other words, what this means is that likely never will a decision about a teacher being this or that actually mean this or that. In fact, this or that might be close, not so close, or entirely wrong, which is a pretty big deal when the measures combined are assumed to function otherwise. This is especially interesting, again and as stated prior, that the third author on this piece – Pete Goldschmidt – is the person consulting with the state of New Mexico. Again, this is the state that is still trying to move forward with the attachment of consequences to teachers’ multiple evaluation measures, as assumed (by the state but not the state’s consultant?) to be accurate and correct (see, for example, here).

Indeed, this is a highly inexact and imperfect social science.

Authors also found that “policy weights yield[ed] more reliable composites than optimal prediction [i.e., empirical] weights” (p. 750). In addition, “[e]mpirically derived weights may or may not align with important theoretical and policy rationales” (p. 750); hence, the authors collectively referred others to use theory and policy when combining measures, while also noting that doing so would (a) still yield overall estimates that would “change from year to year as new crops of teachers and potentially measures are incorporated” (p. 750) and (b) likely “produce divergent inferences and judgments about individual teachers (p. 751). Authors, therefore, concluded that “this in turn highlights the need for a stricter measurement validity framework guiding the development, use, and monitoring of teacher evaluation systems” (p. 751), given all of this also makes the social science arbitrary, which is also a legal issue in and of itself, as also quasi noted.

Now, while I will admit that those who are (perhaps unwisely) devoted to the (in many ways forced) combining of these measures (despite what low reliability indicators already mean for validity, as unaddressed in this piece) might find some value in this piece (e.g., how conjunctive and disjunctive models vary, how principal component, unit weight, policy weight, optimal prediction approaches vary), I will also note that forcing the fit of such multiple measures in such ways, especially without a thorough background in and understanding of reliability and validity and what reliability means for validity (i.e., with rather high levels of reliability required before any valid inferences and especially high-stakes decisions can be made) is certainly unwise.

If high-stakes decisions are not to be attached, such nettlesome (but still necessary) educational measurement issues are of less importance. But any positive (e.g., merit pay) or negative (e.g., performance improvement plan) consequence that comes about without adequate reliability and validity should certainly cause pause, if not a justifiable grievance as based on the evidence provided herein, called for herein, and required pretty much every time such a decision is to be made (and before it is made).

Citation: Martinez, J. F., Schweig, J., & Goldschmidt, P. (2016). Approaches for combining multiple measures of teacher performance: Reliability, validity, and implications for evaluation policy. Educational Evaluation and Policy Analysis, 38(4), 738–756. doi: 10.3102/0162373716666166 Retrieved from http://journals.sagepub.com/doi/pdf/10.3102/0162373716666166

Note: New Mexico’s data were not used for analytical purposes in this study, unless any districts in New Mexico participated in the Bill & Melinda Gates Foundation’s Measures of Effective Teaching (MET) study yielding the data used for analytical purposes herein.

Special Issue of “Educational Researcher” (Paper #8 of 9, Part I): A More Research-Based Assessment of VAMs’ Potentials

Recall that the peer-reviewed journal Educational Researcher (ER) – published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#8 of 9), which is actually a commentary titled “Can Value-Added Add Value to Teacher Evaluation?” This commentary is authored by Linda Darling-Hammond – Professor of Education, Emeritus, at Stanford University.

Like with the last commentary reviewed here, Darling-Hammond reviews some of the key points taken from the five feature articles in the aforementioned “Special Issue.” More specifically, though, Darling-Hammond “reflect[s] on [these five] articles’ findings in light of other work in this field, and [she] offer[s her own] thoughts about whether and how VAMs may add value to teacher evaluation” (p. 132).

She starts her commentary with VAMs “in theory,” in that VAMs COULD accurately identify teachers’ contributions to student learning and achievement IF (and this is a big IF) the following three conditions were met: (1) “student learning is well-measured by tests that reflect valuable learning and the actual achievement of individual students along a vertical scale representing the full range of possible achievement measures in equal interval units” (2) “students are randomly assigned to teachers within and across schools—or, conceptualized another way, the learning conditions and traits of the group of students assigned to one teacher do not vary substantially from those assigned to another;” and (3) “individual teachers are the only contributors to students’ learning over the period of time used for measuring gains” (p. 132).

None of things are actual true (or near to true, nor will they likely ever be true) in educational practice, however. Hence, the errors we continue to observe that continue to prevent VAM use for their intended utilities, even with the sophisticated statistics meant to mitigate errors and account for the above-mentioned, let’s call them, “less than ideal” conditions.

Other pervasive and perpetual issues surrounding VAMs as highlighted by Darling-Hammond, per each of the three categories above, pertain to (1) the tests used to measure value-added is that the tests are very narrow, focus on lower level skills, and are manipulable. These tests in their current form cannot effectively measure the learning gains of a large share of students who are above or below grade level given a lack of sufficient coverage and stretch. As per Haertel (2013, as cited in Darling-Hammond’s commentary), this “translates into bias against those teachers working with the lowest-performing or the highest-performing classes’…and “those who teach in tracked school settings.” It is also important to note here that the new tests created by the Partnership for Assessing Readiness for College and Careers (PARCC) and Smarter Balanced, multistate consortia “will not remedy this problem…Even though they will report students’ scores on a vertical scale, they will not be able to measure accurately the achievement or learning of students who started out below or above grade level” (p.133).

With respect to (2) above, on the equivalence (or rather non-equivalence) of groups of student across teachers classrooms who are the ones whose VAM scores are relativistically compared, the main issue here is that “the U.S. education system is the one of most segregated and unequal in the industrialized world…[likewise]…[t]he country’s extraordinarily high rates of childhood poverty, homelessness, and food insecurity are not randomly distributed across communities…[Add] the extensive practice of tracking to the mix, and it is clear that the assumption of equivalence among classrooms is far from reality” (p. 133). Whether sophisticated statistics can control for all of this variation is one of most debated issues surrounding VAMs and their levels of outcome bias, accordingly.

And as per (3) above, “we know from decades of educational research that many things matter for student achievement aside from the individual teacher a student has at a moment in time for a given subject area. A partial list includes the following [that are also supposed to be statistically controlled for in most VAMs, but are also clearly not controlled for effectively enough, if even possible]: (a) school factors such as class sizes, curriculum choices, instructional time, availability of specialists, tutors, books, computers, science labs, and other resources; (b) prior teachers and schooling, as well as other current teachers—and the opportunities for professional learning and collaborative planning among them; (c) peer culture and achievement; (d) differential summer learning gains and losses; (e) home factors, such as parents’ ability to help with homework, food and housing security, and physical and mental support or abuse; and (e) individual student needs, health, and attendance” (p. 133).

“Given all of these influences on [student] learning [and achievement], it is not surprising that variation among teachers accounts for only a tiny share of variation in achievement, typically estimated at under 10%” (see, for example, highlights from the American Statistical Association’s (ASA’s) Position Statement on VAMs here). “Suffice it to say [these issues]…pose considerable challenges to deriving accurate estimates of teacher effects…[A]s the ASA suggests, these challenges may have unintended negative effects on overall educational quality” (p. 133). “Most worrisome [for example] are [the] studies suggesting that teachers’ ratings are heavily influenced [i.e., biased] by the students they teach even after statistical models have tried to control for these influences” (p. 135).

Other “considerable challenges” include: VAM output are grossly unstable given the swings and variations observed in teacher classifications across time, and VAM output are “notoriously imprecise” (p. 133) given the other errors observed as caused, for example, by varying class sizes (e.g., Sean Corcoran (2010) documented with New York City data that the “true” effectiveness of a teacher ranked in the 43rd percentile could have had a range of possible scores from the 15th to the 71st percentile, qualifying as “below average,” “average,” or close to “above average). In addition, practitioners including administrators and teachers are skeptical of these systems, and their (appropriate) skepticisms are impacting the extent to which they use and value their value-added data, noting that they value their observational data (and the professional discussions surrounding them) much more. Also important is that another likely unintended effect exists (i.e., citing Susan Moore Johnson’s essay here) when statisticians’ efforts to parse out learning to calculate individual teachers’ value-added causes “teachers to hunker down and focus only on their own students, rather than working collegially to address student needs and solve collective problems” (p. 134). Related, “the technology of VAM ranks teachers against each other relative to the gains they appear to produce for students, [hence] one teacher’s gain is another’s loss, thus creating disincentives for collaborative work” (p. 135). This is what Susan Moore Johnson termed the egg-crate model, or rather the egg-crate effects.

Darling-Hammond’s conclusions are that VAMs have “been prematurely thrust into policy contexts that have made it more the subject of advocacy than of careful analysis that shapes its use. There is [good] reason to be skeptical that the current prescriptions for using VAMs can ever succeed in measuring teaching contributions well (p. 135).

Darling-Hammond also “adds value” in one whole section (highlighted in another post forthcoming here), offering a very sound set of solutions, using VAMs for teacher evaluations or not. Given it’s rare in this area of research we can focus on actual solutions, this section is a must read. If you don’t want to wait for the next post, read Darling-Hammond’s “Modest Proposal” (p. 135-136) within her larger article here.

In the end, Darling-Hammond writes that, “Trying to fix VAMs is rather like pushing on a balloon: The effort to correct one problem often creates another one that pops out somewhere else” (p. 135).


If interested, see the Review of Article #1 – the introduction to the special issue here; see the Review of Article #2 – on VAMs’ measurement errors, issues with retroactive revisions, and (more) problems with using standardized tests in VAMs here; see the Review of Article #3 – on VAMs’ potentials here; see the Review of Article #4 – on observational systems’ potentials here; see the Review of Article #5 – on teachers’ perceptions of observations and student growth here; see the Review of Article (Essay) #6 – on VAMs as tools for “egg-crate” schools here; and see the Review of Article (Commentary) #7 – on VAMs situated in their appropriate ecologies here.

Article #8, Part I Reference: Darling-Hammond, L. (2015). Can value-added add value to teacher evaluation? Educational Researcher, 44(2), 132-137. doi:10.3102/0013189X15575346

Victory in Court: Consequences Attached to VAMs Suspended Throughout New Mexico

Great news for New Mexico and New Mexico’s approximately 23,000 teachers, and great news for states and teachers potentially elsewhere, in terms of setting precedent!

Late yesterday, state District Judge David K. Thomson, who presided over the ongoing teacher-evaluation lawsuit in New Mexico, granted a preliminary injunction preventing consequences from being attached to the state’s teacher evaluation data. More specifically, Judge Thomson ruled that the state can proceed with “developing” and “improving” its teacher evaluation system, but the state is not to make any consequential decisions about New Mexico’s teachers using the data the state collects until the state (and/or others external to the state) can evidence to the court during another trial (set for now, for April) that the system is reliable, valid, fair, uniform, and the like.

As you all likely recall, the American Federation of Teachers (AFT), joined by the Albuquerque Teachers Federation (ATF), last year, filed a “Lawsuit in New Mexico Challenging [the] State’s Teacher Evaluation System.” Plaintiffs charged that the state’s teacher evaluation system, imposed on the state in 2012 by the state’s current Public Education Department (PED) Secretary Hanna Skandera (with value-added counting for 50% of teachers’ evaluation scores), is unfair, error-ridden, spurious, harming teachers, and depriving students of high-quality educators, among other claims (see the actual lawsuit here).

Thereafter, one scheduled day of testimonies turned into five in Santa Fe, that ran from the end of September through the beginning of October (each of which I covered here, here, here, here, and here). I served as the expert witness for the plaintiff’s side, along with other witnesses including lawmakers (e.g., a state senator) and educators (e.g., teachers, superintendents) who made various (and very articulate) claims about the state’s teacher evaluation system on the stand. Thomas Kane served as the expert witness for the defendant’s side, along with other witnesses including lawmakers and educators who made counter claims about the system, some of which backfired, unfortunately for the defense, primarily during cross-examination.

See articles released about this ruling this morning in the Santa Fe New Mexican (“Judge suspends penalties linked to state’s teacher eval system”) and the Albuquerque Journal (“Judge curbs PED teacher evaluations).” See also the AFT’s press release, written by AFT President Randi Weingarten, here. Click here for the full 77-page Order written by Judge Thomson (see also, below, five highlights I pulled from this Order).

The journalist of the Santa Fe New Mexican, though, provided the most detailed information about Judge Thomson’s Order, writing, for example, that the “ruling by state District Judge David Thomson focused primarily on the complicated combination of student test scores used to judge teachers. The ruling [therefore] prevents the Public Education Department [PED] from denying teachers licensure advancement or renewal, and it strikes down a requirement that poorly performing teachers be placed on growth plans.” In addition, the Judge noted that “the teacher evaluation system varies from district to district, which goes against a state law calling for a consistent evaluation plan for all educators.”

The PED continues to stand by its teacher evaluation system, calling the court challenge “frivolous” and “a legal PR stunt,” all the while noting that Judge Thomson’s decision “won’t affect how the state conducts its teacher evaluations.” Indeed it will, for now and until the state’s teacher evaluation system is vetted, and validated, and “the court” is “assured” that the system can actually be used to take the “consequential actions” against teachers, “required” by the state’s PED.

Here are some other highlights that I took directly from Judge Thomson’s ruling, capturing what I viewed as his major areas of concern about the state’s system (click here, again, to read Judge Thomson’s full Order):

  • Validation Needed: “The American Statistical Association says ‘estimates from VAM should always be accompanied by measures of precision and a discussion of the assumptions and possible limitations of the model. These limitations are particularly relevant if VAM are used for high stake[s] purposes” (p. 1). These are the measures, assumptions, limitations, and the like that are to be made transparent in this state.
  • Uniformity Required: “New Mexico’s evaluation system is less like a [sound] model than a cafeteria-style evaluation system where the combination of factors, data, and elements are not easily determined and the variance from school district to school district creates conflicts with the [state] statutory mandate” (p. 2)…with the existing statutory framework for teacher evaluations for licensure purposes requiring “that the teacher be evaluated for ‘competency’ against a ‘highly objective uniform statewide standard of evaluation’ to be developed by PED” (p. 4). “It is the term ‘highly objective uniform’ that is the subject matter of this suit” (p. 4), whereby the state and no other “party provided [or could provide] the Court a total calculation of the number of available district-specific plans possible given all the variables” (p. 54). See also the Judge’s points #78-#80 (starting on page 70) for some of the factors that helped to “establish a clear lack of statewide uniformity among teachers” (p. 70).
  • Transparency Missing: “The problem is that it is not easy to pull back the curtain, and the inner workings of the model are not easily understood, translated or made accessible” (p. 2). “Teachers do not find the information transparent or accurate” and “there is no evidence or citation that enables a teacher to verify the data that is the content of their evaluation” (p. 42). In addition, “[g]iven the model’s infancy, there are no real studies to explain or define the [s]tate’s value-added system…[hence, the consequences and decisions]…that are to be made using such system data should be examined and validated prior to making such decisions” (p. 12).
  • Consequences Halted: “Most significant to this Order, [VAMs], in this [s]tate and others, are being used to make consequential decisions…This is where the rubber hits the road [as per]…teacher employment impacts. It is also where, for purposes of this proceeding, the PED departs from the statutory mandate of uniformity requiring an injunction” (p. 9). In addition, it should be noted that indeed “[t]here are adverse consequences to teachers short of termination” (p. 33) including, for example, “a finding of ‘minimally effective’ [that] has an impact on teacher licenses” (p. 41). These, too, are to be halted under this injunction Order.
  • Clarification Required: “[H]ere is what this [O]rder is not: This [O]rder does not stop the PED’s operation, development and improvement of the VAM in this [s]tate, it simply restrains the PED’s ability to take consequential actions…until a trial on the merits is held” (p. 2). In addition, “[a] preliminary injunction differs from a permanent injunction, as does the factors for its issuance…’ The objective of the preliminary injunction is to preserve the status quo [minus the consequences] pending the litigation of the merits. This is quite different from finally determining the cause itself” (p. 74). Hence, “[t]he court is simply enjoining the portion of the evaluation system that has adverse consequences on teachers” (p. 75).

The PED also argued that “an injunction would hurt students because it could leave in place bad teachers.” As per Judge Thomson, “That is also a faulty argument. There is no evidence that temporarily halting consequences due to the errors outlined in this lengthy Opinion more likely results in retention of bad teachers than in the firing of good teachers” (p. 75).

Finally, given my involvement in this lawsuit and given the team with whom I was/am still so fortunate to work (see picture below), including all of those who testified as part of the team and whose testimonies clearly proved critical in Judge Thomson’s final Order, I want to thank everyone for all of their time, energy, and efforts in this case, thus far, on behalf of the educators attempting to (still) do what they love to do — teach and serve students in New Mexico’s public schools.


Left to right: (1) Stephanie Ly, President of AFT New Mexico; (2) Dan McNeil, AFT Legal Department; (3) Ellen Bernstein, ATF President; (4) Shane Youtz, Attorney at Law; and (5) me 😉

The Multiple Teacher Evaluation System(s) in New Mexico, from a Concerned New Mexico Parent

A “concerned New Mexico parent” who wrote a prior post for this blog here, wrote another for you all below, about the sheer numbers of different teacher evaluation systems, or variations, now in place in his/her state of New Mexico. (S)he writes:

Readers of this blog are well aware of the limitations of VAMs for evaluating teachers. However, many readers may not be aware that there are actually many system variations used to evaluate teachers. In the state of New Mexico, for example, 217 different variations are used to evaluate the many and diverse types of teachers teaching in the state [and likely all other states].

But. Is there any evidence that they are valid? NO. Is there any evidence that they are equivalent? NO. Is there any evidence that this is fair? NO.

The New Mexico Public Education Department (NMPED) provides a framework for teacher evaluations, and the final teacher evaluation should be weighted as follows: Improved Student Achievement (50%), Teacher Observations (25%), and Multiple Measures (25%).

Every school district in New Mexico is required to submit a detailed evaluation plan of specifically what measures will be used to satisfy the overall NMPED 50-25-25 percentage framework, after which NMPED approves all plans.

The exact details of any district’s educator effectiveness plan can be found on the NMTEACH website, as every public and charter school plan is posted here.

There are massive differences between how groups of teachers are graded between districts, however, which distorts most everything about the system(s), including the extent to which similar (and different) teachers might be similarly (and fairly) evaluated and assessed.

Even within districts, there are massive differences in how grade level (elementary, middle, high school) teachers are evaluated.

And, even something as seemingly simple as evaluating K-2 teachers requires 42 different variations in scoring.

Table 1 below shows the number of different scales used to calculate teacher effectiveness for each group of teachers and each grade level, for example, at the state level.

New Mexico divides all teachers into three categories — group A teachers have scores based on the statewide test (mathematics, English/language arts (ELA)), group B teachers (e.g. music or history) do not have a corresponding statewide test, and group C teachers teach grades K-2. Table 1 shows the number of scales used by New Mexico school districts for each teacher group. It is further broken down by grade-level. For example, as illustrated, there are 42 different scales used to evaluate Elementary-level Group A teachers in New Mexico. The column marked “Unique (one-offs)” indicates the number of scales that are completely unique for a given teacher group and grade-level. For example, as illustrated, there are 11 unique scales used to grade Group B High School teachers, and for each of these eleven scales, only one district, one grade-level, and one teacher group is evaluated within the entire state.

Based on the size of the school district, a unique scale may be grading as few as a dozen teachers! In addition, there are 217 scales used statewide, with 99 of these scales being unique (by teacher)!

Table 1: New Mexico Teacher Evaluation System(s)

Group Grade Scales Used Unique (one-offs)
Group A (SBA-based) All 58 15
(e.g. 5th grade English teacher) Elem 42 10
MS 37 2
HS 37 3
Group B (non-SBA) All 117 56
(e.g. Elem music teacher) Elem 67 37
MS 62 8
HS 61 11
Group C (grades K-2) All 42 28
Elem 42 28
TOTAL   217 variants 99 one-offs

The table above highlights the spectacular absurdity of the New Mexico Teacher Evaluation System.

(The complete listings of all variants for the three groups are contained here (in Table A for Group A), here (in Table B for Group B), and here (in Table C for Group C). The abbreviations and notes for these tables are listed here (in Table D).

By approving all of these different formulas, all things considered, NMPED is also making the following nonsensical claims..

NMPED Claim: The prototype 50-25-25 percentage split has some validity.

There is no evidence to support this division between student achievement measures, observation, and multiple measures at all. It simply represents what NMPED could politically “get away with” in terms of a formula. Why not 60-20-20 or 57-23-20 or 46-18-36, etcetera? The NMPED prototype scale has no proven validity, whatsoever.

NMPED Claim: All 217 formulas are equivalent to evaluate teachers.

This claim by NMPED is absurd on its face and every other part of its… Is there any evidence that they have cross-validated the tests? There is no evidence that any of these scales are valid or accurate measures of “teacher effectiveness.” Also, there is no evidence whatsoever that they are equivalent.

Further, if the formulas are equivalent (as NMPED claims), why is New Mexico wasting money on technology for administering SBA tests or End-of-Course exams? Why not use an NMPED-approved formula that includes tests like Discovery, MAPS, DIBELS, or Star that are already being used?

NMPED Claim: Teacher Attendance and Student Surveys are interchangeable.

According to the approved plans, many districts assign 10% to Teacher Attendance while other districts assign 10% to Student Surveys. Both variants have been approved by NMPED.

Mathematically, (i.e., in terms of the proportions either is to be allotted) they appear to be interchangeable. If that is so, why is NMPED also specifically trying to enforce Teacher Attendance as an element of the evaluation scale? Why did Hanna Skandera proclaim to the press that this measure improved New Mexico education? (For typical news coverage, on this topic, for example, see here).

The use of teacher attendance appears to be motivated by union-busting rather than any mathematical rationale.

NMPED Claim: All observation methods are equivalent.

NMPED allows for three very different observation methods to be used for 40% of the final score. Each method is somewhat complicated and involves different observers.

There is no indication that NMPED has evaluated the reliability or validity of these three very different observation methods, or tested their results for equivalence. They simply assert that they are equivalent.

NMPED Claim: These formulas will be used to rate teachers.

These formulas are the worst kind of statistical jiggery-pokery (to use a newly current phrase). NMPED presents a seemingly rational, scientific number to the public using invalid and unvalidated mathematical manipulations and then determines teachers’ careers based on the completely bogus New Mexico teacher evaluation system(s).

Conclusion: Not only is the emperor naked, he has a closet containing 217 equivalent outfits at home!

School-Level Bias in the PVAAS Model in Pennsylvania

Research for Action (click here for more about the organization and its mission) just released an analysis of the state of Pennsylvania’s school building rating system – the 100-point School Performance Profile (SPP) used throughout the state to categorize and rank public schools (including charters), although a school can go over 100-points by earning bonus points. The SPP is largely (40%) based on school-level value-added model (VAM) output derived via the state’s Pennsylvania Education Value-Added Assessment System (PVAAS), also generally known as the Education Value-Added Assessment System (EVAAS), also generally known as the VAM that is “expressly designed to control for out-of-school [biasing] factors.” This model should sound familiar to followers, as this is the model with which I am most familiar, on which I conduct most of my research, and widely considered one of “the most popular” and one of “the best” VAMs in town (see recent posts about this model here, here, and here).

Research for Action‘s report titled: “Pennsylvania’s School Performance Profile: Not the Sum of its Parts” details some serious issues with bias with the state’s SPP. That is, bias (a highly controversial issue covered in the research literature and also on this blog; see recent posts about bias here, here, and here), does also appear to exist in this state and particularly at the school-level for (1) subject areas less traditionally tested and, hence, not often consecutively tested (e.g., from one consecutive grade level to the next), and given (2) the state is combining growth measures with proficiency (i.e., “snapshot”) measures to evaluate schools, the latter being significantly negatively correlated with the populations of the students in the schools being evaluated.

First, “growth scores do not appear to be working as intended for writing and science,
the subjects that are not tested year-to-year.” The figure below “displays the correlation between poverty, PVAAS [growth scores], and proficiency scores in these subjects. The lines for both percent proficient and growth follow similar patterns. The correlation between poverty and proficiency rates is stronger, but PVAAS scores are also significantly and negatively correlated with poverty.”

FIgure 1

Whereas before we have covered on this blog how VAMs, and more importantly this VAM (i.e., the PVAAS, TVAAS, EVAAS, etc.), is also be biased by the subject areas teachers teach (see prior posts here and here), this is the first piece of evidence of which I am aware that illustrates that this VAM (and likely other VAMs) might also be biased, particularly for subject areas that are not consecutively tested (i.e., from one consecutive grade to the next) and subject areas that do not have pretest scores, when in this case VAM statisticians use other subject area pretest scores (e.g., from mathematics and English/language arts, as scores from these tests are likely correlated with the pretest scores that are missing) instead. This, of course, has implications for schools (as illustrated in these analyses) but also likely for teachers (as logically generalized from these findings).

Second, researchers found that because five of the six SPP indicators, accounting for 90% of a school’s base score (including an extra-credit portion) rely entirely on test scores, “this reliance on test scores, despite the partial use of growth measures, results in a school rating system that favors more advantaged schools.” The finding here is that using growth or value-added measures in conjunction with proficiency (i.e., “snapshot”) indicators also biases total output against schools serving more disadvantaged students. This is what this looks like, noting in the figures below that bias exists for both the proficiency and growth measures, although bias in the latter is less yet still evident.

FIgure 1

“The black line in each graph shows the correlation between poverty and the percent of students scoring proficient or above in math and reading; the red line displays the
relationship between poverty and growth in PVAAS. As displayed…the relationship between poverty measures and performance is much stronger for the proficiency indicator.”

“This analysis shows a very strong negative correlation between SPP and poverty in both years. In other words, as the percent of a school’s economically disadvantaged population increases, SPP scores decrease.” That is, school performance declines sharply as schools’ aggregate levels of student poverty increase, if and when states use combinations of growth and proficiency to evaluate schools. This is not surprising, but a good reminder of how much poverty is (co)related with achievement captured by test scores when growth is not considered.

Please read the full 16-page report as it includes much more important information about both of these key findings than described herein. Otherwise, well done Research for Action for “adding value,” for the lack of a better term, to the current literature surrounding VAM-based bias.

Combining Different Tests to Measure Value-Added, Continued

Following up on a recent post regarding whether “Combining Different Tests to Measure Value-Added [Is] Valid?” an anonymous VAMboozled! follower emailed me an article I had never read on the topic. I found the article and its contents as related to this question important to share with you all here.

The 2005 article titled “Adjusting for Differences in Tests,” authored by Robert Linn – a highly respected scholar in the academy and also Distinguished Professor Emeritus in Research & Evaluation Methodology from the University of Colorado at Boulder – and published by the esteemed National Academy of Sciences, is all about “[t]he desire to treat scores obtained from different tests as if they were interchangeable, or at least comparable.” This is very much in line with current trends to combine different tests to measure value-added given statisticians and (particularly in the universe of VAMs) econometricians’ “natural predilections” to find (and force) solutions to what they see as very rational and practical problems. Hence, this paper is still very relevant here and today.

To equate tests, as per Dorans and Holland (2000) as cited by Linn, there are five requirements that must be met. These requirements “enjoy a broad professional consensus” and also pertain to test equating when it comes to doing this within VAMs:

  1. The Equal Construct Requirement: Tests that measure different constructs (e.g., mathematics versus reading/language arts) should not be equated. States that are using different subject area tests as pretests for other subject area tests to measure growth over time, for example, are in serious violation of this requirement.
  2. The Equal Reliability Requirement: Tests that measure the same construct but which differ in reliability (i.e., consistency at the same time or over time as measured in a variety of ways) should not be equated. Most large-scale tests, often depending on whether they yield norm- or criterion-referenced data given the normed or state standards to which they are aligned, yield different levels of reliability. Likewise, the types of consequences attached to test results throw this off as well.
  3. The Symmetry Requirement: The equating function for equating the scores of Y to those of X should be the inverse of the equating function for equating the scores of X to those of Y. I am unfamiliar of studies that have addressed this as this is typically overlooked and assumed.
  4. The Equity Requirement: It ought to be a matter of indifference for an examinee to be tested by either one of the two tests that have been equated. It is not, however, a matter of indifference in the universe of VAMs, as best demonstrated by the research conducted by Papay (2010) in his Different tests, different answers: The stability of teacher value-added estimates across outcome measures study.
  5. Population Invariance Requirement: The choice of (sub)population used to compute the equating function between the scores on tests X and Y should not matter. In other words, the equating function used to link the scores of X and Y should be population invariant. As also noted in Papay (2010) even using the same populations, the scores on tests X and Y DO matter.

Accordingly, here are Linn’s two key points of caution: (1) Factors that complicate doing this, and make such simple conversions less reliable and less or invalid include “the content, format, and margins of error of the tests; the intended and actual uses of the tests; and the consequences attached to the results of the tests.” Likewise, (2) “it cannot be assumed that converted scores will behave just like those of the test whose scale is adopted.” That is, by converting or norming scores to permit longitudinal analyses (i.e., in the case of VAMs) can very well violate what the test scores mean even if commonsense might tell us they essentially mean the same thing.

Hence, and also as per the National Research Council (NRC), “it is not feasible to compare ‘the full array of currently administered commercial and state achievement tests to one another, through the development of a single equivalency or linking scale.” Test content varies so much “in content, item format, conditions of administration, and consequences attached to [test] results that the linked scores could not be considered sufficiently comparable to justify the development of [a] single equivalency scale.”

“Although NAEP [the National Assessment of Educational Progress – the nation’s “best” test] would seem to hold the greatest promise for linking state assessments… NAEP has no consequences either for individual students or for the schools they attend. State assessments, on the other hand, clearly have important consequences for schools and in a number of states they also have significant consequences for individual students” and now teachers. This is the key factor that throws off the potential to equate tests that are not designed to be interchangeable. This is certainly not to say, however, just increase the consequences across tests to make them more interchangeable. It’s just not that simple – see rules 1-5 above.

Is Combining Different Tests to Measure Value-Added Valid?

A few days ago, on Diane Ravitch’s blog, a person posted the following comment, that Diane sent to me for a response:

“Diane, In Wis. one proposed Assembly bill directs the Value Added Research Center [VARC] at UW to devise a method to equate three different standardized tests (like Iowas, Stanford) to one another and to the new SBCommon Core to be given this spring. Is this statistically valid? Help!”

Here’s what I wrote in response, that I’m sharing here with you all as this too is becoming increasingly common across the country; hence, it is increasingly becoming a question of high priority and interest:

“We have three issues here when equating different tests SIMPLY because these tests test the same students on the same things around the same time.

First is the ASSUMPTION that all varieties of standardized tests can be used to accurately measure educational “value,” when none have been validated for such purposes. To measure student achievement? YES/OK. To measure teachers impacts on student learning? NO. The ASA statement captures decades of research on this point.

Second, doing this ASSUMES that all standardized tests are vertically scaled whereas scales increase linearly as students progress through different grades on similar linear scales. This is also (grossly) false, ESPECIALLY when one combines different tests with different scales to (force a) fit that simply doesn’t exist. While one can “norm” all test data to make the data output look “similar,” (e.g., with a similar mean and similar standard deviations around the mean), this is really nothing more that statistical wizardry without really any theoretical or otherwise foundation in support.

Third, in one of the best and most well-respected studies we have on this to date, Papay (2010) [in his Different tests, different answers: The stability of teacher value-added estimates across outcome measures study] found that value-added estimates WIDELY range across different standardized tests given to the same students at the same time. So “simply” combining these tests under the assumption that they are indeed similar “enough” is also problematic. Using different tests (in line with the proposal here) with the same students at the same time yields different results, so one cannot simply combine them thinking they will yield similar results regardless. They will not…because the test matters.”

See also: Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn, R. L., Ravitch, D., Rothstein, R., Shavelson, R. J., & Shepard, L. A. (2010). Problems with the use of student test scores to evaluate teachers. Washington, D.C.: Economic Policy Institute.

Same Teachers, Similar Students, Similar Tests, Different Answers

One of my favorite studies to date about VAMs was conducted by John Papay, an economist once at Harvard and now at Brown University. In the study titled “Different Tests, Different Answers: The Stability of Teacher Value-Added Estimates Across Outcome Measures” published in 2009 by the 3rd best and most reputable peer-reviewed journal, American Educational Research Journal, Papay presents evidence that different yet similar tests (i.e., similar on content, and similar on when the tests were administered to similar sets of students) do not provide similar answers about teachers’ value-added performance. This is an issue with validity, in that, if a test is measuring the same things for the same folks at the same times, similar-to-the-same results should be realized. But they are not. Papay, rather, found moderate-sized rank correlations, ranging from r=0.15 to r=0.58, among the value-added estimates derived from the different tests.

Recently released, yet another study (albeit not yet peer-reviewed) has found similar results…potentially solidifying this finding further into our understandings about VAMs and their issues, particularly in terms of validity (or truth in VAM-based results). This study on “Comparing Estimates of Teacher Value-Added Based on Criterion- and Norm-Referenced Tests” released by the U.S. Department of Education and conducted by four researchers representing Notre Dame University, Basis Policy Research, and American Institutes of Research, provides evidence, again, that estimates of teacher value-added as based on different yet similar tests (i.e., in this case a criterion-referenced state assessment and a widely used norm-referenced test given in the same subject around the same time) yielded moderately correlated estimates of teacher-level value added, yet again.

If we had confidence in the validity of the inferences based on value-added measures, these correlations (or more simply put “relationships”) should be much higher than what they found, similar to what Papay found, in the range of 0.44 to 0.65. While the ideal correlation coefficient is a, in this case, r=+1.0, that is very rarely achieved. But for the purposes for which teacher-level value-added is currently being used, correlations above r=+.70/r=+.80 would (and should) be most desired, and possibly required before high-stakes decisions about teachers are to be made as based on these data.

In addition, researchers in this study found that on average, only 33.3% of teachers’ estimates from both sets of value-added estimates positioned them in the same range of scores (using quintiles or ranges including 20% bands of width) on both tests in the same school year. This too has implications for validity in that, again, teachers or teachers’ value-added estimates should fall in the same ranges, if and when using similar tests, if any valid inferences are to be made using value-added estimates.

Again, please note that this study has not yet been peer-reviewed. While some results naturally seem to make more sense, like the one reviewed above, peer-review matters equally for those with results with which we might tend to agree and those with results we might tend to reject. Take anything for that matter that is not peer-reviewed, as just that…a study with methods and findings not critically vetted by the research community.

In Response to a Tennessee Assistant Principal’s Concerns – Part II Mathematics

In our most recent post, we conducted the first of two (this is the second) follow-up analyses to examine the claims put forth in another recent post about a Tennessee assistant principal’s suspicions regarding his state of Tennessee’s value-added scores, as measured by the Tennessee Value-Added Assessment System (TVAAS; publicly available here). This is the final installment of this series, but this time we focus on the mathematics value-added scores and, probably not surprising, illustrate below quite similar results.

Again, we analyzed data from the 10 largest school districts in the state based on population data (found here). We looked at grade-level value-added scores for 3rd through 8th grade mathematics. For each grade level, we calculated the percentage of schools per district that had positive value-added scores for 2013 (see Table 1) and for their three-year composite scores, since these are often considered more reliable (see Table 2).

We were particularly interested in knowing how likely a grade level was to get a positive value-added score and to see if there were any trends across districts. Consistent with our English/Language Arts (ELA) findings, similar trends in mathematics were apparent as well.  For clarity purposes, and as we did with the ELA findings, we color-coded the chart—green signifies the grade levels for which 75% or more of the schools received a positive value-added score, while red signifies the grade levels for which 25% or less of the schools received a positive value-added score.

Table 1: Percent of Schools that had Positive Value-Added Scores by Grade and District (2013) Mathematics

District 3rd  Grade 4th   Grade 5th   Grade 6th  Grade
7th Grade 8th Grade
Memphis 42% 91% 70% 37% 84% 55%
Nashville-Davidson NA 60% 65% 70% 90% 44%
Knox 72% 87% 57% 64% 93% 67%
Hamilton 31% 86% 84% 38% 76% 78%
Shelby 97% 97% 61% 75% 100% 69%
Sumner 81% 77% 50% 67% 75% 58%
Montgomery NA 86% 71% 14% 86% 57%
Rutherford 79% 100% 67% 62% 77% 85%
Williamson NA 67% 21% 100% 100% 56%
Murfreesboro NA 100% 60% 90% NA* NA*

Though the mathematics scores were not as glaringly biased as the ELA scores, there were some alarming trends to notice. In particular, the 4th and 7th grade value-added scores were consistently higher than those of the 3rd, 5th, 6th, and 8th grades in mathematics, which had much greater variation across districts. In fact, all districts had at least 75% of their schools receive positive value-added scores in 7th grade and at least 60% in fourth grade. To recall, the seventh grade scores in ELA were drastically lower, with all districts having no more than 50% of their schools receive positive value-added scores (five of which had fewer than 25%). The 6th and 8th grades also had more variation in mathematics than in ELA.

Table 2: Percent of Schools that had Positive Value-Added Scores by Grade and District (Three-Year Composite) Mathematics

District 3rd Grade 4th  Grade 5th  Grade
6th  Grade 7th  Grade
8th  Grade
Memphis NA 90% 72% 62% 96% 78%
Nashville-Davidson NA 73% 67% 89% 97% 74%
Knox NA 96% 79% 90% 93% 93%
Hamilton NA 89% 80% 52% 90% 71%
Shelby NA 97% 79% 75% 100% 69%
Sumner NA 92% 65% 90% 92% 100%
Montgomery NA 95% 85% 71% 86% 71%
Rutherford NA 100% 79% 69% 92% 85%
Williamson NA 91% 22% 100% 100% 89%
Murfreesboro NA 100% 90% 100% NA NA

As for the three-year composites scores, schools across the state were much more likely to receive positive value-added scores than negative value-added scores in all tested grade levels. This, compared to the ELA scores where 6th and 7th grades struggled to earn positive scores, suggests that there is some level of subject bias going on here. Specifically, a majority of schools across all districts received positive value-added scores at each grade level for mathematics, with the small exception of 5th grade in one school district. For 4th and 7th grades, almost every school received positive scores.

Again, of most importance here is how we choose to interpret these results. By Tennessee’s standard (given their heavy reliance on the TVAAS to evaluate teachers), our conclusion would be that the mathematics teachers are, overall, more effective than the ELA teachers in almost every tested grade level (with the exception of 8th grade ELA), regardless of school district.

Perhaps a more reasonable explanation, though, is that there is some bias in the tests upon which the TVAAS scores are measured (as likely related to some likely issues with the vertical scaling of Tennessee’s tests, not to mention other measurement errors). Far more students across the state  demonstrated growth in mathematics than in ELA (for the past three years at least). To simply assume that this is caused by teacher effectiveness is crass at best.


Analysis conducted by Jessica Holloway-Libell

A Consumer Alert Issued by The 21st Century Principal

In an excellent post just released by The 21st Century Principal the author writes about yet another two companies calculating value-added for school districts, again on the taxpayer’s dime. Teacher Match and Hanover Research are the companies specifically named and targeted for marketing and selling a series of highly false assumptions about teaching and teachers, highly false claims about value-added (without empirical research in support), highly false assertions about how value-added estimates can be used for better teacher evaluation/accountability, and highly false sales pitches about what they as value-added/research “experts” can do to help with the complex statistics needed for the above

The main points of the articles, as I see them, pulled from the main article and in order of priority follow:

  1. School districts are purchasing these “products” based entirely on the promises and related marketing efforts of these (and other) companies. Consumer Alert! Instead of accepting these (and other) companies’ sales pitches and promises that these companies’ “products” will do what they say they will, these companies must be forced to produce independent, peer-reviewed research to prove that what they are selling is in fact real. If they can’t produce the studies, they should not earn the contracts!!
  2. Doing all of this is just another expensive drain on what are already short educational resources. One district is paying over $30,000 to Teacher Match per year for their services, as cited in this piece. Related, the Houston Independent School District is paying SAS Inc. $500,000 per year for their EVAAS-based value-added calculations. These are not trivial expenditures, especially when considering the other potential research-based inititaives towards which these valuable resources could be otherwise spent.
  3. States (and the companies selling their value-added services) haven’t done the validation studies to prove that the value-added scores/estimates are valid. Again, almost always is it that the sales and marketing claims made by these companies are void of evidence that supports the claims being made.
  4. Doing all of this elevates standardized testing even higher in the decision-making and data-driven processes for schools, even though doing this is not warranted or empirically supported (as mentioned).
  5. Related, value-added calculations rely on inexpensive (aka “cheap”) large-scale tests, also of questionable validity, that still are not designed for the purposes for which they are being tasked and used (e.g., measuring growth upwards cannot be done without tests with equivalent scales, which really no tests at this point have).

The shame in all of this, besides the major issues mentioned in the five points above, is that the federal government, thanks to US Secretary of Education Arne Duncan and the Obama administration, is incentivizing these and other companies (e.g. SAS EVAAS, Mathematica) to exist, construct and sell such “products,” and then seek out and compete for these publicly funded and subsidized contracts. We, as taxpayers, are the ones consistently footing the bills.

See another recent article about the chaos a simple error in Mathematica’s code caused in Washington DC’s public schools, following another VAMboozled post about the same topic two weeks ago.