Rest in Peace, EVAAS Developer William L. Sanders

Over the last 3.5 years since I developed this blog, I have written many posts about one particular value-added model (VAM) – the Education Value-Added Assessment System (EVAAS), formerly known as the Tennessee Value-Added Assessment System (TVAAS), now known by some states as the TxVAAS in Texas, the PVAAS in Pennsylvania, and also known as the generically-named EVAAS in states like Ohio, North Carolina, and South Carolina (and many districts throughout the nation). It is this model on which I have conducted most of my research (see, for example, the first piece I published about this model here, in which most of the claims I made still stand, although EVAAS modelers disagreed here). And it is this model that is at the source of the majority of the teacher evaluation lawsuits in which I have been or still am currently engaged (see, for example, details about the Houston lawsuit here, the former Tennessee lawsuit here, and the new Texas lawsuit here, although the model is more peripheral in this particular case).

Anyhow, the original EVAAS model (i.e, the TVAAS) was originally developed by a man named William L. Sanders who ultimately sold it to SAS Institute Inc. that now holds all rights to the proprietary model. See, for example, here. See also examples of prior posts about Sanders here, here, here, here, here, and here. See also examples of prior posts about the EVAAS here, here, here, here, here, and here.

It is William L. Sanders who just passed away and we sincerely hope may rest in peace.

Sanders had a bachelors degree in animal science and a doctorate in statistics and quantitative genetics. As an adjunct professor and agricultural statistician in the college of business at the University of Knoxville, Tennessee, he developed in the late 1980s his TVAAS.

Sanders thought that educators struggling with student achievement in the state should “simply” use more advanced statistics, similar to those used when modeling genetic and reproductive trends among cattle, to measure growth, hold teachers accountable for that growth, and solve the educational measurement woes facing the state of Tennessee at the time. It was to be as simple as that…. I should also mention that given this history, not surprisingly, Tennessee was one of the first states to receive Race to the Top funds to the tune of $502 million to further advance this model; hence, this has also contributed to this model’s popularity across the nation.

Nonetheless, Sanders passed away this past Thursday, March 16, 2017, from natural causes in Columbia, Tennessee. As per his obituary here,

  • He was most well-known for developing “a method used to measure a district, school, and teacher’s effect on student performance by tracking the year-to-year progress of students against themselves over their school career with various teachers’ classes.”
  • He “stood for a hopeful view that teacher effectiveness dwarfs all other factors as a predictor of student academic growth…[challenging]…decades of assumptions that student family life, income, or ethnicity has more effect on student learning.”
  • He believed, in the simplest of terms, “that educational influence matters and teachers matter most.”

Of course, we have much research evidence to counter these claims, but for now we will just leave all of this at that. Again, may he rest in peace.

David Berliner on The Purported Failure of America’s Schools

My primary mentor, David Berliner (Regents Professor at Arizona State University (ASU)) wrote, yesterday, a blog post for the Equity Alliance Blog (also at ASU) on “The Purported Failure of America’s Schools, and Ways to Make Them Better” (click here to access the original blog post). See other posts about David’s scholarship on this blog here, here, and here. See also one of our best blog posts that David also wrote here, about “Why Standardized Tests Should Not Be Used to Evaluate Teachers (and Teacher Education Programs).”

In sum, for many years David has been writing “about the lies told about the poor performance of our students and the failure of our schools and teachers.” For example, he wrote one of the education profession’s all time classics and best sellers: The Manufactured Crisis: Myths, Fraud, And The Attack On America’s Public Schools (1995). If you have not read it, you should! All educators should read this book, on that note and in my opinion, but also in the opinion of many other iconic educational scholars throughout the U.S. (Paufler, Amrein-Beardsley, Hobson, under revision for publication).

While the title of this book accurately captures its contents, more specifically it “debunks the myths that test scores in America’s schools are falling, that illiteracy is rising, and that better funding has no benefit. It shares the good news about public education.” I’ve found the contents of this book to still be my best defense when others with whom I interact attack America’s public schools, as often misinformed and perpetuated by many American politicians and journalists.

In this blog post David, once again, debunks many of these myths surrounding America’s public schools using more up-to-date data from international tests, our country’s National Assessment of Educational Progress (NAEP), state-level SAT and ACT scores, and the like. He reminds us of how student characteristics “strongly influence the [test] scores obtained by the students” at any school and, accordingly, “strongly influence” or bias these scores when used in any aggregate form (e.g., to hold teachers, schools, districts, and states accountable for their students’ performance).

He reminds us that “in the US, wealthy children attending public schools that serve the wealthy are competitive with any nation in the world…[but in]…schools in which low-income students do not achieve well, [that are not competitive with many nations in the world] we find the common correlates of poverty: low birth weight in the neighborhood, higher than average rates of teen and single parenthood, residential mobility, absenteeism, crime, and students in need of special education or English language instruction.” These societal factors explain poor performance much more (i.e., more variance explained) than any school-level, and as pertinent to this blog, teacher-level factor (e.g., teacher quality as measured by large-scale standardized test scores).

In this post David reminds us of much, much more, that we need to remember and also often recall in defense of our public schools and in support of our schools’ futures (e.g., research-based notes to help “fix” some of our public schools).

Again, please do visit the original blog post here to read more.

Last Saturday Night Live’s VAM-Related Skit

For those of you who may have missed it last Saturday, Melissa McCarthy portrayed Sean Spicer — President Trump’s new White House Press Secretary and Communications Director — in one of the funniest of a very funny set of skits recently released on Saturday Night Live. You can watch the full video, compliments of YouTube, here:

In one of the sections of the skit, though, “Spicer” introduces “Betsy DeVos” — portrayed by Kate McKinnon and also just today confirmed as President Trump’s Secretary of Education — to answer some very simple questions about today’s public schools which she, well, very simply could not answer. See this section of the clip starting at about 6:00 (of the above 8:00 minute total skit).

In short, “the man” reporter asks “DeVos” how she values “growth versus proficiency in [sic] measuring progress in students.” Literally at a loss of words, “DeVos” responds that she really doesn’t “know anything about school.” She rambles on, until “Spicer” pushes her off of the stage 40-or-so seconds later.

Humor set aside, this was the one question Saturday Night Live writers wrote into this skit, which reminds us that what we know more generally as the purpose of VAMs is still alive and well in our educational rhetoric as well as popular culture. As background, this question apparently came from Minnesota Sen. Al Franken’s prior, albeit similar question during DeVos’s confirmation hearing.

Notwithstanding, Steve Snyder – the editorial director of The 74 — an (allegedly) non-partisan, honest, and fact-based backed by Editor-in-Chief Campbell Brown (see prior posts about this news site here and here) — took the opportunity to write a “featured” piece about this section of the script (see here). The purpose of the piece was, as the title illustrates, to help us “understand” the skit, as well as it’s important meaning for all of “us.”

Snyder notes that Saturday Night Live writers, with their humor, might have consequently (and perhaps mistakenly) “made their viewers just a little more knowledgeable about how their child’s school works,” or rather should work, as “[g]rowth vs. proficiency is a key concept in the world of education research.” Thereafter, Snyder falsely asserts that more than 2/3rds of educational researchers agree that VAMs are a good way to measure school quality. If you visit the actual statistic cited in this piece, however, as “non-partison, honest, and fact-based” that it is supposed to be, you would find (here) that this 2/3rds consists of 57% of responding American Education Finance Association (AEFA) members, and AEFA members alone, who are certainly not representative of “educational researchers” as claimed.

Regardless, Snyder asks: “Why are researchers…so in favor of [these] growth measures?” Because this disciplinary subset does not represent educational researchers writ large, but only a subset, Snyder.

As it is with politics today, many educational researchers who define themselves as aligned with the disciplines of educational finance or educational econometricians are substantively more in favor of VAMs than those who align more with the more general disciplines of educational research and educational measurement, methods, and statistics, in general. While this is somewhat of a sweeping generalization, which is not wise as I also argue and also acknowledge in this piece, there is certainly more to be said here about the validity of the inferences drawn here, and (too) often driven via the “media” like The 74.

The bottom line is to question and critically consume everything, and everyone who feels qualified to write about particular things without enough expertise in most everything, including in this case good and professional journalism, this area of educational research, and what it means to make valid inferences and then responsibly share them out with the public.

States’ Teacher Evaluation Systems Now “All over the Map”

We are now just one year past the federal passage of the Every Student Succeeds Act (ESSA), within which it is written that states must no longer set up teacher-evaluation systems based in significant part on their students’ test scores. As per a recent article written in Education Week, accordingly, most states are still tinkering with their teacher evaluation systems—particularly regarding the student growth or value-added measures (VAMs) that were also formerly required to help states assesses teachers’ purported impacts on students’ test scores over time.

“States now have a newfound flexibility to adjust their evaluation systems—and in doing so, they’re all over the map.” Likewise, though, “[a] number of states…have been moving away from [said] student growth [and value-added] measures in [teacher] evaluations,” said a friend, colleague, co-editor, and occasional writer on this blog (see, for example, here and here) Kimberly Kappler Hewitt (University of North Carolina at Greensboro).  She added that this is occurring “whether [this] means postponing [such measures’] inclusion, reducing their percentage in the evaluation breakdown, or eliminating those measures altogether.”

While states like Alabama, Iowa, and Ohio seem to still be moving forward with the attachment of students’ test scores to their teachers, other states seem to be going “back and forth” or putting a halt to all of this altogether (e.g, California). Alaska cut back the weight of the measure, while New Jersey tripled the weight to count for 30% of a teacher’s evaluation score, and then introduced a bill to reduce it back to 0%. In New York teacher are to still receive a test-based evaluation score, but it is not to be tied to consequences and completely revamped by 2019. In Alabama a bill that would have tied 25% of a teacher’s evaluation to his/her students’ ACT and ACT Aspire college-readiness tests has yet to see the light of day. In North Carolina state leaders re-framed the use(s) of such measures to be more for improvement tool (e.g., for professional development), but not “a hammer” to be used against schools or teachers. The same thing is happening in Oklahoma, although this state is not specifically mentioned in this piece.

While some might see all of this as good news — or rather better news than what we have seen for nearly the last decade during which states, state departments of education, and practitioners have been grappling with and trying to make sense of student growth measures and VAMs — others are still (and likely forever will be) holding onto what now seems to be some of the now unclenched promises attached to such stronger accountability measures.

Namely in this article, Daniel Weisberg of The New Teacher Project (TNTP) and author of the now famous “Widget Effect” report — about “Our National Failure to Acknowledge and Act on Differences in Teacher Effectiveness” that helped to “inspire” the last near-decade of these policy-based reforms — “doesn’t see states backing away” from using these measures given ESSA’s new flexibility. We “haven’t seen the clock turn back to 2009, and I don’t think [we]’re going to see that.”

Citation: Will, M. (2017). States are all over the map when it comes to how they’re looking to approach teacher-evaluation systems under ESSA. Education Week. Retrieved from

Value-Added for Kindergarten Teachers in Ecuador

In a study a colleague of mine recently sent me, authors of a study recently released in The Quarterly Journal of Economics and titled “Teacher Quality and Learning Outcomes in Kindergarten,” (nearly randomly) assigned two cohorts of more than 24,000 kindergarten students to teachers to examine whether, indeed and once again, teacher behaviors are related to growth in students’ test scores over time (i.e., value-added).

To assess this, researchers administered 12 tests to the Kindergarteners (I know) at the beginning and end of the year in mathematics and language arts (although apparently the 12 posttests only took 30-40 minutes to complete, which is a content validity and coverage issue in and of itself, p. 1424). They also assessed something they called the executive function (EF), and that they defined as children’s inhibitory control, working memory, capacity to pay attention, and cognitive flexibility, all of which they argue to be related to “Volumetric measures of prefrontal cortex size [when] predict[ed]” (p. 1424). This, along with the fact that teachers’ IQs were also measured (using the Spanish-speaking version of the Wechsler Adult Intelligence Scale) speaks directly to the researchers’ background theory and approach (e.g., recall our world’s history with craniometry, aptly captured in one of my favorite books — Stephen J. Gould’s best selling “The Mismeasure of Man”). Teachers were also observed using the Classroom Assessment Scoring System (CLASS), and parents were also solicited for their opinions about their children’s’ teachers (see other measures collected p. 1417-1418).

What should by now be some familiar names (e.g., Raj Chetty, Thomas Kane) served as collaborators on the study. Likewise, their works and the works of other likely familiar scholars and notorious value-added supporters (e.g., Eric Hanushek, Jonah Rockoff) are also cited throughout in support as evidence of “substantial research” (p. 1416) in support of value-added models (VAMs). Of course, this is unfortunate but important to point out in that this is an indicator of “researcher bias” in and of itself. For example, one of the authors’ findings really should come at no surprise: “Our results…complement estimates from [Thomas Kane’s Bill & Melinda Gates Measures of Effective Teaching] MET project” (p. 1419); although, the authors in a very interesting footnote (p. 1419) describe in more detail than I’ve seen elsewhere all of the weaknesses with the MET study in terms of its design, “substantial attrition,” “serious issue[s]” with contamination and compliance, and possibly/likely biased findings caused by self-selection given the extent to which teachers volunteered to be a part of the MET study.

Also very important to note is that this study took place in Ecuador. Apparently, “they,” including some of the key players in this area of research noted above, are moving their VAM-based efforts across international waters, perhaps in part given the Every Student Succeeds Act (ESSA) recently passed in the U.S., that we should all know by now dramatically curbed federal efforts akin to what is apparently going on now and being pushed here and in other developing countries (although the authors assert that Ecuador is a middle-income country, not a developing country, even though this categorization apparently only applies to the petroleum rich sections of the nation). Related, they assert that, “concerns about teacher quality are likely to be just as important in [other] developing countries” (p. 1416); hence, adopting VAMs in such countries might just be precisely what these countries need to “reform” their schools, as well.

Unfortunately, many big businesses and banks (e.g., the Inter-American Development Bank that funded this particular study) are becoming increasingly interested in investing in and solving these and other developing countries’ educational woes, as well, via measuring and holding teachers accountable for teacher-level value-added, regardless of the extent to which doing this has not worked in the U.S to improve much of anything. Needless to say, many who are involved with these developing nation initiatives, including some of those mentioned above, are also financially benefitting by continuing to serve others their proverbial Kool-Aid.

Nonetheless, their findings:

  • First, they “estimate teacher (rather than classroom) effects of 0.09 on language and math” (p. 1434). That is, just less than 1/10th of a standard deviation, or just over a 3% move in the positive direction away from the mean.
  • Similarly, the “estimate classroom effects of 0.07 standard deviation on EF” (p. 1433). That is, precisely 7/100th of a standard deviation, or about a 2% move in the positive direction away from the mean.
  • They found that “children assigned to teachers with a 1-standard deviation higher CLASS score have between 0.05 and 0.07 standard deviation higher end-of-year test scores” (p. 1437), or a 1-2% move in the positive direction away from the mean.
  • And they found that “that parents generally give higher scores to better teachers…parents are 15 percentage points more likely to classify a teacher who produces 1 standard deviation higher test scores as ‘‘very good’’ rather than ‘‘good’’ or lower” (p. 1442). This is quite an odd way of putting it, along with the assumption that the difference between “very good” and “good” is not arbitrary but empirically grounded, along with whatever reason a simple correlation was not more simply reported.
  • Their most major finding is that “a 1 standard deviation increase in classroom quality, corrected for sampling error, results in 0.11 standard deviation higher test scores in both language and math” (p. 1433; see also other findings from p. 1434-447).

Interestingly, the authors equivocate all of these effects to teacher or classroom “shocks,” although I’d hardly call them “shocks” that inherently imply a large, unidirectional, and causal impact. Moreover, this also implies how the authors, also as economists, still view this type of research (i.e., not correlational, even with close-to-random assignment, although they make a slight mention of this possibility on p. 1449).

Nonetheless, the authors conclude that in this article they effectively evidenced “that there are substantial differences [emphasis added] in the amount of learning that takes place in language, math, and executive function across kindergarten classrooms in Ecuador” (p. 1448). In addition, “These differences are associated with differences in teacher behaviors and practices,” as observed, and “that parents can generally tell better from worse teachers, but do not meaningfully alter their investments in children in response to random shocks [emphasis added] to teacher quality” (p. 1448).

Ultimately, they find that “value added is a useful summary measure of teacher quality in Ecuador” (p. 1448). Go figure…

They conclude “to date, no country in Latin America regularly calculates the value added of teachers,” yet “in virtually all countries in the region, decisions about tenure, in-service training, promotion, pay, and early retirement are taken with no regard for (and in most cases no knowledge about) a teacher’s effectiveness” (p. 1448). Also sound familiar??

“Value added is no silver bullet,” and indeed it is not as per much evidence now existent throughout the U.S., “but knowing which teachers produce more or less learning among equivalent students [is] an important step to designing policies to improve learning outcomes” (p. 1448), they also recognizably argue.

Citation: Araujo, M. C., Carneiro, P.,  Cruz-Aguayo, Y., & Schady, N. (2016). Teacher quality and learning outcomes in Kindergarten. The Quarterly Journal of Economics, 1415–1453. doi:10.1093/qje/qjw016  Retrieved from

A New Book about VAMs “On Trial”

I recently heard about a new book that was written by Mark Paige — J.D. and Ph.D., assistant professor of public policy at the University of Massachusetts-Dartmouth, and a former school law attorney — and published by Rowman & Littlefield. The book is about, as per the secondary part of its title “Understanding Value-Added Models [VAMs] in the Law of Teacher Evaluation.” See more on this book, including information about how to purchase it, for those of you who might be interested in reading more, here, and also via Amazon here.

Clearly, this book is to prove very relevant given the ongoing court cases across the country (see a prior post on these cases here) regarding teachers and the systems being used to evaluate them when especially (or extremely) reliant upon VAM-based estimates for consequential decision-making purposes (e.g., teacher tenure, pay, and termination). While I have not yet read the book, I just ordered my copy the other day. I suggest you do the same, again, should you be interested in further or better understanding the federal and state law pertinent to these cases.

Notwithstanding, I also requested that the author of this book — Mark Paige — write a guest post so that you too could find out more. Here is what he wrote:

Many of us have been following VAMs in legal circles. Several courts have faced the issue of VAMs as they relate to employment law matters. These cases have tested a chief selling point (pardon [or underscore] the business reference) of VAMs: that they will effectuate, for example, teacher termination with greater ease because nobody besides the advanced statisticians and econometricians can argue with their numbers derived. In other words, if a teacher’s VAM rating is bad, then the teacher must be bad. It’s to be as simple as that. How can a court deny that, reality?

Of course, as we [should] already know, VAMs are anything but certain. Bluntly stated: VAMs are a statistical “hot mess.” The American Statistical Association, among many others, warned in no uncertain terms that VAMs cannot – and should not – be trusted to make significant employment decisions. Of course, that has not stopped many policymakers from a full-throated adoption of their use in employment and evaluation decisions. Talk about hubris.

Accordingly, I recently completed this book, again, that focuses squarely at the intersection of VAMs and the law. Its full title is “Building a Better Teacher: Understanding Value-Added Models in the Law of Teacher Evaluation” Rowman & Littlefield, 2016). Again, I provide a direct link to the book along with its description here.

To offer a bit of a sneak preview, thought, I draw many conclusions throughout the book, but one of two important take-aways is this: VAMs may actually complicate the effectuation of a teacher’s termination. Here’s one way: because VAMs are so statistically infirm, they invite plaintiff-side attorneys to attack any underlying negative decision based on these models. See, for example, Sheri Lederman’s recent New York State Supreme Court’s decision, here. [See also a related post in this blog here].

In other words, the evidence upon which districts or states rely to make significant decisions is untrustworthy (or arbitrary) and, therefore, so is any decision as based, even if in part, on VAMs. Thus, VAMs may actually strengthen a teacher’s case. This, of course, is quite apart from the fact that VAM use results in firing good teachers based on poor information, thereby contributing to the teacher shortages and lower morale (among many other parades of horribles) being reported across the nation, and now more than likely ever.

The second important take-away is this, especially given followers of this blog include many educators and administrators facing a barrage of criticisms that only “de-professionalize” them: Courts have, over time, consistently deferred to the professional judgment of administrators (and their assessment of effective teaching). The members of that august institution – the judiciary – actually believe that educators know best about teaching, and that years of accumulated experience and knowledge have actual and also court-relevant value. That may come as a startling revelation to those who consistently diminish the education profession, or those who at least feel like they and their efforts are consistently being diminished.

To be sure, the system of educator evaluation is not perfect. Our schools continue to struggle to offer equal and equitable educational opportunities to all students, especially those in the nation’s highest needs schools. But what this book ultimately concludes is that the continued use of VAMs will not, hu-hum, add any value to these efforts.

To reach author Mark Paige via email, please contact him at To reach him via Twitter: @mpaigelaw

Houston Education and Civil Rights Summit (Friday, Oct. 14 to Saturday, Oct. 15)

For those of you interested, and perhaps close to Houston, Texas, I will be presenting my research on the Houston Independent School District’s (now hopefully past) use of the Education Value-Added Assessment System for more high-stakes, teacher-level consequences than anywhere else in the nation.

As you may recall from prior posts (see, for example, here, here, and here), seven teachers in the disrict, with the support of the Houston Federation of Teachers (HFT), are taking the district to federal court over how their value-added scores are/were being used, and allegedly abused. The case, Houston Federation of Teachers, et al. v. Houston ISD, is still ongoing; although, also as per a prior post, the school board just this past June, in a 3:3 split vote, elected to no longer pay an annual $680K to SAS Institute Inc. to calculate the district’s EVAAS estimates. Hence, by non-renewing this contract it appears, at least for the time being, that the district is free from its prior history using the EVAAS for high-stakes accountability. See also this post here for an analysis of Houston’s test scores post EVAAS implementation,  as compared to other districts in the state of Texas. Apparently, all of the time and energy invested did not pay off for the district, or more importantly its teachers and students located within its boundaries.

Anyhow, those presenting and attending the conference–the Houston Education and Civil Rights Summit, as also sponsored and supported by United Opt Out National–will prioritize and focus on the “continued challenges of public education and the teaching profession [that] have only been exacerbated by past and current policies and practices,”  as well as “the shifting landscape of public education and its impact on civil and human rights and civil society.”

As mentioned, I will be speaking, alongside two featured speakers: Samuel Abrams–the Director of the National Center for the Study of Privatization in Education (NCSPE) and an instructor in Columbia’s Teachers College, and Julian Vasquez Heilig–Professor of Educational Leadership and Policy Studies at California State Sacramento and creator of the blog Cloaking Inequality. For more information about these and other speakers, many of whom are practitioners, see  the conference website available, again, here.

When is it? Friday, October 14, 2016 at 4:00 PM through to Saturday, October 15, 2016 at 8:00 PM (CDT).

Where is it? Houston Hilton Post Oak – 2001 Post Oak Blvd, Houston, TX 77056

Hope to see you there!

One Score and Seven Policy Iterations Ago…

I just read what might be one of the best articles I’ve read in a long time on using test scores to measure teacher effectiveness, and why this is such a bad idea. Not surprisingly, unfortunately, this article was written 20 years ago (i.e., 1986) by – Edward Haertel, National Academy of Education member and recently retired Professor at Stanford University. If the name sounds familiar, it should as Professor Emeritus Haertel is one of the best on the topic of, and history behind VAMs (see prior posts about his related scholarship here, here, and here). To access the full article, please scroll to the reference at the bottom of this post.

Heartel wrote this article when at the time policymakers were, like they still are now, trying to hold teachers accountable for their students’ learning as measured on states’ standardized test scores. Although this article deals with minimum competency tests, which were in policy fashion at the time, about seven policy iterations ago, the contents of the article still have much relevance given where we are today — investing in “new and improved” Common Core tests and still riding on unsinkable beliefs that this is the way to reform the schools that have been in despair and (still) in need of major repair since 20+ years ago.

Here are some of the points I found of most “value:”

  • On isolating teacher effects: “Inferring teacher competence from test scores requires the isolation of teaching effects from other major influences on student test performance,” while “the task is to support an interpretation of student test performance as reflecting teacher competence by providing evidence against plausible rival hypotheses or interpretation.” While “student achievement depends on multiple factors, many of which are out of the teacher’s control,” and many of which cannot and likely never will be able to be “controlled.” In terms of home supports, “students enjoy varying levels of out-of-school support for learning. Not only may parental support and expectations influence student motivation and effort, but some parents may share directly in the task of instruction itself, reading with children, for example, or assisting them with homework.” In terms of school supports, “[s]choolwide learning climate refers to the host of factors that make a school more than a collection of self-contained classrooms. Where the principal is a strong instructional leader; where schoolwide policies on attendance, drug use, and discipline are consistently enforced; where the dominant peer culture is achievement-oriented; and where the school is actively supported by parents and the community.” This, all, makes isolating the teacher effect nearly if not wholly impossible.
  • On the difficulties with defining the teacher effect: “Does it include homework? Does it include self-directed study initiated by the student? How about tutoring by a parent or an older sister or brother? For present purposes, instruction logically refers to whatever the teacher being evaluated is responsible for, but there are degrees of responsibility, and it is often shared. If a teacher informs parents of a student’s learning difficulties and they arrange for private tutoring, is the teacher responsible for the student’s improvement? Suppose the teacher merely gives the student low marks, the student informs her parents, and they arrange for a tutor? Should teachers be credited with inspiring a student’s independent study of school subjects? There is no time to dwell on these difficulties; others lie ahead. Recognizing that some ambiguity remains, it may suffice to define instruction as any learning activity directed by the teacher, including homework….The question also must be confronted of what knowledge counts as achievement. The math teacher who digresses into lectures on beekeeping may be effective in communicating information, but for purposes of teacher evaluation the learning outcomes will not match those of a colleague who sticks to quadratic equations.” Much if not all of this cannot and likely never will be able to be “controlled” or “factored” in or our, as well.
  • On standardized tests: The best of standardized tests will (likely) always be too imperfect and not up to the teacher evaluation task, no matter the extent to which they are pitched as “new and improved.” While it might appear that these “problem[s] could be solved with better tests,” they cannot. Ultimately, all that these tests provide is “a sample of student performance. The inference that this performance reflects educational achievement [not to mention teacher effectiveness] is probabilistic [emphasis added], and is only justified under certain conditions.” Likewise, these tests “measure only a subset of important learning objectives, and if teachers are rated on their students’ attainment of just those outcomes, instruction of unmeasured objectives [is also] slighted.” Like it was then as it still is today, “it has become a commonplace that standardized student achievement tests are ill-suited for teacher evaluation.”
  • On the multiple choice formats of such tests: “[A] multiple-choice item remains a recognition task, in which the problem is to find the best of a small number of predetermined alternatives and the cri- teria for comparing the alternatives are well defined. The nonacademic situations where school learning is ultimately ap- plied rarely present problems in this neat, closed form. Discovery and definition of the problem itself and production of a variety of solutions are called for, not selection among a set of fixed alternatives.”
  • On students and the scores they are to contribute to the teacher evaluation formula: “Students varying in their readiness to profit from instruction are said to differ in aptitude. Not only general cognitive abilities, but relevant prior instruction, motivation, and specific inter- actions of these and other learner characteristics with features of the curriculum and instruction will affect academic growth.” In other words, one cannot simply assume all students will learn or grow at the same rate with the same teacher. Rather, they will learn at different rates given their aptitudes, their “readiness to profit from instruction,” the teachers’ instruction, and sometimes despite the teachers’ instruction or what the teacher teaches.
  • And on the formative nature of such tests, as it was then: “Teachers rarely consult standardized test results except, perhaps, for initial grouping or placement of students, and they believe that the tests are of more value to school or district administrators than to themselves.”

Sound familiar?

Reference: Haertel, E. (1986). The valid use of student performance measures for teacher evaluation. Educational Evaluation and Policy Analysis, 8(1), 45-60.

The Late Stephen Jay Gould on IQ Testing (with Implications for Testing Today)

One of my doctoral students sent me a YouTube video I feel compelled to share with you all. It is an interview with one of my all time favorite and most admired academics — Stephen Jay Gould. Gould, who passed away at age 60 from cancer, was a paleontologist, evolutionary biologist, and scientist who spent most of his academic career at Harvard. He was “one of the most influential and widely read writers of popular science of his generation,” and he was also the author of one of my favorite books of all time: The Mismeasure of Man (1981).

In The Mismeasure of Man Gould examined the history of psychometrics and the history of intelligence testing (e.g., the methods of nineteenth century craniometry, or the physical measures of peoples’ skulls to “objectively” capture their intelligence). Gould examined psychological testing and the uses of all sorts of tests and measurements to inform decisions (which is still, as we know, uber-relevant today) as well as “inform” biological determinism (i.e., “the view that “social and economic differences between human groups—primarily races, classes, and sexes—arise from inherited, inborn distinctions and that society, in this sense, is an accurate reflection of biology). Gould also examined in this book the general use of mathematics and “objective” numbers writ large to measure pretty much anything, as well as to measure and evidence predetermined sets of conclusions. This book is, as I mentioned, one of the best. I highly recommend it to all.

In this seven-minute video, you can get a sense of what this book is all about, as also so relevant to that which we continue to believe or not believe about tests and what they really are or are not worth. Thanks, again, to my doctoral student for finding this as this is a treasure not to be buried, especially given Gould’s 2002 passing.

Another Oldie but Still Very Relevant Goodie, by McCaffrey et al.

I recently re-read an article in full that is now 10 years old, or 10 years out, as published in 2004 and, as per the words of the authors, before VAM approaches were “widely adopted in formal state or district accountability systems.” Unfortunately, I consistently find it interesting, particularly in terms of the research on VAMs, to re-explore/re-discover what we actually knew 10 years ago about VAMs, as most of the time, this serves as a reminder of how things, most of the time, have not changed.

The article, “Models for Value-Added Modeling of Teacher Effects,” is authored by Daniel McCaffrey (Educational Testing Service [ETS] Scientist, and still a “big name” in VAM research), J. R. Lockwood (RAND Corporation Scientists),  Daniel Koretz (Professor at Harvard), Thomas Louis (Professor at Johns Hopkins), and Laura Hamilton (RAND Corporation Scientist).

At the point at which the authors wrote this article, besides the aforementioned data and data base issues, were issues with “multiple measures on the same student and multiple teachers instructing each student” as “[c]lass groupings of students change annually, and students are taught by a different teacher each year.” Authors, more specifically, questioned “whether VAM really does remove the effects of factors such as prior performance and [students’] socio-economic status, and thereby provide[s] a more accurate indicator of teacher effectiveness.”

The assertions they advanced, accordingly and as relevant to these questions, follow:

  • Across different types of VAMs, given different types of approaches to control for some of the above (e.g., bias), teachers’ contribution to total variability in test scores (as per value-added gains) ranged from 3% to 20%. That is, teachers can realistically only be held accountable for 3% to 20% of the variance in test scores using VAMs, while the other 80% to 97% of the variance (stil) comes from influences outside of the teacher’s control. A similar statistic (i.e., 1% to 14%) was similarly and recently highlighted in the recent position statement on VAMs released by the American Statistical Association.
  • Most VAMs focus exclusively on scores from standardized assessments, although I will take this one-step further now, noting that all VAMs now focus exclusively on large-scale standardized tests. This I evidenced in a recent paper I published here: Putting growth and value-added models on the map: A national overview).
  • VAMs introduce bias when missing test scores are not missing completely at random. The missing at random assumption, however, runs across most VAMs because without it, data missingness would be pragmatically insolvable, especially “given the large proportion of missing data in many achievement databases and known differences between students with complete and incomplete test data.” The really only solution here is to use “implicit imputation of values for unobserved gains using the observed scores” which is “followed by estimation of teacher effect[s] using the means of both the imputed and observe gains [together].”
  • Bias “[still] is one of the most difficult issues arising from the use of VAMs to estimate school or teacher effects…[and]…the inclusion of student level covariates is not necessarily the solution to [this] bias.” In other words, “Controlling for student-level covariates alone is not sufficient to remove the effects of [students’] background [or demographic] characteristics.” There is a reason why bias is still such a highly contested issue when it comes to VAMs (see a recent post about this here).
  • All (or now most) commonly-used VAMs assume that teachers’ (and prior teachers’) effects persist undiminished over time. This assumption “is not empirically or theoretically justified,” either, yet it persists.

These authors’ overall conclusion, again from 10 years ago but one that in many ways still stands? VAMs “will often be too imprecise to support some of [its] desired inferences” and uses including, for example, making low- and high-stakes decisions about teacher effects as produced via VAMs. “[O]btaining sufficiently precise estimates of teacher effects to support ranking [and such decisions] is likely to [forever] be a challenge.”