The “Value-Added” of Teacher Preparation Programs: New Research

The journal Education of Economics Review recently published a study titled “Teacher Quality Differences Between Teacher Preparation Programs: How Big? How Reliable? Which Programs Are Different?” The study was authored by researchers at the University of Texas – Austin, Duke University, and Tulane. The pre-publication version of this piece can be found here, and a condensed but thorough version of this piece as covered in EducationNext can be found here (see citations below).

As the title implies, the purpose of the study was to “evaluate statistical methods for estimating teacher quality differences between TPPs [teacher preparation programs].” Needless to say, this research is particularly relevant, here, given “Sixteen US states have begun to hold teacher preparation programs (TPPs) accountable for teacher quality, where quality is estimated by teacher value-added to student test scores.” The federal government continues to support and advance these initiatives, as well (see, for example, here).

But this research study is also particularly important because while researchers found that “[t]he most convincing estimates [of TPP quality] [came] from a value-added model where confidence intervals [were] widened;” that is, the extent to which measurement errors were permitted was dramatically increased, and also widened further using statistical corrections. But even when using these statistical techniques and accomodations, they found that it was still “rarely possible to tell which TPPs, if any, [were] better or worse than average.”

They therefore concluded that “[t]he potential benefits of TPP accountability may be too small to balance the risk that a proliferation of noisy TPP estimates will encourage arbitrary and ineffective policy actions” in response. More specifically, and in their own words, they found that:

  1. Differences between TPPs. While most of [their] results suggest that real differences between TPPs exist, the differences [were] not large [or large enough to make or evidence the differentiation between programs as conceptualized and expected]. [Their] estimates var[ied] a bit with their statistical methods, but averaging across plausible methods [they] conclude[d] that between TPPs the heterogeneity [standard deviation (SD) was] about .03 in math and .02 in reading. That is, a 1 SD increase in TPP quality predict[ed] just [emphasis added] a [very small] .03 SD increase in student math scores and a [very small] .02 SD increase in student reading scores.
  2. Reliability of TPP estimates. Even if the [above-mentioned] differences between TPPs were large enough to be of policy interest, accountability could only work if TPP differences could be estimated reliably. And [their] results raise doubts that they can. Every plausible analysis that [they] conducted suggested that TPP estimates consist[ed] mostly of noise. In some analyses, TPP estimates appeared to be about 50% noise; in other analyses, they appeared to be as much as 80% or 90% noise…Even in large TPPs the estimates were mostly noise [although]…[i]t is plausible [although perhaps not probable]…that TPP estimates would be more reliable if [researchers] had more than one year of data…[although states smaller than the one in this study — Texs]…would require 5 years to accumulate the amount of data that [they used] from one year of data.
  3. Notably Different TPPs. Even if [they] focus[ed] on estimates from a single model, it remains hard to identify which TPPs differ from the average…[Again,] TPP differences are small and estimates of them are uncertain.

In conclusion, that researchers found “that there are only small teacher quality differences between TPPs” might seem surprising, but not really given the outcome variables they used to measure and assess TPP effects were students’ test scores. In short, students’ test scores are three times removed from the primary unit of analysis in studies like these. That is, (1) the TPP is to be measured by the effectiveness of its teacher graduates, and (2) teacher graduates are to be measured by their purported impacts on their students’ test scores, while (3) students’ test scores are to only and have only been validated for measuring student learning and achievement. These test scores have not been validated to assess and measure, in the inverse, teachers causal impacts on said achievements or on TPPs impacts on teachers on said achievements.

If this sounds confusing, it is, and also highly nonsensical, but this is also a reason why this is so difficult to do, and as evidenced in this study, improbable to do this well or as theorized in that TPP estimates are sensitive to error, insensitive given error, and, accordingly, highly uncertain and invalid.

Citations: von Hippel, P. T., Bellows, L., Osborne, C., Lincove, J. A., & Mills, N. (2016). Teacher quality differences between teacher preparation programs: How big? How reliable? Which programs are different? Education of Economics Review, 53, 31–45. doi:10.1016/j.econedurev.2016.05.002

von Hippel, P. T., & Bellows, L. (2018). Rating teacher-preparation programs: Can value-added make useful distinctions? EducationNext. Retrieved from

Your Voices Were (At Least) Heard: Federal Plans for College of Education Value-Added

A couple of weeks ago I published a post titled “Your Voice Also Needs to Be Heard.” In this post I put out an all-call to solicit responses to an open request for feedback regarding the US Department of Education’s proposal to require teacher training programs (i.e., colleges of education) to track and be accountable for how their teacher graduates’ students are performing on standardized tests, once their teachers teach in the field for x years. That is, teacher-level value-added that reflects all the way back to a college of education’s purported quality.

In an article written by Stephen Sawchuk and published this past week in Education Week, you can read more about some of your and many others’ responses (i.e., more than 2,300 separate public comments) in an article titled “U.S. Teacher-Prep Rules Face Heavy Criticism in Public Comments.”

As written into the first paragraph, the feedback was “overwhelmingly critical,” and Sawchuk charged this was the case given the “coordinated opposition from higher education officials and assorted policy groups.” Not that many of particularly the former group of folks have valid arguments to make on the topic, given they are the ones at the center of the proposed reforms and understand the realities surrounding such reforms much better than the policymakers in charge…

Among the policy groups, Sawchuk accordingly positions groups like the National Education Policy Center (NEPC), that he defined as “a left-leaning think tank at the University of Colorado at Boulder that is partly funded by teachers’ unions and generally opposes market-based education policies” — against, for example — the Thomas B. Fordham Institute, which in reality is a neoconservative education policy think tank, but in Sawchuk’s “reporting of the facts” he defines as just “generally back[ing] stronger accountability mechanisms in education.” Why such (biased) reporting, Sawchuk?

Regardless, the proposed rules at issue here were “issued under the Higher Education Act [and]…released by the U.S. Department of Education in November, some two years after negotiations with representatives from various types of colleges broke down over the regulations’ shape and scope. Among other provisions, the rules would require states to use measures such as surveys of school districts, teacher-employment data, and student-achievement results to classify each preparation program in one of four categories…The lowest-rated would be barred from offering federal grants of up to $4,000 a year to help pay for teacher education under the teach program.”

Although Sawchuk does not disaggregate the data, I would venture to say that the main if not only issues with which folks are actually talking issue is the latter piece – the use of student-achievement results to classify each preparation program in one of four categories as per their “value-added.” Sawchuk did, however, report on five major themes common across responses about how the new rules would:

  • Prioritize student test scores, potentially leading to deleterious effects on teacher-preparation coursework;
  • Apply punitive sanctions to programs rather than support them;
  • Expand federal meddling in state affairs;
  • Prescribe flawed measures that would yield biased results; and
  • Cost far more to implement than the $42 million the Education Department estimated.

You can see individual’s responses also highlighted within the article, again linked to here.

“Only a handful of commenters were outright supportive of the rules.” Yet, “[w]hether the [US] Education Department will be swayed by the volume of negative comments to rewrite or withdraw the rules remains an open question.” What do you think they will do?

As per Michael J. Petrilli, a former staffer under George W. Bush’s administration, the US Department of Education “must give the public a chance to provide input, and has to explain if it has changed its regulations as a result of the process. But it doesn’t have to change a word.” I will try to stay positive, but I guess we shall wait and see…

Petrilli also cautioned that “critics’ attempts to undermine the rules could backfire. ‘If opponents want to be constructive, they need to suggest ways to improve the regulation, not just argue for its elimination.” For the record, I am MORE THAN HAPPY to help offer better and much more reasonable and valid solutions!!