One Score and Seven Policy Iterations Ago…

I just read what might be one of the best articles I’ve read in a long time on using test scores to measure teacher effectiveness, and why this is such a bad idea. Not surprisingly, unfortunately, this article was written 20 years ago (i.e., 1986) by – Edward Haertel, National Academy of Education member and recently retired Professor at Stanford University. If the name sounds familiar, it should as Professor Emeritus Haertel is one of the best on the topic of, and history behind VAMs (see prior posts about his related scholarship here, here, and here). To access the full article, please scroll to the reference at the bottom of this post.

Heartel wrote this article when at the time policymakers were, like they still are now, trying to hold teachers accountable for their students’ learning as measured on states’ standardized test scores. Although this article deals with minimum competency tests, which were in policy fashion at the time, about seven policy iterations ago, the contents of the article still have much relevance given where we are today — investing in “new and improved” Common Core tests and still riding on unsinkable beliefs that this is the way to reform the schools that have been in despair and (still) in need of major repair since 20+ years ago.

Here are some of the points I found of most “value:”

  • On isolating teacher effects: “Inferring teacher competence from test scores requires the isolation of teaching effects from other major influences on student test performance,” while “the task is to support an interpretation of student test performance as reflecting teacher competence by providing evidence against plausible rival hypotheses or interpretation.” While “student achievement depends on multiple factors, many of which are out of the teacher’s control,” and many of which cannot and likely never will be able to be “controlled.” In terms of home supports, “students enjoy varying levels of out-of-school support for learning. Not only may parental support and expectations influence student motivation and effort, but some parents may share directly in the task of instruction itself, reading with children, for example, or assisting them with homework.” In terms of school supports, “[s]choolwide learning climate refers to the host of factors that make a school more than a collection of self-contained classrooms. Where the principal is a strong instructional leader; where schoolwide policies on attendance, drug use, and discipline are consistently enforced; where the dominant peer culture is achievement-oriented; and where the school is actively supported by parents and the community.” This, all, makes isolating the teacher effect nearly if not wholly impossible.
  • On the difficulties with defining the teacher effect: “Does it include homework? Does it include self-directed study initiated by the student? How about tutoring by a parent or an older sister or brother? For present purposes, instruction logically refers to whatever the teacher being evaluated is responsible for, but there are degrees of responsibility, and it is often shared. If a teacher informs parents of a student’s learning difficulties and they arrange for private tutoring, is the teacher responsible for the student’s improvement? Suppose the teacher merely gives the student low marks, the student informs her parents, and they arrange for a tutor? Should teachers be credited with inspiring a student’s independent study of school subjects? There is no time to dwell on these difficulties; others lie ahead. Recognizing that some ambiguity remains, it may suffice to define instruction as any learning activity directed by the teacher, including homework….The question also must be confronted of what knowledge counts as achievement. The math teacher who digresses into lectures on beekeeping may be effective in communicating information, but for purposes of teacher evaluation the learning outcomes will not match those of a colleague who sticks to quadratic equations.” Much if not all of this cannot and likely never will be able to be “controlled” or “factored” in or our, as well.
  • On standardized tests: The best of standardized tests will (likely) always be too imperfect and not up to the teacher evaluation task, no matter the extent to which they are pitched as “new and improved.” While it might appear that these “problem[s] could be solved with better tests,” they cannot. Ultimately, all that these tests provide is “a sample of student performance. The inference that this performance reflects educational achievement [not to mention teacher effectiveness] is probabilistic [emphasis added], and is only justified under certain conditions.” Likewise, these tests “measure only a subset of important learning objectives, and if teachers are rated on their students’ attainment of just those outcomes, instruction of unmeasured objectives [is also] slighted.” Like it was then as it still is today, “it has become a commonplace that standardized student achievement tests are ill-suited for teacher evaluation.”
  • On the multiple choice formats of such tests: “[A] multiple-choice item remains a recognition task, in which the problem is to find the best of a small number of predetermined alternatives and the cri- teria for comparing the alternatives are well defined. The nonacademic situations where school learning is ultimately ap- plied rarely present problems in this neat, closed form. Discovery and definition of the problem itself and production of a variety of solutions are called for, not selection among a set of fixed alternatives.”
  • On students and the scores they are to contribute to the teacher evaluation formula: “Students varying in their readiness to profit from instruction are said to differ in aptitude. Not only general cognitive abilities, but relevant prior instruction, motivation, and specific inter- actions of these and other learner characteristics with features of the curriculum and instruction will affect academic growth.” In other words, one cannot simply assume all students will learn or grow at the same rate with the same teacher. Rather, they will learn at different rates given their aptitudes, their “readiness to profit from instruction,” the teachers’ instruction, and sometimes despite the teachers’ instruction or what the teacher teaches.
  • And on the formative nature of such tests, as it was then: “Teachers rarely consult standardized test results except, perhaps, for initial grouping or placement of students, and they believe that the tests are of more value to school or district administrators than to themselves.”

Sound familiar?

Reference: Haertel, E. (1986). The valid use of student performance measures for teacher evaluation. Educational Evaluation and Policy Analysis, 8(1), 45-60.

The Study that Keeps on Giving…(Hopefully) in its Final Round

In January I wrote a post about “The Study that Keeps on Giving…” Specifically, this post was about the study conducted and authored by Raj Chetty (Economics Professor at Harvard), John Friedman (Assistant Professor of Public Policy at Harvard), and Jonah Rockoff (Associate Professor of Finance and Economics at Harvard) that was published first in 2011 (in its non-peer-reviewed and not even internally reviewed form) by the National Bureau of Economic Research (NBER) and then published again by NBER in January of 2014 (in the same form) but this time split into two separate studies (see them split here and here).

Their re-release of the same albeit split study was what prompted the title of the initial “The Study that Keeps on Giving…” post. Little did I know then, though, that the reason this study was re-released in split form was that it was soon to be published in a peer-reviewed journal. Its non-peer-reviewed publication status was a major source of prior criticism. While journal editors seemed to have suggested the split, NBER seemingly took advantage of this opportunity to publicize this study in two forms, regardless and without prior explanation.

Anyhow, this came to my attention when the study’s lead author – Raj Chetty – emailed me a few weeks ago, emailed Diane Ravitch on the same email, and also apparently emailed other study “critics” at the same time (see prior reviews of this study as per this study’s other notable “critics” here, here, here, and here) to notify all of us that this study made it through peer review and was to be published in a forthcoming issue of the American Economic Review. While Diane and I responded to our joint email (as other critics may have done as well), we ultimately promised Chetty that we would not share the actual contents of any of the approximately 20 email exchanges that went back and forth among the three of us over the following days.

What I can say, though, is that no genuine concern was expressed by Chetty or on behalf of his co-authors, in particular, about the intended or unintended consequences that came about as a result of his study, nor how many policymakers since used and abused study results for political gain and the further advancement of VAM-based policies. Instead, emails were more or less self-promotional and celebratory, especially given that President Obama cited the study in his 2012 State of the Union Address and that Chetty apparently continues to advise U.S. Secretary of Education Arne Duncan about his similar VAM-based policies. Perhaps, next, the Nobel prize committee might pay this study its due respects, now overdue, but again I only paraphrase from that which I inferred from these email conversations.

As a refresher, Chetty et al. conducted value-added analyses on a massive data set (with over 1 million student-level test and tax records) and presented (highly-questionable) evidence that favored teachers’ long-lasting, enduring, and in some cases miraculous effects. While some of the findings would have been very welcomed to the profession, had they indeed been true (e.g., high value-added teachers substantively affect students incomes in their adult years), the study’s authors overstated their findings, and they did not duly consider (or provide evidence to counter) the alternative hypotheses in terms of what other factors besides teachers might have caused the outcomes they observed (e.g., those things that happen outside of schools while students are in school and throughout students’ lives).

Nor did they consider, or rather satisfactorily consider, how the non-random assignment of students into both schools and classrooms might have biased the effects observed, whereas the students in high “value-added” teachers’ classrooms might have been more “likely to succeed” regardless of, or even despite the teacher effect, on both the short and long term effects demonstrated in their findings…then widely publicized via the media and beyond throughout other political spheres.

Rather, Chetty et al. advanced what they argued were a series of causal effects by exploiting a series of correlations that they turned attributional. They did this because (I believe) they truly believe that their sophisticated econometric models and the sophisticated controls and approaches they use in fact work as intended. Perhaps this also explains why Chetty et al. give pretty much all credit in the area of value-added research to econometricians, and they do this throughout their papers, all the while over-citing the works of their economic researchers/friends but not the others (besides Berkeley economist Jesse Rothstein, see the full reference to his study here) who have also outright contradicted their findings, with evidence. Apparently, educational researchers do not have much to add on this topic, but I digress.

But this is too a serious fault as “they” (and I don’t mean to make sweeping generalizations here) have never been much for understanding what goes into the data they analyze, as socially constructed and largely context dependent. Nor do they seem to care to fully understand the realities of the classrooms from which they receive such data, or what test scores actually mean, or when using them what one can and cannot actually infer. This, too, was made clear via our email exchange. It seems this from-the-sky-down view of educational data is the best (as well as the most convenient) approach that “they” might even expressly prefer, so that they do not have to get their data fingers dirty and deal with the messiness that always surrounds these types of educational data and always comes into play when conducting most any type of educational research that relies (in this case solely) on students’ large-scale standardized test scores.

Regardless, I decided to give this study yet another review to see if, now that this study has made it through the peer review process, I was missing something. I wasn’t. The studies are pretty much exactly the same as they were when first released (which unfortunately does not say much for peer review). The first study here is about VAM-based bias and how VAM estimates that control for students’ prior test scores “exhibit little bias despite the grouping of students” and despite the number of studies not referenced or cited that continue to evidence the opposite. The second study here is about teacher-level value-added and how teachers with a lot of it (purportedly) cause grander things throughout their students’ lives. More specifically, they found that “students [non-randomly] assigned to high [value-added] teachers are more likely to attend college, earn higher salaries, and are less likely to have children as teenagers.” They also found that “[r]eplacing a teacher whose [value-added] is in the bottom 5% with an average teacher would increase the present value of students’ lifetime income by approximately $250,000 per classroom [emphasis added].” Please note that this overstated figure is not per student; had it been broken out by student it would have rather become “chump change,” for the lack of a better term, which serves as one example of just one of their classic exaggerations. They do, however, when you read through the actual text, tone their powerful language down a bit to note that, on average, this is more accurately $185,000, still per classroom. Again, to read the more thorough critiques conducted by scholars with also impressive academic profiles, I suggest readers click here, here, here, or here.

What I did find important to bring to light during this round of review were the assumptions that, thanks to Chetty and his emails, were made more obvious (and likewise troublesome) than before. These are the, in some cases, “very strong” assumptions that Chetty et al. make explicit in both of their studies (see Assumptions 1-3 in the first and second papers). These are also the assumptions they make explicit, with “evidence” why they should not reject these assumptions (most likely, and in some cases clearly) because their study relied on such assumptions. The assumptions they made were so strong, in fact, at one point they even mention that it would be useful could they have “relaxed” some of the assumptions they made. In other cases, they justify their adoption of these assumptions given the data limitations and methodological issues they faced, plain and simply because there was no other way to conduct (or continue) their analyses without making and agreeing to these assumptions.

So, see if you agree with the following three assumptions they make most explicit and use throughout both studies (although other assumptions are littered throughout both pieces), yourselves. I would love for Chetty et al. to discuss whether their assumptions in fact hold given the realities the everyday teacher, school, or classroom face. But again, I digress…

Assumption 1 [Stationarity]: Teacher levels of value-added as based on growth in student achievement over time follows a stationary, unchanging, constant, and consistent process. On average, “teacher quality does not vary across calendar years and [rather] depends only on the amount of time that elapses between” years. While completely nonsensical to the average adult with really any commonsense, this assumption, to them, helped them “simplif[y] the estimation of teacher [value-added] by reducing the number of parameters” needed in their models, or more appropriately needed to model their effects.

Assumption 2 [Selection on Excluded Observables]: Students are sorted or assigned to teachers on excluded observables that can be estimated. See a recent study that I conducted with a doctoral student of mine (that was just published in this month’s issue of the highly esteemed American Educational Research Journal here) in which we found, with evidence, that 98% of the time this assumption is false. Students are non-randomly sorted on “observables” and “non-observables” (most of which are not and cannot be included in such data sets) 98% of the time; both types of variables bias teacher-level value-added over time given the statistical procedures meant to control for these variables do not work effectively well, especially for students in the extremes or on both sides of the normal bell curve. While convenient, especially when conducting this type of far-removed research, this assumption is false and cannot really be taken seriously given the pragmatic realities of schools.

Assumption 3 [Teacher Switching as a Quasi-Experiment]: Changes in teacher-level value-added scores across cohorts within a school-grade are orthogonal (i.e., non-overlapping, uncorrelated, or independent) with changes in other determinants of student scores. While Chetty et al. themselves write that this assumption “could potentially be violated by endogenous student or teacher sorting to schools over time,” they also state that “[s]tudent sorting at an annual frequency is minimal because of the costs of changing schools” which is yet another unchecked assumption without reference(s) in support. They further note that “[w]hile endogenous teacher sorting is plausible over long horizons, the high-frequency changes [they] analyze are likely driven by idiosyncratic shocks such as changes in staffing needs, maternity leaves, or the relocation of spouses.” These are all plausible assumptions too, right? Is “high-frequency teacher turnover…uncorrelated with student and school characteristics?” Concerns about this and really all of these assumptions, and ultimately how they impact study findings, should certainly cause pause.

My final point, interestingly enough, also came up during the email exchanges mentioned above. Chetty made the argument that he, more or less, had no dog in the fight surrounding value-added. In the first sentence of the first manuscript, however, he (and his colleagues) wrote, “Are teachers’ impacts on students’ test scores (“value-added”) a good measure of their quality?” The answer soon thereafter and repeatedly made known in both papers becomes an unequivocal “Yes.” Chetty et al. write in the first paper that “[they] established that value-added measures can help us [emphasis added as “us” is undefined] identify which teachers have the greatest ability to raise students’ test scores.” In the second paper, they write that “We find that teacher [value-added] has substantial impacts on a broad range of outcomes.”  Apparently, Chetty wasn’t representing his and/or his colleagues’ honest “research-based” opinions and feelings about VAMs in one place (i.e., our emails) or the other (his publications) very well.

Contradictions…as nettlesome as those dirtly little assumptions I suppose.

American Statistical Association (ASA) Position Statement on VAMs

Inside my most recent post, about the Top 14 research-based articles about VAMs, there was a great research-based statement that was released just last week by the American Statistical Association (ASA), titled the “ASA Statement on Using Value-Added Models for Educational Assessment.”

It is short, accessible, easy to understand, and hard to dispute, so I wanted to be sure nobody missed it as this is certainly a must read for all of you following this blog, not to mention everybody else dealing/working with VAMs and their related educational policies. Likewise, this represents the current, research-based evidence and thinking of probably 90% of the educational researchers and econometricians (still) conducting research in this area.

Again, the ASA is the best statistical organization in the U.S. and likely one of if not the best statistical associations in the world. Some of the most important parts of their statement, taken directly from their full statement as I see them, follow:

  1. VAMs are complex statistical models, and high-level statistical expertise is needed to
    develop the models and [emphasis added] interpret their results.
  2. Estimates from VAMs should always be accompanied by measures of precision and a discussion of the assumptions and possible limitations of the model. These limitations are particularly relevant if VAMs are used for high-stakes purposes.
  3. VAMs are generally based on standardized test scores, and do not directly measure
    potential teacher contributions toward other student outcomes.
  4. VAMs typically measure correlation, not causation: Effects – positive or negative –
    attributed to a teacher may actually be caused by other factors that are not captured in the model.
  5. Under some conditions, VAM scores and rankings can change substantially when a
    different model or test is used, and a thorough analysis should be undertaken to
    evaluate the sensitivity of estimates to different models.
  6. VAMs should be viewed within the context of quality improvement, which distinguishes aspects of quality that can be attributed to the system from those that can be attributed to individual teachers, teacher preparation programs, or schools.
  7. Most VAM studies find that teachers account for about 1% to 14% of the variability in test scores, and that the majority of opportunities for quality improvement are found in the system-level conditions. Ranking teachers by their VAM scores can have unintended consequences that reduce quality.
  8. Attaching too much importance to a single item of quantitative information is counter-productive—in fact, it can be detrimental to the goal of improving quality.
  9. When used appropriately, VAMs may provide quantitative information that is relevant for improving education processes…[but only if used for descriptive/description purposes]. Otherwise, using VAM scores to improve education requires that they provide meaningful information about a teacher’s ability to promote student learning…[and they just do not do this at this point, as there is no research evidence to support this ideal].
  10. A decision to use VAMs for teacher evaluations might change the way the tests are viewed and lead to changes in the school environment. For example, more classroom time might be spent on test preparation and on specific content from the test at the exclusion of content that may lead to better long-term learning gains or motivation for students. Certain schools may be hard to staff if there is a perception that it is harder for teachers to achieve good VAM scores when working in them. Overreliance on VAM scores may foster a competitive environment, discouraging collaboration and efforts to improve the educational system as a whole.

Also important to point out is that included in the report the ASA makes recommendations regarding the “key questions states and districts [yes, practitioners!] should address regarding the use of any type of VAM.” These include, although they are not limited to questions about reliability (consistency), validity, the tests on which VAM estimates are based, and the major statistical errors that always accompany VAM estimates, but are often buried and often not reported with results (i.e., in terms of confidence
intervals or standard errors).

Also important is the purpose for ASA’s statement, as written by them: “As the largest organization in the United States representing statisticians and related professionals, the American Statistical Association (ASA) is making this statement to provide guidance, given current knowledge and experience, as to what can and cannot reasonably be expected from the use of VAMs. This statement focuses on the use of VAMs for assessing teachers’ performance but the issues discussed here also apply to their use for school or principal accountability. The statement is not intended to be prescriptive. Rather, it is intended to enhance general understanding of the strengths and limitations of the results generated by VAMs and thereby encourage the informed use of these results.”

Do give the position statement a read and use it as needed!

An AZ Teacher’s Perspective on Her “Value-Added”

This came to me from a teacher in my home state – Arizona. Read not only what is becoming a too familiar story, but also her perspective about whether she is the only one who is “adding value” (and I use that term very loosely here) to her students’ learning and achievement.

She writes:

Initially, the focus of this note was going to be my 6-year long experience with a seemingly ever-changing educational system.  I was going to list, with some detail, all the changes that I have seen in my brief time as a K-6 educator, the end-user of educational policy and budget cuts.  Changes like (in no significant order):

  • Math standards (2008?)
  • Common Core implementation and associated instructional shifts (2010?)
  • State accountability system (2012?)
  • State requirements related to ELD classrooms (2009?)
  • Teacher evaluation system (to include a new formula of classroom observation instrument and value-added measures) (2012-2014)
  • State laws governing teacher evaluation/performance, labeling and contracts (2010?)

have happened in a span of, not much more than, three years. And all these changes have happened against a backdrop of budget cuts severe enough to, in my school district, render librarians, counselors, and data coordinators extinct.  In this note, I was going to ask, rhetorically: “What other field or industry has seen this much change this quickly and why?” or “How can any field or industry absorb this much change effectively?”

But then I had a flash of focus just yesterday during a meeting with my school administrators, and I knew immediately the simple message I wanted to relay about the interaction of high-stakes policies and the real world of a school.

At my school, we have entered what is known as “crunch time”—the three-month long period leading up to state testing.  The purpose of the meeting was to roll out a plan, commonly used by my school district, to significantly increase test scores in math via a strategy of leveled grouping. The plan dictates that my homeroom students will be assigned to groups based on benchmark testing data and will then be sent out of my homeroom to other teachers for math instruction for the next three months. In effect, I will be teaching someone else’s students, and another teacher will be teaching my students.

But, wearisomely, sometime after this school year, a formula will be applied to my homeroom students’ state test scores in order to determine close to 50% of my performance. And then another formula (to include classroom observations) will be applied to convert this performance into a label (ineffective, developing, effective, highly effective) that is then reported to the state.  And so my question now is (not rhetorically!), “Whose performance is really being measured by this formula—mine or the teachers who taught my students math for three months of the school year?” At best, professional reputations are at stake–at worse, employment is.

Stanford Professor, Dr. Edward Haertel, on VAMs

In a recent speech and subsequent paper written by Dr. Edward Haertel – National Academy of Education member and Professor at Stanford University – he writes about VAMs and the extent to which VAMs, being based on student test scores, can be used to make reliable and valid inferences about teachers and teacher effectiveness. This is a must-read, particularly for those out there who are new to the research literature in this area. Dr. Haertel is certainly an expert here, actually one of the best we have, and in this piece he captures the major issues well.

Some of the issues highlighted include concerns about the tests used to model value-added and how their scales (falsely assumed to be as objective and equal as units on a measuring stick) complicate and distort VAM-based estimates. He also discusses the general issues with the tests almost if not always used when modeling value-added (i.e., the state-level tests mandated as per No Child Left Behind in 2002).

He discusses why VAM estimates are least trustworthy, and most volatile and error prone, when used to compare teachers who work in very different schools with very different student populations – students who do not attend schools in randomized patterns and who are rarely if ever randomly assigned to classrooms. The issues with bias, as highlighted by Dr. Haertel and also in a recent VAMboozled! post with a link to a new research article here, are probably the most major VAM-related, problems/issues going. As captured in his words, “VAMs will not simply reward or penalize teachers according to how well or poorly they teach. They will also reward or penalize teachers according to which students they teach and which schools they teach in” (Haertel, 2013, p. 12-13).

He reiterates issues with reliability, or a lack thereof. As per one research study he cites, researchers found that “a minimum of 10% of the teachers in the bottom fifth of the distribution one year were in the top fifth the next year, and conversely. Typically, only about a third of 1 year’s top performers were in the top category again the following year, and likewise, only about a third of 1 year’s lowest performers were in the lowest category again the following year. These findings are typical [emphasis added]…[While a] few studies have found reliabilities around .5 or a little higher…this still says that only half the variation in these value-added estimates is signal, and the remainder is noise [and/or error, which makes VAM estimates entirely invalid about half of the time]” (Haertel, 2013, p. 18).

Dr. Haertel also discusses other correlations among VAM estimates and teacher observational scores, VAM estimates and student evaluation scores, and VAM estimates taken from the same teachers at the same time but using different tests, all of which also yield abysmally (and unfortunately) low correlations, similar to those mentioned above.

His bottom line? “VAMs are complicated, but not nearly so complicated as the reality they are intended to represent” (Haertel, 2013, p. 12). They just do not measure well what so many believe they measure so very well.

Again, to find out more reasons and more in-depth explanations as to why, click here for the full speech and subsequent paper.

Random Assigment and Bias in VAM Estimates – Article Published in AERJ

“Nonsensical,” “impractical,” “unprofessional,” “unethical,” and even “detrimental” – these are just a few of the adjectives used by elementary school principals in Arizona to describe the use of randomized practices to assign students to teachers and classrooms. When asked whether principals might consider random assignment practices, one principal noted, “I prefer careful, thoughtful, and intentional placement [of students] to random. I’ve never considered using random placement. These are children, human beings.” Yet the value-added models (VAMs) being used in many states to measure the “valued-added” by individual teachers to their students’ learning assume that any school is as likely as any other school, and any teacher is as likely as any other teacher, to be assigned any student who is as likely as any other student to have similar backgrounds, abilities, aptitudes, dispositions, motivations, and the like.

One of my doctoral students – Noelle Paufler – and I recently reported in the highly esteemed American Educational Research Journal the results of a survey administered to all public and charter elementary principals in Arizona (see the online publication of “The Random Assignment of Students into Elementary Classrooms: Implications for Value-Added Analyses and Interpretations”). We examined the various methods used to assign students to classrooms in their schools, the student background characteristics considered in nonrandom placements, and the roles teachers and parents play in the placement process. In terms of bias, the fundamental question here was whether the use of nonrandom student assignment practices might lead to biased VAM estimates, if the nonrandom student sorting practices went beyond that which is typically controlled for in most VAM models (e.g., academic achievement and prior demonstrated abilities, special education status, ELL status, gender, giftedness, etc.).

We found that overwhelmingly, principals use various placement procedures through which administrators and teachers consider a variety of student background characteristics and student interactions to make placement decisions. In other words, student placements are by far nonrandom (contrary to the methodological assumptions to which VAM consumers often agree).

Principals frequently cited interactions between students, students’ peers, and previous teachers as justification for future placements. Principals stated that students were often matched with teachers based on their individual learning styles and respective teaching strengths. Parents also yielded considerable control over the placement process with a majority of principals stating that parents made placement requests, the majority of which are often honored.

In addition, in general, principal respondents were greatly opposed to using random student assignment methods in lieu of placement practices based on human judgment—practices they collectively agreed were in the best interest of students. Random assignment, even if necessary to produce unbiased VAM-based estimates, was deemed highly “nonsensical,” “impractical,” “unprofessional,” “unethical,” and even “detrimental” to student learning and teacher success.

The nonrandom assignment of students to classrooms has significant implications for the use of value-added models to estimate teacher effects on student learning using large-scale standardized test scores. Given the widespread use of nonrandom methods as indicated in this study, however, value-added researchers, policymakers, and educators should carefully consider the implications of their placement decisions as well as the validity of the inferences made using value-added estimates of teacher effectiveness.

How Might a Test Measure Teachers’ Causal Effects?

A reader wrote a very good question (see the VAMmunition post) that I feel is worth “sharing out,” with a short but (hopefully) informative answer that will help others better understand some of “the issues.”

(S)he wrote: “[W]hat exactly would a test look like if it were, indeed, ‘designed to estimate teachers’ causal effects’? Moreover, how different would it be from today’s tests?”

Here is (most of) my response: While large-scale standardized tests are typically limited in both the number and types of items included, among other things, one could use a similar test with more items and more “instructionally sensitive” items to better capture a teacher’s causal effects, quite simply actually. This would be done with the pre and post-tests occurring in the same year while students are being instructed by the same (albeit not only…) teacher. However, this does not happen in any value-added system at this point as these tests are given once per year (typically spring to spring). Hence, student growth scores include prior and other teachers’ effects, as well as the differential learning gains/losses that also occur over the summers during which students have little to no interactions with formal education systems, or their teachers. This “biases” these measures of growth, big time!

The other necessary condition for doing this would be random assignment. If students were randomly assigned to classrooms (and teachers were randomly assigned to classrooms), this would help to make sure that indeed all students are similar at the outset, before what we might term the “treatment” (i.e., how effectively a teacher teaches for X amount of time). However, again, this rarely if ever happens in practice as administrators and teachers (rightfully) see random assignment practices…while great for experimental research purposes, bad for students and their learning! Regardless, some statisticians suggest that their sophisticated controls can “account” for non-random assignment practices, yet again evidence suggests that no matter how sophisticated the controls are, they simply do not work here either.

See, for example, the Hermann et al. (2013), the Newton et al. (2010), and the Rothstein (2009, 2010) citations here, in this blog, under the “VAM Readings” link. I also have an article coming out about this this month, co-authored with one of my doctoral students, in a highly esteemed peer-reviewed journal. Here is the reference if you want to keep an eye out for it. These references should (hopefully) explain all of this with greater depth and clarity: Paufler, N. A. & Amrein-Beardsley, A. (2013, October). The random assignment of students Into elementary classrooms: Implications for value-added analyses and interpretations. American Educational Research Journal. doi: 10.3102/0002831213508299