New Research Study: Incentives with No Impact

Following VAMboozled!’s most recent post (November 25, 2013) about the non-peer-reviewed National Bureau of Economic Research’s DC IMPACT study, another recently published study albeit this time in the Journal of Labor Economics, a top-field peer-reviewed journal, found no impact of incentives in New York City’s (NYC) public schools given its large-scale multimillion program. The author, Roland Fryer Jr., (2013) analyzed the data from a school-based experiment “to better understand the impact of teacher incentives on student achievement.”

A randomized experiment, a gold standard in applied work of this kind, was implemented in more than 200 hundred NYC public schools. The schools decided on the specific incentive scheme, either team or individual. The stakes were relatively high – on average, a high performing school (i.e. a school that meets the target by 100%), received a transfer of $180,000, and a school that met the target by 75%, received $90,000. Not bad by all accounts!

The target was set based on a school performance in terms of students’ achievement, improvement, and the learning environment. Yes, a fraction of schools met the target and received the transfers, but it did not improve the achievement of students, to say the least. If anything, such incentive in fact worsened the performance of students. The estimates from the experiment imply that if a student attended a middle school with an incentive in place for three years, his/her math test scores would decline by 0.138 of a standard deviation and his/her reading score would drop by 0.09 of a standard deviation.

Not only that, but the incentive program had no effect on teachers’ absenteeism, retention in school or district, nor did it affect the teachers’ perception of the learning environment in a school. Literally, the estimated 75 million dollars invested and spent brought zero return!

This evaluation, together with a few others (Glazerman & Seifullah, 2010; Springer et al., 2010; Vigdor, 2008) raises a question about both the financial effectiveness of similar incentives in schools and the achievement-based accountability measures in particular, and their ability to positively affect students’ achievement.

Thanks to Margarita Pivovarova – Assistant Professor of Economics at Arizona State University – for this post.

References:

Fryer, R. G. (2013). Teacher incentives and student achievement: Evidence from New York City Public Schools. Journal of Labor Economics, 31(2), 373-407.

Glazerman, S., & Seifullah, A. (2010). An evaluation of the Teacher Advancement Program (TAP) in Chicago: Year two impact report. Washington, DC: Mathematica Policy Research. Retrieved from http://www.mathematica-mpr.com/publications/pdfs/education/tap_yr2_rpt.pdf

Springer, M. G., Ballou, D., Hamilton, L. S., Le, V.-N., Lockwood, J.R., McCaffrey, D.F., Pepper, M., & Stecher, B.M. (2010). Teacher pay for performance: Experimental evidence from the project on incentives in teaching. Nashville, TN: National Center on Performance Incentives. Retrieved from http://www.rand.org/content/dam/rand/pubs/reprints/2010/RAND_RP1416.pdf

Vigdor, J. L. (2008). Teacher salary bonuses in North Carolina. Nashville, TN: National Center on Performance Incentives. Retrieved from https://my.vanderbilt.edu/performanceincentives/files/2012/10/200803_Vigdor_TeacherBonusesNC.pdf

Unpacking DC’s Impact, or the Lack Thereof

Recently, I posted a critique of the newly released and highly publicized Mathematica Policy Research study about the (vastly overstated) “value” of value-added measures and their ability to effectively measure teacher quality. The study, which did not go through a peer review process, is wrought with methodological and conceptual problems, which I dismantled in the post, with a consumer alert.

Yet again, VAM enthusiasts are attempting to VAMboozle policymakers and the general public with another faulty study, this time released to the media by the National Bureau of Economic Research (NBER). The “working paper” (i.e., not peer-reviewed, and in this case not even internally reviewed by those at NBER) analyzed the controversial teacher evaluation system (i.e., IMPACT) that was put into place in DC Public Schools (DCPS) under the then Chancellor, Michelle Rhee.

The authors, Thomas Dee and James Wyckoff (2013), present what they term “novel evidence” to suggest that the “uniquely high-powered incentives” linked to “teacher performance” worked to improve the “performance” of high-performing teachers, and that “dismissal threats” worked to increase the “voluntary attrition of low-performing teachers.” The authors, however, and similar to those of the Mathematica study, assert highly troublesome claims based on a plethora of problems that had this study undergone peer review before it was released to the public and before it was hailed in the media, would not have created the media hype that ensued. Hence, it is appropriate to issue yet another consumer alert.

The most major problems include, but are not limited to, the following:

“Teacher Performance:” Probably the largest fatal flaw, or the study’s most major limitation was that only 17% of the teachers included in this study (i.e., teachers of reading and mathematics in grades 4 through 8) were actually evaluated under the IMPACT system for their “teacher performance,” or for that which they contributed to the system’s most valued indicator: student achievement. Rather, 83% of the teachers did not have student test scores available to determine if they were indeed effective (or not) using individual value-added scores. It is implied throughout the paper, as well as the media reports covering this study post release, that “teacher performance” was what was investigated when in fact for four out of five DC teachers their “performance” was evaluated only as per what they were observed doing or self-reported doing all the while. These teachers were evaluated on their “performance” using almost exclusively (except for the 5% school-level value-added indicator) the same subjective measures integral to many traditional evaluation systems as well as student achievement/growth on teacher-developed and administrator-approved classroom-based tests, instead.

Score Manipulation and Inflation: Related, a major study limitation was that the aforementioned indicators that were used to define and observe changes in “teacher performance” (for the 83% of DC teachers) were based almost entirely on highly subjective, highly manipulable, and highly volatile indicators of “teacher performance.” Given the socially constructed indicators used throughout this study were undoubtedly subject to score bias by manipulation and artificial inflation as teachers (and their evaluators) were able to influence their ratings. While evidence of this was provided in the study, the authors banally dismissed this possibility as “theoretically [not really] reasonable.” When using tests, and especially subjective indicators to measure “teacher performance,” one must exercise caution to ensure that those being measured do not engage in manipulation and inflation techniques known to effectively increase the scores derived and valued, particularly within such high-stakes accountability systems. Again, for 83% of the teachers their “teacher performance” indicators were almost entirely manipulable (with the exception of school-level value-added weighted at 5%).

Unrestrained Bias: Related, the authors set forth a series of assumptions throughout their study that would have permitted readers to correctly predict the study’s findings without reading it. This is highly problematic, as well, and this would not have been permitted had the scientific community been involved. Researcher bias can certainly impact (or sway) study findings and this most certainly happened here.

Other problems include gross overstatements (e.g., about how the IMPACT system has evidenced itself as financially sound and sustainable over time), dismissed yet highly complex technical issues (e.g., about classification errors and the arbitrary thresholds the authors used to statistically define and examine whether teachers “jumped” thresholds and became more effective), other over-simplistic treatments of major methodological and pragmatic issues (e.g., cheating in DC Public Schools and whether this impacted outcome “teacher performance” data, and the like.

To read the full critique of the NBER study, click here.

The claims the authors have asserted in this study are disconcerting, at best. I wouldn’t be as worried if I knew that this paper truly was in a “working” state and still had to undergo peer-review before being released to the public. Unfortunately, it’s too late for this, as NBER irresponsibly released the report without such concern. Now, we as the public are responsible for consuming this study with critical caution and advocating for our peers and politicians to do the same.

NY Teacher of the Year Not “Highly Effective” at “Adding Value”

Diane Ravitch recently posted: “New York’s Teacher of the Year Is Not Rated ‘Highly Effective” on her blog. In her post she writes about Kathleen Ferguson, the state of New York’s Teacher of the Year who has also received numerous excellence in teaching awards, including her district’s teacher of the year award.

Ms. Ferguson, despite the clear consensus that she is an excellent teacher, was not even able to evidence herself as “highly effective” on her evaluation largely because she teaches a disproportionate number of special education students in her classroom. She recently testified on her and other teachers’ behalves about these rankings, their (over)reliance on student test scores to define “effectiveness,” and the Common Core to a Senate Education Committee. To see the original report and to hear a short piece of her testimony, please click here. “This system [simply] does not make sense,” Ferguson said.

In terms of VAMs, this presents evidence of issues with what is called “concurrent-related evidence of validity.” When gathering concurrent-related evidence of validity (see full definition on the Glossary page of this site), it is necessary to assess, for example, whether teachers who post large and small value-added gains or losses over time are the same teachers deemed effective or ineffective, respectively, over the same period of time using other independent quantitative and qualitative measures of teacher effectiveness (e.g., external teaching excellence awards like in the case here). If the evidence that is presented points in the same direction, evidence of concurrent validity is supported, and this adds to the overall argument in support of overall “validity.” Inversely, if the evidence that is presented does not point in the same direction, evidence of concurrent validity is not supported, further limiting the overall argument in support of overall “validity.”

Only when similar sources of evidence support similar inferences and conclusions can our confidence in the sources as independent measures of the same construct (i.e., teaching effectiveness) increase.

This is also highlighted in research I have conducted in Houston. Please watch this video (of about 12 minutes) to investigate/understand “the validity issue” further, through the eyes of four teachers who also had similar stories but were terminated due to their value-added scores that, for the most part, contradicted other indicators of their effectiveness as teachers. You can also read the original study here.

http://www.youtube.com/watch?v=18R2af2CKk8

Florida Newspaper Following “Suit”

On Thursday (November 14th), I wrote about what is happening in the courts of Los Angeles about the LA Times’ controversial open public records request soliciting the student test scores of all Los Angeles Unified School District (LAUSD) teachers (see LA Times up to the Same Old Antics). Today I write about the same thing happening in the courts of Florida as per The Florida Times-Union’s suit against the state (see Florida teacher value-added data is public record, appeal court rules).

As the headline reads, “Florida’s controversial value-added teacher data are [to be] public record” and released to The Times-Union for the same reasons and purposes they are being released, again, to the LA times. These (in many ways right) reasons include: (1) the data should not be exempt from public inspection, (2) these records, because they are public, should be open for public consumption, (3) parents as members of the public and direct consumers of these data should be granted access, and the list goes on.

What is in too many ways wrong, however, is that while the court wrote that the data are “only one part of a larger spectrum of criteria by which a public school teacher is evaluated,” the data will be consumed by a highly assuming, highly unaware, and highly uninformed public as the only source of data that count.

Worse, because the data are derived via complex analyses of “objective” test scores that yield (purportedly) hard numbers from which teacher-level value-added can be calculated using even more complex statistics, the public will trust the statisticians behind the scenes, because they are smart, and they will consume the numbers as true and valid because smart people constructed them.

The key here, though, is that they are, in fact, constructed. In every step of the process of constructing the value-added data, there are major issues that arise and major decisions that are made. Both cause major imperfections in the data that will in the end come out clean (i.e., as numbers), even though they are still super dirty on the inside.

As written by The Times-Union reporter, “Value-added is the difference between the learning growth a student makes in a teacher’s class and the statistically predicted learning growth the student should have earned based on previous performance.” It is just not as simple and as straightforward as that.

Here are just some reasons why: (1) large-scale standardized achievement tests offer very narrow measures of what students have achieved, although they are assumed to measure the breadth and depth of student learning covering an entire year; (2) the 40 to 50 total tested items do not represent the hundreds of more complex items we value more; (3) test developers use statistical tools to remove the items that too many students answered correctly making much of what we value not at all valued on the tests; (4) calculating value-added, or growth “upwards” over time requires that the scales used to measure growth from one year to the next are on scales of equal units, but this is (to my knowledge) never the case; (5) otherwise, the standard statistical response is to norm the test scores before using them, but this then means that for every winner there must be a loser, or in this case that as some teachers get better, other teachers must get worse, which does not reflect reality; (6) then value-added estimates do not indicate whether a teacher is good or highly effective, as is also often assumed, but rather whether a teacher is purportedly better or worse than other teachers to whom they are compared who teach entirely different sets of students who are not randomly assigned to their classrooms but for whom they are to demonstrate “similar” levels of growth; (7) then teachers assigned more “difficult-to-teach” students are held accountable for demonstrating similar growth regardless of the numbers of, for example, English Language Learners (ELLs) or special education students in their classes (although statisticians argue they can “control for this” despite recent research evidence); (8) which becomes more of a biasing issue when statisticians cannot effectively control for what happens (or what does not happen) over the summers whereby in every current state-level value-added system these tests are administered annually, from the spring of year X to the spring of year Y, always capturing the summer months in the post-test scores and biasing scores dramatically in one way or another based on that which is entirely out of the control of the school; (9) this forces the value-added statisticians to statistically assume that summer learning growth and decay matters the same for all students despite the research-based fact that different types of students lose or gain variable levels of knowledge over the summer months; and (10) goes into (11), (12), and so forth.

But the numbers do not reflect any of this, now do they.

LA Times up to the Same Old Antics

The Los Angeles Times has made it in its own news, again, for its controversial open public records request soliciting the student test scores of all Los Angeles Unified School District (LAUSD) teachers. They have done this repeatedly since 2011 — the first time the Los Angeles Times hired external contractors to calculate LAUSD teachers’ value-added scores, so that they could publish the teachers’ value-added scores on their Los Angeles Teacher Ratings website. They have also done this, repeatedly since 2011, despite the major research-based issues (see for example a National Education Policy Center (NEPC) analysis that followed; see also a recent/timely post about this on another blog at techcrunch.com).

This time, the Los Angeles Times is charging that the district is opposed to all of this because it wants to keep the data “secret?” How about being opposed to this because doing this is just plain wrong?!

This is an ongoing source of frustration for me since the authors of the initial articles and the creators of the searchable website (Jason Felch and Jason Strong) contacted me back in 2011 regarding whether what they were doing was appropriate, valid, and fair. Despite my strong warnings, research-based apprehensions, and cautionary notes, Felch and Song thanked me for my time and moved forward.

It became apparent to me soon thereafter, that I was merely a pawn in their media game, so that they could ultimately assert that: “The Times shared [what they did] with leading researchers, including skeptics of value added methods, and incorporated their input.” While they did not incorporate any of my input, not one bit, and as far as I could tell any of the input they might have garnered from any of the other skeptics of value added methods with whom they claimed to have spoken, this claim became particularly handy when they were to defend themselves and their actions when the Los Angeles Times, Ethics and Standards board stepped in, on their behalves.

While I wrote words of detest in response to their “Ethics and Standards” post back then, the truth of the matter was, and still is, is that what they continue to do at the Los Angeles Times has literally nothing to do with the current VAM-based research. It is based on “research” only if one defines research as a far-removed action involving far-removed statisticians crunching numbers as part of a monetary contract.

New Tests Nothing but New

Across the country, states are bidding farewell to the tests they adopted and implemented as part of No Child Left Behind (NCLB) in 2002 while at the same time “welcoming” (with apprehension and fear) a series of new and improved tests meant to, once again, increase standards and measure the new (and purportedly improved) Common Core State Standards.

I say “once again” in that we have witnessed policy trends similar to these for more than 30 years now—federally-backed policies that are hyper-reliant on increasing standards and holding educators and students accountable for meeting the higher standards with new and improved tests. Yet, while these perpetual policies are forever meant to improve and reform America’s “failing” public schools, we still have the same concerns about the same “failing” public schools despite 30 years of the same/similar attempts to reform them.

The bummer that comes along with being an academic who has dedicated her scholarly life to conducting research in this area is that it can sometimes create a sense of cynicism, or realistic skepticism, that is based on nothing more than history. Studying over three decades of similar policies (i.e., educational policies based on utopian ideals that continuously promote new and improved standards along with new and improved tests to ensure that the higher standards are met) does a realistic skeptic make!

Mark my words! Here is how history is to, once again, repeat itself:

  1. States’ test scores are going to plummet across the nation when students in America’s public school first take the new Common Core tests (e.g., this has already happened in Kentucky, New York, and North Carolina, the first to administer the new tests);
  2. Many (uninformed) policymakers and members of the media are going to point their fingers at America’s public schools for not taking the past 30 years of repeated test-based reforms seriously enough, blaming the teachers, not the new tests, on why students score so low;
  3. The same (uninformed) policymakers and members of the media will say things like “finally those in America’s public schools are going to blatantly see everything that is wrong with what they are (or are not) doing and will finally start taking more seriously their charge to teach students how to achieve higher standards with these better tests in place – see for example U.S. Secretary of Education Arne Duncan’s recent comments, informing U.S. citizens that the low scores should be seen as a new baseline that will incentivize, in and of itself, school reform and improvement;
  4. Test scores will then miraculously begin to rise year-after-year, albeit for only a few years, after which the same (uninformed) policymakers and members of the media will attribute the observed increases in growth to the teachers finally taking things seriously, all the while ignoring (i.e., being uninformed) that the new and improved tests will be changing, behind the scenes, at the same time (e.g., the “cut-scores” defining proficiency will change and the most difficult test items that are always removed after new tests are implemented will be removed, all of which will cause scores to “artificially” increase regardless of what students might be learning);
  5. Educators will help out with this as they too know very well (given over 30 years of experience and training) how to help “artificially inflate” their scores, thinking much of the time that test score boosting practices (e.g., teaching to the test, narrowing the curriculum to focus excessively on the tested content) are generally in students’ best interests; but then…
  6. America’s national test scores (i.e., on the National Assessment of Educational Progress [NAEP]) and America’s international test scores (i.e., on the Trends in International Mathematics and Science Study [TIMSS], Progress in International Reading Literacy Study [PIRLS], and Programme for International Student Assessment [PISA]) won’t change much…what!?!?…then;
  7. Another round of panic will set in, fingers will point at America’s public schools, yet again, and we will, yet again (though hopefully not by the grace of more visionary educational policymakers) look to even higher standards and better tests to adopt, implement, and repeat, from the beginning – see #1 above.

While I am a gambling woman, and I would bet my savings on the historically-rooted predictions above, should anybody want to take that bet, I would walk away from this one altogether in that in the state of Arizona, alone, this “initiative” is set to cost taxpayers more than $22.5 million a year, $9 million more than the state is currently paying for its NCLB testing program (i.e., the Arizona Instrument to Measure Statndards [AIMS]). Now that is a losing bet, based on nothing more than a gambler’s fallacy. That this is going to work, this time, is false!

Mathematica Study – Value-Added & Teacher Experience

A study released yesterday by Mathematica Policy Research (and sponsored by the U.S. Department of Education) titled “Teachers with High “Value Added” Can Boost Test Scores in Low-Performing Schools” implies that, yet again, value-added estimates are the key statistical indicators we as a nation should be using, above all else, to make pragmatic and policy decisions about America’s public school teachers. In this case, the argument is that value-added estimates can and should be used to make decisions about where to position high value-added teachers so that they might have greater effects, as well as greater potentials to “add” more “value” to student learning and achievement over time.

While most if not all educational researchers agree with the fundamental idea behind this research study — that it is important to explore ways to improve America’s lowest achieving schools by providing students in such schools increased access to better teachers — this study overstates its “value-added” results.

Hence, I have to issue a Consumer Alert! While I don’t recommend reading the whole technical report (at over 200 single-spaced pages), the main issues with this piece, again not in terms of its overall purpose but in terms of its value-added felicitations, follow:

  1. The high value-added teachers who were selected to participate in this study, and transfer into high-needs schools to teach for two years, were disproportionately National Board Certified Teachers and teachers with more years of teaching experience. The finding that these teachers, selected only because they were high value-added teachers was confounded by the very fact that they were compared to “similar” teachers in the high-needs schools, many of whom were not certified as exemplary teachers and many of whom (20%) were new teachers…as in, entirely new to the teaching profession! While the high value-added teachers who choose to teach in higher needs schools for two years (with $20,000 bonuses to boot) were likely wonderful teachers in their own rights, the same study results would have likely been achieved by simply choosing teachers with more than X years of experience or choosing teachers whose supervisors selected them as “the best.” Hence, this study was not about using “value-added” as the arbiter of all that is good and objective in measuring teacher effects, it was about selecting teachers who were distinctly different than the teachers to whom they were compared and attributing the predictable results back to the “value-added” selections that were made.
  2. Related, many of the politicians and policymakers who are advancing national and state value-added initiatives and policies forward are continuously using sets of false assumptions about teacher experience, teacher credentials, and how/why these things do not matter to advance their agendas forward. Rather, in this study, it seems that teacher experience and credentials mattered the most. Results from this study, hence, contradict initiatives, for example, to get rid of salary schedules that rely on years of experience and credentials, as value-added scores, as evidenced in this study, do seem to capture these other variables (i.e., experience and credentials) as well.
  3. Another argument could be made against the millions of dollars in taxpayer generated funds that our politicians are pouring into these initiatives, as well as federally-funded studies like these. For years, we have known that America’s best teachers are disproportionately located in America’s best schools. While this too was one of this study’s “new” findings, this is nowhere “new” to the research literature in this area. This has not just recently become a “growing concern” (see, for example, an article that I wrote in Phi Delta Kappan back in 2007 that helps to explore this issue’s historical roots). In addition, we have about 20 years of research studies examining teacher transfer initiatives like the one studied here, but such initiatives are so costly that they are rarely if ever sustainable, and they typically fizzle out in due time. This is yet another instance in which the U.S. Department of Education took an ahistorical approach and funded a research study with a question to which we already knew the answer: “Transferring teacher talent” works to improve schools but not with real-world education funds. It would be fabulous should significant investments be made in this area, though!
  4. Interesting to add, as well, is that “[s]pearheading” this study were staff from The New Teacher Project (TNTP). This group authored the now famous The Widget Effect study. This group is also famously known for advancing false notions that teachers matter so much that they can literally wipe out all of the social issues that continue to ail so many of America’s public schools (e.g., poverty). This study’s ahistorical and overstated results, are a bit less surprising taking this into consideration.
  5. Finally, call me a research purist but another Consumer Alert! should automatically ding whenever “random assignment,” “random sampling,” “a randomized design,” or a like term (e.g., “a multisite randomized experiment as was used here) is used, especially in a study’s marketing materials. In this study, such terms are exploited when in fact “treatment” teachers were selected due to high-value-added scores (that were not calculated consistently across participating districts), only 35 to 65 percent of teachers had enough value-added data to be eligible for selection, teachers were surveyed to determine if they wanted to participate (self-selection bias), teachers were approved to participate by their supervisors, then teachers were interviewed by the new principals for whom they were to work as they were actually for hire. To make this truly random, principals would have had to agree to whatever placements they got and all high-value-added teachers would have had to have been randomly selected and then randomly placed, and given equal probabilities in their placements, regardless of the actors and human agents involved. On the flip side, the “control” teachers to whom the “treatment” (i.e., high-value-added) teachers were to be compared, should have also been randomly selected from a pool of control applicants who should have been “similar” to the other teachers in the control schools. To get at more valid results, the control group teachers should have better represented “the typical teachers” at the control schools (i.e., not brand new teachers entering the field). “[N]ormal hiring practices” should not have been followed if study results were to answer whether expert teachers would have greater effects on student achievement than other teachers in high-needs schools. The research question answered, instead, was whether high-value-added teachers, a statistically significant group of whom were National Board Certified and had about four more years of teaching experience on average, would have greater effects on student achievement than other newly hired teachers, many of whom (20%) were brand new to the profession.

VAMs v. Student Growth Percentiles (SGPs)

Yesterday (11/4/2013) a reader posted a question, the first part of which I am partially addressing here: “What is the difference [between] VAM[s] and Student Growth Percentiles (SGP[s]) and do SGPs have any usefulness[?]” One of my colleagues, soon to be a “Guest Blogger” on TheTeam, but already an SGP expert, is helping me work on a more nuanced response, but for the time-being please check out the Glossary section of this blog:

VAMs v. Student Growth Models: The main similarities between VAMs and student growth models are that they all use students’ large-scale standardized test score data from current and prior years to calculate students’ growth in achievement over time. In addition, they all use students’ prior test score data to “control for” the risk factors that impact student learning and achievement both at singular points in time as well as over time. The main differences between VAMs and student growth models are how precisely estimates are made, as related to whether, how, and how many control variables are included in the statistical models to control for these risk and other extraneous variable (e.g.,  other teachers’ simultaneous and prior teacher’s residual effects). The best and most popular example of a student growth model is the Student Growth Percentiles (SGP) model. It is not a VAM by traditional standards and definitions, mainly because the SGP model does not use as many sophisticated controls as does its VAM counterparts.

See more forthcoming…

Thanks Diane for the “Shout-Out!”

Thanks to Diane Ravitch for the “shout-out” she blasted via her blog yesterday afternoon. We at VAMboozled! really appreciate it and are pleased to report that given her endorsement we received 1,000 new visitors in an approximately 1-day period! There are scholarly journals out there that are lucky to get 1,000 readers of a volume, and we hit such a target in about one day!! Thanks again!!

To read her post from yesterday, see: VAMboozled! Welcome to the Blogosphere, Audrey!

How Might a Test Measure Teachers’ Causal Effects?

A reader wrote a very good question (see the VAMmunition post) that I feel is worth “sharing out,” with a short but (hopefully) informative answer that will help others better understand some of “the issues.”

(S)he wrote: “[W]hat exactly would a test look like if it were, indeed, ‘designed to estimate teachers’ causal effects’? Moreover, how different would it be from today’s tests?”

Here is (most of) my response: While large-scale standardized tests are typically limited in both the number and types of items included, among other things, one could use a similar test with more items and more “instructionally sensitive” items to better capture a teacher’s causal effects, quite simply actually. This would be done with the pre and post-tests occurring in the same year while students are being instructed by the same (albeit not only…) teacher. However, this does not happen in any value-added system at this point as these tests are given once per year (typically spring to spring). Hence, student growth scores include prior and other teachers’ effects, as well as the differential learning gains/losses that also occur over the summers during which students have little to no interactions with formal education systems, or their teachers. This “biases” these measures of growth, big time!

The other necessary condition for doing this would be random assignment. If students were randomly assigned to classrooms (and teachers were randomly assigned to classrooms), this would help to make sure that indeed all students are similar at the outset, before what we might term the “treatment” (i.e., how effectively a teacher teaches for X amount of time). However, again, this rarely if ever happens in practice as administrators and teachers (rightfully) see random assignment practices…while great for experimental research purposes, bad for students and their learning! Regardless, some statisticians suggest that their sophisticated controls can “account” for non-random assignment practices, yet again evidence suggests that no matter how sophisticated the controls are, they simply do not work here either.

See, for example, the Hermann et al. (2013), the Newton et al. (2010), and the Rothstein (2009, 2010) citations here, in this blog, under the “VAM Readings” link. I also have an article coming out about this this month, co-authored with one of my doctoral students, in a highly esteemed peer-reviewed journal. Here is the reference if you want to keep an eye out for it. These references should (hopefully) explain all of this with greater depth and clarity: Paufler, N. A. & Amrein-Beardsley, A. (2013, October). The random assignment of students Into elementary classrooms: Implications for value-added analyses and interpretations. American Educational Research Journal. doi: 10.3102/0002831213508299