EVAAS, Value-Added, and Teacher Branding

I do not think I ever shared this video out, and now following up on another post, about the potential impact these videos should really have, I thought now is an appropriate time to share. “We can be the change,” and social media can help.

My former doctoral student and I put together this video, after conducting a study with teachers in the Houston Independent School District and more specifically four teachers whose contracts were not renewed due in large part to their EVAAS scores in the summer of 2011. This video (which is really a cartoon, although it certainly lacks humor) is about them, but also about what is happening in general in their schools, post the adoption and implementation (at approximately $500,000/year) of the SAS EVAAS value-added system.

To read the full study from which this video was created, click here. Below is the abstract.

The SAS Educational Value-Added Assessment System (SAS® EVAAS®) is the most widely used value-added system in the country. It is also self-proclaimed as “the most robust and reliable” system available, with its greatest benefit to help educators improve their teaching practices. This study critically examined the effects of SAS® EVAAS® as experienced by teachers, in one of the largest, high-needs urban school districts in the nation – the Houston Independent School District (HISD). Using a multiple methods approach, this study critically analyzed retrospective quantitative and qualitative data to better comprehend and understand the evidence collected from four teachers whose contracts were not renewed in the summer of 2011, in part given their low SAS® EVAAS® scores. This study also suggests some intended and unintended effects that seem to be occurring as a result of SAS® EVAAS® implementation in HISD. In addition to issues with reliability, bias, teacher attribution, and validity, high-stakes use of SAS® EVAAS® in this district seems to be exacerbating unintended effects.

Splits, Rotations, and Other Consequences of Teaching in a High-Stakes Environment in an Urban School

An Arizona teacher who teaches in a very urban, high-needs schools writes about the realities of teaching in her school, under the pressures that come along with high-stakes accountability and a teacher workforce working under an administration, both of which are operating in chaos. This is a must read, as she also talks about two unintended consequences of educational reform in her school about which I’ve never heard before: splits and rotations. Both seem to occur at all costs simply to stay afloat during “rough” times, but both also likely have deleterious effects on students in such schools, as well as teachers being held accountable for the students “they” teach.

She writes:

Last academic year (2012-2013) a new system for evaluating teachers was introduced into my school district. And it was rough. Teachers were dropping like flies. Some were stressed to the point of requiring medical leave. Others were labeled ineffective based on a couple classroom observations and were asked to leave. By mid-year, the school was down five teachers. And there were a handful of others who felt it was just a matter of time before they were labeled ineffective and asked to leave, too.

The situation became even worse when the long-term substitutes who had been brought in to cover those teacher-less classrooms began to leave also. Those students with no contracted teacher and no substitute began getting “split”. “Splitting” is what the administration of a school does in a desperate effort to put kids somewhere. And where the students go doesn’t seem to matter. A class roster is printed, and the first five students on the roster go to teacher A. The second five students go to teacher B, and so on. Grade-level isn’t even much of a consideration. Fourth graders get split to fifth grade classrooms. Sixth graders get split to 5th and 7th grade classrooms. And yes, even 7th and 8th graders get split to 5th grade classrooms. Was it difficult to have another five students in my class? Yes. Was it made more difficult that they weren’t even of the same grade level I was teaching? Yes. This went on for weeks…

And then the situation became even worse. As it became more apparent that the revolving door of long-term substitutes was out of control, the administration began “The Rotation.” “The Rotation” was a plan that used the contracted teachers (who remained!) as substitutes in those teacher-less classrooms. And so once or twice a week, I (and others) would get an email from the administration alerting me that it was my turn to substitute during prep time. Was it difficult to sacrifice 20-40 % of weekly prep time (that is used to do essential work like plan lessons, gather materials, grade, call parents, etc…) Yes. Was it difficult to teach in a classroom that had a different teacher, literally, every hour without coordinated lessons? Yes.

Despite this absurd scenario, in October 2013, I received a letter from my school district indicating how I fared in this inaugural year of the teacher evaluation system. It wasn’t good. Fifty percent of my performance label was based on school test scores (not on the test scores of my homeroom students). How well can students perform on tests when they don’t have a consistent teacher?

So when I think about accountability, I wonder now what it is I was actually held accountable for? An ailing, urban school? An ineffective leadership team who couldn’t keep a workforce together? Or was I just held accountable for not walking away from a no-win situation?

Coincidentally, this 2013-2014 academic year has, in many ways, mirrored the 2012-2013. The upside is that this year, only 10% of my evaluation is based on school-wide test scores (the other 40% will be my homeroom students’ test scores). This year, I have a fighting chance to receive a good label. One more year of an unfavorable performance label and the district will have to, by law, do something about me. Ironically, if it comes to that point, the district can replace me with a long-term substitute, who is not subject to the same evaluation system that I am. Moreover, that long-term substitute doesn’t have to hold a teaching certificate. Further, that long-term substitute will cost the district a lot less money in benefits (i.e. healthcare, retirement system contributions).

I should probably start looking for a job—maybe as a long-term substitute.

Same Model (EVAAS), Different State (Ohio), Same Problems (Bias and Data Errors)

Following up on my most recent post about “School-Level Bias in the PVAAS Model in Pennsylvania,” also in Ohio – a state that also uses “the best” and “most sophisticated” VAM (i.e., a version of the Education Value-Added Assessment System [EVAAS]; for more information click here) – this seems to be a problem, as per an older (2013) article just sent to me following my prior post “Teachers’ ‘Value-Added’ Ratings and [their] Relationship to Student Income Levels [being] Questioned.”

The key finding? Ohio’s “2011-12 value-added results show that districts, schools and teachers with large numbers of poor students tend to have lower value-added results than those that serve more-affluent ones.” Such output continue to evidence how using VAMs may not be “the great equalizer” after all. VAMs might not be the “true” measures they are assumed (and marketed/sold) to be.

Here are the state’s stats, as highlighted in this piece and taken directly from the Plain Dealer/StateImpact Ohio analysis:

  • Value-added scores were 2½ times higher on average for districts where the median family income is above $35,000 than for districts with income below that amount.
  • For low-poverty school districts, two-thirds had positive value-added scores — scores indicating students made more than a year’s worth of progress.
  • For high-poverty school districts, two-thirds had negative value-added scores — scores indicating that students made less than a year’s progress.
  • Almost 40 percent of low-poverty schools scored “Above” the state’s value-added target, compared with 20 percent of high-poverty schools.
  • At the same time, 25 percent of high-poverty schools scored “Below” state value-added targets while low-poverty schools were half as likely to score “Below.”

One issue likely causing this level of bias is that student background factors are not accounted for in the EVAAS model. “By including all of a student’s testing history, each student serves as his or her own control,” as per the analytics company – SAS – that now sells and runs model calculations. Rather, the EVAAS uses students’ prior test scores to control for these factors. As per a document available on the SAS website: “To the extent that SES/DEM [socioeconomic/demographic] influences persist over time, these influences are already represented in the student’s data. This negates the need for SES/DEM adjustment.”

Within the same document SAS authors write: “that adjusting value-added for the race or poverty level of a student adds the dangerous assumption that those students will not perform well. It would also create separate standards for measuring students who have the same history on scores on previous tests. And it says that adjusting for these issues can also hide whether students in poor areas are receiving worse instruction. ‘We recommend that these adjustments not be made,” the brief reads. “Not only are they largely unnecessary, but they may be harmful.”

See also another article, also from Ohio about how a value-added “Glitch Cause[d the] State to Pull Back Teacher “Value Added” Student Growth Scores.” From the intro, “An error by contractor SAS Institute Inc. forced the state to withdraw some key teacher performance measurements that it had posted online for teachers to review. The state’s decision to take down the value added scores for teachers across the state on Wednesday has some educators questioning the future reliability of the scores and other state data…” (click here to read more).

Houston, We Have A Problem: New Research Published about the EVAAS

New VAM research was recently published in the peer-reviewed Education Policy Analysis Archives journal, titled “Houston, We Have a Problem: Teachers Find No Value in the SAS Education Value-Added Assessment System (EVAAS®).” This article was published by a former doctoral student of mine, turned researcher now at a large non-profit — Clarin Collins. I asked her to write a guest post for you all summarizing the fully study (linked again here). Here is what she wrote.

As someone who works in the field of philanthropy, completed a doctoral program more than two years ago, and recently became a new mom, you might question why I worked on an academic publication and am writing about it here as a guest blogger? My motivation is simple: the teachers. Teachers continue to be at the crux of the national education reform efforts as they are blamed for the nation’s failing education system and student academic struggles. National and state legislation has been created and implemented as believed remedies to “fix” this problem by holding teachers accountable for student progress as measured by achievement gains.

While countless researchers have highlighted the faults of teacher accountability systems and growth models (unfortunately to fall on the deaf ears of those mandating such policies), very rarely are teachers asked how such policies play out in practice, or for their opinions, as representing their voices in all of this. The goal of this research, therefore, was first, to see how one such teacher evaluation policy is playing out in practice and second, to give voice to marginalized teachers, those who are at the forefront of these new policy initiatives. That being said, while I encourage you to check out the full article [linked again here], I highlight key findings in this summary, using the words of teachers as often as possible to permit them, really, to speak for themselves.

In this study I examined the SAS Education Value-Added Assessment System (EVAAS) in practice, as perceived and experienced by teachers in the Southwest School District (SSD). SSD [a pseudonym] is using EVAAS for high-stakes consequences more than any other district or state in the country. I used a mixed-method design including a large-scale electronic survey to investigate the model’s reliability and validity; to determine whether teachers used the EVAAS data in formative ways as intended; to gather teachers’ opinions on EVAAS’s claimed benefits and statements; and to understand the unintended consequences that might have also occurred as a result of EVAAS use in SSD.

Results revealed that the reliability of the EVAAS model produced split and inconsistent results among teacher participants regardless of subject or grade-level taught. As one teacher stated, “In three years, I was above average, below average and average.” Teachers indicated that it was the students and their varying background demographics who biased their EVAAS results, and much that was demonstrated via their scores was beyond the control of teachers. “[EVAAS] depends a lot on home support, background knowledge, current family situation, lack of sleep, whether parents are at home, in jail, etc. [There are t]oo many outside factors – behavior issues, etc.” that apparently are not controlled or accounted for in the model.

Teachers reported dissimilar EVAAS and principal observation scores, reducing the criterion-related validity of both measures of teacher quality. Some even reported that principals changed their observation scores to match their EVAAS scores; “One principal told me one year that even though I had high [state standardized test] scores and high Stanford [test] scores, the fact that my EVAAS scores showed no growth, it would look bad to the superintendent.” Added another teacher, “I had high appraisals but low EVAAS, so they had to change the appraisals to match lower EVAAS scores.”

The majority of teachers disagreed with SAS’s marketing claims such as EVAAS reports are easy to use to improve instruction, and EVAAS will ensure growth opportunities for all students. Teachers called the reports “vague” and “unclear” and were “not quite sure how to interpret” and use the data to inform their instruction. As one teacher explained, she looked at her EVAAS report “only to guess as to what to do for the next group in my class.”

Many unintended consequences associated with the high-stakes use of EVAAS emerged through teachers’ responses, which revealed among others that teachers felt heightened pressure and competition, which they believed reduced morale and collaboration, and encouraged cheating or teaching to the test in attempt to raise EVAAS scores. Teachers made comments such as, “To gain the highest EVAAS score, drill and kill and memorization yields the best results, as does teaching to the test,” and “When I figured out how to teach to the test, the scores went up,” as well as, “EVAAS leaves room for me to teach to the test and appear successful.”

Teachers realized this emphasis on test scores was detrimental for students, as one teacher wrote, “As a result of the emphasis on EVAAS, we teach less math, not more. Too much drill and kill and too little understanding [for the] love of math… Raising a generation of children under these circumstances seems best suited for a country of followers, not inventors, not world leaders.”

Teachers also admitted they are not collaborating to share best practices as much anymore: “Since the inception of the EVAAS system, teachers have become even more distrustful of each other because they are afraid that someone might steal a good teaching method or materials from them and in turn earn more bonus money. This is not conducive to having a good work environment, and it actually is detrimental to students because teachers are not willing to share ideas or materials that might help increase student learning and achievement.”

While I realize this body of work could simply add to “the shelves” along with those findings of other researchers striving to deflate and demystify this latest round of education reform, if nothing else, I hope the teachers who participated in this study know I am determined to let their true experiences, perceptions of their experiences, and voices be heard.


Again, to find out more information including the statistics in support of the above assertions and findings, please click here to read the full study.

Five Nashville Teachers Face Termination

Three months ago, Tennessee Schools Director Jesse Register announced he was to fire 63 Tennessean teachers, of 195 total who for two consecutive years scored lowest (i.e., a 1 on a scale of 1 to 5) in terms of their overall “value-added” (as based on 35% EVAAS, 15% related “student achievement,” and 50% observational data). These 63 were the ones who three months ago were still “currently employed,” given the other 132 apparently left voluntarily or retired (which is a nice-sized cost-savings, so let’s be sure not to forget about the economic motivators behind all of this as well). To see a better breakdown of these numbers, click here.

This was to be the first time in Tennessee that its controversial, and “new and improved” teacher evaluation system would be used to take deliberate action against whom they deemed their “lowest-performing” teachers, as “objectively” identified in the classroom; although, officials at that time did not expect to have a “final number” to be terminated until fall.

Well, fall is here, and it seems this final number is officially five: three middle school teachers, one elementary school teacher, and one high school teacher, all teaching in metro Nashville.

The majority of these teachers come from Neely’s Bend: “one of 14 Nashville schools on the state’s priority list for operating at the bottom 5 percent in performance statewide.” Some of these teachers were evaluated even though their principal who evaluated them is  “no longer there.” Another is a computer instructor being terminated as based on this school’s overall “school-level value-added.” This is problematic in and of itself given teacher-level and in this case school-level bias seem to go hand in hand with the use of these models, and grossly interfere with accusations that these teachers “caused” low performance (see a recent post about this here).

It’s not to say these teachers were not were indeed the lowest performing; maybe they were. But I for one would love to talk to these teachers and take a look at their actual data, EVAAS and observational data included. Based on prior experiences working with such individuals, there may be more to this than what it seems. Hence, if anybody knows these folks, do let them know I’d like to better understand their stories.

Otherwise, all of this effort to ultimately attempt to terminate five of a total 5,685 certified teachers in the district (0.09%) seems awfully inefficient, and costly, and quite frankly absurd given this is a “new and improved” system meant to be much better than a prior system that likely yielded a similar termination rate, not including, however, those who left voluntarily prior.

Perhaps an ulterior motive is, indeed, the cost-savings realized given the mere “new and improved” threat.

A Major VAMbarrassment in New Orleans

Tulane University’s Cowen Institute for Education Initiatives, in what is now being called a “high-profile embarrassment,” is apologizing for releasing a high-profile report based on faulty research. Their report, that at one point seemed to have successfully VAMboozle folks representing numerous Louisiana governmental agencies and education non-profits, many of which emerged after Hurricane Katrina, has since been pulled from its once prevalent placement on the institute’s website.

It seems that on October 1, 2014 the institute released a report titled “Beating-the-Odds: Academic Performance and Vulnerable Student Populations in New Orleans Public High Schools.” The report was celebrated widely (see, for example, here) in that it “proved” that students in post-Hurricane Katrina (largely charter) schools, despite the disadvantages they faced prior to the school-based reforms triggered by Katrina, were now “beating the odds, while “posting better test scores and graduation rates than predicted by their populations,” thanks to these reforms. Demographics were no longer predicting students’ educational destinies, and the institute had the VAM-based evidence to prove it.

Institute researchers also “found” that New Orleans charter schools, in which over 80% of all New Orleans are now educated, were to have been substantively helping students “beat the odds,” accordingly. Let’s just say the leaders of the charter movement were all over this one, and also allegedly involved.

To some, however, the report’s initial findings (the findings that have now been retracted) did not come as much of a surprise. It seems the Cowen Institute and the city’s leading charter school incubator, New Schools for New Orleans, literally share office space; that is, they are literally “in office space” together.

To read more about this, as per the research of Kristen Buras (Associate Professor at Georgia State), click here. Mercedes Schneider on a recent post she also wrote about this noted that the “Cowen Institute at Tulane University has been promoting the New Orleans Charter Miracle [emphasis added]” and consistently trying “to sell the ‘transformed’ post-Katrina education system in New Orleans” since 2007 (two years post Katrina). Thanks also go out to Mercedes Schneider as before the report was brought down, she downloaded the report. This report can still be accessed there, or also directly here: Beating-the-Odds for those who want to dive into this further.

Anyhow, and as per another news article recently released about this mess, the Cowen Institute’s Executive Director John Ayers removed the report because the research within it was “inaccurate,” and institute “[o]fficials determined the report’s methodology was flawed, making its conclusions inaccurate.” The report is not be reissued. The institute also intends to “thoroughly examine and strengthen [the institute’s] internal protocols” because of this and to make sure that this does not happen again. The released report was not appropriately internally reviewed, as per the institute’s official response, although external review would have certainly been more appropriate here. Similar situations capturing why internal AND more importantly external review are so very important have been the focus of prior blog posts here, here, and here.

But in this report, that listed Debra Vaughan (the Institute’s Director of Research) and Patrick Sims (the Institute’s Senior Research Analysts) as the lead researchers, they used what they called a VAM – BUT what they did in terms of analyses was certainly not akin to an advanced or “sophisticated” VAM (not that using a VAM would have revealed entirely more accurate and/or less flawed results). Instead, they used a simpler regression approach to reach (or confirm) their conclusions. While Ayers “would not say what piece of the methodology was flawed,” we can all be quite certain it was the so-called “VAM” that was the cause of this serious case of VAMbarrassment.

As background, and from a related article about the Cowen Institute and one of its new affiliates – Douglas Harris, who has also written a book about VAMs but positioned VAMs in a more positive light than I did in my book, but who is also not listed as a direct or affiliated author on the report – the situation in New Orleans post Katrina is as follows:

Before the storm hit in August 2005, New Orleans public schools were like most cities and followed the 100-year-old “One Best System.” A superintendent managed all schools in the school district, which were governed by a locally-elected school board. Students were taught by certified and unionized teachers and attended schools mainly based on where they lived.

But that all changed when the hurricane shuttered most schools and scattered students around the nation. That opened the door for alternative forms of public education, such as charter schools that shifted control from the Orleans Parish School Board into the hands of parents and a state agency, the Recovery School District.

In the 2012-13 school year, 84 percent of New Orleans public school students attended charter schools…New Orleans [currently] leads the nation in the percentage of public school students enrolled in charter schools, with the next-highest percentages in Washington D.C. and Detroit (41 percent in each).

Bias in School Composite Measures Using Value-Added

Ed Fuller – Associate Professor of Educational Policy at The Pennsylvania State University (as also affiliated with this university’s Center for Evaluation and Education Policy Analysis [CEEPA]) recently released an externally-reviewed technical report that I believe you all might find of interest. He conducted “An Examination of Pennsylvania School Performance Profile Scores.”

He found that the Commonwealth of Pennsylvania’s School Performance Profile (SPP) scores, as defined by the Pennsylvania Department of Education (PDE) for each school as also 40% based on Pennsylvania Value-Added Assessment System (PVAAS, which is a derivative of the Education Value-Added Assessment System [EVAAS]), are strongly associated with student- and school-characteristics. Thus, he concludes, these measures are “inaccurate measures of school effectiveness. [Pennsylvania’s] SPP scores, in [fact], are more accurate indicators of the percentage of economically disadvantaged students in a school than of the effectiveness of a school.”

Take a look at Fuller’s Figure 1. In this figure is a scatterplot that illustrates that there is indeed a “strong relationship” or “strong (co)relationship” between the percent of economically disadvantaged students and elementary schools’ SPP scores. Put more simply, as the percentage of economically disadvantaged students goes up (moving from left to right on the x-axis), the SPP goes down (moving from top to bottom on the y-axis). The actual correlation coefficient illustrated, while not listed explicitly, is noted as being greater than r = –0.60.

Untitled 2

Note also the very few outliers, or schools that defy the trend as is often assumed to occur when “we” more fairly measure growth, give all types of students a fair chance to demonstrate growth, level the playing field for all, and the like. Figures 2 and 3 in the same document illustrate essentially the same things, but for middle schools (r = 0.65) and high schools (r = 0.68). In these cases the outliers are even less rare.

When Fuller conducted a bit more complicated statistical analyses (i.e., regression analyses) that help to control or account for additional variables and factors than simple correlations themselves, he found still that “the vast majority of the differences in SPP scores across schools [were still] explained [better] by student- and school-characteristics that [were not and will likely never be] under the control of educators.”

Fuller’s conclusions? “If the scores accurately capture the true effectiveness of schools and the educators within schools, there should be only a weak or non-existent relationship between the SPP scores and the percentage of economically disadvantaged students.”

Hence: (1) “SPP scores should not be used as an indication of either school effectiveness or as a component of educator evaluations” because of the bias inherent within the system, of which PVAAS data are 40%. Bias in this case means that “teachers and principals in schools serving high percentages of economically disadvantaged students will be identified as less effective than they really are while those serving in schools with low percentages of economically disadvantaged students will be identified as more effective than in actuality.” (2) Because SPP scores are biased, and therefore inaccurate assessments of school effectiveness, “the use of the SPP scores in teacher and principal evaluations will lead to inaccurate judgments about teacher and principal effectiveness.” And (3) Using SPP scores in teacher and principal evaluations will create “additional incentive[s] for the most qualified and effective educators to seek employment in schools with high SPP scores. Thus, the use of SPP scores on educator evaluations will simply exacerbate the existing inequities in the distribution of educator quality across schools based on the characteristics of the students enrolled in the schools.”

Jesse Rothstein on Teacher Evaluation and Teacher Tenure

Last week, released via the Washington Post’s Wonkblog, Max Ehrenfreund wrote a piece titled “Teacher tenure has little to do with student achievement, economist says.” For those of you who do not know Jesse Rothstein, he’s an Associate Professor of Economics at University of California – Berkeley, and he is one of the leading researchers/economists conducting research on teacher evaluation and accountability policies writ large, as well as the value-added models (VAMs) being used for such purposes. He’s probably most famous for a study he conducted in 2009 about how the non-random, purposeful sorting of students into classrooms indeed biases (or distorts) value-added estimations, pretty much despite the sophistication of the statistical controls meant to block (or control for) such bias (or distorting effects). You can find this study referenced here.

Anyhow, in this piece author Ehrenfreuend discusses with Rothstein teacher evaluation and teacher tenure. Some of the key take-aways from the interview and for this audience follow, but do read the full piece, linked again here, if so inclined:

Rothstein, on teacher evaluation:

  • In terms of evaluating teachers, “[t]here’s no perfect method. I think there are lots of methods that give you some information, and there are lots of problems with any method. I think there’s been a tendency in thinking about methods to prioritize cheap methods over methods that might be more expensive. In particular, there’s been a tendency to prioritize statistical computations based on student test scores, because all you need is one statistician and the test score data. Classroom observation requires having lots of people to sit in the back of lots and lots of classrooms and make judgments.
  • Why the interest in value-added? “I think that’s a complicated question. It seems scientific, in a way that other methods don’t. Partly it has to do with the fact that it’s cheap, and it seems like an easy answer.”
  • What about the fantabulous study Raj Chetty and his Harvard colleagues (Friedman and Rockoff) conducted about teachers’ value-added (which has been the source of many prior posts herein)? “I don’t think anybody disputes that good teachers are important, that teachers matter. I have some methodological concerns about that study, but in any case, even if you take it at face value, what it tells you is that higher value-added teachers’ students earn more on average.”
  • What are the alternatives? “We could double teachers’ salaries. I’m not joking about that. The standard way that you make a profession a prestigious, desirable profession, is you pay people enough to make it attractive. The fact that that doesn’t even enter the conversation tells you something about what’s wrong with the conversation around these topics. I could see an argument that says it’s just not worth it, that it would cost too much. The fact that nobody even asks the question tells me that people are only willing to consider cheap solutions.”

Rothstein, on teacher tenure:

  • “Getting good teachers in front of classrooms is tricky,” and it will likely “still be a challenge without tenure, possibly even harder. There are only so many people willing to consider teaching as a career, and getting rid of tenure could eliminate one of the job’s main attractions.”
  • Likewise, “there are certainly some teachers in urban, high-poverty settings that are not that good, and we ought to be figuring out ways to either help them get better or get them out of the classroom. But it’s important to keep in mind that that’s only one of several sources of the problem.”
  • “Even if you give the principal the freedom to fire lots of teachers, they won’t do it very often, because they know the alternative is worse.” The alternative being replacing an ineffective teacher by an even less effective teacher. Contrary to what is oft-assumed, high qualified teachers are not knocking down the doors to teach in such schools.
  • Teacher tenure is “really a red herring” in the sense that debating tenure ultimately misleads and distracts others from the more relevant and important issues at hand (e.g., recruiting strong teachers into such schools). Tenure “just doesn’t matter that much. If you got rid of tenure, you would find that the principals don’t really fire very many people anyway” (see also point above).

Pearson Tests v. UT Austin’s Associate Professor Stroup

Last week of the Texas Observer wrote an article, titled “Mute the Messenger,” about University of Texas – Austin’s Associate Professor Walter Stroup, who publicly and quite visibly claimed that Texas’ standardized tests as supported by Pearson were flawed, as per their purposes to measure teachers’ instructional effects. The article is also about how “the testing company [has since] struck back,” purportedly in a very serious way. This article (linked again here) is well worth a full read for many reasons I will leave you all to infer. This article was also covered recently on Diane Ravitch’s blog here, although readers should also see Pearson’s Senior Vice President’s prior response to, and critique of Stroup’s assertions and claims (from August 2, 2014) here.

The main issue? Whether Pearson’s tests are “instructionally sensitive.” That is, whether (as per testing and measurement expert – Professor Emeritus W. James Popham) a test is able to differentiate between well taught and poorly taught students, versus able to differentiate between high and low achievers regardless of how students were taught (i.e., as per that which happens outside of school that students bring with them to the schoolhouse door).

Testing developers like Pearson seem to focus on the prior, that their tests are indeed sensitive to instruction. While testing/measurement academics and especially practitioners seem to focus on the latter, that tests are sensitive to instruction, but such tests are not nearly as “instructionally sensitive” as testing companies might claim. Rather, tests are (as per testing and measurement expert – Regents Professor David Berliner) sensitive to instruction but more importantly sensitive to everything else students bring with them to school from their homes, parents, siblings, and families, all of which are situated in their neighborhoods and communities and related to their social class. Here seems to be where this, now very heated and polarized argument between Pearson and Associate Professor Stroup now stands.

Pearson is focusing on its advanced psychometric approaches, namely its use of Item Response Theory (IRT) while defending their tests as “instructionally sensitive.” IRT is used to examine things like p-values (or essentially proportions of students who respond to items correctly) and item-discrimination indices (to see if test items discriminate between students who know [or are taught] certain things and students who don’t know [or are not taught] certain things otherwise). This is much more complicated than what I am describing here, but hopefully this gives you all the gist of what now seems to be the crux of this situation.

As per Pearson’s Senior Vice President’s statement, linked again here, “Dr. Stroup claim[ed] that selecting questions based on Item Response Theory produces tests that are not sensitive to measuring what students have learned.” While from what I know about Dr. Stroup’s actual claims, this trivializes his overall arguments. Tests, after undergoing revisions as per IRT methods, are not always “instructionally sensitive.”

When using IRT methods, test companies, for example, remove items that “too many students get right” (e.g, as per items’ aforementioned p-values). This alone makes tests less “instructionally insensitive” in practiceIn other words, while the use of IRT methods is sound psychometric practice based on decades of research and development, if using IRT deems an item as “too easy,” even if the item is taught well (i.e., “instructionally senstive”), the item might be removed. This makes the test (1) less “instructionally sensitive” in the eyes of teachers who are to teach the tested content (and who are now more than before held accountable for teaching these items), and this makes the test (2) more “instructionally sensitive” in the eyes of test developers in that when fewer students get test items correct the better the items are when descriminating between those who know (or are taught) certain things and students who don’t know (or are not taught) certain things otherwise.

A paradigm example of what this looks like in practice comes from advanced (e.g., high school) mathematics tests.

Items capturing statistics and/or data displays on such tests should theoretically include items illustrating standard column or bar charts, with questions prompting students to interpret the meanings of the statistics illustrated in the figures. Too often, however, because these items are often taught (and taught well) by teachers (i.e., “instructionally sensitive”) “too many” students answer such items correctly. Sometimes these items yield p-values greater than p=0.80 or 80% correct.

When you need a test and its outcome score data to fit around the bell curve, you cannot have such, or too many of such items, on the final test. In the simplest of terms, for every item with a p-vale of 80% you would need another with a p-value of 20% to balance items out, or keep the overall mean of each test around p=0.50 (the center of the standard normal curve). It’s best if test items, more or less, hang around such a mean, otherwise the test will not function as it needs to, mainly to discriminate between who knows (or is taught) certain things and who doesn’t know (or isn’t taught) certain things otherwise. Such items (i.e., with high p-values) do not always distribute scores well enough because “too many students” answering such items correct reduces the variation (or spread of scores) needed.

The counter-item in this case is another item also meant to capture statistics and/or data display, but that is much more difficult, largely because it’s rarely taught because it rarely matters in the real world. Take, for example, the box and whisker plot. If you don’t know what this is, which is in and of itself telling in this example, see them described and illustrated here. Often, this item IS found on such tests because this item IS DIFFICULT and, accordingly, works wonderfully well to discriminate between those who know (or are taught) certain things and those who don’t know (or aren’t taught) certain things otherwise.

Because this item is not as often taught (unless teachers know it’s coming, which is a whole other issue when we think about “instructional sensitivity” and “teaching-to-the-test“), and because this item doesn’t really matter in the real world, it becomes an item that is more useful for the test, as well as the overall functioning of the test, than it is an item that is useful for the students tested on it.

A side bar on this: A few years ago I had a group of advanced doctoral students studying statistics take Arizona’s (now former) High School Graduation Exam. We then performed an honest analysis of the resulting doctoral students’ scores using some of the above-mentioned IRT methods. Guess which item students struggled with the most, which also happened to be the item that functioned the best as per our IRT analysis? The box and whisker plot. The conversation that followed was most memorable, as the statistics students themselves questioned the utility of this traditional item, for them as advanced doctoral students but also for high school graduates in general.

Anyhow, this item, like many other items similar, had a lower relative p-value, and accordingly helped to increase the difficulty of the test and discriminate results to assert a purported “insructional sensitivity,” regardless of whether the item was actually valued, and more importantly valued in instruction.

Thanks to IRT, the items often left on such tests are not often the items taught by teachers, or perhaps taught by teachers well, BUT they distribute students’ test scores effectively and help others make inferences about who knows what and who doesn’t. This happens even though the items left do not always capture what matters most. Yes – the tests are aligned with the standards as such items are in the standards, but when the most difficult items in the standards trump the others, and many of the others that likely matter more are removed for really no better reason than what IRT dictates, this is where things really go awry.

Texas Hangin’ its Hat on its New VAM System

A fellow blogger, James Hamric and author of Hammy’s Education Reform Blog, emailed a few weeks ago connecting me with a recent post he wrote about teacher evaluations in Texas, titling them and his blog post “The good, the bad and the ridiculous.”

It seems that the Texas Education Agency (TEA), which serves a similar role in Texas as a state department of education elsewhere, recently posted details about the state’s new Teacher Evaluation and Support System (TESS) that the state submitted to the U.S. Department of Education to satisfy the condition’s of its No Child Left Behind (NCLB) waiver, excusing Texas from not meeting NCLB’s prior goal that all students in the state (and all other states) would be 100% proficient in mathematics and reading/language arts by 2014.

While “80% of TESS will be rubric based evaluations consisting of formal observations, self assessment and professional development across six domains…The remaining 20% of TESS ‘will be reflected in a student growth measure at the individual teacher level that will include a value-add score based on student growth as measured by state assessments.’ These value added measures (VAMs) will only apply to approximately one quarter of the teachers [however, and as is the case more or less throughout the country] – those [who] teach testable subjects/grades. For all the other teachers, local districts will have flexibility for the remaining 20% of the evaluation score.” This “flexibility” will include options that include student learning objectives (SLOs), portfolios or district-level pre- and post-tests.

Hamric then goes onto review his concerns about the VAM-based component. While we have highlighted these issues and concerns many times prior on this blog, I do recommend reading these as summarized by others other than us who write here in this blog. This may just help to saturate our minds, and also prepare them to defend ourselves against the “good, bad, and the ridiculous” and perhaps work towards better systems of teacher evaluation, as is really the goal. Click here, again, to read this post in full.

Related, Hamric concludes with the following, “the vast majority of educators want constructive feedback, almost to a fault. As long as the administrator is well trained and qualified, a rubric based evaluation should be sufficient to assess the effectiveness of a teacher. While the mathematical validity of value added models are accepted in more economic and concrete realms, they should not be even a small part of educator evaluations and certainly not any part of high-stakes decisions as to continuing employment. It is my hope that, as Texas rolls out TESS in pilot districts in the 2014-2015 school year, serious consideration will be given to removing the VAM component completely.”