Observations: “Where Most of the Action and Opportunities Are”

In a study just released on the website of Education Next, researchers discuss results from their recent examinations of “new teacher-evaluation systems in four school districts that are at the forefront of the effort [emphasis added] to evaluate teachers meaningfully.” The four districts’ evaluation systems were based on classroom observations, achievement test gains for the whole school (i.e., school-level value-added), performance on non-standardized tests, and some form of measure of teacher professionalism and/or teacher commitment to the school community.

Researchers found the following: The ratings assigned teachers across the four districts’ leading evaluation systems as based primarily (i.e., 50-75%) on observations — not including value-added scores except for the amazingly low 20% of teachers who were VAM eligible — were “sufficiently predictive” of a teacher’s future performance. Later they define what “sufficiently predictive” is in terms of predictive validity coefficients that ranged between 0.33 to 0.38, which are actually quite “low” coefficients in reality. Later they say these coefficients are also “quite predictive,” regardless.

While such low coefficients are to be expected as per others’ research on this topic, one must question how authors came up with their determinations that these were “sufficiently” and “quite” predictive (see also Bill Honig’s comments at the bottom of this article). The authors of this article qualify these classifications later, though, writing that “[t]he degree of correlation confirms that these systems perform substantially better in predicting future teacher performance than traditional systems based on paper credentials and years of experience.” They explain further that these correlations are “in the range that is typical of systems for evaluating and predicting future performance in other fields of human endeavor, including, for example, those used to make management decisions on player contracts in professional sports.” So it seems their qualifications were based on a “better than” or relative but not empirical judgment (see also Bill Honig’s comments at the bottom of this article). That being said, this is something to certainly consume critically, particularly in the ways they’ve inappropriately categorized these coefficients.

Researchers also found the following: “The stability generated by the districts’ evaluation systems range[d] from a bit more than 0.50 for teachers with value-added scores to about 0.65 when value-added is not a component of the score.” In other words, districts’ “[e]valuation scores that [did] not include value-added [were] more stable [when districts] assign[ed] more weight to observation scores, which [were demonstrably] more stable over time than value-added scores.” In other words, observational scores outperformed value-added scores. Likewise, the stability they observed in the value-added scores (i.e., 0.50) fell within the upper range of those coefficients also reported elsewhere in the research. So, researchers also confirmed that teacher-level value-added scores are still quite inconsistent from year to year as they still (and too often) vary widely, and wildly over time.

Researchers’ key recommendations, as based on improving the quality of data derived from classroom observations: “Teacher evaluations should include two to three annual classroom observations, with at least one observation being conducted by a trained external observer.” They provide some evidence in support of this assertion in the full article. In addition, they assert that “[c]lassroom observations should carry at least as much weight as test-score gains in determining a teacher’s overall evaluation score.” Although I would argue, as based on their (and others’ results), they certainly made a greater case for observations in lieu of teacher-level value-added, throughout their paper..

Put differently, and in their own words – words with which I agree: “[M]ost of the action and nearly all the opportunities for improving teacher evaluations lie in the area of classroom observations rather than in test-score gains.” So there it is.

Note: The authors of this article do also talk about the “bias” inherent in classroom observations. As based on their findings, for example, they also recommend that “districts adjust classroom observation scores for the degree to which the students assigned to a teacher create challenging conditions for the teacher. Put simply, the current observation systems are patently unfair to teachers who are assigned less-able and -prepared students. The result is an unintended but strong incentive for good teachers to avoid teaching low-performing students and to avoid teaching in low-performing schools.” While I did not highlight these sections above, do click here if wanting to read more.

Does A “Statistically Sound” Alternative Exist?

A few weeks ago a follower posed the following question on our website, and I thought it imperative to share.

Following the post about “The Arbitrariness Inherent in Teacher Observations,” he wrote: “Have you written about a statistically sound alternative proposal?”

My reply? “Nope. I do not believe such a thing exists. I do have a sound alternative proposal though, that has sound statistics to support it. It serves as the core of chapter 8 of my recent book.”

Essentially, this is a solution that, counter-intuitively, offers an even-more conventional and traditional solution. This is a solution that has research and statistical evidence in support, and has evidenced itself as superior to using value-added measures, along with other measures of teacher effectiveness in their current forms, for evaluating and holding teachers accountable for their effectiveness. It is based on the use of multiple measures, as aligned with the standards of the profession and also locally defined theories capturing what it means to be an effective teacher. Its effectiveness also relies on competent supervisors and elected colleagues serving as professional members of educators’ representative juries.

This solution does not rely solely on mathematics and the allure of numbers or grandeur of objectivity that too often comes along with numerical representation, especially in the social sciences. This solution does not trust the test scores too often (and wrongly) used to assess teacher quality, simply because the test output is already available (and paid for) and these data can be represented numerically, mathematically, and hence objectively. This solution does not marginalize human judgment, but rather embraces human judgment for what it is worth, as positioned and operationalized within a more professional, democratically-based, and sound system of judgment, decision-making, and support.

Thomas Kane On Educational Reform

You might recall from a prior post, the name of Thomas Kane, an economics professor from Harvard University who also directed the $45 million worth of Measures of Effective Teaching (MET) studies for the Bill & Melinda Gates Foundation. Not surprisingly as a VAM advocate, he advanced then, and continues to advance now, a series of highly false claims about the wonderful potentials of VAMs.

As highlighted in the piece Kane wrote, and Brookings released on its website on “The Case for Combining Teacher Evaluation and the Common Core,” Kane continues to advance a series of highly false claims and assumptions in terms of how “better teacher evaluation systems will be vital for any broad [educational] reform effort, such as implementing the Common Core.” Exerting series of “heroic assumptions” without evidence seems to be a recurring theme, which I cannot myself figure out knowing Kane’s an academic and quite honestly should know better.

Here are some examples of what I speak (and protest):

  • Educational reform is “a massive adult behavior change exercise…[U]nless we change what adults do every day inside their classrooms, we cannot expect student outcomes to improve.” Enter teachers as the new and popular (thanks to folks like Kane) sources of blame. We are to accept Kane’s assumption, here, that teachers have not been motivated prior to change their adult behaviors and teach their students well, help their students learn, improve their students’ outcomes, and the like.
  • Hence, when “current attempts to implement new teacher evaluations fall short—as they certainly will, given the long history of box-checking—we must improve them.” We are to accept Kane’s assumption here, despite the fact that little to no research evidence exists supporting that teacher evaluation systems improve much of anything including “improved student outcomes,” that new teacher evaluation systems based on carrot and stick measures are going to do this.
  • Positioning new and improved teacher evaluation systems against another educational reform approach (which I have never seen positioned as a reform approach, but nonetheless), Kane argues “professional development hasn’t worked in the past” so we must go with new teacher evaluation systems? Nobody I know who conducts research on educational reform ever suggested professional development was or could ever be proposed to reform America’s public schools. Rather, professional development is simply a standard of a profession, the teaching profession that (at least to many of us) it is still meant to be just that. If we are to talk about research-based ways to reform our schools, there are indeed other solutions. These other solutions, however, are unfortunately more expensive and, hence, less popular among those who continue to advance cheap and “logical” or “rational” solutions such as those advanced by Kane.
  • Ironically, Kane cites and links to two external studies when arguing that “[b]etter teacher evaluation systems have been shown to be related to better outcomes for students.” While the first piece Kane references might have something to do with this (as per reading the abstract, but not the full piece), the second piece cited and linked to by Kane, rather, is about how professional or teacher development focused on supporting teacher and student interactions actually increased student learning and achievement. But “professional development hasn’t worked in the past?” Funny…
  • Kane also exerts that “The Common Core is more likely to succeed in sites that are implementing better teacher evaluation and feedback as well.” Where’s the evidence on that one…
  • There is really only one thing written into this piece on which we agree: the use of student surveys to provide teachers with student-based feedback (this was the source of a recent post I wrote here).

Thereafter, Kane goes into a series of suggestions for administrators and teachers on how they should, for example, conduct “side-by-side comparison[s] of the new and old standards and identify a few standards—no more than two or three in each grade and subject—to focus on during the upcoming year” — and — how administrators should “schedule classroom observations for the days when the new standards are to be taught.” Indeed, “[e]ven one successful cycle will lay the foundation for the next round of instructional improvement.”

I do have to say, though, as a former teacher, I would advise others to not heed the advice of a person who has conducted a heck of a lot of research “on” education but who has, as far as I can tell or find on the internet (see his full resume or curriculum vita here), not ever been a teacher “in” education himself, or much less set foot in the classroom. I’m sorry practitioners, for my colleague for doing this from (as you sometimes criticize us as a whole) atop his ivory tower post.

Kane concludes with the following: “The norm of autonomous, self-made, self-directed instruction—with no outside feedback or intervention—is long-standing and makes the U.S. education system especially resistant to change. In most high-performing countries, teachers have no such expectations. The lesson study in Japan is a good example. Teachers do not bootstrap their own instruction. They do not expect to be left alone. They expect standards, they expect feedback from peers and supervisors and they expect to be held accountable—for the quality of their delivery as well as for student results. Therefore, a better system for teacher evaluation and feedback is necessary to support individual behavior change, and it’s a tool for collective culture change as well.”

So much of what he wrote here, really in every single sentence, could not be further from the truth, so much so I care not to dissect each point and waste your time further.

As I also said in my prior post, if I was to make a list of VAMboozlers, Kane would still be near the top of the list. All of the reasons for my nomination are highlighted yet again here, unfortunately, but this time as per what Kane wrote himself. Again, though, you can be the judges and read this piece for yourselves, or not.

Jesse Rothstein on Teacher Evaluation and Teacher Tenure

Last week, released via the Washington Post’s Wonkblog, Max Ehrenfreund wrote a piece titled “Teacher tenure has little to do with student achievement, economist says.” For those of you who do not know Jesse Rothstein, he’s an Associate Professor of Economics at University of California – Berkeley, and he is one of the leading researchers/economists conducting research on teacher evaluation and accountability policies writ large, as well as the value-added models (VAMs) being used for such purposes. He’s probably most famous for a study he conducted in 2009 about how the non-random, purposeful sorting of students into classrooms indeed biases (or distorts) value-added estimations, pretty much despite the sophistication of the statistical controls meant to block (or control for) such bias (or distorting effects). You can find this study referenced here.

Anyhow, in this piece author Ehrenfreuend discusses with Rothstein teacher evaluation and teacher tenure. Some of the key take-aways from the interview and for this audience follow, but do read the full piece, linked again here, if so inclined:

Rothstein, on teacher evaluation:

  • In terms of evaluating teachers, “[t]here’s no perfect method. I think there are lots of methods that give you some information, and there are lots of problems with any method. I think there’s been a tendency in thinking about methods to prioritize cheap methods over methods that might be more expensive. In particular, there’s been a tendency to prioritize statistical computations based on student test scores, because all you need is one statistician and the test score data. Classroom observation requires having lots of people to sit in the back of lots and lots of classrooms and make judgments.
  • Why the interest in value-added? “I think that’s a complicated question. It seems scientific, in a way that other methods don’t. Partly it has to do with the fact that it’s cheap, and it seems like an easy answer.”
  • What about the fantabulous study Raj Chetty and his Harvard colleagues (Friedman and Rockoff) conducted about teachers’ value-added (which has been the source of many prior posts herein)? “I don’t think anybody disputes that good teachers are important, that teachers matter. I have some methodological concerns about that study, but in any case, even if you take it at face value, what it tells you is that higher value-added teachers’ students earn more on average.”
  • What are the alternatives? “We could double teachers’ salaries. I’m not joking about that. The standard way that you make a profession a prestigious, desirable profession, is you pay people enough to make it attractive. The fact that that doesn’t even enter the conversation tells you something about what’s wrong with the conversation around these topics. I could see an argument that says it’s just not worth it, that it would cost too much. The fact that nobody even asks the question tells me that people are only willing to consider cheap solutions.”

Rothstein, on teacher tenure:

  • “Getting good teachers in front of classrooms is tricky,” and it will likely “still be a challenge without tenure, possibly even harder. There are only so many people willing to consider teaching as a career, and getting rid of tenure could eliminate one of the job’s main attractions.”
  • Likewise, “there are certainly some teachers in urban, high-poverty settings that are not that good, and we ought to be figuring out ways to either help them get better or get them out of the classroom. But it’s important to keep in mind that that’s only one of several sources of the problem.”
  • “Even if you give the principal the freedom to fire lots of teachers, they won’t do it very often, because they know the alternative is worse.” The alternative being replacing an ineffective teacher by an even less effective teacher. Contrary to what is oft-assumed, high qualified teachers are not knocking down the doors to teach in such schools.
  • Teacher tenure is “really a red herring” in the sense that debating tenure ultimately misleads and distracts others from the more relevant and important issues at hand (e.g., recruiting strong teachers into such schools). Tenure “just doesn’t matter that much. If you got rid of tenure, you would find that the principals don’t really fire very many people anyway” (see also point above).

Can Today’s Tests Yield Instructionally Useful Data?

The answer is no, or at best not yet.

Some heavy hitters in the academy just released an article that might be of interest to you all. In the article the authors discuss whether “today’s standardized achievement tests [actually] yield instructionally useful data.”

The authors include W. James Popham, Professor Emeritus from the University of California, Los Angeles; David Berliner, Regents’ Professor Emeritus at Arizona State University; Neal Kingston, Professor at the University of Kansas; Susan Fuhrman, current President of Teachers College, Columbia University; Steven Ladd, Superintendent of Elk Grove Unified School District in California; Jeffrey Charbonneau, National Board Certified Teacher in Washington and the 2013 US National Teacher of the Year; and Madhabi Chatterji, Associate Professor at Teachers College, Columbia University.

These authors explored some of the challenges and promises in terms of using and designing standardized achievement tests and other educational tests that are “instructionally useful.” This was the focus of a recent post about whether Pearson’s tests are “instructionally sensitive” and what University of Texas – Austin’s Associate Professor Walter Stroup versus Pearson’s Senior Vice President had to say on this topic.

In this study, authors deliberate more specifically the consequences of using inappropriately designed tests for decision-making purposes, particularly when tests are insensitive to instruction. Here, the authors underscore serious issues related to validity, ethics, and consequences, all of which they use and appropriately elevate to speak out, particularly against the use of current, large-scale standardized achievement tests for evaluating teachers and schools.

The authors also make recommendations for local policy contexts, offering recommendations to support (1) the design of more instructionally sensitive large-scale tests as well as (2) the design of other smaller scale tests that can also be more instructionally sensitive, and just better. These include but are not limited to classroom tests as typically created, controlled, and managed by teachers, as well as district tests as sometimes created, controlled, and managed by district administrators.

Such tests might help to create more but also better comprehensive educational evaluation systems, the authors ultimately argue. Although this, of course, would require more professional development to help teachers (and others, including district personnel) develop more instructionally sensitive, and accordingly useful tests. As they also note, this would also require that “validation studies…be undertaken to ensure validity in interpretations of results within the larger accountability policy context where schools and teachers are evaluated.”

This is especially important if tests are to be used for low and high-stakes decision-making purposes. Yet this is something that is way too often forgotten when it comes to test use, and in particular test abuse. All should really take heed here.

Charter Schools’ Value-Added in Ohio

On the 10th Period blog, an Education Policy Fellow at Innovation Ohio named Stephen Dyer wrote about charter schools’ versus traditional schools’ value-added. Click here to read the full blog post, and also to view Dyer’s illustrative graphs explaining the headline: that “Charter Value Added Grades [are] Not Much Better” than the value added grades of their comparable public schools.

First, it is important to note that the state of Ohio uses the Education Value-Added Assessment Systems (EVAAS) of interest in many prior posts on this blog. Second, it is important to note that there are flaws in all of these data, so consume these findings with a critical eye in that very few people agree that value-added data are yielding valid results, or rather results from which valid inferences can be drawn. Even if the self-reported “best” value-added system is being used in the state of Ohio, this does not mean that these results (even though they support public schools) are indeed accurate much less informative.

Let’s just suppose…particularly, because as Dyer stated, using VAM at a more macro level (i.e., district/school versus teacher level) “VAM holds more promise, is less swayed by demographics than raw test scores, and is better philosophically. Though it still needs a lot of work,” using VAM output at the macro level might be okay, largely again if used only for descriptive purposes. Because this is a school level analysis, other researchers would also be more inclined to agree.

While there are certainly some sampling issues in this analysis, as also acknowledged by Dyer in that charter schools in general have fewer students making some analyses (e.g., analyses of gifted students) impossible, Dyer’s main findings follow:

  • “Districts still get higher percentages of As and Bs on all the value added categories. Meanwhile, Charters get higher percentages of Ds and Fs than districts do.”
  • In one value added category (VAM among the lowest scoring 20% of students), charters got 1% more As than districts.
  • Otherwise, charters “fail at a significantly higher level in all these categories than the districts from which they receive their children and money.”
  • Overall, “Charters do a little bit better than their raw scores would indicate. But it’s still nothing to write home about.”

It is also important to note that “every Ohio school district lost money and children to Charter Schools last year (only Ohio’s tiny Lake Erie island districts did not).” If I was a parent in Ohio, I for one would pause before making such a decision given the above, even given the limitations. If I was a policymaker in Ohio? I’d really rethink this year’s budget given last year’s budget that came in at $914 million.

The Arbitrariness Inherent in Teacher Observations

In a recent article released in The Journal News, a newspaper serving many suburban New York counties, another common problem is highlighted whereby districts that have adopted the same teacher observational system (in this case as mandated by the state) are scoring what are likely to be very similar teachers very differently. Whereby teachers in one of the best school districts not only in the state but in the nation apparently has no “highly effective” teachers on staff, teachers in a neighboring district apparently have a staff 99% filled with “highly effective” teachers.

The “believed to be” model developer, Charlotte Danielson, is cited as stating that “Saying 99 percent of your teachers are highly effective is laughable.” I don’t know if I completely agree with her statement, and I do have to admit I question her perspective on this one, and all of her comments throughout this article for that matter, as she is the one who is purportedly offering up her “valid” Framework for Teaching for such observational purposes. Perhaps she’s displacing blame and arguing that it’s the subjectivity of the scorers rather than the subjectivity inherent in her system that should be to blame for the stark discrepancies.

As per Danielson: “The local administrators know who they are evaluating and are often influenced by personal bias…What it also means is that they might have set the standards too low.” As per the Superintendent of the District with 99% highly effective teachers: The state’s “flawed” evaluation model forced districts to “bump up” the scores so “effective” teachers wouldn’t end up with a rating of “developing.” The Superintendent adds that it is possible under the state’s system to be rated “effective” across domains and still end up rated as “developing,” which means teachers may be in need of intervention/improvement, or may be eligible for an expedited hearing process that could lead to their termination. Rather it may have been the case that the scores were inflated to save effective teachers from what the district viewed as an ineffective set of consequences attached to the observational system (i.e., intervention or termination).

Danielson is also cited as saying that “teachers should live in “effective” and only [occasionally] visit “highly effective.” She also notes that if her system contradicts teachers’ value-added scores, this too should “raise red flags” about the quality of the teacher, although she does not (in this article) pay any respect or regard for the issues not only inherent in value-added measures but also her observational system.

What is most important in this article, though, is that reading through it illustrates well the arbitrariness of how all of the measures being mandated and used to evaluate teachers are actually being used. Take, for example, the other note herein that the state department’s intent seems to be that 70%-80% percent of teachers should “fall in the middle” as “developing” or “effective.” While this is mathematically impossible (i.e., to have 70%-80% hang around average), this could not be more arbitrary.

In the end, teacher evaluation systems are highly flawed and highly subjective and highly prone to error and the like, and for people who just don’t “get it” to be passing policies on the contrary, is nonsensical and absurd. These flaws are not as important when evaluation system data can be used for formative, or informative purposes whereas data consumers have more freedom to take the data for what they are worth. When summary, or summative decisions are to be made as based on these data, regardless of whether low or high-stakes are attached to the decision, this is where things really go awry.

Principals’ Perspectives on Value-Added

Principals are not using recent teacher evaluation data, including data from value-added assessment systems, student surveys, and other student achievement indicators, to inform decisions about hiring, placements, and professional development, according to findings from a research study recently released by researchers at Vanderbilt University.

The data most often used by principals? Data collected via their direct observations of their teachers in practice.

Education Week’s Denisa Superville also covered this study here, writing that principals are most likely to use classroom-observation data to inform such decisions, rather than the data yielded via VAMs and other student test scores. Of least relevance were data derived via parent surveys.

Reasons for not using value-added data specifically? “[A]access to the data, the availability of value-added measures when decisions are being made, a lack of understanding of the statistical models used in the evaluation systems, and the absence of training in using [value-added] data.”

Moving forward, “the researchers recommend that districts clarify their expectations for how principals should use data and what data sources should be used for specific human-resources decisions. They recommend training for principals on using value-added estimates, openly encouraging discussions about data use, and clarifying the roles of value-added estimates and observation scores.”

If this is to happen, hopefully such efforts will be informed by the research community, in order to help district and administrators more critically consume value-added data in particular, for that which they can and cannot do.

Note: This study is not yet peer-reviewed, so please consume this information for yourself with that being known.

Pearson Tests v. UT Austin’s Associate Professor Stroup

Last week of the Texas Observer wrote an article, titled “Mute the Messenger,” about University of Texas – Austin’s Associate Professor Walter Stroup, who publicly and quite visibly claimed that Texas’ standardized tests as supported by Pearson were flawed, as per their purposes to measure teachers’ instructional effects. The article is also about how “the testing company [has since] struck back,” purportedly in a very serious way. This article (linked again here) is well worth a full read for many reasons I will leave you all to infer. This article was also covered recently on Diane Ravitch’s blog here, although readers should also see Pearson’s Senior Vice President’s prior response to, and critique of Stroup’s assertions and claims (from August 2, 2014) here.

The main issue? Whether Pearson’s tests are “instructionally sensitive.” That is, whether (as per testing and measurement expert – Professor Emeritus W. James Popham) a test is able to differentiate between well taught and poorly taught students, versus able to differentiate between high and low achievers regardless of how students were taught (i.e., as per that which happens outside of school that students bring with them to the schoolhouse door).

Testing developers like Pearson seem to focus on the prior, that their tests are indeed sensitive to instruction. While testing/measurement academics and especially practitioners seem to focus on the latter, that tests are sensitive to instruction, but such tests are not nearly as “instructionally sensitive” as testing companies might claim. Rather, tests are (as per testing and measurement expert – Regents Professor David Berliner) sensitive to instruction but more importantly sensitive to everything else students bring with them to school from their homes, parents, siblings, and families, all of which are situated in their neighborhoods and communities and related to their social class. Here seems to be where this, now very heated and polarized argument between Pearson and Associate Professor Stroup now stands.

Pearson is focusing on its advanced psychometric approaches, namely its use of Item Response Theory (IRT) while defending their tests as “instructionally sensitive.” IRT is used to examine things like p-values (or essentially proportions of students who respond to items correctly) and item-discrimination indices (to see if test items discriminate between students who know [or are taught] certain things and students who don’t know [or are not taught] certain things otherwise). This is much more complicated than what I am describing here, but hopefully this gives you all the gist of what now seems to be the crux of this situation.

As per Pearson’s Senior Vice President’s statement, linked again here, “Dr. Stroup claim[ed] that selecting questions based on Item Response Theory produces tests that are not sensitive to measuring what students have learned.” While from what I know about Dr. Stroup’s actual claims, this trivializes his overall arguments. Tests, after undergoing revisions as per IRT methods, are not always “instructionally sensitive.”

When using IRT methods, test companies, for example, remove items that “too many students get right” (e.g, as per items’ aforementioned p-values). This alone makes tests less “instructionally insensitive” in practiceIn other words, while the use of IRT methods is sound psychometric practice based on decades of research and development, if using IRT deems an item as “too easy,” even if the item is taught well (i.e., “instructionally senstive”), the item might be removed. This makes the test (1) less “instructionally sensitive” in the eyes of teachers who are to teach the tested content (and who are now more than before held accountable for teaching these items), and this makes the test (2) more “instructionally sensitive” in the eyes of test developers in that when fewer students get test items correct the better the items are when descriminating between those who know (or are taught) certain things and students who don’t know (or are not taught) certain things otherwise.

A paradigm example of what this looks like in practice comes from advanced (e.g., high school) mathematics tests.

Items capturing statistics and/or data displays on such tests should theoretically include items illustrating standard column or bar charts, with questions prompting students to interpret the meanings of the statistics illustrated in the figures. Too often, however, because these items are often taught (and taught well) by teachers (i.e., “instructionally sensitive”) “too many” students answer such items correctly. Sometimes these items yield p-values greater than p=0.80 or 80% correct.

When you need a test and its outcome score data to fit around the bell curve, you cannot have such, or too many of such items, on the final test. In the simplest of terms, for every item with a p-vale of 80% you would need another with a p-value of 20% to balance items out, or keep the overall mean of each test around p=0.50 (the center of the standard normal curve). It’s best if test items, more or less, hang around such a mean, otherwise the test will not function as it needs to, mainly to discriminate between who knows (or is taught) certain things and who doesn’t know (or isn’t taught) certain things otherwise. Such items (i.e., with high p-values) do not always distribute scores well enough because “too many students” answering such items correct reduces the variation (or spread of scores) needed.

The counter-item in this case is another item also meant to capture statistics and/or data display, but that is much more difficult, largely because it’s rarely taught because it rarely matters in the real world. Take, for example, the box and whisker plot. If you don’t know what this is, which is in and of itself telling in this example, see them described and illustrated here. Often, this item IS found on such tests because this item IS DIFFICULT and, accordingly, works wonderfully well to discriminate between those who know (or are taught) certain things and those who don’t know (or aren’t taught) certain things otherwise.

Because this item is not as often taught (unless teachers know it’s coming, which is a whole other issue when we think about “instructional sensitivity” and “teaching-to-the-test“), and because this item doesn’t really matter in the real world, it becomes an item that is more useful for the test, as well as the overall functioning of the test, than it is an item that is useful for the students tested on it.

A side bar on this: A few years ago I had a group of advanced doctoral students studying statistics take Arizona’s (now former) High School Graduation Exam. We then performed an honest analysis of the resulting doctoral students’ scores using some of the above-mentioned IRT methods. Guess which item students struggled with the most, which also happened to be the item that functioned the best as per our IRT analysis? The box and whisker plot. The conversation that followed was most memorable, as the statistics students themselves questioned the utility of this traditional item, for them as advanced doctoral students but also for high school graduates in general.

Anyhow, this item, like many other items similar, had a lower relative p-value, and accordingly helped to increase the difficulty of the test and discriminate results to assert a purported “insructional sensitivity,” regardless of whether the item was actually valued, and more importantly valued in instruction.

Thanks to IRT, the items often left on such tests are not often the items taught by teachers, or perhaps taught by teachers well, BUT they distribute students’ test scores effectively and help others make inferences about who knows what and who doesn’t. This happens even though the items left do not always capture what matters most. Yes – the tests are aligned with the standards as such items are in the standards, but when the most difficult items in the standards trump the others, and many of the others that likely matter more are removed for really no better reason than what IRT dictates, this is where things really go awry.