My Presentation, Tomorrow (Friday Night) in Las Cruces, New Mexico

For just over one year now, I have been blogging about that which is occurring in my neighbor state – New Mexico – under the leadership of Hanna Skandera (the acting head of the New Mexico Public Education Department and Jeb Bush protégé; see prior post here). This state is increasingly becoming another “state to watch” in terms of their teacher accountability policies as based on value-added models (VAMs), or what in New Mexico they term their value-added scores (VASs).

Accordingly, the state is another in which a recent lawsuit is bringing to light some of the issues with these models, particularly in terms of data errors, missingness, incompleteness, and Common Core, Partnership for Assessment of Readiness for College and Careers (PARCC) tests that are now being used for consequential decision-making purposes, right out of the gate. For more about this lawsuit, click here. For more information about the PARCC tests, and protests against it – a reported 950 students in Las Cruces walked out of classes 10 days ago in protest, and 800 students “opted out” with parental support – click here and here.

For those of you who live in New Mexico, I am heading there for a series of presentations and meetings tomorrow (Friday, March 13) with an assortment of folks ranging from school board members, parents, students, members of the media, and academics to discuss that which is happening in New Mexico. My visit is being sponsored by New Mexico State University’s Borderlands Writing Project.

Here is the information for my main talk, tomorrow (Friday, March 13) night from 7:00-9:00 p.m., on “Critical Perspectives on Assessment-Based Accountability,” in the Health and Social Services Auditorium on the NMSU campus in Las Cruces. Attached is the press release as well as the advertising flyer. After my presentation there will be breakout sessions that are also open to the public (1) for parents on making decisions about the tests; (2) for Taking the Test, allowing people to try out a practice test and provide feedback; and (3) for next steps – a session for teachers, students and people who want to help create positive change.

Hope to see you there!

An Average of 60-80 Days of 180 School Days (≈ 40%) on State Testing in Florida

“In Florida, which tests students more frequently than most other states, many schools this year will dedicate on average 60 to 80 days out of the 180-day school year to standardized testing.” So it is written in an article recently released in the New York Times titled “States Listen as Parents Give Rampant Testing an F.”

Of issue, as per a serious set of parents, include the following:

  • The aforementioned focus on state standardized testing, as highlighted above, is being defined as a serious educational issue/concern. In the words of the article’s author, “Parents railed at a system that they said was overrun by new tests coming from all levels — district, state and federal.”
  • Related, parents took issue with the new Common Core tests and a ” state mandate that students use computers for [these, and other] standardized tests” which has “made the situation worse because computers are scarce and easily crash” and because “the state did not give districts extra money for computers or technology help.”
  • Related, “Because schools do not have computers for every student, tests are staggered throughout the day, which translates to more hours spent administering tests and less time teaching. Students who are not taking tests often occupy their time watching movies. The staggered test times also mean computer labs are not available for other students.”
  • In addition, parents “wept as they described teenagers who take Xanax to cope with test stress, children who refuse to go to school and teachers who retire rather than promote a culture that seems to value testing over learning.” One father cried confessing that he planned to pull his second grader from school because, in his words, “Teaching to a test is destroying our society,” as a side effect of also destroying the public education system.

Regardless, former Governor Jeb Bush — a possible presidential contender and one of the first governors to introduce high-stakes testing into “his” state of Florida — “continues to advocate test-based accountability through his education foundation. Former President George W. Bush, his brother, introduced similar measures as [former] governor of Texas and, as [former] president, embraced No Child Left Behind, the law that required states to develop tests to measure progress.”

While the testing craze existed in many ways and places prior to NCLB, NCLB and the Bush brothers’ insistent reliance on high-stakes tests to reform America’s public education system have really brought the U.S. to where it is today; that is, relying even more on more and more tests and the use of VAMs to better measure growth in between the tests administered.

Likewise, and as per the director of FairTest, “The numbers and consequences of these tests have driven public opinion over the edge, and politicians are scrambling to figure out how to deal with that.” In the state of Florida in particular, “[d]espite continued support in the Republican-dominated State Legislature for high-stakes testing,” these are just some of the signs that “Florida is headed for a showdown with opponents of an education system that many say is undermining its original mission: to improve student learning, help teachers and inform parents.”

Tennessee’s Senator Lamar Alexander to “Fix” No Child Left Behind (NCLB)

In The Tennessean this week was an article about how Lamar Alexander – the new chairman of the US Senate Education Committee, the current US Senator from Tennessee, and the former US Secretary of Education under former President George Bush Senior – is planning on “fixing” No Child Left Behind (NCLB). See another Tennessean article about this last week as well, here. This is good news, also given this is coming from Tennessee – one of our states to watch given its history of  value-added reforms tied to NCLB tests.

NCLB, even though it has not been reauthorized since 2002 (despite a failed attempt in 2007) is still foundational to current test-based reforms, including those that tie federal funds to the linking of students’ (NCLB) test scores to students’ teachers to measure teacher-level value-added. To date, for example, 42 states have been granted NCLB waivers for not getting 100% of their students to 100% proficiency in mathematics and reading/language arts by 2014, as long as states adopted and implemented federal reforms based on common core tests and began tying test scores to teacher quality using growth/value-added measures. This was in exchange for the federal educational funds on which states also rely.

Well, Lamar apparently believes that this is precisely what needs fixing. “The law has become unworkable,” Alexander said. “States are struggling. As a result, we need to act.”

More specifically, Senator Alexander wants to:

  • take power away from the US Department of Education, which he often refers to as our country’s “national school board.”
  • prevent the US Department of Education from further pressuring schools to adopt certain tests and standards.
  • prevent current US Secretary of Education Duncan from continuing to hold “states over a barrel,” forcing them to do what he has wanted them to do to avoid being labeled failures.
  • “set realistic goals, keep the best portions of the law, and restore to states and communities the responsibility to decide whether schools and teachers are succeeding or failing.” Although, Alexander has still been somewhat neutral regarding the future role of tests.

“Are there too many? Are they redundant? Are they the right tests? I’m open on the question,” Alexander said.

As noted by the journalist of this article, however, this is the biggest concern with this (potentially) big win for education in that “There is broad agreement that students should be tested less, but what agency wants to relinquish the ability to hold teachers, administrators and school districts accountable for the money we [as a public] spend on education?” Current US Secretary of Education Duncan, for example, believes if we don’t continue down his envisioned path, “we [will] turn back the clock on educational progress, 15 years or more.” This from a guy who has built his political career on the fact that educators have made no educational progress; hence, this is the reason we need to hold teachers accountable for the lack of progress thereof.

We shall see how this one goes, I guess.

For an excellent “Dear Lamar” letter, from Lamar Alexander’s former Assistant Secretary of Education who served under Lamar when he was US Secretary of Education under former President Bush Senior, click here. It’s a wonderful letter written by Diane Ravitch; wonderful becuase it includes so many recommendations highlighting that which could be at this potential turning point in federal policy.

Policy Idiocy in New York: Teacher Value-Added to Trump All Else

A Washington Post post, recently released by Valerie Strauss and written by award-winning (e.g., New York’s 2013 High School Principal of the Year) Carol Burris of South Side High School in New York, details how in New York their teacher evaluation system is “going from bad to worse.”

It seems that New York state’s education commissioner – John King – recently resigned, thanks to a new job working as a top assistant directly under US Secretary of Education Arne Duncan. But that is not where the “going from bad to worse” phrase applies.

Rather, it seems the state’s Schools Chancellor Merryl Tisch, with the support and prodding of New York Governor Andrew Cuomo, wants to take what was the state’s teacher evaluation system requirement that 20% of an educator’s evaluation be based on “locally selected measures of achievement,” to a system whereas teachers’ value-added as based on growth on the state’s (Common Core) standardized test scores will be set at 40%.

In addition, she (along with the support of prodding of Cuomo) is pushing for a system in which these scores would “trump all,” and in which a teacher rated as ineffective in the growth score portion would be rated ineffective overall. A teacher with two ineffective ratings would “not return to the classroom.”

This is not only “going from bad to worse,” but going from bad to idiotic.

All of this is happending despite the research studies that, by this point, should have literally buried such policy initiatives. Many of these research studies are highlighted on this blog here, here, here, and here, and are also summarized in two position statements on value-added models (VAMs) released by the American Statistical Association last April (to read about their top 10 concerns/warnings, click here) and the National Association of Secondary School Principals (NASSP) last month (to read about their top 6 concerns/warnings, click here).

All of this is also happening despite the fact that this flies in the face of the 2014 “Standards for Educational and Psychological Testing” also released this year by the leading professional organizations in the area of educational measurement, including the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME).

And all of this is happening despite the fact that teachers and principals in this state of New York already, and democratically, created sound evaluation plans to which the majority had already agreed, given the system they created to meet state and federal policy mandates was much more defensible, and much less destructive.

I, for one, find this policy idiocy infuriating.

To read the full post, one that is definitely worth a full read, click here.

US Secretary of Education Duncan “Loves Him Some VAM Sauce”

US Secretary of Education “Arne [Duncan] loves him some VAM sauce, and it is a love that simply refuses to die,” writes Peter Greene in a recent Huffington Post post. Duncan’s (simple-mind) loves it because, indeed, the plan is (too) overly simplistic. All that the plan requires are two simple ingredients: “1) A standardized test that reliably and validly measures how much students know 2) A super-sciency math algorithm that will reliably and validly strip out all influences except that of the teacher.”

Sticking with the cooking metaphor, however, Green writes “VAM is no spring chicken, and perhaps when it was fresh and young some affection for it could be justified. After all, lots of folks, including non-reformy folks, like the idea of recognizing and rewarding teachers for being excellent. But how would we identify these pillars of excellence? That was the puzzler for ages until VAM jumped up to say, “We can do it! With Science!!” We’ll give some tests and then use super-sciency math to filter out every influence that’s Not a Teacher and we’ll know exactly how much learnin’ that teacher poured into that kid.”

“Unfortunately, we don’t have either,” and we likely never will. Why this is the case is also highlighted in this post, with Greene explicitly citing three main sources for support: the recent oppositional statement released by the National Association of Secondary School Principals, the oppositional statement released this past summer by the American Statistical Association, and our mostly-oppositional blog Vamboozled! (by Audrey Amrein-Beardsley). Hopefully getting the research into the hands of educational practitioners, school board members, the general public, and the like is indeed “adding value” in the purest sense of this phrase’s meaning. I sure hope so!

Anyhow, in this post Greene also illustrates and references a nice visual (with a side of sarcasm) explaining the complexity behind VAMs in pretty clear terms. I also paste this illustration here, which Greene references as originally coming from a blog post from Daniel Katz, Ph.D. but I have seen similar versions elsewhere and prior (e.g., a New York Times article here).


Greene ultimately asks why Duncan is still staying so fixated on a policy, disproportionally loaded and ever-increasingly rejected and unsupported?

Greene’s answer: ‘[I]f Duncan were to admit that his beloved VAM is a useless tool…then all his other favorite [reform-based] programs would collapse” around him…Why do we give the Big Test? To measure teacher effectiveness. How do we rank and evaluate our schools? By looking at teacher effectiveness. How do we find the teachers that we are going to move around so that every classroom has a great teacher? With teacher effectiveness ratings. How do we institute merit pay and a career ladder? By looking at teacher effectiveness. How do we evaluate every single program instituted in any school? By checking to see how it affects teacher effectiveness. How do we prove that centralized planning (such as Common Core) is working? By looking at teacher effectiveness. How do we prove that corporate involvement at every stage is a Good Thing? By looking at teacher effectiveness. And by “teacher effectiveness,” we always mean VAM (because we [i.e., far-removed educational reformers] don’t know any other way, at all).”

If Duncan’s “magic VAM sauce, is a sham and a delusion and a big bowl of nothing,” his career would literally fold in.

To read more from Greene, do click here to read his post in full.

VAM Updates from An Important State to Watch: Tennessee

The state of Tennessee, the state in which our beloved education-based VAMs were born (see here and here), has been one state we have been following closely on VAMboozled! throughout the last year’s blog posts.

Throughout this period we have heard from concerned administrators and teachers in Tennessee (see, for example, here and here). We have written about how something called subject area bias also exists, unheard of in the VAM-related literature until a Tennessee administrator sent us a lead, and we analyzed Tennessee’s data (see here and here, and also an article also written by my graduate student and now Dr. Jessica Holloway-Libell forthcoming in the esteemed Teachers College Record). We followed closely the rise and recent fall of the career of Tennessee’s Education Commissioner Kevin Huffman (see, for example, here and here, respectively). And we have watched how the Tennessee Board of Education and other leaders in the state have met, attempted to rescind, and actually rescinded some of the policy requirements that tie teachers’ to their VAM scores, again as determined by teachers’ students’ performance as calculated by the familiar Tennessee Education Value-Added Assessment System (TVAAS), and its all-too-familiar mother-ship, the Education Value-Added Assessment System (EVAAS).

Now, following the (in many ways celebrated) exit of Commissioner Huffman, it seems the state is taking an even more reasonable stance towards VAMs and their use(s) for teacher accountability…at least for now.

As per a recent article in the Tennessee Education Report (see also an article in The Tennessean here) Governor Bill Haslam announced this week that “he will be proposing changes to the state’s teacher evaluation process in the 2015 legislative session,” the most significant change being “to reduce the weight of value-added data on teacher evaluations during the transition [emphasis added] to a new test for Tennessee students.” New tests are to be developed in 2016, which is unlikely to be part of the Common Core, and rather significantly informed by teachers in consultation with Measurement Inc.

Anyhow, as per Governor Haslam’s press release (as also cited in this article), he intends to do the following three things:

  1. Adjust the weighting of student growth data in a teacher’s evaluation so that the new state assessments in ELA [English/language arts] and math will count 10 percent of the overall evaluation in the first year of administration (2016), 20 percent in year two (2017) and 35 percent in year three (2018). Currently 35 percent of an educator’s evaluation is comprised of student achievement data based on student growth;
  2. Lower the weight of student achievement growth for teachers in non-tested grades and subjects from 25 percent to 15 percent;
  3. And make explicit local school district discretion in both the qualitative teacher evaluation model that is used for the observation portion of the evaluation as well as the specific weight student achievement growth in evaluations will play in personnel
    decisions made by the district.

Obviously, the latter two points (i.e., #2 and #3) demonstrate steps in the right direction: #2 to be a bit more reasonable about whether teachers who don’t actually teach students in the subject areas should be held accountable for out-of-subject scores (although I’d vote for 0% weight here) and #3 to handover to districts more local discretion and control over how their teacher evaluations are conducted (although VAMs still must be a part).

I’m less optimistic about the first intended change, however, as “the proposal does not go as far as some have proposed” (e.g., the American Statistical Association (ASA) as per some of their key points in their position statement on VAMs). This first change still supports what is still a “heroic” assumption that VAMs do work, and in this case will get better over time (i.e., 2016, 2017, 2018) with “new and improved tests,” so that the weights in place now (i.e., 35%) might be more appropriate then, and hence reached, or just reached regardless of appropriateness at that time…

A “Must Watch” Video on VAMs from Albuquerque, New Mexico

As written into a recent article in the Albuquerque Journal, an Albuquerque, New Mexico Public School Board member publicly, but in many ways appropriately, unleashed her frustration over the use of standardized tests and VAMs in Albuquerque’s public schools.

This is a “must watch” video that can be watched in just over two minutes here.

At one point she said that she is “so goddamn grateful she (her eldest of four children, a high school senior) is leaving the public schools” system at the end of the school year, although she still has three children left in the system.” Hence, her outrage, but also her call to her fellow board members to delay, or rather suspend indefinitely the use of the state of New Mexico’s new Common Core tests and to suspend the use of such test scores in teacher evaluations using VAMs, also suspect as per the board. In the state of New Mexico, test scores are to make up 50% of all test-eligible teacher’s evaluation scores. She also cited the American Statistical Association (ASA) recently released Position Statement on VAMs (discussed here and directly accessible here).

A fellow board member criticized Kort for her language choices that he defined as “poor.” He and a few other members also criticized her for not wanting to continue to work with the state that is pushing VAMs as working and negotiating with the state is in their best interests. Thereafter, “[t]he board passed a resolution calling on itself to draft a resolution addressing its concerns with the PARCC exam and its role in teacher evaluations and school grades. The board is looking to get feedback from parents and teachers before drafting its resolution.”

Recent Gallup Poll on Teacher Perceptions

In an “Update” published by the Gallup public opinion polling company titled “Teachers Favor Common Core Standards, Not the Testing,” some interesting data can be found regarding current teachers’ perceptions about some currently hot reform topics in education. These Gallup data include, as pertinent here, self-report data on how teachers feel about using students’ test scores to hold them accountable.

Here are what I see as the most interesting findings, again as pertinent here. For more information, however, do click here.

  • 76% of America’s public school teachers “reacted positively” to the primary goal of the Common Core State Standards (i.e., to have all states use the same set of academic standards for reading, writing and math in grades K-12).
  • 9% of America’s public school teachers were in favor of linking students’ test scores to teacher evaluations.
  • 89% of America’s public school teachers viewed linking students’ test scores to teacher evaluations negatively, and 2% reported having “no opinion” on the matter.
  • 78% of America’s public school teachers agreed that the tests to be used in general and also to hold them accountable for their effectiveness (e.g., using VAMs) take too much time away from teaching.

As one teacher who helped Gallup to interpret these findings put it: “The [Common Core] standards were positive until standardized testing was involved.”

As for linking students’ test scores to teacher performance, another teacher elaborated on this by stating, “Student populations are not spread evenly among teachers. Some students will have learning disabilities and some will have familial and environmental factors that do not allow them to progress at the same rate as others. If I have to be so concerned about a test score, I cannot address the needs of individual students.” And a Texas high school teacher concludes, “We must have different yardsticks for different student circumstances so teachers can focus on individual student growth.”

The teachers who participated in this survey research study included a random sample of 854 public K-12 school teachers, aged 18 and older, with Internet access, living in all 50 U.S. states and the District of Columbia.

“The Test Matters”

Last month in Educational Researcher (ER), researchers Pam Grossman (Stanford University), Julie Cohen (University of Virginia), Matthew Ronfeldt (University of Michigan), and Lindsay Brown (Stanford University) published a new article titled: “The Test Matters: The Relationship Between Classroom Observation Scores and Teacher Value Added on Multiple Types of Assessment.” Click here to view a prior version of this document, pre-official-publication in ER.

Building upon what the research community has consistently evidenced in terms of the generally modest to low correlations (or relationships) consistently being observed between observational data and VAMs, researchers in this study set out to investigate these relationships a bit further, and in more depth.

Using the Bill & Melinda Gates Foundation-funded Measures of Effective Teaching (MET) data, researchers examined the extent to which the (cor)relationships between one specific observation tool designed to assess instructional quality in English/language arts (ELA) – the Protocol for Language Arts Teaching Observation (PLATO) – and value-added model (VAM) output changed when two different tests were used to assess student achievement using VAMs. One set of tests included the states’ tests (depending on where teachers in the sample were located by state), and the other test was the Stanford Achievement Test (SAT-9) Open-Ended ELA test on which students are required not only to answer multiple choice items but to also construct responses to open or free response items. This is more akin to what is to be expected with the new Common Core tests, hence, this is significant looking forward.

Researchers found, not surprisingly given the aforementioned prior research (see, for example, Papay’s similar and seminal 2010 work here), that the relationship between teachers’ PLATO observational scores and VAM output scores varied given which of the two aforementioned tests was used to calculate VAM output. In addition, researchers found that both sets of correlation coefficients were low, which also aligns with current research, but in this case it might be more accurate to say the correlations they observed were very low.

More descriptively, PLATO was more highly correlated with the VAM scores derived using the SAT-9 (r = 0.16) versus the states’ tests (r = .09), and these differences were statistically significant. These differences also varied by the type of teacher, the type of “teacher effectiveness” domain observed, etc. To read more about the finer distinctions the authors make in this regard, please click, again, here.

These results do also provide “initial evidence that SAT-9 VAMs may be more sensitive to the kinds of instruction measured by PLATO.” In other words, the SAT-9 test may be more sensitive than the state tests all states currently use (post the implementation of No Child Left Behind [NCLB] in 2002). These are the tests that are to align “better” with state standards, but on which multiple-choice items (that oft-drive multiple-choice learning activities) are much more, if not solely, valued. This is an interesting conundrum, to say the least, as we anticipate the release of the “new and improved” set of Common Core tests.

Researchers also found some preliminary evidence that teachers who scored relatively higher on the “Classroom Environment” observational factor demonstrated “better” value-added scores across tests. Noting here again, however, that the correlation coefficients demonstrated here were also quite weak (r = 0.15 for the SAT-9 VAM and r = 0.13 for the state VAM). Researchers also found preliminary evidence that only the VAM that used the SAT-9 test scores was significantly related to the “Cognitive and Disciplinary Demand” factor, although again the correlation coefficient was also very weak (r = 0.12 for the SAT-9 VAM and r = 0.05 for the state VAM).

That being said, the “big ideas” I think we can takeaway from this study are namely that:

  1. Which tests we use to construct value-added scores matter because which tests we decide to use yield different results (see also the Papay 2010 reference cited above). Accordingly, “researchers and policymakers [DO] need to pay careful attention to the assessments used to measure student achievement in designing teacher evaluation systems” as these decisions will [emphasis added] yield different results.”
  2. Related, certain tests are going to be more instructionally sensitive than others. Click here and here for two prior posts about this topic.
  3. In terms of using observational data, in general, “[p]erhaps the best indicator[s] of whether or not students are learning to engage in rigorous academic discourse…[should actually come from] classroom observations that capture the extent to which students are actually participating in [rigorous/critical] discussions” and their related higher-order thinking activities. This line of thinking aligns, as well, with a recent post on this blog titled “Observations: “Where Most of the Action and Opportunities Are.

Pearson Tests v. UT Austin’s Associate Professor Stroup

Last week of the Texas Observer wrote an article, titled “Mute the Messenger,” about University of Texas – Austin’s Associate Professor Walter Stroup, who publicly and quite visibly claimed that Texas’ standardized tests as supported by Pearson were flawed, as per their purposes to measure teachers’ instructional effects. The article is also about how “the testing company [has since] struck back,” purportedly in a very serious way. This article (linked again here) is well worth a full read for many reasons I will leave you all to infer. This article was also covered recently on Diane Ravitch’s blog here, although readers should also see Pearson’s Senior Vice President’s prior response to, and critique of Stroup’s assertions and claims (from August 2, 2014) here.

The main issue? Whether Pearson’s tests are “instructionally sensitive.” That is, whether (as per testing and measurement expert – Professor Emeritus W. James Popham) a test is able to differentiate between well taught and poorly taught students, versus able to differentiate between high and low achievers regardless of how students were taught (i.e., as per that which happens outside of school that students bring with them to the schoolhouse door).

Testing developers like Pearson seem to focus on the prior, that their tests are indeed sensitive to instruction. While testing/measurement academics and especially practitioners seem to focus on the latter, that tests are sensitive to instruction, but such tests are not nearly as “instructionally sensitive” as testing companies might claim. Rather, tests are (as per testing and measurement expert – Regents Professor David Berliner) sensitive to instruction but more importantly sensitive to everything else students bring with them to school from their homes, parents, siblings, and families, all of which are situated in their neighborhoods and communities and related to their social class. Here seems to be where this, now very heated and polarized argument between Pearson and Associate Professor Stroup now stands.

Pearson is focusing on its advanced psychometric approaches, namely its use of Item Response Theory (IRT) while defending their tests as “instructionally sensitive.” IRT is used to examine things like p-values (or essentially proportions of students who respond to items correctly) and item-discrimination indices (to see if test items discriminate between students who know [or are taught] certain things and students who don’t know [or are not taught] certain things otherwise). This is much more complicated than what I am describing here, but hopefully this gives you all the gist of what now seems to be the crux of this situation.

As per Pearson’s Senior Vice President’s statement, linked again here, “Dr. Stroup claim[ed] that selecting questions based on Item Response Theory produces tests that are not sensitive to measuring what students have learned.” While from what I know about Dr. Stroup’s actual claims, this trivializes his overall arguments. Tests, after undergoing revisions as per IRT methods, are not always “instructionally sensitive.”

When using IRT methods, test companies, for example, remove items that “too many students get right” (e.g, as per items’ aforementioned p-values). This alone makes tests less “instructionally insensitive” in practiceIn other words, while the use of IRT methods is sound psychometric practice based on decades of research and development, if using IRT deems an item as “too easy,” even if the item is taught well (i.e., “instructionally senstive”), the item might be removed. This makes the test (1) less “instructionally sensitive” in the eyes of teachers who are to teach the tested content (and who are now more than before held accountable for teaching these items), and this makes the test (2) more “instructionally sensitive” in the eyes of test developers in that when fewer students get test items correct the better the items are when descriminating between those who know (or are taught) certain things and students who don’t know (or are not taught) certain things otherwise.

A paradigm example of what this looks like in practice comes from advanced (e.g., high school) mathematics tests.

Items capturing statistics and/or data displays on such tests should theoretically include items illustrating standard column or bar charts, with questions prompting students to interpret the meanings of the statistics illustrated in the figures. Too often, however, because these items are often taught (and taught well) by teachers (i.e., “instructionally sensitive”) “too many” students answer such items correctly. Sometimes these items yield p-values greater than p=0.80 or 80% correct.

When you need a test and its outcome score data to fit around the bell curve, you cannot have such, or too many of such items, on the final test. In the simplest of terms, for every item with a p-vale of 80% you would need another with a p-value of 20% to balance items out, or keep the overall mean of each test around p=0.50 (the center of the standard normal curve). It’s best if test items, more or less, hang around such a mean, otherwise the test will not function as it needs to, mainly to discriminate between who knows (or is taught) certain things and who doesn’t know (or isn’t taught) certain things otherwise. Such items (i.e., with high p-values) do not always distribute scores well enough because “too many students” answering such items correct reduces the variation (or spread of scores) needed.

The counter-item in this case is another item also meant to capture statistics and/or data display, but that is much more difficult, largely because it’s rarely taught because it rarely matters in the real world. Take, for example, the box and whisker plot. If you don’t know what this is, which is in and of itself telling in this example, see them described and illustrated here. Often, this item IS found on such tests because this item IS DIFFICULT and, accordingly, works wonderfully well to discriminate between those who know (or are taught) certain things and those who don’t know (or aren’t taught) certain things otherwise.

Because this item is not as often taught (unless teachers know it’s coming, which is a whole other issue when we think about “instructional sensitivity” and “teaching-to-the-test“), and because this item doesn’t really matter in the real world, it becomes an item that is more useful for the test, as well as the overall functioning of the test, than it is an item that is useful for the students tested on it.

A side bar on this: A few years ago I had a group of advanced doctoral students studying statistics take Arizona’s (now former) High School Graduation Exam. We then performed an honest analysis of the resulting doctoral students’ scores using some of the above-mentioned IRT methods. Guess which item students struggled with the most, which also happened to be the item that functioned the best as per our IRT analysis? The box and whisker plot. The conversation that followed was most memorable, as the statistics students themselves questioned the utility of this traditional item, for them as advanced doctoral students but also for high school graduates in general.

Anyhow, this item, like many other items similar, had a lower relative p-value, and accordingly helped to increase the difficulty of the test and discriminate results to assert a purported “insructional sensitivity,” regardless of whether the item was actually valued, and more importantly valued in instruction.

Thanks to IRT, the items often left on such tests are not often the items taught by teachers, or perhaps taught by teachers well, BUT they distribute students’ test scores effectively and help others make inferences about who knows what and who doesn’t. This happens even though the items left do not always capture what matters most. Yes – the tests are aligned with the standards as such items are in the standards, but when the most difficult items in the standards trump the others, and many of the others that likely matter more are removed for really no better reason than what IRT dictates, this is where things really go awry.