Tennessee’s Senator Lamar Alexander to “Fix” No Child Left Behind (NCLB)

In The Tennessean this week was an article about how Lamar Alexander – the new chairman of the US Senate Education Committee, the current US Senator from Tennessee, and the former US Secretary of Education under former President George Bush Senior – is planning on “fixing” No Child Left Behind (NCLB). See another Tennessean article about this last week as well, here. This is good news, also given this is coming from Tennessee – one of our states to watch given its history of  value-added reforms tied to NCLB tests.

NCLB, even though it has not been reauthorized since 2002 (despite a failed attempt in 2007) is still foundational to current test-based reforms, including those that tie federal funds to the linking of students’ (NCLB) test scores to students’ teachers to measure teacher-level value-added. To date, for example, 42 states have been granted NCLB waivers for not getting 100% of their students to 100% proficiency in mathematics and reading/language arts by 2014, as long as states adopted and implemented federal reforms based on common core tests and began tying test scores to teacher quality using growth/value-added measures. This was in exchange for the federal educational funds on which states also rely.

Well, Lamar apparently believes that this is precisely what needs fixing. “The law has become unworkable,” Alexander said. “States are struggling. As a result, we need to act.”

More specifically, Senator Alexander wants to:

  • take power away from the US Department of Education, which he often refers to as our country’s “national school board.”
  • prevent the US Department of Education from further pressuring schools to adopt certain tests and standards.
  • prevent current US Secretary of Education Duncan from continuing to hold “states over a barrel,” forcing them to do what he has wanted them to do to avoid being labeled failures.
  • “set realistic goals, keep the best portions of the law, and restore to states and communities the responsibility to decide whether schools and teachers are succeeding or failing.” Although, Alexander has still been somewhat neutral regarding the future role of tests.

“Are there too many? Are they redundant? Are they the right tests? I’m open on the question,” Alexander said.

As noted by the journalist of this article, however, this is the biggest concern with this (potentially) big win for education in that “There is broad agreement that students should be tested less, but what agency wants to relinquish the ability to hold teachers, administrators and school districts accountable for the money we [as a public] spend on education?” Current US Secretary of Education Duncan, for example, believes if we don’t continue down his envisioned path, “we [will] turn back the clock on educational progress, 15 years or more.” This from a guy who has built his political career on the fact that educators have made no educational progress; hence, this is the reason we need to hold teachers accountable for the lack of progress thereof.

We shall see how this one goes, I guess.

For an excellent “Dear Lamar” letter, from Lamar Alexander’s former Assistant Secretary of Education who served under Lamar when he was US Secretary of Education under former President Bush Senior, click here. It’s a wonderful letter written by Diane Ravitch; wonderful becuase it includes so many recommendations highlighting that which could be at this potential turning point in federal policy.

Is Combining Different Tests to Measure Value-Added Valid?

A few days ago, on Diane Ravitch’s blog, a person posted the following comment, that Diane sent to me for a response:

“Diane, In Wis. one proposed Assembly bill directs the Value Added Research Center [VARC] at UW to devise a method to equate three different standardized tests (like Iowas, Stanford) to one another and to the new SBCommon Core to be given this spring. Is this statistically valid? Help!”

Here’s what I wrote in response, that I’m sharing here with you all as this too is becoming increasingly common across the country; hence, it is increasingly becoming a question of high priority and interest:

“We have three issues here when equating different tests SIMPLY because these tests test the same students on the same things around the same time.

First is the ASSUMPTION that all varieties of standardized tests can be used to accurately measure educational “value,” when none have been validated for such purposes. To measure student achievement? YES/OK. To measure teachers impacts on student learning? NO. The ASA statement captures decades of research on this point.

Second, doing this ASSUMES that all standardized tests are vertically scaled whereas scales increase linearly as students progress through different grades on similar linear scales. This is also (grossly) false, ESPECIALLY when one combines different tests with different scales to (force a) fit that simply doesn’t exist. While one can “norm” all test data to make the data output look “similar,” (e.g., with a similar mean and similar standard deviations around the mean), this is really nothing more that statistical wizardry without really any theoretical or otherwise foundation in support.

Third, in one of the best and most well-respected studies we have on this to date, Papay (2010) [in his Different tests, different answers: The stability of teacher value-added estimates across outcome measures study] found that value-added estimates WIDELY range across different standardized tests given to the same students at the same time. So “simply” combining these tests under the assumption that they are indeed similar “enough” is also problematic. Using different tests (in line with the proposal here) with the same students at the same time yields different results, so one cannot simply combine them thinking they will yield similar results regardless. They will not…because the test matters.”

See also: Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn, R. L., Ravitch, D., Rothstein, R., Shavelson, R. J., & Shepard, L. A. (2010). Problems with the use of student test scores to evaluate teachers. Washington, D.C.: Economic Policy Institute.

Teacher Evaluation and Accountability Alternatives, for A New Year

At the beginning of December I posted a post about Diane Ravitch’s really nice piece published in the Huffington Post about what she views as a much better paradigm for teacher evaluation and accountability. Diane Ravitch posted another on similar alternatives, although this one was written by teachers themselves.

I thought this was more than appropriate, especially given a New Year is upon us, and while it might very well be wishful thinking, perhaps at least some of our state policy makers might be willing to think in new ways about what really could be new and improved teacher evaluation systems. Cheers to that!

The main point here, though, is that alternatives do, indeed, exist. Likewise, it’s not that teachers do not want to be held accountable for, and evaluated on that which they do, but they do want whatever systems are in place (formal or informal) to be appropriate, professional, and fair. How about that for policy-based resolution.

This is from Diane’s post: The Wisdom of Teachers: A New Vision of Accountability.

Anyone who criticizes the current regime of test-based accountability is inevitably asked: What would you replace it with? Test-based accountability fails because it is based on a lack of trust in professionals. It fails because it confuses measurement with instruction. No doctor ever said to a sick patient, “Go home, take your temperature hourly, and call me in a month.” Measurement is not a treatment or a cure. It is measurement. It doesn’t close gaps: it measures them.

Here is a sound alternative approach to accountability, written by a group of teachers whose collective experience is 275 years in the classroom. Over 900 teachers contributed ideas to the plan. It is a new vision that holds all actors responsible for the full development and education of children, acknowledging that every child is a unique individual.

Its key features:

  • Shared responsibility, not blame
  • Educate the whole child
  • Full and adequate funding for all schools, with less emphasis on standardized testing
  • Teacher autonomy and professionalism
  • A shift from evaluation to support
  • Recognition that in education one size does not fit all

The Nation’s High School Principals (Working) Position Statement on VAMs

The Board of Directors of the National Association of Secondary School Principals (NASSP) officially released a working position announcement on VAMs, that was also recently referenced in an article in Education Week here (“Principals’ Group Latest to Criticize ‘Value Added’ for Teacher Evaluations“) and a Washington Post post here (“Principals Reject ‘Value-Added’ Assessment that Links Test Scores to Educators’ Jobs“).

I have pasted this statement below, but also link to it here as well. The position’s highlights follow, as also summarized in the above links and the position statement itself:

  • “[T]est-score-based algorithms for measuring teacher quality aren’t appropriate.”
  • “[T]he timing for using [VAMs] comes at a a terrible time, just as schools adjust to demands from the Common Core State Standards and other difficult new expectations for K-12 students.”
  • “Principals are concerned that the new evaluation systems are eroding trust and are detrimental to building a culture of collaboration and continuous improvement necessary to successfully raise student performance to college and career-ready levels.”
  • “Value-added systems, the statement concludes, should be used to measure school improvement and help determine the effectivness of some programs and instructional methods; they could even be used to tailor professional development. But they shouldn’t be used to make “key personel decisions” about individual teachers.”
  • “[P]rincipals often don’t use value-added data even where it exists, largely because a lot of them don’t trust it.”
  • The position statement also quotes Mel Riddile, a former National Principal of the Year and chief architect of the NASSP statement, who says: “We are using value-added measurement in a way that the science does not yet support. We have to make it very clear to policymakers that using a flawed measurement both misrepresents student growth and does a disservice to the educators who live the work each day.”

See also two other great blog posts re: the potential impact the NASSP’s working statement might/should have, also, on America’s current VAM-situation. The first external post comes from the blog “curmudgucation” and discusses in great detail the highlights of the NASSP’s post. The second external post comes from a guest post on Diane Ravitch’s blog.

Below, again, is the full post as per the website of the NASSP:


To determine the efficacy of the use of data from student test scores, particularly in the form of Value-Added Measures (VAMs), to evaluate and to make key personnel decisions about classroom teachers.


Currently, a number of states either are adopting or have adopted new or revamped teacher evaluation systems, which are based in part on data from student test scores in the form of value-added measures (VAM). Some states mandate that up to fifty percent of the teacher evaluation must be based on data from student test scores. States and school districts are using the evaluation systems to make key personnel decisions about retention, dismissal and compensation of teachers and principals.

At the same time, states have also adopted and are implementing new, more rigorous college- and career standards. These new standards are intended to raise the bar from having every student earn a high school diploma to the much more ambitious goal of having every student be on-target for success in post-secondary education and training.

The assessments accompanying these new standards depart from the old, much less expensive, multiple-choice style tests to assessments, which include constructed responses. These new assessments demand higher-order thinking and up to a two-year increase in expected reading and writing skills. Not surprisingly, the newness of the assessments combined with increased rigor has resulted in significant drops in the number of students reaching “proficient” levels on assessments aligned to the new standards.

Herein lies the challenge for principals and school leaders. New teacher evaluation systems demand the inclusion of student data at a time when scores on new assessments are dropping. The fears accompanying any new evaluation system have been magnified by the inclusion of data that will get worse before it gets better. Principals are concerned that the new evaluation systems are eroding trust and are detrimental to building a culture of collaboration and continuous improvement necessary to successfully raise student performance to college and career-ready levels.

Specific question have arisen about using value-added measurement (VAM) to retain, dismiss, and compensate teachers. VAMs are statistical measures of student growth. They employ complex algorithms to figure out how much teachers contribute to their students’ learning, holding constant factors such as demographics. And so, at first glance, it would appear reasonable to use VAMs to gauge teacher effectiveness. Unfortunately, policy makers have acted on that impression over the consistent objections of researchers who have cautioned against this inappropriate use of VAM.

In a 2014 report, the American Statistical Association urged states and school districts against using VAM systems to make personnel decisions. A statement accompanying the report pointed out the following:

  • “VAMs are generally based on standardized test scores, and do not directly measure potential teacher contributions toward other student outcomes.
  • VAMs typically measure correlation, not causation: Effects – positive or negative – attributed to a teacher may actually be caused by other factors that are not captured in the model.
  • Under some conditions, VAM scores and rankings can change substantially when a different model or test is used, and a thorough analysis should be undertaken to evaluate the sensitivity of estimates to different models.
  • VAMs should be viewed within the context of quality improvement, which distinguishes aspects of quality that can be attributed to the system from those that can be attributed to individual teachers, teacher preparation programs, or schools.
  • Most VAM studies find that teachers account for about 1% to 14% of the variability in test scores, and that the majority of opportunities for quality improvement are found in the system-level conditions. Ranking teachers by their VAM scores can have unintended consequences that reduce quality.”

Another peer-reviewed study funded by the Gates Foundation and published by the American Educational Research Association (AERA) stated emphatically, “Value-Added Performance Measures Do Not Reflect the Content or Quality of Teachers’ Instruction.” The study found that “state tests and these measures of evaluating teachers don’t really seem to be associated with the things we think of as defining good teaching.” It further found that some teachers who were highly rated on student surveys, classroom observations by principals and other indicators of quality had students who scored poorly on tests. The opposite also was true. “We need to slow down or ease off completely for the stakes for teachers, at least in the first few years, so we can get a sense of what do these things measure, what does it mean,” the researchers admonished. “We’re moving these systems forward way ahead of the science in terms of the quality of the measures.”

Researcher Bruce Baker cautions against using VAMs even when test scores count less than fifty percent of a teacher’s final evaluation.  Using VAM estimates in a parallel weighting system with other measures like student surveys and principal observations “requires that VAM be considered even in the presence of a likely false positive. NY legislation prohibits a teacher from being rated highly if their test-based effectiveness estimate is low. Further, where VAM estimates vary more than other components, they will quite often be the tipping point – nearly 100% of the decision even if only 20% of the weight.”

Stanford’s Edward Haertel takes the objection for using VAMs for personnel decisions one step further: “Teacher VAM scores should emphatically not be included as a substantial factor with a fixed weight in consequential teacher personnel decisions. The information they provide is simply not good enough to use in that way. It is not just that the information is noisy. Much more serious is the fact that the scores may be systematically biased for some teachers and against others, and major potential sources of bias stem from the way our school system is organized. No statistical manipulation can assure fair comparisons of teachers working in very different schools, with very different students, under very different conditions.”

Still other researchers believe that VAM is flawed at its very foundation. Linda Darling-Hammond et al. point out that the use of test scores via VAMs assumes “that student learning is measured by a given test, is influenced by the teacher alone, and is independent from the growth of classmates and other aspects of the classroom context. None of these assumptions is well supported by current evidence.” Other factors including class size, instructional time, home support, peer culture, summer learning loss impact student achievement. Darling-Hammond points out that VAMs are inconsistent from class to class and year to year. VAMs are based on the false assumption that students are randomly assigned to teachers. VAMs cannot account for the fact that “some teachers may be more effective at some forms of instruction…and less effective in others.”

Guiding Principles

  • As instructional leader, “the principal’s role is to lead the school’s teachers in a process of learning to improve teaching, while learning alongside them about what works and what doesn’t.”
  • The teacher evaluation system should aid the principal in creating a collaborative culture of continuous learning and incremental improvement in teaching and learning.
  • Assessment for learning is critical to continuous improvement of teachers.
  • Data from student test scores should be used by schools to move students to mastery and a deep conceptual understanding of key concepts as well as to inform instruction, target remediation, and to focus review efforts.
  • NASSP supports recommendations for the use of “multiple measures” to evaluate teachers as indicated in the 2014 “Standards for Educational and Psychological Testing” measurement standards released by leading professional organizations in the area of educational measurement, including the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME).


  • Successful teacher evaluation systems should employ “multiple classroom observations across the year by expert evaluators looking to multiple sources of data, and they provide meaningful feedback to teachers.”
  • Districts and States should encourage the use of Peer Assistance and Review (PAR) programs, which use expert mentor teachers supporting novice teachers and struggling veteran teachers have been proven to be an effective system for improving instruction.
  • States and Districts should allow the use of teacher-constructed portfolios of student learning, which are being successfully used as a part of teacher evaluation systems in a number of jurisdictions.
  • VAMs should be used by principals to measure school improvement and to determine the effectiveness of programs and instructional methods.
  • VAMs should be used by principals to target professional development initiatives.
  • VAMs should not to be used to make key personnel decisions about individual teachers.
  • States and Districts should provide ongoing training for Principals in the appropriate use student data and VAMs.
  • States and Districts should make student data and VAMs available to principals at a time when decisions about school programs are being made.
  • States and Districts should provide resources and time principals need in order to make the best use of data.

A New Paradigm for Accountability

Diane Ravitch recently published in the Huffington Post a really nice piece about what she views as a much better paradigm for accountability — one based on much better indicators than large scale standardized test scores. This does indeed offer a much better and much more positive and supportive accountability alternative to that with which we have been “dealing” for the last, really, 30 years.

The key components of this new paradigm, as taken from the full post titled, “A New Paradigm for Accountability: The Joy of Learning,” are pasted below. Although I would recommend giving this article a full read, instead or in addition, as the way Diane frames her reasoning around this list is also important to understand. Click here to see the full article on the Huffington Post website. Otherwise, here’s her paradigm:

The new accountability system would be called No Child Left Out. The measures would be these:

  • How many children had the opportunity to learn to play a musical instrument?
  • How many children had the chance to play in the school band or orchestra?
  • How many children participated in singing, either individually or in the chorus or a glee club or other group?
  • How many public performances did the school offer?
  • How many children participated in dramatics?
  • How many children produced documentaries or videos?
  • How many children engaged in science experiments? How many started a project in science and completed it?
  • How many children learned robotics?
  • How many children wrote stories of more than five pages, whether fiction or nonfiction?
  • How often did children have the chance to draw, paint, make videos, or sculpt?
  • How many children wrote poetry? Short stories? Novels? History research papers?
  • How many children performed service in their community to help others?
  • How many children were encouraged to design an invention or to redesign a common item?
  • How many students wrote research papers on historical topics?

Can you imagine an accountability system whose purpose is to encourage and recognize creativity, imagination, originality, and innovation? Isn’t this what we need more of?

Well, you can make up your own metrics, but you get the idea. Setting expectations in the arts, in literature, in science, in history, and in civics can change the nature of schooling. It would require far more work and self-discipline than test prep for a test that is soon forgotten.

My paradigm would dramatically change schools from Gradgrind academies to halls of joy and inspiration, where creativity, self-discipline, and inspiration are nurtured, honored, and valued.

This is only a start. Add your own ideas. The sky is the limit. Surely we can do better than this era of soul-crushing standardized testing.

Education-Related Election Updates

Great news! The state of Missouri did not pass its “VAM with a Vengeance” constitutional amendment proposing to tie teacher evaluation to student test scores and require teachers to be dismissed, retained, promoted, demoted, and paid based primarily on the test scores of their students. It was also to require teachers to enter into contracts of three years or less, eliminating seniority and tenure. Well done, voters of Missouri!!

Uncertain news! My state of Arizona is still counting ballots to determine who will be our new State Superintendent of Public Instruction. For the first time in history, we have somebody who actually knows a lot about education running – Dr. David Garcia – an associate professor colleague of mine at Arizona State University who also studies educational policy. He would also be the first Latino elected to a state office in over 30 years. He is running as a Democrat, though, and in a very red state, a win for him and others before him have often turned impossible. His opponent is Diane Douglas, a former school board member who focused everything about her campaign in protest of the Common Core Standards and the federal government’s involvement in state-level educational policies. She ran a low-key campaign, she dodged the media spotlight, and she avoided all but one debate with David during which David literally ate her lunch. BUT lucky for her, she ran on the Republican ticket. While there are still 300,000 votes to be counted, she has a narrow lead which is making many on both sides of the party line very upset with those in our state who perpetually vote along the party line. Even with David drawing some very significant endorsements from Republicans and conservative business groups throughout this red state, the state’s conservative voters continue to dominate. Many agree on this note and “see her success as part of the decisive Republican sweep and not a triumph of ideology.” So very sad for our state is this one here. The next move might just be amend the state constitution and make this a state appointed position in that David is by far the best and most informed candidate our state has likely every had in the running (anonymous phone conversation, this morning ;).

Unfortunate news! As per a recent post by Diane Ravitch, the other election news as related to education was bad. There were many victories for similar people “who hold the public sector in contempt and believe in Social Darwinism. It was a bad night for those who hope for a larger vision of the common good, some vision grander than each one on his own. American history and politics are cyclical. It may require the excesses of this time to bring a turn of the wheel. It’s always darkest just before dawn.”

“VAM with a Vengeance” on the Ballot in Missouri

On this election day, the state of Missouri is worth mentioning as it has a very relevant constitutional amendment on its ballot for the day. As featured on Diane Ravitch’s blog a few days ago here, this (see below) is “the worst constitutional amendment to appear on any state ballot in 2014.”


“It ties teacher evaluation to student test scores. It bans collective bargaining about teacher evaluation. It requires teachers to be dismissed, retained, promoted, demoted, and paid based primarily on the test scores of their students. It requires teachers to enter into contracts of three years or less, thus eliminating seniority and tenure.

This is VAM with a vengeance.

This ballot resolution is the work of the far-right Show-Me Institute, funded by the multi-millionaire Rex Sinquefeld. He is a major contributor to politics in Missouri and to ALEC.

The Center for Media and Democracy writes about him:

‘Sinquefield is doing to Missouri what the Koch Brothers are doing to the entire country. For the Koch Brothers and Sinquefield, a lot of the action these days is not at the national but at the state level.’

‘By examining what Sinquefield is up to in Missouri, you get a sobering glimpse of how the wealthiest conservatives are conducting a low-profile campaign to destroy civil society.’

‘Sinquefield told The Wall Street Journal in 2012 that his two main interests are “rolling back taxes” and “rescuing education from teachers’ unions.’

‘His anti-tax, anti-labor, and anti-public education views are common fare on the right. But what sets Sinquefield apart is the systematic way he has used his millions to try to push his private agenda down the throats of the citizens of Missouri.”

Will the Obama Administration Continue to Stay the Course?

According to a “recent” post by Diane Ravitch, President Obama “recently” selected Robert Gordon, and economist and “early proponent of VAMs,” for a top policy post in the US Department of Education. “This is a very important position” as “he will be the person in charge of the agency that basically decides what is working, what is not, and which way to go next with [educational] policy.”

Diane writes: “When he worked in the Office of Management and Budget, Gordon helped to develop the priorities for the controversial Race to the Top program. Before joining the Obama administration, he worked for Joel Klein in the New York City Department of Education. An economist, Gordon was [also] lead author of an influential paper in 2006 that helped to put value-added-measurement at the top of the “reformers” policy agenda. That paper, called “Identifying Effective Teachers Using Performance on the Job,” was co-authored by Thomas J. Kane and Douglas O. Staiger. Kane became the lead adviser to the Gates Foundation in developing its “Measures of Effective Teaching,” which has spent hundreds of millions of dollars trying to develop the formula for the teacher who can raise test scores consistently. Gordon went on to Obama’s Office of Management and Budget, which is the U.S. government’s lead agency for determining budget priorities.”

Here is the abstract from this paper:

Traditionally, policymakers have attempted to improve the quality of the teaching force by raising minimum credentials for entering teachers. Recent research, however, suggests that such paper qualifications have little predictive power in identifying effective teachers. We propose federal support to help states measure the effectiveness of individual teachers—based on their impact on student achievement, subjective evaluations by principals and peers, and parental evaluations. States would be given considerable discretion to develop their own measures, as long as student achievement impacts (using so-called “value-added” measures) are a key component. The federal government would pay for bonuses to highly rated teachers willing to teach in high-poverty schools. In return for federal support, schools would not be able to offer tenure to new teachers who receive poor evaluations during their first two years on the job without obtaining district approval and informing parents in the schools. States would open further the door to teaching for those who lack traditional certification but can demonstrate success on the job [Gordon’s wife worked for TFA]. This approach would facilitate entry into teaching by those pursuing other careers. The new measures of teacher performance would also provide key data for teachers and schools to use in their efforts to improve their performance.

Diane writes again: “Much has happened since Gordon, Kane, and Staiger speculated about how to identify effective teachers by performance measures such as student test scores. We now have evidence that these measures are fraught with error and instability. We now have numerous examples where teachers are evaluated based on the scores of students they never taught. We have numerous examples of teachers rated highly effective one year, but ineffective the next year, showing that what mattered most was the composition of their class, not their quality or effectiveness. Just recently, the American Statistical Association said: “Most VAM studies find that teachers account for about 1% to 14% of the variability in test scores, and that the majority of opportunities for quality improvement are found in the system-level conditions. Ranking teachers by their VAM scores can have unintended consequences that reduce quality.”). In a joint statement, the National Academy of Education and the American Educational Research Association warned about the defects and limitations of VAM and showed that most of the factors that determine test scores are beyond the control of teachers.  Numerous individual scholars have taken issue with the naive belief that teacher quality can be established by the test scores of their students, even when the computer matches as many variables as it can find.”

So, will the Obama administration continue to stay its now well-established course? Diane having met Gordon and “knowing him to be a very smart person,” is betting, and hoping, that he might come to his senses, provided the research that has literally burgeoned since his initial report was released in 2006. Gordon might just help the Obama administration change course, not stay it, we all sincerely hope.

Pearson Tests v. UT Austin’s Associate Professor Stroup

Last week of the Texas Observer wrote an article, titled “Mute the Messenger,” about University of Texas – Austin’s Associate Professor Walter Stroup, who publicly and quite visibly claimed that Texas’ standardized tests as supported by Pearson were flawed, as per their purposes to measure teachers’ instructional effects. The article is also about how “the testing company [has since] struck back,” purportedly in a very serious way. This article (linked again here) is well worth a full read for many reasons I will leave you all to infer. This article was also covered recently on Diane Ravitch’s blog here, although readers should also see Pearson’s Senior Vice President’s prior response to, and critique of Stroup’s assertions and claims (from August 2, 2014) here.

The main issue? Whether Pearson’s tests are “instructionally sensitive.” That is, whether (as per testing and measurement expert – Professor Emeritus W. James Popham) a test is able to differentiate between well taught and poorly taught students, versus able to differentiate between high and low achievers regardless of how students were taught (i.e., as per that which happens outside of school that students bring with them to the schoolhouse door).

Testing developers like Pearson seem to focus on the prior, that their tests are indeed sensitive to instruction. While testing/measurement academics and especially practitioners seem to focus on the latter, that tests are sensitive to instruction, but such tests are not nearly as “instructionally sensitive” as testing companies might claim. Rather, tests are (as per testing and measurement expert – Regents Professor David Berliner) sensitive to instruction but more importantly sensitive to everything else students bring with them to school from their homes, parents, siblings, and families, all of which are situated in their neighborhoods and communities and related to their social class. Here seems to be where this, now very heated and polarized argument between Pearson and Associate Professor Stroup now stands.

Pearson is focusing on its advanced psychometric approaches, namely its use of Item Response Theory (IRT) while defending their tests as “instructionally sensitive.” IRT is used to examine things like p-values (or essentially proportions of students who respond to items correctly) and item-discrimination indices (to see if test items discriminate between students who know [or are taught] certain things and students who don’t know [or are not taught] certain things otherwise). This is much more complicated than what I am describing here, but hopefully this gives you all the gist of what now seems to be the crux of this situation.

As per Pearson’s Senior Vice President’s statement, linked again here, “Dr. Stroup claim[ed] that selecting questions based on Item Response Theory produces tests that are not sensitive to measuring what students have learned.” While from what I know about Dr. Stroup’s actual claims, this trivializes his overall arguments. Tests, after undergoing revisions as per IRT methods, are not always “instructionally sensitive.”

When using IRT methods, test companies, for example, remove items that “too many students get right” (e.g, as per items’ aforementioned p-values). This alone makes tests less “instructionally insensitive” in practiceIn other words, while the use of IRT methods is sound psychometric practice based on decades of research and development, if using IRT deems an item as “too easy,” even if the item is taught well (i.e., “instructionally senstive”), the item might be removed. This makes the test (1) less “instructionally sensitive” in the eyes of teachers who are to teach the tested content (and who are now more than before held accountable for teaching these items), and this makes the test (2) more “instructionally sensitive” in the eyes of test developers in that when fewer students get test items correct the better the items are when descriminating between those who know (or are taught) certain things and students who don’t know (or are not taught) certain things otherwise.

A paradigm example of what this looks like in practice comes from advanced (e.g., high school) mathematics tests.

Items capturing statistics and/or data displays on such tests should theoretically include items illustrating standard column or bar charts, with questions prompting students to interpret the meanings of the statistics illustrated in the figures. Too often, however, because these items are often taught (and taught well) by teachers (i.e., “instructionally sensitive”) “too many” students answer such items correctly. Sometimes these items yield p-values greater than p=0.80 or 80% correct.

When you need a test and its outcome score data to fit around the bell curve, you cannot have such, or too many of such items, on the final test. In the simplest of terms, for every item with a p-vale of 80% you would need another with a p-value of 20% to balance items out, or keep the overall mean of each test around p=0.50 (the center of the standard normal curve). It’s best if test items, more or less, hang around such a mean, otherwise the test will not function as it needs to, mainly to discriminate between who knows (or is taught) certain things and who doesn’t know (or isn’t taught) certain things otherwise. Such items (i.e., with high p-values) do not always distribute scores well enough because “too many students” answering such items correct reduces the variation (or spread of scores) needed.

The counter-item in this case is another item also meant to capture statistics and/or data display, but that is much more difficult, largely because it’s rarely taught because it rarely matters in the real world. Take, for example, the box and whisker plot. If you don’t know what this is, which is in and of itself telling in this example, see them described and illustrated here. Often, this item IS found on such tests because this item IS DIFFICULT and, accordingly, works wonderfully well to discriminate between those who know (or are taught) certain things and those who don’t know (or aren’t taught) certain things otherwise.

Because this item is not as often taught (unless teachers know it’s coming, which is a whole other issue when we think about “instructional sensitivity” and “teaching-to-the-test“), and because this item doesn’t really matter in the real world, it becomes an item that is more useful for the test, as well as the overall functioning of the test, than it is an item that is useful for the students tested on it.

A side bar on this: A few years ago I had a group of advanced doctoral students studying statistics take Arizona’s (now former) High School Graduation Exam. We then performed an honest analysis of the resulting doctoral students’ scores using some of the above-mentioned IRT methods. Guess which item students struggled with the most, which also happened to be the item that functioned the best as per our IRT analysis? The box and whisker plot. The conversation that followed was most memorable, as the statistics students themselves questioned the utility of this traditional item, for them as advanced doctoral students but also for high school graduates in general.

Anyhow, this item, like many other items similar, had a lower relative p-value, and accordingly helped to increase the difficulty of the test and discriminate results to assert a purported “insructional sensitivity,” regardless of whether the item was actually valued, and more importantly valued in instruction.

Thanks to IRT, the items often left on such tests are not often the items taught by teachers, or perhaps taught by teachers well, BUT they distribute students’ test scores effectively and help others make inferences about who knows what and who doesn’t. This happens even though the items left do not always capture what matters most. Yes – the tests are aligned with the standards as such items are in the standards, but when the most difficult items in the standards trump the others, and many of the others that likely matter more are removed for really no better reason than what IRT dictates, this is where things really go awry.

Unidentified Bloggers Defending Themselves Anonymously

“People should stand behind their own words…that’s doubly-true of people in public life.”

For those of you who missed it, about two weeks ago Diane Ravitch posted a piece about (and including) a set of emails exchanged between Diane, Raj Chetty, and me (although this post did not include Chetty’s emails as he did not grant Diane permission to share) about Chetty et al.’s now infamous study, the study at the heart of the recent Vergara v. California win (see Diane’s post here). In the comments section of this blog post, a unidentified respondent by the name of “WT,” as in “What The…,” went after me and my review/critique of Chetty et al.’s study, also referenced in this same post. See WT’s comments, again here, about 40 comments down: “WT on June 2, 2014 at 4:00 p.m.”

Anyhow, it became clear after a few back-and-forths that “WT” was closer to this study than (s)he seemingly wanted to expose. Following these suspicions, I wrote that I was at that point “even more curious as to why the real “WT” [wasn’t] standing up? Who [was] hiding behind (a perhaps fictitious) set of initials, or perhaps an acronym? Whoever “WT” [was] seem[ed] to care pretty deeply about this study, not to mention know a lot about it to cite directly from [very] small sections of the 56 pages (something I for sure would not be able to do nor, quite frankly, would I take the time to do given I’ve already conducted two reviews of this study since 2011). Might “WT” [have been] somebody a bit too close to this study, hence the knee-jerk, irrational reactions, that (still) lack[ed] precision and care? Everyone else (besides Harold [another commentator on this blog post]) ha[d] fully identified themselves in this string. So what [was to say] WT?”

Well, “WT” said nothing, besides a bunch of nothingness surrounding his anonymous withdrawal from the conversation string. Perhaps this was Chetty as “WT?” Perhaps not, but I’d bet some serious cash whoever “WT” was was pretty darn close to the Chetty et al. study and didn’t want to show it.

Sooo…now getting to the best part of all of this and how this has, in an interesting and similar turn of events, evidenced itself elsewhere. This video just came out on our local news station in Phoenix about a very similar situation in the state of Arizona and the state’s Superintendent of Public Instruction – John Huppenthal – engaging in some equally shady blogging tactics. Do give it a watch if you want yet another check on reality.

View Video Here