ACT Also Finds but Discourages States’ Decreased Use of VAMs Post-ESSA

Last June (2018), I released a blog post covering three key findings from a study that I along with two others conducted on states’ revised teacher evaluation systems post the passage of the federal government’s Every Student Succeeds Act (ESSA; see the full study here). In short, we evidenced that (1) the role of growth or value-added models (VAMs) for teacher evaluation purposes is declining across states, (2) many states are embracing the increased local control afforded them via ESSA and no longer have one-size-fits-all teacher evaluation systems, and (3) the rhetoric surrounding teacher evaluation has changed (e.g., language about holding teachers accountable is increasingly less evident than language about teachers’ professional development and support).

Last week, a similar study was released by the ACT standardized testing company. As per the title of this report (see the full report here), they too found that there is a “Shrinking Use of Growth” across states’ “Teacher Evaluation Legislation since ESSA.” They also found that for some states there was a “complete removal” of the state’s teacher evaluation system (e.g., Maine, Washington), a postponement of the state’s teacher evaluation systems until further notice (e.g., Connecticut, Indiana, Tennessee) or a complete prohibition of the use of students’ standardized tests in any teachers’ evaluations moving forward (e.g., Connecticut, Idaho). Otherwise, as we also found, states are increasingly “allowing districts, rather than the state, to determine their evaluation frameworks” themselves.

Unlike in our study, however, ACT (perhaps not surprisingly as a standardized testing company that also advertises its tests’ value-added capacities; see, for example, here) cautions states against “the complete elimination of student growth as part of teacher evaluation systems” in that they have “clearly” (without citations in support) proven “their value and potential value.” Hence, ACT defines this as “a step backward.” In our aforementioned study and blog post we reviewed some of the actual evidence (with citations in support) and would, accordingly, contest all-day-long that these systems have a “clear” “proven” “value.” Hence, we (also perhaps not surprisingly) called our similar findings as “steps in the right direction.”

Regardless, in in all fairness, they recommend that states not “respond to the challenges of using growth
measures in evaluation systems by eliminating their use” but “first consider less drastic measures such as:

  • postponing the use of student growth for employment decisions while refinements to the system can be made;
  • carrying out special studies to better understand the growth model; and/or
  • reviewing evaluation requirements for teachers who teach in untested grades and subjects so that the measures used more accurately reflect their performance.

“Pursuing such refinements, rather than reversing efforts to make teacher evaluation more meaningful and reflective of performance, is the best first step toward improving states’ evaluation systems.”

Citation: Croft, M., Guffy, G., & Vitale, D. (2018). The shrinking use of growth: Teacher evaluation legislation since ESSA. Iowa City, IA: ACT. Retrieved from https://www.act.org/content/dam/act/unsecured/documents/teacher-evaluation-legislation-since-essa.pdf

Also Last Thursday in Nevada: The “Top Ten” Research-Based Reasons Why Large-Scale, Standardized Tests Should Not Be Used to Evaluate Teachers

Last Thursday was a BIG day in terms of value-added models (VAMs). For those of you who missed it, US Magistrate Judge Smith ruled — in Houston Federation of Teachers (HFT) et al. v. Houston Independent School District (HISD) — that Houston teacher plaintiffs’ have legitimate claims regarding how their EVAAS value-added estimates, as used (and abused) in HISD, was a violation of their Fourteenth Amendment due process protections (i.e., no state or in this case organization shall deprive any person of life, liberty, or property, without due process). See post here: “A Big Victory in Court in Houston.” On the same day, “we” won another court case — Texas State Teachers Association v. Texas Education Agency —  on which The Honorable Lora J. Livingston ruled that the state was to remove all student growth requirements from all state-level teacher evaluation systems. In other words, and in the name of increased local control, teachers throughout Texas will no longer be required to be evaluated using their students’ test scores. See prior post here: “Another Big Victory in Court in Texas.”

Also last Thursday (it was a BIG day, like I said), I testified, again, regarding a similar provision (hopefully) being passed in the state of Nevada. As per a prior post here, Nevada’s “Democratic lawmakers are trying to eliminate — or at least reduce — the role [students’] standardized tests play in evaluations of teachers, saying educators are being unfairly judged on factors outside of their control.” More specifically, as per AB320 the state would eliminate statewide, standardized test results as a mandated teacher evaluation measure but allow local assessments to account for 20% of a teacher’s total evaluation. AB320 is still in work session. It has the votes in committee and on the floor, thus far.

The National Center on Teacher Quality (NCTQ), unsurprisingly (see here and here), submitted (unsurprising) testimony against AB320 that can be read here, and I submitted testimony (I think, quite effectively 😉 ) refuting their “research-based” testimony, and also making explicit what I termed “The “Top Ten” Research-Based Reasons Why Large-Scale, Standardized Tests Should Not Be Used to Evaluate Teachers” here. I have also pasted my submission below, in case anybody wants to forward/share any of my main points with others, especially others in similar positions looking to impact state or local educational policies in similar ways.

*****

May 4, 2017

Dear Assemblywoman Miller:

Re: The “Top Ten” Research-Based Reasons Why Large-Scale, Standardized Tests Should Not Be Used to Evaluate Teachers

While I understand that the National Council on Teacher Quality (NCTQ) submitted a letter expressing their opposition against Assembly Bill (AB) 320, it should be officially noted that, counter to that which the NCTQ wrote into its “research-based” letter,[1] the American Statistical Association (ASA), the American Educational Research Association (AERA), the National Academy of Education (NAE), and other large-scale, highly esteemed, professional educational and educational research/measurement associations disagree with the assertions the NCTQ put forth. Indeed, the NCTQ is not a nonpartisan research and policy organization as claimed, but one of only a small handful of partisan operations still in existence and still pushing forward what is increasingly becoming dismissed as America’s ideal teacher evaluation systems (e.g., announced today, Texas dropped their policy requirement that standardized test scores be used to evaluate teachers; Connecticut moved in the same policy direction last month).

Accordingly, these aforementioned and highly esteemed organizations have all released statements cautioning all against the use of students’ large-scale, state-level standardized tests to evaluate teachers, primarily, for the following research-based reasons, that I have limited to ten for obvious purposes:

  1. The ASA evidenced that teacher effects correlate with only 1-14% of the variance in their students’ large-scale standardized test scores. This means that the other 86%-99% of the variance is due to factors outside of any teacher’s control (e.g., out-of-school and student-level variables). That teachers’ effects, as measured by large-scaled standardized tests (and not including other teacher effects that cannot be measured using large-scaled standardized tests), account for such little variance makes using them to evaluate teachers wholly irrational and unreasonable.
  1. Large-scale standardized tests have always been, and continue to be, developed to assess levels of student achievement, but not levels of growth in achievement over time, and definitely not growth in achievement that can be attributed back to a teacher (i.e., in terms of his/her effects). Put differently, these tests were never designed to estimate teachers’ effects; hence, using them in this regard is also psychometrically invalid and indefensible.
  1. Large-scale standardized tests, when used to evaluate teachers, often yield unreliable or inconsistent results. Teachers who should be (more or less) consistently effective are, accordingly, being classified in sometimes highly inconsistent ways year-to-year. As per the current research, a teacher evaluated using large-scale standardized test scores as effective one year has a 25% to 65% chance of being classified as ineffective the following year(s), and vice versa. This makes the probability of a teacher being identified as effective, as based on students’ large-scale test scores, no different than the flip of a coin (i.e., random).
  1. The estimates derived via teachers’ students’ large-scale standardized test scores are also invalid. Very limited evidence exists to support that teachers whose students’ yield high- large-scale standardized tests scores are also effective using at least one other correlated criterion (e.g., teacher observational scores, student satisfaction survey data), and vice versa. That these “multiple measures” don’t map onto each other, also given the error prevalent in all of the “multiple measures” being used, decreases the degree to which all measures, students’ test scores included, can yield valid inferences about teachers’ effects.
  1. Large-scale standardized tests are often biased when used to measure teachers’ purported effects over time. More specifically, test-based estimates for teachers who teach inordinate proportions of English Language Learners (ELLs), special education students, students who receive free or reduced lunches, students retained in grade, and gifted students are often evaluated not as per their true effects but group effects that bias their estimates upwards or downwards given these mediating factors. The same thing holds true with teachers who teach English/language arts versus mathematics, in that mathematics teachers typically yield more positive test-based effects (which defies logic and commonsense).
  1. Related, large-scale standardized tests estimates are fraught with measurement errors that negate their usefulness. These errors are caused by inordinate amounts of inaccurate and missing data that cannot be replaced or disregarded; student variables that cannot be statistically “controlled for;” current and prior teachers’ effects on the same tests that also prevent their use for making determinations about single teachers’ effects; and the like.
  1. Using large-scale standardized tests to evaluate teachers is unfair. Issues of fairness arise when these test-based indicators impact some teachers more than others, sometimes in consequential ways. Typically, as is true across the nation, only teachers of mathematics and English/language arts in certain grade levels (e.g., grades 3-8 and once in high school) can be measured or held accountable using students’ large-scale test scores. Across the nation, this leaves approximately 60-70% of teachers as test-based ineligible.
  1. Large-scale standardized test-based estimates are typically of very little formative or instructional value. Related, no research to date evidences that using tests for said purposes has improved teachers’ instruction or student achievement as a result. As per UCLA Professor Emeritus James Popham: The farther the test moves away from the classroom level (e.g., a test developed and used at the state level) the worst the test gets in terms of its instructional value and its potential to help promote change within teachers’ classrooms.
  1. Large-scale standardized test scores are being used inappropriately to make consequential decisions, although they do not have the reliability, validity, fairness, etc. to satisfy that for which they are increasingly being used, especially at the teacher-level. This is becoming increasingly recognized by US court systems as well (e.g., in New York and New Mexico).
  1. The unintended consequences of such test score use for teacher evaluation purposes are continuously going unrecognized (e.g., by states that pass such policies, and that states should acknowledge in advance of adapting such policies), given research has evidenced, for example, that teachers are choosing not to teach certain types of students whom they deem as the most likely to hinder their potentials positive effects. Principals are also stacking teachers’ classes to make sure certain teachers are more likely to demonstrate positive effects, or vice versa, to protect or penalize certain teachers, respectively. Teachers are leaving/refusing assignments to grades in which test-based estimates matter most, and some are leaving teaching altogether out of discontent or in professional protest.

[1] Note that the two studies the NCTQ used to substantiate their “research-based” letter would not support the claims included. For example, their statement that “According to the best-available research, teacher evaluation systems that assign between 33 and 50 percent of the available weight to student growth ‘achieve more consistency, avoid the risk of encouraging too narrow a focus on any one aspect of teaching, and can support a broader range of learning objectives than measured by a single test’ is false. First, the actual “best-available” research comes from over 10 years of peer-reviewed publications on this topic, including over 500 peer-reviewed articles. Second, what the authors of the Measures of Effective Teaching (MET) Studies found was that the percentages to be assigned to student test scores were arbitrary at best, because their attempts to empirically determine such a percentage failed. This face the authors also made explicit in their report; that is, they also noted that the percentages they suggested were not empirically supported.

Is Combining Different Tests to Measure Value-Added Valid?

A few days ago, on Diane Ravitch’s blog, a person posted the following comment, that Diane sent to me for a response:

“Diane, In Wis. one proposed Assembly bill directs the Value Added Research Center [VARC] at UW to devise a method to equate three different standardized tests (like Iowas, Stanford) to one another and to the new SBCommon Core to be given this spring. Is this statistically valid? Help!”

Here’s what I wrote in response, that I’m sharing here with you all as this too is becoming increasingly common across the country; hence, it is increasingly becoming a question of high priority and interest:

“We have three issues here when equating different tests SIMPLY because these tests test the same students on the same things around the same time.

First is the ASSUMPTION that all varieties of standardized tests can be used to accurately measure educational “value,” when none have been validated for such purposes. To measure student achievement? YES/OK. To measure teachers impacts on student learning? NO. The ASA statement captures decades of research on this point.

Second, doing this ASSUMES that all standardized tests are vertically scaled whereas scales increase linearly as students progress through different grades on similar linear scales. This is also (grossly) false, ESPECIALLY when one combines different tests with different scales to (force a) fit that simply doesn’t exist. While one can “norm” all test data to make the data output look “similar,” (e.g., with a similar mean and similar standard deviations around the mean), this is really nothing more that statistical wizardry without really any theoretical or otherwise foundation in support.

Third, in one of the best and most well-respected studies we have on this to date, Papay (2010) [in his Different tests, different answers: The stability of teacher value-added estimates across outcome measures study] found that value-added estimates WIDELY range across different standardized tests given to the same students at the same time. So “simply” combining these tests under the assumption that they are indeed similar “enough” is also problematic. Using different tests (in line with the proposal here) with the same students at the same time yields different results, so one cannot simply combine them thinking they will yield similar results regardless. They will not…because the test matters.”

See also: Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn, R. L., Ravitch, D., Rothstein, R., Shavelson, R. J., & Shepard, L. A. (2010). Problems with the use of student test scores to evaluate teachers. Washington, D.C.: Economic Policy Institute.

Can Today’s Tests Yield Instructionally Useful Data?

The answer is no, or at best not yet.

Some heavy hitters in the academy just released an article that might be of interest to you all. In the article the authors discuss whether “today’s standardized achievement tests [actually] yield instructionally useful data.”

The authors include W. James Popham, Professor Emeritus from the University of California, Los Angeles; David Berliner, Regents’ Professor Emeritus at Arizona State University; Neal Kingston, Professor at the University of Kansas; Susan Fuhrman, current President of Teachers College, Columbia University; Steven Ladd, Superintendent of Elk Grove Unified School District in California; Jeffrey Charbonneau, National Board Certified Teacher in Washington and the 2013 US National Teacher of the Year; and Madhabi Chatterji, Associate Professor at Teachers College, Columbia University.

These authors explored some of the challenges and promises in terms of using and designing standardized achievement tests and other educational tests that are “instructionally useful.” This was the focus of a recent post about whether Pearson’s tests are “instructionally sensitive” and what University of Texas – Austin’s Associate Professor Walter Stroup versus Pearson’s Senior Vice President had to say on this topic.

In this study, authors deliberate more specifically the consequences of using inappropriately designed tests for decision-making purposes, particularly when tests are insensitive to instruction. Here, the authors underscore serious issues related to validity, ethics, and consequences, all of which they use and appropriately elevate to speak out, particularly against the use of current, large-scale standardized achievement tests for evaluating teachers and schools.

The authors also make recommendations for local policy contexts, offering recommendations to support (1) the design of more instructionally sensitive large-scale tests as well as (2) the design of other smaller scale tests that can also be more instructionally sensitive, and just better. These include but are not limited to classroom tests as typically created, controlled, and managed by teachers, as well as district tests as sometimes created, controlled, and managed by district administrators.

Such tests might help to create more but also better comprehensive educational evaluation systems, the authors ultimately argue. Although this, of course, would require more professional development to help teachers (and others, including district personnel) develop more instructionally sensitive, and accordingly useful tests. As they also note, this would also require that “validation studies…be undertaken to ensure validity in interpretations of results within the larger accountability policy context where schools and teachers are evaluated.”

This is especially important if tests are to be used for low and high-stakes decision-making purposes. Yet this is something that is way too often forgotten when it comes to test use, and in particular test abuse. All should really take heed here.

Reference: Popham, W. J., Berliner, D. C., Kingston, N. M., Fuhrman, S. H., Ladd, S. M., Charbonneau, J., & Chatterji, M. (2014). Can today’s standardized achievement tests yield instructionally useful data? Quality Assurance in Education, 22(4), 303-318 doi:10.1108/QAE-07-2014-0033. Retrieved from http://www.tc.columbia.edu/aeri/publications/QAE1.pdf

Pearson Tests v. UT Austin’s Associate Professor Stroup

Last week of the Texas Observer wrote an article, titled “Mute the Messenger,” about University of Texas – Austin’s Associate Professor Walter Stroup, who publicly and quite visibly claimed that Texas’ standardized tests as supported by Pearson were flawed, as per their purposes to measure teachers’ instructional effects. The article is also about how “the testing company [has since] struck back,” purportedly in a very serious way. This article (linked again here) is well worth a full read for many reasons I will leave you all to infer. This article was also covered recently on Diane Ravitch’s blog here, although readers should also see Pearson’s Senior Vice President’s prior response to, and critique of Stroup’s assertions and claims (from August 2, 2014) here.

The main issue? Whether Pearson’s tests are “instructionally sensitive.” That is, whether (as per testing and measurement expert – Professor Emeritus W. James Popham) a test is able to differentiate between well taught and poorly taught students, versus able to differentiate between high and low achievers regardless of how students were taught (i.e., as per that which happens outside of school that students bring with them to the schoolhouse door).

Testing developers like Pearson seem to focus on the prior, that their tests are indeed sensitive to instruction. While testing/measurement academics and especially practitioners seem to focus on the latter, that tests are sensitive to instruction, but such tests are not nearly as “instructionally sensitive” as testing companies might claim. Rather, tests are (as per testing and measurement expert – Regents Professor David Berliner) sensitive to instruction but more importantly sensitive to everything else students bring with them to school from their homes, parents, siblings, and families, all of which are situated in their neighborhoods and communities and related to their social class. Here seems to be where this, now very heated and polarized argument between Pearson and Associate Professor Stroup now stands.

Pearson is focusing on its advanced psychometric approaches, namely its use of Item Response Theory (IRT) while defending their tests as “instructionally sensitive.” IRT is used to examine things like p-values (or essentially proportions of students who respond to items correctly) and item-discrimination indices (to see if test items discriminate between students who know [or are taught] certain things and students who don’t know [or are not taught] certain things otherwise). This is much more complicated than what I am describing here, but hopefully this gives you all the gist of what now seems to be the crux of this situation.

As per Pearson’s Senior Vice President’s statement, linked again here, “Dr. Stroup claim[ed] that selecting questions based on Item Response Theory produces tests that are not sensitive to measuring what students have learned.” While from what I know about Dr. Stroup’s actual claims, this trivializes his overall arguments. Tests, after undergoing revisions as per IRT methods, are not always “instructionally sensitive.”

When using IRT methods, test companies, for example, remove items that “too many students get right” (e.g, as per items’ aforementioned p-values). This alone makes tests less “instructionally insensitive” in practiceIn other words, while the use of IRT methods is sound psychometric practice based on decades of research and development, if using IRT deems an item as “too easy,” even if the item is taught well (i.e., “instructionally senstive”), the item might be removed. This makes the test (1) less “instructionally sensitive” in the eyes of teachers who are to teach the tested content (and who are now more than before held accountable for teaching these items), and this makes the test (2) more “instructionally sensitive” in the eyes of test developers in that when fewer students get test items correct the better the items are when descriminating between those who know (or are taught) certain things and students who don’t know (or are not taught) certain things otherwise.

A paradigm example of what this looks like in practice comes from advanced (e.g., high school) mathematics tests.

Items capturing statistics and/or data displays on such tests should theoretically include items illustrating standard column or bar charts, with questions prompting students to interpret the meanings of the statistics illustrated in the figures. Too often, however, because these items are often taught (and taught well) by teachers (i.e., “instructionally sensitive”) “too many” students answer such items correctly. Sometimes these items yield p-values greater than p=0.80 or 80% correct.

When you need a test and its outcome score data to fit around the bell curve, you cannot have such, or too many of such items, on the final test. In the simplest of terms, for every item with a p-vale of 80% you would need another with a p-value of 20% to balance items out, or keep the overall mean of each test around p=0.50 (the center of the standard normal curve). It’s best if test items, more or less, hang around such a mean, otherwise the test will not function as it needs to, mainly to discriminate between who knows (or is taught) certain things and who doesn’t know (or isn’t taught) certain things otherwise. Such items (i.e., with high p-values) do not always distribute scores well enough because “too many students” answering such items correct reduces the variation (or spread of scores) needed.

The counter-item in this case is another item also meant to capture statistics and/or data display, but that is much more difficult, largely because it’s rarely taught because it rarely matters in the real world. Take, for example, the box and whisker plot. If you don’t know what this is, which is in and of itself telling in this example, see them described and illustrated here. Often, this item IS found on such tests because this item IS DIFFICULT and, accordingly, works wonderfully well to discriminate between those who know (or are taught) certain things and those who don’t know (or aren’t taught) certain things otherwise.

Because this item is not as often taught (unless teachers know it’s coming, which is a whole other issue when we think about “instructional sensitivity” and “teaching-to-the-test“), and because this item doesn’t really matter in the real world, it becomes an item that is more useful for the test, as well as the overall functioning of the test, than it is an item that is useful for the students tested on it.

A side bar on this: A few years ago I had a group of advanced doctoral students studying statistics take Arizona’s (now former) High School Graduation Exam. We then performed an honest analysis of the resulting doctoral students’ scores using some of the above-mentioned IRT methods. Guess which item students struggled with the most, which also happened to be the item that functioned the best as per our IRT analysis? The box and whisker plot. The conversation that followed was most memorable, as the statistics students themselves questioned the utility of this traditional item, for them as advanced doctoral students but also for high school graduates in general.

Anyhow, this item, like many other items similar, had a lower relative p-value, and accordingly helped to increase the difficulty of the test and discriminate results to assert a purported “insructional sensitivity,” regardless of whether the item was actually valued, and more importantly valued in instruction.

Thanks to IRT, the items often left on such tests are not often the items taught by teachers, or perhaps taught by teachers well, BUT they distribute students’ test scores effectively and help others make inferences about who knows what and who doesn’t. This happens even though the items left do not always capture what matters most. Yes – the tests are aligned with the standards as such items are in the standards, but when the most difficult items in the standards trump the others, and many of the others that likely matter more are removed for really no better reason than what IRT dictates, this is where things really go awry.

David Berliner’s “Thought Experiment”

My main mentor, David Berliner (Regents Professor at Arizona State University) wrote a “Thought Experiment” that Diane Ravitch posted on her blog yesterday. I have pasted the full contents here for those of you who may have missed it. Do take a read, and play along and see if you can predict which state will yield higher test performance in the end.

—–

Let’s do a thought experiment. I will slowly parcel out data about two different states. Eventually, when you are nearly 100% certain of your choice, I want you to choose between them by identifying the state in which an average child is likely to be achieving better in school. But you have to be nearly 100% certain that you can make that choice.

To check the accuracy of your choice I will use the National Assessment of Educational Progress (NAEP) as the measure of school achievement. It is considered by experts to be the best indicator we have to determine how children in our nation are doing in reading and mathematics, and both states take this test.

Let’s start. In State A the percent of three and four year old children attending a state associated prekindergarten is 8.8% while in State B the percent is 1.7%. With these data think about where students might be doing better in 4th and 8th grade, the grades NAEP evaluates student progress in all our states. I imagine that most people will hold onto this information about preschool for a while and not yet want to choose one state over the other. A cautious person might rightly say it is too soon to make such a prediction based on a difference of this size, on a variable that has modest, though real effects on later school success.

So let me add more information to consider. In State A the percent of children living in poverty is 14% while in State B the percent is 24%. Got a prediction yet? See a trend? How about this related statistic: In State A the percent of households with food insecurity is 11.4% while in State B the percent is 14.9%. I also can inform you also that in State A the percent of people without health insurance is 3.8% while in State B the percent is 17.7%. Are you getting the picture? Are you ready to pick one state over another in terms of the likelihood that one state has its average student scoring higher on the NAEP achievement tests than the other?

​If you still say that this is not enough data to make yourself almost 100% sure of your pick, let me add more to help you. In State A the per capita personal income is $54,687 while in state B the per capita personal income is $35,979. Since per capita personal income in the country is now at about $42,693, we see that state A is considerably above the national average and State B is considerably below the national average. Still not ready to choose a state where kids might be doing better in school?

Alright, if you are still cautious in expressing your opinions, here is some more to think about. In State A the per capita spending on education is $2,764 while in State B the per capita spending on education is $2,095, about 25% less. Enough? Ready to choose now?
Maybe you should also examine some statistics related to the expenditure data, namely, that the pupil/teacher ratio (not the class sizes) in State A is 14.5 to one, while in State B it is 19.8 to one.

As you might now suspect, class size differences also occur in the two states. At the elementary and the secondary level, respectively, the class sizes for State A average 18.7 and 20.6. For State B those class sizes at elementary and secondary are 23.5 and 25.6, respectively. State B, therefore, averages at least 20% higher in the number of students per classroom. Ready now to pick the higher achieving state with near 100% certainty? If not, maybe a little more data will make you as sure as I am of my prediction.

​In State A the percent of those who are 25 years of age or older with bachelors degrees is 38.7% while in State B that percent is 26.4%. Furthermore, the two states have just about the same size population. But State A has 370 public libraries and State B has 89.
Let me try to tip the data scales for what I imagine are only a few people who are reluctant to make a prediction. The percent of teachers with Master degrees is 62% in State A and 41.6% in State B. And, the average public school teacher salary in the time period 2010-2012 was $72,000 in State A and $46,358 in State B. Moreover, during the time period from the academic year 1999-2000 to the academic year 2011-2012 the percent change in average teacher salaries in the public schools was +15% in State A. Over that same time period, in State B public school teacher salaries dropped -1.8%.

I will assume by now we almost all have reached the opinion that children in state A are far more likely to perform better on the NAEP tests than will children in State B. Everything we know about the ways we structure the societies we live in, and how those structures affect school achievement, suggests that State A will have higher achieving students. In addition, I will further assume that if you don’t think that State A is more likely to have higher performing students than State B you are a really difficult and very peculiar person. You should seek help!

So, for the majority of us, it should come as no surprise that in the 2013 data set on the 4th grade NAEP mathematics test State A was the highest performing state in the nation (tied with two others). And it had 16 percent of its children scoring at the Advanced level—the highest level of mathematics achievement. State B’s score was behind 32 other states, and it had only 7% of its students scoring at the Advanced level. The two states were even further apart on the 8th grade mathematics test, with State A the highest scoring state in the nation, by far, and with State B lagging behind 35 other states.

Similarly, it now should come as no surprise that State A was number 1 in the nation in the 4th grade reading test, although tied with 2 others. State A also had 14% of its students scoring at the advanced level, the highest rate in the nation. Students in State B scored behind 44 other states and only 5% of its students scored at the Advanced level. The 8th grade reading data was the same: State A walloped State B!

States A and B really exist. State B is my home state of Arizona, which obviously cares not to have its children achieve as well as do those in state A. It’s poor achievement is by design. Proof of that is not hard to find. We just learned that 6000 phone calls reporting child abuse to the state were uninvestigated. Ignored and buried! Such callous disregard for the safety of our children can only occur in an environment that fosters, and then condones a lack of concern for the children of the Arizona, perhaps because they are often poor and often minorities. Arizona, given the data we have, apparently does not choose to take care of its children. The agency with the express directive of insuring the welfare of children may need 350 more investigators of child abuse. But the governor and the majority of our legislature is currently against increased funding for that agency.

State A, where kids do a lot better, is Massachusetts. It is generally a progressive state in politics. To me, Massachusetts, with all its warts, resembles Northern European countries like Sweden, Finland, and Denmark more than it does states like Alabama, Mississippi or Arizona. According to UNESCO data and epidemiological studies it is the progressive societies like those in Northern Europe and Massachusetts that care much better for their children. On average, in comparisons with other wealthy nations, the U. S. turns out not to take good care of its children. With few exceptions, our politicians appear less likely to kiss our babies and more likely to hang out with individuals and corporations that won’t pay the taxes needed to care for our children, thereby insuring that our schools will not function well.

But enough political commentary: Here is the most important part of this thought experiment for those who care about education. Everyone of you who predicted that Massachusetts would out perform Arizona did so without knowing anything about the unions’ roles in the two states, the curriculum used by the schools, the quality of the instruction, the quality of the leadership of the schools, and so forth. You made your prediction about achievement without recourse to any of the variables the anti-public school forces love to shout about –incompetent teachers, a dumbed down curriculum, coddling of students, not enough discipline, not enough homework, and so forth. From a few variables about life in two different states you were able to predict differences in student achievement test scores quite accurately.

I believe it is time for the President, the Secretary of Education, and many in the press to get off the backs of educators and focus their anger on those who will not support societies in which families and children can flourish. Massachusetts still has many problems to face and overcome—but they are nowhere as severe as those in my home state and a dozen other states that will not support programs for neighborhoods, families, and children to thrive.

This little thought experiment also suggests also that a caution for Massachusetts is in order. It seems to me that despite all their bragging about their fine performance on international tests and NAEP tests, it’s not likely that Massachusetts’ teachers, or their curriculum, or their assessments are the basis of their outstanding achievements in reading and mathematics. It is much more likely that Massachusetts is a high performing state because it has chosen to take better care of its citizens than do those of us living in other states. The roots of high achievement on standardized tests is less likely to be found in the classrooms of Massachusetts and more likely to be discovered in its neighborhoods and families, a refection of the prevailing economic health of the community served by the schools of that state.