Just Released: The American Education Research Association’s (AERA) Statement on VAMs

Yesterday, the Council of the American Education Research Association (AERA) – AERA, founded in 1916, is the largest national professional organization devoted to the scientific study of education – publicly released their “AERA Statement on Use of Value-Added Models (VAM) for the Evaluation of Educators and Educator Preparation Programs.” Below is a summary of the AERA Council’s key points, noting for transparency that I contributed to these points in June of 2014, before the final statement was externally reviewed, revised, and vetted for public release.

As per the introduction: “The purpose of this statement is to inform those using or considering the use of value-added models (VAM) about their scientific and technical limitations in the evaluation of educators [as well as programs that prepare teachers].” The purpose of this statement is also to stress “the importance of any educator evaluation system meeting the highest standards of practice in statistics and measurement,” well before VAM output are to carry any “high-stakes, dispositive weight in [such teacher or other] evaluations” ( p. 1).

As per the main body of the statement, the AERA Council highlights eight very important technical requirements that must be met prior to such evaluative use. These eight technical requirements should be officially recognized by all states, and/or used by any of you out there to help inform your states regarding what they can and cannot, or should and should not do when using VAMs, whereas “[a]ny material departure from these [eight] requirements should preclude [VAM] use” (p. 2).

Here are AERA’s eight technical requirements for the use of VAM:

  1. “VAM scores must only be derived from students’ scores on assessments that meet professional standards of reliability and validity for the purpose to be served…Relevant evidence should be reported in the documentation supporting the claims and proposed uses of VAM results, including evidence that the tests used are a valid measure of growth [emphasis added] by measuring the actual subject matter being taught and the full range of student achievement represented in teachers’ classrooms” (p. 3).
  2. “VAM scores must be accompanied by separate lines of evidence of reliability and validity that support each [and every] claim and interpretative argument” (p. 3).
  3. “VAM scores must be based on multiple years of data from sufficient numbers of students…[Related,] VAM scores should always be accompanied by estimates of uncertainty to guard against [simplistic] overinterpretation[s] of [simple] differences” (p. 3).
  4. “VAM scores must only be calculated from scores on tests that are comparable over time…[In addition,] VAM scores should generally not be employed across transitions [to new, albeit different tests over time]” (AERA Council, 2015, p. 3).
  5. “VAM scores must not be calculated in grades or for subjects where there are not standardized assessments that are accompanied by evidence of their reliability and validity…When standardized assessment data are not available across all grades (K–12) and subjects (e.g., health, social studies) in a state or district, alternative measures (e.g., locally developed assessments, proxy measures, observational ratings) are often employed in those grades and subjects to implement VAM. Such alternative assessments should not be used unless they are accompanied by evidence of reliability and validity as required by the AERA, APA, and NCME Standards for Educational and Psychological Testing” (p. 3).
  6. “VAM scores must never be used alone or in isolation in educator or program evaluation systems…Other measures of practice and student outcomes should always be integrated into judgments about overall teacher effectiveness” (p. 3).
  7. “Evaluation systems using VAM must include ongoing monitoring for technical quality and validity of use…Ongoing monitoring is essential to any educator evaluation program and especially important for those incorporating indicators based on VAM that have only recently been employed widely. If authorizing bodies mandate the use of VAM, they, together with the organizations that implement and report results, are responsible for conducting the ongoing evaluation of both intended and unintended consequences. The monitoring should be of sufficient scope and extent to provide evidence to document the technical quality of the VAM application and the validity of its use within a given evaluation system” (AERA Council, 2015, p. 3).
  8. “Evaluation reports and determinations based on VAM must include statistical estimates of error associated with student growth measures and any ratings or measures derived from them…There should be transparency with respect to VAM uses and the overall evaluation systems in which they are embedded. Reporting should include the rationale and methods used to estimate error and the precision associated with different VAM scores. Also, their reliability from year to year and course to course should be reported. Additionally, when cut scores or performance levels are established for the purpose of evaluative decisions, the methods used, as well as estimates of classification accuracy, should be documented and reported. Justification should [also] be provided for the inclusion of each indicator and the weight accorded to it in the evaluation process…Dissemination should [also] include accessible formats that are widely available to the public, as well as to professionals” ( p. 3-4).

As per the  conclusion: “The standards of practice in statistics and testing set a high technical bar for properly aggregating student assessment results for any purpose, especially those related to drawing inferences about teacher, school leader, or educator preparation program effectiveness” (p. 4). Accordingly, the AERA Council recommends that VAMs “not be used without sufficient evidence that this technical bar has been met in ways that support all claims, interpretative arguments, and uses (e.g., rankings, classification decisions)” (p. 4).

CITATION: AERA Council. (2015). AERA statement on use of value-added models (VAM) for the evaluation of educators and educator preparation programs. Educational Researcher, X(Y), 1-5. doi:10.3102/0013189X15618385 Retrieved from http://edr.sagepub.com/content/early/2015/11/10/0013189X15618385.full.pdf+html

Houston Board Candidates Respond to their Teacher Evaluation System

For a recent article in the Houston Chronicle, the newspaper sent 12 current candidates for the Houston Independent School District (HISD) School Board a series of questions about HISD, to which seven candidates responded. The seven candidates’ responses are of specific interest here in that HISD is the district well-known for attaching more higher-stakes consequences to value-added output (e.g., teacher termination) than others (see for example here, here, and here). The seven candidates’ responses are of general interest in that the district uses the popular and (in)famous Education Value-Added Assessment System (EVAAS) for said purposes (see also here, here, and here). Accordingly, what these seven candidates have to say about the EVAAS and/or HISD’s teacher evaluation system might also be a sign of things to come, perhaps for the better, throughout HISD.

The questions are: (1) Do you support HISD’s current teacher evaluation system, which includes student test scores? Why or why not? What, if any, changes would you make? And (2) Do you support HISD’s current bonus system based on student test scores? Why or why not? What, if any, changes would you make? To see candidate names, their background information, their responses to other questions, etc. please read in full the article in the Houston Chronicle.

Here are the seven candidates’ responses to question #1:

  • I do not support the current teacher evaluation system. Teacher’s performance should not rely on the current formula using the evaluation system with the amount of weight placed on student test scores. Too many obstacles outside the classroom affect student learning today that are unfair in this system. Other means of support such as a community school model must be put in place to support the whole student, supporting student learning in the classroom (Fonseca).
  • No, I do not support the current teacher evaluation system, EVAAS, because it relies on an algorithm that no one understands. Testing should be diagnostic, not punitive. Teachers must have the freedom to teach basic math, reading, writing and science and not only teach to the test, which determines if they keep a job and/or get bonuses. Teachers should be evaluated on student growth. For example, did the third-grade teacher raise his/her non-reading third-grader to a higher level than that student read when he/she came into the teacher’s class? Did the teacher take time to figure out what non-educational obstacles the student had in order to address those needs so that the student began learning? Did the teacher coach the debate team and help the students become more well-rounded, and so on? Standardized tests in a vacuum indicate nothing (Jones).
  • I remember the time when teachers practically never revised test scores. Tests can be one of the best tools to help a child identify strengths and weakness. Students’ scores were filed, and no one ever checked them out from the archives. When student scores became part of their evaluation, teachers began to look into data more often. It is a magnificent tool for student and teacher growth. Having said that, I also believe that many variables that make a teacher great are not measured in his or her evaluation. There is nothing on character education for which teachers are greatly responsible. I do not know of a domain in the teacher’s evaluation that quite measures the art of teaching. Data is about the scientific part of teaching, but the art of teaching has to be evaluated by an expert at every school; we call them principals (Leal).
  • Student test scores were not designed to be used for this purpose. The use of students’ test scores to evaluate teachers has been discredited by researchers and statisticians. EVAAS and other value-added models are deeply flawed and should not be major components of a teacher evaluation system. The existing research indicates that 10-14 percent of students’ test scores are attributable to teacher factors. Therefore, I would support using student test scores (a measure of student achievement) as no more than 10-14 percent of teachers’ evaluations (McCoy).
  • No, I do not support the current teacher evaluation system, which includes student test scores, for the following reasons: 1) High-stakes decisions should not be made based on the basis of value-added scores alone. 2) The system is meant to assess and predict student performance with precision and reliability, but the data revealed that the EVAAS system is inconsistent and has consistent problems. 3) The EVAAS repots do not match the teachers’ “observation” PDAS scores [on the formal evaluation]; therefore, data is manipulated to show a relationship. 4) Most importantly, teachers cannot use the information generated as a formative tool because teachers receive the EVAAS reports in the summer or fall after the students leave their classroom. 5) Very few teachers realized that there was an HISD-sponsored professional development training linked to the EVAAS system to improve instruction. Changes that I will make are to make recommendations and confer with other board members to revamp the system or identify a more equitable system (McCullough).
  • The current teacher evaluation system should be reviewed and modified. While I believe we should test, it should only be a diagnostic measure of progress and indicator of deficiency for the purpose of aligned instruction. There should not be any high stakes attached for the student or the teacher. That opens the door for restricting teaching-to-test content and stifles the learning potential. If we have to have it, make it 5 percent. The classroom should be based on rich academic experiences, not memorization regurgitation (Skillern-Jones).
  • I support evaluating teachers on how well their students perform and grow, but I do not support high-stakes evaluation of teachers using a value-added test score that is based on the unreliable STAAR test. Research indicates that value-added measures of student achievement tied to individual teachers should not be used for high-stakes decisions or compared across dissimilar student populations or schools. If we had a reliable test of student learning, I would support the use of value-added growth measures in a low-stakes fashion where measures of student growth are part of an integrated analysis of a teacher’s overall performance and practices. I strongly believe that teachers should be evaluated with an integrated set of measures that show what teachers do and what happens as a result. These measures may include meaningful evidence of student work and learning, pedagogy, classroom management, knowledge of content and even student surveys. Evaluators should be appropriately trained, and teachers should have regular evaluations with frequent feedback from strong mentors and professional development to strengthen their content knowledge and practice (Stipeche).

Here are the seven candidates’ responses to question #2:

  • I do not support the current bonus system based on student test scores as, again, teachers do not currently have support to affect what happens outside the classroom. Until we provide support, we cannot base teacher performance or bonuses on a heavy weight of test scores (Fonseca).
  • No, I do not support the current bonus system. Teachers who grow student achievement should receive bonuses, not just teachers whose students score well on tests. For example, a teacher who closes the educational achievement gap with a struggling student should earn a bonus before a teacher who has students who are not challenged and for whom learning is relatively easy. Teachers who grow their students in extracurricular activities should earn a bonus before a teacher that only focuses on education. Teachers that choose to teach in struggling schools should earn a bonus over a teacher that teaches in a school with non-struggling students. Teachers who work with their students in UIL participation, history fairs, debate, choir, student government and like activities should earn a bonus over a teacher who does not (Jones).
  • Extrinsic incentives killed creativity. I knew that from my counseling background, but in 2011 or 2010, Dr. Grier sent an email to school administrators with a link of a TED Talks video that contradicts any notion of giving monetary incentives to promote productivity in the classroom: http://www.ted.com/talks/dan_pink_on_motivation?language=en. Give incentives for perfect attendance or cooperation among teachers selected by teachers (Leal).
  • No. Student test scores were not designed to be used for this purpose. All teachers need salary increases (McCoy).
  • No, I do not support HISD’s current bonus system based on student test scores. Student test scores should be a diagnostic tool used to identify instructional gaps and improve student achievement. Not as a measure to reward teachers, because the process is flawed. I would work collaboratively to identify another system to reward teachers (McCullough).
  • The current bonus program does, in fact, reward teachers who students make significant academic gains. It leaves out those teachers who have students at the top of the achievement scale. By formulaic measures, it is flawed and the system, according to its creators, is being misused and misapplied. It would be beneficial overall to consider measures to expand the teacher population of recipients as well as to undertake measures to simplify the process if we keep it. I think a better focus would be to see how we can increase overall teacher salaries in a meaningful and impactful way to incentivize performance and longevity (Skillern-Jones).
  • No. I do not support the use of EVAAS in this manner. More importantly, ASPIRE has not closed the achievement gap nor dramatically improved the academic performance of all students in the district (Stipeche).

No responses or no responses of any general substance were received from Daniels, Davila, McKinzie, Smith, Williams.

The Nation’s “Best Test” Scores Released: Test-Based Policies (Evidently) Not Working

From Diane Ravitch’s Blog (click here for direct link):

Sometimes events happen that seem to be disconnected, but after a few days or weeks, the pattern emerges. Consider this: On October 2, [U.S.] Secretary of Education Arne Duncan announced that he was resigning and planned to return to Chicago. Former New York Commissioner of Education John King, who is a clone of Duncan in terms of his belief in testing and charter schools, was designated to take Duncan’s place. On October 23, the Obama administration held a surprise news conference to declare that testing was out of control and should be reduced to not more than 2% of classroom time [see prior link on this announcement here]. Actually, that wasn’t a true reduction, because 2% translates into between 18-24 hours of testing, which is a staggering amount of annual testing for children in grades 3-8 and not different from the status quo in most states.

Disconnected events?

Not at all. Here comes the pattern-maker: the federal tests called the National Assessment of Educational Progress [NAEP] released its every-other-year report card in reading and math, and the results were dismal. There would be many excuses offered, many rationales, but the bottom line: the NAEP scores are an embarrassment to the Obama administration (and the George W. Bush administration that preceded it).

For nearly 15 years, Presidents Bush and Obama and the Congress have bet billions of dollars—both federal and state—on a strategy of testing, accountability, and choice. They believed that if every student was tested in reading and mathematics every year from grades 3 to 8, test scores would go up and up. In those schools where test scores did not go up, the principals and teachers would be fired and replaced. Where scores didn’t go up for five years in a row, the schools would be closed. Thousands of educators were fired, and thousands of public schools were closed, based on the theory that sticks and carrots, rewards and punishments, would improve education.

But the 2015 NAEP scores released today by the National Assessment Governing Board (a federal agency) showed that Arne Duncan’s $4.35 billion Race to the Top program had flopped. It also showed that George W. Bush’s No Child Left Behind was as phony as the “Texas education miracle” of 2000, which Bush touted as proof of his education credentials.

NAEP is an audit test. It is given every other year to samples of students in every state and in about 20 urban districts. No one can prepare for it, and no one gets a grade. NAEP measures the rise or fall of average scores for states in fourth grade and eighth grade in reading and math and reports them by race, gender, disability status, English language ability, economic status, and a variety of other measures.

The 2015 NAEP scores showed no gains nationally in either grade in either subject. In mathematics, scores declined in both grades, compared to 2013. In reading, scores were flat in grade 4 and lower in grade 8. Usually the Secretary of Education presides at a press conference where he points with pride to increases in certain grades or in certain states. Two years ago, Arne Duncan boasted about the gains made in Tennessee, which had won $500 million in Duncan’s Race to the Top competition. This year, Duncan had nothing to boast about.

In his Race to the Top program, Duncan made testing the primary purpose of education. Scores had to go up every year, because the entire nation was “racing to the top.” Only 12 states won a share of the $4.35 billion that Duncan was given by Congress: Tennessee and Delaware were first to win, in 2010. The next round, the following states won multi-millions of federal dollars to double down on testing: Maryland, Massachusetts, the District of Columbia, Florida, Georgia, Hawaii, New York, North Carolina, Ohio, and Rhode Island.

Tennessee, Duncan’s showcase state in 2013, made no gains in reading or mathematics, neither in fourth grade or eighth grade. The black-white test score gap was as large in 2015 as it had been in 1998, before either NCLB or the Race to the Top.

The results in mathematics were bleak across the nation, in both grades 4 and 8. The declines nationally were only 1 or 2 points, but they were significant in a national assessment on the scale of NAEP.

In fourth grade mathematics, the only jurisdictions to report gains were the District of Columbia, Mississippi, and the Department of Defense schools. Sixteen states had significant declines in their math scores, and thirty-three were flat in relation to 2013 scores. The scores in Tennessee (the $500 million winner) were flat.

In eighth grade, the lack of progress in mathematics was universal. Twenty-two states had significantly lower scores than in 2013, while 30 states or jurisdictions had flat scores. Pennsylvania, Kansas, and Florida (a Race to the Top winner), were the biggest losers, by dropping six points. Among the states that declined by four points were Race to the Top winners Ohio, North Carolina, and Massachusetts. Maryland, Hawaii, New York, and the District of Columbia lost two points. The scores in Tennessee were flat.

The District of Columbia made gains in fourth grade reading and mathematics, but not in eighth grade. It continues to have the largest score gap-—56 points–between white and black students of any urban district in the nation. That is more than double the average of the other 20 urban districts. The state with the biggest achievement gap between black and white students is Wisconsin; it is also the state where black students have the lowest scores, lower than their peers in states like Mississippi and South Carolina. Wisconsin has invested heavily in vouchers and charter schools, which Governor Scott Walker intends to increase.

The best single word to describe NAEP 2015 is stagnation. Contrary to President George W. Bush’s law, many children have been left behind by the strategy of test-and-punish. Contrary to the Obama administration’s Race to the Top program, the mindless reliance on standardized testing has not brought us closer to some mythical “Top.”

No wonder Arne Duncan is leaving Washington. There is nothing to boast about, and the next set of NAEP results won’t be published until 2017. The program that he claimed would transform American education has not raised test scores, but has demoralized educators and created teacher shortages. Disgusted with the testing regime, experienced teachers leave and enrollments in teacher education programs fall. One can only dream about what the Obama administration might have accomplished had it spent that $5 billion in discretionary dollars to encourage states and districts to develop and implement realistic plans for desegregation of their schools, or had they invested the same amount of money in the arts.

The past dozen or so years have been a time when “reformers” like Arne Duncan, Michelle Rhee, Joel Klein, and Bill Gates proudly claimed that they were disrupting school systems and destroying the status quo. Now the “reformers” have become the status quo, and we have learned that disruption is not good for children or education.

Time is running out for this administration, and it is not likely that there will be any meaningful change of course in education policy. One can only hope that the next administration learns important lessons from the squandered resources and failure of NCLB and Race to the Top.

The Obama Administration’s (Smoke and Mirrors) Calls for “Less Testing”

For those of you who have not yet heard, last weekend the Obama Administration released a new “Testing Action Plan” in which the administration calls for a “decreased,” “curbed,” “reversed,” less “obsessed,” etc. emphasis on standardized testing for the nation. The plan, headlined as such, has hit the proverbial “front pages” of many news (and other) outlets since. See, for example, articles in The New York Times, The Huffington Post, The Atlantic, The Washington Post, CNN, US News & World Report, Education Week, and the like.

The gist of the “Testing Action Plan” is that student-level tests, when “[d]one poorly, in excess, or without clear purpose…take valuable time away from teaching and learning, draining creative approaches from our classrooms.” It is about time the federal government acknowledges this, officially, and kudos to them for “bear[ing] some of the responsibility for this” throughout the nation. However, they also assert that testing is, nevertheless, still essential as long as tests “take up the minimum necessary time, and reflect the expectation that students will be prepared for success.”

What is this “necessary time” of which they speak?

They set the testing limits for all states not to exceed 2%. More specifically, they, “recommend that states place a cap on the percentage of instructional time students spend taking required statewide standardized assessments to ensure that… [pause marker added] no child spends more than 2 percent of her classroom time taking these tests [emphasis added].” Notice the circumlocution here as per No Child Left Behind (NCLB) — that which substantively helped bring us to become such a test-crazed nation in the first place.

When I first heard this, though, the first thing I did was pull out my trusty calculator to determine what this would actually mean in practice. If students across the nation attend school 180 days (which is standard), and they spend approximately 5 of approximately 6 hours each of these 180 days in instruction (e.g., not including lunch), this would mean that students spend approximately 900 educative hours in school every year (i.e., 180 days x 5 hours/day). If we take 2% of 900, that yields an approximate number of actual testing hours (as “recommended” and possibly soon to be mandated by the feds, pending a legislative act of congress) equal to 18 hours per academic year. “Assess” for yourself whether you think that amount of testing time (i.e., 18 hours of just test taking per student across all public schools) is to reduce the nation’s current over-emphasis on testing, especially given this does not include the time it takes for what the feds also define as high-quality “test preparation strategies,” either.

Nonetheless, President Obama also directed the U.S. Department of Education to review its test-based policies to also address places where the feds may have contributed to the problem, but might also contribute to the (arguably token) solutions (i.e., by offering financial support to help states develop better and less burdensome tests, by offering “expertise” to help states reduce time spent on testing – see comment about the 2% limit prior). You can see their other strategies in their “Testing Action Plan.” Note, however, that it also clearly states within this plan that the feds will do this to help states still “meet [states’] policy objectives and requirements [as required] under [federal] law,” although the feds also state that they will become at least a bit more flexible on this end, as well.

In this regard, the feds express that they will provide more flexibility and support in terms of non-tested grades and subjects, and the extent to which states that wish to amend their NCLB flexibility waivers (e.g., in terms of evaluating out-of-tested-subject-area teachers). However, states will still be required to maintain their “teacher and leader evaluation and support systems that include [and rely upon] growth in student learning [emphasis added]” (e.g., by providing states with greater flexibility when determining how much weight to ascribe to teacher-level growth measures).

How clever of the feds to carry out such a smoke and mirrors explanation.

Another indicator of this is the fact that the 10 states that the feds highlight in their “Testing Action Plan” as the states in which educational leaders are helping to lead these federal initiatives are as follows: Delaware, Florida, New Mexico, New York, North Carolina, Massachusetts, Minnesota, Rhode Island, Tennessee, Washington DC. Seven of these 10 states (except for Delaware, Minnesota, and Rhode Island) are the 7 states about which I write blog posts most often, as these 7 states have the most draconian educational policies mandating high-stakes use of said tests across the nation. In addition, Massachusetts, New Mexico, and New York are the states leading the nation in terms of the national opt-out movement. This is not because these states are leading the way in focusing less on said tests.

In addition, all of this was also based (at least in part, see also here) on new survey results recently released by the Council of the Great City Schools, in which researchers set out to determine how much time is spent on testing. They found that across their (large) district members, the average time spent testing was “surprisingly low [?!?]” at 2.34%, which study authors calculate to be approximately 4.22 total days spent on just testing (i.e., around 21 hours if one assumes, again, an average day’s instructional time = 5 hours). Again, this does not include time spent preparing for tests, nor does it include other non-standardized tests (e.g., those that teachers develop and use to assess their students’ learning).

So, really, the feds did not decrease the amount of time spent testing really at all, they literally just rounded down, losing 34 hundredths of a whole. For more information about this survey research study, click here.

If these two indicators are not both indicators of something (else) gone awry within the feds’ “Testing Action Plan,” not to mention the hype surrounding it, I don’t know what are. I do know one thing for certain, though, that it is way too soon to call this announcement or the Obama administration’s “Testing Action Plan” a victory. Relief from testing is really not on the way (see, for example, here). Additional details are to be released in January.

“Value-Less” Value-Added Data

Peter Greene, a veteran teacher of English in Pennsylvania who works as a teacher in a state using the Pennsylvania version of the Education Value-Added Assessment System (EVAAS), wrote last week (October 5, 2015) in his Curmudgucation blog about his “Value-Less Data.” I thought it very important to share with you all, as he does a great job deconstructing one of the most widespread claims being made, and most lacking research support, about using the data derived via value-added models (VAMs) to inform and improve what teachers do in their classrooms.

Greene sententiously critiques this claim, writing:

It’s autumn in Pennsylvania, which means it’s time to look at the rich data to be gleaned from our Big Standardized Test (called PSSA for grades 3-8, and Keystone Exams at the high school level).

We love us some value added data crunching in PA (our version is called PVAAS, an early version of the value-added baloney model). This is a model that promises far more than it can deliver, but it also makes up a sizeable chunk of our school evaluation model, which in turn is part of our teacher evaluation model.

Of course the data crunching and collecting is supposed to have many valuable benefits, not the least of which is unleashing a pack of rich and robust data hounds who will chase the wild beast of low student achievement up the tree of instructional re-alignment. Like every other state, we have been promised that the tests will have classroom teachers swimming in a vast vault of data, like Scrooge McDuck on a gold bullion bender. So this morning I set out early to the states Big Data Portal to see what riches the system could reveal.

Here’s what I can learn from looking at the rich data.

* the raw scores of each student
* how many students fell into each of the achievement subgroups (test scores broken down by 20 point percentile slices)
* if each of the five percentile slices was generally above, below, or at its growth target

Annnnd that’s about it. I can sift through some of that data for a few other features.

For instance, PVAAS can, in a Minority Report sort of twist, predict what each student should get as a score based on– well, I’ve been trying for six years to find someone who can explain this to me, and still nothing. But every student has his or her own personal alternate universe score. If the student beats that score, they have shown growth. If they don’t, they have not.

The state’s site will actually tell me what each student’s alternate universe score was, side by side with their actual score. This is kind of an amazing twist– you might think this data set would be useful for determining how well the state’s predictive legerdemain actually works. Or maybe a discrepancy might be a signal that something is up with the student. But no — all discrepancies between predicted and actual scores are either blamed on or credited to the teacher.

I can use that same magical power to draw a big target on the backs of certain students. I can generate a list of students expected to fall within certain score ranges and throw them directly into the extra test prep focused remediation tank. Although since I’m giving them the instruction based on projected scores from a test they haven’t taken yet, maybe I should call it premediation.

Of course, either remediation or premediation would be easier to develop if I knew exactly what the problem was.

But the website gives only raw scores. I don’t know what “modules” or sections of the test the student did poorly on. We’ve got a principal working on getting us that breakdown, but as classroom teachers we don’t get to see it. Hell, as classroom teachers, we are not allowed to see the questions, and if we do see them, we are forbidden to talk about them, report on them, or use them in any way. (Confession: I have peeked, and many of the questions absolutely suck as measures of anything).

Bottom line– we have no idea what exactly our students messed up to get a low score on the test. In fact, we have no idea what they messed up generally.

So that’s my rich data. A test grade comes back, but I can’t see the test, or the questions, or the actual items that the student got wrong.

The website is loaded with bells and whistles and flash-dependent functions along with instructional videos that seem to assume that the site will be used by nine-year-olds, combining instructions that should be unnecessary (how to use a color-coding key to read a pie chart) to explanations of “analysis” that isn’t (by looking at how many students have scored below basic, we can determine how many students have scored below basic).

I wish some of the reformsters who believe that BS [i.e., not “basic skills” but the “other” BS] Testing gets us rich data that can drive and focus instruction would just get in there and take a look at this, because they would just weep. No value is being added, but lots of time and money is being wasted.

Valerie Strauss also covered Greene’s post in her Answer Sheet Blog in The Washington Post here, in case you’re interested in seeing her take on this as well: “Why the ‘rich’ student data we get from testing is actually worthless.”

Florida’s Superintendents Have Officially “Lost Confidence” in the State’s Teacher and School Evaluation System

In an article written by Valerie Strauss and featured yesterday in The Washington Post, school superintendents throughout the state of Florida have officially declared to the state that they have “lost confidence” in Florida’s teacher and school evaluation system (click here for the original article). More specifically, state school superintendents submitted a letter to the state asserting that “they ‘have lost confidence’ in the system’s accuracy and are calling for a suspension and a review of the system” (click here also for their original letter).

This is the evaluation model instituted by Florida’s former governor Jeb Bush, whom we all also identify as the brother of former President George W. Bush and son of former President George H. W. Bush. Former President GW Bush was also the architect of what most now agree is the failed No Child Left Behind (NCLB) Act of 2001, that former President GW Bush put forth as federal policy given his “educational reforms” in the state of Texas in the late 1990s. Beyond this brotherly association, however, Jeb is also currently a running candidate for the 2016 Republican presidential nomination, and via his campaign he is also (like his brother, pre-NCLB) touting his “educational reforms” in the state of Florida, hoping to take these reforms also to the White House.

It is Jeb Bush’s “educational reforms” that, without research evidence in support, also became models of “educational reform” for other states around the country, including states like New Mexico. In New Mexico, the state in which its current evaluation system, as based on the Florida model, is at the core of a potentially significant lawsuit (see prior posts about this lawsuit here, here, here, and here). Not surprisingly, one of Jeb Bush’s protégés – Hanna Skandera, who is currently the state of New Mexico’s Secretary of its Public Education Department (PED) – took Bush’s model to even more of an extreme, making this particular model one of the most arbitrary and capricious of all such models currently in place throughout the U.S. (as also defined and detailed more fully here).

Hence, it should certainly be of interest to those in New Mexico (and other states) that the teacher and school evaluation model upon which their state model was built, is now apparently falling apart, at least for now.

“The superintendents [in Florida] are calling for the state to suspend the system for a year – meaning that the scores from this spring’s administration of the [state’s] exams will not be used in [the state-level] evaluations [of teachers or schools].” The state school superintendents are also calling for “a full review” of the system, which certainly seems warranted in Florida, New Mexico, and any other state for that matter, in which similar evaluation models are being paid for using taxpayer revenues, without much if any research evidence to support that they are working as intended, and sold, and paid for.

No Longer in Washington State are Test Scores to be Tied to Teachers’ Evaluations

You might recall from a prior post that U.S. Secretary of Education Arne Duncan pulled the state of Washington’s No Child Left Behind (NCLB) waiver because the state did not met the U.S. Department of Education’s timeline for tying teacher evaluations to student test scores (i.e., using growth or value-added modeling). Washington was the first to lose its waiver, but hopefully not the last.

Teachers in Washington achieved yet another victory in this regard.

As per a recent article in The Seattle Times (see also a related post here), it looks like “After four months of negotiations, a five-day [teachers] strike and one final all-night [negotiations] talk, the Seattle teachers union and Seattle Public Schools reached a tentative contract agreement,” that is (if it is ultimately finalized), will:

  • Remove test scores from playing any role in teacher evaluations;
  • Give teachers significantly more say in how often students are tested;
  • Secure for teachers pay increases of 9.5 percent over three years, in addition to the state cost-of-living adjustment of 4.8 percent over two years;
  • Pay teachers for the longer school days they are to teach (longer school days are forthcoming);
  • Lower special-education student-teacher ratios;
  • Lower teacher specialists’ caseloads;
  • Allow for guaranteed recesses for students; etc.

As also noted in this article, this is certainly “groundbreaking,” especially given the federal politics and policies surrounding, as most pertinent to followers of this blog, using test scores to evaluate teachers (see bullet #1 above).

Hopefully other states can, and will follow suit…

“Efficiency” as a Constitutional Mandate for Texas’s Educational System

The Texas Constitution requires that the state “establish and make suitable provision for the support and maintenance of an efficient system of public free schools,” as the “general diffusion of knowledge [is]…essential to the preservation of the liberties and rights of the people.” Following this notion, The George W. Bush Institute’s Education Reform Initiative  recently released its first set of reports as part of its The Productivity for Results Series: “A Legal Lever for Enhancing Productivity.” The report was authored by an affiliate of The New Teacher Project (TNTP) – the non-profit organization founded by the controversial former Chancellor of Washington DC’s public schools Michelle Rhee; an unknown and apparently unaffiliated “education researcher” named Krishanu Sengupta; and Sandy Kress, the “key architect of No Child Left Behind [under the presidential leadership of George W. Bush] who later became a lobbyist for Pearson, the testing company” (see, for example, here).

Authors of this paper review the economic and education research (although if you look through the references the strong majority of pieces come from economics research, which makes sense as this is an economically driven venture) to identify characteristics that typify enterprises that are efficient. More specifically, the authors use the principles of x-efficiency set out in the work of the highly respected Henry Levin that require efficient organizations, in this case as (perhaps inappropriately) applied to schools, to have: 1) Clear objective outcomes with measurable outcomes; 2) Incentives that are linked to success on the objective function; 3) Efficient access to useful information for decisions; 4) Adaptability to meet changing conditions; and 5) Use of the most productive technology consistent with cost constraints.

The authors also advance another series of premises, as related to this view of x-efficiency and its application to education/schools in Texas: (1) that “if Texas is committed to diffusing knowledge efficiently, as mandated by the state constitution, it should ensure that the system for putting effective teachers in classrooms and effective materials in the hands of teachers and students is characterized by the principles that undergird an efficient enterprise, such as those of x-efficiency;” (2) this system must include value-added measurement systems (i.e., VAMs), as deemed throughout this paper as not only constitutional but also rational and in support of x-efficiency; (3) given “rational policies for teacher training, certification, evaluation, compensation, and dismissal are key to an efficient education system;” (4) “the extent to which teacher education programs prepare their teachers to achieve this goal should [also] be [an] important factor;”  (5) “teacher evaluation systems [should also] be properly linked to incentives…[because]…in x-efficient enterprises, incentives are linked to success in the objective function of the organization;” (6) which is contradictory with current, less x-efficient teacher compensation systems that link incentives to time on the job, or tenure, rather than to “the success of the organization’s function; (6), in the end, “x-efficient organizations have efficient access to useful information for decisions, and by not linking teacher evaluations to student achievement, [education] systems [such as the one in Texas will] fail to provide the necessary information to improve or dismiss teachers.”

The two districts highlighted as being most x-efficient in Texas, and in this report include, to no surprise: “Houston [which] adds a value-added system to reward teachers, with student performance data counting for half of a teacher’s overall rating. HISD compares students’ academic growth year to year, under a commonly used system called EVAAS.” We’ve discussed not only this system but also its use in Houston often on this blog (see, for example, here, here, and here). Teachers in Houston who consistently perform poorly can be fired for “insufficient student academic growth as reflected by value added scores…In 2009, before EVAAS became a factor in terminations, 36 of 12,000 teachers were fired for performance reasons, or .3%, a number so low the Superintendent [Terry Grier] himself called the dismissal system into question. From 2004-2009, the district
fired or did not renew 365 teachers, 140 for “performance reasons,” including poor discipline management, excessive absences, and a lack of student progress. In 2011, 221 teacher contracts were not renewed, multiple for “significant lack of student progress attributable to the educator,” as well as “insufficient student academic growth reflected by [SAS EVAAS] value-added scores….In the 2011-12 school year, 54% of the district’s low-performing teachers were dismissed.” That’s “progress,” right?!?

Anyhow, for those of you who have not heard, this same (controversial) Superintendent, who pushed this system throughout his district is retiring (see, for example, here).

The other district of (dis)honorable mention was Dallas Independent School district; it also uses a VAM called the Classroom Effectiveness Index (CIE), although I know less about this system as I have never examined or researched it myself, nor have I read really anything about it. But in 2012, the district’s Board “decided not to renew 259 contracts due to poor performance, five times more than the previous year.” The “progress” such x-efficiency brings…

What is still worrisome to the authors, though, is that “[w]hile some districts appear to be increasing their efforts to eliminate ineffective teachers, the percentage of teachers dismissed for any reason, let alone poor performance, remains well under one percent in the state’s largest districts.” Related, and I preface this one noting that this next argument is one of the most over-cited and hyper-utilized by organizations backing “initiatives” or “reforms” such as these, that this “falls well below the five to eight percent that Hanushek calculates would elevate achievement to internationally competitive levels” “Calculations by Eric Hanushek of Stanford University show that removing the bottom five percent of teachers in the United States and replacing them with teachers of average effectiveness would raise student achievement in the U.S. 0.4 standard deviations, to the level of student achievement in Canada. Replacing the bottom eight percent would raise student achievement to the level of Finland, a top performing country on international assessments.” As Linda Darling-Hammond, also of Stanford would argue, we cannot simply “fire our way to Finland.” Sorry Eric! But this is based on econometric predictions, and no evidence exists whatsoever that this is in fact a valid inference. Nontheless, it is cited over and over again by the same/similar folks (such as the authors of this piece) to justify their currently trendy educational reforms.

The major point here, though, is that “if Texas wanted to remove (or improve) the bottom five to eight percent of its teachers, the current evaluation system would not be able to identify them;” hence, the state desperately needs a VAM-based system to do this. Again,  no research to counter this or really any claim is included in this piece; only the primarily economics-based literatures were selected in support.

In the end, though, the authors conclude that “While the Texas Constitution has established a clear objective function for the state school system and assessments are in place to measure the outcome, it does not appear that the Texas education system shares the other four characteristics of x-efficient enterprises as identified by Levin. Given the constitutional mandate for efficiency and the difficult economic climate, it may be a good time for the state to remedy this situation…[Likewise] the adversity and incentives may now be in place for Texas to focus on improving the x-efficiency of its school system.”

As I know and very much respect Henry Levin (see, for example, an interview I conducted with him a few years ago, with the shorter version here and the longer version here), I’d be curious to know what his response might be to the authors’ use of his x-efficiency framework to frame such neo-conservative (and again trendy) initiatives and reforms. Perhaps I will email him…

Special Issue of “Educational Researcher” (Paper #2 of 9): VAMs’ Measurement Errors, Issues with Retroactive Revisions, and (More) Problems with Using Test Scores

Recall from a prior post that the peer-reviewed journal titled Educational Researcher (ER) – recently published a “Special Issue” including nine articles examining value-added measures (VAMs). I have reviewed the next of nine articles (#2 of 9) here, titled “Using Student Test Scores to Measure Teacher Performance: Some Problems in the Design and Implementation of Evaluation Systems” and authored by Dale Ballou – Associate Professor of Leadership, Policy, and Organizations at Vanderbilt University – and Matthew Springer – Assistant Professor of Public Policy also at Vanderbilt.

As written into the articles’ abstract, their “aim in this article [was] to draw attention to some underappreciated problems in the design and implementation of evaluation systems that incorporate value-added measures. [They focused] on four [problems]: (1) taking into account measurement error in teacher assessments, (2) revising teachers’ scores as more information becomes available about their students, and (3) and (4) minimizing opportunistic behavior by teachers during roster verification and the supervision of exams.”

Here is background on their perspective, so that you all can read and understand their forthcoming findings in context: “On the whole we regard the use of educator evaluation systems as a positive development, provided judicious use is made of this information. No evaluation instrument is perfect; every evaluation system is an assembly of various imperfect measures. There is information in student test scores about teacher performance; the challenge is to extract it and combine it with the information gleaned from other instruments.”

Their claims of most interest, in my opinion and given their perspective as illustrated above, are as follows:

  • “Teacher value-added estimates are notoriously imprecise. If value-added scores are to be used for high-stakes personnel decisions, appropriate account must be taken of the magnitude of the likely error in these estimates” (p. 78).
  • “[C]omparing a teacher of 25 students to [an equally effective] teacher of 100 students… the former is 4 to 12 times more likely to be deemed ineffective, solely as a function of the number of the teacher’s students who are tested—a reflection of the fact that the measures used in such accountability systems are noisy and that the amount of noise is greater the fewer students a teacher has. Clearly it is unfair to treat two teachers with the same true effectiveness differently” (p. 78).
  • “[R]esources will be wasted if teachers are targeted for interventions without taking
    into account the probability that the ratings they receive are based on error” (p. 78).
  • “Because many state administrative data systems are not up to [the data challenges required to calculate VAM output], many states have implemented procedures wherein teachers are called on to verify and correct their class rosters [i.e., roster verification]…[Hence]…the notion that teachers might manipulate their rosters in order to improve their value-added scores [is worrisome as the possibility of this occurring] obtains indirect support from other studies of strategic behavior in response to high-stakes accountability…These studies suggest that at least some teachers and schools will take advantage of virtually any opportunity to game
    a test-based evaluation system…” (p. 80), especially if they view the system as unfair (this is my addition, not theirs) and despite the extent to which school or district administrators monitor the process or verify the final roster data. This is another gaming technique not often discussed, or researched.
  • Related, in one analysis these authors found that “students [who teachers] do not claim [during this roster verification process] have on average test scores far below those of the students who are claimed…a student who is not claimed is very likely to be one who would lower teachers’ value added” (p. 80). Interestingly, and inversely, they also found that “a majority of the students [they] deem[ed] exempt [were actually] claimed by their teachers [on teachers’ rosters]” (p. 80). They note that when either occurs, it’s rare; hence, it should not significantly impact teachers value added scores on the whole. However, this finding also “raises the prospect of more serious manipulation of roster verification should value added come to be used for high-stakes personnel decisions, when incentives to game the system will grow stronger” (p. 80).
  • In terms of teachers versus proctors or other teachers monitoring students when they take large-scale standardized tests (that are used across all states to calculate value-added estimates), researchers also found that “[a]t every grade level, the number of questions answered correctly is higher when students are monitored by their own teacher” (p. 82). They believe this finding is more relevant that I do in that the difference was one question (although when multiplied by the number of students included in a teacher’s value-added calculations this might be more noteworthy). In addition,  I know of very few teachers, anymore, who are permitted to proctor their own students’ tests, but for those who still allow this, this finding might also be relevant. “An alternative interpretation of these findings is that students
    naturally do better when their own teacher supervises the exam as
    opposed to a teacher they do not know” (p. 83).

The authors also critique, quite extensively in fact, the Education Value-Added Assessment System (EVAAS) used statewide in North Carolina, Ohio, Pennsylvania, and Tennessee and many districts elsewhere. In particular, they take issue with the model’s use of the conventional t-test statistic to identify a teacher for whom they are 95% confident (s)he differs from average. They also take issue with EVAAS practice whereby teachers’ EVAAS scores change retroactively, as more data become available, to get at more “precision” even though teachers’ scores can change one or two years well after the initial score is registered (and used for whatever purposes).

“This has confused teachers, who wonder why their value-added score keeps changing for students they had in the past. Whether or not there are sound statistical reasons for undertaking these revisions…revising value-added estimates poses problems when the evaluation system is used for high-stakes decisions. What will be done about the teacher whose performance during the 2013–2014 school year, as calculated in the summer of 2014, was so low that the teacher loses his or her job or license but whose revised estimate for the same year, released in the summer of 2015, places the teacher’s performance above the threshold at which these sanctions would apply?…[Hence,] it clearly makes no sense to revise these estimates, as each revision is based on less information about student performance” (p. 79).

Hence, “a state that [makes] a practice of issuing revised ‘improved’ estimates would appear to be in a poor position to argue that high-stakes decisions ought to be based on initial, unrevised estimates, though in fact the grounds for regarding the revised estimates as an improvement are sometimes highly dubious. There is no obvious fix for this problem, which we expect will be fought out in the courts” (p. 83).

*****

If interested, see the Review of Article #1 – the introduction to the special issue here.

Article #2 Reference: Ballou, D., & Springer, M. G. (2015). Using student test scores to measure teacher performance: Some problems in the design and implementation of evaluation systems. Educational Researcher, 44(2), 77-86. doi:10.3102/0013189X15574904

EVAAS, Value-Added, and Teacher Branding

I do not think I ever shared this video out, and now following up on another post, about the potential impact these videos should really have, I thought now is an appropriate time to share. “We can be the change,” and social media can help.

My former doctoral student and I put together this video, after conducting a study with teachers in the Houston Independent School District and more specifically four teachers whose contracts were not renewed due in large part to their EVAAS scores in the summer of 2011. This video (which is really a cartoon, although it certainly lacks humor) is about them, but also about what is happening in general in their schools, post the adoption and implementation (at approximately $500,000/year) of the SAS EVAAS value-added system.

To read the full study from which this video was created, click here. Below is the abstract.

The SAS Educational Value-Added Assessment System (SAS® EVAAS®) is the most widely used value-added system in the country. It is also self-proclaimed as “the most robust and reliable” system available, with its greatest benefit to help educators improve their teaching practices. This study critically examined the effects of SAS® EVAAS® as experienced by teachers, in one of the largest, high-needs urban school districts in the nation – the Houston Independent School District (HISD). Using a multiple methods approach, this study critically analyzed retrospective quantitative and qualitative data to better comprehend and understand the evidence collected from four teachers whose contracts were not renewed in the summer of 2011, in part given their low SAS® EVAAS® scores. This study also suggests some intended and unintended effects that seem to be occurring as a result of SAS® EVAAS® implementation in HISD. In addition to issues with reliability, bias, teacher attribution, and validity, high-stakes use of SAS® EVAAS® in this district seems to be exacerbating unintended effects.