“Evaluation Systems that are Byzantine at Best and At Worst, Draconian”

As per the New Year, Valerie Strauss at The Washington Post recently released the top 11 education-related “The Answer Sheet’s” articles of the year, the top article focused on a letter to the Post explaining why he finally decided to leave the teaching “profession.” For multiple reasons, he writes of both his sadness and discontent, and most pertinently here, given the nature of this blog, he writes (as highlighted in the title of this post): “Evaluation Systems that are Byzantine at Best and At Worst, Draconian”

He writes: “My profession is being demeaned by a pervasive atmosphere of distrust, dictating that teachers cannot be permitted to develop and administer their own quizzes and tests (now titled as generic “assessments”) or grade their own students’ examinations. The development of plans, choice of lessons and the materials to be employed are increasingly expected to be common [i.e., the common core] to all teachers in a given subject. This approach not only strangles creativity, it smothers the development of critical thinking in our students and assumes a one-size-fits-all mentality more appropriate to the assembly line than to the classroom [i.e., value-added and its inputs and outputs].”

He continues: “[D]ata driven” education seeks only conformity, standardization, testing and a zombie-like adherence to the shallow and generic Common Core…Creativity, academic freedom, teacher autonomy, experimentation and innovation are being stifled in a misguided effort to fix what is not broken in our system of public education.”

He then concludes: “After writing all of this I realize that I am not leaving my profession, in truth, it has left me. It no longer exists.”

Take a read for yourself, as there is much more to read not directly related to the teacher evaluation systems he protests. And perhaps take a read, with a New Year’s resolution to help this not [continue to] happen to others.




Stanford Professor Darling-Hammond on America’s Testing Fixation and Frenzy

Just recently on National Public Radio (NPR), current Stanford Professor and former runner-up to being appointed by President Obama as the US Secretary of Education (Obama appointed current secretary Arne Duncan instead) Linda Darling-Hammond was interviewed about why she thought “School Testing Systems Should Be Examined In 2014.”

Her reasons?

  • Post No Child Left Behind (that positioned states as the steroids of educational reform) America’s public schools have seen no substantive changes, or more specifically gains for the better, as intended. According to Darling-Hammond, “We’re actually not doing any better than we were doing a decade ago” when NCLB was first passed into legislation (2002).
  • “When No Child Left Behind was passed back in 2002, there was a target set for each year for each school [in each state] that [students in each state] would get to a place where 100 percent of students would be, quote/unquote ‘proficient’ on the state tests. Researchers knew even then that would be impossible.” Accordingly, and in many ways unfortunately, all states have since failed to meet this target (i.e., 100% proficiency), as falsely assumed, and predicted, and used as political rhetoric to endorse and pass NCLB by both republicans and democrats, making for one of the first bipartisan educational policies of its time.
  • “Testing has some utility, if you use it in thoughtful ways” but in our country we are lacking serious thought and consideration about that which tests can and should do versus that which they cannot and should not do.

Moving forward we MUST “change our policies around the nature of testing, the amount of testing, and the uses of testing…and move [forward] from a test-and-punish philosophy – which was the framework for No Child Left Behind…to an assess-and-improve philosophy.” More importantly, we MUST “address childhood poverty” as this, not testing and holding teachers accountable for their students’ test scores, is where true reform should be positioned

While “certainly [a] good education is a [good] way out of poverty…we [still must] address some of these issues that adhere to poverty itself” and perpetually cause (and are significantly and substantially related to) low levels of student learning and low levels of achievement. “The two are completely intertwined, and we have to work on both at the same time.”

US Dept of Education Rejects District’s Concerns

A school district in South Carolina, educating approximately 45,000 K-12 students, received a US Dept of Education Teacher Incentive Fund (TIF) grant and recently requested a one-year extension to delay the full implementation of its TIF-funded (and federally backed at the tune of $23.7 million) teacher evaluation system.

The US Dept of Education noted that the district received TIF funding as based on the district’s assurance that it would meet the grant requirements along with its timeline. The district superintendent said that she “wasn’t inclined to move forward with any district-wide evaluation system until she had more confidence in it than she [currently] does.”

The most major concerns are about the systems’ value-added component. The US Dept of Education gave the district a bit of leeway on this, and their expressed concerns, but only in terms of the value-added component for the district’s high school teachers.

District administrators and its (supportive) school board will soon have to have an open discussion about next steps, with one option being to return their remaining TIF funds to the US Dept of Education. As stated by one district teacher, “It seems to me that the path forward is clear…It’s time to let go of the grant.”

Check out the full article here.

More Value-Added Problems in DC’s Public Schools

Over the past month I have posted two entries about what’s going in in DC’s public schools with the value-added-based teacher evaluation system developed and advanced by the former School Chancellor Michelle Rhee and carried on by the current School Chancellor Kaya Henderson.

The first post was about a bogus “research” study in which National Bureau of Education Research (NBER)/University of Virginia and Stanford researchers overstated false claims that the system was indeed working and effective, despite the fact that (among other problems) 83% of the teachers in the study did not have student test scores available to measure their “value added.” The second post was about a DC teacher’s experiences being evaluated under this system (as part of the aforementioned 83%) using almost solely his administrator’s and master educator’s observational scores. Demonstrated with data in this post was how error prone this part of the DC system also evidenced itself to be.

Adding to the value-added issues in DC, it was just released by DC public school officials (the day before winter break) and then two Washington Post articles (see the first article here and the second here) that 44 DC public school teachers also received incorrect evaluation scores for the last academic year (2012-2013) because of technical errors in the ways the scores were calculated. One of the 44 teachers was fired as a result, although (s)he is now looking to be reinstated and compensated for the salary lost.

While “[s]chool officials described the errors as the most significant since the system launched a controversial initiative in 2009 to evaluate teachers in part on student test scores,” they also downplayed the situation as only impacting 44.

VAM formulas are certainly “subject to error,” and they are subject to error always, across the board, for teachers in general as well as the 470 DC public school teachers with value-added scores based on student test scores. Put more accurately, just over 10% (n=470) of all DC teachers (n=4,000) were evaluated using their students’ test scores, which is even less than the 83% mentioned above. And for about 10% of these teachers (n=44), calculation errors were found.

This is not a “minor glitch” as written into a recent Huffington Post article covering the same story, which positions the teachers’ unions as almost irrational for “slamming the school system for the mistake and raising broader questions about the system.” It is a major glitch caused both by inappropriate “weightings” of teachers’ administrator’ and master educators’ observational scores, as well as “a small technical error” that directly impacted the teachers’ value-added calculations. It is a major glitch with major implications about which others, including not just those from the unions but many (e.g., 90%) from the research community, are concerned. It is a major glitch that does warrant additional cause about this AND all of the other statistical and other errors not mentioned but prevalent in all value-added scores (e.g., the errors always found in large-scale standardized tests particularly given their non-equivalent scales, the errors caused by missing data, the errors caused by small class sizes, the errors caused by summer learning loss/gains, the errors caused by other teachers’ simultaneous and carry over effects, the errors caused by parental and peer effects [see also this recent post about these], etc.).

So what type of consequence is to be in store for those perpetuating such nonsense? Including, particularly here, those charged with calculating and releasing value-added “estimates” (“estimates” as these are not and should never be interpreted as hard data), but also the reporters who report on the issues without understanding them or reading the research about them. I, for one, would like to see them held accountable for the “value” they too are to “add” to our thinking about these social issues, but who rather detract and distract readers away from the very real, research-based issues at hand.

Who “Added Value” to My Son’s Learning and Achievement?

‘Tis the beginning of the holiday break and the end of the second semester of my son’s 5th grade year. This past Friday morning, after breakfast and before my son’s last day of school, he explained to me how his B grades in reading and language arts from his first grading period in 5th grade had now moved up to two A scores in both subject areas.

When I asked him why he thought he dramatically improved both scores, he explained how he and his friends really wanted to become “Millionaires” as per their Accelerated Reading (AR) program word count goals. He thought this helped him improve his reading and language arts homework/test/benchmark scores (i.e., peer effects). He also explained how he thought that my husband and I requesting that he read every night and every morning before school also helped him become a better reader (surprise!) and increase his homework/test/benchmark scores (i.e., parental effects). And he explained how my helping him understand his tests/benchmarks, how to take them, how to focus, and how to (in my terms) analyze and exclude the “distractor” items (i.e., the items that are often times “right,” but not as “right” as the best and most correct response options, yet that are purposefully placed on multiple choice tests to literally “distract” test-takers from the correct answers) also helped his overall performance (i.e., the effects of having educated/professional parent(s)).

So, who “added value” to my son’s learning and achievement? Therein lies the rub. Will the complicated statistics used to capture my son’s teacher’s value-added this year take all of this into consideration, or better yet, control for it and factor it out? Will controlling for my son’s 4th grade scores (and other demographic variables as available) help to control for and counter for that which happens in our home? I highly doubt it (with research evidence in support).

Complicating things further, my son’s 4th grade teacher last year liked my son, but also knowing that I am a professor in a college of education, was sure to place my son in arguably “the best” 5th grade teacher’s class this year. His teacher is indeed, wonderful, and not surprisingly a National Board Certified Teacher (NBCT), but being a professor gave my son a distinct advantage…and perhaps gave his current teacher a “value-added” advantage as well.

Herein lies another rub. Who else besides my son was non-randomly placed into this classroom with this 5th grade teacher? And what other “external effects” might other parents of other students in this class be having on their own children’s learning and achievement, outside of school, similar to those explained by my son Friday morning? Can this too be taken into consideration, or better yet, controlled for and factored out? I highly doubt it, again.

And what will the implications for this teacher be when it comes time to measure her value-added? Lucky her, she will likely get the kudos and perhaps the monetary bonus she truly deserves, thanks in large part, though, to so many things that were indeed, and continue to be, outside of her control as a teacher…things she did not, even being a phenomenal teacher, cause or have a causal impact on. Yet another uncontrollable consideration that must be considered.

What’s Happening in Tennessee?

VAMs have been used in Tennessee for more than 20 years, coming about largely “by accident.” In the late 1980s, an agricultural statistician/adjunct professor at the University of Knoxville – William Sanders – thought that educators struggling with student achievement in the state could “simply” use more advanced statistics, similar to those used when modeling genetic and reproductive trends among cattle, to measure growth, hold teachers accountable for that growth, and solve the educational measurement woes facing the state at the time.

Sanders developed the Tennessee Value-Added Assessment System (TVAAS), which is now known as the Education Value-Added Assessment System (EVAAS®) and operated by SAS® Institute Inc. Nowadays, the SAS® EVAAS® is widely considered, with over 20 years of development, the largest, one of the best, one of the most widely adopted and used, and likely the most controversial VAM in the country. It is controversial in that it is a proprietary model (i.e., it is costly and used/marketed under exclusive legal rights of the inventors/operators), and it is often akin to a “black box” model (i.e., it is protected by a good deal of secrecy and mystery, largely given its non-transparency). Nonetheless, on the SAS® EVAAS® website developers continue to make grandiose marketing claims without much caution or really any research evidence in support (e.g., using the SAS® EVAAS® will provide “a clear path to achieve the US goal to lead the world in college completion by the year 2020”). Riding on such claims, they continue to sell their model to states (e.g., Tennessee, North Carolina, Ohio, Pennsylvania) and districts (e.g., the Houston Independent School District) regardless, but at taxpayers’ costs.

In fact, thanks to similar marketing efforts and promotions, not to mention lobbying efforts coming out of Tennessee, VAMs have now been forced upon America’s public schools by the Obama administration’s Race to the Top program. Not surprisingly, Tennessee was one of the first states to receive Race to the Top funds to the tune of $502 million, to further advance the SAS® EVAAS® model, still referred to as the TVAAS, however, in the state of Tennessee.

Well, it seems things are not going so well in Tennessee. As highlighted in this recent article, it seems that boards of education across the state are increasingly opposing the TVAAS for high-stakes use (e.g., using TVAAS estimates when issuing, renewing or denying teacher licenses).

Some of the key issues? The TVAAS is too complex to understand; teachers’ scores are highly and unacceptably inconsistent from one year to the next, and, hence, invalid; teachers are being held accountable for things that are out of their control (e.g., what happens outside of school); there are major concerns about the extent to which teachers actually cause changes in TVAAS estimates over time; the state does not have any improvement plans in place for teachers with subpar TVAAS estimates; these issues are being further complicated by changing standards (e.g., the adoption of the Common Core) at the same time; and the like.

Interestingly enough, researchers have been raising precisely these same concerns since the early 1990s when the TVAAS was first implemented in Tennessee. Had policymakers listened to researchers’ research-based concerns, many a taxpayer dollar would have been saved!

VAMboozled! Post “Best of the Ed Blogs”…Plus

A recent VAMboozled! post titled “Unpacking DC’s Impact, or the Lack Thereof” was selected as one of the National Education Policy Center’s (NEPC’s) “Best of the Ed Blogs.” The “Best of the Ed Blogs” features interesting and insightful blog posts on education policy and today’s most important education topics. They are highlighted and selected on the website of the NEPC.

The same post was highlighted in an article Diane Ravitch wrote for the Talking Points Memo titled “Did Michelle Rhee’s Policies In D.C. Work?” Another good read, capturing the major issues with not only DC’s teacher evaluation systems but also with national policy trends in education.



Charter v. Public School Students: NAEP 2013 Performance

On November 19, Diane Ravitch blogged in response to the Thomas B. Fordham Institute’s latest report on Ohio charter schools students’ National Assessment of Educational Progress (NAEP) performance – that students in traditional public schools significantly outperformed charter schools on the NAEP (the nation’s best test). In her post, Dr. Ravitch states, “It was findings like these that convinced me that the proliferation of charter schools was no panacea; that most charter schools were no better and possibly weaker than traditional public schools.” She asked us at VAMboozled! to probe further.

Thanks to Nicole Blalock, PhD and postdoctoral scholar at Arizona State University for this analysis and entry. Nicole writes:

Aside from discussions around the validity of utilizing NAEP “proficiency” as a reliable measure of academic learning, it is important to focus on a broader view of the performance of students in charter schools beyond a single state as analyzed in the aforementioned report. However, “it is difficult – not to mention scientifically invalid – to make blanket comparisons of charter schools to traditional public schools” (“Charter Schools”, n.d.). Nonetheless, because the proliferation of charters is reliant on theories that charters and increased school choice improve student achievement (Bulkley, 2003), it is also appropriate to do just this to evaluate charter effectiveness on a large-scale.

About half of states (n≈23) across the U.S. had enough charter schools to sample for the NAEP in 2013, in this case so that charter students’ performance could be compared to public school students’ performance. And in 2013, students in charter schools performed no differently than students in public schools in approximately half of these states (7-13 of the 23 states in each grade/subject pair). In about half of the states with significant performance differences (4-7 of the 7-13 states in each grade/subject pair) students attending public schools outperformed students attending charter schools, while in the other half charter school students outperformed public school students. The differences between charter and public schools students, by state, were really no different than the flip of a coin.

In four of the 7-13 states with a statistical difference between charter school students’ NAEP scores and public school students’ NAEP scores, statistical differences were observed for all grade/subject pairs tested. This occurred in the states of Alaska, Maryland, Ohio, and Pennsylvania. On average, in Alaska, students attending charter schools outperformed students in public schools by approximately 10 points in most grade/subject area tests and by more than 20 points in reading in grade 4. However, the National Alliance for Public Charters that ranked the 42 states with charter schools and the District of Colombia as per their charter school laws, ranked Alaska nearly the lowest (i.e., 41st of 43) for the “best” charter laws (“Measuring up to the model”, 2013). Put differently, the state whose charter school students performed among the best as compared to their public school peers just happened to be one of the worst charter states as externally ranked.

Otherwise, public school students outperformed charter school students in the other three states (i.e., Maryland, Ohio, and Pennsylvania) with consistent and significant score differences across the board. Maryland was one of two states to be ranked lower than Alaska for the “best” charter laws overall (i.e., 42nd of 43), and Ohio and Pennsylvania ranked in the middle of the pack (27th and 19th of 43 respectively). Each of these states demonstrated charter school student performance that lagged behind public school students by an average of 23 points.

In the other states with relatively larger numbers of charters, results were also mixed. Texas, a large state with over 200 charter schools, came in at the middle of the aforementioned “best” charter laws rankings (position 23rd of 43), and on average 4th grade charter school students in Texas scored 20 points lower on their NAEP reading assessment than their public school peers. That difference went away among grade 8 students tested in reading, and there was no average difference in performance measured among students tested in mathematics.

In California, another large charter state that boasts more than 1300 charter schools and ranked 7th of 43 in terms of “best” charter laws, its grade 4 charter students scored 10 points higher in reading than their public school peers, yet no differences in performance were measured for grade 8 students in reading or grade 4 and grade 8 students tested in mathematics.

In Arizona, a state often characterized by policy makers as the leader in school choice (Gresham, Hess, Maranto, & Milliman, 2000), a state with over 600 approved charters and a state ranking on the aforementioned list at 13th of 43 in terms of “best” charter laws, its charter and public school students performed about the same.

How did students in Minnesota, the state to write the first charter school law in 1991 and ranked 1st of the 43 for “best” charter laws fare? On average, there was no difference between the performance of students in charters and students in public schools there either.

Hence, it is clear that the performance of charter school students, when compared to public school students, were mixed in 2013 on the NAEP. Charter schools are certainly no panacea, as often claimed and/or intended to be.


Bulkley, K. (2003). A decade of charter schools: From theory to practice. Educational Policy 17(3), 317-342. doi:10.1177/0895904803017003002

Charter schools. (n.d.). Retrieved December 15, 2013, from http://www.nea.org/home/16332.htm

Gresham, A., Hess, F., Maranto, S., & Milliman, R. (2000). Desert bloom: Arizona’s free market in education. Phi Delta Kappan, 18(10), 751-757.

Measuring up to the model: A tool for comparing state charter school laws. (2013). Retrieved December 15, 2013, from http://www.publiccharters.org/law/

Haertel: “Are Teacher-Level VAM Scores Good for Anything?”

In the most recent post, I highlighted Dr. Edward Haertel’s recent speech and subsequent paper about VAMs. As referenced in this post, Dr. Haertel did a solid and thorough review of the major issues with VAMs.

In his conclusion, it is also worth mentioning what he and I (for the most part) agree about is his answer to what he positions as a bottom-line question: “Are teacher-level VAM scores good for anything?”

He writes: “Yes, absolutely” But only for some purposes, and for some purposes “they must be used with considerable caution.”

For example, VAMs are useful “for researchers comparing large groups of teachers to investigate the effects of teacher training approaches or educational policies, or simply to investigate the size and importance of long-term teacher effects…[I]t is clear that value-added scores are far superior to unadjusted end-of-year student test scores” (Haertel, 2013, p. 23).

VAMs are indeed better alternatives to the snapshot or status models used for decades past to measure student achievement at one point in time (e.g., like a snapshot), typically once per year. With this, most if not all VAM researchers agree, largely because these snapshot tests were/are incapable of capturing where students were, in terms of their academic achievement before entering a classroom or school, and where they ended up X months later. VAMs are far-better alternatives to such prior approaches in that VAMs at least attempt to measure student growth over time.

The more nuanced question to ask, though, is whether VAMs offer much more than other statistical and analytical alternatives (e.g, ANCOVA [Analyses of Covariance] and multiple regression models) when conducting such large-scale analyses. I disagree that at the very least VAMs might be useful when conducting large-scale research, because the other methods often used to measure gains or the impacts of large-scale interventions and policies over time seem to fare well as well. If anything, VAMs might make statistical differences more difficult to come by and shrink effect sizes because of the multiple (and sometimes hyper) controls typically used.

VAMs are not the only nor are they the best method to examine things like large-scale interventions and policies. Hence, I disagree with the statement that, “for research purposes, VAM estimates definitely have a place” (Haertel, 2013, p. 23). Plenty of other models perform well for such large-scale educational research studies.

Dr. Haertel also writes, “A considerably riskier use, but one I would cautiously endorse, would be providing individual teachers’ VAM estimates to the teachers themselves and to their principals, provided all 5 of the following critically important conditions are met” (Haertel, 2013, p. 25):

  1. Scores are based on sound and appropriate student tests;
  2. Comparisons are limited to homogeneous teacher groups (to facilitate apples to apples comparisons);
  3. Fixed weights (e.g., 50% of a teacher’s effectiveness score will be based on VAM output) are NOT used and educators retain full rights in terms of how they interpret their VAM scores, in context, all things considered, case-by-case, etc. Educators  also retain rights to completely set aside VAM scores — to ignore them completely — if educators have specific information that could plausibly render the VAM output invalid;
  4. Users are well trained and able to interpret scores validly; and
  5. Clear and accurate information about VAM levels of uncertainty (e.g., margins of error) are disclosed and continuously made available so that all can be critical consumers of the data that pertain directly to them.

With these assertions I agree, but also with a very cautious endorsement as well. VAMs are not “all” wrong, but they are often wrong, actually too often wrong, especially at the teacher-levels when a lot of “things” cannot be statistically controlled or accounted for.

Accordingly, “VAMs may have a modest place in teacher evaluation systems, but only as an adjunct to other information [and] used in a context where teachers and principals have genuine autonomy in their decisions about using and interpreting teacher effectiveness estimates [i.e. VAM output] in local contexts” (Haertel, 2013, p. 25).


Stanford Professor, Dr. Edward Haertel, on VAMs

In a recent speech and subsequent paper written by Dr. Edward Haertel – National Academy of Education member and Professor at Stanford University – he writes about VAMs and the extent to which VAMs, being based on student test scores, can be used to make reliable and valid inferences about teachers and teacher effectiveness. This is a must-read, particularly for those out there who are new to the research literature in this area. Dr. Haertel is certainly an expert here, actually one of the best we have, and in this piece he captures the major issues well.

Some of the issues highlighted include concerns about the tests used to model value-added and how their scales (falsely assumed to be as objective and equal as units on a measuring stick) complicate and distort VAM-based estimates. He also discusses the general issues with the tests almost if not always used when modeling value-added (i.e., the state-level tests mandated as per No Child Left Behind in 2002).

He discusses why VAM estimates are least trustworthy, and most volatile and error prone, when used to compare teachers who work in very different schools with very different student populations – students who do not attend schools in randomized patterns and who are rarely if ever randomly assigned to classrooms. The issues with bias, as highlighted by Dr. Haertel and also in a recent VAMboozled! post with a link to a new research article here, are probably the most major VAM-related, problems/issues going. As captured in his words, “VAMs will not simply reward or penalize teachers according to how well or poorly they teach. They will also reward or penalize teachers according to which students they teach and which schools they teach in” (Haertel, 2013, p. 12-13).

He reiterates issues with reliability, or a lack thereof. As per one research study he cites, researchers found that “a minimum of 10% of the teachers in the bottom fifth of the distribution one year were in the top fifth the next year, and conversely. Typically, only about a third of 1 year’s top performers were in the top category again the following year, and likewise, only about a third of 1 year’s lowest performers were in the lowest category again the following year. These findings are typical [emphasis added]…[While a] few studies have found reliabilities around .5 or a little higher…this still says that only half the variation in these value-added estimates is signal, and the remainder is noise [and/or error, which makes VAM estimates entirely invalid about half of the time]” (Haertel, 2013, p. 18).

Dr. Haertel also discusses other correlations among VAM estimates and teacher observational scores, VAM estimates and student evaluation scores, and VAM estimates taken from the same teachers at the same time but using different tests, all of which also yield abysmally (and unfortunately) low correlations, similar to those mentioned above.

His bottom line? “VAMs are complicated, but not nearly so complicated as the reality they are intended to represent” (Haertel, 2013, p. 12). They just do not measure well what so many believe they measure so very well.

Again, to find out more reasons and more in-depth explanations as to why, click here for the full speech and subsequent paper.