Same Teachers, Similar Students, Similar Tests, Different Answers

One of my favorite studies to date about VAMs was conducted by John Papay, an economist once at Harvard and now at Brown University. In the study titled “Different Tests, Different Answers: The Stability of Teacher Value-Added Estimates Across Outcome Measures” published in 2009 by the 3rd best and most reputable peer-reviewed journal, American Educational Research Journal, Papay presents evidence that different yet similar tests (i.e., similar on content, and similar on when the tests were administered to similar sets of students) do not provide similar answers about teachers’ value-added performance. This is an issue with validity, in that, if a test is measuring the same things for the same folks at the same times, similar-to-the-same results should be realized. But they are not. Papay, rather, found moderate-sized rank correlations, ranging from r=0.15 to r=0.58, among the value-added estimates derived from the different tests.

Recently released, yet another study (albeit not yet peer-reviewed) has found similar results…potentially solidifying this finding further into our understandings about VAMs and their issues, particularly in terms of validity (or truth in VAM-based results). This study on “Comparing Estimates of Teacher Value-Added Based on Criterion- and Norm-Referenced Tests” released by the U.S. Department of Education and conducted by four researchers representing Notre Dame University, Basis Policy Research, and American Institutes of Research, provides evidence, again, that estimates of teacher value-added as based on different yet similar tests (i.e., in this case a criterion-referenced state assessment and a widely used norm-referenced test given in the same subject around the same time) yielded moderately correlated estimates of teacher-level value added, yet again.

If we had confidence in the validity of the inferences based on value-added measures, these correlations (or more simply put “relationships”) should be much higher than what they found, similar to what Papay found, in the range of 0.44 to 0.65. While the ideal correlation coefficient is a, in this case, r=+1.0, that is very rarely achieved. But for the purposes for which teacher-level value-added is currently being used, correlations above r=+.70/r=+.80 would (and should) be most desired, and possibly required before high-stakes decisions about teachers are to be made as based on these data.

In addition, researchers in this study found that on average, only 33.3% of teachers’ estimates from both sets of value-added estimates positioned them in the same range of scores (using quintiles or ranges including 20% bands of width) on both tests in the same school year. This too has implications for validity in that, again, teachers or teachers’ value-added estimates should fall in the same ranges, if and when using similar tests, if any valid inferences are to be made using value-added estimates.

Again, please note that this study has not yet been peer-reviewed. While some results naturally seem to make more sense, like the one reviewed above, peer-review matters equally for those with results with which we might tend to agree and those with results we might tend to reject. Take anything for that matter that is not peer-reviewed, as just that…a study with methods and findings not critically vetted by the research community.

In Response to a Tennessee Assistant Principal’s Concerns – Part II Mathematics

In our most recent post, we conducted the first of two (this is the second) follow-up analyses to examine the claims put forth in another recent post about a Tennessee assistant principal’s suspicions regarding his state of Tennessee’s value-added scores, as measured by the Tennessee Value-Added Assessment System (TVAAS; publicly available here). This is the final installment of this series, but this time we focus on the mathematics value-added scores and, probably not surprising, illustrate below quite similar results.

Again, we analyzed data from the 10 largest school districts in the state based on population data (found here). We looked at grade-level value-added scores for 3rd through 8th grade mathematics. For each grade level, we calculated the percentage of schools per district that had positive value-added scores for 2013 (see Table 1) and for their three-year composite scores, since these are often considered more reliable (see Table 2).

We were particularly interested in knowing how likely a grade level was to get a positive value-added score and to see if there were any trends across districts. Consistent with our English/Language Arts (ELA) findings, similar trends in mathematics were apparent as well.  For clarity purposes, and as we did with the ELA findings, we color-coded the chart—green signifies the grade levels for which 75% or more of the schools received a positive value-added score, while red signifies the grade levels for which 25% or less of the schools received a positive value-added score.

Table 1: Percent of Schools that had Positive Value-Added Scores by Grade and District (2013) Mathematics

District 3rd  Grade 4th   Grade 5th   Grade 6th  Grade
7th Grade 8th Grade
Memphis 42% 91% 70% 37% 84% 55%
Nashville-Davidson NA 60% 65% 70% 90% 44%
Knox 72% 87% 57% 64% 93% 67%
Hamilton 31% 86% 84% 38% 76% 78%
Shelby 97% 97% 61% 75% 100% 69%
Sumner 81% 77% 50% 67% 75% 58%
Montgomery NA 86% 71% 14% 86% 57%
Rutherford 79% 100% 67% 62% 77% 85%
Williamson NA 67% 21% 100% 100% 56%
Murfreesboro NA 100% 60% 90% NA* NA*

Though the mathematics scores were not as glaringly biased as the ELA scores, there were some alarming trends to notice. In particular, the 4th and 7th grade value-added scores were consistently higher than those of the 3rd, 5th, 6th, and 8th grades in mathematics, which had much greater variation across districts. In fact, all districts had at least 75% of their schools receive positive value-added scores in 7th grade and at least 60% in fourth grade. To recall, the seventh grade scores in ELA were drastically lower, with all districts having no more than 50% of their schools receive positive value-added scores (five of which had fewer than 25%). The 6th and 8th grades also had more variation in mathematics than in ELA.

Table 2: Percent of Schools that had Positive Value-Added Scores by Grade and District (Three-Year Composite) Mathematics

District 3rd Grade 4th  Grade 5th  Grade
6th  Grade 7th  Grade
8th  Grade
Memphis NA 90% 72% 62% 96% 78%
Nashville-Davidson NA 73% 67% 89% 97% 74%
Knox NA 96% 79% 90% 93% 93%
Hamilton NA 89% 80% 52% 90% 71%
Shelby NA 97% 79% 75% 100% 69%
Sumner NA 92% 65% 90% 92% 100%
Montgomery NA 95% 85% 71% 86% 71%
Rutherford NA 100% 79% 69% 92% 85%
Williamson NA 91% 22% 100% 100% 89%
Murfreesboro NA 100% 90% 100% NA NA

As for the three-year composites scores, schools across the state were much more likely to receive positive value-added scores than negative value-added scores in all tested grade levels. This, compared to the ELA scores where 6th and 7th grades struggled to earn positive scores, suggests that there is some level of subject bias going on here. Specifically, a majority of schools across all districts received positive value-added scores at each grade level for mathematics, with the small exception of 5th grade in one school district. For 4th and 7th grades, almost every school received positive scores.

Again, of most importance here is how we choose to interpret these results. By Tennessee’s standard (given their heavy reliance on the TVAAS to evaluate teachers), our conclusion would be that the mathematics teachers are, overall, more effective than the ELA teachers in almost every tested grade level (with the exception of 8th grade ELA), regardless of school district.

Perhaps a more reasonable explanation, though, is that there is some bias in the tests upon which the TVAAS scores are measured (as likely related to some likely issues with the vertical scaling of Tennessee’s tests, not to mention other measurement errors). Far more students across the state  demonstrated growth in mathematics than in ELA (for the past three years at least). To simply assume that this is caused by teacher effectiveness is crass at best.


Analysis conducted by Jessica Holloway-Libell

A Consumer Alert Issued by The 21st Century Principal

In an excellent post just released by The 21st Century Principal the author writes about yet another two companies calculating value-added for school districts, again on the taxpayer’s dime. Teacher Match and Hanover Research are the companies specifically named and targeted for marketing and selling a series of highly false assumptions about teaching and teachers, highly false claims about value-added (without empirical research in support), highly false assertions about how value-added estimates can be used for better teacher evaluation/accountability, and highly false sales pitches about what they as value-added/research “experts” can do to help with the complex statistics needed for the above

The main points of the articles, as I see them, pulled from the main article and in order of priority follow:

  1. School districts are purchasing these “products” based entirely on the promises and related marketing efforts of these (and other) companies. Consumer Alert! Instead of accepting these (and other) companies’ sales pitches and promises that these companies’ “products” will do what they say they will, these companies must be forced to produce independent, peer-reviewed research to prove that what they are selling is in fact real. If they can’t produce the studies, they should not earn the contracts!!
  2. Doing all of this is just another expensive drain on what are already short educational resources. One district is paying over $30,000 to Teacher Match per year for their services, as cited in this piece. Related, the Houston Independent School District is paying SAS Inc. $500,000 per year for their EVAAS-based value-added calculations. These are not trivial expenditures, especially when considering the other potential research-based inititaives towards which these valuable resources could be otherwise spent.
  3. States (and the companies selling their value-added services) haven’t done the validation studies to prove that the value-added scores/estimates are valid. Again, almost always is it that the sales and marketing claims made by these companies are void of evidence that supports the claims being made.
  4. Doing all of this elevates standardized testing even higher in the decision-making and data-driven processes for schools, even though doing this is not warranted or empirically supported (as mentioned).
  5. Related, value-added calculations rely on inexpensive (aka “cheap”) large-scale tests, also of questionable validity, that still are not designed for the purposes for which they are being tasked and used (e.g., measuring growth upwards cannot be done without tests with equivalent scales, which really no tests at this point have).

The shame in all of this, besides the major issues mentioned in the five points above, is that the federal government, thanks to US Secretary of Education Arne Duncan and the Obama administration, is incentivizing these and other companies (e.g. SAS EVAAS, Mathematica) to exist, construct and sell such “products,” and then seek out and compete for these publicly funded and subsidized contracts. We, as taxpayers, are the ones consistently footing the bills.

See another recent article about the chaos a simple error in Mathematica’s code caused in Washington DC’s public schools, following another VAMboozled post about the same topic two weeks ago.


Student Learning Objectives, aka Student Growth Objectives, aka Another Attempt to Quantify “High Quality” Teaching

After a previous post about VAMs v. Student Growth Percentiles (SGPs) (see also VAMs v. SGPs Part II) a reader posted a comment asking for more information about the utility of SGPs, but also about the difference between SGPs and Student Growth Objectives.

“Student Growth Objectives” is a new term for an older concept that is being increasingly integrated into educational accountability systems nationwide, and also under scrutiny (see one of Diane Ravitch’s recent posts about this here). But the concept underlying Student Growth Objectives (SGOs) is essentially just Student Learning Objectives (SLOs). Why they insist on using the term “growth” in place of the term “learning” is perhaps yet another fad. Related, it also likely has something to do with various legislative requirements (e.g., Race to the Top terminologies), although evidence in support of this transition is also void.

Regardless, and put simply, an SGO/SLO is an annual goal for measuring student growth/learning of the students instructed by teachers (or principals, for school-level evaluations) who are not eligible to participate in a school’s or district’s value-added or student growth model. This includes the vast majority of teachers in most schools or districts (e.g., 70+%), because only those teachers who instruct reading/language arts or mathematics in state achievement tested grade levels, typically grades 3-8, are eligible to participate in the VAM or SGP evaluation system. Hence via the development of SGOs/SLOs, administrators and others were either unwilling to allow these exclusions to continue or forced to establish a mechanism to include the other teachers to meet some legislative mandate.

New Jersey, for example, defines an SGO as “a long-term academic goal that teachers set for groups of students and must be: Specific and measureable; Aligned to New Jersey’s curriculum standards; Based on available prior student learning data; A measure of what a student has learned between two points in time; Ambitious and achievable” (for more information click here).

Denver Public Schools has been using SGOs for many years; their 2008-2009 Teacher Handbook states that an SGO must be “focused on the expected growth of [a teacher’s] students in areas identified in collaboration with their principal,” as well as that the objectives must be “Job-based; Measurable; Focused on student growth in learning; Based on learning content and teaching strategies; Discussed collaboratively at least three times during the school year; May be adjusted during the school year; Are not directly related to the teacher evaluation process; [and] Recorded online” (for more information click here).

That being said, and in sum, SGOs/SLOs, like VAMs, are not supported with empirical work. As Jersey Jazzman summarized very well in his post about this, the correlational evidence is very weak, the conclusions drawn by outside researchers are a stretch, and the rush to implement these measures is just as unfounded as the rush to implement VAMs for educator evaluation. We don’t know that SGOs/SLOs make a difference in distinguishing “good” from “poor” teachers; and in fact, some could argue (like Jersey Jazzman does) that they don’t actually do so much of anything at all. They’re just another metric being used in the attempt to quantify “high quality” teaching.

Thanks to Dr. Sarah Polasky for this post.

Stanford Professor, Dr. Edward Haertel, on VAMs

In a recent speech and subsequent paper written by Dr. Edward Haertel – National Academy of Education member and Professor at Stanford University – he writes about VAMs and the extent to which VAMs, being based on student test scores, can be used to make reliable and valid inferences about teachers and teacher effectiveness. This is a must-read, particularly for those out there who are new to the research literature in this area. Dr. Haertel is certainly an expert here, actually one of the best we have, and in this piece he captures the major issues well.

Some of the issues highlighted include concerns about the tests used to model value-added and how their scales (falsely assumed to be as objective and equal as units on a measuring stick) complicate and distort VAM-based estimates. He also discusses the general issues with the tests almost if not always used when modeling value-added (i.e., the state-level tests mandated as per No Child Left Behind in 2002).

He discusses why VAM estimates are least trustworthy, and most volatile and error prone, when used to compare teachers who work in very different schools with very different student populations – students who do not attend schools in randomized patterns and who are rarely if ever randomly assigned to classrooms. The issues with bias, as highlighted by Dr. Haertel and also in a recent VAMboozled! post with a link to a new research article here, are probably the most major VAM-related, problems/issues going. As captured in his words, “VAMs will not simply reward or penalize teachers according to how well or poorly they teach. They will also reward or penalize teachers according to which students they teach and which schools they teach in” (Haertel, 2013, p. 12-13).

He reiterates issues with reliability, or a lack thereof. As per one research study he cites, researchers found that “a minimum of 10% of the teachers in the bottom fifth of the distribution one year were in the top fifth the next year, and conversely. Typically, only about a third of 1 year’s top performers were in the top category again the following year, and likewise, only about a third of 1 year’s lowest performers were in the lowest category again the following year. These findings are typical [emphasis added]…[While a] few studies have found reliabilities around .5 or a little higher…this still says that only half the variation in these value-added estimates is signal, and the remainder is noise [and/or error, which makes VAM estimates entirely invalid about half of the time]” (Haertel, 2013, p. 18).

Dr. Haertel also discusses other correlations among VAM estimates and teacher observational scores, VAM estimates and student evaluation scores, and VAM estimates taken from the same teachers at the same time but using different tests, all of which also yield abysmally (and unfortunately) low correlations, similar to those mentioned above.

His bottom line? “VAMs are complicated, but not nearly so complicated as the reality they are intended to represent” (Haertel, 2013, p. 12). They just do not measure well what so many believe they measure so very well.

Again, to find out more reasons and more in-depth explanations as to why, click here for the full speech and subsequent paper.

Mr. T’s Scores on the DC Public Schools’ IMPACT Evaluation System

After our recent post regarding the DC Public Schools’ IMPACT Evaluation System, and Diane Ravitch’s follow-up, a DC teacher wrote to Diane expressing his concerns about his DC IMPACT evaluation scores, attaching the scores he recently received after his supervising administrator and a master educator observed the same 30-minute lesson he recently taught to the same class.

First, take a look at his scores summarized below. Please note that other supportive “evidence” (e.g., notes re: what was observed to support and warrant the scores below) was available, but for purposes of brevity and confidentiality this “evidence” is not included here.

As you can easily see, these two evaluators were very much NOT on the same evaluation page, again, when observing the same thing during the same time at the same instructional occasion.

Evaluative Criteria Definition Administrator Scores   (Mean Score = 1.44) Master Educator Scores (Mean Score = 3.11)
TEACH 1 Lead Well-Organized, Objective-Driven Lessons = 1 Ineffective = 4 Highly Effective
TEACH 2 Explain Content Clearly = 1 Ineffective = 3 Effective
TEACH 3 Engage Students at All Learning Levels in Rigorous Work = 1 Ineffective = 3 Effective
TEACH 4 Provide Students Multiple Ways to Engage with Content = 1 Ineffective = 3 Effective
TEACH 5 Check for Student Understanding = 2 Minimally Effective = 4 Highly Effective
TEACH 6 Respond to Student Understandings = 1 Ineffective = 3 Effective
TEACH 7 Develop Higher- Level Understanding through Effective Questioning = 1 Ineffective = 2 Minimally Effective
TEACH 8 Maximize Instructional Time = 2 Minimally Effective = 3 Effective
TEACH 9 Build a Supportive, Learning-Focused Classroom Community = 3 Effective = 3 Effective

Overall, Mr. T (an obvious pseudonym) received a 1.44 from his supervising administrator and a 3.11 from the master educator, with scores ranging from 1 = Ineffective to 4 = Highly Effective.

This is particularly important as illustrated in the prior post (Footnote 8 of the full piece to be exact), because “Teacher effectiveness ratings were based on, in order of importance by the proportion of weight assigned to each indicator [including first and foremost]: (1) scores derived via [this] district-created and purportedly “rigorous” (Dee & Wyckoff, 2013, p. 5) yet invalid (i.e., not having been validated) observational instrument with which teachers are observed five times per year by different folks, but about which no psychometric data were made available (e.g., Kappa statistics to test for inter-rater consistencies among scores).” For all DC teachers, this is THE observational system used, and for 83% of them these data are weighted at 75% of their total “worth” (Dee & Wyckoff, 2013, p. 10). This is precisely the system that is receiving (and gaining) praise, especially as it has thus far led to teacher bonuses (professedly up to $25,000 per year) as well as terminations of more than 500 teachers (≈ 8%) throughout DC’s Public Schools. Yet as evident here, again,this system has some fatal flaws and serious issues, despite its praised “rigor” (Dee & Wyckoff, 2013, p. 5).

See also ten representative comments taken from both the administrator’s evaluation form and the master educator’s evaluation form. Revealed here, as well, are MAJOR issues and discrepancies that should not occur in any “objective” and reliable” evaluation system, especially in one to which such major consequences are attached and that has been, accordingly, so “rigorously” praised (Dee & Wyckoff, 2013, p. 5).

Administrator’s Comments:
1. The objective was not posted nor verbally articulated during the observation… Students were asked what the objective was and they looked to the board but when they saw no objective.
2. There was limited evidence that students mastered the content based on the work they produced.
3. Explanations of content weren’t clear and coherent based on student responses and the level of attention that Mr. T had to pay to most students.
4. Students were observed using limited academic language throughout the observation.
5. The lesson was not accessible to students and therefore posed too much challenge based on their level of ability.
6. [T]here wasn’t an appropriate balance between teacher‐directed and student‐centered learning.
7. There was limited higher-level understanding developed based on verbal conferencing or work products that were created.
8. Through [checks for understanding] Mr. T was able to get the pulse of the class… however there was limited evidence that Mr. T understood the depth of student understanding.
9. There were many students that had misunderstandings based on student responses from putting their heads down to moving to others to talk instead of work.
10. Inappropriate behaviors occurred regularly within the classroom.

Master Educator’s Comments:
1. Mr. T was highly effective at leading well-organized, objective-driven lessons.
2. Mr. T’s explanations of content were clear and coherent, and they built student understanding of content.
3. All parts of Mr. T’s lesson significantly moved students towards mastery of the objective as evidenced by students.
4. Mr. T included learning styles that were appropriate to students needs and all students responded positively and were actively involved.
5. Mr. T’s explanations of content were clear and coherent, and they built student understanding of content.
6. Mr. T was effective at engaging students at all levels in accessible and challenging work.
7. Students had adequate opportunities to meaningfully practice, apply, and demonstrate what they are learning.
8. Mr. T always used appropriate strategies to ensure that students moved toward higher-level understanding.
9. Mr. T was effective at maximizing instructional time…Inappropriate or off-task student behavior never interrupted or delayed the lesson.
10. Mr. T was effective at building a supportive, learning-focused classroom community. Students were invested in their work and valued academic success.

In sum, as Mr. T wrote in his email to Diane, while he is “fortunate enough to have a teaching position that is not affected by VAM nonsense…that doesn’t mean [he’s] completely immune from a flawed system of evaluations.” This “supposedly ‘objective’ measure seems to be anything but.” Is the administrator correct whereas positioning Mr. T as ineffective? Or might it be, perhaps, the master educator was “just being too soft.” Either way, “it’s confusing and it’s giving [Mr. T.] some thought as to whether [he] should just spend the school day at [his] desk working on [his] resumé.”

Our thanks to Mr. T for sharing his DC data, and for sharing his story!

Why VAMs & Merit Pay Aren’t Fair

An “oldie” (i.e., published about one year ago), but a goodie! This one is already posted in the video gallery of this site, but it recently came up again as a good, short-at-three minutes, video version, that captures some of the main issues.
Check it out and share as (so) needed!

Six Reasons Why VAMs and Merit Pay Aren’t Fair