Student Learning Objectives (SLOs): What (Little) We Know about Them Besides We Are to Use Them

Following up on a recent post, a VAMboozled! follower – Laura Chapman – wrote the comment below about Student Learning Objectives (SLOs) that I found important to share with you all. SLOs are objectives that are teacher-developed and administrator-approved to help hold teachers accountable for their students’ growth, although growth in this case is individually and loosely defined, which makes SLOs about as subjective as it gets. Ironically, SLOs serve as alternatives to VAMs when teachers who are VAM-ineligible need to be held accountable for “growth.”

Laura commented about how I need to write more about SLOs as states are increasingly adopting these, but states are doing this without really any research evidence in support of the concept, much less the practice. That might seem more surprising than it really is, but there is not a lot of research being conducted on SLOs, yet. One research document of which I am aware I reviewed here, with the actual document written by Mathematica and published by the US Department of Education here: “Alternative student growth measures for teacher evaluation: Profiles of early-adopting districts.

Conducting a search on ERIC, I found only two additional pieces also contracted out and published by the US Department of Education, although the first piece is more about describing what states are doing in terms of SLOs versus researching the actual properties of the SLOs. The second piece better illustrates the fact that “very little of the literature on SLOs addresses their statistical properties.”

What little we do know about SLOs at this point, however, is two-fold: (1) “no studies have looked at SLO reliability” and (2) “[l]ittle is known about whether SLOs can yield ratings that correlate with other measures of teacher performance” (i.e., one indicator of validity). The very few studies in which researchers have examined this found “small but positive correlations” between SLOs and VAM-based ratings (i.e., not a strong indicator of validity).

With that being said, if any of you are aware of research I should review or if any of you have anything to say or write about SLOs in your states, districts, or schools, feel free to email me at audrey.beardsley@asu.edu.

In the meantime, do also read what Laura Wrote about SLOs here:

I appreciate your work on the VAM problem. Equal attention needs to be given to the use of SLOs for evaluating teacher education in so-called untested and non-tested subjects. It has been estimated that about 65-69% of teachers have job assignments for which there are not state-wide tests. SLOs (and variants) are the proxy of choice for VAM. This writing exercise is required in at least 27 states, with pretest-posttest and/or baseline to post-test reports on student growth. Four reports from USDE (2014) [I found three] show that there is no empirical research to support the use of the SLO process (and associated district-devised tests and cut-off scores) for teacher evaluation.

The template for SLOs originated in Denver in 1999. It has been widely copied and promoted via publications from USDE’s “Reform Support Network,” which operates free of any need for evidence and few constraints other than marketing a deeply flawed product. SLO templates in wide use have no peer reviewed evidence to support their use for teacher evaluation…not one reliability study, not one study addressing their validity for teacher evaluation.

SLO templates in Ohio and other states are designed to fit the teacher-student data link project (funded by Gates and USDE since 2005). This means that USDE’s proposed evaluations of specific teacher education programs ( e.g., art education at Ohio State University) will be aided by the use of extensive “teacher of record” data routinely gathered by schools and districts, including personnel files that typically require the teacher’s college transcripts, degree earned, certifications, scores on tests for any teacher license and so on.

There are technical questions galore, but a big chunk of the data of interest to the promoters of this latest extension of the Gates/USDE’s rating game are in place.
I have written about the use of SLOs as a proxy for VAM in an unpublished paper titled The Marketing of Student Learning Objectives (SLOs): 1999-2014. A pdf with references can be obtained by request at chapmanLH@aol.com

Combining Different Tests to Measure Value-Added, Continued

Following up on a recent post regarding whether “Combining Different Tests to Measure Value-Added [Is] Valid?” an anonymous VAMboozled! follower emailed me an article I had never read on the topic. I found the article and its contents as related to this question important to share with you all here.

The 2005 article titled “Adjusting for Differences in Tests,” authored by Robert Linn – a highly respected scholar in the academy and also Distinguished Professor Emeritus in Research & Evaluation Methodology from the University of Colorado at Boulder – and published by the esteemed National Academy of Sciences, is all about “[t]he desire to treat scores obtained from different tests as if they were interchangeable, or at least comparable.” This is very much in line with current trends to combine different tests to measure value-added given statisticians and (particularly in the universe of VAMs) econometricians’ “natural predilections” to find (and force) solutions to what they see as very rational and practical problems. Hence, this paper is still very relevant here and today.

To equate tests, as per Dorans and Holland (2000) as cited by Linn, there are five requirements that must be met. These requirements “enjoy a broad professional consensus” and also pertain to test equating when it comes to doing this within VAMs:

  1. The Equal Construct Requirement: Tests that measure different constructs (e.g., mathematics versus reading/language arts) should not be equated. States that are using different subject area tests as pretests for other subject area tests to measure growth over time, for example, are in serious violation of this requirement.
  2. The Equal Reliability Requirement: Tests that measure the same construct but which differ in reliability (i.e., consistency at the same time or over time as measured in a variety of ways) should not be equated. Most large-scale tests, often depending on whether they yield norm- or criterion-referenced data given the normed or state standards to which they are aligned, yield different levels of reliability. Likewise, the types of consequences attached to test results throw this off as well.
  3. The Symmetry Requirement: The equating function for equating the scores of Y to those of X should be the inverse of the equating function for equating the scores of X to those of Y. I am unfamiliar of studies that have addressed this as this is typically overlooked and assumed.
  4. The Equity Requirement: It ought to be a matter of indifference for an examinee to be tested by either one of the two tests that have been equated. It is not, however, a matter of indifference in the universe of VAMs, as best demonstrated by the research conducted by Papay (2010) in his Different tests, different answers: The stability of teacher value-added estimates across outcome measures study.
  5. Population Invariance Requirement: The choice of (sub)population used to compute the equating function between the scores on tests X and Y should not matter. In other words, the equating function used to link the scores of X and Y should be population invariant. As also noted in Papay (2010) even using the same populations, the scores on tests X and Y DO matter.

Accordingly, here are Linn’s two key points of caution: (1) Factors that complicate doing this, and make such simple conversions less reliable and less or invalid include “the content, format, and margins of error of the tests; the intended and actual uses of the tests; and the consequences attached to the results of the tests.” Likewise, (2) “it cannot be assumed that converted scores will behave just like those of the test whose scale is adopted.” That is, by converting or norming scores to permit longitudinal analyses (i.e., in the case of VAMs) can very well violate what the test scores mean even if commonsense might tell us they essentially mean the same thing.

Hence, and also as per the National Research Council (NRC), “it is not feasible to compare ‘the full array of currently administered commercial and state achievement tests to one another, through the development of a single equivalency or linking scale.” Test content varies so much “in content, item format, conditions of administration, and consequences attached to [test] results that the linked scores could not be considered sufficiently comparable to justify the development of [a] single equivalency scale.”

“Although NAEP [the National Assessment of Educational Progress – the nation’s “best” test] would seem to hold the greatest promise for linking state assessments… NAEP has no consequences either for individual students or for the schools they attend. State assessments, on the other hand, clearly have important consequences for schools and in a number of states they also have significant consequences for individual students” and now teachers. This is the key factor that throws off the potential to equate tests that are not designed to be interchangeable. This is certainly not to say, however, just increase the consequences across tests to make them more interchangeable. It’s just not that simple – see rules 1-5 above.

Tennessee’s Senator Lamar Alexander to “Fix” No Child Left Behind (NCLB)

In The Tennessean this week was an article about how Lamar Alexander – the new chairman of the US Senate Education Committee, the current US Senator from Tennessee, and the former US Secretary of Education under former President George Bush Senior – is planning on “fixing” No Child Left Behind (NCLB). See another Tennessean article about this last week as well, here. This is good news, also given this is coming from Tennessee – one of our states to watch given its history of  value-added reforms tied to NCLB tests.

NCLB, even though it has not been reauthorized since 2002 (despite a failed attempt in 2007) is still foundational to current test-based reforms, including those that tie federal funds to the linking of students’ (NCLB) test scores to students’ teachers to measure teacher-level value-added. To date, for example, 42 states have been granted NCLB waivers for not getting 100% of their students to 100% proficiency in mathematics and reading/language arts by 2014, as long as states adopted and implemented federal reforms based on common core tests and began tying test scores to teacher quality using growth/value-added measures. This was in exchange for the federal educational funds on which states also rely.

Well, Lamar apparently believes that this is precisely what needs fixing. “The law has become unworkable,” Alexander said. “States are struggling. As a result, we need to act.”

More specifically, Senator Alexander wants to:

  • take power away from the US Department of Education, which he often refers to as our country’s “national school board.”
  • prevent the US Department of Education from further pressuring schools to adopt certain tests and standards.
  • prevent current US Secretary of Education Duncan from continuing to hold “states over a barrel,” forcing them to do what he has wanted them to do to avoid being labeled failures.
  • “set realistic goals, keep the best portions of the law, and restore to states and communities the responsibility to decide whether schools and teachers are succeeding or failing.” Although, Alexander has still been somewhat neutral regarding the future role of tests.

“Are there too many? Are they redundant? Are they the right tests? I’m open on the question,” Alexander said.

As noted by the journalist of this article, however, this is the biggest concern with this (potentially) big win for education in that “There is broad agreement that students should be tested less, but what agency wants to relinquish the ability to hold teachers, administrators and school districts accountable for the money we [as a public] spend on education?” Current US Secretary of Education Duncan, for example, believes if we don’t continue down his envisioned path, “we [will] turn back the clock on educational progress, 15 years or more.” This from a guy who has built his political career on the fact that educators have made no educational progress; hence, this is the reason we need to hold teachers accountable for the lack of progress thereof.

We shall see how this one goes, I guess.

For an excellent “Dear Lamar” letter, from Lamar Alexander’s former Assistant Secretary of Education who served under Lamar when he was US Secretary of Education under former President Bush Senior, click here. It’s a wonderful letter written by Diane Ravitch; wonderful becuase it includes so many recommendations highlighting that which could be at this potential turning point in federal policy.

Your Voice Also Needs to Be Heard

Following up “On Rating The Effectiveness of Colleges of Education Using VAMs” – which is about how the US Department of Education wants teacher training programs to track how college of educations’ teacher graduates’ students are performing on standardized tests (i.e., teacher-level value-added that reflects all the way back to a college of education’s purported quality), the proposal for these new sanctions is now open for public comment.

Click here on Regulations.gov, “Your Voice in Federal Decision-Making” to read more, but also to post any comments you might have (click on the “Comment Now!” button in blue in the upper right hand corner). I encourage you all to post your concerns, as this really is a potential case of things going from bad to worse in the universe of VAMs. The deadline is Monday, February 2, 2015.

I pasted what I submitted below, as taken from an article I published about this in Teachers College Record in 2013:

1. The model posed is inappropriately one-dimensional. More than 50% of college graduates attend more than one higher education institution before receiving a bachelor’s degree (Ewell, Schild, & Paulson, 2003), and approximately 60% of teacher education occurs in general liberal arts and sciences, and other academic departments outside of teacher education. There are many more variables that contribute to teachers’ knowledge by the time they graduate than just the teacher education program (Anrig, 1986; Darling-Hammond & Sykes, 2003).

2. The implied assumptions of the aforementioned linear formula are overly simplistic given the nonrandomness of the teacher candidate population…If teacher candidates who enroll in a traditional teacher education program are arguably different from teacher candidates who enroll in an alternative program, and both groups are compared once they become teachers, one group might have a distinct and unfair advantage over the other…What cannot be overlooked, controlled for, or dismissed from these comparative investigations are teachers’ enduring qualities that go beyond their preparation (Boyd et al., 2006; Boyd, Grossman, Lankford, Loeb, & Wyckoff, 2007; Harris & Sass, 2007; Shulman, 1988; Wenglinsky, 2002).

3. Teachers are nonrandomly distributed into schools after graduation as well. The type of teacher education program from which a student graduates is highly correlated with the type and location of the school in which the teacher enters the profession (Good et al., 2006; Harris & Sass, 2007; Rivkin, 2007; Wineburg, 2006), especially given the geographic proximity of the program…Without randomly distributing teachers across schools, comparison groups will never be adequately equivalent, as implied in this model, to warrant valid assertions about teacher education quality (Boyd et al., 2006; Good et al., 2006). It should be noted, however, that whether the use of students’ pretest scores and other covariates can account or control for such inter- and intra-classroom variations is still being debated and remains highly uncertain (Ballou, Sanders, & Wright, 2004; Capitol Hill Briefing, 2011; Koedel & Betts, 2010; Kupermintz, 2003; McCaffrey, Lockwood, Koretz, Louis, & Hamilton, 2004; J. Rothstein, 2009; Tekwe et al., 2004).

4. Students are also not randomly placed into classrooms…Students’ innate abilities and motivation levels bias even the most basic examinations in which researchers attempt to link teachers with student learning (Newton et al., 2010; Harris & Sass, 2007; Rivkin, 2007)…the degree to which such systematic errors, often considered measurement biases, [still] impact value-added output is [still] yet highly unsettled (Ballou et al., 2004; Capitol Hill Briefing, 2011; Koedel & Betts, 2010; Kupermintz, 2003; McCaffrey et al., 2004; J. Rothstein, 2009; Tekwe et al., 2004).

5. A student’s performance is also empirically compounded by what teachers learn “on the job” post-graduation via professional development (see, for example, Greenleaf et al., 2011). If researchers are to measure the impact of a teacher education program using student achievement, and graduates have received professional development, mentoring, and enrichment opportunities post-graduation, one must question whether it is feasible to disentangle the impact that professional development, versus teacher education, has on teacher quality and students’ learning over time. Graduates’ opportunities to learn on the job, and the extent to which they take advantage of such opportunities, introduces yet another source of construct irrelevant variance (CIV) into, what seemed to be, the conceptually simple relational formula presented earlier (Good et al., 2006; Harris & Sass, 2007; Rivkin, 2007; Yinger, Daniel, & Lawton, 2007). CIV is generally prevalent when a test measures too many variables, including extraneous and uncontrolled variables that ultimately impact test outcomes and test-based inferences (Haladyna & Downing, 2004) [and the statistics, no matter how sophisticated they might be, cannot control for or factor all of this out].

Is Combining Different Tests to Measure Value-Added Valid?

A few days ago, on Diane Ravitch’s blog, a person posted the following comment, that Diane sent to me for a response:

“Diane, In Wis. one proposed Assembly bill directs the Value Added Research Center [VARC] at UW to devise a method to equate three different standardized tests (like Iowas, Stanford) to one another and to the new SBCommon Core to be given this spring. Is this statistically valid? Help!”

Here’s what I wrote in response, that I’m sharing here with you all as this too is becoming increasingly common across the country; hence, it is increasingly becoming a question of high priority and interest:

“We have three issues here when equating different tests SIMPLY because these tests test the same students on the same things around the same time.

First is the ASSUMPTION that all varieties of standardized tests can be used to accurately measure educational “value,” when none have been validated for such purposes. To measure student achievement? YES/OK. To measure teachers impacts on student learning? NO. The ASA statement captures decades of research on this point.

Second, doing this ASSUMES that all standardized tests are vertically scaled whereas scales increase linearly as students progress through different grades on similar linear scales. This is also (grossly) false, ESPECIALLY when one combines different tests with different scales to (force a) fit that simply doesn’t exist. While one can “norm” all test data to make the data output look “similar,” (e.g., with a similar mean and similar standard deviations around the mean), this is really nothing more that statistical wizardry without really any theoretical or otherwise foundation in support.

Third, in one of the best and most well-respected studies we have on this to date, Papay (2010) [in his Different tests, different answers: The stability of teacher value-added estimates across outcome measures study] found that value-added estimates WIDELY range across different standardized tests given to the same students at the same time. So “simply” combining these tests under the assumption that they are indeed similar “enough” is also problematic. Using different tests (in line with the proposal here) with the same students at the same time yields different results, so one cannot simply combine them thinking they will yield similar results regardless. They will not…because the test matters.”

See also: Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn, R. L., Ravitch, D., Rothstein, R., Shavelson, R. J., & Shepard, L. A. (2010). Problems with the use of student test scores to evaluate teachers. Washington, D.C.: Economic Policy Institute.

A Pragmatic Position on VAMs

Bennett Mackinney, a local administrator and former doctoral student in one of my research methods courses, recently emailed me asking the following: “Is a valid and reliable VAM theoretically possible? If so, is the issue that no state is taking the time and energy to develop it correctly? Or, is it just impossible? I need to land on a pragmatic position on VAMs. Part of the problem is that for the past decade all of us over here at Title I schools [i.e., schools with disproportionate numbers of disadvantaged students] have been saying “[our state test] is not fair…. our kids come to us with so many issues… you need to measure us on growth not final performance….”  I feel like VAM advocates will come to us and say, “fine, here’s a model that will meet your request to be fairly measured on growth…”

I responded with the following: “In my research-based opinion, we are searching and always will be searching for a type of utopia in this area, one that will likely be out of our research forever UNLESS these PARC, etc. tests come through with miracles [which, as history is likely to repeat itself, is highly unlikely]. However, at the end of the day, [we] can be confident that this is better than the snapshot measures used before (I.e., before growth measures), particularly for analyses of large-scale trends, but certainly not for teacher evaluation and especially not for high-stakes teacher evaluation purposes. Regardless of the purpose, though, NEVER should [VAMs] be used in isolation of other measures and NEVER should they be used without a great deal of human judgment re: what the VAM estimates in reality and in context demonstrate, in light of what they can and just cannot do.”

For more on this, see my prior post about the position statement recently released by the American Statistical Association (ASA), or the actual statement itself.

A Really Old Oldie but Still Very Relevant Goodie

Thanks to a colleague in Florida, I recently read an article about the “Problems of Teacher Measurementpublished in 1917 in the Journal of Educational Psychology
by B. F. Pittenger. As mentioned, it’s always interesting to take a historical approach (hint here to policymakers), and in this case a historical vies via the perspective of an author on the same topic of interest to followers here through an article he wrote almost 100 years ago. Let’s see how things have changed, or more specifically, how things have not changed.

Then, “they” had the same goals we still have today, if this isn’t telling in and of itself. From 1917: “The current efforts of experimentallists in the field of teacher measurement are only attempts to extract from the consciousness of principals and supervisors these personal criteria of good teaching, and to assemble and condense them into a single objective schedule, thoroughly tested, by means of which every judge of teaching may make his [sic] estimates more accurate, and more consistent with those of other judges. There is nothing new about the entire movement except the attempt to objectify what already exists subjectively, and to unify and render universal what is now the scattered property of many men.”

Policymakers continue to invest entirely on an ideal known then also to be (possibly forever) false. From 1917: “There are those who believe that the movement toward teacher measurement is a monstrous innovation, which threatens the holiest traditions of the educational profession by putting a premium upon mechanical methodology…the phrase ‘teacher-measurement,’ itself, no doubt, is in part responsible for this misunderstanding, as it suggests a mathematical exactness of procedure which is clearly impossible in this field [emphasis added]. Teacher measurement will probably never become more than a carefully controlled process of estimating a teacher’s individual efficiency…[This is]…sufficiently convenient and euphonious, and has now been used widely enough, to warrant its continuation.”

As for the methods “issues” in 1917? “However sympathetic one may be with the general plan of devising schedules for teacher measurement, it is difficult to justify many of the methods by which these investigators have attacked the problem. For example, all of them appear to have set up as their goal the construction of a schedule which can be applied to any teacher, whether in the elementary or high school, and irrespective of the grade or subject in which his teaching is being done. “Teaching is teaching,” is the evident assumption, “and the same wherever found.” But it may reasonably be maintained that different qualities and methods, at least in part, are requisite…In so far as the criteria of good teaching are the same in these very diverse situations, it seems probable that the comparative importance to be attached to each must differ.” Sound familiar?

On the use of multiple measures, as currently in line with the current measurement standards of the profession, from 1917: “students of teacher measurement appear to have erred in that they have attempted too much. The writer is strongly of the opinion that, for the present at least, efforts to construct a schedule for teacher measurement should be confined to a single one of the three planes which have been enumerated. Doubtless in the end we shall want to know as much as possible about all three; and to combine in our final estimate of a teacher’s merit all attainable facts as to her equipment, her classroom procedure, and the results which she achieves. But at present we should do wisely to project our investigations upon one plane at a time, and to make each of these investigations as thorough as it is possible to make it. Later, when we know the nature and comparative value of the various items necessary to adequate judgment upon all planes, there will be time and opportunity for putting together the different schedules into one.” One-hundred years later…

On prior teachers’ effects: “we must keep constantly in mind the fact that the results which pupils achieve in any given subject are by no means the product of the labor of any single teacher. Earlier teachers, other contemporary teachers, and the environment external to the school, are all factors in determining pupil efficiency in any school subject. It has been urged that the influence of these complicating factors can be materially reduced by measuring only the change in pupil achievement which takes place under the guidance of a single teacher. But it must be remembered that this process only reduces these complications; it does not and cannot eliminate them.”

Finally, the supreme to be sought, then and now? “The plane of results (in the sense of changes wrought in pupils) would be the ideal plane upon which to build an estimate of a teacher’s individual efficiency, if it were possible (1) to measure all of the results of teaching, and (2) to pick out from the body of measured results any single teacher’s contribution. At present these desiderata are impossible to attain [emphasis added]…[but]…let us not make the mistake of assuming that the results that we can measure are the only results of teaching, or even that they are the most important part.”

Likewise, “no one teacher can be given the entire blame or credit for the doings of the pupils in her classroom…the ‘classroom process’ should be regarded as including the activities of both teachers and pupils.” In the end, “The promotion, discharge, or constructive criticism of teachers cannot be reduced to mathematical formulae. The proper function of a scorecard for teacher measurement is not to substitute such a formula for a supervisor’s personal judgment, but to aid him in discovering and assembling all the data upon which intelligent judgment should be based.”

My Book Listed As One of the Five “Best Education Books of 2014”

Following up on a recent post about A (Great) Review of My Book on Value-Added Models (VAMs), my book was also recently listed as one of the five “Best Education Books of 2014.”

For those of you who have not read it, please do borrow or pick up a copy as the research and research-based information inside it can, should, and hopefully will serve critical in the fight against so many of the malformed policies still being perpetuated across the country as based on VAMs for increased teacher accountability.

Here are is the list of the top five “Best Education Books of 2014” in case you all are interested in reading (any of) the others.

Doug Harris on the (Increased) Use of Value-Added in Louisiana

Thus far, there have been four total books written about value-added models (VAMs) in education: one (2005) scholarly, edited book that was published prior to our heightened policy interest in VAMs; one (2012) that is less scholarly but more of a field guide on how to use VAM-based data; my recent (2014) scholarly book; and another recent (2011) scholarly book written by Doug Harris. Doug is an Associate Professor of Economics at Tulane in Louisiana. He is also, as I’ve written prior, “a ‘cautious’ but quite active proponent of VAMs.”

There is quite an interesting history surrounding these latter two books, given Harris and I have quite different views on VAMs and their potentials in education. To read more about our differing opinions you can read a review of Harris’s book I wrote for Teachers College Record, and another review a former doctoral student and I wrote for Education Review, to which he responded in his (and his book’s) defense, to which we also responded (with a “rebuttal to a rebuttal, as you will“). What was ultimately titled a “Value-Added Smackdown” in a blog post featured in Education Week, let’s just say, got a little out of hand, with the “smackdown” ending up focusing almost solely around our claim that Harris believed, and we disagreed, with the notion that “value-added [was and still is] good enough to be used for [purposes of] educational accountability.” We asserted then, and I continue to assert now, that “value-added is not good enough to be attaching any sort of consequences much less any such decisions to its output. Value-added may not even be good enough even at the most basic, pragmatic level.”

Harris continues to disagree…

Just this month he released a technical report to his state’s school board (i.e., the Louisiana Board of Elementary and Secondary Education (BESE)), in which he (unfortunately) evidenced that he has not (yet) changed his scholarly stripes….even given the most recent research about the increasingly apparent, methodological, statistical, and pragmatic limitations (see, for example, here, here, here, here, and here), and the recent position statement released by the American Statistical Association underscoring the key points being evidenced across (most) educational research studies. See also the 24 articles published about VAMs in all American Educational Research Association (AERA) Journals here, along with open-access links to the actual articles.

In this report Harris makes “Recommendations to Improve the Louisiana System of
Accountability for Teachers, Leaders, Schools, and Districts,” the main one being that the state focus “more on student learning or growth—[by] specifically, calculating the predicted test scores and rewarding schools based on how well students do compared with those predictions.” The recommendations in more detail, in support, and that also pertain to our interests here include the following five (of six recommendations total):

1. “Focus more on student growth [i.e., value-added] in order to better measure the performance of schools.” Not that there is any research evidence in support, but “The state should [also] aim for a 50-­‐50 split between growth and achievement levels [i.e., not based on value-added].” Doing this at the school accountability level “would also improve alignment with teacher accountability, which includes student growth [i.e., teacher-level value-added] as 50% of the evaluation.”

2. “Reduce uneven incentives and avoid “incentive cliffs” by increasing [school performance score] points more gradually as students move to higher performance levels,” notwithstanding the fact that no research to date has evidenced that such incentives incentivize much of anything intended, at least in education. Regardless, and despite the research, “Giving more weight to achievement growth [will help to create] more even [emphasis added] incentives (see Recommendation #1).”

3. Related, “Create a larger number of school letter grades [to] create incentives for all schools to improve,” by adding +/- extensions to the school letter grades, because “[i]f there were more categories, the next [school letter grade] level would always be within reach….provide. This way all schools will have an incentive to improve, whereas currently only those who are at the high end of the B-­‐D categories have much incentive.” If only the real world of education worked as informed by simple idioms, like those simplifying the theories supporting incentives (e.g., the carrot just in front of the reach of the mule will make the mule draw the cart harder).

5. “Eliminate the first over-­ride provision in the teacher accountability system, which automatically places teachers who are “Ineffective” on either measure in the “Ineffective” performance category.” With this recommendation, I fully agree, as Louisiana is one of the most extreme states when it comes to attaching consequences to problematic data, although I don’t think Harris would agree with my “problematic” classification. But this would mean that “teachers who appear highly effective on one measure could not end up in the “Ineffective” category,” which for this state would certainly be a step in the right direction. Although Harris’s assertion that doing this would also help prevent principals from saving truly ineffective teachers (e.g., by countering teachers’ value-added scores with artificially inflated or allegedly fake observational scores), on behalf of principals as professionals, I find insulting.

6. “Commission a full-­scale third party evaluation of the entire accountability system focused on educator responses and student outcomes.” With this recommendation, I also fully agree under certain conditions: (1) the external evaluator is indeed external to the system and has no conflicts of interest, including financial (even prior to payment for the external review), (2) that which the external evaluator is to investigate is informed by the research in terms of the framing of the questions that need to be asked, (3) as also recommended by Harris, that perspectives of those involved (e.g., principals and teachers) are included in the evaluation design, and (4) all parties formally agree to releasing all data regardless of what (positive or negative) the external evaluator might evidence and find.

Harris’s additional details and “other, more modest recommendations” include the following:

  • Keep “value-­‐added out of the principal [evaluation] measure,” but the state should consider calculating principal value-­‐added measures and issuing reports that describe patterns of variation (e.g., variation in performance overall [in] certain kinds of schools) both for the state as a whole and specific districts.” This reminds me of the time that value-added measures for teachers were to be used only for descriptive purposes. While noble as a recommendation, we know from history what policymakers can do once the data are made available.
  • “Additional Teacher Accountability Recommendations” start on page 11 of this report, although all of these (unfortunately, again) focus on value-added model twists and tweaks (e.g., how to adjust for ceiling effects for schools and teachers with disproportionate numbers of gifted/high-achieving students, how to watch and account for bias) to make the teacher value-added model even better.

Harris concludes that “With these changes, Louisiana would have one of the best accountability systems in the country. Rather than weakening accountability, these recommendations [would] make accountability smarter and make it more likely to improve students’ academic performance.” Following these recommendations would “make the state a national leader.” While Harris cites 20 years of failed attempts in Louisiana and across all states across the country as the reason America’s public education system has not improved its public school students’ academic performance, I’d argue it’s more like 40 years of failed attempts because Harris’s (and so many others’) accountability-bent logic is seriously flawed.

Those Who Can, Teach—Those Who Don’t Understand Teaching, Make Policy

There is an old adage that many in education have undoubtedly (and unfortunately) encountered at some point in their education careers: that “those who can, do—those who can’t, teach.” For decades now, researchers, including Dr. Thomas Good who is the author of an article published in 2014 in Teachers College Record, have evidenced that, contrary to this belief, teachers can do, and some teachers do better than others. Dr. Good points out in his historical analysis, What Do We Know About How Teachers Influence Student Performance on Standardized Tests: And Why Do We Know so Little About Other Student Outcomes? that teachers matter and that teachers vary in their effects on student achievement. He provides evidence from several decades of scholarly research on teacher effectiveness to show that teachers do make a difference in student achievement as measured by large-scale standardized achievement tests.

Dr. Good is also quick to acknowledge that, despite the reiterated notion that teachers matter and thus should possess (and continue to be trained in) effective teaching qualities (e.g., be well versed in their content knowledge, have strong classroom management skills, hold appropriate expectations, etc.), “fad-driven” education reform policies (e.g., teacher evaluation polices that are based in large part on student achievement growth or teachers’ “value-added”) have gone too far and have actually overvalued the effects of teachers. He explains that simplistic reform efforts, such as Race to the Top and VAM-based teacher evaluation systems, overvalue teacher effects in terms of the actual levels of impact teachers have on student achievement. It has been well documented that teacher effects can only explain, on average, between 10-20% of the variation in student achievement scores (this will be further explored in forthcoming posts). This means that 80-90% of student achievement scores are the result of other factors that are completely outside of the teachers’ control (e.g., poverty, parental support, etc.). Regardless, new teacher evaluation systems that rely so heavily on VAM estimates ignore this very important fact.

Dr. Good also emphasizes that teaching is a complex practice and that by attempting to isolate the variables that make up effective teaching and focusing on each one separately only oversimplifies the complexities of the teaching-achievement relationship. For one thing, VAMs and other popular evaluative practices, such as classroom observations, inhibit one’s ability to recognize the patterns of effective teaching and instead promote a simplistic view of just some of the individual variables that actually matter when thinking about effective teaching. He provides the following example:

Many observational systems call for the demonstration of high expectations. However…expectations can be too high or too low, and the issue is for teachers to demonstrate appropriate expectations. How then does a classroom observer know and code if expectations are appropriate both for individual students and for the class as a whole?

The bottom line is that, like teaching, understanding the impact of teachers on student learning is very complex. Teachers matter, and they should be trained, treated, and valued as such. However, they are one of many factors that impact student learning and achievement over time, and this very critical point cannot be (though currently is) ignored by policy. VAM-based policies, particularly, place an exorbitantly over-estimated value on the impact of teachers on student achievement scores.

Post contributed by Jessica Holloway-Libell