Educator Evaluations (and the Use of VAM) Unlikely to be Mandated in Reauthorization of ESEA

In invited a colleague of mine – Kimberly Kappler Hewitt (Assistant Professor, University of North Carolina, Greensboro) – to write a guest post for you all, and she did on her thoughts regarding what is currently occurring on Capitol Hill regarding the reauthorization of the Elementary and Secondary Education Act (ESEA). Here is what she wrote:

Amidst what is largely a bitterly partisan culture on Capitol Hill, Republicans and Democrats agree that teacher evaluation is unlikely to be mandated in the reauthorization of the Elementary and Secondary Education Act (ESEA), the most recent iteration of which is No Child Left Behind (NCLB), signed into law in 2001. See here for an Education Week article by Lauren Camera on the topic.

In another piece on the topic (here), the same author Camera explains: “Republicans, including Chairman Lamar Alexander, R-Tenn., said Washington shouldn’t mandate such policies, while Democrats, including ranking member Patty Murray, D-Wash., were wary of increasing the role student test scores play in evaluations and how those evaluations are used to compensate teachers.” However, under draft legislation introduced by Senator Lamar Alexander (R-Tenn.), Chairman of the Senate Health, Education, Labor, and Pensions Committee, Title II funding would turn into federal block grants, which could be used by states for educator evaluation. Regardless, excluding a teacher evaluation mandate from ESEA reauthorization may undermine efforts by the Obama administration to incorporate student test score gains as a significant component of educator evaluation.

Camera further explains: “Should Congress succeed in overhauling the federal K-12 law, the lack of teacher evaluation requirements will likely stop in its tracks the Obama administration’s efforts to push states to adopt evaluation systems based in part on student test scores and performance-based compensation systems.”

Under the Obama administration, in order for states to obtain a waiver from NCLB penalties and to receive a Race to the Top Grant, they had to incorporate—as a significant component—student growth data in educator evaluations. Influenced by these powerful policy levers, forty states and the District of Columbia require objective measures of student learning to be included in educator evaluations—a sea change from just five years ago (Doherty & Jacobs/National Council on Teacher Quality, 2013). Most states use either some type of value-added model (VAM) or student growth percentile (SGP) model to calculate a teacher’s contribution to student score changes.

The Good, the Bad, and the Ugly

As someone who is skeptical about the use of VAMs and SGPs for evaluating educators, I have mixed feelings about the idea that educator evaluation will be left out of ESEA reauthorization. I believe that student growth measures such as VAMs and SGPs should be used not as a calculable component of an educator’s evaluation but as a screener to flag educators who may need further scrutiny or support, a recommendation made by a number of student growth measure (SGM) experts (e.g., Baker et al., 2010; Hill, Kapitula, & Umland, 2011; IES, 2010; Linn, 2008).

Here are two thoughts about the consequences of not incorporating policy on educator evaluation in the reauthorization of ESEA:

  1. Lack of clear federal vision for educator evaluation devolves to states the debate. There are strong debates about what the nature of educator evaluation can and should be, and education luminaries such as Linda Darling Hammond and James Popham have weighed in on the issue (see here and here, respectively). If Congress does not address educator evaluation in ESEA legislation, the void will be filled by disparate state policies. This in itself is neither good nor bad. It does, however, call into question the longevity of the efforts the Obama administration has made to leverage educator evaluation as a way to increase teacher quality. Essentially, the lack of action on the part of Congress regarding educator evaluation devolves the debates to the state level, which means that heated—and sometimes vitriolic—debates about educator evaluation will endure, shifting attention away from other efforts that could have a more powerful and more positive effect on student learning.
  2. Possibility of increases in inequity. ESEA was first passed in 1965 as part of President Johnson’s War on Poverty. ESEA was intended to promote equity for students from poverty by providing federal funding to districts serving low-income students. The idea was that the federal government could help to level the playing field, so to speak, for students who lacked the advantages of higher income students. My own research suggests that the use of VAM for educator evaluation potentially exacerbates inequity in that some teachers avoid working with certain groups of students (e.g., students with disabilities, gifted students, and students who are multiple grade levels behind) and at certain schools, especially high-poverty schools, based on the perception that teaching such students and in such schools will result in lower value-added scores. Without federal legislation that provides clear direction to states that student test score data should not be used for high-stakes evaluation and personnel decisions, states may continue to use data in this manner, which could exacerbate the very inequities that ESEA was originally designed to address.

While it is a good thing, in my mind, that ESEA reauthorization will not mandate educator evaluation that incorporates student test score data, it is a bad (or at least ugly) thing that Congress is abdicating the role of promoting sound educator evaluation policy.

References

Baker, A. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn, R. L., . . . Shepard, L. A. (2010). Problems with the use of student test scores to evaluate teachers. EPI Briefing Paper. Washington, D.C.

Hill, H. C., Kapitula, L., & Umland, K. (2011). A validity argument approach to evaluating teacher value-added scores. American Educational Research Journal, 48(3), 794-831.

Doherty, K. M., & Jacobs, S./National Council on Teacher Quality (2013). State of the states 2013: Connect the dots: Using evaluation of teacher effectiveness to inform policy and practice. Washington, D. C.: National Council on Teacher Quality.

Institute of Education Sciences. (2010). Error rates in measuring teacher and school performance based on students’ test score gains. Washington, D.C.: U. S. Department of Education.

Linn, R. L. (2008). Methodological issues in achieving school accountability. Journal of Curriculum Studies, 40(6), 699-711.

Student Learning Objectives (SLOs): What (Little) We Know about Them Besides We Are to Use Them

Following up on a recent post, a VAMboozled! follower – Laura Chapman – wrote the comment below about Student Learning Objectives (SLOs) that I found important to share with you all. SLOs are objectives that are teacher-developed and administrator-approved to help hold teachers accountable for their students’ growth, although growth in this case is individually and loosely defined, which makes SLOs about as subjective as it gets. Ironically, SLOs serve as alternatives to VAMs when teachers who are VAM-ineligible need to be held accountable for “growth.”

Laura commented about how I need to write more about SLOs as states are increasingly adopting these, but states are doing this without really any research evidence in support of the concept, much less the practice. That might seem more surprising than it really is, but there is not a lot of research being conducted on SLOs, yet. One research document of which I am aware I reviewed here, with the actual document written by Mathematica and published by the US Department of Education here: “Alternative student growth measures for teacher evaluation: Profiles of early-adopting districts.

Conducting a search on ERIC, I found only two additional pieces also contracted out and published by the US Department of Education, although the first piece is more about describing what states are doing in terms of SLOs versus researching the actual properties of the SLOs. The second piece better illustrates the fact that “very little of the literature on SLOs addresses their statistical properties.”

What little we do know about SLOs at this point, however, is two-fold: (1) “no studies have looked at SLO reliability” and (2) “[l]ittle is known about whether SLOs can yield ratings that correlate with other measures of teacher performance” (i.e., one indicator of validity). The very few studies in which researchers have examined this found “small but positive correlations” between SLOs and VAM-based ratings (i.e., not a strong indicator of validity).

With that being said, if any of you are aware of research I should review or if any of you have anything to say or write about SLOs in your states, districts, or schools, feel free to email me at audrey.beardsley@asu.edu.

In the meantime, do also read what Laura Wrote about SLOs here:

I appreciate your work on the VAM problem. Equal attention needs to be given to the use of SLOs for evaluating teacher education in so-called untested and non-tested subjects. It has been estimated that about 65-69% of teachers have job assignments for which there are not state-wide tests. SLOs (and variants) are the proxy of choice for VAM. This writing exercise is required in at least 27 states, with pretest-posttest and/or baseline to post-test reports on student growth. Four reports from USDE (2014) [I found three] show that there is no empirical research to support the use of the SLO process (and associated district-devised tests and cut-off scores) for teacher evaluation.

The template for SLOs originated in Denver in 1999. It has been widely copied and promoted via publications from USDE’s “Reform Support Network,” which operates free of any need for evidence and few constraints other than marketing a deeply flawed product. SLO templates in wide use have no peer reviewed evidence to support their use for teacher evaluation…not one reliability study, not one study addressing their validity for teacher evaluation.

SLO templates in Ohio and other states are designed to fit the teacher-student data link project (funded by Gates and USDE since 2005). This means that USDE’s proposed evaluations of specific teacher education programs ( e.g., art education at Ohio State University) will be aided by the use of extensive “teacher of record” data routinely gathered by schools and districts, including personnel files that typically require the teacher’s college transcripts, degree earned, certifications, scores on tests for any teacher license and so on.

There are technical questions galore, but a big chunk of the data of interest to the promoters of this latest extension of the Gates/USDE’s rating game are in place.
I have written about the use of SLOs as a proxy for VAM in an unpublished paper titled The Marketing of Student Learning Objectives (SLOs): 1999-2014. A pdf with references can be obtained by request at chapmanLH@aol.com

Teacher Evaluation and Accountability Alternatives, for A New Year

At the beginning of December I posted a post about Diane Ravitch’s really nice piece published in the Huffington Post about what she views as a much better paradigm for teacher evaluation and accountability. Diane Ravitch posted another on similar alternatives, although this one was written by teachers themselves.

I thought this was more than appropriate, especially given a New Year is upon us, and while it might very well be wishful thinking, perhaps at least some of our state policy makers might be willing to think in new ways about what really could be new and improved teacher evaluation systems. Cheers to that!

The main point here, though, is that alternatives do, indeed, exist. Likewise, it’s not that teachers do not want to be held accountable for, and evaluated on that which they do, but they do want whatever systems are in place (formal or informal) to be appropriate, professional, and fair. How about that for policy-based resolution.

This is from Diane’s post: The Wisdom of Teachers: A New Vision of Accountability.

Anyone who criticizes the current regime of test-based accountability is inevitably asked: What would you replace it with? Test-based accountability fails because it is based on a lack of trust in professionals. It fails because it confuses measurement with instruction. No doctor ever said to a sick patient, “Go home, take your temperature hourly, and call me in a month.” Measurement is not a treatment or a cure. It is measurement. It doesn’t close gaps: it measures them.

Here is a sound alternative approach to accountability, written by a group of teachers whose collective experience is 275 years in the classroom. Over 900 teachers contributed ideas to the plan. It is a new vision that holds all actors responsible for the full development and education of children, acknowledging that every child is a unique individual.

Its key features:

  • Shared responsibility, not blame
  • Educate the whole child
  • Full and adequate funding for all schools, with less emphasis on standardized testing
  • Teacher autonomy and professionalism
  • A shift from evaluation to support
  • Recognition that in education one size does not fit all

A New Paradigm for Accountability

Diane Ravitch recently published in the Huffington Post a really nice piece about what she views as a much better paradigm for accountability — one based on much better indicators than large scale standardized test scores. This does indeed offer a much better and much more positive and supportive accountability alternative to that with which we have been “dealing” for the last, really, 30 years.

The key components of this new paradigm, as taken from the full post titled, “A New Paradigm for Accountability: The Joy of Learning,” are pasted below. Although I would recommend giving this article a full read, instead or in addition, as the way Diane frames her reasoning around this list is also important to understand. Click here to see the full article on the Huffington Post website. Otherwise, here’s her paradigm:

The new accountability system would be called No Child Left Out. The measures would be these:

  • How many children had the opportunity to learn to play a musical instrument?
  • How many children had the chance to play in the school band or orchestra?
  • How many children participated in singing, either individually or in the chorus or a glee club or other group?
  • How many public performances did the school offer?
  • How many children participated in dramatics?
  • How many children produced documentaries or videos?
  • How many children engaged in science experiments? How many started a project in science and completed it?
  • How many children learned robotics?
  • How many children wrote stories of more than five pages, whether fiction or nonfiction?
  • How often did children have the chance to draw, paint, make videos, or sculpt?
  • How many children wrote poetry? Short stories? Novels? History research papers?
  • How many children performed service in their community to help others?
  • How many children were encouraged to design an invention or to redesign a common item?
  • How many students wrote research papers on historical topics?

Can you imagine an accountability system whose purpose is to encourage and recognize creativity, imagination, originality, and innovation? Isn’t this what we need more of?

Well, you can make up your own metrics, but you get the idea. Setting expectations in the arts, in literature, in science, in history, and in civics can change the nature of schooling. It would require far more work and self-discipline than test prep for a test that is soon forgotten.

My paradigm would dramatically change schools from Gradgrind academies to halls of joy and inspiration, where creativity, self-discipline, and inspiration are nurtured, honored, and valued.

This is only a start. Add your own ideas. The sky is the limit. Surely we can do better than this era of soul-crushing standardized testing.

Surveys + Observations for Measuring Value-Added

Following up on a recent post about the promise of Using Student Surveys to Evaluate Teachers using a more holistic definition of a teacher’s valued added, I just read a chapter written by Ronald Ferguson — the creator of the Tripod student survey instrument and Tripod’s lead researcher — and written along with Charlotte Danielson — the creator of the Framework for Teaching and founder of The Danielson Group (see a prior post about this instrument here). Both instruments are “research-based,” both are used nationally and internationally, both are (increasingly being) used as key indicators to evaluate teachers across the U.S., and both were used throughout the Bill & Melinda Gates Foundation’s ($43 million worth of) Measures of Effective Teaching (MET) studies.

The chapter titled, “How Framework for Teaching and Tripod 7Cs Evidence Distinguish Key Components of Effective Teaching,” was recently published in a book all about the MET studies, titled “Designing Teacher Evaluation Systems: New Guidance from the Measures of Effective Teaching Project” written by Thomas Kane, Kerri Kerr, and Robert Pianta. The chapter is about whether and how data derived via the Tripod student survey instrument (i.e., as built on 7Cs: challenging students, control of the classroom, teacher caring, teachers confer with students, teachers captivate their students, teachers clarify difficult concepts, teachers consolidate students’ concerns) align with the data derived via Danielson’s Framework for Teaching, to collectively capture teacher effectiveness.

Another purpose for this chapter is to examine how both indicators also align with teacher level-value-added. Ferguson (and Danielson) find that:

  • Their two measures (i.e., the Tripod and the Framework for Teaching) are more reliable (and likely more valid) than value-added measures. The over-time, teacher-level classroom correlations, cited in this chapter, are r = 0.38 for value-added (which is comparable with the correlations noted in plentiful studies elsewhere), r = 0.42 for the Danielson Framework, and r = 0.61 for the Tripod student survey component. These “clear correlations,” while not strong particularly in terms of value-added, do indicate there is some common signal that the indicators are capturing, some stronger than the others (as should be obvious given the above numbers).
  • Contrary to what some (softies) might think, classroom management, not caring (i.e., the extent to which teachers care about their students and what their students learn and achieve), is the strongest predictor of a teachers’ value-added. However, the correlation (i.e., the strongest of the bunch) is still quite “weak” at an approximate r = 0.26, even though it is statistically significant. Caring, rather, is the strongest predictor of whether students are happy in their classrooms with their teachers.
  • In terms of “predicting” teacher-level value-added, and of the aforementioned 7Cs, the things that also matter “most” next to classroom management (although none of the coefficients are as strong as we might expect [i.e., r < 0.26]) include: the extent to which teachers challenge their students and have control over their classrooms.
  • Value-added in general is more highly correlated with teachers at the extremes in terms of their student survey and observational composite indicators.

In the end, while the authors of this chapter do not disclose the actual correlations between their two measures and value-added, specifically (although from the appendix one can infer that the correlation between value-added and Tripod output is around r = 0.45 as based on an unadjusted r-squared), and I should mention this is a HUGE shortcoming of this chapter (one that would not have passed peer review should this chapter have been submitted to a journal for publication), the authors do mention that “the conceptual overlap between the frameworks is substantial and that empirical patterns in the data show similarities.” Unfortunately again, however, they do not quantify the strength of said “similarities.” This only leaves us to assume that since they were not reported the actual strength of the similarities empirically observed between was likely low (as is also evidenced in many other studies, although not as often with student survey indicators as opposed to observational indicators.)

The final conclusion the authors of this chapter make is that educators “cross-walk” the two frameworks (i.e., the Tripod and the Danielson Framework) and use both frameworks when reflecting on teaching. I must say I’m concerned about these recommendations, as well, mainly given this recommendation will cost states and districts more $$$, and the returns or “added value” (using the grandest definition of this term) of doing so and engaging in such an approach does not have the necessary evidence I would say one might use to adequately justify such recommendations.

“Accountability that Sticks” v. “Accountability with Sticks”

Michael Fullan, Professor Emeritus from the University of Toronto and former Dean of the Ontario Institute for Studies in Education (OISE), gave a presentation he titled “Accountability that Sticks.” Which is “a preposition away from accountability with sticks.”

In his speech he said: “Firing teachers and closing schools if student test scores and graduation rates do not meet a certain bar is not an effective way to raise achievement across a district or a state…Linking student achievement to teacher appraisal, as sensible as it might seem on the surface, is a non-starter…It’s a wrong policy [emphasis added]…[and] Its days are numbered.”

He noted that teacher evaluation is “the biggest factor that most policies get wrong…Teacher appraisal, even if you get it right – which the federal government doesn’t do – is the wrong driver. It will never be intensive enough.”

He then spoke about how, at least in the state of California, things look more promising than they have in the past, from his view working with local districts throughout the state. He noted noticing that “Growing numbers of superintendents, teachers and parents…are rejecting punitive measures…in favor of what he called more collaborative, humane and effective approaches to supporting teachers and improving student achievement.”

If the goal is to improve teaching, then, what’s the best way to do this according to Fullan? “[A] culture of collaboration is the most powerful tool for improving what happens in classrooms and across districts…This is the foundation. You reinforce it with selective professional development and teacher appraisal.”

In addition, “[c]ollaboration requires a positive school climate – teachers need to feel respected and listened to, school principals need to step back, and the tone has to be one of growth and improvement, not degradation.” Accordingly, “New Local Control and Accountability Plans [emphasis added], created individually by districts, could be used by teachers and parents to push for ways to create collaborative cultures” and cultures of community-based and community-respected accountability.

This will help allow “talented schools” to improve “weak teachers” and further prevent the attrition of “talented teachers” from “weak schools.”

To read more about his speech, as highlighted by Jane Meredith Adams on EdSource, click here.

Does A “Statistically Sound” Alternative Exist?

A few weeks ago a follower posed the following question on our website, and I thought it imperative to share.

Following the post about “The Arbitrariness Inherent in Teacher Observations,” he wrote: “Have you written about a statistically sound alternative proposal?”

My reply? “Nope. I do not believe such a thing exists. I do have a sound alternative proposal though, that has sound statistics to support it. It serves as the core of chapter 8 of my recent book.”

Essentially, this is a solution that, counter-intuitively, offers an even-more conventional and traditional solution. This is a solution that has research and statistical evidence in support, and has evidenced itself as superior to using value-added measures, along with other measures of teacher effectiveness in their current forms, for evaluating and holding teachers accountable for their effectiveness. It is based on the use of multiple measures, as aligned with the standards of the profession and also locally defined theories capturing what it means to be an effective teacher. Its effectiveness also relies on competent supervisors and elected colleagues serving as professional members of educators’ representative juries.

This solution does not rely solely on mathematics and the allure of numbers or grandeur of objectivity that too often comes along with numerical representation, especially in the social sciences. This solution does not trust the test scores too often (and wrongly) used to assess teacher quality, simply because the test output is already available (and paid for) and these data can be represented numerically, mathematically, and hence objectively. This solution does not marginalize human judgment, but rather embraces human judgment for what it is worth, as positioned and operationalized within a more professional, democratically-based, and sound system of judgment, decision-making, and support.

Jesse Rothstein on Teacher Evaluation and Teacher Tenure

Last week, released via the Washington Post’s Wonkblog, Max Ehrenfreund wrote a piece titled “Teacher tenure has little to do with student achievement, economist says.” For those of you who do not know Jesse Rothstein, he’s an Associate Professor of Economics at University of California – Berkeley, and he is one of the leading researchers/economists conducting research on teacher evaluation and accountability policies writ large, as well as the value-added models (VAMs) being used for such purposes. He’s probably most famous for a study he conducted in 2009 about how the non-random, purposeful sorting of students into classrooms indeed biases (or distorts) value-added estimations, pretty much despite the sophistication of the statistical controls meant to block (or control for) such bias (or distorting effects). You can find this study referenced here.

Anyhow, in this piece author Ehrenfreuend discusses with Rothstein teacher evaluation and teacher tenure. Some of the key take-aways from the interview and for this audience follow, but do read the full piece, linked again here, if so inclined:

Rothstein, on teacher evaluation:

  • In terms of evaluating teachers, “[t]here’s no perfect method. I think there are lots of methods that give you some information, and there are lots of problems with any method. I think there’s been a tendency in thinking about methods to prioritize cheap methods over methods that might be more expensive. In particular, there’s been a tendency to prioritize statistical computations based on student test scores, because all you need is one statistician and the test score data. Classroom observation requires having lots of people to sit in the back of lots and lots of classrooms and make judgments.
  • Why the interest in value-added? “I think that’s a complicated question. It seems scientific, in a way that other methods don’t. Partly it has to do with the fact that it’s cheap, and it seems like an easy answer.”
  • What about the fantabulous study Raj Chetty and his Harvard colleagues (Friedman and Rockoff) conducted about teachers’ value-added (which has been the source of many prior posts herein)? “I don’t think anybody disputes that good teachers are important, that teachers matter. I have some methodological concerns about that study, but in any case, even if you take it at face value, what it tells you is that higher value-added teachers’ students earn more on average.”
  • What are the alternatives? “We could double teachers’ salaries. I’m not joking about that. The standard way that you make a profession a prestigious, desirable profession, is you pay people enough to make it attractive. The fact that that doesn’t even enter the conversation tells you something about what’s wrong with the conversation around these topics. I could see an argument that says it’s just not worth it, that it would cost too much. The fact that nobody even asks the question tells me that people are only willing to consider cheap solutions.”

Rothstein, on teacher tenure:

  • “Getting good teachers in front of classrooms is tricky,” and it will likely “still be a challenge without tenure, possibly even harder. There are only so many people willing to consider teaching as a career, and getting rid of tenure could eliminate one of the job’s main attractions.”
  • Likewise, “there are certainly some teachers in urban, high-poverty settings that are not that good, and we ought to be figuring out ways to either help them get better or get them out of the classroom. But it’s important to keep in mind that that’s only one of several sources of the problem.”
  • “Even if you give the principal the freedom to fire lots of teachers, they won’t do it very often, because they know the alternative is worse.” The alternative being replacing an ineffective teacher by an even less effective teacher. Contrary to what is oft-assumed, high qualified teachers are not knocking down the doors to teach in such schools.
  • Teacher tenure is “really a red herring” in the sense that debating tenure ultimately misleads and distracts others from the more relevant and important issues at hand (e.g., recruiting strong teachers into such schools). Tenure “just doesn’t matter that much. If you got rid of tenure, you would find that the principals don’t really fire very many people anyway” (see also point above).

Using Student Surveys to Evaluate Teachers

The technology section of The New York Times released an article yesterday called “Grading Teachers, With Data From Class.” It’s about using student-level survey data, or what students themselves have to say about the effectiveness of their teachers, to supplement (or perhaps trump) value-added and other test-based data when evaluating teacher effectiveness.

I recommend this article to you all in that it’s pretty much right on in terms of using “multiple measures” to measure pretty much anything educational these days, including teacher effectiveness. Likewise, such an approach aligns with the 2014 “Standards for Educational and Psychological Testing” measurement standards recently released by the leading professional organizations in the area of educational measurement, including the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME).

Some of the benefits of using student surveys to help measure teacher effectiveness:

  • Student-level data based on such surveys typically yield data that are of more formative use to teachers than most other data, including data generated via value-added models (VAMs) and many observational systems.
  • These data represent students’ perceptions and opinions. This is important as these data come directly from students in teachers’ classrooms, and students are the most direct “consumers” of (in)effective teaching.
  • In this article in particular, the survey instrument described is open-source. This is definitely of “added value;” rare is it that products are offered to big (and small) money districts, more or less, for free.
  • This helps with current issues of fairness, or the lack thereof (whereas only about 30% of current PreK-12 teachers can be evaluated using students’ test scores). Using survey data can apply to really all teachers, if all teachers agree that the more generalized items pertain to them and the subject areas they teach (e.g., physical education). One thing to note, however, is that there are typically issues that arise when using these survey data when the data are to come from young children. Our littlest ones are typically happy with most any teacher and do not really have the capacities to differentiate among teacher effectiveness items or sub-factors; hence, these data do not typically yield very useful data for either formative (informative) or summative (summary) purposes in the lowest grade levels. Whether student surveys are appropriate for students in such grades is highly questionable, accordingly.

Some things to consider and some major notes of caution when using student surveys to help measure teacher effectiveness:

  • Response rates are always an issue when valid inferences are to be drawn from such survey data. Too often folks draw assertions and conclusions they believe to be valid from samples of respondents that are too small and not representative of the population, of in this case students, whom were initially solicited for their responses. Response rates cannot be overlooked; if response rates are inadequate this can and should void all data entirely.
  • There is a rapidly growing market for student-level survey systems such as these, and some are rushing to satisfy the demand without conducting the research necessary to make the claims they are simultaneously marketing. Consumers need to make sure such survey instruments themselves (as well as the online/paper administration systems that often come along with them) are functioning appropriately, and accordingly yielding reliable, good, accurate, useful, etc. data. These instruments are very difficult to construct and validate, so serious attention should be paid to the actual research supporting marketers’ claims. Consumers should continue to ask for the research evidence, as such research is often incomplete or not done when tools are needed ASAP. District-level researchers should be more than capable of examining the evidence before any contracts are signed.
  • Related, districts should not necessarily do this on their own. Not that district personnel are not capable, but as stated, validation research is a long, arduous, but also very necessary process. And typically, the instruments available (especially if for free) do a decent job capturing the general teacher effectiveness construct. This too can be debated, however (e.g., in terms of universal and/or too many items and halo effects).
  • Many in higher education have experience with both developing and using student-level survey data, and much can be learned from the wealth of research and information on using such systems to evaluate college instructor/professor effectiveness. This research certainly applies here. Accordingly, there is much research about how such survey data can be gamed and manipulated by instructors (e.g., via the use of external incentives/disincentives), can be biased by respondent or student background variables (e.g., charisma, attractiveness, gender and race as compared to the gender and race of the teacher or instructor, grade expected or earned in the class, overall grade point average, perceived course difficulty or the lack thereof), and the like. These literature should be consulted, so that all users of such student-level survey data are aware of the potential pitfalls when using and consuming such output. Accordingly, this research can help future consumers be proactive in terms of ensuring, as best they can, that results might yield as valid inferences as possible.
  • On that note, all educational measurements and measurement systems are imperfect. This is precisely why the standards of the profession call for “multiple measures” as with each multiple measure, the strengths of one hopefully help to offset the weaknesses of the others. This should yield a more holistic assessment of the construct of interest, which is in this case teacher effectiveness. However, the extent to which these data holistically capture teacher effectiveness, also needs to be continuously researched and assessed.

I hope this helps, and please do respond with comments if you all have anything else to add for the good of the group. I should also add that this is an incomplete list of both the strengths and drawbacks to such approaches; the aforementioned research literature, particularly as it represents 30+ years of using student-level surveys in higher education should be advised if more information is needed and desired.

 

Vermont’s Enlightened State Board of Education

The Vermont State Board of Education recently released a more than reasonable “Statement on Assessment and Accountability” that I certainly wish would be read and adopted by other leaders across other states.

They encourage their educators to “make use of diverse indicators of student learning and strengths,” when measuring student learning and achievement, the growth of both over time, and especially when using such data to inform their practice. The use of multiple and diverse indicators (i.e., including traditional and non-traditional tests, teacher-developed assessments, and student work samples) is in line with the professional measurement and assessment standards. At the same time, however, they must also “document the opportunities schools provide to further the goals of equity and [said] growth.”

As per growth on standardized tests in particular, and particularly in the case of value-added models (VAMs), they write that such tests and test uses cannot “adequately capture the strengths of all children, nor the growth that can be ascribed to individual teachers. And under high-stakes conditions, when schools feel extraordinary pressure to raise scores, even rising scores may not be a signal that students are actually learning more. At best, a standardized test is an incomplete picture of learning: without additional measures, a single test is inadequate to capture a years’ worth of learning and growth.” This too aligns with the standards of the profession.

They continue, noting that “the way in which standardized tests have been used under federal law as almost the single measure of school quality has resulted in the frequent misuse of these instruments across the nation.” Hence, they also put forth a set of guiding principles they, as a state, are to use to inform their assessment and accountability goals (and mandates).

The principle that should be of most interest to readers of this blog?

  • “Value-added scores – Although the federal government is encouraging states to use value added scores for teacher, principal and school evaluations, this policy direction is not appropriate. A strong body of recent research has found that there is no valid method of calculating “value-added” scores which compare pass rates from one year to the next, nor do current value-added models adequately account for factors outside the school that influence student performance scores. Thus, other than for research or experimental purposes, this technique will not be employed in Vermont schools for any consequential purpose.”

See also their other related principles as also very important, summarized briefly here:

  • All tests must have evidence validating their particular uses. In other words, tests may not be used for things that make sense in theory or may seem convenient. Rather, research evidence must support their uses, otherwise valid inferences cannot be made, or more importantly accepted as valid.
  • When such test scores are reported via press and media outlets, more than just test scores, hierarchical rankings of test scores, and the like are to be reported, given the people of Vermont more holistic understandings about schools in their state.
  • Educators must actively and consciously prevent “excessive testing” as it “diverts resources and time away from learning while providing little additional value for accountability purposes.”
  • “While the federal government continues to require the use of subjectively determined, cut-off scores; employing such metrics lacks scientific foundation…Claims to the contrary are technically indefensible and their application [is to] be [considered] unethical.”
  • “So that [they] can more validly and meaningfully describe school- and state-level progress…[they also endorse] reporting performance in terms of scale scores and standard deviations rather than percent proficient” indicators.
  • “[A]ny report on a school based on the state’s EQS standards must also include a report on the adequacy of resources provided by or to that school in light of the school’s unique needs. Such evaluations shall address the adequacy of resources [and] the judicious use of resources.”
  • In terms of assessment in general, educators are to always align with and follow “the aforementioned guidelines and principles adopted by the American Educational Research Association, the National Council on Measurement in Education, and the American Psychological Association.

See also their list of resolution at the end of this document, as also, I can’t think of a better adjective than enlightened!!