Another Study about Bias in Teachers’ Observational Scores

Following-up on two prior posts about potential bias in teachers’ observations (see prior posts here and here), another research study was recently released evidencing, again, that the evaluation ratings derived via observations of teachers in practice are indeed related to (and potentially biased by) teachers’ demographic characteristics. The study also evidenced that teachers representing racial and ethnic minority background might be more likely than others to not only receive lower relatively scores but also be more likely identified for possible dismissal as a result of their relatively lower evaluation scores.

The Regional Educational Laboratory (REL) authored and U.S. Department of Education (Institute of Education Sciences) sponsored study titled “Teacher Demographics and Evaluation: A Descriptive Study in a Large Urban District” can be found here, and a condensed version of the study can be found here. Interestingly, the study was commissioned by district leaders who were already concerned about what they believed to be occurring in this regard, but for which they had no hard evidence… until the completion of this study.

Authors’ key finding follows (as based on three consecutive years of data): Black teachers, teachers age 50 and older, and male teachers were rated below proficient relatively more often than the same district teachers to whom they were compared. More specifically,

  • In all three years the percentage of teachers who were rated below proficient was higher among Black teachers than among White teachers, although the gap was smaller in 2013/14 and 2014/15.
  • In all three years the percentage of teachers with a summative performance rating who were rated below proficient was higher among teachers age 50 and older than among teachers younger than age 50.
  • In all three years the difference in the percentage of male and female teachers with a summative performance rating who were rated below proficient was approximately 5 percentage points or less.
  • The percentage of teachers who improved their rating during all three year-to-year
    comparisons did not vary by race/ethnicity, age, or gender.

This is certainly something to (still) keep in consideration, especially when teachers are rewarded (e.g., via merit pay) or penalized (e.g., vie performance improvement plans or plans for dismissal). Basing these or other high-stakes decisions on not only subjective but also likely biased observational data (see, again, other studies evidencing that this is happening here and here), is not only unwise, it’s also possibly prejudiced.

While study authors note that their findings do not necessarily “explain why the
patterns exist or to what they may be attributed,” and that there is a “need
for further research on the potential causes of the gaps identified, as well as strategies for
ameliorating them,” for starters and at minimum, those conducting these observations literally across the country must be made aware.

Citation: Bailey, J., Bocala, C., Shakman, K., & Zweig, J. (2016). Teacher demographics and evaluation: A descriptive study in a large urban district. Washington DC: U.S. Department of Education. Retrieved from http://ies.ed.gov/ncee/edlabs/regions/northeast/pdf/REL_2017189.pdf

Ohio Rejects Subpar VAM, for Another VAM Arguably Less Subpar?

From a prior post coming from Ohio (see here), you may recall that Ohio state legislators recently introduced a bill to review its state’s value-added model (VAM), especially as it pertains to the state’s use of their VAM (i.e., the Education Value-Added Assessment System (EVAAS); see more information about the use of this model in Ohio here).

As per an article published last week in The Columbus Dispatch, the Ohio Department of Education (ODE) apparently rejected a proposal made by the state’s pro-charter school Ohio Coalition for Quality Education and the state’s largest online charter school, all of whom wanted to add (or replace) this state’s VAM with another, unnamed “Similar Students” measure (which could be the Student Growth Percentiles model discussed prior on this blog, for example, here, here, and here) used in California.

The ODE charged that this measure “would lower expectations for students with different backgrounds, such as those in poverty,” which is not often a common criticism of this model (if I have the model correct), nor is it a common criticism of the model they already have in place. In fact, and again if I have the model correct, these are really the only two models that do not statistically control for potentially biasing factors (e.g., student demographic and other background factors) when calculating teachers’ value-added; hence, their arguments about this model may be in actuality no different than that which they are already doing. Hence, statements like that made by Chris Woolard, senior executive director of the ODE, are false: “At the end of the day, our system right now has high expectations for all students. This (California model) violates that basic principle that we want all students to be able to succeed.”

The models, again if I am correct, are very much the same. While indeed the California measurement might in fact consider “student demographics such as poverty, mobility, disability and limited-English learners,” this model (if I am correct on the model) does not statistically factor these variables out. If anything, the state’s EVAAS system does, even though EVAAS modelers claim they do not do this, by statistically controlling for students’ prior performance, which (unfortunately) has these demographics already built into them. In essence, they are already doing the same thing they now protest.

Indeed, as per a statement made by Ron Adler, president of the Ohio Coalition for Quality Education, not only is it “disappointing that ODE spends so much time denying that poverty and mobility of students impedes their ability to generate academic performance…they [continue to] remain absolutely silent about the state’s broken report card and continually defend their value-added model that offers no transparency and creates wild swings for schools across Ohio” (i.e., the EVAAS system, although in all fairness all VAMs and the SGP yield the “wild swings’ noted). See, for example, here.

What might be worse, though, is that the ODE apparently found that, depending on the variables used in the California model, it produced different results. Guess what! All VAMs, depending on the variables used, produce different results. In fact, using the same data and different VAMs for the same teachers at the same time also produce (in some cases grossly) different results. The bottom line here is if any thinks that any VAM is yielding estimates from which valid or “true” statements can be made are fooling themselves.

Value-Added for Kindergarten Teachers in Ecuador

In a study a colleague of mine recently sent me, authors of a study recently released in The Quarterly Journal of Economics and titled “Teacher Quality and Learning Outcomes in Kindergarten,” (nearly randomly) assigned two cohorts of more than 24,000 kindergarten students to teachers to examine whether, indeed and once again, teacher behaviors are related to growth in students’ test scores over time (i.e., value-added).

To assess this, researchers administered 12 tests to the Kindergarteners (I know) at the beginning and end of the year in mathematics and language arts (although apparently the 12 posttests only took 30-40 minutes to complete, which is a content validity and coverage issue in and of itself, p. 1424). They also assessed something they called the executive function (EF), and that they defined as children’s inhibitory control, working memory, capacity to pay attention, and cognitive flexibility, all of which they argue to be related to “Volumetric measures of prefrontal cortex size [when] predict[ed]” (p. 1424). This, along with the fact that teachers’ IQs were also measured (using the Spanish-speaking version of the Wechsler Adult Intelligence Scale) speaks directly to the researchers’ background theory and approach (e.g., recall our world’s history with craniometry, aptly captured in one of my favorite books — Stephen J. Gould’s best selling “The Mismeasure of Man”). Teachers were also observed using the Classroom Assessment Scoring System (CLASS), and parents were also solicited for their opinions about their children’s’ teachers (see other measures collected p. 1417-1418).

What should by now be some familiar names (e.g., Raj Chetty, Thomas Kane) served as collaborators on the study. Likewise, their works and the works of other likely familiar scholars and notorious value-added supporters (e.g., Eric Hanushek, Jonah Rockoff) are also cited throughout in support as evidence of “substantial research” (p. 1416) in support of value-added models (VAMs). Of course, this is unfortunate but important to point out in that this is an indicator of “researcher bias” in and of itself. For example, one of the authors’ findings really should come at no surprise: “Our results…complement estimates from [Thomas Kane’s Bill & Melinda Gates Measures of Effective Teaching] MET project” (p. 1419); although, the authors in a very interesting footnote (p. 1419) describe in more detail than I’ve seen elsewhere all of the weaknesses with the MET study in terms of its design, “substantial attrition,” “serious issue[s]” with contamination and compliance, and possibly/likely biased findings caused by self-selection given the extent to which teachers volunteered to be a part of the MET study.

Also very important to note is that this study took place in Ecuador. Apparently, “they,” including some of the key players in this area of research noted above, are moving their VAM-based efforts across international waters, perhaps in part given the Every Student Succeeds Act (ESSA) recently passed in the U.S., that we should all know by now dramatically curbed federal efforts akin to what is apparently going on now and being pushed here and in other developing countries (although the authors assert that Ecuador is a middle-income country, not a developing country, even though this categorization apparently only applies to the petroleum rich sections of the nation). Related, they assert that, “concerns about teacher quality are likely to be just as important in [other] developing countries” (p. 1416); hence, adopting VAMs in such countries might just be precisely what these countries need to “reform” their schools, as well.

Unfortunately, many big businesses and banks (e.g., the Inter-American Development Bank that funded this particular study) are becoming increasingly interested in investing in and solving these and other developing countries’ educational woes, as well, via measuring and holding teachers accountable for teacher-level value-added, regardless of the extent to which doing this has not worked in the U.S to improve much of anything. Needless to say, many who are involved with these developing nation initiatives, including some of those mentioned above, are also financially benefitting by continuing to serve others their proverbial Kool-Aid.

Nonetheless, their findings:

  • First, they “estimate teacher (rather than classroom) effects of 0.09 on language and math” (p. 1434). That is, just less than 1/10th of a standard deviation, or just over a 3% move in the positive direction away from the mean.
  • Similarly, the “estimate classroom effects of 0.07 standard deviation on EF” (p. 1433). That is, precisely 7/100th of a standard deviation, or about a 2% move in the positive direction away from the mean.
  • They found that “children assigned to teachers with a 1-standard deviation higher CLASS score have between 0.05 and 0.07 standard deviation higher end-of-year test scores” (p. 1437), or a 1-2% move in the positive direction away from the mean.
  • And they found that “that parents generally give higher scores to better teachers…parents are 15 percentage points more likely to classify a teacher who produces 1 standard deviation higher test scores as ‘‘very good’’ rather than ‘‘good’’ or lower” (p. 1442). This is quite an odd way of putting it, along with the assumption that the difference between “very good” and “good” is not arbitrary but empirically grounded, along with whatever reason a simple correlation was not more simply reported.
  • Their most major finding is that “a 1 standard deviation increase in classroom quality, corrected for sampling error, results in 0.11 standard deviation higher test scores in both language and math” (p. 1433; see also other findings from p. 1434-447).

Interestingly, the authors equivocate all of these effects to teacher or classroom “shocks,” although I’d hardly call them “shocks” that inherently imply a large, unidirectional, and causal impact. Moreover, this also implies how the authors, also as economists, still view this type of research (i.e., not correlational, even with close-to-random assignment, although they make a slight mention of this possibility on p. 1449).

Nonetheless, the authors conclude that in this article they effectively evidenced “that there are substantial differences [emphasis added] in the amount of learning that takes place in language, math, and executive function across kindergarten classrooms in Ecuador” (p. 1448). In addition, “These differences are associated with differences in teacher behaviors and practices,” as observed, and “that parents can generally tell better from worse teachers, but do not meaningfully alter their investments in children in response to random shocks [emphasis added] to teacher quality” (p. 1448).

Ultimately, they find that “value added is a useful summary measure of teacher quality in Ecuador” (p. 1448). Go figure…

They conclude “to date, no country in Latin America regularly calculates the value added of teachers,” yet “in virtually all countries in the region, decisions about tenure, in-service training, promotion, pay, and early retirement are taken with no regard for (and in most cases no knowledge about) a teacher’s effectiveness” (p. 1448). Also sound familiar??

“Value added is no silver bullet,” and indeed it is not as per much evidence now existent throughout the U.S., “but knowing which teachers produce more or less learning among equivalent students [is] an important step to designing policies to improve learning outcomes” (p. 1448), they also recognizably argue.

Citation: Araujo, M. C., Carneiro, P.,  Cruz-Aguayo, Y., & Schady, N. (2016). Teacher quality and learning outcomes in Kindergarten. The Quarterly Journal of Economics, 1415–1453. doi:10.1093/qje/qjw016  Retrieved from http://qje.oxfordjournals.org/content/131/3/1415.abstract

Bias in Teacher Observations, As Well

Following a post last month titled “New Empirical Evidence: Students’ ‘Persistent Economic Disadvantage’ More Likely to Bias Value-Added Estimates,” Matt Barnum — senior staff writer for The 74, an (allegedly) non-partisan, honest, and fact-based news site backed by Editor-in-Chief Campbell Brown and covering America’s education system “in crisis” (see, also, a prior post about The 74 here) — followed up with a tweet via Twitter. He wrote: “Yes, though [bias caused by economic disadvantage] likely applies with equal or even more force to other measures of teacher quality, like observations.” I replied via Twitter that I disagreed with this statement in that I was unaware of research in support of his assertion, and Barnum sent me two articles to review thereafter.

I attempted to review both of these articles herein, although I quickly figured out that I had actually read and reviewed the first (2014) piece on this blog (see original post here, see also a 2014 Brookings Institution article summarizing this piece here). In short, in this study researchers found that the observational components of states’ contemporary teacher systems certainly “add” more “value” than their value-added counterparts, especially for (in)formative purposes. However, researchers  found that observational bias also exists, as akin to value-added bias, whereas teachers who are non-randomly assigned students who enter their classrooms with higher levels of prior achievement tend to get higher observational scores than teachers non-randomly assigned students entering their classrooms with lower levels of prior achievement. Researchers concluded that because districts “do not have processes in place to address the possible biases in observational scores,” statistical adjustments might be made to offset said bias, as might external observers/raters be brought in to yield more “objective” observational assessments of teachers.

For the second study, and this post here, I gave this one a more thorough read (you can find the full study, pre-publication here). Using data from the Measures of Effective
Teaching (MET) Project, in which random assignment was used (or more accurately attempted), researchers also explored the extent to which students enrolled in teachers’ classrooms influence classroom observational scores.

They found, primarily, that:

  1. “[T]he context in which teachers work—most notably, the incoming academic performance of their students—plays a critical role in determining teachers’ performance” as measured by teacher observations. More specifically, “ELA [English/language arts] teachers were more than twice as likely to be rated in the top performance quintile if [nearly randomly] assigned the highest achieving students compared with teachers assigned the low-est achieving students,” and “math teachers were more than 6 times as likely.” In addition, “approximately half of the teachers—48% in ELA and 54% in math—were rated in the top two performance quintiles if assigned the highest performing students, while 37% of ELA and only 18% of math teachers assigned the lowest performing students were highly rated based on classroom observation scores”
  2. “[T]he intentional sorting of teachers to students has a significant influence on measured performance” as well. More specifically, results further suggest that “higher performing students [are, at least sometimes] endogenously sorted into the classes of higher performing teachers…Therefore, the nonrandom and positive assignment of teachers to classes of students based on time-invariant (and unobserved) teacher
    characteristics would reveal more effective teacher performance, as measured by classroom observation scores, than may actually be true.”

So, the non-random assignment of teachers biases both the value-added and observational components written into America’s now “more objective” teacher evaluation systems, as (formerly) required of all states that were to comply with federal initiatives and incentives (e.g., Race to the Top). In addition, when those responsible for assigning students to classrooms (sub)consciously favor teachers with high, prior observational scores, this exacerbates the issues. This is especially important when observational (and value-added) data are to be used for high-stakes accountability systems in that the data yielded via really both measurement systems may be less likely to reflect “true” teaching effectiveness due to “true” bias. “Indeed, teachers working with higher achieving students tend to receive higher performance ratings, above and beyond that which might be attributable to aspects of teacher quality,” and vice-versa.

Citation Study #1: Whitehurst, G. J., Chingos, M. M., & Lindquist, K. M. (2014). Evaluating teachers with classroom observations: Lessons learned in four districts. Washington, DC: Brookings Institution. Retrieved from https://www.brookings.edu/wp-content/uploads/2016/06/Evaluating-Teachers-with-Classroom-Observations.pdf

Citation Study #2: Steinberg, M. P., & Garrett, R. (2016). Classroom composition and measured teacher performance: What do teacher observation scores really measure? Educational Evaluation and Policy Analysis, 38(2), 293-317. doi:10.3102/0162373715616249  Retrieved from http://static.politico.com/58/5f/f14b2b144846a9b3365b8f2b0897/study-of-classroom-observations-of-teachers.pdf

 

The “Value-Added” of Teacher Preparation Programs: New Research

The journal Education of Economics Review recently published a study titled “Teacher Quality Differences Between Teacher Preparation Programs: How Big? How Reliable? Which Programs Are Different?” The study was authored by researchers at the University of Texas – Austin, Duke University, and Tulane. The pre-publication version of this piece can be found here.

As the title implies, the purpose of the study was to “evaluate statistical methods for estimating teacher quality differences between TPPs [teacher preparation programs].” Needless to say, this research is particularly relevant, here, given “Sixteen US states have begun to hold teacher preparation programs (TPPs) accountable for teacher quality, where quality is estimated by teacher value-added to student test scores.” The federal government continues to support and advance these initiatives, as well (see, for example, here).

But this research study is also particularly important because while researchers found that “[t]he most convincing estimates [of TPP quality] [came] from a value-added model where confidence intervals [were] widened;” that is, the extent to which measurement errors were permitted was dramatically increased, and also widened further using statistical corrections. But even when using these statistical techniques and accomodations, they found that it was still “rarely possible to tell which TPPs, if any, [were] better or worse than average.”

They therefore concluded that “[t]he potential benefits of TPP accountability may be too small to balance the risk that a proliferation of noisy TPP estimates will encourage arbitrary and ineffective policy actions” in response. More specifically, and in their own words, they found that:

  1. Differences between TPPs. While most of [their] results suggest that real differences between TPPs exist, the differences [were] not large [or large enough to make or evidence the differentiation between programs as conceptualized and expected]. [Their] estimates var[ied] a bit with their statistical methods, but averaging across plausible methods [they] conclude[d] that between TPPs the heterogeneity [standard deviation (SD) was] about .03 in math and .02 in reading. That is, a 1 SD increase in TPP quality predict[ed] just [emphasis added] a [very small] .03 SD increase in student math scores and a [very small] .02 SD increase in student reading scores.
  2. Reliability of TPP estimates. Even if the [above-mentioned] differences between TPPs were large enough to be of policy interest, accountability could only work if TPP differences could be estimated reliably. And [their] results raise doubts that they can. Every plausible analysis that [they] conducted suggested that TPP estimates consist[ed] mostly of noise. In some analyses, TPP estimates appeared to be about 50% noise; in other analyses, they appeared to be as much as 80% or 90% noise…Even in large TPPs the estimates were mostly noise [although]…[i]t is plausible [although perhaps not probable]…that TPP estimates would be more reliable if [researchers] had more than one year of data…[although states smaller than the one in this study — Texs]…would require 5 years to accumulate the amount of data that [they used] from one year of data.
  3. Notably Different TPPs. Even if [they] focus[ed] on estimates from a single model, it remains hard to identify which TPPs differ from the average…[Again,] TPP differences are small and estimates of them are uncertain.

In conclusion, that researchers found “that there are only small teacher quality differences between TPPs” might seem surprising, but not really given the outcome variables they used to measure and assess TPP effects were students’ test scores. In short, students’ test scores are three times removed from the primary unit of analysis in studies like these. That is, (1) the TPP is to be measured by the effectiveness of its teacher graduates, and (2) teacher graduates are to be measured by their purported impacts on their students’ test scores, while (3) students’ test scores are to only and have only been validated for measuring student learning and achievement. These test scores have not been validated to assess and measure, in the inverse, teachers causal impacts on said achievements or on TPPs impacts on teachers on said achievements.

If this sounds confusing, it is, and also highly nonsensical, but this is also a reason why this is so difficult to do, and as evidenced in this study, improbable to do this well or as theorized in that TPP estimates are sensitive to error, insensitive given error, and, accordingly, highly uncertain and invalid.

Citation: von Hippela, P. T., Bellowsb, L., Osbornea, C., Lincovec, J. A., & Millsd, N. (2016). Teacher quality differences between teacher preparation programs: How big? How reliable? Which programs are different? Education of Economics Review, 53, 31–45. doi:10.1016/j.econedurev.2016.05.002

VAM-Based Chaos Reigns in Florida, as Caused by State-Mandated Teacher Turnovers

The state of Florida is another one of our state’s to watch in that, even since the passage of the Every Student Succeeds Act (ESSA) last January, the state is still moving forward with using its VAMs for high-stakes accountability reform. See my most recent post about one district in Florida here, after the state ordered it to dismiss a good number of its teachers as per their low VAM scores when this school year started. After realizing this also caused or contributed to a teacher shortage in the district, the district scrambled to hire Kelly Services contracted substitute teachers to replace them, after which the district also put administrators back into the classroom to help alleviate the bad situation turned worse.

In a recent post released by The Ledger, teachers from the same Polk County School District (size = 100K students) added much needed details and also voiced concerns about all of this in the article that author Madison Fantozzi titled “Polk teachers: We are more than value-added model scores.”

Throughout this piece Fantozzi covers the story of Elizabeth Keep, a teacher who was “plucked from” the middle school in which she taught for 13 years, after which she was involuntarily placed at a district high school “just days before she was to report back to work.” She was one of 35 teachers moved from five schools in need of reform as based on schools’ value-added scores, although this was clearly done with no real concern or regard of the disruption this would cause these teachers, not to mention the students on the exiting and receiving ends. Likewise, and according to Keep, “If you asked students what they need, they wouldn’t say a teacher with a high VAM score…They need consistency and stability.” Apparently not. In Keep’s case, she “went from being the second most experienced person in [her middle school’s English] department…where she was department chair and oversaw the gifted program, to a [new, and never before] 10th- and 11th-grade English teacher” at the new high school to which she was moved.

As background, when Polk County School District officials presented turnaround plans to the State Board of Education last July, school board members “were most critical of their inability to move ‘unsatisfactory’ teachers out of the schools and ‘effective’ teachers in.”  One board member, for example, expressed finding it “horrendous” that the district was “held hostage” by the extent to which the local union was protecting teachers from being moved as per their value-added scores. Referring to the union, and its interference in this “reform,” he accused the unions of “shackling” the districts and preventing its intended reforms. Note that the “effective” teachers who are to replace the “ineffective” ones can earn up to $7,500 in bonuses per year to help the “turnaround” the schools into which they enter.

Likewise, the state’s Commissioner of Education concurred saying that she also “wanted ‘unsatisfactory’ teachers out and ‘highly effective’ teachers in,” again, with effectiveness being defined by teachers’ value-added or lack thereof, even though (1) the teachers targeted only had one or two years of the three years of value-added data required by state statute, and even though (2) the district’s senior director of assessment, accountability and evaluation noted that, in line with a plethora of other research findings, teachers being evaluated using the state’s VAM have a 51% chance of changing their scores from one year to the next. This lack of reliability, as we know it, should outright prevent any such moves in that without some level of stability, valid inferences from which valid decisions are to be made cannot be drawn. It’s literally impossible.

Nonetheless, state board of education members “unanimously… threatened to take [all of the district’s poor performing] over or close them in 2017-18 if district officials [didn’t] do what [the Board said].” See also other tales of similar districts in the article available, again, here.

In Keep’s case, “her ‘unsatisfactory’ VAM score [that caused the district to move her, as] paired with her ‘highly effective’ in-class observations by her administrators brought her overall district evaluation to ‘effective’…[although she also notes that]…her VAM scores fluctuate because the state has created a moving target.” Regardless, Keep was notified “five days before teachers were due back to their assigned schools Aug. 8 [after which she was] told she had to report to a new school with a different start time that [also] disrupted her 13-year routine and family that shares one car.”

VAM-based chaos reigns, especially in Florida.

U.S. Department of Education: Value-Added Not Good for Evaluating Schools and Principals

Just this month, the Institute of Education Sciences (IES) wing of the U.S. Department of Education released a report about using value-added models (VAMs) for measuring school principals’ performance. The article conducted by researchers at Mathematica Policy Research and titled “Can Student Test Scores Provide Useful Measures of School Principals’ Performance?” can be found online here, with my summary of the study findings highlighted next and herein.

Before the passage of the Every Student Succeeds Act (ESSA), 40 states had written into their state statutes, as incentivized by the federal government, to use growth in student achievement growth for annual principal evaluation purposes. More states had written growth/value-added models (VAMs) for teacher evaluation purposes, which we have covered extensively via this blog, but this pertains only to school and/or principal evaluation purposes. Now since the passage of ESSA, and the reduction in the federal government’s control over state-level policies, states now have much more liberty to more freely decide whether to continue using student achievement growth for either purposes. This paper is positioned within this reasoning, and more specifically to help states decide whether or to what extent they might (or might not) continue to move forward with using growth/VAMs for school and principal evaluation purposes.

Researchers, more specifically, assessed (1) reliability – or the consistency or stability of these ratings over time, which is important “because only stable parts of a rating have the potential to contain information about principals’ future performance; unstable parts reflect only transient aspects of their performance;” and (2) one form of multiple evidences of validity – the predictive validity of these principal-level measures, with predictive validity defined as “the extent to which ratings from these measures accurately reflect principals’ contributions to student achievement in future years.” In short, “A measure could have high predictive validity only if [emphasis added] it was highly stable between consecutive years [i.e., reliability]…and its stable part was strongly related to principals’ contributions to student achievement” over time (i.e., predictive validity).

Researchers used principal-level value-added (unadjusted and adjusted for prior achievement and other potentially biasing demographic variables) to more directly examine “the extent to which student achievement growth at a school differed from average growth statewide for students with similar prior achievement and background characteristics.” Also important to note is that the data they used to examine school-level value-added came from Pennsylvania, which is one of a handful of states that uses the popular and proprietary (and controversial) Education Value-Added Assessment System (EVAAS) statewide.

Here are the researchers’ key findings, taken directly from the study’s summary (again, for more information see the full manuscript here).

  • The two performance measures in this study that did not account for students’ past achievement—average achievement and adjusted average achievement—provided no information for predicting principals’ contributions to student achievement in the following year.
  • The two performance measures in this study that accounted for students’ past achievement—school value-added and adjusted school value-added—provided, at most, a small amount of information for predicting principals’ contributions to student achievement in the following year. This was due to instability and inaccuracy in the stable parts.
  • Averaging performance measures across multiple recent years did not improve their accuracy for predicting principals’ contributions to student achievement in the following year. In simpler terms, a principal’s average rating over three years did not predict his or her future contributions more accurately than did a rating from the most recent year only. This is more of a statistical finding than one that has direct implications for policy and practice (except for silly states who might, despite findings like those presented in this study, decide that they can use one year to do this not at all well instead of three years to do this not at all well).

Their bottom line? “…no available measures of principal [/school] performance have yet been shown to accurately identify principals [/schools] who will contribute successfully to student outcomes in future years,” especially if based on students’ test scores, although the researchers also assert that “no research has ever determined whether non-test measures, such as measures of principals’ leadership practices, [have successfully or accurately] predict[ed] their future contributions” either.

The researchers follow-up with a highly cautionary note: “the value-added measures will make plenty of mistakes when trying to identify principals [/schools] who will contribute effectively or ineffectively to student achievement in future years. Therefore, states and districts should exercise caution when using these measures to make major decisions about principals. Given the inaccuracy of the test-based measures, state and district leaders and researchers should also make every effort to identify nontest measures that can predict principals’ future contributions to student outcomes [instead].”

Citation: Chiang, H., McCullough, M., Lipscomb, S., & Gill, B. (2016). Can student test scores provide useful measures of school principals’ performance? Washington DC: U.S. Department of Education, Institute of Education Sciences. Retrieved from http://ies.ed.gov/ncee/pubs/2016002/pdf/2016002.pdf

New Empirical Evidence: Students’ “Persistent Economic Disadvantage” More Likely to Bias Value-Added Estimates

The National Bureau of Economic Research (NBER) recently released a circulated but not-yet internally or externally reviewed study titled “The Gap within the Gap: Using Longitudinal Data to Understand Income Differences in Student Achievement.” Note that we have covered NBER studies such as this in the past in this blog, so in all fairness and like I have noted in the past, this paper should also be critically consumed, as well as my interpretations of the authors’ findings.

Nevertheless, this study is authored by Katherine Michelmore — Assistant Professor of Public Administration and International Affairs at Syracuse University, and Susan Dynarski — Professor of Public Policy, Education, and Economics at the University of Michigan, and this study is entirely relevant to value-added models (VAMs). Hence, below I cover their key highlights and takeaways, as I see them. I should note up front, however, that the authors did not directly examine how the new measure of economic disadvantage that they introduce (see below) actually affects calculations of teacher-level value-added. Rather, they motivate their analyses by saying that calculating teacher value-added is one application of their analyses.

The background to their study is as follows: “Gaps in educational achievement between high- and low-income children are growing” (p. 1), but the data that are used to capture “high- and low-income” in the state of Michigan (i.e., the state in which their study took place) and many if not most other states throughout the US, capture “income” demographics in very rudimentary, blunt, and often binary ways (i.e., “yes” for students who are eligible to receive federally funded free-or-reduced lunches and “no” for the ineligible).

Consequently, in this study the authors “leverage[d] the longitudinal structure of these data sets to develop a new measure of persistent economic disadvantage” (p. 1), all the while defining “persistent economic disadvantage” by the extent to which students were “eligible for subsidized meals in every grade since kindergarten” (p. 8). Students “who [were] never eligible for subsidized meals during those grades [were] defined as never [being economically] disadvantaged” (p. 8), and students who were eligible for subsidized meals for variable years were defined as “transitorily disadvantaged” (p. 8). This all runs counter, however, to the binary codes typically used, again, across the nation.

Appropriately, then, their goal (among other things) was to see how a new measure they constructed to better measure and capture “persistent economic disadvantage” might help when calculating teacher-level value-added. They accordingly argue (among other things) that, perhaps, not accounting for persistent disadvantage might subsequently cause more biased value-added estimates “against teachers of [and perhaps schools educating] persistently disadvantaged children” (p. 3). This, of course, also depends on how persistently disadvantaged students are (non)randomly assigned to teachers.

With statistics like the following as also reported in their report: “Students [in Michigan] [persistently] disadvantaged by 8th grade were six times more likely to be black and four times more likely to be Hispanic, compared to those who were never disadvantaged,” their assertions speak volumes not only to the importance of their findings for educational policy, but also for the teachers and schools still being evaluated using value-added scores and the researchers investigating, criticizing, promoting, or even trying to make these models better (if that is possible). In short, though, teachers who are disproportionately teaching in urban areas with more students akin to their equally disadvantaged peers, might realize relatively more biased value-added estimates as a result.

For value-added purposes, then, it is clear that the assumptions that controlling for student disadvantage by using such basal indicators of current economic disadvantage is overly simplistic, and just using test scores to also count for this economic disadvantage (i.e., as promoted in most versions of the Education Value-Added Assessment System (EVAAS)) is likely worse. More specifically, the assumption that economic disadvantage also does not impact some students more than others over time, or over the period of data being used to capture value-added (typically 3-5 years of students’ test score data), is also highly susceptible. “[T]hat children who are persistently disadvantaged perform worse than those who are disadvantaged in only some grades” (p. 14) also violates another fundamental assumption that teachers’ effects are consistent over time for similar students who learn at more or less consistent rates over time, regardless of these and other demographics.

The bottom line here, then, is that the indicator that should be used instead of our currently used proxies for current economic disadvantage is the number of grades students spend in economic disadvantage. If the value-added indicator does not effectively account for the “negative, nearly linear relationship between [students’ test] scores and the number of grades spent in economic disadvantage” (p. 18), while controlling for other student demographics and school fixed effects, value-added estimates will likely be (even) more biased against teachers who teach these students as a result.

Otherwise, teachers who teach students with persistent economic disadvantages will likely have it worse (i.e., in terms of bias) than teachers who teach students with current economic disadvantages, teachers who teach students with economically disadvantaged in their current or past histories will have it worse than teachers who teach students without (m)any prior economic disadvantages, and so on.

Citation: Michelmore, K., & Dynarski, S. (2016). The gap within the gap: Using longitudinal data to understand income differences in student achievement. Cambridge, MA: National Bureau of Economic Research (NBER). Retrieved from http://www.nber.org/papers/w22474

One Score and Seven Policy Iterations Ago…

I just read what might be one of the best articles I’ve read in a long time on using test scores to measure teacher effectiveness, and why this is such a bad idea. Not surprisingly, unfortunately, this article was written 20 years ago (i.e., 1986) by – Edward Haertel, National Academy of Education member and recently retired Professor at Stanford University. If the name sounds familiar, it should as Professor Emeritus Haertel is one of the best on the topic of, and history behind VAMs (see prior posts about his related scholarship here, here, and here). To access the full article, please scroll to the reference at the bottom of this post.

Heartel wrote this article when at the time policymakers were, like they still are now, trying to hold teachers accountable for their students’ learning as measured on states’ standardized test scores. Although this article deals with minimum competency tests, which were in policy fashion at the time, about seven policy iterations ago, the contents of the article still have much relevance given where we are today — investing in “new and improved” Common Core tests and still riding on unsinkable beliefs that this is the way to reform the schools that have been in despair and (still) in need of major repair since 20+ years ago.

Here are some of the points I found of most “value:”

  • On isolating teacher effects: “Inferring teacher competence from test scores requires the isolation of teaching effects from other major influences on student test performance,” while “the task is to support an interpretation of student test performance as reflecting teacher competence by providing evidence against plausible rival hypotheses or interpretation.” While “student achievement depends on multiple factors, many of which are out of the teacher’s control,” and many of which cannot and likely never will be able to be “controlled.” In terms of home supports, “students enjoy varying levels of out-of-school support for learning. Not only may parental support and expectations influence student motivation and effort, but some parents may share directly in the task of instruction itself, reading with children, for example, or assisting them with homework.” In terms of school supports, “[s]choolwide learning climate refers to the host of factors that make a school more than a collection of self-contained classrooms. Where the principal is a strong instructional leader; where schoolwide policies on attendance, drug use, and discipline are consistently enforced; where the dominant peer culture is achievement-oriented; and where the school is actively supported by parents and the community.” This, all, makes isolating the teacher effect nearly if not wholly impossible.
  • On the difficulties with defining the teacher effect: “Does it include homework? Does it include self-directed study initiated by the student? How about tutoring by a parent or an older sister or brother? For present purposes, instruction logically refers to whatever the teacher being evaluated is responsible for, but there are degrees of responsibility, and it is often shared. If a teacher informs parents of a student’s learning difficulties and they arrange for private tutoring, is the teacher responsible for the student’s improvement? Suppose the teacher merely gives the student low marks, the student informs her parents, and they arrange for a tutor? Should teachers be credited with inspiring a student’s independent study of school subjects? There is no time to dwell on these difficulties; others lie ahead. Recognizing that some ambiguity remains, it may suffice to define instruction as any learning activity directed by the teacher, including homework….The question also must be confronted of what knowledge counts as achievement. The math teacher who digresses into lectures on beekeeping may be effective in communicating information, but for purposes of teacher evaluation the learning outcomes will not match those of a colleague who sticks to quadratic equations.” Much if not all of this cannot and likely never will be able to be “controlled” or “factored” in or our, as well.
  • On standardized tests: The best of standardized tests will (likely) always be too imperfect and not up to the teacher evaluation task, no matter the extent to which they are pitched as “new and improved.” While it might appear that these “problem[s] could be solved with better tests,” they cannot. Ultimately, all that these tests provide is “a sample of student performance. The inference that this performance reflects educational achievement [not to mention teacher effectiveness] is probabilistic [emphasis added], and is only justified under certain conditions.” Likewise, these tests “measure only a subset of important learning objectives, and if teachers are rated on their students’ attainment of just those outcomes, instruction of unmeasured objectives [is also] slighted.” Like it was then as it still is today, “it has become a commonplace that standardized student achievement tests are ill-suited for teacher evaluation.”
  • On the multiple choice formats of such tests: “[A] multiple-choice item remains a recognition task, in which the problem is to find the best of a small number of predetermined alternatives and the cri- teria for comparing the alternatives are well defined. The nonacademic situations where school learning is ultimately ap- plied rarely present problems in this neat, closed form. Discovery and definition of the problem itself and production of a variety of solutions are called for, not selection among a set of fixed alternatives.”
  • On students and the scores they are to contribute to the teacher evaluation formula: “Students varying in their readiness to profit from instruction are said to differ in aptitude. Not only general cognitive abilities, but relevant prior instruction, motivation, and specific inter- actions of these and other learner characteristics with features of the curriculum and instruction will affect academic growth.” In other words, one cannot simply assume all students will learn or grow at the same rate with the same teacher. Rather, they will learn at different rates given their aptitudes, their “readiness to profit from instruction,” the teachers’ instruction, and sometimes despite the teachers’ instruction or what the teacher teaches.
  • And on the formative nature of such tests, as it was then: “Teachers rarely consult standardized test results except, perhaps, for initial grouping or placement of students, and they believe that the tests are of more value to school or district administrators than to themselves.”

Sound familiar?

Reference: Haertel, E. (1986). The valid use of student performance measures for teacher evaluation. Educational Evaluation and Policy Analysis, 8(1), 45-60.

Another Oldie but Still Very Relevant Goodie, by McCaffrey et al.

I recently re-read an article in full that is now 10 years old, or 10 years out, as published in 2004 and, as per the words of the authors, before VAM approaches were “widely adopted in formal state or district accountability systems.” Unfortunately, I consistently find it interesting, particularly in terms of the research on VAMs, to re-explore/re-discover what we actually knew 10 years ago about VAMs, as most of the time, this serves as a reminder of how things, most of the time, have not changed.

The article, “Models for Value-Added Modeling of Teacher Effects,” is authored by Daniel McCaffrey (Educational Testing Service [ETS] Scientist, and still a “big name” in VAM research), J. R. Lockwood (RAND Corporation Scientists),  Daniel Koretz (Professor at Harvard), Thomas Louis (Professor at Johns Hopkins), and Laura Hamilton (RAND Corporation Scientist).

At the point at which the authors wrote this article, besides the aforementioned data and data base issues, were issues with “multiple measures on the same student and multiple teachers instructing each student” as “[c]lass groupings of students change annually, and students are taught by a different teacher each year.” Authors, more specifically, questioned “whether VAM really does remove the effects of factors such as prior performance and [students’] socio-economic status, and thereby provide[s] a more accurate indicator of teacher effectiveness.”

The assertions they advanced, accordingly and as relevant to these questions, follow:

  • Across different types of VAMs, given different types of approaches to control for some of the above (e.g., bias), teachers’ contribution to total variability in test scores (as per value-added gains) ranged from 3% to 20%. That is, teachers can realistically only be held accountable for 3% to 20% of the variance in test scores using VAMs, while the other 80% to 97% of the variance (stil) comes from influences outside of the teacher’s control. A similar statistic (i.e., 1% to 14%) was similarly and recently highlighted in the recent position statement on VAMs released by the American Statistical Association.
  • Most VAMs focus exclusively on scores from standardized assessments, although I will take this one-step further now, noting that all VAMs now focus exclusively on large-scale standardized tests. This I evidenced in a recent paper I published here: Putting growth and value-added models on the map: A national overview).
  • VAMs introduce bias when missing test scores are not missing completely at random. The missing at random assumption, however, runs across most VAMs because without it, data missingness would be pragmatically insolvable, especially “given the large proportion of missing data in many achievement databases and known differences between students with complete and incomplete test data.” The really only solution here is to use “implicit imputation of values for unobserved gains using the observed scores” which is “followed by estimation of teacher effect[s] using the means of both the imputed and observe gains [together].”
  • Bias “[still] is one of the most difficult issues arising from the use of VAMs to estimate school or teacher effects…[and]…the inclusion of student level covariates is not necessarily the solution to [this] bias.” In other words, “Controlling for student-level covariates alone is not sufficient to remove the effects of [students’] background [or demographic] characteristics.” There is a reason why bias is still such a highly contested issue when it comes to VAMs (see a recent post about this here).
  • All (or now most) commonly-used VAMs assume that teachers’ (and prior teachers’) effects persist undiminished over time. This assumption “is not empirically or theoretically justified,” either, yet it persists.

These authors’ overall conclusion, again from 10 years ago but one that in many ways still stands? VAMs “will often be too imprecise to support some of [its] desired inferences” and uses including, for example, making low- and high-stakes decisions about teacher effects as produced via VAMs. “[O]btaining sufficiently precise estimates of teacher effects to support ranking [and such decisions] is likely to [forever] be a challenge.”