Glossary

Assumptions: Assumptions appeal to people’s emotions and, accordingly, are used to sway peoples’ thinking in desired ways. Assumptions are often made about intangible ideas or tangible products, but often without proof or data to support the legitimacy of the assumptions being made. Correspondingly, most governmental and private groups, in efforts to promote and/or protect sets of private and/or public interests, attempt to methodically sway or change public attitudes and behaviors oftentimes using assumptions. Such assumptions are often expressed via enthusiastic statements and bold claims, further engraining the assumptions and transforming them into accepted realities, while not always true or supported by evidence.

BiasBias is a huge threat to validity, as biasing factors (e.g., student risk factors) both distort the measurement of a variable and distort their interpretations, either increasing or decreasing, in this case, VAM-based estimates. This occurs even though the biasing factors are unrelated to what the test-based indicators (i.e., VAMs) are meant to represent (e.g., teacher effectiveness). Accordingly, if VAM estimates are highly correlated to biasing factors, then it becomes impossible to make valid interpretations about the causes of student achievement gains or losses as intended. Bias is most difficult to statistically “control for” because students are rarely if ever randomly assigned to classrooms (and teachers are rarely randomly assigned to classrooms as well).

Confidence IntervalsConfidence intervals are the statistical areas or ranges within which one can be confident that, in this case, a teacher’s true value-added estimate has been effectively and accurately captured. To contextualize, the typical confidence intervals used in statistics give or take about 5 percentage points from the reported estimate (i.e., given standard 95% confidence intervals). However, in New York, for example, the confidence intervals around 18,000 teachers’ value-added ratings spanned 35 percentile points in mathematics and 53 percentile points in reading/language arts. This meant that a mathematics teacher who ranked at the 50th percentile could have actually had a true score between the 33rd and 67th percentile ranks (rounding inwards). In reading/language arts, a teacher who ranked at the 50th percentile could have had an observed score any and everywhere between the 24th and 76th percentile rank (rounding inwards). While 95% confidence intervals are typically used, with VAMs these standard error ranges are typically much, much larger, to account for the relatively higher potential for measurement error.

Education Production Function: In the education production function, it is assumed that using VAMs will induce teachers to work harder, and using VAMs will incite teachers who do not work hard enough to improve out of fear of being penalized or terminated. Such positive and negative motivators will increase the quality of the inputs (i.e., teaching) and subsequently enhance the quality of the outputs (i.e., student achievement) included in the production function.In addition, once VAM estimates are made available (i.e., test-based information about student growth over time per teacher, school, or district), the outcome data can be fed back into the system, internalized and operationalized by those who are the key inputs in the production function, and then used in (in)formative ways, yielding even better results  in effect.

Fairness: Issues of fairness arise when a test, or more likely its inference-based use, impacts some more than others in unfair yet often consequential ways. In terms of VAMs, the main issue here is that VAM-based estimates can be produced for only approximately 30-40% of all teachers across America’s public schools. The other 60-70%, which sometimes includes entire campuses of teachers (e.g., early elementary and high school teachers), cannot altogether be evaluated or held accountable using teacher- or individual-based value-added data.
Instructional Sensitivity: Instructional sensitivity represents the degree to which students test performance has the capacity to, and actually reflects the quality of instruction provided to facilitate student learning and mastery of that which is being assessed.
No Child Left Behind (NCLB): The No Child Left Behind (NCLB) Act of 2001, or the reauthorized Elementary and Secondary Education Act (ESEA), was the nation’s first Act of Congress to explicitly support higher standards and stronger accountability mechanisms, primarily to hold students and educators responsible for meeting higher standards and, hence, improve student learning and achievement. With NCLB, the U.S. government mandated that all states use large-scale standardized tests (i.e., from grades 3-8 and once in high school) that are aligned to state-level standards, again to help determine whether students are meeting higher standards each year. Hence, these tests have been used most often by states to date, and now within all current VAM and student growth analyses.

Public Policy: A public policy is a tool used by governments to define a course of action that will ultimately lead a public (group of people) to some end. Sometimes public policies can be unwise, however, whereas seemingly principled means produce unintended, unanticipated, and perverse consequences instead.

Race to the TopPresident Obama’s Race to the Top (RttT) initiative helped (and continues to help) to distribute billions of dollars in federal stimulus monies to states, thus far to a total of $4.35 billion, if states promise via their legislative policies that they will use students’ large-scale standardized test scores for even more consequential purposes than NCLB required prior. Specifically, states, again in exchange for federal funds, were required to use large-scale standardized test scores, in particular, for teacher evaluation, teacher termination, and teacher compensation purposes. The states applying for RttT funds, expressly 40 states for the first year of funding, also had to agree to adopt even stronger accountability mechanisms if they were to secure waivers excusing them from not meeting NCLB’s prior goal that 100% of the students in their states would be academically proficient by the year 2014.

Random Assignment: The purpose of random assignment is to make the probability of the occurrence of any observable differences among treatment groups (e.g., treatment or no treatment) equal at the outset of any experiment or study. Randomized experiments occur when individuals are premeasured on an outcome (e.g., a pretest score[s]), randomly assigned to receive different treatments (e.g., different teachers) whereby each participant has an equal chance of receiving any treatment (or no treatment), and then measured again on the post-test occasion (e.g., post-test score[s]) to determine whether different changes occurred across different treatments (e.g., teachers’ attributional or causal effects). Hence, random assignment is considered the “gold standard” when scientific causal associations and inferences are desired, that control for and are seemingly “free” of the biasing impacts caused by extraneous (i.e., unmeasured or immeasurable) variables and other risk factors (e.g., student’s out-of-school experiences).

Reliability: Reliability is the psychometric term used to represent the degree to which, in this case, a set of large-scale standardized test scores have random error. Random error can be positive or negative and large or small, although it cannot be directly observed. Instead, what is observed, again in the case of VAMs, is the extent to which a value-added measure produces consistent or dependable results over time (i.e., reliability or inter-temporal stability). In terms of VAMs, reliability should be observed when VAM estimates of teacher (or school/district) effectiveness are consistent or endure over time, from one year to the next, regardless of the types of students and perhaps subject areas teachers teach. Reliability is typically captured using reliability coefficients as well as confidence intervals that help to situate and contextualize VAM estimates and their measurement errors.

Students “At-Risk:” Students “at-risk” include populations of students in America’s public schools who, disproportionate to their low-risk peers, are more likely to have emotional/learning disabilities and/or come from high-needs, high-poverty, English-language deficient, culturally isolated (e.g., inner-cities, hoods, ghettoes, enclaves, and American-Indian reservations), and often racial/ethnic minority backgrounds. However, it should be noted that a large number of students “at-risk” are not students of color. Such risk factors, however, complicate the interpretation of large-scale standardized test scores and their related value-added estimates, as VAMs rely solely on large-scale standardized test scores to yield their growth estimates. While complex statistical methods are often used to “control” or “account” for students’ levels of risk or risk factors, much debate exists about the extent to which such statistical controls, no matter how sophisticated, indeed work (e.g., to partly/wholly eliminate “bias”).

Student Growth Percentile (SGP): The SGP model is traditionally not considered a VAM mainly because it does not use as many sophisticated statistical controls. As a result, SGP model estimates are often viewed as less precise and accurate. Unlike a typical VAM, the SGP model has been developed and used to serve as a more normative method for describing similarly matched students’ growth, as measured in reference to the growth levels of students’ peers, for describing teachers’ potential impacts on that growth. Once teachers’ rosters of students are identified, student SGPs are aggregated, and model users compute the median (or sometimes the average) of the SGPs for an individual teacher. This statistics is termed the median growth percentile (MGP).

Tests – Large-Scale Standardized Tests: Large-scale standardized tests are tests that are administered to large populations of individuals to determine students’ learning and academic achievements. They are also tests that are scored in a predetermined, uniform, or “standard” manner. Large-scale standardized testing companies (e.g., Pearson, Harcourt Educational Measurement, CTB McGraw-Hill, Riverside Publishing [a Houghton Mifflin company], etc.) manufacture most (if not all) large-scale standardized achievement tests.

Tests – Norm and Criterion Referenced Standardized Tests: Large-scale standardized tests are sometimes designed to align with national standards. These are falsely, yet oft-termed norm-referenced tests as companies often use a national norm group as part of their development. Otherwise, these tests are designed to align with state or more local standards, objectives, outcome statements, and the like. These are falsely, yet oft-termed criterion-referenced tests as local agencies develop criteria against which to measure students’  mastery of tested content. To be clear, however, norm-referenced and criterion-referenced are not types of tests but types of test interpretations. Norm referencing occurs, for example, when resultant test scores are distributed around a bell curve so that students’ test scores might be compared to the greater “normal” population of their peers (e.g., using percentile ranks or normal curve equivalents [NCEs]). This is useful when determining whether individual students did relatively better or worse than their peers (e.g., students in the same grade levels who took the test prior if using a national test or who took the test concurrently if using a state-aligned test). Criterion referencing occurs, for example, when resultant test scores are used to more straightforwardly (vs. relatively) determine whether students achieved a certain (often predetermined) standard of achievement. While all large-scale standardized achievement tests can yield valid norm-referenced interpretations, only tests aligned to a set of standards can produce valid criterion-referenced interpretations.

Transparency: Transparency can be defined as the extent to which something is easily seen and readily capable of being understood. Along with transparency comes the “formative” aspects key to any, in this case, VAM-based measurement system. Put differently, VAM-based estimates must be made transparent, in order to be understood, so that they can ultimately be used to “inform” change, growth, and hopefully future progress in “formative” ways.

Type I and II ErrorsOf concern with value-added estimates is the prevalence of false positive or false discovery errors (i.e., Type I errors), whereas an ineffective teacher is falsely identified as effective. However, the inverse is equally likely, defined as false negative or false non-discovery errors (i.e., Type II errors), whereas an ineffective teacher might go unnoticed instead. For example, with an r = 0.4 or an R-squared = 0.16, out of every 1000 teachers 750 teachers would be identified correctly and 250 teachers would not. That is, one in four teachers would be falsely identified as either being worse or better than they were originally classified, considering both Type I and Type II errors are at play.

Validity: Validity is the psychometric term used to describe the accuracy of an interpretation that is derived from some use of a, in this case, large-scale standardized test score. Validity describes the extent to which an assessment measure produces authentic, accurate, strong results and yields acceptable inferences about that which the assessment tool, or test, is intended to measure. Validity is an essential of any measurement, although it must be noted that reliability is a necessary or qualifying condition for validity. That said, validity must follow reliability, only if reliability is observed. Without consistency (i.e., reliability) one cannot typically achieve any certain level or sense of truth (i.e., validity). Put differently, if scores are unreliable, it is virtually impossible to support valid interpretations or use.

—Content-Related Evidence of Validity: To collect content-related evidence of validity, it is necessary, in this case, to examine whether large-scale standardized test scores can and should be used to make inferences about student learning and achievement, and in the case of VAMs, teachers’ causal impacts on students’ levels of learning and achievement over time. Evidencing content validity is not about the properties of the tests, but the meanings of the inferences derived from the tests being used, given the purposes for which they are being used.

—Criterion-Related Evidence of Validity: Criterion-related evidence of validity is based on the degree of the relationships between, in this case, large-scale standardized test scores and other criterion (e.g., teachers’ supervisor evaluation scores). To establish criterion-related evidence of validity, VAM estimates should be highly correlated with other criterion, all of which should yield valid interpretations and uses. Criterion-related evidence of validity is typically comprised of two types: concurrent-related evidence of validity and predictive-related evidence of validity.

—Concurrent-Related Evidence of Validity: Concurrent-related evidence of validity is observed when, in this case, large-scale standardized test and other criterion data are collected at the same time. In this case, this might be observed when VAM estimates of teacher (or school/district) effectiveness relate, or more specifically correlate well with other measures (e.g., supervisor evaluation scores) that are also developed to measure the same construct (e.g., teacher effectiveness) at or around the same time. If VAM estimates of teacher effectiveness are valid, there should be research-based evidence (and some commonsense) that proves that all similar indicators collected at or around the same time are together pointing towards the same proverbial truth.

—Predictive-Related Evidence of ValidityPredictive-related evidence of validity is observed when large-scale standardized test and other criterion data are collected at different times, typically one in advance of the other. In this case, VAM-based estimates might be used to predict future outcomes on a related measure (e.g., success on a college-entrance exam), after which predictions might be verified to evidence whether and to what extent the VAM-based predictions came true. If predictions are detected as anticipated, predictive-related evidence of validity is present. If not, predictive-related evidence is void.

Construct-Related Evidence of ValidityConstruct-related evidence of validity is based on the integration of all of the evidence that help with the interpretation or meaning of the, in this case, large-scale standardized test scores and their purported uses, in this case, in terms of measuring growth in student achievement over time. Construct-related evidence of validity helps us better define theoretical constructs, although difficult to establish. Gathering construct-related evidence of validity requires defining the construct to be measured (e.g., teacher effectiveness) and logically determining whether all of the instruments and measures used to assess the construct represent and capture the construct well, and ultimately support the accuracy of the overall interpretations and inferences derived.

 —Consequence-Related Evidence of Validity: Consequence-related evidence of validity pertains to the social consequences of, in this case, using large-scale tests to measure student growth in achievement over time and then attributing that growth back to teachers. Consequence-related evidence of validity also pertains to the ethical consequences that come about as a result. Of social and ethical concern are, primarily, both the intended and unintended consequences of use as determined by the research (and dissemination of research) concerning both the positive and the negative effects of whatever system or program is in use. The burden of proof rests on the shoulders of VAM researchers to provide credible evidence about the intended and unintended consequences, to explain the positive and negative effects to external constituencies including policymakers, and to collectively work to determine whether VAM use, all things considered, can be rendered as acceptable and of value.

Value-Added Models (VAMs): VAMs by definition are statistical models that are designed to isolate and measure teachers’ (or schools’/districts’) contributions to student learning and achievement on large-scale standardized achievement tests from one year to the next. VAM statisticians attempt to measure value-added by mathematically calculating the “value” a teacher (or school/district) “adds” to (or detracts from) student achievement scores from the point at which students enter a classroom (or a school/district) to the point they leave. In more precise terms, VAM statisticians attempt to calculate value-added by computing the difference between students’ composite test scores between these two points in time, after which they compare the added/detracted value (or growth/decline) coefficients to what they predicted beforehand and to the coefficients of other “similar” teachers (or schools/districts) who posted “similar” value-added estimates at the same time. VAM statisticians then position teachers (or schools/districts) accordingly, and typically hierarchically along a categorical yet arbitrary continuum, assigning teachers (or schools/districts) high to low value-added categorizations with negative and positive differences yielding negative and positive value-added classifications respectively. From here, high-stakes decisions (e.g., merit pay, teacher tenure, teacher terminations) can more “appropriately” and “accurately” be made, again, so it is assumed.

VAMs v. Student Growth ModelsThe main similarities between VAMs and student growth models are that they all use students’ large-scale standardized test score data from current and prior years to calculate students’ growth in achievement over time. In addition, they all use students’ prior test score data to “control for” the risk factors that impact student learning and achievement both at singular points in time as well as over time. The main differences between VAMs and student growth models are how precisely estimates are made, as related to whether, how, and how many control variables are included in the statistical models to control for these risk and other extraneous variable (e.g.,  other teachers’ simultaneous and prior teacher’s residual effects). The best and most popular example of a student growth model is the Student Growth Percentiles (SGP) model. It is not a VAM by traditional standards and definitions, mainly because the SGP model does not use as many sophisticated controls as does its VAM counterparts.

3 thoughts on “Glossary

  1. Through your blog, I finally feel that I have a voice and that my voice will truly be heard by you and other educators in this country. The VAM evaluation system and state testings have created a plague that is killing this country’s public school system! Outstanding and effective teachers are purposefully being pushed out of public schools to make room for business controlled charter schools.

    I am a 7th grade teacher at a public school in Miami-Dade County, Florida and I want to save our public schools from the money making testing machines supported by the state legislatures.

Leave a Reply to Barbara Helms Cancel reply

Your email address will not be published. Required fields are marked *