Public Policy: A public policy is a tool used by governments to define a course of action that will ultimately lead a public (group of people) to some end. Sometimes public policies can be unwise, however, whereas seemingly principled means produce unintended, unanticipated, and perverse consequences instead.
Race to the Top: President Obama’s Race to the Top (RttT) initiative helped (and continues to help) to distribute billions of dollars in federal stimulus monies to states, thus far to a total of $4.35 billion, if states promise via their legislative policies that they will use students’ large-scale standardized test scores for even more consequential purposes than NCLB required prior. Specifically, states, again in exchange for federal funds, were required to use large-scale standardized test scores, in particular, for teacher evaluation, teacher termination, and teacher compensation purposes. The states applying for RttT funds, expressly 40 states for the first year of funding, also had to agree to adopt even stronger accountability mechanisms if they were to secure waivers excusing them from not meeting NCLB’s prior goal that 100% of the students in their states would be academically proficient by the year 2014.
Random Assignment: The purpose of random assignment is to make the probability of the occurrence of any observable differences among treatment groups (e.g., treatment or no treatment) equal at the outset of any experiment or study. Randomized experiments occur when individuals are premeasured on an outcome (e.g., a pretest score[s]), randomly assigned to receive different treatments (e.g., different teachers) whereby each participant has an equal chance of receiving any treatment (or no treatment), and then measured again on the post-test occasion (e.g., post-test score[s]) to determine whether different changes occurred across different treatments (e.g., teachers’ attributional or causal effects). Hence, random assignment is considered the “gold standard” when scientific causal associations and inferences are desired, that control for and are seemingly “free” of the biasing impacts caused by extraneous (i.e., unmeasured or immeasurable) variables and other risk factors (e.g., student’s out-of-school experiences).
Reliability: Reliability is the psychometric term used to represent the degree to which, in this case, a set of large-scale standardized test scores have random error. Random error can be positive or negative and large or small, although it cannot be directly observed. Instead, what is observed, again in the case of VAMs, is the extent to which a value-added measure produces consistent or dependable results over time (i.e., reliability or inter-temporal stability). In terms of VAMs, reliability should be observed when VAM estimates of teacher (or school/district) effectiveness are consistent or endure over time, from one year to the next, regardless of the types of students and perhaps subject areas teachers teach. Reliability is typically captured using reliability coefficients as well as confidence intervals that help to situate and contextualize VAM estimates and their measurement errors.
Students “At-Risk:” Students “at-risk” include populations of students in America’s public schools who, disproportionate to their low-risk peers, are more likely to have emotional/learning disabilities and/or come from high-needs, high-poverty, English-language deficient, culturally isolated (e.g., inner-cities, hoods, ghettoes, enclaves, and American-Indian reservations), and often racial/ethnic minority backgrounds. However, it should be noted that a large number of students “at-risk” are not students of color. Such risk factors, however, complicate the interpretation of large-scale standardized test scores and their related value-added estimates, as VAMs rely solely on large-scale standardized test scores to yield their growth estimates. While complex statistical methods are often used to “control” or “account” for students’ levels of risk or risk factors, much debate exists about the extent to which such statistical controls, no matter how sophisticated, indeed work (e.g., to partly/wholly eliminate “bias”).
Student Growth Percentile (SGP): The SGP model is traditionally not considered a VAM mainly because it does not use as many sophisticated statistical controls. As a result, SGP model estimates are often viewed as less precise and accurate. Unlike a typical VAM, the SGP model has been developed and used to serve as a more normative method for describing similarly matched students’ growth, as measured in reference to the growth levels of students’ peers, for describing teachers’ potential impacts on that growth. Once teachers’ rosters of students are identified, student SGPs are aggregated, and model users compute the median (or sometimes the average) of the SGPs for an individual teacher. This statistics is termed the median growth percentile (MGP).
Tests – Large-Scale Standardized Tests: Large-scale standardized tests are tests that are administered to large populations of individuals to determine students’ learning and academic achievements. They are also tests that are scored in a predetermined, uniform, or “standard” manner. Large-scale standardized testing companies (e.g., Pearson, Harcourt Educational Measurement, CTB McGraw-Hill, Riverside Publishing [a Houghton Mifflin company], etc.) manufacture most (if not all) large-scale standardized achievement tests.
Tests – Norm and Criterion Referenced Standardized Tests: Large-scale standardized tests are sometimes designed to align with national standards. These are falsely, yet oft-termed norm-referenced tests as companies often use a national norm group as part of their development. Otherwise, these tests are designed to align with state or more local standards, objectives, outcome statements, and the like. These are falsely, yet oft-termed criterion-referenced tests as local agencies develop criteria against which to measure students’ mastery of tested content. To be clear, however, norm-referenced and criterion-referenced are not types of tests but types of test interpretations. Norm referencing occurs, for example, when resultant test scores are distributed around a bell curve so that students’ test scores might be compared to the greater “normal” population of their peers (e.g., using percentile ranks or normal curve equivalents [NCEs]). This is useful when determining whether individual students did relatively better or worse than their peers (e.g., students in the same grade levels who took the test prior if using a national test or who took the test concurrently if using a state-aligned test). Criterion referencing occurs, for example, when resultant test scores are used to more straightforwardly (vs. relatively) determine whether students achieved a certain (often predetermined) standard of achievement. While all large-scale standardized achievement tests can yield valid norm-referenced interpretations, only tests aligned to a set of standards can produce valid criterion-referenced interpretations.
Transparency: Transparency can be defined as the extent to which something is easily seen and readily capable of being understood. Along with transparency comes the “formative” aspects key to any, in this case, VAM-based measurement system. Put differently, VAM-based estimates must be made transparent, in order to be understood, so that they can ultimately be used to “inform” change, growth, and hopefully future progress in “formative” ways.
Type I and II Errors: Of concern with value-added estimates is the prevalence of false positive or false discovery errors (i.e., Type I errors), whereas an ineffective teacher is falsely identified as effective. However, the inverse is equally likely, defined as false negative or false non-discovery errors (i.e., Type II errors), whereas an ineffective teacher might go unnoticed instead. For example, with an r = 0.4 or an R-squared = 0.16, out of every 1000 teachers 750 teachers would be identified correctly and 250 teachers would not. That is, one in four teachers would be falsely identified as either being worse or better than they were originally classified, considering both Type I and Type II errors are at play.
—Criterion-Related Evidence of Validity: Criterion-related evidence of validity is based on the degree of the relationships between, in this case, large-scale standardized test scores and other criterion (e.g., teachers’ supervisor evaluation scores). To establish criterion-related evidence of validity, VAM estimates should be highly correlated with other criterion, all of which should yield valid interpretations and uses. Criterion-related evidence of validity is typically comprised of two types: concurrent-related evidence of validity and predictive-related evidence of validity.
——Concurrent-Related Evidence of Validity: Concurrent-related evidence of validity is observed when, in this case, large-scale standardized test and other criterion data are collected at the same time. In this case, this might be observed when VAM estimates of teacher (or school/district) effectiveness relate, or more specifically correlate well with other measures (e.g., supervisor evaluation scores) that are also developed to measure the same construct (e.g., teacher effectiveness) at or around the same time. If VAM estimates of teacher effectiveness are valid, there should be research-based evidence (and some commonsense) that proves that all similar indicators collected at or around the same time are together pointing towards the same proverbial truth.
——Predictive-Related Evidence of Validity: Predictive-related evidence of validity is observed when large-scale standardized test and other criterion data are collected at different times, typically one in advance of the other. In this case, VAM-based estimates might be used to predict future outcomes on a related measure (e.g., success on a college-entrance exam), after which predictions might be verified to evidence whether and to what extent the VAM-based predictions came true. If predictions are detected as anticipated, predictive-related evidence of validity is present. If not, predictive-related evidence is void.
—Construct-Related Evidence of Validity: Construct-related evidence of validity is based on the integration of all of the evidence that help with the interpretation or meaning of the, in this case, large-scale standardized test scores and their purported uses, in this case, in terms of measuring growth in student achievement over time. Construct-related evidence of validity helps us better define theoretical constructs, although difficult to establish. Gathering construct-related evidence of validity requires defining the construct to be measured (e.g., teacher effectiveness) and logically determining whether all of the instruments and measures used to assess the construct represent and capture the construct well, and ultimately support the accuracy of the overall interpretations and inferences derived.
—Consequence-Related Evidence of Validity: Consequence-related evidence of validity pertains to the social consequences of, in this case, using large-scale tests to measure student growth in achievement over time and then attributing that growth back to teachers. Consequence-related evidence of validity also pertains to the ethical consequences that come about as a result. Of social and ethical concern are, primarily, both the intended and unintended consequences of use as determined by the research (and dissemination of research) concerning both the positive and the negative effects of whatever system or program is in use. The burden of proof rests on the shoulders of VAM researchers to provide credible evidence about the intended and unintended consequences, to explain the positive and negative effects to external constituencies including policymakers, and to collectively work to determine whether VAM use, all things considered, can be rendered as acceptable and of value.