Random error arises from student variables, task sampling, item calibration, and scaling, as well as other sources. These different sources affect the process at different times in the development and implementation of large-scale assessments; therefore, they need to be documented and monitored throughout the process. The effect can then be minimized to provide more stable and dependable estimates of students' performance. The documentation would provide appropriate procedural evidence to allow the formulation of a validity argument. With appropriate analyses, statistical evidence would be used to complement the procedural evidence. However, as noted in the Standards (American Educational Research Association et al., 1999), various forms of reliability estimates are possible, and they need to address specifically the source of error for which they are targeted. For example, if raters are used in the scoring process, then interjudge reliability needs to be documented; with alternate forms, this type needs to be noted; when change over time is being documented, test-retest reliability needs to be established; finally, internal consistency provides evidence of reliability of items and tasks.
An alternate assessment poses numerous challenges that are associated with measurement error. Some sources of random error pertain to examinee characteristics, item and test design, administration, and scoring protocols. State large-scale assessments typically use both SR and CR items or tasks, either with or without accommodations. CR is used in performance measures that require a rubric (subjectively scored) or performance measures that require observation of student performance, completion of performance tasks, or collection of student work samples. The opportunities for measurement error are likely to expand with increased flexibility. As a consequence, assessment design and reliability estimates need to take into account the multiple factors that can attenuate measurement accuracy. The challenge with isolating and controlling sources of measurement error is complicated by the relationships among error sources, as described below.
Students come into school situations from a variety of home environments, all of which can affect their performance in school. For example, students come to school hungry, tired, or fatigued, and so forth. As they interact with classroom tasks and receive feedback, students come to have expectations of success or failure, reflecting motivation and self-efficacy that may interact differentially with the kinds of tasks they are given. All of these conative factors may influence the results of large-scale assessments in unsystematic (i.e., random) ways (McGrew, Johnson, Cosio, & Evans, 2003).
In addition, for students with disabilities, a number of personal and behavioral characteristics may also unsystematically influence performance. For example, with some disabilities (e.g., attention deficit-hyperactivity disorders), medications are used; depending upon the dosage or uptake, performance on large-scale tests may be inconsistent. Even without the use of medications, students with disabilities may exhibit behavioral tendencies that distract them from attending to tasks (tendencies of perseveration, distractibility, inattentiveness, etc.). The administration of the test may be nonstandardized and therefore may influence students unevenly (e.g., it may negatively affect some and act neutrally for others). Whenever such behavior or conditions influence students' performance unsystematically, reliability is weakened, as is the overall claim of validity (the claim that the outcomes reflect what the student knows and can do). Therefore, the inference of proficiency is less certain. A careful analysis of the context and the student is needed, however, as some variations in personal state (health, attention deficit-hyperactivity disorders) would be regarded as sources of systematic error. For example, the Standards note that test anxieties that can be "recognized in an examinee" are considered "systematic errors" and "are not generally regarded as an element that contributes to unreliability" (American Educational Research Association et al., 1999 p. 26).
As a consequence, participation in large-scale assessment systems is not only a matter of scheduling students to take a test at the end of the year. Rather, the assessment needs to be considered an important part of the school's annual cycle of activities. The large-scale assessment program needs to take into account such behavioral factors when collecting performances from students. Although tests may be given only once during the year (typically in the spring), plans for test administration should be introduced early in the year to allow students and teachers a fair opportunity to participate.
As an example of testing conditions reflecting ongoing classroom conditions, many states require teachers to use the same accommodations in testing that have been part of the accommodations used in the classroom. If these accommodations are not implemented in a standardized manner as part of the teaching or testing, unsystematic variance may be introduced. Furthermore, a teacher who is watchful during the year may be able to better understand critical student behaviors and recommend specific accommodations for testing at the end of the year.
Samples of performance tasks must be prepared so that they are parallel in format and difficulty. That is, the tasks are ideally comparable to the extent that a student would not perform differently with one or another because they are both of equal difficulty. The sample of tasks is apt to be more or less variable with respect to difficulty and representation of the performance domain. Using multiple forms, individuals can be assessed over time or compared to another. The extent to which tasks differ is of obvious consequence because, with more variation, the change over time or comparisons over multiple individuals is less trustworthy. Score variability that is attributable to task differences needs to be identified with carefully controlled studies in which parallel tasks and forms are used.
With portfolio assessments, it is often the teacher who selects the student work to be included in a collection or portfolio. Although selection criteria may be specified in both the test administration manual and in training, teacher judgment is ultimately involved. Consequently, the portfolio or collection of work may represent the grade level content broadly or narrowly. Choices of what is included or excluded in the collection can therefore affect the adequacy of the evidence in representing what a student knows and can do. From the perspective of repeatability, another collection, assembled at a different time or by a different teacher, may or may not support inferences drawn from the original collection. Therefore, it is critical to consider task sampling in the context of the assessment approach. It is generally easier to establish parallel forms when dealing with brief constructed tasks; when using performance collections and portfolios, it may be more difficult to establish comparability of tasks and forms.
Assessment developers increasingly recognize the value of item calibration, since assessment items are not necessarily equivalent (Thissen & Wainer, 2001). Whether CR or SR, assessment items provide differential amounts of information depending on the respondent's true ability. Item calibrations are estimates of item characteristics such as item difficulty or sensitivity (van der Linden & Hambleton, 1997). With accurate item calibrations, estimation of true scores becomes considerably more accurate. That is, calibration helps minimize standard error of estimation.
Calibration accuracy pertains directly to measurement reliability. The value of item calibrations for ability estimation depends on the appropriate choice of IRT model and proper calibration procedures. The technicalities involved in these decisions are far beyond the scope of this paper, but the importance of good calibration should be noted. First, the calibration process requires adequate sampling of examinee response patterns. Ideally, a range of abilities is represented in the calibration sample. Second, an appropriate IRT model must be applied. For instance, alternate assessments rely heavily on performance tasks. Usually, observed performance is scored polytomously, that is, more than correct/incorrect. This method of scoring requires the use of a rating scale, partial credit, or graded response model. Numerous other possible models are described in the literature (van der Linden & Hambleton, 1997; Boomsma, van Duijn & Snijders, 2001).
Another aspect of IRT item calibration that can influence reliability is the unidimensionality of the assessment—the degree to which an assessment measures a single construct (an ideal condition). In reality, this outcome is very difficult to achieve on rigidly constructed measures. Also, when using alternate assessments, the challenge increases dramatically as flexible CR tasks are applied and risk the involvement of multiple ability factors and other variable factors such as time constraints, rater severity, and so forth. Multidimensional IRT models can be used to accurately calibrate performance tasks, thereby yielding more reliable ability estimation. Local item dependency is generally attributable to multidimensional problems. Good assessment development must identify and provide corrections for this situation.
Finally, when developing measures for diverse populations, it is important to understand whether assessment tasks function identically across the populations. Ideally, tasks should perform equivalently across populations, although one population may have higher mean ability than another population. This type of analysis helps maintain quality control over assessment development. For example, Yovanoff and Tindal (in press) were able to understand how well the Oregon Early Reading Extended Assessment tasks functioned irrespective of whether students were in special or general education. In this study, a range of constructed response tasks was placed on the same scale and used as the first benchmark of the state test to provide students of all abilities a sufficient range of difficulties, leading to appropriate assessment. These tasks included letter naming and letter sounding as well as word, sentence, and passage reading.
Scaling and Equating
If items are calibrated, they can be fitted onto a scale and then used to monitor change over time or compare students of differing abilities. In this process, assessments need to be rescaled or equated with an external measure. State measurement programs perform this function when scores on alternate assessments are placed on the same scale as the general assessment scores. As noted above, the Oregon Early Reading Extended Assessment (Yovanoff and Tindal, in press) was scaled with the general Oregon Statewide Assessment. Using the appropriate IRT model and necessary research design, Yovanoff and Tindal demonstrated that the early reading performance tasks functioned appropriately for students who were otherwise unable to be measured accurately with the general benchmark on statewide assessment.
In the end, this type of scaling and measurement calibration makes the assessment both more accurate and more informative. Equating standard errors is extremely important for appraising the accuracy of score equivalents (Kolen & Brennan, 1995). Whenever assessments are equated, the standard error should be reported for both the overall population and individual students. Standard error is conditional on the student's score. If the scale contains too few items that are appropriate for a student's ability, the SEM is greater, and it is more difficult to accurately locate the student on the scale.
Irrespective of any errors made in collecting assessment data or as estimated with reliability coefficients, different or unique errors can also be made when making judgments. This type of random error refers to ratings and classifications made for students, such as pass/fail or below basic, basic, proficient, and advanced. In this instance, the focus is less on the actual score consistency than on the consistency of judgments about states of mastery. Two types of judgments can contain error: (a) at the score level, the focus is on rubrics (or partial correct responses); (b) at the classification level, the focus is not only on the final decision to classify a student's performance but also on the standard-setting process itself. The analysis, therefore, needs to consider both the individual judgments made for a student as well as the overall process for making classification decisions.
Although score errors need to be addressed, classification errors are far too serious, are more difficult to detect, and require more resources to resolve. Furthermore, whereas score error is usually minimized at the cut score, judgment error is most problematic at the cut score. According to the Standards:
Where the purpose of measurement is classification, some measurement errors are more serious than others. An individual who is far above or far below the value established for pass/fail or for eligibility for a special program can be mis-measured without serious consequences. Mis-measurement of examinees whose true scores are close to the cut score is a more serious concern. The techniques used to quantify reliability should recognize these circumstances. This can be done by reporting the conditional standard error in the vicinity of the critical score. (American Educational Research Association et al., 1999, p. 3)
Even with the conditional SEM (for an individual score) reported at the cut score, classification judgments can be problematic. For example, Hollenbeck and Tindal (1999) reported that, although judges were in considerable agreement at the exact or adjacent score values, they were in the greatest disagreement with respect to judgments of proficiency (at the cut score). In this study, judges agreed about writing quality (using a 6-point scale) when it was judged very low (rated 1 or 2) or very high (rated 5 or 6); they disagreed, however, when writing scores were in the middle (at ratings of 3 or 4, which includes the cut score of 4 for passing). As a consequence, the state educational agency began reporting "conditional" proficiency (in essence noting that disagreement occurred at the cut score) to acknowledge this type of error.
Reliability at the classification level involves attention to proper selection of content experts as well as training and feedback. The Standards are very clear:
When subjective judgment enters into test scoring, evidence should be provided on both inter-rater consistency in scoring and within examinee consistency over repeated measurements. A clear distinction should be made among reliability data based on (a) independent panels of raters scoring the same performances, (b) a single panel scoring successive performances or new products, and (c) independent panels scoring successive performances or new products. (American Educational Research Association et al., 1999, p. 34)
The distinction between the reliability of the score and the judgment sometimes becomes blurred when both are estimated at the same time. A hybrid model, taking into account both the reliability of the scoring process and the decisions made from the score, is created in the case in which a single judgment is based on all evidence. Oregon's juried assessment represents such a combined judgment that is primarily of the classification but includes evidence of individual scores also (e.g., performance on classroom tests or achievement on the state test when it has been modified) (Yovanoff & Tindal, in press). In analyzing reliability for this system, the key is to ensure both that the judgment is reliable and that the achievement is judged against grade level standards.
If Oregon's approach with the juried assessment is typical of what states develop to meet the "alternate assessment judged against grade level content standards," there is likely to be heavy reliance on two categories of judgment: one that addresses the sufficiency of evidence to make a determination and another that addresses proficiency based on the collection as a whole. To achieve reliability in these judgments, Oregon relies on systematic procedures and structured criteria for making the determination: One indirectly increases reliability by identifying what was thought to be random error, and identifying portions of it as having systematic causes. These causes can then be addressed through systematic procedures, thereby decreasing the amount of random error. Oregon's method was considered by the state's national Technical Advisory Committee (TAC) and considered a reasoned and prudent approach to avoiding false negative judgments (e.g., denying proficiency when it was deserved). Some committee members conjectured that the juried assessment might actually present a higher standard of achievement than the general multiple-choice assessments. Their view was noteworthy in that they considered the approach on its merits and weighed against both the Standards for testing and the consequences to the student. In the end, the TAC considered it a promising strategy and determined that they had nothing better to offer that would meet the same demands for integrity and fairness.
The Juried Assessment administration manual describes them as being completed under nonstandard conditions—either through the Collection of Evidence process or through a Collection to Jury a Modification to answer the question: "Does the evidence provided by the student meet the Oregon content and performance standards for a particular subject?" (Oregon Department of Education, 2005, p. 5). Training in scoring is provided to ensure reliability; in addition, collections that are rated above the standard must be verified through a secondary source. The requirements for a collection of evidence in Oregon's Juried Assessment are clarified in subject specific documents on the Web site along with the administration manual. A description of the juried process was extracted from the manual for inclusion here. The Juried Assessment is designed for students literate in a language other than English, with either physical disabilities preventing participation in the writing assessment or any other disability affecting the student's ability to read and write. According to the Juried Assessment Guidelines:
The Moderation Panel would consider the evidence and determine whether test results using the translation in the first example, the word prediction software in the second, or the auditory methods in the third, are reliable and valid in addressing a specific standard. If the panel determines that the change does not affect the validity of the test score for this student, the student's score would then be considered for meeting that standard.
Juried Modifications are approved one student and one assessment at a time. A panel of experts makes the final determination. It is believed that there may be a student with a significant learning disability who uses assistive technology, screen readers, and recorded text to perform the task of understanding text and interpreting "meaning". The panel might approve this modification as an accommodation for the particular student after reviewing the student's case if:
- The student is skilled in using the read aloud adaptations
- The measure of comprehension reflects the student's own knowledge and understanding
- The student achieves the same standards for interpreting text required of all students
If approved, the student would be permitted to use the "read aloud" modification with the Reading/Literature Knowledge and Skills assessment and have the opportunity to "meet" (e.g. be determined "proficient" on the standard). There is the possibility that the decision could be made after testing was completed if there was sufficient documentation of the process to assure that it was the student's own work (Oregon Department of Education, 2005, p. 5).
Another source of unsystematic error in the data collection process is introduced in the participation of students in alternate assessments when conducting an alignment analysis between grade level content standards and portfolio assessment approaches. When states use portfolios as part of the alternate assessment that are defined by student need (e.g., a fixed set of entries are not specified a priori, but teachers select unique entries for each student), the process of alignment between the alternate assessment and the standards cannot take place without sampling students. If the sampling of standards is isomorphic with the sampling of students, any statements about alignment on content coverage, breadth of knowledge, and depth of knowledge are primarily a function of those who participated. In this source of error (more like survey sampling error), stable statements about alignment are difficult to make. Because each student samples only a prespecified set of standards by design, alignment at the student level is inherently skewed. As a result, the process needs to sample a sufficient number of students to determine the coverage of standards being addressed at the system level. At this level, sampling of students needs to be considered using not only ages but also disabilities, geographic region, and type of program to make any inferences about alignment.
A final source of error relates to assessment administration. One reason for using standardized procedures in large-scale assessment systems is to minimize the error from external sources. Testing personnel (most often teachers), however, can introduce error (unsystematic variance) through the way that they administer or score the test. Ironically, few states have training systems for test administration. Educators assume that the conditions as noted in the test booklets are the same as those enacted in the classroom. Significant deficits are evident in teacher knowledge concerning high-stakes testing. Most teachers' knowledge about testing and measurement comes from "trial-and-error learning in the classroom" (Wise, Lukin, & Roos, 1991, p. 39). This problem, however, is rarely addressed through any in-service programs, even though these authors attributed the lack of assessment knowledge to teacher certification agencies at the state level (i.e., states do not require assessment/measurement courses for initial teacher certification).
This kind of unsystematic error is best addressed by state educational agencies (SEAs) through rigorous training and monitoring throughout administration of the large-scale assessment system. "Measurements derived from observations of behavior or evaluations of products are especially sensitive to a variety of error factors. These include evaluator biases and idiosyncrasies, scoring subjectivity, and intra-examinee factors that cause variation from one performance or product to another (American Educational Research Association et al., 1999, p. 29). Evidence can be both procedural and empirical for documenting the reliability associated with (a) test administration, and (b) response rating. Because random measurement errors are inconsistent and unpredictable, they cannot be removed from observed scores. However, their aggregate magnitude can be summarized in several ways…." (American Educational Research Association et al., 1999, p. 27).
Options for Participation in Large-Scale Assessments
Participation methods present a somewhat different challenge regarding sources of error. The estimate of reliability for the first method, taking the general assessment without accommodations, is likely to be a characteristic that has already been addressed in most states' technical reports. There is still some question about which students participated in the assessment or, more importantly, which subgroups did not participate, and the possible affect participation might have had on estimates of reliability.
Participation in the general assessment with accommodations increasingly has been studied (Sireci, Li, & Scarpati, 2003), and the use of accommodations increasingly has been utilized ( Clapper, Morse, Lazarus, Thompson, & Thurlow, 2005), although in both instances, little is definite about the influence of reliability on either participation or the use of accommodations (see section on standard error of measures). Nevertheless, the 1999 Standards are quite clear: "When significant variations are permitted in test administration procedures, separate reliability analyses should be provided for scores produced under each major variation if adequate sample sizes are available" (American Educational Research Association et al., 1999, p. 36).
As a consequence, little is known about estimates of reliability in any testing program in which changes are made (either as accommodations or for the remaining three methods: alternate assessments judged against grade level, modified, or alternate achievement standards). These latter methods present particular challenges because the quantity of data resulting from the method is likely to be limited in scope. Therefore, making inferences beyond an instance of testing is difficult. These are (a) alternate assessment judged against grade level and (b) alternate assessment judged against alternate achievement standards. It is possible that the alternate assessment based on modified achievement standards can follow the course of research followed by accommodations with sufficient numbers and adequate standardization to allow generalizations.
Ironically, standardization may be the antithesis of the solutions for controlling measurement error. Again, there is no direct control of random error, but by identifying systematic sources of what had been considered random error, total error (and therefore estimates of measurement error) can be reduced. By forcing tests to be taken in the same way across all students, both internal and external sources of error may be exacerbated [maintained] rather than controlled. For example, a student with hyperactive tendencies (and who therefore takes medications to control sources of error that are internal to the student due to inattentiveness) given a test in one session (to control sources of error external to the student due to time and setting) may actually need to have accommodations made in order to make appropriate inferences about performance that are not influenced by construct-irrelevant variance. When accommodations are made, reliability-related evidence is needed to support the consistency of administration and scoring across replications (time, items, and raters).
Generalizability Theory and Differentiated Error
So far, all error has been undifferentiated and treated as one source. However, this discussion has noted many plausible sources of this undesirable error. Using generalizability theory, this single "error" term is decomposed into various facets, or factors, that influence performance (Brennan, 2001). The most consistently studied facets are judges, tasks, and occasions. Using carefully planned research designs, assessment developers can better understand to what extent the assessment facets influence the reliability of observations.
Generalizability studies (G-study) are used to differentiate error and identify how much of the examinee score is attributable to, for instance, lack of rater agreement or task variability. This information is extremely valuable as it casts light on where assessments need adjustment. Obviously, alternate assessment can benefit from this effort and identify exact sources of error. Once the error term is partitioned into specific sources, the assessment development research can proceed to estimate measurement reliability.
A typical finding of G-studies is that the primary source of variance is the task itself. For example, assessments of science using experiments or a paper and pencil test result in very different estimates of performance. Likewise, comparisons of other CR and SR formats may result in different performance estimates, primarily due to format rather than content. The influence of raters and occasions has typically not been found to be as influential as format.
Using Decision Studies, multiple assessment scenarios can be constructed along with the corresponding reliability estimate. For instance, with G-study information, it is possible to know how much reliability will improve if more raters were used or responses to more tasks were obtained. Typically, using more than five to seven raters does not substantially improve estimates of performance. The implications of this relation for time and money considerations are very clear. If longer examinee test times are not possible, then the assessment would not include more tasks. Instead, the addition of another rater could be considered to the extent that it would bring reliability up to an acceptable standard.