U.S. Department of Education: Promoting Educational Excellence for all Americans - Link to ED.gov Home Page
OSEP Ideas tha Work-U.S. Office of Special Education Programs
Ideas that work logo
  Home  Contact us
Models for Large-Scale Assessment
TOOL KIT HOME
OVERVIEW
MODELS FOR LARGE-SCALE ASSESSMENT
TECHNICAL ASSISTANCE PRODUCTS
Assessment
Instructional Practices
Behavior
Accommodations
RESOURCES
 
 
 Information About PDF
 
 

 Printer Friendly Version (pdf, 91K)

Reliability Issues And Evidence

Sources of Random Error

Random error arises from student variables, task sampling, item calibration, and scaling, as well as other sources. These different sources affect the process at different times in the development and implementation of large-scale assessments; therefore, they need to be documented and monitored throughout the process. The effect can then be minimized to provide more stable and dependable estimates of students' performance. The documentation would provide appropriate procedural evidence to allow the formulation of a validity argument. With appropriate analyses, statistical evidence would be used to complement the procedural evidence. However, as noted in the Standards (American Educational Research Association et al., 1999), various forms of reliability estimates are possible, and they need to address specifically the source of error for which they are targeted. For example, if raters are used in the scoring process, then interjudge reliability needs to be documented; with alternate forms, this type needs to be noted; when change over time is being documented, test-retest reliability needs to be established; finally, internal consistency provides evidence of reliability of items and tasks.

An alternate assessment poses numerous challenges that are associated with measurement error. Some sources of random error pertain to examinee characteristics, item and test design, administration, and scoring protocols. State large-scale assessments typically use both SR and CR items or tasks, either with or without accommodations. CR is used in performance measures that require a rubric (subjectively scored) or performance measures that require observation of student performance, completion of performance tasks, or collection of student work samples. The opportunities for measurement error are likely to expand with increased flexibility. As a consequence, assessment design and reliability estimates need to take into account the multiple factors that can attenuate measurement accuracy. The challenge with isolating and controlling sources of measurement error is complicated by the relationships among error sources, as described below.

Student Variables

Students come into school situations from a variety of home environments, all of which can affect their performance in school. For example, students come to school hungry, tired, or fatigued, and so forth. As they interact with classroom tasks and receive feedback, students come to have expectations of success or failure, reflecting motivation and self-efficacy that may interact differentially with the kinds of tasks they are given. All of these conative factors may influence the results of large-scale assessments in unsystematic (i.e., random) ways (McGrew, Johnson, Cosio, & Evans, 2003).

In addition, for students with disabilities, a number of personal and behavioral characteristics may also unsystematically influence performance. For example, with some disabilities (e.g., attention deficit-hyperactivity disorders), medications are used; depending upon the dosage or uptake, performance on large-scale tests may be inconsistent. Even without the use of medications, students with disabilities may exhibit behavioral tendencies that distract them from attending to tasks (tendencies of perseveration, distractibility, inattentiveness, etc.). The administration of the test may be nonstandardized and therefore may influence students unevenly (e.g., it may negatively affect some and act neutrally for others). Whenever such behavior or conditions influence students' performance unsystematically, reliability is weakened, as is the overall claim of validity (the claim that the outcomes reflect what the student knows and can do). Therefore, the inference of proficiency is less certain. A careful analysis of the context and the student is needed, however, as some variations in personal state (health, attention deficit-hyperactivity disorders) would be regarded as sources of systematic error. For example, the Standards note that test anxieties that can be "recognized in an examinee" are considered "systematic errors" and "are not generally regarded as an element that contributes to unreliability" (American Educational Research Association et al., 1999 p. 26).

As a consequence, participation in large-scale assessment systems is not only a matter of scheduling students to take a test at the end of the year. Rather, the assessment needs to be considered an important part of the school's annual cycle of activities. The large-scale assessment program needs to take into account such behavioral factors when collecting performances from students. Although tests may be given only once during the year (typically in the spring), plans for test administration should be introduced early in the year to allow students and teachers a fair opportunity to participate.

As an example of testing conditions reflecting ongoing classroom conditions, many states require teachers to use the same accommodations in testing that have been part of the accommodations used in the classroom. If these accommodations are not implemented in a standardized manner as part of the teaching or testing, unsystematic variance may be introduced. Furthermore, a teacher who is watchful during the year may be able to better understand critical student behaviors and recommend specific accommodations for testing at the end of the year.

Task Sampling

Samples of performance tasks must be prepared so that they are parallel in format and difficulty. That is, the tasks are ideally comparable to the extent that a student would not perform differently with one or another because they are both of equal difficulty. The sample of tasks is apt to be more or less variable with respect to difficulty and representation of the performance domain. Using multiple forms, individuals can be assessed over time or compared to another. The extent to which tasks differ is of obvious consequence because, with more variation, the change over time or comparisons over multiple individuals is less trustworthy. Score variability that is attributable to task differences needs to be identified with carefully controlled studies in which parallel tasks and forms are used.

With portfolio assessments, it is often the teacher who selects the student work to be included in a collection or portfolio. Although selection criteria may be specified in both the test administration manual and in training, teacher judgment is ultimately involved. Consequently, the portfolio or collection of work may represent the grade level content broadly or narrowly. Choices of what is included or excluded in the collection can therefore affect the adequacy of the evidence in representing what a student knows and can do. From the perspective of repeatability, another collection, assembled at a different time or by a different teacher, may or may not support inferences drawn from the original collection. Therefore, it is critical to consider task sampling in the context of the assessment approach. It is generally easier to establish parallel forms when dealing with brief constructed tasks; when using performance collections and portfolios, it may be more difficult to establish comparability of tasks and forms.

Item Calibration

Assessment developers increasingly recognize the value of item calibration, since assessment items are not necessarily equivalent (Thissen & Wainer, 2001). Whether CR or SR, assessment items provide differential amounts of information depending on the respondent's true ability. Item calibrations are estimates of item characteristics such as item difficulty or sensitivity (van der Linden & Hambleton, 1997). With accurate item calibrations, estimation of true scores becomes considerably more accurate. That is, calibration helps minimize standard error of estimation.

Calibration accuracy pertains directly to measurement reliability. The value of item calibrations for ability estimation depends on the appropriate choice of IRT model and proper calibration procedures. The technicalities involved in these decisions are far beyond the scope of this paper, but the importance of good calibration should be noted. First, the calibration process requires adequate sampling of examinee response patterns. Ideally, a range of abilities is represented in the calibration sample. Second, an appropriate IRT model must be applied. For instance, alternate assessments rely heavily on performance tasks. Usually, observed performance is scored polytomously, that is, more than correct/incorrect. This method of scoring requires the use of a rating scale, partial credit, or graded response model. Numerous other possible models are described in the literature (van der Linden & Hambleton, 1997; Boomsma, van Duijn & Snijders, 2001).

Another aspect of IRT item calibration that can influence reliability is the unidimensionality of the assessment—the degree to which an assessment measures a single construct (an ideal condition). In reality, this outcome is very difficult to achieve on rigidly constructed measures. Also, when using alternate assessments, the challenge increases dramatically as flexible CR tasks are applied and risk the involvement of multiple ability factors and other variable factors such as time constraints, rater severity, and so forth. Multidimensional IRT models can be used to accurately calibrate performance tasks, thereby yielding more reliable ability estimation. Local item dependency is generally attributable to multidimensional problems. Good assessment development must identify and provide corrections for this situation.

Finally, when developing measures for diverse populations, it is important to understand whether assessment tasks function identically across the populations. Ideally, tasks should perform equivalently across populations, although one population may have higher mean ability than another population. This type of analysis helps maintain quality control over assessment development. For example, Yovanoff and Tindal (in press) were able to understand how well the Oregon Early Reading Extended Assessment tasks functioned irrespective of whether students were in special or general education. In this study, a range of constructed response tasks was placed on the same scale and used as the first benchmark of the state test to provide students of all abilities a sufficient range of difficulties, leading to appropriate assessment. These tasks included letter naming and letter sounding as well as word, sentence, and passage reading.

 Previous  |  Next