To understand a validity argument it is essential to have a clear idea of which construct is being tested because it forms the basis of the claim of validity. For example, in the following standards from Massachusetts and Oregon involving a mathematics problem from a fifth-grade state practice test, a word-story problem is presented as a multiple-choice item. It is essential to know whether this test item also has within it other constructs that are irrelevant to the mathematical construct being tested.
To answer the question we must consider (1) the construct being assessed; (2) the knowledge and skills reflected in the specific tasks and the manner in which this knowledge and these skills are sampled, formatted and scored; and (3) the use of test scores to make inferences about the teaching and learning process as well as the accountability system (relative to the construct). The validity claim is that the test adequately reflects the domain of knowledge and skills of the standards and can be used as the basis for the inference of proficiency.
Table 1
Construct in an Example of a Mathematics Standard and an Assessment Problem^{1}
Oregon Standard: | Add and subtract decimals to hundredths, including money amounts. |
---|---|
Massachusetts Standard: | Select and use appropriate operations (addition, subtraction, multiplication and division) to solve problems, including those involving money. |
Assessment Problem: | Tommy bought 4 shirts for $18.95 each and 3 pairs of pants for $21.49 each. What was the total Tommy spent? |
Assessment Options: | A) $11.33 B) $135.97 C) $132.27 D) $140.27 |
The validity argument considers whether the task presented on the large-scale assessment appropriately measures the domain of achievement or whether it is misrepresented or underrepresented as described in Table 2.
Table 2
Validation Claim and Questions Supported by Evidence
Construct |
Misrepresentation |
Underrepresentation |
---|---|---|
Achievement (domain of tasks) | Does the math story problem include other constructs or rely on access or prerequisite skills that prevent students from displaying their knowledge and skill? | Does the math story problem adequately represent the kind of mathematics operations needed to solve money estimation problems in the presence of suitable distracters (i.e., irrelevant elements of the problem)? |
In this simple mathematics problem, reading may be part of construct-irrelevant variance that impedes our efforts to measure the mathematical knowledge and skills as applied in this limited situation (a printed math story problem). However, if we had used a performance task to measure achievement (open-ended problem requiring the student to write his or her answer), then writing may have become part of the construct-irrelevant variance. If we had required a demonstration of money estimation in a local store or in the community, however, a host of other factors that are part of the assessment (the type of store in which we shopped, the presence of others at the check out, the bills being used, etc.) would then have become sources of construct-irrelevant variance. Construct-irrelevant variance can arise from several sources, including from the unique needs of students with disabilities or groups of individuals and how they participate in large-scale assessment systems. This source of variance is systematic and either consistently disadvantages or advantages individuals or groups. For example, if students are allowed only 60 minutes to complete a reading test, students with poor reading skills will be consistently disadvantaged. Or if students are given read-aloud assistance and the tester inadvertently prompts the correct choice by inflection, students taking the test from this person are systematically advantaged. In both examples, math performance is confounded with (influenced by) other characteristics of the measurement process that are irrelevant to the construct being measured.
In the math story problem as a measure of achievement, the construct also can be seriously underrepresented, failing to include appropriate operations (addition, subtraction, multiplication or division), steps (making exact change or estimations of change), distracters (elements of the problem that need to be seen as irrelevant), or critical strategies (use of self-guided actions that were used by the student but not documented). In all of these instances, the construct may have been underrepresented.
The validity claim can be threatened by several factors, for example, by insufficient evidence. And in making the claim, serious social consequences are at stake. Misinterpretations could be made (e.g., the student is not proficient in mathematics). Resources could be misdirected (e.g., very complex tasks are used that require intensive manpower to administer and score, for which reliability-related evidence is found lacking). Tasks could be misrepresented as constructs because measurement specialists, content experts, and special educators fundamentally disagree with (or are uninformed by) each other. Knowing the limitations of assessments for making inferences about proficiency in cognitive skills using more complex tasks, it is important to emphasize the need for appropriate and credible assessment approaches.
^{1} Examples excerpted from (1) the Oregon Department of Education’s fifth grade mathematics content standards for computation and estimation, available at: http://www.ode.state.or.us/teachlearn/real/Standards/Default.aspx (accessed March 25, 2006); and (2) the Massachusetts Department of Education’s fourth grade mathematics content standards for number sense and operations, available at: http://www.doe.mass.edu/frameworks/math/2000/num3.html (accessed March 24, 2006).
Accommodations
We introduce accommodations to remove construct-irrelevant variance by making changes in the supports (and not in making changes in the content domains). For example, the mathematics problem could be read aloud to students who cannot read well to eliminate reading as a construct-irrelevant variable. Likewise, we could use a calculator to remove the computational requirements for mathematics problems targeting other constructs. We also could allow more time so the student can finish the item (or test). Tindal and Ketterlin-Geller (2004, p. 8) note the following in their review of mathematics accommodations research on four major classes of accommodations (using calculators, reading mathematics problems to students, employing extended time, and using multiple accommodation packages). Notice, however, that these task (test) features may be problem- and person-specific.
In general, the findings from using calculators and reading mathematics problems to students clearly document the effect of accommodations to be dependent on the type of items and populations. For some items, calculators are facilitative (e.g., solving fractions problems) and for others detractive (e.g., on complex calculations as part of mathematical reasoning). Similarly, item specific findings are beginning to appear in reading mathematics problems: when the problems are wordy (both in count and difficulty) and contain several verb phrases, the accommodations appear effective. Likewise, student characteristic is an important variable. The positive effects of the read-aloud accommodation are more likely with younger students or those with lower reading skills. Finally, the use of extended time appears relatively inert though often it appears as part of other accommodations. For example, calculators and reading mathematics problems often take more time.
Thus, the research on accommodations reflects that changes in the way tests are given or taken (the supports used) indeed can make a difference, sometimes removing construct-irrelevant variance. Furthermore, the effect of an accommodation is dependent on characteristics of the population using the accommodation. At other times, however, accommodations may actually introduce construct-irrelevant variance (e.g. teachers systematically provide extra prompts). So, accommodations cannot be considered a panacea or a simple process. Their usefulness depends on the construct of the standard, the assessment approach or format, and the needs of the student.
At this point in time, most states have both participation and accommodation policies. These policies, however, focus mostly on who needs to participate and how they should participate, and less on why certain types of participation options should be recommended or applied. This statement is particularly true for the use of accommodations. Very few states have policies that explain the reasoning behind an accommodation in terms of the intended construct to be measured and the evidence needed to support its measurement (see Thurlow & Bolt, 2001). We address that kind of evidence through the consequences of assessment, most of which are seriously underreported (c.f., National Center on Educational Outcomes Online Accommodations Bibliography). In the end, states need to have policies on what accommodations to allow and why; these policies need to provide IEP teams guidance in determining how the unique needs of students with disabilities require changes in testing.
Table 3
Types of Accommodations
Presentation | Presentation Equipment | Response | Setting | Scheduling |
---|---|---|---|---|
Large print | Magnification equipment | Proctor/scribe | Individual | Extended time |
Braille | Light/acoustics | Computer or machine | Small group | With breaks |
Read-aloud | Calculator | Write in test booklets | Carrel | Multiple sessions |
Interpreter for instructions | Amplification equipment | Tape recorder | Separate room | Time beneficial to student |
Read/reread/ simplify/clarify |
Templates/ graph paper |
Communication device | Seat location/ proximity |
Over multiple days |
Directions | Audio/video cassette | Spell checker/ assistance |
Minimize distractions/ quiet/reduced noise |
Flexible schedule |
Visual cues on test/instructions | Noise buffer | Braille | Student’s home | Other |
Administration by other | Adaptive or special furniture | Pointing | Special ed. class | |
Additional examples | Abacus | Other | Other | |
Other | Other |
Alternate Assessments
The general education large-scale assessment (with or without accommodations, or when it involves multiple administrations) is intended to allow educators to make comparable inferences about proficiency on state standards. Yet, at some point, changes are made that are significant enough to constrain the inference, which is when states need to consider them as part of their alternate assessments. In this type of assessment, constraints begin to appear in the inference about proficiency on standards. Because of changes in supports (assistive technologies, prompts or scaffolds) and/or changes in the breadth, depth, and complexity of the material being tested, the scores on alternate assessments based on alternate achievement standards cannot be aggregated with the scores on regular assessments (and therefore must be reported separately). However, as explained later in this paper, using a validity argument within the context of federal regulations allows for the aggregation of proficiency levels based on grade-level, modified, and alternate achievement standards for purposes of reporting Adequate Yearly Progress.
In the sample mathematics problem presented at the beginning of the paper, changes could be made in the assessment approach by observing the student actually making change and using a checklist or rating scale to note the correctness of the response, by assembling into a portfolio materials that document the student making change during an interaction at a local store in the community, or by observing or recording a performance task given to the student in which the student is required to add these amounts of money using real bills and make change accordingly. All of these options could become part of an assessment judged against modified achievement standards or an alternate assessment judged against alternate achievement standards. Remember, however, that these "situated" environments may well introduce other sources of irrelevant variance unrelated to the construct. Therefore, each of these approaches brings with it the need to collect specific kinds of evidence to ensure that the construct is being fully assessed (and not underrepresented), requiring both procedural and empirical evidence.