Open Response

We will answer as many questions from our readers as time and space permit. If you have an assessment-related question, please send us an E-mail.

Do Out-of-Grade Items Compromise a Test’s Validity?

QUESTION (paraphrased): I recently became aware that my states department of education incorporated into criterion-referenced tests “extra” items from a grade level other than the one being tested. The DOE stated that it was “conducting research to create a vertical scale to measure growth across grade levels and the more difficult items did not count against the students’ scores.” For example, the third-grade test contained embedded fourth-grade items. Do you think that the inclusion of these more difficult items, whether they counted or not, may compromise the validity of the test, especially since the test is being used as the sole criterion in deciding whether or not to retain third graders this year?

The DOE has not researched the potential effects of the inclusion of these items, but I would think that it would contribute to the existing anxiety these third graders had, when they encountered several items containing content they had never even heard of and had not been taught to them yet. The DOE did not inform classroom teachers ahead of time so they could warn students that fourth-grade items would be on the third-grade test. It just seems that, from a statistical—as well as ethical—standpoint, this test can no longer be considered a valid measure when the DOE deliberately included items that may have increased anxiety levels to the point of having a detrimental effect on student scores.

ANSWER: This may not be the answer you want, but it is standard practice to embed items from other grades in a test for a particular target grade. Nevertheless, the issues you raise are good ones. First, we want to point out that the traditional, norm-referenced tests (NRTs) routinely include off-grade items in a test, and they count. The purpose of NRTs is to rank order students of widely varying ability levels based on their competency in a subject. Recognizing that the range of ability in any one grade spans many grade levels in terms of the old “grade-level equivalents,” an effective NRT, by necessity, includes items that discriminate all along the ability continuum represented by the students in that grade. In fact, the same NRT test forms were frequently used at multiple grades, with the same raw score earned by a student corresponding to different scaled scores and percentile ranks depending on the grade level identified by the student (or teacher) on the student’s answer sheet. Actually, nowadays, with the emphasis on standards-based tests aligned with grade-specific standards (grade-level expectations), if anything the new tests are much more focused on the content of a single grade than other types of tests used historically.

While there are plusses and minuses to the use of vertically scaled scores, they are much in demand. Parents and teachers would like to know that a “240” earned by a student on a fourth grade test, compared to a “210” earned by the same students on a grade three test a year earlier, really represents significant improvement or “growth.” Vertical scaling is done to yield such a conclusion. The alternatives are independent interpretations of scores each year or comparisons of normative performance information across years. Vertical scaling or equating does indeed require the use of embedded, “off-grade” test items.

In the skill areas of reading and math, we don’t believe there is much of an issue about the impact of these off-grade items. In reading, for example, even without embedded items, the passages for a target grade would represent a range of reading difficulty. (The skills addressed by the questions might be much the same.) There could be readability indexes for some passages well above grade level, as well as some well below. This is not unlike the situation with the NRTs described above.

Even an all-on-grade test would include some very hard questions to discriminate at the higher ability levels. In fact, many embedded questions from a higher grade could be easier than the more challenging “on-grade” questions. The situation would be the same in both reading and math. Thus, for the most part, items appropriate for two grades are being used at the two grades for the purposes of vertical equating. Whether taking a “pure” grade-level test or a test with embedded off-grade questions, the weaker the student, the greater the number of very challenging questions he or she will encounter. Standard test directions often include an admonition to students that the test may include “questions you will find difficult to answer. Do not spend too much time on any one question. Move on to the next one.” Test directions should continue to include such statements.

The content areas of science and social studies pose some special problems. State tests, no matter what the target grade level, often cover all the major disciplines of school science (life science, earth/space science, physical science). Yet, instruction at a particular grade level might not cover all those disciplines. The upcoming science testing requirement under NCLB calls for science assessments only at three grades: one elementary grade, one middle school grade, and one high school grade. This gives schools some flexibility regarding the grades in which they teach science subjects prior to the tested grade.

For states that test all disciplines of science in each of many adjacent grades, if instruction does not address all disciplines each year, then it is very likely that some portion of the test content will be unfamiliar to the students. This would be the case with or without embedded items for vertical equating. If science tests across different grades address very different content, consistent with instructional programs that emphasize different disciplines of science in different years, then vertical equating should not be done at all. Test equating, including vertical equating, should only be done if the tests measure the same content. As explained earlier, there is little problem with vertical equating in the skill areas of reading and math. If science tests in adjacent grades cover all science disciplines and the only differences are slight variations in the sophistication of questions, then vertical equating may be reasonable. However, there would still be the problem of a curriculum-assessment mismatch if the instructional programs addressed different science disciplines in different grades. By the way, most of this discussion would apply to social studies as well.

In summary, the embedding of off-grade items is standard practice. In reality, because of the range of abilities over which tests must discriminate, these off-grade items are typically within the wide range of difficulty that would be covered by a “pure” on-grade test. In skill areas such as math and reading, the off-grade items would be highly unlikely to cover content that is inappropriate for the target grade. For vertical scaling, some “overlap” items must be used, i.e., identical items used at adjacent grades. However, most items appropriately designated as being targeted to a particular grade level may be appropriately used at other grade levels. Even a test that does not include off-grade items will include some very difficult items and, possibly, items with some content seemingly “foreign” to some students. Test directions should tell students that they may encounter difficult questions and should not spend too much time on any one question.

Copyright 2004 by Measured Progress. All rights reserved.