Measured Progress logo and motto: It's all about student learning. Period.
K-12 Assessments: We supply standards-based assessments tailored to meet the needs of statewide-testing programs and classrooms. Professional Development: We provide professional development services that build assessment literacy and promote standards-based classrooms. Educational Resources: Supplement your educational programs with our advisory services, research, publications, and newsletter. About Us: Meet the people. Sense the passion. See the possibilities.

Stakes, Mistakes & Statewide Testing

By Stuart Kahl

From The State Education Standard, Winter 2000

What do the following states have in common: Arizona, California, Florida, Idaho, Illinois, Indiana, Kentucky, Maine, Missouri, Nevada, New York, Ohio, Rhode Island, South Carolina, Tennessee, Vermont, Washington, Wisconsin, and Wyoming? The answer: news articles over the past three years have reported problems encountered by their statewide testing programs with respect to testing materials or results. An extensive search would undoubtedly turn up even more states than those listed.

The nature of the problems encountered by these states varies considerably but tends to fall into three general categories: errors in testing materials, security breaches, and errors in results. For example, in 1999, thousands of New York City students were mistakenly sent to summer school based on incorrectly determined test scores. Erroneous scores were reported in at least five additional states (Indiana, Nevada, South Carolina, Tennessee, and Wisconsin) where the same test that was used in New York City was administered as part of statewide testing.

Also last summer, reporting errors due to the miscategorization of students in California got a great deal of media attention. In Washington, 500,000 student essays had to be rescored as a result of a scoring error the first time through. Recent problems with testing materials or test security were reported in Ohio, Rhode Island, Vermont, and Wyoming, to name a few. A few years earlier, analysis and reporting errors occurred in Florida, Illinois, Kentucky, Maine, and Ohio.

As it turns out, over the past several years, hardly a month has passed without a newspaper report of a testing error. No testing company has escaped these problems. One wonders if these errors are occurring more often than in the past, or if they are just getting more attention. The answer is both. One also wonders if customized programs are more prone to such errors than programs that use off-the-shelf products. While it would be logical to come to this conclusion, even the established off-the-shelf products have had their problems, and these problems extend beyond statewide testing programs. For example, in 1997, the SSAT program, used for admission to prep schools, reported scores that were miscomputed. In 1998, it was discovered that the SAT math scores reported in the fall of 1997 were incorrect for over 15,000 students, with some scores off by as much as a hundred points. And several of the errors mentioned at the start of this article were associated with off-the-shelf tests that were used in statewide programs.

What is going on? What is it about statewide testing that makes errors and other problems so commonplace? Are all testing contractors incompetent? I know that is not the case. To get a better sense of what is happening in the testing industry, a little history is helpful.

Increasing Demands

In the mid-1980s, state educational accountability laws and reform acts led to an increase in the number of statewide assessment programs. Although a few states had already been using tests they had developed themselves, the instruments most available and most used were the published, off-the-shelf products. These tests, designed to compare or rank-order students, were not intended to be used for the evaluation of curricular and instructional programs. With the Lake Wobegon score inflation scandal in the testing industry a few years later, even more states began to develop their own programs. More recently, of course, Goals 2000 funds and Title 1 requirements have led to still more do-it-yourself state tests because the off-the-shelf products are not aligned with state content standards.

Throughout the 1980s and 90s, state educational assessment systems have been characterized by increased stakes for students and educators, increased scope (more grades, more subjects, and more innovative approaches to assessment), increased costs, and increased politicization. With all of these changes come demands for more instruments and faster work. The higher stakes attached to test results have caused problems even for state testing programs employing off-the-shelf products. Security issues arise because the published products that are selected for state programs are often already in many schools. (Off-the-shelf test series have a lifetime of almost a decade.) This was a problem recently in one state, where some schools used for practice the very same form the state administered as its grade three early warning reading test. Also with higher stakes come challenges that require far greater disclosure than before of testing materials, data, etc. Consequently, many states are contracting to have new tests produced and administered every year. Of course, these tests have to be tailored to the state content standards, and they have to be equated to the previous year's tests.

At the same time that assessment systems are growing in complexity and scope, the time line for developing and implementing the tests is frequently shrinking. In 1991, Kentucky awarded a customized testing contract in July. Content advisory committees were convened in September to discuss test specifications and again in October to review the first item sets. Later in that same school year, accountability tests in seven subjects, including multiple-choice and constructed-response questions and writing prompts, were administered in three grade levels statewide. "Off-grade" assessments in four subjects were administered in seven other grades. Also, each student in the three accountability grades participated in hands-on performance assessments monitored by trained facilitators, and each student submitted writing portfolios that were scored as part of the accountability requirement. All of this activity took place within eleven months of the selection of a contractor. The scope of that program, great as it seemed at the time, is now being approached in many other states.

The extensive release of materials, a common practice for many programs now, necessitates tremendous development efforts. In the past few years, states have issued item development contracts with development goals on the order of 30,000 test questions in a half year! Isn't it likely, considering the demands for tests and the requirement that they be produced in months—rather than years as in the past—that errors in materials will occur?

I was once asked by one state's commissioner of education if my company would guarantee error-free assessments. I responded, "I will assure you that we will do everything possible, within the constraints of time and money you have placed on us, to minimize the probability of error in testing materials and results for your program." While this was not the answer the commissioner was seeking, it at least brought a smile. In reality, when state testing takes place and errors are avoided, the testing companies and their partners, the state departments of education, have accomplished miracles. Another way to look at it is to say they have dodged a lot of bullets.

Turn-around Time for Results

A common concern local school personnel express regarding many statewide testing programs is the time it takes for them to get test results. Their frame of reference is their experience in administering off-the-shelf tests as part of school or district testing programs. These tests are developed over a period of years, studies are completed to determine national norms, and scoring and reporting programs are perfected before the tests are put on the market. Schools administer the tests, then return their student answer sheets, and within weeks receive reports. Because of the advance work that has already been completed, computers do little more than determine points earned by students, then look up in a table the corresponding scaled scores by going to the right place for the age of the students and time of year tested. This is much like using a tax table once adjusted gross income is known.

In the scenario above, schools receive student reports as well as school summaries within a relatively short time period. For local testing, it does not matter whether the test publisher receives answer documents for all students within a school. The publisher scans and scores whatever materials it was sent and returns results for just those students whose materials it processed, having no knowledge of whether students were inappropriately excluded or otherwise unaccounted for. However, the situation is usually different for statewide testing. In many state programs, the contractor spends a great deal of time tracking down information on students when the numbers of student testing materials do not agree with the enrollment figures of the schools. More often than not, this happens whether the statewide program uses customized or off-the-shelf instruments. Since many programs produce current-year statewide results for purposes of comparisons in the school reports, all school and student materials must be accounted for before final results for the state can be determined. This goal has two inherently contradictory requirements: complete information on all districts, schools, and students; and fast, efficient reporting of results.

Many tasks must be accomplished before reports of results can be distributed. If a program uses mixed-item formats, then multiple-choice responses can be processed immediately, but answers to constructed-response questions must be scored before data from those questions can be merged with data from the multiple-choice component. The merging of files generally uncovers the need for additional data cleanup because of unmatched partial records. If the program is a totally customized assessment program, using new tests every year, existing computer programs must be modified, and sophisticated scaling and equating procedures must be implemented each year. Then, after all reportable statistics have been computed, report programs must be run to transfer the statistics from the data file to the report shells or formats. With some testing programs producing many different reports of student, school, and district results, the pages of unique data generated may number in the millions. Smudging and other printing problems may create the need for time-consuming reruns.

The Irony of Turn-around Demands

While state testing programs vary considerably in terms of complexity and the amount of work that must be accomplished to generate reports, one thing is very common: policy makers are requiring faster and faster turn-around times for results. The irony of this is that a primary purpose of many state programs is to monitor changes in school performance for accountability reasons, and yet for this purpose, data from several years of testing are required to make defensible interpretations and legitimate use of test results. For example, when a program tests at a fixed grade, say grade four at the elementary level, it is well-known that school results will fluctuate considerably from year to year due to the variation in the capabilities of students passing through the tested grade from one year to the next (just as the caliber of a school's basketball team will change year to year even if the coaches and equipment remain the same). With such a testing design, several years of data should be aggregated to produce baseline scores, and several more years should be aggregated for change scores to compare to the baseline scores.

While school results may fluctuate from year to year, statewide results should not be expected to change much from one year to the next. The near random school fluctuations cancel each other out. Furthermore, the cumulative educational experience of fourth graders in a school or statewide one year is not significantly different from that of fourth graders the next year, regardless of the programmatic or instructional changes that occur in a school or statewide that one year. Improvement is detectable only after a critical mass of teachers and students has been reached for a critical number of years.

The starting point for measuring change is the first year of testing, no matter how many years earlier reform efforts were begun. Recently, when one state released its second-year statewide test results, political leaders and policy makers seemed shocked that there were no significant changes from the first year to the second. (A reform act had been passed a few years before assessment development was begun.) Such results are to be expected, but a lack of understanding of what could reasonably be anticipated could well lead to the discontinuation of funding of reform efforts in that state. The notion that educational reform is a long-term commitment that requires five to ten years to significantly raise achievement levels is incompatible with the term limits of political figures. A few years ago, in another state, three months were arbitrarily cut from a reporting schedule. This resulted in the requirement that high-stakes school accountability results, involving data from a four-year cycle of testing, be reported only one month after raw data files for the fourth year of testing were produced. Not surprisingly, there were errors.

What Can Be Done?

Given the capabilities of technology and the effectiveness of the newer psychometric techniques and software packages, high-quality work in testing is possible. However, too little time and too few resources are being provided to accomplish the quality control necessary for statewide testing programs as large and complex as many of today's programs. Nobody wants errors in testing materials or reports of results. When they occur, contractors generally accept responsibility for them; yet the role client decision-makers play should not be ignored. They have the power to do a great deal to minimize the likelihood of errors in statewide testing programs. A few suggestions follow.

  • Report results of a testing program in phases. The results that schools can use soonest are not the high-stakes school accountability results that require sophisticated scaling and equating analyses and many years' worth of data. Many programs report test item results, item-by-item student responses, and student raw total test scores and percentile ranks. These types of results do not require extensive analysis and can be released soon after scoring is completed. They can be used by teacher teams looking for particular strengths and weaknesses of their students and their instruction with respect to concepts, topics, and skills tested. The total test scores can be used by schools to assist in comparing the performance of students for purposes of placement and other decisions. As an added benefit, the early release of these data can also be used for data verification. Irregularities in these preliminary reports, such as misclassifications of students, identified by local school personnel, can be corrected before high-stakes school accountability results are computed.
  • Don't succumb to demands for data that can only be misused. As indicated earlier, school scores can fluctuate from one year to the next due to differences between the groups of students passing through the tested grade. Thus, school scores for close to half the schools in a state should be expected to go down each year, relative to the previous year's results. These declines, if reported, can only be misinterpreted. Multi-year data, such as cumulative averages, should be reported to provide more stable indicators of schools' relative performances. Only when there are enough years' data to compute meaningful change scores should change data be the focus of reporting. Since these results take years to produce anyway, they should not be rushed in the final year of a change cycle. That can only lead to errors.
  • Simplify testing programs. Many statewide testing programs have become quite complex. What is gained by complex designs can easily be lost when the complexity, combined with unreasonable time lines, leads to errors in testing materials or results. Sometimes a single testing program can effectively address multiple purposes (e.g., accountability and school improvement). Sometimes, however, separate tests that are optimally designed for two different purposes (e.g., school program evaluation and student graduation requirement) might be very different. In the latter case, combining tests can cause a test design to be overly complex or inefficient.
  • Don't change programs unnecessarily. The famous "reading anomaly"* in national assessment results a few years ago led officials at the National Assessment of Educational Progress (NAEP) to conclude, "When measuring change, you shouldn't change the measure." The NAEP issue was not mistakes, which were not found. Instead, the concern was that the purpose of monitoring change could not be accomplished because of comparability problems due to changes in testing procedures or instruments.

This can be a problem for statewide assessments, too. However, with respect to the topic of this article, changes in programs increase the likelihood of errors. Especially in statewide testing programs, with greatly compacted time lines, most changes seem to be made at the last minute, and all too often these cause more trouble than they are worth. Better initial designs are the solution. Unfortunately, legislation prepared by individuals who are not knowledgeable of many testing issues sometimes goes too far in specifying the design characteristics of programs. Then, after less-than-ideal designs are implemented, the need for changes becomes more obvious.

Without question, testing contractors must continually seek to improve processes and reduce errors. State policy makers and decision-makers can go a long way to support that effort by recognizing their crucial roles regarding the competing goals of speed and accuracy. While both are desirable, going too far with one jeopardizes the other.

*The "reading anomaly" occurred when results from the 1986 NAEP reading assessment showed a steep decline in reading scores. However, an expert panel convened by the U.S. Department of Education concluded that the drop in scores resulted from changes in test design and administration rather than actual declines in student achievement.