Research Framework - Scoring and Reporting
Observed Scores, Scaled Scores, Norms, and IRT Ability Estimates
Observed Scores
Research on observed scores might focus directly on individual item responses, or it could be based on the sum of item scores or some type of weighted sum of item scores. The research could include how to score constructed-response items (scoring rubrics, multiple human raters, computer scoring, etc.), although this would also clearly interact with task design and development. Observed scores could be based on the entire test or on a subset of the items, in which case they are typically called subscores.
Research examples include examining how different types of weighted scores affect other psychometric or substantive characteristics of interest, such as reliability, score interpretation, or effectiveness of remediation methods. Research on subscores is especially concerned with how best to carry out the subscoring process and report such scores in a way that maintains sufficient reliability for particular purposes, usually related to giving diagnostic information on individual knowledge or skill domains that would otherwise be added in the total score.
IRT Ability Estimates
This research area is not about the ability estimation method but rather how ability estimates relate to other aspects of measurement and how to report them for research purposes (the assumption being that consumers would not be interested in them). For example, for continuous models of ability, estimation results may be reported on a logistic scale, true-score scale, or percentile scale. For discrete models of ability (as in levels of mastery in skills diagnosis), estimation results might be reported as probabilities for individual skills, for different skill patterns, or simply the most likely discrete level or category.
Such research would almost always be conducted in conjunction with another area; for example, seeing how the choice of scale affects external validity correlations or how it affects decision making.
One innovative research area is the incorporation of response time into the estimation procedure through the use of an appropriate IRT model. Another area is the estimation of growth or learning over time by use of a sophisticated IRT model. These innovations would clearly overlap with Modeling and Design. Also, although both of these innovative areas have typically had an IRT flavor to them, other types of statistical analyses (such as regression analyses of observed scores or modifications thereof) could also be used.
Score Scale
This research pertains to choosing the scale used for official score reporting when a continuous scale is desired. The score scale will be some type of monotonic transformation of either the observed score scale or the scale used for the IRT continuous ability estimates. Loosely speaking, the observed score scale or IRT scale will be "stretched" or "shrunk" at various points, and then relabeled with numeric values that maintain the same order as the original scale. Both substantive and statistical issues play a role in this decision making, and research can be conducted to help guide these choices.
Proficiency Scaling
Research in this area is concerned with how best to link an examinee's official scale score or observed score with diagnostic information (either statistical, substantive, or both) about the individual's proficiency on the knowledge and skills measured by the test.
Normed Scores
Normed scores report how a particular reference population performed on a test, so that teachers, students, or parents can better understand the meaning of the scores. Research may, for example, focus on what kind of score to use for a norm, the quantification of its accuracy or precision, or how the norm scores relate to other variables of interest.
Relationship between IRT Ability Estimates, Observed Scores, and Scaled Scores
Examinees typically have several types of score statistics associated with them, such as observed score(s) (the sum of their scored item responses on either the test as a whole or on subtests), IRT ability estimates (derived from scored item responses in conjunction with the psychometric model), and scaled scores. Research in this area investigates interactions among some or all of these components.
Research
A Two-Stage Scoring Method to Enhance Accuracy of Performance Level Classification by Matthew Finkelman, Mark Darby, and Michael Nering (2009), from Educational and Psychological Measurement, Volume 69, Number 1
When Push Comes to Shove: Modified Performance Profile Procedure as Viable Alternative to Body of Work Standard Setting Method by Luz Bay, Kevin Haley, and Liz Burton (2008), Paper presented at the Annual Meeting of the National Council on Measurement in Education, New York, NY
An Adaptive Scoring Protocol to Enhance Accuracy of Performance Classification by Mark Darby, Matt Finkelman, and Michael Nering (2007), Paper presented at the American Education Research Association Annual Meeting, Chicago, IL
Creating a Proficiency Scale with Models for Cognitive Design by Robert A. Henson and Jonathan L. Templin (2004)
The Long-term Sustainability of Different Item Response Theory Scaling Methods by Lisa A. Keller and Robert R. Keller (2009)

