2003 PretestPosttestDesignsandMeasur
- (Dimitrov & Rumrill, 2003) ⇒ Dimiter M. Dimitrov, and Phillip D. Rumrill Jr.. (2003). “Pretest-posttest Designs and Measurement of Change.” In: WORK: A Journal of Prevention, Assessment and Rehabilitation, 20(2).
Subject Headings: Treatment Effect; Pretest-Posttest Experiment
Notes
Cited By
Quotes
Author Keywords
Abstract
The article examines issues involved in comparing groups and measuring change with pretest and posttest data. Different pretest-posttest designs are presented in a manner that can help rehabilitation professionals to better understand and determine effects resulting from selected interventions. The reliability of gain scores in pretest-posttest measurement is also discussed in the context of rehabilitation research and practice.
1. Introduction
Pretest-posttest designs are widely used in behavioral research, primarily for the purpose of comparing groups and/or measuring change resulting from experimental treatments. The focus of this article is on comparing groups with pretest and posttest data and related reliability issues. In rehabilitation research, change is commonly measured in such dependent variables as employment status, income, empowerment, assertiveness, self-advocacy skills, and adjustment to disability. The measurement of change provides a vehicle for assessing the impact of rehabilitation services, as well as the effects of specific counseling and allied health interventions.
2. Basic pretest-posttest experimental designs
This section addresses designs in which one or more experimental groups are exposed to a treatment or intervention and then compared to one or more control groups who did not receive the treatment. Brief notes on internal and external validity of such designs are first necessary. Internal validity is the degree to which the experimental treatment makes a difference in (or causes change in) the specific experimental settings. External validity is the degree to which the treatment effect can be generalized across populations, settings, treatment variables, and measurement instruments. As described in previous research (e.g. [11]), factors that threaten internal validity are: history, maturation, pretest effects, instruments, statistical regression toward the mean, differential selection of participants, mortality, and interactions of factors (e.g., selection and maturation). Threats to external validity include: interaction effects of selection biases and treatment, reactive interaction effect of pretesting, reactive effect of experimental procedures, and multiple-treatment interference. For a thorough discussion of threats to internal and external validity, readers may consult Bellini and Rumrill [1]. Notations used in this section are [math]\displaystyle{ Y_1 }[/math] = pretest scores, [math]\displaystyle{ T }[/math] = experimental treatment, [math]\displaystyle{ Y_1 }[/math] = posttest scores, [math]\displaystyle{ D = Y_2 - Y_1 }[/math] (gain scores), and RD = randomized design (random selection and assignment of participants to groups and, then, random assignment of groups to treatments).
With the RDs discussed in this section, one can compare experimental and control groups on (a) posttest scores, while controlling for pretest differences or (b) mean gain scores, that is, the difference between the posttest mean and the pretest mean. Appropriate statistical methods for such comparisons and related measurement issues are discussed later in this article.
With this RD, all conditions are the same for both the experimental and control groups, with the exception that the experimental group is exposed to a treatment, T, whereas the control group is not. Maturation and history are major problems for internal validity in this design, whereas the interaction of pretesting and treatment is a major threat to external validity. Maturation occurs when biological and psychological characteristics of research participants change during the experiment, thus affecting their posttest scores. History occurs when participants experience an event (external to the experimental treatment) that affects their posttest scores. Interaction of pretesting and treatment comes into play when the pretest sensitizes participants so that they respond to the treatment differently than they would with no pretest. For example, participants in a job seeking skills training program take a pretest regarding job-seeking behaviors (e.g., how many applications they have completed in the past month, how many job interviews attended). Responding to questions about their job-seeking activities might prompt participants to initiate or increase those activities, irrespective of the intervention.
- Design 2
- Randomized Solomon four-group design
This RD involves two experimental groups, E1 and E2, and two control groups, C1 and C2. All four groups complete posttest measures, but only groups E1 and C1 complete pretest measures in order to allow for better control of pretesting effects. In general, the Solomon four-group RD enhances both internal and external validity. This design, unlike other pretest-posttest RDs, also allows the researcher to evaluate separately the magnitudes of effects due to treatment, maturation, history, and pretesting. Let D1, D2, D3, and D4 denote the gain scores for groups E1, C1, E2, and C2, respectively. These gain scores are affected by several factors (given in parentheses) as follows: D1 (pretesting, treatment, maturation, history),D2 (pretesting, maturation, history), D3 (treatment, maturation, history), and D4 (maturation, history). With this, the difference D3– D4 evaluates the effect of treatment alone, D2–D4 the effect of pretesting alone, and D1–D2–D3 the effect of interaction of pretesting and treatment [11, pp. 68]. Despite the advantages of the Solomon four-group RD, Design 1 is still predominantly used in studies with pretest-posttest data. When the groups are relatively large, for example, one can randomly split the experimental group into two groups and the control group into two groups to use the Solomon four-group RD. However, sample size is almost always an issue in intervention studies in rehabilitation, which often leaves researchers opting for the simpler, more limited twogroup design.
This design is similar to Design 1, but the participants are not randomly assigned to groups. Design 3 has practical advantages over Design 1 and Design 2, because it deals with intact groups and thus does not disrupt the existing research setting. This reduces the reactive effects of the experimental procedure and, therefore, improves the external validity of the design. Indeed, conducting a legitimate experiment without the participants being aware of it is possible with intact groups, but not with random assignment of subjects to groups. Design 3, however, is more sensitive to internal validity problems due to interaction between such factors as selection and maturation, selection and history, and selection and pretesting. For example, a common quasi-experimental approach in rehabilitation research is to use time sampling methods whereby the first, say 25 participants receive an intervention and the next 25 or so form a control group. The problem with this approach is that, even if there are posttest differences between groups, those differences may be attributable to characteristic differences between groups rather than to the intervention. Random assignment to groups, on the other hand, equalizes groups on existing characteristics and, thereby, isolates the effects of the intervention.
3. Statistical methods for analysis of pretest-posttest data
The brief discussion of modern approaches for measuring change in this section requires the definition of some concepts from classical test theory (CTT) and item response theory (IRT). In CTT, each observed score, [math]\displaystyle{ X }[/math], is a sum of a true score, [math]\displaystyle{ T }[/math] , and an error of measurement, [math]\displaystyle{ E }[/math] (i.e., [math]\displaystyle{ X = T + E }[/math]). The true score is unobservable, because it represents the theoretical mean of all observed scores that an individual may have under an unlimited number of administrations of the same test under the same conditions. The statistical tests for measuring change in true scores from pretest to posttest have important advantages to the classical raw-score differences in terms of accuracy, flexibility, and control of error sources. Theoretical frameworks, designs, procedures, and software for such tests, based on structural equation modeling, have been developed and successfully used in the last three decades [13,21].
3.1. ANOVA on gain scores
The gain scores, D = Y2 − Y1, represent the dependent variable in ANOVA comparisons of two or more groups. The use of gain scores in measurement of change has been criticized because of the (generally false) assertion that the difference between scores is much less reliable than the scores themselves [5,14,15]. This assertion is true only if the pretest scores and the posttest scores have equal (or proportional) variances and equal reliability. When this is not the case, which may happen in many testing situations, the reliability of the gain scores is high [18,19,23]. The unreliability of the gain score does not preclude valid testing of the null hypothesis of zero mean gain score in a population of examinees. If the gain score is unreliable, however, it is not appropriate to correlate the gain score with other variables in a population of examinees [17]. An important practical implication is that, without ignoring the caution urged by previous authors, researchers should not always discard gain scores and should be aware of situations when gain scores are useful.
3.2. ANCOVA with pretest-posttest data
3.3. ANOVA on residual scores
3.4. Repeated measures ANOVA with pretest-posttest data
4. Measurement of change with pretest-posttest data
4.1. Classical approach
4.2. Modern approaches for measurement of change
5. Conclusion Important summary points are as follows
Important summary points are as follows:
- The experimental and control groups with Designs 1 and 2 discussed in the first section of this article are assumed to be equivalent on the pretest or other variables that may affect their posttest scores on the basis of random selection. Both designs control well for threats to internal and external validity. Design 2 (Solomon four-group design) is superior to Design 1 because, along with controlling for effects of history, maturation, and pretesting, it allows for evaluation of the magnitudes of such effects. With Design 3 (nonrandomized control-group design), the groups being compared cannot be assumed to be equivalent on the pretest. Therefore, the data analysis with this design should use ANCOVA or other appropriate statistical procedure. An advantage of Design 3 over Designs 1 and 2 is that it involves intact groups (i.e., keeps the participants in natural settings), thus allowing a higher degree of external validity.
- The discussion of statistical methods for analysis of pretest-posttest data in this article focuses on several important facts. First, contrary to the traditional misconception, the reliability of gain scores is high in many practical situations, particularly when the pre- and posttest scores do not have equal variance and equal reliability. Second, the unreliability of gain scores does not preclude valid testing of the null hypothesis related to the mean gain score in a population of examinees. It is not appropriate, however, to correlate unreliable gain scores with other variables. Third, ANCOVA should be the preferred method for analysis of pretest-posttest data. ANOVA on gain scores is also useful, whereas ANOVA on residual scores and repeated measures ANOVA with pretest-posttest data should be avoided. With randomized designs (Designs 1 and 2), the purpose of ANCOVA is to reduce error variance, whereas with nonrandomized designs (Design 3) ANCOVA is used to adjust the posttest means for pretest differences among intact groups. If the pretest scores are not reliable, the treatment effects can be seriously biased, particularly with nonrandomized designs. Another caution with ANCOVA relates to possible differential growth on the dependent variable in intact or self-selected groups.
- The methodological appropriateness and social benefit of measuring change in terms of mean gain score is questionable; it is not clear, for example, that a method yielding a lower mean gain score in a rehabilitation experiment is uniformly inferior to the other method(s) involved in this experiment. Also, the results from using raw-score differences in measuring change are generally misleading because they depend on the level of difficulty of test items. Specifically, for subjects with equal actual (true score or ability) change, an easy test (a ceiling effect test) will falsely favor low ability subjects and, conversely, a difficult test (a floor effect test) will falsely favor high ability subjects. These problems with raw-score differences are eliminated by using (a) modern approaches such as structural equation modeling for measuring true score changes or (b) item response models (e.g., LLMC) for measuring changes in the ability underlying subjects’ performance on a test. Researchers in the field of rehabilitation can also benefit from using recently developed computer software with modern theoretical frameworks and procedures for measuring change across two (pretest-pottest) or more time points.
References
[1] J. Bellini and P. Rumrill, Research in rehabilitation counseling, Springfield, IL: Charles C. Thomas.
[2] R.D. Bock, Basic issues in the measurement of change. in: Advances in Psychological and Educational Measurement, D.N.M. DeGruijter and L.J.Th. Van der Kamp, eds, JohnWiley & Sons, NY, 1976, pp. 75–96.
[3] A.D. Bryk and H. I. Weisberg, Use of the nonequivalent control group design when subjects are growing, Psychological Bulletin 85 (1977), 950–962.
[4] I.S. Cahen and R.L. Linn, Regions of significant criterion difference in aptitude- treatment interaction research, American Educational Research Journal 8 (1971), 521–530.
[5] L.J. Cronbach and L. Furby, How should we measure change - or should we? Psychological Bulletin 74 (1970), 68–80.
[6] D.M. Dimitrov, S. McGee and B. Howard, Changes in students science ability produced by multimedia learning environments: Application of the Linear Logistic Model for Change, School Science and Mathematics 102(1) (2002), 15– 22.
[7] G.H. Fischer, Some probabilistic models for measuring change, in: Advances in Psychological and Educational Measurement, D.N.M. DeGruijter and L.J.Th. Van der Kamp, eds, John Wiley & Sons, NY, 1976, pp. 97–110.
[8] G.H. Fischer and E. Ponocny-Seliger, Structural Rasch modeling, Handbook of the usage of LPCM-WIN 1.0, Progamma, Netherlands, 1998.
[9] R.K. Hambleton, H. Swaminathan and H. J. Rogers, Fundamentals of Item Response Theory, Sage, Newbury Park, CA, 1991.
[10] S.W. Huck and R.A. McLean, Using a repeated measures ANOVA to analyze data from a pretest-posttest design: A potentially confusing task, Psychological Bulletin 82 (1975), 511–518.
[11] S. Isaac and W.B. Michael, Handbook in research and evaluation 2nd. ed., EdITS, San Diego, CA, 1981.
[12] E. Jennings, Models for pretest-posttest data: repeated measures ANOVA revisited, Journal of Educational Statistics 13 (1988), 273–280.
[13] K.G. J¨oreskog and D.S¨orbom, Statistical models and methods for test-retest situations, in: Advances in Psychological and Educational Measurement, D.N.M. DeGruijter and L.J.Th. Van der Kamp, eds, John Wiley & Sons, NY, 1976, pp. 135– 157.
[14] L. Linn and J.A. Slindle, The determination of the significance of change between pre- and posttesting periods, Review of Educational Research 47 (1977), 121–150.
[15] F.M. Lord, The measurement of growth, Educational and Psychological Measurement 16 (1956), 421–437.
[16] S. Maxwell, H.D. Delaney and J. Manheimer, ANOVA of residuals andANCOVA: Correcting an illusion by using model comparisons and graphs, Journal of Educational Statistics 95 (1985), 136–147.
[17] G.J. Mellenbergh, A note on simple gain score precision, Applied Psychological Measurement 23 (1999), 87–89.
[18] J.E. Overall and J. A. Woodward, Unreliability of difference scores: A paradox for measurement of change, Psychological Bulletin 82 (1975), 85–86.
[19] D. Rogosa, D. Brandt and M. Zimowski, A growth curve approach to the measurement of change, Psychological Bulletin 92 (1982), 726–748.
[20] I. Rop, The application of a linear logistic model describing the effects of preschool education on cognitive growth, in: Some mathematical models for social psychology, W.H. Kempf and B.H. Repp, eds, Huber, Bern, 1976.
[21] D. S¨orbom, A statistical model for the measurement of change in true scores, in: Advances in Psychological and Educational Measurement, D.N.M. DeGruijter and L.J.Th. Van der Kamp, eds, John Wiley & Sons, NY, 1976, pp. 1159–1170.
[22] J. Stevens, Applied multivariate statistics for the social sciences 3rd ed., Lawrence Erlbaum, Mahwah, NJ, 1996. [23] D.W. Zimmerman and R.H.Williams, Gain scores in research can be highly reliable, Journal of Educational Measurement 19 (1982), 149–154.;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2003 PretestPosttestDesignsandMeasur | Dimiter M. Dimitrov Phillip D. Rumrill Jr. | Pretest-posttest Designs and Measurement of Change | 2003 |