KINGSBURY RESEARCH CENTER - Setting Response Time Thresholds for a CAT Item Pool: The Normative Threshold Method
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
KINGSBURY RESEARCH CENTER Setting Response Time Thresholds for a CAT Item Pool: The Normative Threshold Method Steven L. Wise and Lingling Ma Northwest Evaluation Association Paper presented at the 2012 annual meeting of the National Council on Measurement in Education, Vancouver, Canada Send correspondence to: Northwest Evaluation Association (NWEA) 121 NW Everett St. Portland, OR 97209 (503) 915-6118 1 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
2 Abstract Identification of rapid-guessing test-taking behavior is important because it indicates the presence of non- effortful item responses that can negatively bias examinee proficiency estimates. Identifying rapid guesses requires the specification of a time threshold for each item. This requirement can be practically challenging for computerized adaptive tests (CATs), which use item pools that may contain thousands of items. This study investigated a new method—termed the normative threshold method—for identifying item thresholds. The results showed that the normative threshold method could markedly outperform a common three-second threshold (which has been used in previous research) in identifying non-effortful behavior. 2 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
3 Setting Response Time Thresholds for a CAT Item Pool: The Normative Threshold Method The goal of educational measurement is to provide test scores that validly indicate what examinees know and can do. To attain this goal, test givers have traditionally focused on developing tests that are of suitable length and contain items that are sufficiently representative of the content domain of interest. In addition, attention is paid to how these tests are given. Standardized test administration procedures are typically followed, in settings that conducive to obtaining valid scores from examinees. In practice, however, obtaining valid scores from a test administration requires more than simply presenting items to examinees under standardized conditions. A valid score also requires a motivated examinee, who behaves effortfully throughout a test event. It is not uncommon, though, for some examinees to exhibit test-taking behavior that is non-effortful during a standardized test. For these examinees, the resulting scores are less valid because they are likely to underestimate what the examinees actually know and can do. The idea that score validity can vary across examinees is consistent with the concept of individual score validity (ISV; Hauser, Kingsbury, & Wise, 2008; Kingsbury & Hauser, 2007; Wise, Kingsbury, & Hauser, 2009). ISV conceptualizes an examinee’s test performance as being potentially influenced by one or more construct-irrelevant factors (Haladyna & Downing, 2004). ISV for a particular test event is driven in part by the degree to which the resulting score is free of these factors. Thus, to obtain a valid score from a particular examinee, the test giver has the dual challenge of (a) developing a test that is capable of producing valid scores and (b) conducting a test event that minimizes the impact of construct- irrelevant factors on that examinee’s test performance. Unfortunately, examinee effort is a construct-irrelevant factor that lies largely outside the test giver’s control. In particular, whenever the test stakes are low from an examinee’s perspective, there is a realistic chance that the examinee will not become or remain engaged in devoting effort to the test. If enough disengagement occurs during the test event, low ISV will result. The impact of low effort on test performance can be sizable; a synthesis of 15 studies of test-taking motivation indicated that less 3 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
4 motivated examinees showed mean test performance to be .58 standard deviations lower than that of their more motivated peers (Wise & DeMars, 2005). Although there appears to be a generalized recognition among test givers that low examinee effort threatens the validity of test scores, until recently methods have not been available for them to manage the problem. All of these methods require that we be able to measure examinee effort. The current study looks at the analysis of examinee effort in the context of a computerized adaptive test (CAT). Measurement of Examinee Effort Self-Report Measures. A popular method for measuring effort is to give examinees a self-report instrument about the effort they gave to a test they had just taken. Such instruments are easy to administer, and require relatively few items to attain adequate reliability. For example, the Student Opinion Survey (Sundre & Moore, 2002) which uses five Likert-style items to measure examinee effort commonly exhibits internal consistency reliability estimates in the .80s. Despite its ease of use, self-reported effort has several limitations. First, it is unclear how truthful examinees will be when asked about their test-taking effort. Self-report measures are potentially vulnerable to response biases. Examples include low-effort examinees who report high effort because they fear punishment from test givers and high-effort examinees who report low effort because felt they did poorly on the test and tend to attribute failure to low effort (Pintrich & Schunk, 2002). Second, self- report measures—which typically ask examinees for an overall assessment of their test-taking effort—are not sensitive to changes in effort during test events, which often occur (Wise & Kong, 2005). A third limitation of self-report measures is that they require the awkward assumption that examinees who did not give good effort to their test were sufficiently motivated to seriously complete the self-report instrument regarding their lack of effort. Response Time-Based Measures. The idea that response time is associated with test-taking effort has its roots in the work of Schnipke and Scrams (1997, 2002), who studied examinee behavior at the end of timed high-stakes tests. They found that, as the time limit approached, some examinees switched from trying to work out the answer to items (which they termed solution behavior) to rapidly 4 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
5 answering remaining items in hopes of getting some correct by chance (termed rapid-guessing behavior). Schnipke and Scrams found that responses under such rapid guessing behavior tended to be correct at a rate consistent with random responding. Wise and Kong (2005) observed that rapid-guessing behavior also occurred during untimed low- stakes tests. They showed that in this context rapid guessing indicated instances where examinees were not motivated to give their best effort. Similar to the findings of Schnipke and Scrams (1997, 2002), rapid guesses on low-stakes tests have accuracy rates resembling those expected by random responding (Wise & Kong, 2005; Wise, 2006). In addition, rapid-guessing behavior has been found to be relatively unrelated to examinee ability. Regardless of the stakes of the test, rapid-guessing behavior indicates that the examinee has ceased trying to show what he knows and can do. The motives, however, differ with test stakes. In a high-stakes test, a good test performance helps the examinee obtain something he wants (e.g., high grade, graduation, job certification). In this case, rapid-guessing behavior is used strategically by the examinee in the hopes of raising his score by guessing some items correct that would otherwise be left unanswered (and therefore be scored as incorrect). In contrast, in a low-stakes test, an examinee who is responding rapidly on a low-stakes test is ostensibly trying to get the test over with, rather than trying to maximize his score. It is helpful to clarify what is meant by solution behavior, because there is an asymmetry to the inferences being made. It is assumed that all rapid guesses are non-effortful. It is not assumed, however, that all responses classified as solution behavior are effortful. We recognize that there may be non- effortful responses made slowly by an examinee, and such responses are not readily distinguishable from slowly-made effortful responses. Hence, we characterize effort analysis using response time as identifying item responses that should not be trusted to be informative about an examinee’s proficiency level. Response time-based measurement of test-taking effort has three advantages. First, it is based on student behavior that can be objectively and unobtrusively collected by the computer. It does not rely on 5 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
6 self-report data, which may be biased. Second, it has been shown to have high internal consistency. Third, it can evaluate effort on an item-by-item basis, which allows us to assess the possibility that a student’s level of effort changes during the test event. The primary limitation of response time-based measures is that the collection of response time data requires that the test under study be computer administered. The identification of rapid-guessing behavior is important because it indicates the presence of item responses that are uninformative about an examinee’s proficiency level. Worse, rapid guesses tend to exert a negative bias on a proficiency estimate because rapid guesses are correct at a rate that is usually markedly lower than what would have been the case had the examinee exhibited solution behavior. The more rapid guesses occur during a test event, the more negative bias is likely present in a test score. Thus, the presence of rapid-guessing behavior—if it is pervasive enough—provides an indicator that a score has low ISV. In conducting an analysis of response time-based effort on a test, there are two basic operational questions that must be addressed. First, exactly what comprises rapid-guessing behavior on an item? To keep the effort analysis as objective as possible, it is important to specify an operational definition of rapid guessing. Second, how extensive does rapid-guessing behavior have to be during a test event to warrant action on the part of the test giver? For example, if a test giver decided to identify test scores reflecting low ISV on score reports, how many rapid guesses would be required to warrant such a label? The current study focuses on the first question in the context of a CAT. Identification of Rapid Guesses The typical approach to identifying rapid guessing has been to use item response time to classify each item response as reflecting either solution behavior or rapid-guessing behavior. Differentiating between rapid-guessing behavior and solution behavior requires the establishment of a time threshold for each item such that all responses occurring faster than that threshold are classified as rapid guesses, and responses occurring slower are deemed solution behaviors. 6 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
7 There are two principles underlying the choice of a time threshold. First, it is desirable to identify as many instances of non-effortful item responses as possible. Second, it is important to avoid classifying effortful responses as non-effortful. There is a tension between the two principles; the first encourages us to choose a longer threshold, while the second encourages us to choose a shorter one. How we balance the two principles depends on our data-analytic goals. For example, if test data from a group of examinees is being used to calibrate item parameters and we wanted to cleanse the data of as many rapid guesses as possible, the first principle might be of higher concern. In contrast, if test scores are used to make inferences about the instructional needs of individual examinees, fairness considerations for the examinees might suggest that the second principle would be the predominant concern. Several different methods have been examined for identifying time thresholds for a given item. Schnipke and Scrams (1997) noted that, while a frequency distribution of response times to an item typically has a positively skewed unimodal distribution, the presence of rapid guessing results in an initial frequency spike occurring during the first few seconds the item is displayed. Schnipke and Scrams conceptualized this as indicating that rapid-guessing behavior and solution behavior exhibit distinctly different response time distributions. Based on this, they used a two-state mixture model to identify a time threshold for each item. Wise and Kong (2005) based their time thresholds on surface features of the items (e.g., the number of characters in the stem and options), reasoning that items requiring more reading should have longer thresholds. Wise (2006) alternatively proposed that the time threshold for an item could be identified through visual inspection of its response time frequency distribution, with the time threshold corresponding to the end of the initial frequency spike. Kong, Wise, and Bhola (2007) compared these three variable-threshold identification methods, finding that they produced similar thresholds. Threshold Identification in a CAT Threshold identification becomes more challenging in an adaptive testing context. A CAT selects items from pools that contain hundreds, if not thousands of items. In addition, these item pools typically are dynamic, with new items being continually added. Thus, relative to a conventional fixed test, the task 7 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
8 of identifying variable thresholds for CAT items is necessarily more time and resource intensive. Because of this, much of the early effort analysis research with CAT has used a common three-second threshold (Wise, Kingsbury, Thomason, & Kong, 2004; Wise, Kingsbury & Hauser, 2009; Wise, Ma, Kingsbury & Hauser, 2010), because it could be immediately applied to any item pool. Despite its ease of use, a common threshold may not be adequate for operational use with CATs such as the Measures of Academic Progress (MAP), which uses item pools in assessing the academic growth of primary and secondary school students. Hauser and Kingsbury (2009) argued that a three- second threshold is too conservative for many items that contain a lot of reading, such as those that would require an engaged student with strong reading skills more than a minute to read. For these items, use of a three-second threshold would violate the first threshold principle described earlier. On the other hand, an engaged student might be able to select the correct answer for some basic math items (e.g., “What is 4 x 5?”) effortfully in less than three seconds. In these instances, a three-second threshold could classify effortful test-taking behavior as non-effortful, which would violate the second threshold principle. In addition, Kong et al. (2007) found that variable-threshold methods performed somewhat better than a common three-second threshold in terms of both threshold agreement and the impact of motivation filtering (Sundre & Wise, 2003; Wise & DeMars, 2005) on convergent validity. For these reasons, a common threshold was deemed unacceptable with MAP, and a practical variable threshold method was deemed needed. Ma, Wise, Thum, and Kingsbury (2011) studied the use of several variable-threshold methods for MAP data. Three versions of visual inspection were investigated: inspection of response time distributions, inspection of both response time and response accuracy distributions, and inspection of response time and response accuracy distributions along with item content. They found that the agreement between judges was low for all three of the inspection methods. In addition, they explored the use of two statistical threshold identification methods (mixture models and a non-parametric method), finding that neither method exhibited practical utility. 8 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
9 Because inspection methods and statistical methods have not been found sufficiently useful in identifying thresholds for MAP data, and because the calculation of thresholds based on surface features was considered too resource intensive, there is a need for a threshold identification method that can be practically applied to large and frequently changing item pools. The current study introduces and investigates a new variable-threshold method in the analysis of examinee effort on the adaptive MAP assessment. The Normative Threshold Method Items vary in the time spent by examinees in responding to them. This variation can be attributed to multiple factors, such as to the amount of reading required by the items or how mentally taxing they are. Similarly, what might be called a rapid response varies according to the time demands of the items. In the Normative Threshold (NT) method, the time threshold for a particular item is defined as a percentage of the elapsed time between when the item is displayed and the mean of the response time distribution for the item, up to a maximum threshold value of 10 seconds. For example, if it takes examinees an average of 40 seconds to respond to a particular item, a 10 percent threshold (NT10) would be 4 seconds (that is, 10% of 40 seconds), whereas an NT15 threshold would be 6 seconds. A maximum threshold value of 10 seconds was chosen following the observation by Setzer, Wise, van den Heuvel, and Ling (in press) that it might be problematic to credibly characterize a response taking longer than 10 seconds as a rapid guess. The NT method requires only (a) the mean response time for each item in a pool (which can be readily calculated and stored based on earlier administrations of the item) and (b) a chosen percentage value. Establishing a suitable percentage value involves balancing the two threshold principles. The goal is to identify as many non-effortful responses as possible with the constraint that responses deemed rapid guesses should be correct at a rate that is not higher than one would expect from random responses. In the current study, three threshold values were studied: NT10, NT15, and NT20. 9 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
10 Method Data Set and Sample All of the tests administered were part of the Northwest Evaluation Association’s (NWEA’s) MAP testing system. MAP tests are untimed interim CATs, with the tests in Math being generally 50 multiple-choice items in length, while those in Reading are generally 40 multiple-choice items in length. MAP proficiency estimates are expressed as scale (RIT) scores on a common scale that allows growth to be assessed when examinees are tested at multiple times. Over 285,000 examinees in grades 3-9 from a single U.S. state were administered MAP in both the fall and spring of the 2010-2011 academic year. For each examinee, a growth score was generated (calculated as RITSpring –RITFall). MAP growth scores were generated for both Math and Reading. Outcome Measures Ideally, a criterion would be available that specified which of the item responses were non- effortful and which were not. This would allow a direct comparison between the three NT methods relative to the two threshold principles discussed earlier. In the absence of such a criterion, a set of test events that indicated the presence of non-effortful test-taking behavior was studied, and the degree to which those test events were identified using the three NT methods was assessed. A practical measurement problem frequently encountered with MAP data is the occurrence of growth scores whose values are difficult to interpret because their direction and/or magnitude are not considered credible. These types of scores come in two forms. During a school year, some students exhibit negative growth values that are much larger in magnitude than can be attributable to measurement error. Such negative values are problematic because they imply that an examinee knows substantially less in the spring than they did the previous fall. Wise et al. (2009) showed that students exhibiting effortful test-taking behavior during fall testing, but not during spring testing, could explain many instances of large negative growth values. Conversely, some students exhibit unrealistically large positive growth values, which can be explained by students exhibiting effortful test-taking behavior in the spring, but not in the fall. 10 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
11 Non-credible growth scores (either positive or negative) were used as indicators of non-effortful test events. One of the current study’s outcome measures was the percentage of non-credible growth scores that were accounted for by applying a set of response-time based flagging criteria (examples of which were described Wise et al., 2009) for each of the threshold methods. This allowed a comparison of the NT methods relative to the first threshold principle. The accuracy rate of item responses classified as rapid guesses comprised the second outcome measure. Rapid guesses should show an accuracy rate similar to that expected by chance through random responding (Wise & Kong, 2005). Accuracy rates exceeding chance would indicate the inclusion of effortful responses (whose accuracy rates were substantially higher than chance). Hence, the accuracy rate provides an indication of the extent to which the second threshold principle had been violated by a particular threshold method. For Math MAP items, which contain five response options, the random accuracy rate should be .20; for four-choice Reading items, the accuracy rate should be .25. This is an oversimplification, however, because test takers exhibiting rapid-guessing behavior do not actually respond randomly. It has been shown that examinees are somewhat less likely to guess either the first or the last response option and somewhat more likely to guess one of the middle options (Attali & Bar-Hillel, 2003; Wise, 2006). This is an example of “middle bias”, which is a well-known psychological phenomenon that explains why people sometimes have difficulty behaving randomly (even when instructed to). Middle bias would suggest that the expected rapid guessing accuracy rate for an item depends in part on the location of the correct answer. For example, random guesses to a Math item whose answer is option “a” might have an accuracy rate of .17, while those to a Math item whose answer is “b” might have an accuracy rate of .24. If the positions of the correct answers are reasonably balanced across a set of items, however, the overall accuracy rate of rapid guesses should be close to what would be expected under random responding. An analysis of the MAP item pools revealed that the pools were slightly imbalanced, with a disproportionately higher number of items with correct answers in the second and third response option 11 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
12 position1. Thus, it was expected that rapid guesses would be correct at a rate slightly above that expected purely by chance. Data Analysis Comparisons were made among the three normative-threshold methods (NT10, NT15, & NT20), along with a common three-second threshold (Common3). Mean response times were identified for items in both the Math and Reading item pools and thresholds for NT10, NT15, and NT20 were calculated for each item. Next, for each threshold method, each item response in a test event was classified as either rapid-guessing behavior or solution behavior. The test events were then evaluated for each threshold method using five flagging criteria designed to identify low-ISV scores (Wise et al., 2009). Three of the flagging criteria were based on the preponderance and pattern of rapid guesses in the data, with the other two criteria based on the accuracy rates of sets of item responses. A test event was classified as low ISV if it was flagged by at least one of the five criteria. Finally, the accuracy rates for all rapid guesses and solution behaviors were calculated (aggregating across both items and examinees) for each threshold method. Results Descriptive statistics for the four threshold methods are shown in Table 1. Each of the NT methods yielded a range of threshold values, with some values less than three seconds and some exceeding three seconds. The median threshold value exceeded three seconds for each NT method and, as the NT percentage increased, increases were observed in both the median threshold value and the number of values set at the maximum value of ten seconds. Figure 1 shows the distributions of the threshold values for the NT methods. The generally higher threshold values for the NT methods imply that they would identify rapid guesses more often than would Common3. Table 1 shows that the proportion of responses classified as 1 Item writers are also vulnerable to middle bias, which would result in their tending to more often construct items with the correct answers in the middle response positions (Attali & Bar-Hillel, 2003). 12 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
13 rapid guesses varied substantially across the methods, with Common3 identifying rapid guesses about 1% of the time, and NT20 about 3-4% of the time. Table 2 shows the results of applying the ISV flagging criteria based on rapid guesses identified under the various threshold methods. More test events were flagged by the NT methods, with the numbers of flagged test events increasing with the NT percent value. In math, a greater number of test events were flagged in the fall testing session only (ranging from 2.7% to 6.6%) than for the spring testing session only (ranging from 1.7 to 2.5%). In reading, the pattern of results was similar except that the prevalence of low ISV was generally at least 80% higher than that found for math. The overall numbers of flagged test events suggested that there were numerous instances of extreme positive or negative growth scores in the data for both math and reading. The distributions of growth scores from examinees whose scores were flagged for low ISV during fall testing only are shown in Table 3. These were examinees for whom the low ISV in the fall would be expected to exert a positive bias on the growth scores. The majority of these growth scores were positive, as their values were subject to the joint influence of the true proficiency gains of the students and positive bias introduced by the presence of low ISV in the fall session. Regardless of threshold method, there were a substantial number of high positive growth scores that were flagged. In math, there were at least 1,600 scores that exceeded +20; in reading, at least 3,000 exceeded +20. Removing the low ISV test events identified in Table 3 from the overall sample had a differential impact on the growth score intervals. Table 4 shows the impact for each threshold method. As expected, the greatest impact was seen for the most extreme positive scores (i.e., those that were +20 or higher). The NT methods accounted for substantially more extreme positive scores than did Common3 and the NT20 method accounted for the most, followed by NT15 and then NT10. In addition, higher percentages of the extreme positive scores were accounted for in reading than in math. Overall, however, the percentages of high positive growth scores that were accounted for by low ISV were moderate. For the most extreme positive growth (i.e., greater than +30), between 29% and 42% of the scores were accounted for by low ISV in math and between 39% and 58% in reading. 13 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
14 Tables 5 and 6 show the same analyses for examinees whose scores were flagged for low ISV during spring testing only. Low ISV in the spring scores would be expected to exert a negative bias on growth scores. In Table 5, the distributions of growth scores for these examinees are shown. The values in the rows of Table 5 are more symmetric than those shown in Table 3, reflecting the combination of true growth (which tends to be positive) with the negative effects of low ISV. Table 6 shows the effects of removing examinees exhibiting low ISV in the spring only from the overall sample. The pattern of results is similar to that found in Table 4, except that, as expected, the majority of the scores removed were negative. Moreover, the percentages of high negative growth scores accounted for by low ISV were higher. For the most extreme negative growth (i.e., less than -30), between 87% and 97% of the scores were accounted for by low ISV in math and between 73% and 82% in reading. As Table 6 illustrates, the NT methods outperformed the Common3 method in accounting for negative growth scores. Furthermore, NT20 accounted for the highest percentage, followed by NT15, and then NT10. The relative performance of the NT methods, however, follows logically from their definition and should therefore not be surprising. Because each NT15 threshold is slightly higher than its NT10 counterpart, the NT15 method classified as rapid guesses all of the item responses that were classified as such by NT10, plus some additional responses. Similarly, NT20 classified as rapid guesses all of the responses that were classified by NT15, plus some additional responses. This relationship between NT10, NT15, and NT20 was also observed when the ISV flagging criteria were applied to the test events. The higher the NT threshold value, the higher the percentage of non-credible growth scores that would be accounted for. This is consistent with the first threshold principle. It is clear, however, that as the NT threshold values increase, there will be an increased likelihood that effortful responses will be misclassified as rapid guesses (which violates the second threshold principle). The impact of the threshold methods relative to the second principle was evaluated by considering the accuracy rates of responses classified as either solution behavior or rapid guessing behavior. 14 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
15 Table 7 shows the accuracy rates for each threshold method. In math, solution behaviors showed accuracy rates very close to the value of .50 expected under a CAT, for each of the methods. Rapid guesses for math items—which had five response options—should exhibit accuracy rates close to .20. This occurred for Common3 and NT10. In contrast, both NT 15 and NT20 exhibited accuracy rates that had moved noticeably higher than .20, indicating that some effortful responses (whose accuracy rates should be closer to .50) had been classified as rapid guesses. The presence of these effortful responses can be seen more clearly if only the additional rapid guesses that were gained by moving from one NT method to another are considered. When moving from NT10 to NT15, the additional rapid guesses gained were correct 31% of the time. When moving from NT15 to NT20, they were correct 47% of the time. The same pattern of results was found in reading (also shown in Table 7), whose items had four response options. The accuracy of solution behaviors was close, for each of the four methods, to the expected value of .50. Regarding rapid guessing behavior, Common3 and NT10 exhibited accuracy rates reasonably close to the expected rapid guessing accuracy rate of .25. NT15 and NT20 again exhibited signs that they were classifying some effortful responses as rapid guesses. When moving from NT10 to NT15, the additional rapid guesses gained were correct 38% of the time. When moving from NT15 to NT20, they were correct 51% of the time. A final data analysis examined the correlations between the fall and spring RIT scores. Wise et al. (2009) showed that the fall-spring RIT correlations increased after low-ISV scores were removed. These increased correlations were interpreted as indicating that construct-irrelevant variance—which had been attenuating the correlations—had been removed. Table 8 shows the fall-spring correlations in the current study, by threshold method. In math, the correlations increased from .91 to .92 for each of the methods. In reading, the correlation increased from .89 to .91 for NT10, but increased only to .90 with the other three methods. 15 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
16 Discussion Identification of rapid-guessing behavior for a CAT requires that a practical, effective method be available for establishing time thresholds for pools of items. Choice of a threshold method should consider the two threshold principles. The first principle states that the method should account for as many non-effortful item responses as possible. The second principle states that the method should not classify effortful responses as non-effortful. The current study explored the validity of a new variable threshold identification method—the normative threshold method—and sought to identify a threshold value that best balances the two principles. As a point of reference, the Common3 method, which has been used previously for effort analysis with MAP scores, did a reasonably good job of balancing the two principles. It identified low-ISV test events that produced a large proportion of the extreme growth scores that were accounted for by any of the methods. Moreover, the accuracy rates of solution behaviors and rapid-guessing behaviors classified by Commmon3 were very close to the expected values. Finally, removing its low-ISV scores from the sample increased the fall-spring RIT score correlations. Each of the NT methods studied identified markedly more low-ISV test events than Common3 and, as a consequence, the NT methods identified higher percentages of non-credible growth scores. However, when accuracy rates were considered, only the NT10 method maintained accuracy for solution behaviors and rapid-guessing behaviors at their expected rates. The NT15 and NT20 methods showed clear signs that they were classifying effortful responses as rapid guesses. In addition, the NT10 method had the strongest positive effect (albeit slightly) on the fall-spring RIT correlations. Hence, overall, the NT10 method showed the best performance and is therefore recommended. The differences in the performance of the various NT methods illustrate the absence of a clear line between rapid guessing and effortful responding. The second threshold principle, that we should not classify as non-effortful any responses that were actually effortful, implies that we will necessarily act conservatively in our decisions about the validity of test events. For example, suppose that the time threshold for a particular item is 2.6 seconds. If a given student responded in 2.7 seconds, it may still be 16 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
17 more highly probable that the student was rapid guessing, but now there is also a realistic possibility that a student who reads, thinks, and responds very quickly could have engaged in solution behavior and answered the item effortfully. In this situation, the response would be classified as reflecting solution behavior even though it may be much more likely that it was actually a rapid guess. As a result, decisions are biased toward classifying ambiguous response times as reflecting effortful behavior, which suggests that we will sometimes not flag some test events whose scores actually reflect low ISV. This explains, at least in part, why higher percentages of extreme growth scores were not accounted for by the ISV flagging criteria. The general conclusion of this study is that the normative threshold method can provide a practical and effective way to establish time thresholds for identifying rapid-guessing behavior. In particular, the NT10 method was found to provide a good balance between the two threshold principles. It can be readily applied to a CAT item pool containing thousands of items, as it requires only the average response time that examinees spend on each item. Such item information can be routinely extracted from field test or operational data and stored as an attribute of each item in the pool. 17 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
18 References Attali, Y., & Bar-Hillel, M. (2003). Guess where: The position of correct answers in multiple-choice test items as a psychometric variable. Journal of Educational Measurement, 40, 109–128. Haladyna, T. M., & Downing, S. M. (2004). Construct-irrelevant variance in high-stakes testing. Educational Measurement: Issues and Practice, 23(1), 17-27. Hauser, C., Kingsbury, G, G., & Wise, S. L. (2008, March). Individual validity: Adding a missing link. Paper presented at the annual meeting of the American Educational Research Association, New York, NY. Hauser, C., & Kingsbury, G, G. (2009, April). Individual score validity in a modest-stakes adaptive educational testing setting. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA. Kingsbury, G. G., & Hauser, C. (2007, April). Individual validity in the context of an adaptive test. Paper presented at the annual meeting of the American Educational Research Association, Chicago, IL. Kong, X., Wise, S. L., & Bhola, D. S. (2007). Setting the response time threshold parameter to differentiate solution behavior from rapid-guessing behavior. Educational and Psychological Measurement, 67, 606-619. Ma, L., Wise, S. L., Thum, Y. M., & Kingsbury, G. G. (2011, April). Detecting Response Time Threshold under the Computer Adaptive Testing Environment. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans. Pintrich, P. R., & Schunk, D. H. (2002). Motivation in education: Theory, research, and applications (2nd ed.). Upper Saddle, NJ: Merrill Prentice-Hall. Schnipke, D. L. & Scrams, D. J. (1997). Modeling item response times with a two-state mixture model: A new method of measuring speededness. Journal of Education Measurement, 34(3), 213-232. Schnipke, D.L. & Scrams, D.J. (2002). Exploring issues of examinee behavior: Insights gained from response-time analyses. In C. N. Mills, M. T. Potenza, J. J. Fremer, & W. C. Ward (Eds.), Computer- based testing: Building the foundation for future assessments. Mahwah, NJ: Lawrence Erlbaum Associates. Setzer, J. C., Wise, S. L., van den Heuvel, J. R., & Ling, G. (in press). An investigation of test-taking effort on a large-scale assessment. Applied Measurement in Education. Sundre, D. L., & Moore, D. L. (2002). The Student Opinion Scale: A measure of examinee motivation. Assessment Update, 14(1), 8-9. Sundre, D. L., & Wise, S. L. (2003, April). “Motivation Filtering”: An exploration of the impact of low examinee motivation on the psychometric quality of tests. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL. Wise, S. L. (2006). An investigation of the differential effort received by items on a low-stakes, computer-based test. Applied Measurement in Education, 19, 25-114. 18 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
19 Wise, S. L., & DeMars, C. E. (2005). Examinee motivation in low-stakes assessment: Problems and potential solutions. Educational Assessment, 10, 1-18. Wise, S. L., Kingsbury, G. G., & Hauser, C. (2009, April). How do I know that this score is valid? The case for assessing individual score validity. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego. Wise, S. L., Kingsbury, G. G., Thomason, J., & Kong, X. (2004, April). An investigation of motivation filtering in a statewide achievement testing program. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA. Wise, S. L., & Kong, X. (2005). Response time effort: A new measure of examinee motivation in computer-based tests. Applied Measurement in Education, 18, 163-183. Wise, S. L., Ma, L., Kingsbury, G.G., & Hauser, C. (2010) An Investigation of the Relationship between Time of Testing and Test-Taking Effort. Paper presented at the annual meeting of the National Council on Measurement in Education, Denver, CO. 19 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
20 Table 1 Descriptive Statistics for the Four Threshold Methods Threshold Statistics Proportion of Proportion of Item Content Threshold Minimum Maximum Median Thresholds set at 10 Responses Classified as Area Method Seconds Rapid Guesses Math Common3 3.00 3.00 3.00 0.00 0.007 NT10 1.15 10.00 5.31 0.10 0.014 NT15 1.73 10.00 7.96 0.32 0.023 NT20 2.31 10.00 10.00 0.55 0.033 Reading Common3 3.00 3.00 3.00 0.00 0.014 NT10 1.33 10.00 6.49 0.15 0.026 NT15 2.00 10.00 9.74 0.49 0.034 NT20 2.67 10.00 10.00 0.66 0.040 Table 2 The Numbers Test Events Flagged for Low ISV in Fall and/or Spring Testing, by Threshold Method When Did Low ISV Occur? Content Threshold In Neither Testing In Fall In Spring Testing In Both Area Method Session Testing Only Only Testing Sessions Math Common3 272,415 (95.2%) 7,867 (2.7%) 4,953 (1.7%) 1,026 (0.4%) NT10 264,702 (92.5%) 12,128 (4.2%) 7,052 (2.5%) 2,379 (0.8%) NT15 258,556 (90.3%) 15,575 (5.4%) 8,457 (3.0%) 3,673 (1.3%) NT20 252,246 (88.1%) 18,839 (6.6%) 10,133 (3.5%) 5,043 (1.8%) Reading Common3 257,332 (89.4%) 16,715 (5.8%) 10,485 (3.6%) 3,158 (1.1%) NT10 242,034 (84.1%) 23,737 (8.3%) 13,922 (4.8%) 7,997 (2.8%) NT15 236,340 (82.2%) 26,594 (9.2%) 14,766 (5.1%) 9,990 (3.5%) NT20 232,838 (80.9%) 28,252 (9.8%) 15,394 (5.4%) 11,206 (3.9%) Note. The values in parentheses are row percentages. 20 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
21 Table 3 The Numbers of Examinees in Various Growth Score Intervals, of Examinees Who Exhibited Low ISV in the Fall Testing Only Growth Score Range Content Threshold -30 to -20 to < -30 -10 to +10 +10 to +20 +20 to +30 > +30 Area Method -20 -10 Math Common3 1 7 110 3,513 2,555 1,150 531 NT10 2 15 179 5,778 3,949 1,614 591 NT15 2 17 219 7,508 5,173 1,975 681 NT20 1 13 241 9,226 6,345 2,279 734 Reading Common3 13 76 518 8,531 4,528 2,033 1,016 NT10 12 96 668 12,439 6,669 2,719 1,134 NT15 10 91 675 13,772 7,655 3,124 1,267 NT20 9 80 694 14,475 8,260 3,387 1,347 Table 4 Percentage Decreases in the Number of Examinees From Various Growth Score Intervals When Low-ISV test Events Were Removed, of Examinees Who Exhibited Low ISV in the Fall Testing Only Growth Score Range1 Content Threshold -30 to -20 to +10 to +20 to < -30 -10 to +10 > +30 Area Method -20 -10 +20 +30 Math Common3 1% 2% 3% 2% 3% 7% 29% NT10 2% 4% 4% 3% 5% 10% 33% NT15 2% 5% 6% 4% 6% 13% 38% NT20 1% 4% 6% 5% 8% 15% 42% Reading Common3 5% 7% 6% 4% 7% 17% 39% NT10 5% 11% 8% 6% 11% 23% 47% NT15 5% 11% 8% 7% 13% 27% 54% NT20 4% 10% 8% 8% 14% 29% 58% Note. Examinees who exhibited low ISV in the fall testing only would be expected to exhibit high positive growth. Examinees triggering as least one effort flag during both the fall and spring testing sessions were excluded. 1 The standard errors of growth scores for Math and Reading were approximately four and five points, respectively. 21 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
22 Table 5 The Numbers of Examinees in Various Growth Score Intervals, of Examinees Who Exhibited Low ISV in the Spring Testing Only Growth Score Range Content Threshold -30 to -20 to < -30 -10 to +10 +10 to +20 +20 to +30 > +30 Area Method -20 -10 Math Common3 103 197 514 2,949 968 202 20 NT10 99 217 628 4,491 1,341 250 26 NT15 90 210 724 5,580 1,552 277 24 NT20 88 214 827 6,885 1,812 283 24 Reading Common3 197 413 1,180 6,133 1,924 526 112 NT10 170 424 1,503 8,635 2,421 633 136 NT15 166 444 1,609 9,343 2,483 609 112 NT20 167 451 1,665 9,930 2,500 589 92 Table 6 Percentage Decreases in the Number of Examinees From Various Growth Score Intervals When Low-ISV test Events Were Removed, of Examinees Who Exhibited Low ISV in the Spring Testing Only Growth Score Range1 Content Threshold -30 to -20 to < -30 -10 to +10 +10 to +20 +20 to +30 > +30 Area Method -20 -10 Math Common3 87% 51% 12% 2% 1% 1% 1% NT10 94% 60% 16% 3% 2% 2% 1% NT15 96% 61% 18% 3% 2% 2% 1% NT20 97% 66% 21% 4% 2% 2% 1% Reading Common3 73% 39% 13% 3% 3% 4% 4% NT10 77% 47% 17% 4% 4% 5% 6% NT15 79% 51% 19% 5% 4% 5% 5% NT20 82% 54% 20% 5% 4% 5% 4% Note. Examinees who exhibited low ISV in the spring testing only would be expected to exhibit high negative growth. Examinees triggering as least one effort flag during both the fall and spring testing sessions were excluded. 1 The standard errors of growth scores for Math and Reading were approximately four and five points, respectively. 22 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
23 Table 7 Accuracy Rates for Item Responses Under Solution Behavior and Rapid-Guessing Behavior, by Threshold Method Accuracy of Solution Accuracy of Rapid- Threshold Method Behaviors Guessing Behaviors Math Items (5 response options) Common3 50.2% 21.2% NT10 50.5% 20.9% NT15 50.7% 24.4% NT20 50.7% 31.0% Additional rapid guesses gained 31.0% when moving from NT10 to NT15 Additional rapid guesses gained 47.0% when moving from NT15 to NT20 Reading Items (4 response options) Common3 52.2% 25.5% NT10 52.6% 27.1% NT15 52.7% 29.1% NT20 52.7% 31.8% Additional rapid guesses gained 38.0% when moving from NT10 to NT15 Additional rapid guesses gained 51.0% when moving from NT15 to NT20 Table 8 Fall-Spring RIT Score Correlations After Removal of Low-ISV Scores, by Threshold Method Correlation When Low-ISV Scores Removed Content Correlation When No Common3 NT10 NT15 NT20 Area Scores Were Removed Math .91 .92 .92 .92 .92 Reading .89 .90 .91 .90 .90 23 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ®
24 Mathematics Reading Figure 1. Histograms of item thresholds for each NT method, by content area. Note the differences in vertical scales across graphs. Learn more about our research at NWEA.org/Research. NWEA has nearly 40 years of experience helping educators accelerate student learning through computer-based assessment suites, professional development offerings, and research services. Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209 ® ©2014 Northwest Evaluation Association. All rights reserved. Measures of Academic Progress, MAP, and Partnering to Help All Kids Learn are registered trademarks and Northwest Evaluation Association and NWEA are trademarks of Northwest Evaluation Association. July 2014
You can also read