KINGSBURY RESEARCH CENTER - Setting Response Time Thresholds for a CAT Item Pool: The Normative Threshold Method

Page created by Jose Harmon
 
CONTINUE READING
KINGSBURY RESEARCH CENTER
                       Setting Response Time Thresholds for a CAT Item Pool:
                                 The Normative Threshold Method

                                               Steven L. Wise and Lingling Ma
                                              Northwest Evaluation Association

Paper presented at the 2012 annual meeting of the National Council on
Measurement in Education, Vancouver, Canada
Send correspondence to:
Northwest Evaluation Association (NWEA)
121 NW Everett St.
Portland, OR 97209
(503) 915-6118

1 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                   ®
2

                                                                    Abstract

     Identification of rapid-guessing test-taking behavior is important because it indicates the presence of non-

     effortful item responses that can negatively bias examinee proficiency estimates. Identifying rapid

     guesses requires the specification of a time threshold for each item. This requirement can be practically

     challenging for computerized adaptive tests (CATs), which use item pools that may contain thousands of

     items. This study investigated a new method—termed the normative threshold method—for identifying

     item thresholds. The results showed that the normative threshold method could markedly outperform a

     common three-second threshold (which has been used in previous research) in identifying non-effortful

     behavior.

2 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                   ®
3

     Setting Response Time Thresholds for a CAT Item Pool: The Normative Threshold Method

                The goal of educational measurement is to provide test scores that validly indicate what

     examinees know and can do. To attain this goal, test givers have traditionally focused on developing tests

     that are of suitable length and contain items that are sufficiently representative of the content domain of

     interest. In addition, attention is paid to how these tests are given. Standardized test administration

     procedures are typically followed, in settings that conducive to obtaining valid scores from examinees.

                In practice, however, obtaining valid scores from a test administration requires more than simply

     presenting items to examinees under standardized conditions. A valid score also requires a motivated

     examinee, who behaves effortfully throughout a test event. It is not uncommon, though, for some

     examinees to exhibit test-taking behavior that is non-effortful during a standardized test. For these

     examinees, the resulting scores are less valid because they are likely to underestimate what the examinees

     actually know and can do.

                The idea that score validity can vary across examinees is consistent with the concept of individual

     score validity (ISV; Hauser, Kingsbury, & Wise, 2008; Kingsbury & Hauser, 2007; Wise, Kingsbury, &

     Hauser, 2009). ISV conceptualizes an examinee’s test performance as being potentially influenced by

     one or more construct-irrelevant factors (Haladyna & Downing, 2004). ISV for a particular test event is

     driven in part by the degree to which the resulting score is free of these factors. Thus, to obtain a valid

     score from a particular examinee, the test giver has the dual challenge of (a) developing a test that is

     capable of producing valid scores and (b) conducting a test event that minimizes the impact of construct-

     irrelevant factors on that examinee’s test performance.

                Unfortunately, examinee effort is a construct-irrelevant factor that lies largely outside the test

     giver’s control. In particular, whenever the test stakes are low from an examinee’s perspective, there is a

     realistic chance that the examinee will not become or remain engaged in devoting effort to the test. If

     enough disengagement occurs during the test event, low ISV will result. The impact of low effort on test

     performance can be sizable; a synthesis of 15 studies of test-taking motivation indicated that less

3 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                   ®
4

     motivated examinees showed mean test performance to be .58 standard deviations lower than that of their

     more motivated peers (Wise & DeMars, 2005).

                Although there appears to be a generalized recognition among test givers that low examinee effort

     threatens the validity of test scores, until recently methods have not been available for them to manage the

     problem. All of these methods require that we be able to measure examinee effort. The current study

     looks at the analysis of examinee effort in the context of a computerized adaptive test (CAT).

     Measurement of Examinee Effort

                Self-Report Measures. A popular method for measuring effort is to give examinees a self-report

     instrument about the effort they gave to a test they had just taken. Such instruments are easy to

     administer, and require relatively few items to attain adequate reliability. For example, the Student

     Opinion Survey (Sundre & Moore, 2002) which uses five Likert-style items to measure examinee effort

     commonly exhibits internal consistency reliability estimates in the .80s.

                Despite its ease of use, self-reported effort has several limitations. First, it is unclear how truthful

     examinees will be when asked about their test-taking effort. Self-report measures are potentially

     vulnerable to response biases. Examples include low-effort examinees who report high effort because

     they fear punishment from test givers and high-effort examinees who report low effort because felt they

     did poorly on the test and tend to attribute failure to low effort (Pintrich & Schunk, 2002). Second, self-

     report measures—which typically ask examinees for an overall assessment of their test-taking effort—are

     not sensitive to changes in effort during test events, which often occur (Wise & Kong, 2005). A third

     limitation of self-report measures is that they require the awkward assumption that examinees who did not

     give good effort to their test were sufficiently motivated to seriously complete the self-report instrument

     regarding their lack of effort.

                Response Time-Based Measures. The idea that response time is associated with test-taking

     effort has its roots in the work of Schnipke and Scrams (1997, 2002), who studied examinee behavior at

     the end of timed high-stakes tests. They found that, as the time limit approached, some examinees

     switched from trying to work out the answer to items (which they termed solution behavior) to rapidly

4 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                   ®
5

     answering remaining items in hopes of getting some correct by chance (termed rapid-guessing behavior).

     Schnipke and Scrams found that responses under such rapid guessing behavior tended to be correct at a

     rate consistent with random responding.

                Wise and Kong (2005) observed that rapid-guessing behavior also occurred during untimed low-

     stakes tests. They showed that in this context rapid guessing indicated instances where examinees were

     not motivated to give their best effort. Similar to the findings of Schnipke and Scrams (1997, 2002),

     rapid guesses on low-stakes tests have accuracy rates resembling those expected by random responding

     (Wise & Kong, 2005; Wise, 2006). In addition, rapid-guessing behavior has been found to be relatively

     unrelated to examinee ability.

                Regardless of the stakes of the test, rapid-guessing behavior indicates that the examinee has

     ceased trying to show what he knows and can do. The motives, however, differ with test stakes. In a

     high-stakes test, a good test performance helps the examinee obtain something he wants (e.g., high grade,

     graduation, job certification). In this case, rapid-guessing behavior is used strategically by the examinee

     in the hopes of raising his score by guessing some items correct that would otherwise be left unanswered

     (and therefore be scored as incorrect). In contrast, in a low-stakes test, an examinee who is responding

     rapidly on a low-stakes test is ostensibly trying to get the test over with, rather than trying to maximize

     his score.

                It is helpful to clarify what is meant by solution behavior, because there is an asymmetry to the

     inferences being made. It is assumed that all rapid guesses are non-effortful. It is not assumed, however,

     that all responses classified as solution behavior are effortful. We recognize that there may be non-

     effortful responses made slowly by an examinee, and such responses are not readily distinguishable from

     slowly-made effortful responses. Hence, we characterize effort analysis using response time as

     identifying item responses that should not be trusted to be informative about an examinee’s proficiency

     level.

                Response time-based measurement of test-taking effort has three advantages. First, it is based on

     student behavior that can be objectively and unobtrusively collected by the computer. It does not rely on

5 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                   ®
6

     self-report data, which may be biased. Second, it has been shown to have high internal consistency.

     Third, it can evaluate effort on an item-by-item basis, which allows us to assess the possibility that a

     student’s level of effort changes during the test event. The primary limitation of response time-based

     measures is that the collection of response time data requires that the test under study be computer

     administered.

                The identification of rapid-guessing behavior is important because it indicates the presence of

     item responses that are uninformative about an examinee’s proficiency level. Worse, rapid guesses tend

     to exert a negative bias on a proficiency estimate because rapid guesses are correct at a rate that is usually

     markedly lower than what would have been the case had the examinee exhibited solution behavior. The

     more rapid guesses occur during a test event, the more negative bias is likely present in a test score.

     Thus, the presence of rapid-guessing behavior—if it is pervasive enough—provides an indicator that a

     score has low ISV.

                In conducting an analysis of response time-based effort on a test, there are two basic operational

     questions that must be addressed. First, exactly what comprises rapid-guessing behavior on an item? To

     keep the effort analysis as objective as possible, it is important to specify an operational definition of

     rapid guessing. Second, how extensive does rapid-guessing behavior have to be during a test event to

     warrant action on the part of the test giver? For example, if a test giver decided to identify test scores

     reflecting low ISV on score reports, how many rapid guesses would be required to warrant such a label?

     The current study focuses on the first question in the context of a CAT.

     Identification of Rapid Guesses

                The typical approach to identifying rapid guessing has been to use item response time to classify

     each item response as reflecting either solution behavior or rapid-guessing behavior. Differentiating

     between rapid-guessing behavior and solution behavior requires the establishment of a time threshold for

     each item such that all responses occurring faster than that threshold are classified as rapid guesses, and

     responses occurring slower are deemed solution behaviors.

6 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                   ®
7

                There are two principles underlying the choice of a time threshold. First, it is desirable to identify

     as many instances of non-effortful item responses as possible. Second, it is important to avoid classifying

     effortful responses as non-effortful. There is a tension between the two principles; the first encourages us

     to choose a longer threshold, while the second encourages us to choose a shorter one. How we balance

     the two principles depends on our data-analytic goals. For example, if test data from a group of

     examinees is being used to calibrate item parameters and we wanted to cleanse the data of as many rapid

     guesses as possible, the first principle might be of higher concern. In contrast, if test scores are used to

     make inferences about the instructional needs of individual examinees, fairness considerations for the

     examinees might suggest that the second principle would be the predominant concern.

                Several different methods have been examined for identifying time thresholds for a given item.

     Schnipke and Scrams (1997) noted that, while a frequency distribution of response times to an item

     typically has a positively skewed unimodal distribution, the presence of rapid guessing results in an initial

     frequency spike occurring during the first few seconds the item is displayed. Schnipke and Scrams

     conceptualized this as indicating that rapid-guessing behavior and solution behavior exhibit distinctly

     different response time distributions. Based on this, they used a two-state mixture model to identify a time

     threshold for each item. Wise and Kong (2005) based their time thresholds on surface features of the

     items (e.g., the number of characters in the stem and options), reasoning that items requiring more reading

     should have longer thresholds. Wise (2006) alternatively proposed that the time threshold for an item

     could be identified through visual inspection of its response time frequency distribution, with the time

     threshold corresponding to the end of the initial frequency spike. Kong, Wise, and Bhola (2007)

     compared these three variable-threshold identification methods, finding that they produced similar

     thresholds.

     Threshold Identification in a CAT

                Threshold identification becomes more challenging in an adaptive testing context. A CAT selects

     items from pools that contain hundreds, if not thousands of items. In addition, these item pools typically

     are dynamic, with new items being continually added. Thus, relative to a conventional fixed test, the task

7 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                   ®
8

     of identifying variable thresholds for CAT items is necessarily more time and resource intensive.

     Because of this, much of the early effort analysis research with CAT has used a common three-second

     threshold (Wise, Kingsbury, Thomason, & Kong, 2004; Wise, Kingsbury & Hauser, 2009; Wise, Ma,

     Kingsbury & Hauser, 2010), because it could be immediately applied to any item pool.

                Despite its ease of use, a common threshold may not be adequate for operational use with CATs

     such as the Measures of Academic Progress (MAP), which uses item pools in assessing the academic

     growth of primary and secondary school students. Hauser and Kingsbury (2009) argued that a three-

     second threshold is too conservative for many items that contain a lot of reading, such as those that would

     require an engaged student with strong reading skills more than a minute to read. For these items, use of

     a three-second threshold would violate the first threshold principle described earlier. On the other hand,

     an engaged student might be able to select the correct answer for some basic math items (e.g., “What is 4

     x 5?”) effortfully in less than three seconds. In these instances, a three-second threshold could classify

     effortful test-taking behavior as non-effortful, which would violate the second threshold principle. In

     addition, Kong et al. (2007) found that variable-threshold methods performed somewhat better than a

     common three-second threshold in terms of both threshold agreement and the impact of motivation

     filtering (Sundre & Wise, 2003; Wise & DeMars, 2005) on convergent validity. For these reasons, a

     common threshold was deemed unacceptable with MAP, and a practical variable threshold method was

     deemed needed.

                Ma, Wise, Thum, and Kingsbury (2011) studied the use of several variable-threshold methods for

     MAP data. Three versions of visual inspection were investigated: inspection of response time

     distributions, inspection of both response time and response accuracy distributions, and inspection of

     response time and response accuracy distributions along with item content. They found that the

     agreement between judges was low for all three of the inspection methods. In addition, they explored the

     use of two statistical threshold identification methods (mixture models and a non-parametric method),

     finding that neither method exhibited practical utility.

8 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                   ®
9

                Because inspection methods and statistical methods have not been found sufficiently useful in

     identifying thresholds for MAP data, and because the calculation of thresholds based on surface features

     was considered too resource intensive, there is a need for a threshold identification method that can be

     practically applied to large and frequently changing item pools. The current study introduces and

     investigates a new variable-threshold method in the analysis of examinee effort on the adaptive MAP

     assessment.

     The Normative Threshold Method

                Items vary in the time spent by examinees in responding to them. This variation can be attributed

     to multiple factors, such as to the amount of reading required by the items or how mentally taxing they

     are. Similarly, what might be called a rapid response varies according to the time demands of the items.

     In the Normative Threshold (NT) method, the time threshold for a particular item is defined as a

     percentage of the elapsed time between when the item is displayed and the mean of the response time

     distribution for the item, up to a maximum threshold value of 10 seconds. For example, if it takes

     examinees an average of 40 seconds to respond to a particular item, a 10 percent threshold (NT10) would

     be 4 seconds (that is, 10% of 40 seconds), whereas an NT15 threshold would be 6 seconds. A maximum

     threshold value of 10 seconds was chosen following the observation by Setzer, Wise, van den Heuvel, and

     Ling (in press) that it might be problematic to credibly characterize a response taking longer than 10

     seconds as a rapid guess.

                The NT method requires only (a) the mean response time for each item in a pool (which can be

     readily calculated and stored based on earlier administrations of the item) and (b) a chosen percentage

     value. Establishing a suitable percentage value involves balancing the two threshold principles. The goal

     is to identify as many non-effortful responses as possible with the constraint that responses deemed rapid

     guesses should be correct at a rate that is not higher than one would expect from random responses. In

     the current study, three threshold values were studied: NT10, NT15, and NT20.

9 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                   ®
10

                                                                     Method

     Data Set and Sample

                All of the tests administered were part of the Northwest Evaluation Association’s (NWEA’s)

     MAP testing system. MAP tests are untimed interim CATs, with the tests in Math being generally 50

     multiple-choice items in length, while those in Reading are generally 40 multiple-choice items in length.

     MAP proficiency estimates are expressed as scale (RIT) scores on a common scale that allows growth to

     be assessed when examinees are tested at multiple times. Over 285,000 examinees in grades 3-9 from a

     single U.S. state were administered MAP in both the fall and spring of the 2010-2011 academic year. For

     each examinee, a growth score was generated (calculated as RITSpring –RITFall). MAP growth scores were

     generated for both Math and Reading.

     Outcome Measures

                Ideally, a criterion would be available that specified which of the item responses were non-

     effortful and which were not. This would allow a direct comparison between the three NT methods

     relative to the two threshold principles discussed earlier. In the absence of such a criterion, a set of test

     events that indicated the presence of non-effortful test-taking behavior was studied, and the degree to

     which those test events were identified using the three NT methods was assessed.

                A practical measurement problem frequently encountered with MAP data is the occurrence of

     growth scores whose values are difficult to interpret because their direction and/or magnitude are not

     considered credible. These types of scores come in two forms. During a school year, some students

     exhibit negative growth values that are much larger in magnitude than can be attributable to measurement

     error. Such negative values are problematic because they imply that an examinee knows substantially less

     in the spring than they did the previous fall. Wise et al. (2009) showed that students exhibiting effortful

     test-taking behavior during fall testing, but not during spring testing, could explain many instances of

     large negative growth values. Conversely, some students exhibit unrealistically large positive growth

     values, which can be explained by students exhibiting effortful test-taking behavior in the spring, but not

     in the fall.

10 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                    ®
11

                Non-credible growth scores (either positive or negative) were used as indicators of non-effortful

     test events. One of the current study’s outcome measures was the percentage of non-credible growth

     scores that were accounted for by applying a set of response-time based flagging criteria (examples of

     which were described Wise et al., 2009) for each of the threshold methods. This allowed a comparison of

     the NT methods relative to the first threshold principle.

                The accuracy rate of item responses classified as rapid guesses comprised the second outcome

     measure. Rapid guesses should show an accuracy rate similar to that expected by chance through

     random responding (Wise & Kong, 2005). Accuracy rates exceeding chance would indicate the inclusion

     of effortful responses (whose accuracy rates were substantially higher than chance). Hence, the accuracy

     rate provides an indication of the extent to which the second threshold principle had been violated by a

     particular threshold method.

                For Math MAP items, which contain five response options, the random accuracy rate should be

     .20; for four-choice Reading items, the accuracy rate should be .25. This is an oversimplification,

     however, because test takers exhibiting rapid-guessing behavior do not actually respond randomly. It has

     been shown that examinees are somewhat less likely to guess either the first or the last response option

     and somewhat more likely to guess one of the middle options (Attali & Bar-Hillel, 2003; Wise, 2006).

     This is an example of “middle bias”, which is a well-known psychological phenomenon that explains why

     people sometimes have difficulty behaving randomly (even when instructed to). Middle bias would

     suggest that the expected rapid guessing accuracy rate for an item depends in part on the location of the

     correct answer. For example, random guesses to a Math item whose answer is option “a” might have an

     accuracy rate of .17, while those to a Math item whose answer is “b” might have an accuracy rate of .24.

                If the positions of the correct answers are reasonably balanced across a set of items, however, the

     overall accuracy rate of rapid guesses should be close to what would be expected under random

     responding. An analysis of the MAP item pools revealed that the pools were slightly imbalanced, with a

     disproportionately higher number of items with correct answers in the second and third response option

11 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                    ®
12

     position1. Thus, it was expected that rapid guesses would be correct at a rate slightly above that expected

     purely by chance.

     Data Analysis

                Comparisons were made among the three normative-threshold methods (NT10, NT15, & NT20),

     along with a common three-second threshold (Common3). Mean response times were identified for items

     in both the Math and Reading item pools and thresholds for NT10, NT15, and NT20 were calculated for

     each item. Next, for each threshold method, each item response in a test event was classified as either

     rapid-guessing behavior or solution behavior. The test events were then evaluated for each threshold

     method using five flagging criteria designed to identify low-ISV scores (Wise et al., 2009). Three of the

     flagging criteria were based on the preponderance and pattern of rapid guesses in the data, with the other

     two criteria based on the accuracy rates of sets of item responses. A test event was classified as low ISV

     if it was flagged by at least one of the five criteria. Finally, the accuracy rates for all rapid guesses and

     solution behaviors were calculated (aggregating across both items and examinees) for each threshold

     method.

                                                                     Results

                Descriptive statistics for the four threshold methods are shown in Table 1. Each of the NT

     methods yielded a range of threshold values, with some values less than three seconds and some

     exceeding three seconds. The median threshold value exceeded three seconds for each NT method and,

     as the NT percentage increased, increases were observed in both the median threshold value and the

     number of values set at the maximum value of ten seconds. Figure 1 shows the distributions of the

     threshold values for the NT methods.

                The generally higher threshold values for the NT methods imply that they would identify rapid

     guesses more often than would Common3. Table 1 shows that the proportion of responses classified as

     1
      Item writers are also vulnerable to middle bias, which would result in their tending to more often construct items
     with the correct answers in the middle response positions (Attali & Bar-Hillel, 2003).

12 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                    ®
13

     rapid guesses varied substantially across the methods, with Common3 identifying rapid guesses about 1%

     of the time, and NT20 about 3-4% of the time.

                Table 2 shows the results of applying the ISV flagging criteria based on rapid guesses identified

     under the various threshold methods. More test events were flagged by the NT methods, with the

     numbers of flagged test events increasing with the NT percent value. In math, a greater number of test

     events were flagged in the fall testing session only (ranging from 2.7% to 6.6%) than for the spring

     testing session only (ranging from 1.7 to 2.5%). In reading, the pattern of results was similar except that

     the prevalence of low ISV was generally at least 80% higher than that found for math. The overall

     numbers of flagged test events suggested that there were numerous instances of extreme positive or

     negative growth scores in the data for both math and reading.

                The distributions of growth scores from examinees whose scores were flagged for low ISV during

     fall testing only are shown in Table 3. These were examinees for whom the low ISV in the fall would be

     expected to exert a positive bias on the growth scores. The majority of these growth scores were positive,

     as their values were subject to the joint influence of the true proficiency gains of the students and positive

     bias introduced by the presence of low ISV in the fall session. Regardless of threshold method, there

     were a substantial number of high positive growth scores that were flagged. In math, there were at least

     1,600 scores that exceeded +20; in reading, at least 3,000 exceeded +20.

                Removing the low ISV test events identified in Table 3 from the overall sample had a differential

     impact on the growth score intervals. Table 4 shows the impact for each threshold method. As expected,

     the greatest impact was seen for the most extreme positive scores (i.e., those that were +20 or higher).

     The NT methods accounted for substantially more extreme positive scores than did Common3 and the

     NT20 method accounted for the most, followed by NT15 and then NT10. In addition, higher percentages

     of the extreme positive scores were accounted for in reading than in math. Overall, however, the

     percentages of high positive growth scores that were accounted for by low ISV were moderate. For the

     most extreme positive growth (i.e., greater than +30), between 29% and 42% of the scores were

     accounted for by low ISV in math and between 39% and 58% in reading.

13 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                    ®
14

                Tables 5 and 6 show the same analyses for examinees whose scores were flagged for low ISV

     during spring testing only. Low ISV in the spring scores would be expected to exert a negative bias on

     growth scores. In Table 5, the distributions of growth scores for these examinees are shown. The values

     in the rows of Table 5 are more symmetric than those shown in Table 3, reflecting the combination of true

     growth (which tends to be positive) with the negative effects of low ISV.

                Table 6 shows the effects of removing examinees exhibiting low ISV in the spring only from the

     overall sample. The pattern of results is similar to that found in Table 4, except that, as expected, the

     majority of the scores removed were negative. Moreover, the percentages of high negative growth scores

     accounted for by low ISV were higher. For the most extreme negative growth (i.e., less than -30),

     between 87% and 97% of the scores were accounted for by low ISV in math and between 73% and 82%

     in reading.

                As Table 6 illustrates, the NT methods outperformed the Common3 method in accounting for

     negative growth scores. Furthermore, NT20 accounted for the highest percentage, followed by NT15, and

     then NT10. The relative performance of the NT methods, however, follows logically from their definition

     and should therefore not be surprising. Because each NT15 threshold is slightly higher than its NT10

     counterpart, the NT15 method classified as rapid guesses all of the item responses that were classified as

     such by NT10, plus some additional responses. Similarly, NT20 classified as rapid guesses all of the

     responses that were classified by NT15, plus some additional responses. This relationship between NT10,

     NT15, and NT20 was also observed when the ISV flagging criteria were applied to the test events.

                The higher the NT threshold value, the higher the percentage of non-credible growth scores that

     would be accounted for. This is consistent with the first threshold principle. It is clear, however, that as

     the NT threshold values increase, there will be an increased likelihood that effortful responses will be

     misclassified as rapid guesses (which violates the second threshold principle). The impact of the

     threshold methods relative to the second principle was evaluated by considering the accuracy rates of

     responses classified as either solution behavior or rapid guessing behavior.

14 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                    ®
15

                Table 7 shows the accuracy rates for each threshold method. In math, solution behaviors showed

     accuracy rates very close to the value of .50 expected under a CAT, for each of the methods. Rapid

     guesses for math items—which had five response options—should exhibit accuracy rates close to .20.

     This occurred for Common3 and NT10. In contrast, both NT 15 and NT20 exhibited accuracy rates that

     had moved noticeably higher than .20, indicating that some effortful responses (whose accuracy rates

     should be closer to .50) had been classified as rapid guesses. The presence of these effortful responses

     can be seen more clearly if only the additional rapid guesses that were gained by moving from one NT

     method to another are considered. When moving from NT10 to NT15, the additional rapid guesses

     gained were correct 31% of the time. When moving from NT15 to NT20, they were correct 47% of the

     time.

                The same pattern of results was found in reading (also shown in Table 7), whose items had four

     response options. The accuracy of solution behaviors was close, for each of the four methods, to the

     expected value of .50. Regarding rapid guessing behavior, Common3 and NT10 exhibited accuracy rates

     reasonably close to the expected rapid guessing accuracy rate of .25. NT15 and NT20 again exhibited

     signs that they were classifying some effortful responses as rapid guesses. When moving from NT10 to

     NT15, the additional rapid guesses gained were correct 38% of the time. When moving from NT15 to

     NT20, they were correct 51% of the time.

                A final data analysis examined the correlations between the fall and spring RIT scores. Wise et

     al. (2009) showed that the fall-spring RIT correlations increased after low-ISV scores were removed.

     These increased correlations were interpreted as indicating that construct-irrelevant variance—which had

     been attenuating the correlations—had been removed. Table 8 shows the fall-spring correlations in the

     current study, by threshold method. In math, the correlations increased from .91 to .92 for each of the

     methods. In reading, the correlation increased from .89 to .91 for NT10, but increased only to .90 with

     the other three methods.

15 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                    ®
16

                                                                   Discussion

                Identification of rapid-guessing behavior for a CAT requires that a practical, effective method be

     available for establishing time thresholds for pools of items. Choice of a threshold method should

     consider the two threshold principles. The first principle states that the method should account for as

     many non-effortful item responses as possible. The second principle states that the method should not

     classify effortful responses as non-effortful. The current study explored the validity of a new variable

     threshold identification method—the normative threshold method—and sought to identify a threshold

     value that best balances the two principles.

                As a point of reference, the Common3 method, which has been used previously for effort analysis

     with MAP scores, did a reasonably good job of balancing the two principles. It identified low-ISV test

     events that produced a large proportion of the extreme growth scores that were accounted for by any of

     the methods. Moreover, the accuracy rates of solution behaviors and rapid-guessing behaviors classified

     by Commmon3 were very close to the expected values. Finally, removing its low-ISV scores from the

     sample increased the fall-spring RIT score correlations.

                Each of the NT methods studied identified markedly more low-ISV test events than Common3

     and, as a consequence, the NT methods identified higher percentages of non-credible growth scores.

     However, when accuracy rates were considered, only the NT10 method maintained accuracy for solution

     behaviors and rapid-guessing behaviors at their expected rates. The NT15 and NT20 methods showed

     clear signs that they were classifying effortful responses as rapid guesses. In addition, the NT10 method

     had the strongest positive effect (albeit slightly) on the fall-spring RIT correlations. Hence, overall, the

     NT10 method showed the best performance and is therefore recommended.

                The differences in the performance of the various NT methods illustrate the absence of a clear

     line between rapid guessing and effortful responding. The second threshold principle, that we should not

     classify as non-effortful any responses that were actually effortful, implies that we will necessarily act

     conservatively in our decisions about the validity of test events. For example, suppose that the time

     threshold for a particular item is 2.6 seconds. If a given student responded in 2.7 seconds, it may still be

16 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                    ®
17

     more highly probable that the student was rapid guessing, but now there is also a realistic possibility that

     a student who reads, thinks, and responds very quickly could have engaged in solution behavior and

     answered the item effortfully. In this situation, the response would be classified as reflecting solution

     behavior even though it may be much more likely that it was actually a rapid guess. As a result, decisions

     are biased toward classifying ambiguous response times as reflecting effortful behavior, which suggests

     that we will sometimes not flag some test events whose scores actually reflect low ISV. This explains, at

     least in part, why higher percentages of extreme growth scores were not accounted for by the ISV

     flagging criteria.

                The general conclusion of this study is that the normative threshold method can provide a

     practical and effective way to establish time thresholds for identifying rapid-guessing behavior. In

     particular, the NT10 method was found to provide a good balance between the two threshold principles. It

     can be readily applied to a CAT item pool containing thousands of items, as it requires only the average

     response time that examinees spend on each item. Such item information can be routinely extracted from

     field test or operational data and stored as an attribute of each item in the pool.

17 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                    ®
18

                                                                   References

     Attali, Y., & Bar-Hillel, M. (2003). Guess where: The position of correct answers in multiple-choice test
         items as a psychometric variable. Journal of Educational Measurement, 40, 109–128.

     Haladyna, T. M., & Downing, S. M. (2004). Construct-irrelevant variance in high-stakes testing.
        Educational Measurement: Issues and Practice, 23(1), 17-27.

     Hauser, C., Kingsbury, G, G., & Wise, S. L. (2008, March). Individual validity: Adding a missing link.
        Paper presented at the annual meeting of the American Educational Research Association, New York,
        NY.

     Hauser, C., & Kingsbury, G, G. (2009, April). Individual score validity in a modest-stakes adaptive
        educational testing setting. Paper presented at the annual meeting of the National Council on
        Measurement in Education, San Diego, CA.

     Kingsbury, G. G., & Hauser, C. (2007, April). Individual validity in the context of an adaptive test.
        Paper presented at the annual meeting of the American Educational Research Association, Chicago,
        IL.

     Kong, X., Wise, S. L., & Bhola, D. S. (2007). Setting the response time threshold parameter to
        differentiate solution behavior from rapid-guessing behavior. Educational and Psychological
        Measurement, 67, 606-619.

     Ma, L., Wise, S. L., Thum, Y. M., & Kingsbury, G. G. (2011, April). Detecting Response Time Threshold
        under the Computer Adaptive Testing Environment. Paper presented at the annual meeting of the
        National Council on Measurement in Education, New Orleans.

     Pintrich, P. R., & Schunk, D. H. (2002). Motivation in education: Theory, research, and applications
         (2nd ed.). Upper Saddle, NJ: Merrill Prentice-Hall.

     Schnipke, D. L. & Scrams, D. J. (1997). Modeling item response times with a two-state mixture model:
        A new method of measuring speededness. Journal of Education Measurement, 34(3), 213-232.

     Schnipke, D.L. & Scrams, D.J. (2002). Exploring issues of examinee behavior: Insights gained from
        response-time analyses. In C. N. Mills, M. T. Potenza, J. J. Fremer, & W. C. Ward (Eds.), Computer-
        based testing: Building the foundation for future assessments. Mahwah, NJ: Lawrence Erlbaum
        Associates.

     Setzer, J. C., Wise, S. L., van den Heuvel, J. R., & Ling, G. (in press). An investigation of test-taking
       effort on a large-scale assessment. Applied Measurement in Education.

     Sundre, D. L., & Moore, D. L. (2002). The Student Opinion Scale: A measure of examinee motivation.
       Assessment Update, 14(1), 8-9.

     Sundre, D. L., & Wise, S. L. (2003, April). “Motivation Filtering”: An exploration of the impact of low
        examinee motivation on the psychometric quality of tests. Paper presented at the annual meeting of
        the National Council on Measurement in Education, Chicago, IL.

     Wise, S. L. (2006). An investigation of the differential effort received by items on a low-stakes,
        computer-based test. Applied Measurement in Education, 19, 25-114.

18 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                    ®
19

     Wise, S. L., & DeMars, C. E. (2005). Examinee motivation in low-stakes assessment: Problems and
        potential solutions. Educational Assessment, 10, 1-18.

     Wise, S. L., Kingsbury, G. G., & Hauser, C. (2009, April). How do I know that this score is valid? The
        case for assessing individual score validity. Paper presented at the annual meeting of the National
        Council on Measurement in Education, San Diego.

     Wise, S. L., Kingsbury, G. G., Thomason, J., & Kong, X. (2004, April). An investigation of motivation
        filtering in a statewide achievement testing program. Paper presented at the annual meeting of the
        National Council on Measurement in Education, San Diego, CA.

     Wise, S. L., & Kong, X. (2005). Response time effort: A new measure of examinee motivation in
        computer-based tests. Applied Measurement in Education, 18, 163-183.

     Wise, S. L., Ma, L., Kingsbury, G.G., & Hauser, C. (2010) An Investigation of the Relationship between
        Time of Testing and Test-Taking Effort. Paper presented at the annual meeting of the National
        Council on Measurement in Education, Denver, CO.

19 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                    ®
20

     Table 1

     Descriptive Statistics for the Four Threshold Methods

                                                               Threshold Statistics

                                                                                         Proportion of              Proportion of Item
      Content      Threshold
                                   Minimum          Maximum            Median         Thresholds set at 10        Responses Classified as
       Area         Method
                                                                                           Seconds                    Rapid Guesses

       Math        Common3              3.00            3.00             3.00                  0.00                         0.007

                     NT10               1.15           10.00             5.31                  0.10                         0.014

                     NT15               1.73           10.00             7.96                  0.32                         0.023

                     NT20               2.31           10.00            10.00                  0.55                         0.033

      Reading      Common3              3.00            3.00             3.00                  0.00                         0.014

                     NT10               1.33           10.00             6.49                  0.15                         0.026

                     NT15               2.00           10.00             9.74                  0.49                         0.034

                     NT20               2.67           10.00            10.00                  0.66                         0.040

     Table 2
     The Numbers Test Events Flagged for Low ISV in Fall and/or Spring Testing, by Threshold Method

                                                                When Did Low ISV Occur?

      Content      Threshold      In Neither Testing          In Fall           In Spring Testing           In Both
       Area         Method             Session             Testing Only               Only              Testing Sessions

       Math        Common3        272,415 (95.2%)           7,867 (2.7%)          4,953 (1.7%)              1,026 (0.4%)

                     NT10         264,702 (92.5%)          12,128 (4.2%)          7,052 (2.5%)              2,379 (0.8%)

                     NT15         258,556 (90.3%)          15,575 (5.4%)          8,457 (3.0%)              3,673 (1.3%)

                     NT20         252,246 (88.1%)          18,839 (6.6%)          10,133 (3.5%)             5,043 (1.8%)

      Reading      Common3        257,332 (89.4%)          16,715 (5.8%)          10,485 (3.6%)             3,158 (1.1%)

                     NT10         242,034 (84.1%)          23,737 (8.3%)          13,922 (4.8%)             7,997 (2.8%)

                     NT15         236,340 (82.2%)          26,594 (9.2%)          14,766 (5.1%)             9,990 (3.5%)

                     NT20         232,838 (80.9%)          28,252 (9.8%)          15,394 (5.4%)             11,206 (3.9%)

     Note. The values in parentheses are row percentages.

20 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                    ®
21

     Table 3

     The Numbers of Examinees in Various Growth Score Intervals, of Examinees Who Exhibited Low ISV in the Fall Testing
     Only

                                                                             Growth Score Range

      Content      Threshold                        -30 to         -20 to
                                     < -30                                       -10 to +10     +10 to +20    +20 to +30      > +30
       Area         Method                           -20            -10

       Math        Common3               1             7             110           3,513           2,555        1,150            531

                     NT10                2            15             179           5,778          3,949         1,614            591

                     NT15                2            17             219           7,508           5,173        1,975            681

                     NT20                1            13             241           9,226           6,345        2,279            734

      Reading      Common3              13            76             518           8,531           4,528        2,033          1,016
                     NT10               12            96             668          12,439          6,669         2,719          1,134

                     NT15               10            91             675          13,772          7,655         3,124          1,267

                     NT20                9            80             694          14,475           8,260        3,387          1,347

     Table 4

     Percentage Decreases in the Number of Examinees From Various Growth Score Intervals When Low-ISV test Events
     Were Removed, of Examinees Who Exhibited Low ISV in the Fall Testing Only

                                                                           Growth Score Range1

      Content      Threshold                        -30 to        -20 to                       +10 to        +20 to
                                        < -30                                  -10 to +10                                  > +30
       Area         Method                           -20           -10                          +20           +30

       Math        Common3              1%           2%            3%             2%             3%           7%           29%

                      NT10              2%           4%            4%             3%             5%          10%           33%

                      NT15              2%           5%            6%             4%             6%          13%           38%

                      NT20              1%           4%            6%             5%             8%          15%           42%

      Reading      Common3              5%           7%            6%             4%             7%          17%           39%

                      NT10              5%          11%            8%             6%            11%          23%           47%

                      NT15              5%          11%            8%             7%            13%          27%           54%

                      NT20              4%          10%            8%             8%            14%          29%           58%

     Note. Examinees who exhibited low ISV in the fall testing only would be expected to exhibit high positive growth.
     Examinees triggering as least one effort flag during both the fall and spring testing sessions were excluded.
     1
       The standard errors of growth scores for Math and Reading were approximately four and five points, respectively.

21 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                    ®
22

     Table 5

     The Numbers of Examinees in Various Growth Score Intervals, of Examinees Who Exhibited Low ISV in the Spring Testing
     Only

                                                                             Growth Score Range

      Content       Threshold                        -30 to         -20 to
                                        < -30                                     -10 to +10    +10 to +20   +20 to +30   > +30
       Area          Method                           -20            -10

       Math        Common3              103           197            514            2,949           968         202        20

                      NT10               99           217            628            4,491          1,341        250        26

                      NT15               90           210            724            5,580          1,552        277        24

                      NT20               88           214            827            6,885          1,812        283        24

      Reading      Common3              197           413           1,180           6,133          1,924        526       112
                      NT10              170           424           1,503           8,635          2,421        633       136

                      NT15              166           444           1,609           9,343          2,483        609       112

                      NT20              167           451           1,665           9,930          2,500        589        92

     Table 6

     Percentage Decreases in the Number of Examinees From Various Growth Score Intervals When Low-ISV test Events Were
     Removed, of Examinees Who Exhibited Low ISV in the Spring Testing Only

                                                                             Growth Score Range1

     Content       Threshold                         -30 to         -20 to
                                        < -30                                    -10 to +10     +10 to +20   +20 to +30   > +30
      Area          Method                            -20            -10

       Math        Common3              87%           51%            12%             2%             1%          1%         1%

                     NT10               94%          60%             16%             3%             2%          2%         1%

                     NT15               96%          61%             18%             3%             2%          2%         1%

                     NT20               97%          66%             21%             4%             2%          2%         1%

     Reading       Common3              73%           39%            13%             3%             3%          4%         4%

                     NT10               77%          47%             17%             4%             4%          5%         6%

                     NT15               79%          51%             19%             5%             4%          5%         5%

                     NT20               82%          54%             20%             5%             4%          5%         4%

     Note. Examinees who exhibited low ISV in the spring testing only would be expected to exhibit high negative growth.
     Examinees triggering as least one effort flag during both the fall and spring testing sessions were excluded.
     1
       The standard errors of growth scores for Math and Reading were approximately four and five points, respectively.

22 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                    ®
23

     Table 7
     Accuracy Rates for Item Responses Under Solution Behavior and Rapid-Guessing Behavior, by Threshold
     Method

                                                                   Accuracy of Solution         Accuracy of Rapid-
     Threshold Method
                                                                       Behaviors                Guessing Behaviors

                                             Math Items (5 response options)

     Common3                                                               50.2%                       21.2%

     NT10                                                                  50.5%                       20.9%

     NT15                                                                  50.7%                       24.4%

     NT20                                                                  50.7%                       31.0%
                Additional rapid guesses gained
                                                                                                        31.0%
                when moving from NT10 to NT15
                Additional rapid guesses gained
                                                                                                        47.0%
                when moving from NT15 to NT20

                                           Reading Items (4 response options)

     Common3                                                               52.2%                       25.5%

     NT10                                                                  52.6%                       27.1%

     NT15                                                                  52.7%                       29.1%

     NT20                                                                  52.7%                       31.8%
                Additional rapid guesses gained
                                                                                                        38.0%
                when moving from NT10 to NT15
                Additional rapid guesses gained
                                                                                                        51.0%
                when moving from NT15 to NT20

     Table 8

     Fall-Spring RIT Score Correlations After Removal of Low-ISV Scores, by Threshold Method

                                                         Correlation When Low-ISV Scores Removed

      Content       Correlation When No
                                                   Common3            NT10             NT15             NT20
       Area         Scores Were Removed

       Math                    .91                     .92              .92              .92                .92

      Reading                  .89                     .90              .91              .90                .90

23 | Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                     ®
24

                                            Mathematics                                                                                 Reading

            Figure 1. Histograms of item thresholds for each NT method, by content area. Note the differences in
            vertical scales across graphs.

Learn more about our research at NWEA.org/Research.
NWEA has nearly 40 years of experience helping educators accelerate student learning through computer-based
assessment suites, professional development offerings, and research services.

Partnering to Help All Kids Learn | NWEA.org | 503.624.1951 | 121 NW Everett St., Portland, OR 97209
                                        ®

©2014 Northwest Evaluation Association. All rights reserved. Measures of Academic Progress, MAP, and Partnering to Help All Kids Learn are
registered trademarks and Northwest Evaluation Association and NWEA are trademarks of Northwest Evaluation Association.

July 2014
You can also read