THE SWEDISH DRIVING-LICENSE TEST - A Summary of Studies from the Department of Educational Measurement, Umeå University Widar Henriksson Anna ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
THE SWEDISH DRIVING-LICENSE TEST A Summary of Studies from the Department of Educational Measurement, Umeå University Widar Henriksson Anna Sundström Marie Wiberg Em No 45, 2004 ISSN 1103-2685 ISRN UM-PED-EM--45--SE
INTRODUCTION...................................................................................................... 1 THE DRIVER EDUCATION IN SWEDEN.............................................................. 1 HISTORY OF THE SWEDISH DRIVER EDUCATION AND DRIVING- LICENSE TESTS....................................................................................................... 2 CRITERION-REFERENCED AND NORM-REFERENCED TESTS...................... 7 IMPORTANT ISSUES IN TEST DEVELOPMENT................................................. 8 Test specification ................................................................................................................................. 8 Item specifications............................................................................................................................. 10 Item format ........................................................................................................................................ 11 Evaluation of items............................................................................................................................ 12 Try-out ............................................................................................................................................... 13 Validity............................................................................................................................................... 14 Reliability .......................................................................................................................................... 16 Parallel test versions ......................................................................................................................... 18 Standard setting................................................................................................................................. 19 Test administration............................................................................................................................ 21 Item bank ........................................................................................................................................... 22 EMPIRICAL STUDIES OF THE THEORY TEST................................................. 24 A new curriculum and a new theory test in 1990................................................ 24 Judgement of items – difficulty.......................................................................................................... 25 Parallel test versions ......................................................................................................................... 25 A theoretical description of the test .................................................................... 27 Test specifications ............................................................................................................................. 27 Item format ........................................................................................................................................ 28 Try-out ............................................................................................................................................... 29 Standard setting in the theory test..................................................................................................... 29 Traffic education in upper secondary school – an experiment............................ 30 Analysis of the structure of the curriculum and the theory test........................... 31 Judgement of items – the relation between the curricula and the content of the items........................................................................................................................... 31
Aspects of assessment in the practical driving-licence test................................. 34 A detailed curriculum........................................................................................................................ 34 A model for judgement of competencies ........................................................................................... 35 The computerisation of the theory test................................................................ 37 Methods for standard setting............................................................................... 38 Standard setting for the theory test used between 1990-1999.......................................................... 38 Standard setting for the new theory test introduced in 1999............................................................ 38 Item bank for the theory test ............................................................................... 39 A sequential approach to the theory test ............................................................. 40 Results of the Swedish driving-license test......................................................... 41 Parallel test versions and the relationship between the tests ........................................................... 42 Private or professional education..................................................................................................... 42 Validating the results......................................................................................................................... 43 Driver education’s effect on test performance .................................................... 44 Driver education in the Nordic countries ............................................................ 45 Curriculum, driver education and driver testing ................................................. 46 Assessing the quality of the tests – reliability and validity ............................................................... 47 Assessment of attitudes and motives ................................................................................................. 49 FURTHER RESEARCH .......................................................................................... 50
Introduction Since 1990, the Department of Educational Measurement at Umeå University has been commissioned to study the Swedish driving- license test by the Swedish National Road Administration, SNRA. Over the past few years several studies have been conducted in order to develop and improve the Swedish driving-license test. The focus of the majority of the studies has been the theory test. The aims of this paper were threefold: firstly to describe the develop- ment of the driver education and the driving-license test in Sweden during the past century; secondly, to summarize the findings of our research, which is related to important issues in test development; and finally, to make some suggestions for further research. The driver education in Sweden The present driver education consists of a theory part and a practical part. Since the driver education is voluntary, the learner-drivers have the choice of professional and/or private education. Driver instruction refers to professional education at a driving school and driving prac- tice refers to lay instructed driver training. In order to engage in driver instruction or driving practice the learner-driver needs a Learner’s Permit. In September 1993 the age limit for driving practice was low- ered from 17 ½ years to 16 years (SFS 1992:1765). It’s common that the learner-drivers in Sweden combine driver instruction with driving practice (Sundström, 2004). The learner-drivers get intense driver in- struction at the driving school and practice the exercises at home, for example under supervision of their parents. There are certain criteria that a person has to meet in order to be approved as a lay instructor for a learner-driver, for example the person has to be at least 24 years old and have held a driving license for a minimum of five years (SFS 1998:488). The driver education reflects the curriculum which consists of nine main parts (VVFS 1996:168). To determine if the student has gained enough competence according to the curriculum, a driving-license test is taken. The test consists of two examinations, a theory test and a practical test. Five of the nine parts of the curriculum are tested in the 1
theory test and the remaining four parts are tested in the practical test (see Table 1). Test-takers have to be 18 years old and pass the theory test before they are allowed to take the practical test. Table. 1. The nine content areas of the driving-license test. Theory test Practical test Vehicle-related knowledge Vehicle-related knowledge Traffic regulations Manoeuvring the vehicle Risky situations in traffic Driving in traffic Limitation of driver abilities Driving under special conditions Special applications and other regulations History of the Swedish driver education and driving- license tests During the past century, the number of vehicles in traffic has rapidly increased, which has been reflected in many new constitutions and regulations. Franke, Larsson and Mårdsjö (1995) described the devel- opment of the Swedish driver-education system. The increasing mo- torism has caused a need for assessment of the driver’s knowledge and abilities. The content of the theory education and the practical driver education, and the knowledge and abilities required to pass the driv- ing-license test have increased over time. The trends in the driver education and the driving-license test are that the focus has changed from teaching students about the construction and manoeuvring of the vehicle to judgement of risk and the driver’s behaviour in traffic. The first regulation for motor traffic was introduced in 1906. In order to drive a car the person needed a certificate. To obtain the certificate, the person had to demonstrate his or her theoretical knowledge and practical ability to a driving examiner. In 1916 the knowledge re- quired for obtaining a driving license, became more extensive. The driver now had to demonstrate his or her knowledge of the construc- tion and management of the vehicle and the most necessary traffic regulations. The requirements for obtaining a driving license got even stricter in 1923 when the driving examiner was required to judge if the person was suitable as a driver. In 1927, the opinion was that the prac- tical test should be the main part of the driving-license test. The prac- tical test should be conducted in different traffic situations so that the driving examiner could assess the driving skill, presence of mind and judgement of the test-taker (Molander, 1997). 2
In 1948 the education in driving theory was supplemented with some new parts that dealt with the responsibilities of the driver and acci- dents in traffic, these new areas were also reflected in the driving- license test. At the time the driving-license test consisted of three ex- aminations: a written test, an oral examination and a practical test. The written test consisted of twenty-five items that should be completed in fifteen minutes. The purpose of the oral examination was to check the test-takers understanding of traffic related problems. The practical test involved at least ten minutes of actual driving where either the test- taker or the driving examiner decided the route. The results of the three tests were considered in the final judgement of the test-taker. There were clear directives on what knowledge was required in order to pass the test. Later, it was stated that there were some problems with the practical test. It was found that the difficulty of the practical test varied a great deal depending on when the test was taken (Franke et al., 1995). In the 1950s the responsibilities of the driver were emphasised to a greater extent than before. This change was based on the opinion that the personality of the drivers affected their behaviour in traffic. The purpose of the theory test was to check that the learner-driver had knowledge that improved his or her judgement in traffic. For a long time, the focus of the education in driving theory had been how the vehicle was constructed. Now, the consideration of other road users and the judgement in traffic were considered the most important parts of the theory test. Even though it was important to improve the judgement of the learner-driver in traffic, the focus was still the prac- tical education. In the end of the 60s the theory education and the practical education were integrated. It became important that the learner-driver understood the content of the education in driving the- ory, rather than just learning it (Franke et al., 1995). In 1971 a new curriculum was introduced and two years later a new differentiated theory test was employed. The theory test was com- posed of a basic test and one or more supplementary tests. The basic test had to be taken by all test-takers, irrespective of the type of cer- tificate applied for. The supplementary tests were selected according to the type of certificate (motorcycle, car/light truck, heavy truck etc.) applied for. The theory test was a written test that contained 80 multi- ple-choice items for AB-applicants (car/light truck). The basic test comprised 60 items and the cut-off score was set to 51 (85%). The 3
supplementary test for AB-applicants consisted of 20 items and the cut-off score was 15 (75%). The scoring model was conjunctive, which means that the test-taker had to pass both theory tests. The items consisted of a question and three options. Only one of the op- tions was correct. The content of the test was not changed very often, so eventually test-takers came to know many of the items before the test (Franke et al., 1995; Spolander, 1974). In 1989 the curriculum was changed (Trafiksäkerhetsverket, 1988) and both the practical and the theory test were altered. The practical test was meant to cover the content of the curriculum to a greater ex- tent than before. Five areas of competence (speed, manoeuvring, placement, traffic behaviour and attentiveness) were introduced. The judgement of traffic situations in the practical test should be related to these competences. The judgement of the practical test was changed from an assessment where the test-taker obtained a grade on a scale from one to five, to an assessment where the test-taker either passed or failed the test. In the field of the theory education, two new content areas were intro- duced (Trafiksäkerhetsverket, 1988). These areas focused on risky situations in traffic and the limitations of driver abilities. The driver education was extended and the theory education was planned to be more effective. The new objectives of the curriculum had to be cov- ered by the new test, so at the same time as the new curriculum was introduced a new theory test was constructed. When the new curricu- lum was introduced, it was decided that the test-takers had to pass the theory test before they were allowed to take the practical test (Matts- son, 1990). The new theory test was introduced in January 1990, nine months af- ter the introduction of the curriculum (Mattsson, 1990). The test was administered in six versions and each version consisted of forty items. All items, except for one, were multiple-choice items. The one item that had a different item format consisted of four descriptions of the meaning of four traffic signs that should be paired with four out of eight pictures of traffic signs. The number of options in the new test was increased from three to four and the test-takers did not know how many options were correct. In order to get one point on an item the test-taker had to identify all the correct options. The test-taker did not get a point if he or she answered three out of four options correctly. 4
The item format used in the test is rarely used in other countries, partly because of the inconsistency of the item format, which is a re- sult of that the number of correct answers is sometimes known and sometimes unknown. The content areas of the theory test were given different weights ac- cording to the curriculum. The weights of the different parts were regulated through the number of items. The content area that con- tained most items was “traffic regulations”. In order to pass the theory test, most of the criteria in the curriculum should be met. The test- taker was not allowed to be found lacking in any content area. The scoring of the test was both compensatory and conjunctive, which means that the test-takers could pass the test in two ways. One way to pass the test was if the test-taker’s score was 36 out of 40 (90% of the total score) or higher. Another way to pass the test was if the test-taker reached the specific cut-off score for each content area and had a score of 30 or more (Mattsson, 1993). Table 2. Number of items and cut-off score for the different content areas of the theory test (1990-1999). Content area Number of items Cut-off score Traffic regulations 14 11 (79 %) Risky situations in traffic 8 5 (63 %) Limitation of driver abilities 8 5 (63 %) Vehicle-related knowledge 3 1 (33 %) Special applications and other regu- 7 4 (57 %) lations + 4 items correct Total 40 30 (75 %) or 36/40 (90 %) The curriculum introduced in 1989 was used until 1996, when the curriculum was revised to include more environmental aspects (VVFS 1996:168). In June 1999 a new theory test was introduced. The new test had the same content areas as the old test but a different item- format (VVFS 1999:32). The new test consists of sixty-five multiple- choice items with only one option correct for each item. Mainly, the items have four options and the items are proportionally distributed over the five content areas with the old theory test as a model, i.e. the relation between the content areas is the same in the new test as in the old theory test. “Traffic regulations” is still the area that contains most 5
items. Five try-out items that do not count to the score are also put into each test. The cut-off score is set to 52 out of the 65 (80%) and the basis for this decision was that there should not be any change in the level of difficulty between the old and new theory test. The scor- ing model is compensatory. Lack of knowledge in some area can be compensated with greater knowledge in the other areas (Wolming, 2000b). There are various methods to use for standard setting, but the decision to set the cut-off score at 52 was not based on any of these methods (Wiberg & Henriksson, 2000). Instead a statistical model, which was based on data for the same test-takers taking the old and new theory test, was used. A practical test is set in order to test the four main parts of the curricu- lum that relate to the practical driving. The performance of the test- taker is assessed with respect to five competences (VVFS 1996:168) that are related to the driver’s awareness of risks in traffic. The first competence is the driver’s speed adaptation in different situations in traffic. The second competence is the driver’s ability to manoeuvre the vehicle. The third area of competence is the driver’s placement of the vehicle in traffic. The fourth area is the driver’s traffic behaviour and the fifth competence is the driver’s attentiveness to various situations in traffic. During the practical test different traffic situations are observed. These traffic situations are divided into five types of situations; handling the vehicle, driving in a built-up area, driving in a non built-up area, com- bination of driving in a built-up area and in a non built-up area and driving in special conditions e.g. darkness and slippery roads. The performance in these situations is related to the five competences mentioned earlier. If the test-taker fails in any competence the driving examiner notes in what traffic situation the error did occur. One error is sufficient to fail the test-taker. In the following sections the process of test construction and impor- tant issues in test development will be considered. When constructing a test it is important to consider if the performance of the test-taker is to be compared with the performance of other test-takers or with some external criterion, i.e. if the test is a criterion- or norm-referenced test. 6
Criterion-referenced and norm-referenced tests In general, tests can provide information, that aids individual and in- stitutional decision-making. Tests can also be used to describe indi- vidual differences and the extent of mastery of basic knowledge and skills. These two general areas of test application lead to two ap- proaches to measurement and, as a consequence, also two kinds of test; norm-referenced tests (NRT) and criterion-referenced tests (CRT). This formal differentiation of two general approaches to test construction and interpretation has its origin in an article by Glaser (1963). This article outlines these two formal approaches to test con- struction and interpretation. The main difference between these approaches is that a CRT is used to ascertain a test-taker’s status with respect to a well-defined criterion and a NRT is used to ascertain a test-taker’s status with respect to the performance of other test-takers on that test. From a more detailed perspective Popham (1990) also defines two major distinctions between CRT and NRT. The first relates to the cri- terion with CRT focusing mostly on a well-defined criterion and a well-defined content domain. The specification and description of the domain, and the concept of domain, is described in terms of learner behaviours. The specification of the instructional objectives associated with these behaviours is central in CRT. The criterion is performance with regard to the domain. NRT is focusing more on general content or process domains such as vocabulary and reading comprehension. Thus, the difference rests on a tighter, more complete definition of the domain for CRT, as compared to NRT. In some cases CRT also in- cludes specification of the performance standard and this performance standard may for example take the form of specifying number of items to be answered correctly or number of objectives to be mastered. The other major distinction relates to the interpretation of a test- taker’s score. CRT describes the score with respect to a criterion and NRT the score with respect to the score of other test-takers. An exam- ple of a NRT in Sweden is the SweSAT which is used for selection to higher education (Andersson, 1999) and an example of a CRT is the national tests that are used as an aid in the grading procedure for teachers in upper secondary school (Lindström, Nyström & Palm, 1996). 7
A closer look at the theory test, the instructional objectives of the driver education (as they are defined in the curriculum issued by the SNRA) and the interpretation of test scores, leads to the conclusion that the theory test can be characterised as a CRT. The curriculum represents the criterion and the theory test consists of five different parts (Table 1) that are connected to the curriculum. The purpose of the test is to determine if a test-taker has acquired a certain level of knowledge compared with the defined criterion and standard setting is used to define this level of knowledge (Mattsson, 1993). Important issues in test development The first and most important step in test development is to define the purpose of the test or the nature of the inferences intended from test scores. The measurement literature is filled with breakdowns and clas- sifications of purposes of tests and in most cases the focus is on the decision, i.e., the decision that is made on the basis of the test infor- mation (see for example Bloom, Hastings and Madus, 1971; Mehrens & Lehman, 1991; Gronlund, 1998; Thissen & Wainer, 2001). The setting for the theory test is that this test is used to make decisions about test-taker performance with reference to an identified curricular domain. Curricular domain is defined here as the skills and knowledge intended or developed as a result of formal, or non-formal, instruction on identifiable curricular content. Test specification When the purpose of the test is clarified the next logical step in test development is to specify important attributes of the test. Test content is, in most cases, the main attribute. Other important attributes include for example test and item specification, item format and design of the whole test, as well as psychometric characteristics, evaluation and selection of items and standard setting procedures. These attributes are also dependant on external factors, such as how much testing time is available and how the test can be administered. Millman & Green (1989), for example, distinguished between external contextual factors (for instance who will be taking the test and how the test will be ad- ministered) and internal test attributes (for instance, desired dimen- sionality of the content and distribution among content components, 8
item formats, evaluation of items and desirable psychometric charac- teristics of both individual items and the whole test). With reference to internal attributes Henriksson (1996a) also made a distinction between two kinds of models, a theoretical model and an empirical model. The theoretical model is based mainly on judgements but also on state- ments about, for example, the number of items in the test and item type, and the empirical model is based on empirical data describing psychometric characteristics of items as well as the whole test. The theoretical and empirical model is summarised in test specifications. One effective way to ensure adequate representation of items in a test is to develop a two-way grid called a test blueprint or a table of speci- fication (Nitko, 1996; Haladyna, 1997). In most cases the two-way grid includes content and the types of mental behaviours required of the test-taker when responding to each item. Haladyna (1999), for example, suggested that all content can be classified as representing one of four categories: fact, concept, principle, or procedure. He also defined five cognitive operations: recall, understanding, prediction, evaluation and problem solving. Another well-known hierarchical system is the taxonomy by Bloom (1956) consisting of six major cate- gories. This hierarchical system has also been elaborated (Andersson et al, 2001). However, the behaviour dimensions should not be too complex and it can be claimed that the Bloom taxonomy has never achieved any greater success as a tool for test construction, maybe because it is too complex. Perhaps the revised model will be a step forward in that respect? But, as Henriksson (1996a) pointed out, the matrix schemes for the composition of a test need not be limited to the dimensions of content and process. More dimensions can be added by considering for exam- ple the item’s reading level, the amount of text and the formation of distractors. Other factors that also can be considered are surplus in- formation, degree of non-verbal information, abstract-concrete and so forth. The effort to create these theoretical attributes and to establish a theoretical model for the test is based on judgements by experts. It can also be added that these added dimensions give more guidance for the test developer and, at the same time, the model for the whole test be- comes more exact. 9
Item specifications There is dependence between test and item specifications since the theoretical as well as the empirical model for a certain test are related to the attributes of the item. Therefore, most of the item specifications are outlined when the test specifications are defined. An item specifi- cation includes sources of item format, item content, descriptions of the problem situations, characteristics of the correct response and in the case of multiple-choice items: characteristics of the incorrect re- sponses. The use of item specifications is particularly advantageous when a large item pool should be created and when different item writers will construct the items. If each writer sticks to the item speci- fication, a large number of parallel items can be generated for an ob- jective within a relatively short time (Crocker & Algina, 1986). Different types of information should be stored for each item. First, information used to access the item from a number of different points of view should be stored. This information usually consists of key- words describing the item content, its curricular content, its behav- ioural classification and any other salient features; for example the textual and graphical portions of the item. Different kinds of judge- ments by experts give this theoretical information (Henriksson, 1996b). Second, psychometric data should be stored, such as the item difficulty and item discrimination indices. Third, and of relevance to the theory test - the number of times the item has been used in a given period, the date of the last use of the item, and identification of the last test-version the item appeared in, i.e. different indices of exposure for each item. It should also be noted that the storage of empirical item statistics also represents a measurement problem. Under classical test theory, item statistics are group dependent and, therefore, must be interpreted within the context of the group tested (Linn, 1989). It should also be mentioned that when using item response theory (IRT) as a basis for empirical item statistics this disadvantage of group dependence is eliminated, i.e. it is possible to characterise or describe an item, inde- pendently of any sample of test-takers who might respond to the item (see for example Lord, 1980; Hambleton et al, 1991; Thissen & Or- lando, 2001). 10
Item format Generally speaking, the test developer faces the issue of what to measure and how to measure it. For most large-scale testing pro- grammes, test blueprints and cognitive demands specify content and demands in terms of what to measure. Regarding the question of how to measure, one dilemma facing test developers is the choice of item format. This issue is, according to Rodriguez (2002), significant in a number of ways. One factor is that interpretations vary according to item format and a second factor is that the cost of scoring open-ended questions can be enormous compared with multiple-choice items. A third factor is that the consequences of using a certain item format may effect instruction in ways that foster, or hinder, the development of cognitive skills measured by tests. The significance of format selec- tion is also related to validity, either as a unitary construct (Frederik- sen & Collins 1989; Messick, 1989) or as an aspect of consequential validity (Messick, 1994). In view of the statements mentioned in the previous paragraph the conclusion is that it is useful to distinguish between what is measured and how it is measured; between substance and form; between content and format. The two are not independent, for form affects substance, and, to some extent, substance dictates form. Nevertheless, the em- phasis here is on form; on how items are presented. First, a set of at- tributes of item formats is offered that can serve to classify item types. Second, the importance of an item’s format is discussed: its relation- ship to what is measured and its effect on item parameters (Linn, 1989). The issues surrounding item format selection, and test design more generally, are also critically tied to the nature of the construct being measured. In line with this statement Martines (1999), reviewing the literature on cognition and the question of item format, concluded that no single format is appropriate for all educational purposes. Referring to the driving-license test, we might assert that driving ability can (and should) be measured via a driving-ability performance test and not a multiple-choice exam, but knowledge about driving (procedures, regu- lations and local laws) can be measured by a multiple-choice exam. The item format is described in the item specifications. For optimal performance tests (for example the theory test) there is a variety of 11
item formats that could be considered. The item formats can be di- vided into two major categories; those that require the test-taker to generate the response and those that provide two or more possible responses and require the test-taker to make a selection. Because the latter can be scored with little subjectivity, they are often called objec- tive test items (Crocker & Algina, 1986). It is also worth mentioning that open-ended questions, i.e., questions for which the test-taker constructs the answer using his or her own words, are often preferred because of a belief that they may directly measure some cognitive process more readily, or because of a belief that they may more readily tap a different aspect of the outcome do- main. The consequence has been that popular notions of authentic and direct assessment have politicised the item-writing profession (Rodri- guez, 2002). This tendency to include less objective formats in tests give rise to subjectivity and this conclusion is based on the fact that multiple-choice items can be scored with significant certainty and with objectivity. But the crucial question is whether multiple-choice items and open-ended items measure the same cognitive behaviour or not? Rodriguez (2002, p 214) briefly formulated his standpoint in the following way: “They do if we write them to do so”. In line with the arguments for multiple-choice items Ebel (1951, 1972) suggested that every aspect of cognitive educational achieve- ment is testable through the multiple-choice format (or true-false items). His conclusion is also that the things measured by these items are far more determined by their content than by their form. Many of the recent authors refer to the wise advice in Ebel’s writing regarding test development and item writing. See for example Carey (1994); Osterhof (1994); Kubiszyn & Borich (1996); Payne (1997); McDon- ald (1999). Evaluation of items The problem of deciding which items to use in a test is related to the theoretical and empirical model as well as to the test and item specifi- cation. The summarised conclusion is that quality items are desired. Consequently, evaluation and judgement procedures based on theo- retical and empirical data are important to weed out flawed items. 12
An often-used procedure in item construction is that external item- writers deliver items, which then are examined and scrutinised by test developers. This model is used by the SNRA. The result of this is that in many cases the proposals, which the item writers have, are to be changed in one way or another in order to meet the requirements for good items. The test developer, who is an expert in test- and item- construction, makes these changes and improvements. When this process is finished, item evaluation is the next step. The term theoretical evaluation is used for the process when the items are judged against stated and defined criteria. The procedure requires that the items are written but not necessary administered to a represen- tative sample of test-takers in a try-out. Common for all methods for theoretical evaluation is that one or more judges evaluate items against the criteria. A decision must be made about which criterion or criteria should be addressed, and the priority between those criteria. Tech- niques and methods for evaluating the judgements must be decided upon as well. This process of judgement can be related to the item per se as to the theoretical and empirical model for the test. Henriksson (1996b) defined and described accuracy, difficulty, importance, bias and conformity as assessment criteria. The judgement can also be fo- cused on the classification of items according to item parameters. These item parameters are included in the model for the theoretical component of the total model for the test and in this respect the basic aim of the judgement and evaluation is to get indications about the reliability of classification. To obtain information about certain items, item analysis is used. Item analysis is the computation of the statistical properties of an item re- sponse distribution. Item difficulty (p) is the proportion of test-takers answering the item correctly. Item discrimination is used to assess how well performance on one item relates to some other criterion, e.g. the total test score. Two statistical formulas that are commonly used are point biserial correlation (rpbis) and biserial correlation coefficient (rbis) (Crocker & Algina, 1986). Try-out It’s important to pre-test the items before they are put in the actual test since it’s difficult to anticipate how an item will work in the actual test. Before the try-out is carried out it is important to describe what 13
information the try-out should result in. It’s also important to be aware that some items probably will not be good enough and that several try- outs are necessary to end up with a collection of good items. If the test will consist of several parallel test versions, an extensive domain of pre-tested items is required. When selecting the group for the try-out it is important to consider if they are representative of the group that takes the actual test. One should also consider their motivation to do the test and the size and the availability of the group. Of course there are many reasons why the apparent difficulty might be expected to change between item try- out and actual testing. One might, for example, expect the test-takers to be more motivated during the actual testing, or one might believe that there were changes in instruction during the intervening period. The try-out can be done separately from the actual test, or in combina- tion with items in the actual test. If the try-out items are a part of the actual test the test-taker can either be informed that they are working with try-out items or not. The advantage with this design is that the try-out is done in the proper group of test-takers and that they are probably fully motivated. Validity The traditional approach to validity implies that validity is classified into three different types of validity: content-related evidence of valid- ity, criterion-related evidence of validity and construct-related evi- dence of validity. Content-related evidence of validity refers to the extent to which the content of test items represents the entire body of content. This body of content is often called the content universe or domain. The basic issue in content validation is representativeness. In other words, how adequately does the content of the test represent the entire body of content to which the test user intends to generalise? The word “con- tent” refers, in this context and according to Anastasi (1988), to both the subject-matter included in the test and the cognitive processes that test-takers are expected to apply to the subject matter. Hence, in col- lecting evidence of content-related evidence of validity it is necessary to determine what kinds of mental operations are elicited by the prob- lems presented in the test, as well as what subject-matter topics have 14
been included or excluded. The key ingredient in securing content- related evidence of validity is human judgement. Criterion-related evidence of validity is based on the extent to which the test score allows inferences about the performance on a criterion variable. In this context the criterion is the variable of primary inter- est. If the information about the criterion can be available at the same time as the test information the validity is called concurrent validity. Concurrent-related evidence of validity is, for example, frequently used to establish that a new test is an acceptable substitute for a more expansive measure. If the criterion information is available after a certain time, for example a year or more, the validity is called predic- tive validity. Thus, predictive-related evidence of validity refers to how well a test predicts or estimates some future performance on a certain criterion. The degree to which scores on the test being vali- dated predict successful performance on the criterion is estimated by a correlation coefficient. This coefficient is called validity coefficient. Construct-related evidence of validity refers to the relation between test score and a theoretical construct, i.e. a measure of a psychological characteristic of interest. Theoretical constructs are: intelligence, criti- cal thinking, creativity, introversion, self-esteem, aggressiveness and achievement motivation etc. Reasoning ability, reading comprehen- sion, mathematical reasoning ability and scholastic aptitude are other examples of constructs. Such characteristics are referred to as con- structs because they are theoretical constructions about the nature of human behaviour. Construct validation is the process of collecting evidence to support the assertion that a test measures the construct that it is supposed to measure. Construct-related evidence of validity can seldom be in- ferred from a single empirical study or from one logical analysis of a measure. Rather, judgements of validity must be based on an accumu- lation of evidence. Construct-related evidence of validity is investi- gated through rational, analytical, statistical and experimental proce- dures. The development or use of theory that relates various elements of the construct under investigation is central. Hypotheses based on theory are derived and predictions are made about how the test scores should relate to specified variables. In a classical article Cronbach & Meel (1955) suggested five types of evidence that might be assembled in support for construct validity. These types were also succinctly 15
stated by Helmstadter (1964) and Payne (1997). Both evidence of con- tent-related validity and evidence of criterion-related validity are used in this process. In that sense, content validation and criterion valida- tion become part of construct validation. This latter conclusion (i.e. that content-related, criterion-related and construct-related evidence of validity are not separate and independent types of validity, but rather different categories of evidence that are each necessary and cumulative) represents the integrated view of va- lidity. This integrated and unitary view of validity is described, for example, in Messick’s (1989) treatment of validity. Recent trends in validation research have also stressed that validity is a unitary concept (see, for example, Wolming, 2000a; Nyström, 2004). Thus, validity- related evidence concerns the extent to which test scores lead to ade- quate and appropriate inferences, decisions and actions. It concerns evidence for test use and judgement about potential consequences of score interpretation and use. However, it can also be added that, in a very real sense, validity is not strictly a characteristic of the instrument itself but of the inference that is to be made from the test scores de- rived from the instrument. Reliability When a test is administered, the test user would like some assurance that the test is reliable and that the results could be replicated if the same individuals were tested again under similar conditions (Crocker & Algina, 1986). Reliability refers to the degree to which test scores are free from errors of measurement. There are several procedures to estimate test score reliability. The alternate form method requires constructing two similar versions of a test and administering both versions to the same group of test- takers. In this case, the errors of measurement that primarily concern test users are those due to differences in content of the test versions. The correlation coefficient between the two sets of scores is then computed (Crocker & Algina, 1986). If two versions of a test measure exactly the same trait and measure it consistently, the scores of a group of individuals on the two test versions would show perfect cor- relation. The lack of perfect correlation between test versions is due to the errors of measurement. The greater the errors of measurement, the lower the correlation (Wainer et al., 1990). 16
The test-retest method is used to control how consistently test-takers respond to the test at different times. In this situation measurement errors of primary concern are fluctuations of a test-takers’ observed score around the true score because of temporary changes in the test- takers’ state. To estimate the test-retest reliability the test constructor administers the test to a group of test-takers, waits, and readministers the same test to the same group. Then the correlation coefficient be- tween the two sets of scores is estimated. Internal consistency is an index of both item homogeneity and item quality. In most testing situations the examiner is interested in gener- alizing from the specific items to a larger content domain. One way to estimate how consistently the performance of the test-takers relates to the domain of items that might have been asked is to determine how consistently the test-takers performed across items or subsets of items on a single test version. The internal consistency estimation proce- dures estimate the correlation between separately scored halves of a test. It is reasonable to think that the correlation between subsets of items provides some information about the extent to which they were constructed according to the same specifications. If test-takers’ per- formance is consistent across subsets of items within a test, the exam- iner can have some confidence that this performance would generalize to other possible items in the content domain (Crocker & Algina, 1986). The techniques for estimation of reliability mentioned above have been developed largely for norm-referenced measurement. Other techniques have been suggested for criterion-referenced tests. Crocker and Algina (1986) presented some reliability coefficients for criterion- referenced measurement. Wiberg (1999a) found that the statistical techniques used to evaluate the reliability in norm-referenced test also could be used to evaluate the reliability in criterion-referenced tests. However, the usage and interpretation of the results must be handled with caution. The variation in test scores among test-takers constitutes an important foundation for the statistical techniques estimating reli- ability in norm-referenced tests. Only when the items in a criterion- referenced test fulfil the assumptions underlying classical test theory would it be recommendable to use these statistical methods. 17
Parallel test versions If a test has two or more versions and the test-taker’s score from the test is used for decisions (which is the case for the theory test) all of them must be parallel. This means that different versions contain dif- ferent items but are built to the same test and item specifications and the same models. From a perspective of a test-taker this means that the obtained test result should be exactly the same, irrespective of the ver- sion that is administered. The need for parallel test versions is moti- vated by the need for test security and for the sake of fairness. It is also a fundamental requirement if repeated test taking is permitted. There are formal and theoretical definitions of parallel test forms (see for example Thissen & Wainer, 2001) and sometimes a distinction is made between parallel, equivalent and alternate forms. But, it can also be added that, for example, Hanna (1993) used parallel, equiva- lent and alternate forms synonymously. Thus, parallel, equivalent or alternate forms1 have identical weight allocations among topics and mental processes, but the particular test questions differ. Ideally, parallel test versions should have equivalent raw score means, variability, distribution shapes, reliabilities, and cor- relation with other variables. To estimate the reliability between two or more versions of the same test the alternate form method is used (Crocker & Algina, 1986). If the versions are parallel regarding item difficulty there is a high correlation between them. When put into practice, however, the construction and evaluation of parallel test versions give rise to a number of problems and it is neces- sary to examine the property two versions should have that would qualify them for use interchangeably. The concept of parallel versions sets the ground for a discussion of practical problems in constructing two (or more than two) parallel test versions that we are willing to regard as interchangeable. 1 The term parallel test versions is used in this report. 18
Standard setting The idea of standard setting is to find a method that minimises the number of wrong decisions about the test-taker. There are two types of wrong decision. The first is if a test-taker that does not have the knowledge passes the test. The other is if a test-taker that has the knowledge fails the test (Berk, 1996). The cut-off score in a test represents a line between confirmed knowl- edge and a lack of knowledge in a certain area. If the test-taker’s total score is equal to or higher than the cut-off score he or she has the knowledge that is measured by the test. If the test-taker’s total score is less than the cut-off score he or she does not have the knowledge measured by the test (Crocker & Algina, 1986). There are various methods that one can use in standard setting and depending on the format of the test different methods are good to use (Berk, 1986). The methods can be categorized from their definition of competence. Some methods assume that the test-takers either have the knowledge or they do not. Other methods view competence as a char- acteristic that is continuously distributed, and that a test-taker’s knowledge can be seen as a value within an interval in this distribu- tion. These latter methods of standard setting can be divided into dif- ferent groups depending on the amount of judgement in the decision. Jaeger (1989) proposed two main categories that are based on per- formance-data of the test-takers; test-centred continuum models that are mainly based on judgements and examinee-centered continuum models that are mainly based on test-taker’s performance on the test. In addition to these models there are judgemental continuum models that are mainly based on judgement. In the last few years a fourth category, “multiple models”, has been introduced. This model is used for standard setting when the test has multiple item formats or multi- ple cut-off scores. It can also be added that there are basically three general methods for applying standards: disjunctive, conjunctive and compensatory (see for example Gulliksen, 1950; Mehrens, 1990; Haladyna & Hess, 1999). In the disjunctive and conjunctive approaches, performance standards are set separately for the individual assessment, for example a subtest. In the compensatory procedure, performance standards are 19
set for a composite or index that reflects a combination of subtest scores. With the disjunctive model, test-takers are classified as an overall pass if they pass any one of the subtests by which they are assessed. This approach is applied rather seldom and seems most appropriate when the subtests involved in a test battery are parallel versions, or in some other way are believed to measure the same construct. Haladyna & Hess (1999), for example, pointed out that the disjunctive approach is employed in assessment programmes that allow a test-taker to retake a failed test. With a conjunctive model for decision-making, test-takers are classi- fied as having passed only if they pass each of the subtests by which they are assessed. The use of the conjunctive approach seems most appropriate when the subtests assess different constructs, or aspects of the same construct, and each aspect of the construct is highly valued. Failing only one assessment yields an overall fail because the content standards measured by each assessment are considered essential to earn an overall pass. The application of a conjunctive strategy to stan- dard setting results in test-takers being classified into the lowest cate- gory attained on any one measure employed. With a compensatory model, test-takers are classified as pass or fail based on performance standards set in combination of the separate subtests employed. Data are combined in a compensatory approach by means of an additive algorithm that allows high scores on some sub- tests to compensate for low scores on others. The use of a compensa- tory strategy seems, according to Ryan (2002), appropriate when the composite of the separate subtests has important substantial meaning, a meaning that is not represented by subtests taken separately. A useful combination of the compensatory and conjunctive model can also be employed. Such an approach sets minimal standards on each subtest that is applied in a conjunctive fashion. This means that the test-taker must yield a minimal pass-level on each subtest before a compensatory approach is applied, and a final rating is determined. This combined conjunctive-compensatory approach sets minimum standards that are necessary on each subtest but not sufficient for the subtests taken together. This approach prevents very low levels of 20
performance on one subtest being balanced by exceptional perform- ance on other subtests (Mehrens, 1990). Test administration There are basically three different ways of presenting items to the test- takers; by paper-and-pencil-tests, computerised tests or computerised adaptive tests. In a paper-and-pencil-test all test-takers get the same number of items. The items are answered with a pencil on paper. The test is often administered to a large number of test-takers a limited number of times because of the item exposure. The item analysis is mainly done with classical test theory (Wainer et al., 1990). A computerised test is mainly the same as a paper-and-pencil-test. The difference is that a computerised test is carried out with a computer, which makes it possible to randomise the order of the items and the options for each test-taker. An advantage with computerised tests is that the administration takes less time since the scoring can be done during the test. Another advantage is that the security of the test is increased when there are no paper copies of the test. With computerised tests there’s the possibility of using new innova- tive types of items (van der Linden & Glas, 2000). Different types of items are created from combinations of item format, response actions, media and interactivity. An example of an item format is a multiple- choice item. A response action could be that the test-taker answers the item with a joystick. The items can contain different media, for exam- ple animations and sound. Media can be used both in the item and in the options. An example of interactivity is an item where the test-taker can answer the item by marking a text or a point. It is important to be aware of new measurement errors that can occur in computerised testing. For example, a computerised test could imply problems for test-takers who are not used to working with computers. Another possible measurement error is that bad graphics on the com- puter monitor can result in blurred pictures. In computerised adaptive tests (CAT) the test-taker obtains items of different difficulty depending on how the person answered the previ- 21
ous items in the test. CAT has the possibility to give the test-takers a test that fits their ability (Umar, 1997). Which items are selected for a test-taker also depends on the content and difficulty of the test and the item discrimination. For each response the test-taker gives, the com- puter program estimates the test-taker’s ability and how reliable the estimate is. When the predetermined reliability is achieved the test is finished and the test-taker obtains the final estimate of his or her abil- ity level (Wainer et al., 1990). Tests based on CAT are often analysed with Item Response Theory, (IRT). IRT can be used to describe test-takers, items and the relation between them. IRT takes into account that the items in a test can vary with respect to item difficulty. There are different models in IRT that can be used to create scale-points. The one-parameter logistic (or Rasch) model is the simplest model where only an “item difficulty” parameter is estimated. The two-parameter logistic model estimates not only a “difficulty” parameter but also a “discrimination” parame- ter. The three-parameter logistic model includes a “guessing” parame- ter as well as “discrimination” and “difficulty” parameters (Birnbaum, 1968). There are three basic assumptions in IRT (Crocker & Algina, 1986). The test has to be unidimensional, which means that all items measure the same trait. The assumption of local independence means that the answer on one item by a randomly picked test-taker is independent of his or her answers on other items. The third assumption is that the relationship between the proportion of test-takers that answered an item correctly and the latent trait can be described with an item- characteristic curve for each item. With IRT models we can deter- mine the relationship between a test-taker’s score on the test and the latent trait, which is assumed to determine the test-taker’s result on the test. A test-taker with higher ability is more likely to answer an item correctly than a test-taker with lower ability. If these three conditions are met, test-takers can be compared even if they did not take parallel test versions. Item bank An item bank is a collection of items that can be used to construct a computerised test or a test based on CAT. An item bank should con- sist of a large number of pre-tested items so that varied tests can be 22
You can also read