Prototype and Exemplar-Based Information in Natural Language Categories
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Journal of Memory and Language 42, 51–73 (2000) Article ID jmla.1999.2669, available online at http://www.idealibrary.com on Prototype and Exemplar-Based Information in Natural Language Categories Gert Storms, Paul De Boeck, and Wim Ruts University of Leuven, Leuven, Belgium Two experiments are reported in which four dependent variables; typicality ratings, response times, category-naming frequencies, and exemplar-generation frequencies of natural language concepts, were predicted by two sorts of prototype predictors and by an exemplar predictor related to Heit and Barsalou’s (1996) instantiation principle. In the first experiment, the exemplar predictor was com- pared to a prototype predictor calculated as in Hampton (1979). The four dependent variables were either predicted better by the exemplar measure than by the prototype predictor or the predictive value was about equal. In the second experiment, a new prototype predictor was calculated based on Rosch and Mervis’ (1975) classic family resemblance measure. The results showed that the exemplar predictor accounted better for the dependent variables than Hampton’s and Rosch and Mervis’ prototype measures. The differences between the prototype measures were not significant. © 2000 Academic Press Key Words: natural language concepts; prototype theory; exemplar models; family resemblance. The classical view of semantic concepts based on whether an item possesses enough of states that a concept can be described in terms these features. The exemplar view essentially of defining features that are singly necessary states that a category is represented by particu- and jointly sufficient (e.g., Sutcliffe, 1993). lar instances that have previously been encoun- Ample evidence has been provided against this tered. A new item is assumed to be judged an view (see, e.g., Komatsu, 1992; Smith & Me- instance of a category to the extent that it is din, 1981, for overviews.) Two major alterna- sufficiently similar to one or more of the in- tive theories have been formulated: the proto- stance representations stored in memory. type view and the exemplar view. In the Different procedures have been proposed to prototype view (Rosch, 1975a, 1978, 1983; derive prototype measures for natural language Rosch & Mervis, 1975), it is assumed that cat- concepts (e.g., Hampton, 1979; Rosch & egories are represented by a set of features Mervis, 1975; see also Barsalou, 1990). Also, which may carry more or less weight in the different variants of the exemplar view have definition of the prototype, and categorization is been presented in the literature, depending on the assumptions made about the number and This research project was supported by Grant 2.0073.94 nature of the instances stored, about the pres- from the Belgian National Science Foundation (Fundamen- ence or absence of forgetting, and so on (Bar- tal Human Sciences) to P. De Boeck, I. Van Mechelen, and D. Geeraerts. Part of this research was conducted while the salou, 1990). In some theories it is assumed, for first author was visiting Doug Medin at Northwestern Uni- example, that every instance that is encountered versity. Their hospitality is gratefully acknowledged. We is also stored (e.g., Reed, 1972), while in others thank the following for discussion and advice in the course it is assumed that only the most typical in- of the research and in the preparation of this manuscript: Lawrence Barsalou, Dedre Gentner, Lloyd Komatsu, Doug- stances are stored (e.g., Rosch, 1975b) or that las Medin, Iven Van Mechelen, Ed Wisniewski, and one many instances are stored to varying degrees of anonymous reviewer. All the data described in this manu- completeness (see, Komatsu, 1992, for more script can be obtained from the first author (in Excel files) details). upon simple request. Address reprint requests to Psychology Department, Uni- The evidence for exemplar models mostly versity of Leuven, Tiensestraat 102, B-3000 Leuven, Belgium. consists of category learning data, obtained with E-mail address: Gert.Storms@psy.kuleuven.ac.be. tasks in which subjects learn new categories of 51 0749-596X/00 $35.00 Copyright © 2000 by Academic Press All rights of reproduction in any form reserved.
52 STORMS, DE BOECK, AND RUTS artificial stimuli. In the context of natural lan- servation of real-world objects, and thus, that guage concepts, prototype and exemplar-based information is encoded in the same way (Malt & predictors have, to our best knowledge, not yet Smith, 1984). been compared. To our knowledge, the study of Heit and In this paper, we want to focus on the extent Barsalou (1996) is the only attempt to test a to which category-based variables (typicality model compatible with most exemplar models ratings, latency of category decision, exemplar- in the context of natural language concepts. Heit generation frequencies, and category-naming and Barsalou proposed an instantiation model in frequencies) derived from natural language which it is essentially assumed that people gen- concepts can be predicted from prototype mea- erate instantiations of a category to base cate- sures and exemplar-based measures. Before de- gory-related decisions on. More specifically, scribing the empirical work, we will elaborate Heit and Barsalou predicted the typicality of on the differences between natural language lower level concepts (like, e.g., mammals) concepts as they are used by adult language within a higher level concept animal from the users on the one hand and concepts related to typicality of instances of the lower level concept categories as they appear in laboratory studies (like, e.g., dog, horse, cat) within the higher in which subjects learn new categories of arti- level concept animal. ficial stimuli on the other hand. In contrast with the exemplar view, much EXEMPLAR AND PROTOTYPE MODELS evidence in favor of the prototype view stems FOR ARTIFICIAL CATEGORIES AND from natural language concepts. As Hampton NATURAL LANGUAGE CONCEPTS (1979) and Rosch and Mervis (1975) demon- strated, prototype-based predictions succeed in In the past, many studies have compared pro- explaining a considerable portion of the vari- totype and exemplar-based predictions to ex- ance in typicality judgments for natural lan- plain categorization and category learning, guage concepts. Applying the exemplar models mostly using perceptual stimuli (see Nosofsky, to natural language categories, however, is not 1992). Only a few studies have been published straightforward. An important problem in order in which the exemplars of the (new) categories to be learned were verbally described (e.g., a set for the exemplar view to be tested for natural of symptoms of fictitious patients in Medin, categories is that “it is not entirely clear what an Altom, Edelson, & Freko, 1982, or a set of exemplar representation is” (Komatsu, 1992, p. features of fictitious persons in Hayes-Roth & 507). At one extreme, an exemplar representa- Hayes-Roth, 1977), and, although the stimuli tion might be a family resemblance representa- were verbal, they are still somewhat artificial, as tion that abstracts across different specific in- were the categories. As it cannot be taken for stances. In this view, the concept fish may granted that the nature of categorization learn- consist of the set of representations of trout, ing using artificial categories parallels the learn- goldfish, shark, and so on, which are themselves ing of natural language concepts, a generaliza- abstractions. At the other extreme, exemplar tion of the results to the mental representation of representations may involve no abstraction at everyday natural language categories, such as all, with representations consisting only of spe- trees, furniture, or games, in adult language cific memory traces of particular previously en- users, seems problematic. Most people learn countered instances. (See, e.g., Medin, 1986, 1 about birds and flowers erratically and from 1 Though the latter view may be more popular among many different sources, contrary to the learning researchers that have studied artificial category learning, situation in a laboratory experiment where a one can doubt whether a theory which assumes no abstrac- category is learned in an explicit learning phase. tion at all has ever been tested in any of the laboratory studies. In designing these experiments, researchers assume Also, whereas subjects in an artificial category that the representation of the presented stimuli consists of learning experiment are instructed to encode the the dimensions that they manipulate (e.g., color, form, size, exemplars in detail, there is no similar guaran- and position of the stimulus), but if subjects do not abstract tee that people are equally careful in their ob- these dimensions from other information that may in prin-
PROTOTYPES AND EXEMPLARS IN NATURAL LANGUAGE CATEGORIES 53 and for a strong critique of nonabstracting ex- views are exemplar views. Both of these exem- emplar models, see Barsalou, Huttenlocher, & plars views are compatible with Heit and Bar- Lamberts, 1998.) salou’s (1996) instantiation principle. Heit and Barsalou’s (1996) instantiation prin- The present paper describes two experiments ciple described above is not explicit about ex- in which two different prototype predictions actly how exemplar information is stored in and an exemplar prediction inspired by Heit and memory. The model assumes that category- Barsalou’s (1996) instantiation will be com- based decisions about concepts like birds, fish, pared to predict category-based decisions for mammals, etc. are based on information stored eight natural language concepts. at the level of instantiations that are retrieved EXPERIMENT 1 from memory (i.e., at the level of robin, trout, and horse). However, the proposed instantiation In the first experiment, a first prototype pre- principle is equally compatible with a model dictor and the exemplar-based predictors will be that assumes no abstraction at all as with a compared for the four different category-related model that assumes abstraction across specifi- dependent variables. We will first motivate the cally encountered examples at the level of the choice of the four dependent variables and then instantiated exemplars. elaborate on the prototype and exemplar-based To summarize, in the context of natural lan- measures, which will be used as predictors for guage categories like fruits, vegetables, vehi- the dependent variables. cles, etc., three different theoretical views may Typicality Ratings be distinguished depending on the levels at Typicality has been shown to be a very in- which abstraction does or does not take place. fluential variable in a wide variety of cognitive The first view assumes that no abstraction what- tasks (Hampton, 1993; Malt & Smith, 1984), soever takes place and that only memory traces such as speeded categorization (Hampton, of particular encountered instances are stored. 1979), inductive inference (Rips, 1975), pro- Any category-related judgment is based on ductive tasks (Hampton & Gardiner, 1983), these memory traces, as no abstract information priming effects (Rosch, 1975b), semantic sub- is stored with verbal concepts. The second view stitutability (Rosch, 1977), and memory inter- assumes that abstraction may take place, but ference effects (Keller & Kellas, 1978). Accord- only at a level lower than the concepts studied, ing to the prototype view, variations in category that is, at the level of tomatoes in case vegeta- typicality reflect differences in similarity to the bles are studied. The representation of the stud- prototype in terms of the features it shares with ied natural language concepts, like vegetables the prototype representation of the concept. The and vehicles, is comprised of lower level con- extent to which items are characterized by fea- cepts like tomatoes and bikes, respectively. Fi- tures that are important in deciding on category nally, the third view states that abstraction membership has been reported to be a good (also) takes place at the level of the studied predictor by Hampton (1979), Malt and Smith natural language concepts and that (characteris- (1984), and Rosch and Mervis (1975). Accord- tic) features of their exemplars are directly ing to the exemplar view, variations in category stored at this level. The latter view can be typicality reflect varying degrees of similarity to labeled the prototype view, and the first two stored exemplars of the category. The instantia- tion principle, proposed by Heit and Barsalou ciple be stored (like, e.g., the trial number and slight differ- (1996), can be adapted to account for typicality ences in illumination due to uncontrollable events), they ratings within basic level categories like birds, might not be able to learn the categories. Thus, each training fish, and mammals. In evaluating the typicality exemplar is, so to speak, a sort of a prototype consisting of of a particular instance X within a category Y, a a set of abstracted features that, in principle, can apply also to other exemplars that differ on some other, irrelevant, subject might first generate one or more stored features (D. L. Medin, personal communication, May 19, exemplars of the category Y. Next, the typical- 1997). ity of instance X within Y might be based on the
54 STORMS, DE BOECK, AND RUTS similarity of instance X toward the instantiated the subjects can give a “Yes” response. If, how- exemplars stored earlier under Y. In summary, ever, the number of retrieved items that are not both views explain typicality in terms of their stored exemplars reaches a critical value first, basic assumptions about how categories are rep- then a “No” response is assumed to be emitted. resented. Therefore, finding which of the two It follows from this hypothesized process that, views best predicts typicality ratings may shed within a category, response times should de- light on the representation of semantic catego- crease with increasing similarity toward stored ries. good exemplars of the category. (See Nosofsky & Palmeri, 1997, for another exemplar-based Response Times model of response times.) The second measure to be predicted by both models is response time in a speeded categori- Category-Naming and Exemplar-Generation zation task. A very robust finding in the litera- Frequency ture on semantic verification tasks is that the Category-naming and exemplar-generation time to verify category membership differs sig- frequency are, in a way, each other’s opposite: nificantly among the members of the same cat- In an exemplar-generation task, participants are egory (Larochelle & Pineau, 1994). This differ- given the category label and are asked to name ence in response times can be explained both by exemplars, while in a category-naming task, the prototype model and by exemplar models. participants are given exemplars and they are Following Hampton’s (1979) assumptions asked to name the category. about category prototypes, different character- As for category naming as a dependent vari- istic features of the concept can be selected able, the fact that some exemplars of a category successively in time and a “Yes” response can are labeled with the category name more fre- be emitted as soon as the feature overlap be- quently than other exemplars can be explained tween the stored features of the prototype and by the exemplar-based models when assuming the features of a presented word reaches a cer- that the link between a category and an exem- tain threshold. Likewise, subjects can be as- plar is of a probabilistic nature. The prototype sumed to give a “No” response as soon as the model can account for this finding in assuming set of nonmatching features reaches a certain that the feature pattern associated with the given criterion. Hampton’s results show that a consid- exemplar somehow activates the feature pattern erable percentage of the variance in the re- of the category (which then allows the subject to sponse times can be accounted for by a proto- retrieve the category label) and that this is more type-based measure. (See also McCloskey & likely the more the feature patterns of the ex- Glucksberg, 1979.) emplar and of the category resemble each other. Exemplar-based predictions about response The difference in both views is whether or not times in a speeded categorization follow a find- the activation goes through a feature pattern and ing of Hines, Czerwinski, Sawyer, and Dwyer from there to the category label or goes directly (1986), who found that good category members to the category label. However, exactly how this prime other category members (regardless of activation would work is not clear. Besides, their association level), while medium category there is another complicating factor in that ex- exemplars fail to produce semantic priming for emplars can be linked to more than one category within-category members. Based on these find- label. Neither the prototype nor the exemplar- ings, it could be assumed in the exemplar model based models have dealt in an explicit way with that a category member presented in a speeded the problem of overlapping categories. categorization task might activate other exem- The situation is even less clear in the exem- plars based on a simple mechanism in which plar-generation task. Exemplar-based models similar exemplars are retrieved first. If the num- assume that categories are learned by associat- ber of retrieved items that are directly stored ing the category label to given exemplars and by exemplars of the category (i.e., good members evaluating the similarity of new stimuli to of the category) reaches a threshold value, then stored exemplars. Thus, these models have to be
PROTOTYPES AND EXEMPLARS IN NATURAL LANGUAGE CATEGORIES 55 extended to account for the fact that subjects concepts were obtained from Hampton’s study. can easily do the reverse, that is, generate ex- Next, the applicability of each characteristic emplars when given a category label. The pro- feature of the category is evaluated for different totype view again has to rely on a feature-based items and a sum of these feature applicabilities activation process in which the feature pattern is used to predict category-related decisions for associated with the category somehow activates the corresponding items. the feature pattern of category exemplars. In sum, neither the prototype nor the exem- Method plar-based model is very clear on the process We will first give an overview of the material underlying the responses in a category-naming used in this experiment. Next, we will describe task and in an exemplar-generation task. Given the different tasks used to gather data on which that both types of models are not really elabo- the dependent and the independent variables rated with the types of tasks under consider- were based. ation, it is useful to collect data of this kind to help to develop the models further. Material All concepts and items used were in Dutch Prototype and Exemplar Based Predictors and all subjects were native Dutch speakers. In this first experiment, we compared an ex- Eight common categories, previously studied by emplar-based predictor with a prototype predic- Hampton (1979), were used in this experiment: tor. We considered a version of the exemplar kitchen utensils, furniture, vehicle, sport, fruit, view in which subjects use the stored informa- vegetable, fish, and bird. The set of categories tion about the “best exemplars.” Note that our contained natural kinds, artifacts and activities, conception differs somewhat from the instantia- with natural kinds and nonnatural kinds equally tion model proposed by Heit and Barsalou balanced (four of each), which was desirable (1996). Heit and Barsalou assume that subjects given the potential effects of this nature of the instantiate a concept with only one single ex- concepts studied (e.g., Malt & Johnson, 1992). emplar, possibly a different one depending on For each of these categories, a list of 36 items the subject, that is, the instantiation that first was selected from an exemplar generation study comes to their mind, while we assume that (Storms, De Boeck, Van Mechelen, & Ruts, subjects can instantiate a concept using more 1996): 24 presumed exemplars and 12 nonex- than one exemplar. Though the exemplar acti- emplars that were related to the category. The vation process is not observable, it can be as- exemplar set always included the most fre- sumed that it is possible to derive an approxi- quently generated exemplar, but also items with mation of the sampling distribution of the presumably varying degrees of typicality within concept instantiations over subjects from the the categories. The 12 related nonexemplars of results of an exemplar-generation task. Since each of the eight categories were selected from we had a list of concept instantiations from an the results of an exemplar-generation task, earlier exemplar-generation task (Storms et al., where subjects were asked to write down exem- 1996), we could evaluate the impact of taking plars of superordinates of the eight categories, into account different numbers of exemplars in excluding the category itself (e.g., of food that deriving the exemplar-based predictor. Thus, is not fruit, of animals that are not fish, etc.). different predictors could be constructed by Features of the eight categories were taken summing the similarity toward an increasing from Hampton’s study. Hampton gathered these number of “best” exemplars of the category, features by interviewing 32 undergraduate stud- “best” meaning the most frequently generated ies extensively. The interview consisted of a exemplars, weighted for their production fre- first part, in which the participants gave free quency. descriptions of the eight concepts, and a second The prototype prediction used in Experiment part, in which they were given seven questions 1 is derived using Hampton’s (1979) procedure. to further encourage them to generate as many More specifically, characteristic features of the features as possible. Features generated by less
56 STORMS, DE BOECK, AND RUTS than 25% of the participants were excluded. For so as to cover the whole range of varying typi- the eight categories, respectively 13, 11, 12, 14, calities in the concept.) The task took approxi- 13, 9, 16, and 12 features were selected to mately 30 min. derive the prototype prediction. (Note that the participants in Hampton’s study had English as Item by Feature Applicability (Matrix Filling) their mother tongue, while our participants were Task Dutch speaking. However, data gathered re- Participants. Eighty students from the Uni- cently in the context of another study, where the versity of Leuven participated for course credit. same eight categories were used, indicated that Procedure. Participants were given a matrix the feature set for the Dutch translation of the where the rows were labeled with the 36 items eight concept labels yielded virtually identical and where the columns were labeled with the feature sets.) category features taken from Hampton (1979). They were asked to fill out all entries in the Procedure matrix with a 1 or a 0 to indicate whether or not To derive the prototype and exemplar-based a feature was considered present in the item predictor variables, two different tasks were corresponding to the row of the entry. Comple- given to different groups of subjects: a similar- tion of the applicability matrix took about 50 ity rating task and a feature applicability task. min. Four other tasks were administered in the first Similarity Rating Task experiment to obtain the four dependent vari- Participants. Two hundred and fifty students ables. from the University of Leuven participated for course credit. Typicality-Rating Task Procedure. Participants indicated the similar- Participants. Ten students from the Univer- ity of 36 listed items (24 presumed exemplars sity of Leuven participated for course credit. and 12 related nonexemplars) toward a key- Procedure. The participants received stan- word. The similarity judgments were given on a dard instructions for typicality ratings. A 10-point rating scale (ranging from 1 “no sim- 7-point rating scale was used (ranging from 23 ilarity at all,” to 10 “highly similar”). No indi- for very atypical or unrelated items to 13 for cation was given as to the point of view from very typical items). All participants rated typi- which similarity had to be considered. As key cality for the item sets of all eight categories. words, the 25 most frequently generated exem- The task took approximately 30 min. plars of the category were used. (The 10 most frequently generated exemplars for every cate- Speeded Categorization Task gory are given in the Appendix). The partici- Participants. Eighteen last-year psychology pants each received eight lists of 36 items, one students from the University of Leuven partic- per concept, in order to judge the similarity of ipated voluntarily. the items of each list to one of the 25 key words Procedure. The participants were seated in for that list, that is, one of the 25 most fre- front of a computer screen and read the instruc- quently generated exemplars. Key words were tions from the screen. They were asked to per- distributed randomly over participants, with the form a speeded categorization task for nine cat- restriction that all participants rated all item lists egories (the eight concepts studied plus one for their similarity to only one key word. Note concept used to acquaint the participants with that the 25 most frequently generated exemplars the procedure). They saw the name of a cate- of the category that were used as target stimuli gory printed in bold in the middle of the screen in the similarity rating task usually contained and were instructed that words would appear the majority of, but not all of, the 24 category right under the category name. They were asked exemplars of the item list. (The 24 items con- to decide as quickly as possible whether or not tained the 10 most frequently generated exem- the word shown under the category name be- plars, but the remaining 14 items were selected longed to the category. In order to avoid ex-
PROTOTYPES AND EXEMPLARS IN NATURAL LANGUAGE CATEGORIES 57 treme response bias, four additional nonmem- different exemplars of each of the eight con- bers of the categories were included. As a cepts studied. They were asked to write the consequence, the item set of every concept exemplars in the order that they thought of comprised 24 members and 16 (12 1 4) non- them. members. Assignment of the two response keys, labeled “Yes” and “No,” was counterbalanced Results over participants. The participants kept the in- dex fingers of both hands on the response keys Reliability at all times, except during the pauses in between The reliability of all the data that we used in two categories. The experiment started with a this paper was evaluated by the split-half cor- practice category (flowers). Participants were relations corrected with the Spearman–Brown asked to write down the words they had re- formula, with the halves referring to halves of sponded to incorrectly, right after the comple- the subjects that completed the task in question. tion of every category. They were told that All reliability coefficients are derived over the doing the task as fast and as accurate as possible 36 items associated with a category. Similarity was much more important, however, than trying ratings with the 25 best exemplars of each cat- to remember incorrect responses. As in Hamp- egory had a mean reliability estimate of .94 ton (1979), all response times for trials marked (with only 15 out of the 200 ratings having an as mistakes by the participants and all response estimated reliability below .90). Reliabilities times of “No” answers for members were dis- were estimated for all features (i.e., all columns) carded. Participants were informed that they in the applicability matrices of the eight con- could pause in between two categories. The cepts. Mean estimates of .96, .97, .98, .96, .88, order of the eight categories was determined .93, .91, and .95 were found for fruit, birds, following a Latin square design. The experi- vehicles, sports, furniture, fish, vegetables, and ment typically took 35 to 40 min to be com- kitchen utensils, respectively. Estimated reli- pleted. abilities of the typicality ratings of all categories were above .985. For the response times, esti- Category Naming Task mates of the reliability of the response times Participants. Three hundred and sixty stu- were .49, .89, .81, .64, .87, .89, .41, and .84, dents from the University of Leuven partici- respectively. For category naming, reliabilities pated voluntarily. were all well over .93, except for kitchen uten- Procedure. The participants were given a list sils (.85). Finally, the estimates for the data of of eight items, each being one of the 36 items the exemplar-generation task were .92, .85, .95, associated with a different category (as there .92, .97, .95, .83, and .89, respectively. were eight categories studied), either as a mem- Prototype predictions. The prototype predic- ber or as a nonmember. The participants were tions were calculated as in Hampton (1979). asked to write down, for each item, the first The procedure described below is also illus- category they thought the item belonged to. The trated in Fig. 1. Recall that the applicability task was administered collectively at the begin- matrix has frequencies as its entries, corre- ning of a class and took only a couple of min- sponding to the number of participants (out of utes. There were 36 different lists of eight items 10) that judged the corresponding feature appli- and ten different participants completed every cable to the corresponding item. Different pro- list. totype predictions were calculated based on the item by feature applicability matrix. A first Exemplar Generation Task measure used no weighting of the features. Participants. Fifteen graduate students of the Thus, the prediction here consisted simply of Psychology Department participated voluntar- summing over the different features, the number ily. of subjects that credited the item with the cor- Procedure. The participants were given a responding feature. Three other measures were booklet in which they had to write down ten derived by first weighting the features before
58 STORMS, DE BOECK, AND RUTS FIG. 1. Calculating Hampton’s (1979) prototype prediction. summing the applicability frequencies. The results of an earlier study described in Storms et weights were based (1) on the rated feature al. (1996). The first predictor simply consists of importance for defining the concept, (2) on rat- the rated similarity toward the first ordered ex- ings of how characteristic the features are for emplar of the category. For the second predic- the concept, and (3) on the production fre- tor, similarity ratings toward the two best-or- quency of the features. All three weights were dered exemplars were summed. The two terms taken from Hampton (1979). Given the a priori of the sum were weighted using the production nature of the weights, no (estimated) free pa- frequency of the corresponding items. The re- rameters were used. maining predictors were constructed by adding Exemplar prediction. To our best knowledge, each time the similarity ratings toward the next there are no guidelines to be found in the liter- most frequently generated exemplar of the cat- ature concerning the number of exemplars that egory, weighted by the production frequency. are activated by a concept label for natural Note that no (estimated) free parameters were language categories. Therefore, several exem- used in the exemplar-based predictors. Further- plar-based predictors were tried, which differed more, we remind the reader that, in the similar- from each other in that an increasing number of ity-rating task on which the exemplar prediction exemplars were assumed to be activated. A is based, no indication was given as to the point comparison of these predictors enabled us to of view from which similarity had to be consid- evaluate the effect of enlarging the set of exem- ered. However, due to the composition of the plars. item list (which contained 24 members of a Correlations were calculated between 25 dif- particular category and 12 related nonmem- ferent exemplar predictors and the four depen- bers), it is plausible that participants have rated dent variables. The 25 exemplar predictors were similarity based at least partially on the features constructed based on an ordering of exemplars related to the category. derived from generation frequency. Note that Prediction of the four dependent variables. these frequencies were determined based on the For each of the different prototype and exem-
PROTOTYPES AND EXEMPLARS IN NATURAL LANGUAGE CATEGORIES 59 TABLE 1 Correlations of the Four Dependent Variables with the Hampton Prototype Predictors for the Category Members Correlation of Hampton prototype predictors versus Mean response Category-naming Exemplar-generation Concept Typicality times frequencies frequencies Fruits .55** 2.29 .20 .10 Birds .94** 2.66** .74** .55** Vehicles .78** 2.57** .71** .47* Sports .92** 2.66** .72** .40* Furniture .79** 2.76** .64** .51** Fish .68** 2.77** .44* .25 Vegetables .66** 2.53** .32 .03 Kitchen utensils .62** 2.72** .23 .35* Mean .79** 2.64** .54** .34 Note. *p , .05, **p , .01. All others not significant. plar-based measures, all analyses on the data served for totally unrelated nonmembers and were done with and without weights to deter- slow “No” responses are emitted for related mine predictor variables. However, concerning nonmembers (Hampton, 1979). For the exem- the prototype predictor, only the results for the plar-generation task, evidently, almost all of the unweighted prototype predictor will be given, nonmembers were never generated as exem- since none of the three different feature weight- plars of the categories, and in the category- ings improved the prediction of the unweighted naming task, the concept label was almost never prototype measure. (Averaged over the four de- given in response to nonmembers of the cate- pendent variables and over the eight concepts, gories. Therefore, there was no variation in 12 the unweighted prototype did explain a larger of the 36 items, yielding an odd kind of distri- proportion of the variance than the weighted bution. Furthermore, by excluding the nonmem- prototypes, but the difference was small and not bers, it is even more difficult to obtain high significant.) In the remainder of the article, we predictive levels. Thus, if the predictive level is will refer to the unweighted sum of feature high all the same, the evidence in favor of the applicability frequencies with the term “proto- model is considerably stronger. type prediction.” For the exemplar-based pre- Prototype predictions. Table 1 shows the cor- dictors, only results for the weighted versions relations of the prototype predictor with the four will be reported, since here the weighting did dependent variables for the eight concepts stud- improve the prediction somewhat. ied. The prototype predictor is better in explain- The prototype and the exemplar-based mea- ing the typicality ratings than the other depen- sures were correlated with all four dependent dent variables. All values are significant at the measures. All correlations were based on the 24 a 5 .01 level. Also, the correlations for the members of the categories only, excluding the response times are quite high and significant at related nonmembers of the categories from the the a 5 .01 level for all concepts except fruits. analyses. Only for typicality ratings the analy- Finally, the correlations for the generation fre- ses were also done for the complete list of 36 quencies and for the category-naming frequen- items. It would be problematic for response cies are somewhat lower and significant for only times to include nonmembers, as fast “Yes” five concepts. responses are observed for typical members and The exemplar-based predictors. In Fig. 2, the slow “Yes” responses are emitted for atypical correlations for the 25 exemplar predictors are members; while fast “No” responses are ob- displayed graphically. Thirty-two diagrams
60 STORMS, DE BOECK, AND RUTS FIG. 2. Correlations of typicality, mean response times, category-naming frequencies, and exemplar- generation frequencies with the sum of 1 to 25 exemplar-based predictions for the category members only. (The dotted line gives the correlation with the prototype predictions.)
PROTOTYPES AND EXEMPLARS IN NATURAL LANGUAGE CATEGORIES 61 TABLE 2 Correlations of the Four Dependent Variables and Prototype Predictors with the Exemplar Predictors Based on Weighted Sum of Similarities Toward the Ten Most Frequently Generated Exemplars for the Category Members Correlation weighted sum of similarities toward the ten most frequently generated exemplars versus Mean response Category-naming Exemplar-generation Hampton prototype Concept Typicality times frequencies frequencies predictors Fruits .77** 2.65** .59** .70** .59** Birds .89** 2.62** .77** .65** .91** Vehicles .93** 2.79** .80** .72** .88** Sports .72** 2.58** .53** .67** .70** Furniture .76** 2.70** .43* .51** .79** Fish .93** 2.84** .63** .47** .69** Vegetables .68** 2.55** .47* .36* .49** Kitchen utensils .60** 2.60** .24 .72** .56** Mean .82** 2.68** .59** .61** .74** Note. *p , .05. **p , .01. All others not significant. show the correlations for the exemplar-based It is remarkable how similar the diagrams in predictors of the eight concepts (corresponding the first two columns of Fig. 2 are, meaning that to the eight rows) and the four dependent vari- the predictions of the typicality ratings and of ables (corresponding to the four columns). The the speeded categorization task are very similar. abscise of each of the diagrams corresponds to As can be seen in Table 2, the correlations of the increasing number of best exemplars taken the exemplar-based measure for the 10 most into account in the predictor. The ordinate cor- frequently generated exemplars correlated sig- responds to the value of the correlation found nificantly with typicality and with response between the dependent variable and the predic- times for all eight concepts (p , .01). In gen- tor. (Note that the correlation values in the eral, the predictions for the typicality-rating task second column, corresponding to the response are a little better, partly because the response times, are negative.) To ease the comparison of times are less reliable than the typicality ratings. the exemplar-based predictors with the proto- The resemblance of the result patterns of the type predictor, a horizontal line is drawn in each response times and of the rated typicalities is graph, indicating the level that corresponds to higher for a concept the more reliable the re- the correlation of the prototype predictor. sponse times of the concept are. In most of the It can be seen clearly that the different dia- diagrams, the exemplar-based measure predicts grams are similar in showing that the exemplar the dependent variable better or equally good as predictors improve when taking into account the prototype predictor, with the exception of more exemplars. The improvement is strong the typicality ratings for sports and the response when comparing the predictors based on up to times for kitchen utensils, which were some- seven exemplars. However, the improvement what better predicted by the prototype measure. decreases as the number of exemplars taken into A similar pattern of results showed up for the account increases. There is almost no improve- prediction of the category-naming frequencies. ment in the predictive power after adding more Again the exemplar prediction was better than than ten exemplars. The first five columns of the prototype predictor in the majority of the Table 2 show the correlations of the exemplar concepts studied (with the exception of sports predictor based on the weighted sum of the 10 and of furniture). The correlations shown in most frequently generated exemplars with the Table 2 (based on the weighted sum of the 10 four dependent variables for the eight concepts “best” exemplars) was significant (p , .05) for studied. all concepts except kitchen utensils. Finally, for
62 STORMS, DE BOECK, AND RUTS the exemplar-generation frequencies, the exem- TABLE 3 plar-based measure is clearly superior to the Correlations of Typicality with the Hampton Prototype prototype predictor. The latter reaches the level Predictors and the Exemplar Predictors Based on Weighted of the exemplar predictor only for furniture. Sum of Similarities for the Complete Set of 36 Items: The values in Table 2 are significant for all eight Category Members and Nonmembers concepts (p , .05). Correlation of typicality versus An important consideration for interpreting the correlations of the two predictors with the Hampton prototype Exemplar dependent variables is whether the prototype Concept predictors predictors predictor and the exemplar-based predictor can be differentiated. The last column of Table 2 Fruits .97 .97 Birds .98 .98 shows the correlations between the prototype Vehicles .97 .96 predictors and the exemplar predictors based on Sports .98 .85 weighted sum of similarities toward the ten Furniture .86 .93 most frequently generated exemplars. The val- Fish .98 .99 ues show that there is some overlap, but the two Vegetables .90 .96 Kitchen utensils .85 .77 predictors are not at all indistinguishable. Only for the concepts birds and vehicles, both predic- Mean .96 .95 tors are very much the same. Averaged over the Note. All correlations are significant at the p , .001 level. eight concepts, the exemplar-based predictor and the prototype predictor have 51% of their variance in common. The percentage of com- quencies and the category-naming frequencies mon variance does not increase when taking (t(18) 5 2.60, p , .01). Though the category- similarities toward all 25 exemplars into ac- naming frequencies showed a larger percentage count. of explained variance than the exemplar-gener- To obtain a more detailed view of the impact ation frequencies, the difference was not signif- of the predictors in explaining the dependent icant. Furthermore, the interaction between the variables for the different concepts, an analysis predictor factor and the dependent variable fac- of variance (ANOVA) with a split-plot factorial tor was also significant. Contrasts for interac- design (Kirk, 1982) was conducted, where the tions showed that the exemplar-based measure eight concepts function as blocks, where the differed only significantly from the prototype predictor variables (prototype and exemplar- measure in predicting the exemplar-generation based predictors) and the four dependent vari- frequencies (t(18) 5 6.68, p , .01), with the ables (typicality ratings, response times, cate- exemplar-based measure predicting the fre- gory naming, and exemplar generation quencies better. All other effects and interaction frequencies) are within-block factors, and effects were not significant, though the differ- where the sort of concepts (natural and nonnat- ence between the exemplar-based measure and ural kinds) is a between-block factor. The de- the prototype measure approached significance pendent variable in the ANOVA was the per- for the typicality ratings (t(18) 5 1.42, p , .10), centage of explained variance (the square of the with the exemplar-based measure yielding bet- correlations shown in Tables 1 and in the sec- ter predictions. ond through the fifth column of Table 2). As mentioned previously, the results of the The analysis yielded a significant effect of the typicality ratings can also be analyzed when dependent variable factor (F(3, 18) 5 12.42, including the related nonmembers of the cate- p , .01). A posteriori contrasts revealed that the gories. Table 3 shows the correlations between typicality ratings could be predicted better than the two predictor variables and the typicality the other three dependent variables (t(18) 5 ratings based on the complete set of 36 items for 5.43, p , .01) and that the response times from the eight concepts studied. These correlations the speeded categorization task could be pre- are a lot higher than the correlations based on dicted better than the exemplar-generation fre- the members only (see Tables 1 and 2). These
PROTOTYPES AND EXEMPLARS IN NATURAL LANGUAGE CATEGORIES 63 differences can easily be explained given a re- with predictions derived from prototype theory. striction of range in the analyses excluding non- The results obtained in our experiment suggest members. An analysis of variance with a split- that, although for all four dependent variables plot factorial design was conducted, with the the exemplar-based predictions were on the av- eight concepts as blocks, the sorts of concepts erage better than the prototype predictions, the (artifacts versus natural kinds) as a between- difference was not always significant. Note also blocks factor, and the two predictor variables that 31 out of the 32 correlations for the (prototype measure and exemplar-based mea- summed similarity toward 10 exemplars were sure with 10 exemplars) as a within-blocks fac- significant, while only 25 of the corresponding tor. The percentage of variance accounted for correlations for the prototype predictor reached was again used as the dependent variable in the significance. We will now focus on the four analysis. On the average, there was almost no different dependent variables separately. difference in explained variance between the The typicality ratings, which could better be two predictor variables (87 and 86% for the predicted than any of the other dependent vari- prototype and the exemplar-based predictor, re- ables, were significantly predicted by both pre- spectively). Though the mean percentage of ex- dictors for all eight concepts. The ratings plained variance was 93 and 81% for the natural showed a considerably higher correlation with kind concepts and for the artifact concepts, re- the exemplar predictor for fruits, vehicles, and spectively, the difference was not significant. fish, but the prototype predictor explained the typicalities of furniture better. For the remain- Discussion ing concepts, the difference between the two First of all, the main idea of an instantiation predictors was rather small. On the average, the principle, proposed by Heit and Barsalou (1996) difference between the two predictors in ex- to account for typicalities of superordinate nat- plaining typicality was only marginally signifi- ural language categories (like birds, fish, mam- cant. When predicting the typicalities of the mals, etc.) within higher level concepts (like whole item set, that is, including related non- animals) was shown to work well in predicting members, the prediction based on both mea- typicalities of exemplars within these superor- sures increased considerably, but no significant dinate categories as well. The correlations of the difference between the two predictors could be dependent variables with the predictions based observed. on the exemplar model were rather high. All but As to the response times in a speeded cate- one of the 32 correlations (8 concepts by 4 gorization task, the exemplar-based measure dependent variables) were significant. Note, was also a better predictor for fruits and vehi- however, that the exemplar model proposed cles, whereas the prototype predictor better ac- here allowed that a concept can be instantiated counted for the response times for kitchen uten- with more than one exemplar, whereas Heit and sils. Averaged over the eight concepts, the Barsalou’s (1996) model assumes only one in- difference between the two predictors in pre- stantiation. For a detailed comparison of Heit dicting the response times was again not signif- and Barsalou’s model and the model proposed icant. here, see Storms, De Wilde, De Boeck, and Ruts Looking at the correlations for the category- (1999). naming task, the exemplar-based predictor was The results showed also very clearly that the better for fruits, fish, and vegetables, but the predictive power of the exemplar-based mea- prototype measure was better for sports and sure increases when more exemplars are taken furniture. Averaged over the eight concepts, the into account. However, this increase was not better predictions of the exemplar-based mea- linear. The gain in predictive power slows down sure were not significantly different from the from seven exemplars on and on the average no prototype predictions. The results of the reverse improvement is found beyond ten exemplars. process though, where subjects were given the Heit and Barsalou (1996) have not compared concept label and were asked to generate exem- the predictions of the instantiation principle plars, were significantly better predicted by the
64 STORMS, DE BOECK, AND RUTS exemplar-based measure. Only for furniture the prototype measure and an exemplar-based mea- prototype predictor did as good as the exemplar- sure is that the former is based on one repre- based predictor. A possible explanation for the sentation for the whole concept, while the latter clearly better exemplar-based predictor for is based on many representations— one for ev- these data lies in the similarity of the method for ery stored exemplar—which are summarized or gathering data: Both the exemplar-based predic- averaged. Assuming that the representations tor and the exemplar-generation frequencies consist of features or attributes, we can rephrase start from production frequencies. this by stating that the prototype predictors are The analyses also showed that the results of based on one single vector of attribute values, the different concepts, regardless of the task, while exemplar predictors are based on the av- were predictable to a different extent. The data erage of many different vectors of attribute val- for birds and for vehicles could be predicted ues. The different procedures to calculate pro- rather well, while the data for fruits, vegetables, totype predictors differ in the way to obtain the and kitchen utensils seemed to resist prediction attributes of which the concept vector of at- most. The differences, however, were not sig- tribute values consists. Hampton’s (1979) pro- nificant. cedure, which was used in Experiment 1, gath- The reader may also recall that several pro- ered these attributes starting from the category totype predictors were calculated and com- label, by simply asking participants to generate pared. The best prototype prediction was a sim- features of the concept. 2 Since the results of ple unweighted sum of the applicability Experiment 1 showed that the exemplar-based frequencies, summed over all features. This measure predicted the dependent variables a does not support the suggestion of Rosch and little better than Hampton’s prototype proce- Mervis (1975) that the features possessed by the dure, another experiment was conducted to different items are weighted by their cue valid- evaluate how consistent this advantage of the ity to determine degree of category member- exemplar predictor was when varying the pro- ship. The results are in line, however, with cedure to derive the prototype. Experiment 2 findings of Hampton (1979), who also tried two was set up to test another procedure, based on different sorts of feature weighting, production Rosch & Mervis (1975) classical family resem- frequency and rated feature importance, and blance procedure. found that none of these weightings improved EXPERIMENT 2 the prediction of typicality ratings. Our findings that weighting did not improve the prototype Rosch and Mervis (1975, p. 575) have de- prediction thus confirm Hampton’s results. scribed the prototype structure of semantic cat- The prototype measure in this first experi- egories as follows: “the basic hypothesis was ment was derived using the same procedure that members of a category come to be viewed used by Hampton (1979). He used the prototype as prototypical of the category as a whole in measure to predict responses in a speeded cat- proportion to the extent to which they bear a egorization task. In general, the corresponding 2 According to Hampton (1979), when asked to give correlations obtained in Experiment 1 were features of a concept, participants are assumed to be able to somewhat higher than the values reported by activate this information directly, without first instantiating Hampton. Especially for kitchen utensils, the exemplars of the concept. This procedure can therefore be criticized because exactly what goes on in a subject’s mind difference was considerable, since Hampton’s when he or she is asked to give features of a concept is not correlation for this concept was rather low. The directly observable. It is very well conceivable that subjects higher values obtained in our experiment reas- instantiate exemplars, activate features of these exemplars, sure us of the soundness of the response times and respond with those features that apply to enough of the from our speeded categorization task as well as instantiated exemplars. We thank Larry Barsalou for point- ing this out. However, if the activation of these “concept of the prototype predictor. features” really happens after exemplar instantiation, one However, different procedures have been should still assume that participants first apply a sort of proposed in the literature to calculate prototype filter, since no features idiosyncratic for just one (or a few) predictors. Essentially, the difference between a exemplars are given.
PROTOTYPES AND EXEMPLARS IN NATURAL LANGUAGE CATEGORIES 65 family resemblance to (have attributes which Thus, the purpose of Experiment 2 is similar overlap those of) other members of the cate- to the purpose of Experiment 1: Again the four gory.” While this description gives a generally dependent variables (typicality ratings, response accepted conception of a prototype of a seman- times from the speeded categorization task, cat- tic category, it is not clear at all how the rele- egory-naming frequencies, and exemplar-gener- vant attributes on which the prototype is based ation frequencies) were correlated with the new can be obtained empirically. For an elaborate family resemblance measure, which was then discussion of this difficulty, see Smith & Medin compared with the prototype measure based on (1981). Hampton’s procedure and with the exemplar- The solution used by Hampton (1979), which based predictor of Experiment 1. we adopted in Experiment 1, is to simply ask subjects to generate features that characterize Method the concepts under study. More in detail, Hamp- ton asked participants to give descriptions of the Participants. There were six different partic- eight categories. For each category, a set of ipants in this experiment: Three of them partic- seven different questions was used in order to ipated in the attribute generation task and four encourage the participants to generate as many judged the attribute applicabilities for every different properties as they could. These ques- item. All participants were graduate students of tions were, for example, why some items only the Psychology Department of the University of “loosely speaking” belong to a particular cate- Leuven. gory, or why a certain item might be considered Material. The same eight concepts from Ex- a very typical example of the category. Impor- periment 1 were used. Because of the very high tantly, in all questions involving the consider- correlations obtained for the typicality predic- ation of particular examples, no specific exam- tions including related nonmembers in Experi- ple was ever given by the experimenter to avoid ment 1, and because of the time-consuming biasing the participants. nature of the matrix filling task in the second In Experiment 2, a prototype measure was experiment, data were gathered for the 24 con- computed using another procedure, which is cept members only (thus dropping the 12 related more in line with the prototype theory as for- nonmembers of the item list). mulated by Rosch (1975a, 1975b, 1977, 1978), Procedure. In the attribute generation task, 1983). The procedure of Rosch and Mervis three participants generated features for all 24 (1975) was applied, where the attributes of the members of five or six concepts. For every category were gathered starting from the exem- concept, two different participants took part in plars of the category. Next, for each category, the attribute generation task. They were given all attributes of the exemplars were listed and all 24 items of every concept together and they judged for their applicability for every exem- were instructed to write down as many at- plar. Each attribute then received a weight, tributes as possible for each of the items, but based on the number of items that had been credited with that attribute. Finally, a family they did not have to repeat an attribute that was resemblance score could be calculated for every already written down for a previous item of the item by summing the weights of the attributes same concept. that were judged to apply to the item. (Another In the attribute applicability judgment task, version of this new prototype measure was cal- four different participants filled up all entries of culated, without weighting the different fea- the eight grids, each consisting of 24 rows for tures. Averaged over the four dependent vari- the items and 26 to 73 columns for the at- ables and the eight concepts, the unweighted tributes. The participants were allowed to work version explained less variance than the on the task at different moments, but they were weighted version, but like for the weighted and asked to complete a grid before pausing, once unweighted prototype versions in Experiment 1, they started it. Completing a single grid took the difference was not significant.) between 25 and 50 min.
You can also read