Prototype and Exemplar-Based Information in Natural Language Categories

Page created by Vincent Gregory
 
CONTINUE READING
Journal of Memory and Language 42, 51–73 (2000)
Article ID jmla.1999.2669, available online at http://www.idealibrary.com on

                            Prototype and Exemplar-Based Information
                                  in Natural Language Categories

                                 Gert Storms, Paul De Boeck, and Wim Ruts
                                          University of Leuven, Leuven, Belgium

              Two experiments are reported in which four dependent variables; typicality ratings, response times,
           category-naming frequencies, and exemplar-generation frequencies of natural language concepts,
           were predicted by two sorts of prototype predictors and by an exemplar predictor related to Heit and
           Barsalou’s (1996) instantiation principle. In the first experiment, the exemplar predictor was com-
           pared to a prototype predictor calculated as in Hampton (1979). The four dependent variables were
           either predicted better by the exemplar measure than by the prototype predictor or the predictive value
           was about equal. In the second experiment, a new prototype predictor was calculated based on Rosch
           and Mervis’ (1975) classic family resemblance measure. The results showed that the exemplar
           predictor accounted better for the dependent variables than Hampton’s and Rosch and Mervis’
           prototype measures. The differences between the prototype measures were not significant. © 2000
           Academic Press
             Key Words: natural language concepts; prototype theory; exemplar models; family resemblance.

   The classical view of semantic concepts                           based on whether an item possesses enough of
states that a concept can be described in terms                      these features. The exemplar view essentially
of defining features that are singly necessary                       states that a category is represented by particu-
and jointly sufficient (e.g., Sutcliffe, 1993).                      lar instances that have previously been encoun-
Ample evidence has been provided against this                        tered. A new item is assumed to be judged an
view (see, e.g., Komatsu, 1992; Smith & Me-                          instance of a category to the extent that it is
din, 1981, for overviews.) Two major alterna-                        sufficiently similar to one or more of the in-
tive theories have been formulated: the proto-                       stance representations stored in memory.
type view and the exemplar view. In the                                 Different procedures have been proposed to
prototype view (Rosch, 1975a, 1978, 1983;                            derive prototype measures for natural language
Rosch & Mervis, 1975), it is assumed that cat-                       concepts (e.g., Hampton, 1979; Rosch &
egories are represented by a set of features                         Mervis, 1975; see also Barsalou, 1990). Also,
which may carry more or less weight in the                           different variants of the exemplar view have
definition of the prototype, and categorization is                   been presented in the literature, depending on
                                                                     the assumptions made about the number and
   This research project was supported by Grant 2.0073.94            nature of the instances stored, about the pres-
from the Belgian National Science Foundation (Fundamen-
                                                                     ence or absence of forgetting, and so on (Bar-
tal Human Sciences) to P. De Boeck, I. Van Mechelen, and
D. Geeraerts. Part of this research was conducted while the          salou, 1990). In some theories it is assumed, for
first author was visiting Doug Medin at Northwestern Uni-            example, that every instance that is encountered
versity. Their hospitality is gratefully acknowledged. We            is also stored (e.g., Reed, 1972), while in others
thank the following for discussion and advice in the course          it is assumed that only the most typical in-
of the research and in the preparation of this manuscript:
Lawrence Barsalou, Dedre Gentner, Lloyd Komatsu, Doug-
                                                                     stances are stored (e.g., Rosch, 1975b) or that
las Medin, Iven Van Mechelen, Ed Wisniewski, and one                 many instances are stored to varying degrees of
anonymous reviewer. All the data described in this manu-             completeness (see, Komatsu, 1992, for more
script can be obtained from the first author (in Excel files)        details).
upon simple request.
   Address reprint requests to Psychology Department, Uni-
                                                                        The evidence for exemplar models mostly
versity of Leuven, Tiensestraat 102, B-3000 Leuven, Belgium.         consists of category learning data, obtained with
E-mail address: Gert.Storms@psy.kuleuven.ac.be.                      tasks in which subjects learn new categories of
                                                                51                                        0749-596X/00 $35.00
                                                                                             Copyright © 2000 by Academic Press
                                                                                   All rights of reproduction in any form reserved.
52                                  STORMS, DE BOECK, AND RUTS

artificial stimuli. In the context of natural lan-     servation of real-world objects, and thus, that
guage concepts, prototype and exemplar-based           information is encoded in the same way (Malt &
predictors have, to our best knowledge, not yet        Smith, 1984).
been compared.                                            To our knowledge, the study of Heit and
   In this paper, we want to focus on the extent       Barsalou (1996) is the only attempt to test a
to which category-based variables (typicality          model compatible with most exemplar models
ratings, latency of category decision, exemplar-       in the context of natural language concepts. Heit
generation frequencies, and category-naming            and Barsalou proposed an instantiation model in
frequencies) derived from natural language             which it is essentially assumed that people gen-
concepts can be predicted from prototype mea-          erate instantiations of a category to base cate-
sures and exemplar-based measures. Before de-          gory-related decisions on. More specifically,
scribing the empirical work, we will elaborate         Heit and Barsalou predicted the typicality of
on the differences between natural language            lower level concepts (like, e.g., mammals)
concepts as they are used by adult language            within a higher level concept animal from the
users on the one hand and concepts related to          typicality of instances of the lower level concept
categories as they appear in laboratory studies        (like, e.g., dog, horse, cat) within the higher
in which subjects learn new categories of arti-        level concept animal.
ficial stimuli on the other hand.                         In contrast with the exemplar view, much
 EXEMPLAR AND PROTOTYPE MODELS                         evidence in favor of the prototype view stems
  FOR ARTIFICIAL CATEGORIES AND                        from natural language concepts. As Hampton
   NATURAL LANGUAGE CONCEPTS                           (1979) and Rosch and Mervis (1975) demon-
                                                       strated, prototype-based predictions succeed in
   In the past, many studies have compared pro-
                                                       explaining a considerable portion of the vari-
totype and exemplar-based predictions to ex-
                                                       ance in typicality judgments for natural lan-
plain categorization and category learning,
                                                       guage concepts. Applying the exemplar models
mostly using perceptual stimuli (see Nosofsky,
                                                       to natural language categories, however, is not
1992). Only a few studies have been published
                                                       straightforward. An important problem in order
in which the exemplars of the (new) categories
to be learned were verbally described (e.g., a set     for the exemplar view to be tested for natural
of symptoms of fictitious patients in Medin,           categories is that “it is not entirely clear what an
Altom, Edelson, & Freko, 1982, or a set of             exemplar representation is” (Komatsu, 1992, p.
features of fictitious persons in Hayes-Roth &         507). At one extreme, an exemplar representa-
Hayes-Roth, 1977), and, although the stimuli           tion might be a family resemblance representa-
were verbal, they are still somewhat artificial, as    tion that abstracts across different specific in-
were the categories. As it cannot be taken for         stances. In this view, the concept fish may
granted that the nature of categorization learn-       consist of the set of representations of trout,
ing using artificial categories parallels the learn-   goldfish, shark, and so on, which are themselves
ing of natural language concepts, a generaliza-        abstractions. At the other extreme, exemplar
tion of the results to the mental representation of    representations may involve no abstraction at
everyday natural language categories, such as          all, with representations consisting only of spe-
trees, furniture, or games, in adult language          cific memory traces of particular previously en-
users, seems problematic. Most people learn            countered instances. (See, e.g., Medin, 1986, 1
about birds and flowers erratically and from              1
                                                            Though the latter view may be more popular among
many different sources, contrary to the learning       researchers that have studied artificial category learning,
situation in a laboratory experiment where a           one can doubt whether a theory which assumes no abstrac-
category is learned in an explicit learning phase.     tion at all has ever been tested in any of the laboratory
                                                       studies. In designing these experiments, researchers assume
Also, whereas subjects in an artificial category
                                                       that the representation of the presented stimuli consists of
learning experiment are instructed to encode the       the dimensions that they manipulate (e.g., color, form, size,
exemplars in detail, there is no similar guaran-       and position of the stimulus), but if subjects do not abstract
tee that people are equally careful in their ob-       these dimensions from other information that may in prin-
PROTOTYPES AND EXEMPLARS IN NATURAL LANGUAGE CATEGORIES                                         53

and for a strong critique of nonabstracting ex-                    views are exemplar views. Both of these exem-
emplar models, see Barsalou, Huttenlocher, &                       plars views are compatible with Heit and Bar-
Lamberts, 1998.)                                                   salou’s (1996) instantiation principle.
   Heit and Barsalou’s (1996) instantiation prin-                     The present paper describes two experiments
ciple described above is not explicit about ex-                    in which two different prototype predictions
actly how exemplar information is stored in                        and an exemplar prediction inspired by Heit and
memory. The model assumes that category-                           Barsalou’s (1996) instantiation will be com-
based decisions about concepts like birds, fish,                   pared to predict category-based decisions for
mammals, etc. are based on information stored                      eight natural language concepts.
at the level of instantiations that are retrieved
                                                                                  EXPERIMENT 1
from memory (i.e., at the level of robin, trout,
and horse). However, the proposed instantiation                       In the first experiment, a first prototype pre-
principle is equally compatible with a model                       dictor and the exemplar-based predictors will be
that assumes no abstraction at all as with a                       compared for the four different category-related
model that assumes abstraction across specifi-                     dependent variables. We will first motivate the
cally encountered examples at the level of the                     choice of the four dependent variables and then
instantiated exemplars.                                            elaborate on the prototype and exemplar-based
   To summarize, in the context of natural lan-                    measures, which will be used as predictors for
guage categories like fruits, vegetables, vehi-                    the dependent variables.
cles, etc., three different theoretical views may                                 Typicality Ratings
be distinguished depending on the levels at
                                                                      Typicality has been shown to be a very in-
which abstraction does or does not take place.
                                                                   fluential variable in a wide variety of cognitive
The first view assumes that no abstraction what-
                                                                   tasks (Hampton, 1993; Malt & Smith, 1984),
soever takes place and that only memory traces
                                                                   such as speeded categorization (Hampton,
of particular encountered instances are stored.
                                                                   1979), inductive inference (Rips, 1975), pro-
Any category-related judgment is based on
                                                                   ductive tasks (Hampton & Gardiner, 1983),
these memory traces, as no abstract information                    priming effects (Rosch, 1975b), semantic sub-
is stored with verbal concepts. The second view                    stitutability (Rosch, 1977), and memory inter-
assumes that abstraction may take place, but                       ference effects (Keller & Kellas, 1978). Accord-
only at a level lower than the concepts studied,                   ing to the prototype view, variations in category
that is, at the level of tomatoes in case vegeta-                  typicality reflect differences in similarity to the
bles are studied. The representation of the stud-                  prototype in terms of the features it shares with
ied natural language concepts, like vegetables                     the prototype representation of the concept. The
and vehicles, is comprised of lower level con-                     extent to which items are characterized by fea-
cepts like tomatoes and bikes, respectively. Fi-                   tures that are important in deciding on category
nally, the third view states that abstraction                      membership has been reported to be a good
(also) takes place at the level of the studied                     predictor by Hampton (1979), Malt and Smith
natural language concepts and that (characteris-                   (1984), and Rosch and Mervis (1975). Accord-
tic) features of their exemplars are directly                      ing to the exemplar view, variations in category
stored at this level. The latter view can be                       typicality reflect varying degrees of similarity to
labeled the prototype view, and the first two                      stored exemplars of the category. The instantia-
                                                                   tion principle, proposed by Heit and Barsalou
ciple be stored (like, e.g., the trial number and slight differ-   (1996), can be adapted to account for typicality
ences in illumination due to uncontrollable events), they          ratings within basic level categories like birds,
might not be able to learn the categories. Thus, each training     fish, and mammals. In evaluating the typicality
exemplar is, so to speak, a sort of a prototype consisting of
                                                                   of a particular instance X within a category Y, a
a set of abstracted features that, in principle, can apply also
to other exemplars that differ on some other, irrelevant,          subject might first generate one or more stored
features (D. L. Medin, personal communication, May 19,             exemplars of the category Y. Next, the typical-
1997).                                                             ity of instance X within Y might be based on the
54                                STORMS, DE BOECK, AND RUTS

similarity of instance X toward the instantiated     the subjects can give a “Yes” response. If, how-
exemplars stored earlier under Y. In summary,        ever, the number of retrieved items that are not
both views explain typicality in terms of their      stored exemplars reaches a critical value first,
basic assumptions about how categories are rep-      then a “No” response is assumed to be emitted.
resented. Therefore, finding which of the two        It follows from this hypothesized process that,
views best predicts typicality ratings may shed      within a category, response times should de-
light on the representation of semantic catego-      crease with increasing similarity toward stored
ries.                                                good exemplars of the category. (See Nosofsky
                                                     & Palmeri, 1997, for another exemplar-based
                Response Times                       model of response times.)
   The second measure to be predicted by both
models is response time in a speeded categori-        Category-Naming and Exemplar-Generation
zation task. A very robust finding in the litera-                   Frequency
ture on semantic verification tasks is that the         Category-naming and exemplar-generation
time to verify category membership differs sig-      frequency are, in a way, each other’s opposite:
nificantly among the members of the same cat-        In an exemplar-generation task, participants are
egory (Larochelle & Pineau, 1994). This differ-      given the category label and are asked to name
ence in response times can be explained both by      exemplars, while in a category-naming task,
the prototype model and by exemplar models.          participants are given exemplars and they are
Following Hampton’s (1979) assumptions               asked to name the category.
about category prototypes, different character-         As for category naming as a dependent vari-
istic features of the concept can be selected        able, the fact that some exemplars of a category
successively in time and a “Yes” response can        are labeled with the category name more fre-
be emitted as soon as the feature overlap be-        quently than other exemplars can be explained
tween the stored features of the prototype and       by the exemplar-based models when assuming
the features of a presented word reaches a cer-      that the link between a category and an exem-
tain threshold. Likewise, subjects can be as-        plar is of a probabilistic nature. The prototype
sumed to give a “No” response as soon as the         model can account for this finding in assuming
set of nonmatching features reaches a certain        that the feature pattern associated with the given
criterion. Hampton’s results show that a consid-     exemplar somehow activates the feature pattern
erable percentage of the variance in the re-         of the category (which then allows the subject to
sponse times can be accounted for by a proto-        retrieve the category label) and that this is more
type-based measure. (See also McCloskey &            likely the more the feature patterns of the ex-
Glucksberg, 1979.)                                   emplar and of the category resemble each other.
   Exemplar-based predictions about response         The difference in both views is whether or not
times in a speeded categorization follow a find-     the activation goes through a feature pattern and
ing of Hines, Czerwinski, Sawyer, and Dwyer          from there to the category label or goes directly
(1986), who found that good category members         to the category label. However, exactly how this
prime other category members (regardless of          activation would work is not clear. Besides,
their association level), while medium category      there is another complicating factor in that ex-
exemplars fail to produce semantic priming for       emplars can be linked to more than one category
within-category members. Based on these find-        label. Neither the prototype nor the exemplar-
ings, it could be assumed in the exemplar model      based models have dealt in an explicit way with
that a category member presented in a speeded        the problem of overlapping categories.
categorization task might activate other exem-          The situation is even less clear in the exem-
plars based on a simple mechanism in which           plar-generation task. Exemplar-based models
similar exemplars are retrieved first. If the num-   assume that categories are learned by associat-
ber of retrieved items that are directly stored      ing the category label to given exemplars and by
exemplars of the category (i.e., good members        evaluating the similarity of new stimuli to
of the category) reaches a threshold value, then     stored exemplars. Thus, these models have to be
PROTOTYPES AND EXEMPLARS IN NATURAL LANGUAGE CATEGORIES                                55

extended to account for the fact that subjects        concepts were obtained from Hampton’s study.
can easily do the reverse, that is, generate ex-      Next, the applicability of each characteristic
emplars when given a category label. The pro-         feature of the category is evaluated for different
totype view again has to rely on a feature-based      items and a sum of these feature applicabilities
activation process in which the feature pattern       is used to predict category-related decisions for
associated with the category somehow activates        the corresponding items.
the feature pattern of category exemplars.
   In sum, neither the prototype nor the exem-                             Method
plar-based model is very clear on the process            We will first give an overview of the material
underlying the responses in a category-naming         used in this experiment. Next, we will describe
task and in an exemplar-generation task. Given        the different tasks used to gather data on which
that both types of models are not really elabo-       the dependent and the independent variables
rated with the types of tasks under consider-         were based.
ation, it is useful to collect data of this kind to
help to develop the models further.                   Material
                                                         All concepts and items used were in Dutch
  Prototype and Exemplar Based Predictors             and all subjects were native Dutch speakers.
   In this first experiment, we compared an ex-       Eight common categories, previously studied by
emplar-based predictor with a prototype predic-       Hampton (1979), were used in this experiment:
tor. We considered a version of the exemplar          kitchen utensils, furniture, vehicle, sport, fruit,
view in which subjects use the stored informa-        vegetable, fish, and bird. The set of categories
tion about the “best exemplars.” Note that our        contained natural kinds, artifacts and activities,
conception differs somewhat from the instantia-       with natural kinds and nonnatural kinds equally
tion model proposed by Heit and Barsalou              balanced (four of each), which was desirable
(1996). Heit and Barsalou assume that subjects        given the potential effects of this nature of the
instantiate a concept with only one single ex-        concepts studied (e.g., Malt & Johnson, 1992).
emplar, possibly a different one depending on         For each of these categories, a list of 36 items
the subject, that is, the instantiation that first    was selected from an exemplar generation study
comes to their mind, while we assume that             (Storms, De Boeck, Van Mechelen, & Ruts,
subjects can instantiate a concept using more         1996): 24 presumed exemplars and 12 nonex-
than one exemplar. Though the exemplar acti-          emplars that were related to the category. The
vation process is not observable, it can be as-       exemplar set always included the most fre-
sumed that it is possible to derive an approxi-       quently generated exemplar, but also items with
mation of the sampling distribution of the            presumably varying degrees of typicality within
concept instantiations over subjects from the         the categories. The 12 related nonexemplars of
results of an exemplar-generation task. Since         each of the eight categories were selected from
we had a list of concept instantiations from an       the results of an exemplar-generation task,
earlier exemplar-generation task (Storms et al.,      where subjects were asked to write down exem-
1996), we could evaluate the impact of taking         plars of superordinates of the eight categories,
into account different numbers of exemplars in        excluding the category itself (e.g., of food that
deriving the exemplar-based predictor. Thus,          is not fruit, of animals that are not fish, etc.).
different predictors could be constructed by             Features of the eight categories were taken
summing the similarity toward an increasing           from Hampton’s study. Hampton gathered these
number of “best” exemplars of the category,           features by interviewing 32 undergraduate stud-
“best” meaning the most frequently generated          ies extensively. The interview consisted of a
exemplars, weighted for their production fre-         first part, in which the participants gave free
quency.                                               descriptions of the eight concepts, and a second
   The prototype prediction used in Experiment        part, in which they were given seven questions
1 is derived using Hampton’s (1979) procedure.        to further encourage them to generate as many
More specifically, characteristic features of the     features as possible. Features generated by less
56                                   STORMS, DE BOECK, AND RUTS

than 25% of the participants were excluded. For          so as to cover the whole range of varying typi-
the eight categories, respectively 13, 11, 12, 14,       calities in the concept.) The task took approxi-
13, 9, 16, and 12 features were selected to              mately 30 min.
derive the prototype prediction. (Note that the
participants in Hampton’s study had English as           Item by Feature Applicability (Matrix Filling)
their mother tongue, while our participants were            Task
Dutch speaking. However, data gathered re-                  Participants. Eighty students from the Uni-
cently in the context of another study, where the        versity of Leuven participated for course credit.
same eight categories were used, indicated that             Procedure. Participants were given a matrix
the feature set for the Dutch translation of the         where the rows were labeled with the 36 items
eight concept labels yielded virtually identical         and where the columns were labeled with the
feature sets.)                                           category features taken from Hampton (1979).
                                                         They were asked to fill out all entries in the
Procedure                                                matrix with a 1 or a 0 to indicate whether or not
   To derive the prototype and exemplar-based            a feature was considered present in the item
predictor variables, two different tasks were            corresponding to the row of the entry. Comple-
given to different groups of subjects: a similar-        tion of the applicability matrix took about 50
ity rating task and a feature applicability task.        min.
                                                            Four other tasks were administered in the first
Similarity Rating Task                                   experiment to obtain the four dependent vari-
   Participants. Two hundred and fifty students          ables.
from the University of Leuven participated for
course credit.                                           Typicality-Rating Task
   Procedure. Participants indicated the similar-           Participants. Ten students from the Univer-
ity of 36 listed items (24 presumed exemplars            sity of Leuven participated for course credit.
and 12 related nonexemplars) toward a key-                  Procedure. The participants received stan-
word. The similarity judgments were given on a           dard instructions for typicality ratings. A
10-point rating scale (ranging from 1 “no sim-           7-point rating scale was used (ranging from 23
ilarity at all,” to 10 “highly similar”). No indi-       for very atypical or unrelated items to 13 for
cation was given as to the point of view from            very typical items). All participants rated typi-
which similarity had to be considered. As key            cality for the item sets of all eight categories.
words, the 25 most frequently generated exem-            The task took approximately 30 min.
plars of the category were used. (The 10 most
frequently generated exemplars for every cate-           Speeded Categorization Task
gory are given in the Appendix). The partici-               Participants. Eighteen last-year psychology
pants each received eight lists of 36 items, one         students from the University of Leuven partic-
per concept, in order to judge the similarity of         ipated voluntarily.
the items of each list to one of the 25 key words           Procedure. The participants were seated in
for that list, that is, one of the 25 most fre-          front of a computer screen and read the instruc-
quently generated exemplars. Key words were              tions from the screen. They were asked to per-
distributed randomly over participants, with the         form a speeded categorization task for nine cat-
restriction that all participants rated all item lists   egories (the eight concepts studied plus one
for their similarity to only one key word. Note          concept used to acquaint the participants with
that the 25 most frequently generated exemplars          the procedure). They saw the name of a cate-
of the category that were used as target stimuli         gory printed in bold in the middle of the screen
in the similarity rating task usually contained          and were instructed that words would appear
the majority of, but not all of, the 24 category         right under the category name. They were asked
exemplars of the item list. (The 24 items con-           to decide as quickly as possible whether or not
tained the 10 most frequently generated exem-            the word shown under the category name be-
plars, but the remaining 14 items were selected          longed to the category. In order to avoid ex-
PROTOTYPES AND EXEMPLARS IN NATURAL LANGUAGE CATEGORIES                                 57

treme response bias, four additional nonmem-         different exemplars of each of the eight con-
bers of the categories were included. As a           cepts studied. They were asked to write the
consequence, the item set of every concept           exemplars in the order that they thought of
comprised 24 members and 16 (12 1 4) non-            them.
members. Assignment of the two response keys,
labeled “Yes” and “No,” was counterbalanced                                Results
over participants. The participants kept the in-
dex fingers of both hands on the response keys       Reliability
at all times, except during the pauses in between       The reliability of all the data that we used in
two categories. The experiment started with a        this paper was evaluated by the split-half cor-
practice category (flowers). Participants were       relations corrected with the Spearman–Brown
asked to write down the words they had re-           formula, with the halves referring to halves of
sponded to incorrectly, right after the comple-      the subjects that completed the task in question.
tion of every category. They were told that          All reliability coefficients are derived over the
doing the task as fast and as accurate as possible   36 items associated with a category. Similarity
was much more important, however, than trying        ratings with the 25 best exemplars of each cat-
to remember incorrect responses. As in Hamp-         egory had a mean reliability estimate of .94
ton (1979), all response times for trials marked     (with only 15 out of the 200 ratings having an
as mistakes by the participants and all response     estimated reliability below .90). Reliabilities
times of “No” answers for members were dis-          were estimated for all features (i.e., all columns)
carded. Participants were informed that they         in the applicability matrices of the eight con-
could pause in between two categories. The           cepts. Mean estimates of .96, .97, .98, .96, .88,
order of the eight categories was determined         .93, .91, and .95 were found for fruit, birds,
following a Latin square design. The experi-         vehicles, sports, furniture, fish, vegetables, and
ment typically took 35 to 40 min to be com-          kitchen utensils, respectively. Estimated reli-
pleted.                                              abilities of the typicality ratings of all categories
                                                     were above .985. For the response times, esti-
Category Naming Task                                 mates of the reliability of the response times
   Participants. Three hundred and sixty stu-        were .49, .89, .81, .64, .87, .89, .41, and .84,
dents from the University of Leuven partici-         respectively. For category naming, reliabilities
pated voluntarily.                                   were all well over .93, except for kitchen uten-
   Procedure. The participants were given a list     sils (.85). Finally, the estimates for the data of
of eight items, each being one of the 36 items       the exemplar-generation task were .92, .85, .95,
associated with a different category (as there       .92, .97, .95, .83, and .89, respectively.
were eight categories studied), either as a mem-        Prototype predictions. The prototype predic-
ber or as a nonmember. The participants were         tions were calculated as in Hampton (1979).
asked to write down, for each item, the first        The procedure described below is also illus-
category they thought the item belonged to. The      trated in Fig. 1. Recall that the applicability
task was administered collectively at the begin-     matrix has frequencies as its entries, corre-
ning of a class and took only a couple of min-       sponding to the number of participants (out of
utes. There were 36 different lists of eight items   10) that judged the corresponding feature appli-
and ten different participants completed every       cable to the corresponding item. Different pro-
list.                                                totype predictions were calculated based on the
                                                     item by feature applicability matrix. A first
Exemplar Generation Task                             measure used no weighting of the features.
   Participants. Fifteen graduate students of the    Thus, the prediction here consisted simply of
Psychology Department participated voluntar-         summing over the different features, the number
ily.                                                 of subjects that credited the item with the cor-
   Procedure. The participants were given a          responding feature. Three other measures were
booklet in which they had to write down ten          derived by first weighting the features before
58                                 STORMS, DE BOECK, AND RUTS

                         FIG. 1. Calculating Hampton’s (1979) prototype prediction.

summing the applicability frequencies. The             results of an earlier study described in Storms et
weights were based (1) on the rated feature            al. (1996). The first predictor simply consists of
importance for defining the concept, (2) on rat-       the rated similarity toward the first ordered ex-
ings of how characteristic the features are for        emplar of the category. For the second predic-
the concept, and (3) on the production fre-            tor, similarity ratings toward the two best-or-
quency of the features. All three weights were         dered exemplars were summed. The two terms
taken from Hampton (1979). Given the a priori          of the sum were weighted using the production
nature of the weights, no (estimated) free pa-         frequency of the corresponding items. The re-
rameters were used.                                    maining predictors were constructed by adding
   Exemplar prediction. To our best knowledge,         each time the similarity ratings toward the next
there are no guidelines to be found in the liter-      most frequently generated exemplar of the cat-
ature concerning the number of exemplars that          egory, weighted by the production frequency.
are activated by a concept label for natural           Note that no (estimated) free parameters were
language categories. Therefore, several exem-          used in the exemplar-based predictors. Further-
plar-based predictors were tried, which differed       more, we remind the reader that, in the similar-
from each other in that an increasing number of        ity-rating task on which the exemplar prediction
exemplars were assumed to be activated. A              is based, no indication was given as to the point
comparison of these predictors enabled us to           of view from which similarity had to be consid-
evaluate the effect of enlarging the set of exem-      ered. However, due to the composition of the
plars.                                                 item list (which contained 24 members of a
   Correlations were calculated between 25 dif-        particular category and 12 related nonmem-
ferent exemplar predictors and the four depen-         bers), it is plausible that participants have rated
dent variables. The 25 exemplar predictors were        similarity based at least partially on the features
constructed based on an ordering of exemplars          related to the category.
derived from generation frequency. Note that              Prediction of the four dependent variables.
these frequencies were determined based on the         For each of the different prototype and exem-
PROTOTYPES AND EXEMPLARS IN NATURAL LANGUAGE CATEGORIES                                          59

                                                      TABLE 1

    Correlations of the Four Dependent Variables with the Hampton Prototype Predictors for the Category Members

                                               Correlation of Hampton prototype predictors versus

                                              Mean response           Category-naming               Exemplar-generation
    Concept               Typicality             times                  frequencies                    frequencies

Fruits                      .55**                 2.29                      .20                           .10
Birds                       .94**                 2.66**                    .74**                         .55**
Vehicles                    .78**                 2.57**                    .71**                         .47*
Sports                      .92**                 2.66**                    .72**                         .40*
Furniture                   .79**                 2.76**                    .64**                         .51**
Fish                        .68**                 2.77**                    .44*                          .25
Vegetables                  .66**                 2.53**                    .32                           .03
Kitchen utensils            .62**                 2.72**                    .23                           .35*

Mean                        .79**                 2.64**                    .54**                         .34

  Note. *p , .05, **p , .01. All others not significant.

plar-based measures, all analyses on the data                 served for totally unrelated nonmembers and
were done with and without weights to deter-                  slow “No” responses are emitted for related
mine predictor variables. However, concerning                 nonmembers (Hampton, 1979). For the exem-
the prototype predictor, only the results for the             plar-generation task, evidently, almost all of the
unweighted prototype predictor will be given,                 nonmembers were never generated as exem-
since none of the three different feature weight-             plars of the categories, and in the category-
ings improved the prediction of the unweighted                naming task, the concept label was almost never
prototype measure. (Averaged over the four de-                given in response to nonmembers of the cate-
pendent variables and over the eight concepts,                gories. Therefore, there was no variation in 12
the unweighted prototype did explain a larger                 of the 36 items, yielding an odd kind of distri-
proportion of the variance than the weighted                  bution. Furthermore, by excluding the nonmem-
prototypes, but the difference was small and not              bers, it is even more difficult to obtain high
significant.) In the remainder of the article, we             predictive levels. Thus, if the predictive level is
will refer to the unweighted sum of feature                   high all the same, the evidence in favor of the
applicability frequencies with the term “proto-               model is considerably stronger.
type prediction.” For the exemplar-based pre-                    Prototype predictions. Table 1 shows the cor-
dictors, only results for the weighted versions               relations of the prototype predictor with the four
will be reported, since here the weighting did                dependent variables for the eight concepts stud-
improve the prediction somewhat.                              ied. The prototype predictor is better in explain-
   The prototype and the exemplar-based mea-                  ing the typicality ratings than the other depen-
sures were correlated with all four dependent                 dent variables. All values are significant at the
measures. All correlations were based on the 24               a 5 .01 level. Also, the correlations for the
members of the categories only, excluding the                 response times are quite high and significant at
related nonmembers of the categories from the                 the a 5 .01 level for all concepts except fruits.
analyses. Only for typicality ratings the analy-              Finally, the correlations for the generation fre-
ses were also done for the complete list of 36                quencies and for the category-naming frequen-
items. It would be problematic for response                   cies are somewhat lower and significant for only
times to include nonmembers, as fast “Yes”                    five concepts.
responses are observed for typical members and                   The exemplar-based predictors. In Fig. 2, the
slow “Yes” responses are emitted for atypical                 correlations for the 25 exemplar predictors are
members; while fast “No” responses are ob-                    displayed graphically. Thirty-two diagrams
60                                      STORMS, DE BOECK, AND RUTS

       FIG. 2. Correlations of typicality, mean response times, category-naming frequencies, and exemplar-
     generation frequencies with the sum of 1 to 25 exemplar-based predictions for the category members only. (The
     dotted line gives the correlation with the prototype predictions.)
PROTOTYPES AND EXEMPLARS IN NATURAL LANGUAGE CATEGORIES                                         61

                                                      TABLE 2

   Correlations of the Four Dependent Variables and Prototype Predictors with the Exemplar Predictors Based on
   Weighted Sum of Similarities Toward the Ten Most Frequently Generated Exemplars for the Category Members

                    Correlation weighted sum of similarities toward the ten most frequently generated exemplars versus

                                  Mean response       Category-naming     Exemplar-generation      Hampton prototype
    Concept         Typicality       times              frequencies          frequencies              predictors

Fruits                .77**           2.65**               .59**                 .70**                   .59**
Birds                 .89**           2.62**               .77**                 .65**                   .91**
Vehicles              .93**           2.79**               .80**                 .72**                   .88**
Sports                .72**           2.58**               .53**                 .67**                   .70**
Furniture             .76**           2.70**               .43*                  .51**                   .79**
Fish                  .93**           2.84**               .63**                 .47**                   .69**
Vegetables            .68**           2.55**               .47*                  .36*                    .49**
Kitchen utensils      .60**           2.60**               .24                   .72**                   .56**

Mean                  .82**           2.68**               .59**                 .61**                   .74**

  Note. *p , .05. **p , .01. All others not significant.

show the correlations for the exemplar-based                   It is remarkable how similar the diagrams in
predictors of the eight concepts (corresponding             the first two columns of Fig. 2 are, meaning that
to the eight rows) and the four dependent vari-             the predictions of the typicality ratings and of
ables (corresponding to the four columns). The              the speeded categorization task are very similar.
abscise of each of the diagrams corresponds to              As can be seen in Table 2, the correlations of
the increasing number of best exemplars taken               the exemplar-based measure for the 10 most
into account in the predictor. The ordinate cor-            frequently generated exemplars correlated sig-
responds to the value of the correlation found              nificantly with typicality and with response
between the dependent variable and the predic-              times for all eight concepts (p , .01). In gen-
tor. (Note that the correlation values in the               eral, the predictions for the typicality-rating task
second column, corresponding to the response                are a little better, partly because the response
times, are negative.) To ease the comparison of             times are less reliable than the typicality ratings.
the exemplar-based predictors with the proto-               The resemblance of the result patterns of the
type predictor, a horizontal line is drawn in each          response times and of the rated typicalities is
graph, indicating the level that corresponds to             higher for a concept the more reliable the re-
the correlation of the prototype predictor.                 sponse times of the concept are. In most of the
   It can be seen clearly that the different dia-           diagrams, the exemplar-based measure predicts
grams are similar in showing that the exemplar              the dependent variable better or equally good as
predictors improve when taking into account                 the prototype predictor, with the exception of
more exemplars. The improvement is strong                   the typicality ratings for sports and the response
when comparing the predictors based on up to                times for kitchen utensils, which were some-
seven exemplars. However, the improvement                   what better predicted by the prototype measure.
decreases as the number of exemplars taken into                A similar pattern of results showed up for the
account increases. There is almost no improve-              prediction of the category-naming frequencies.
ment in the predictive power after adding more              Again the exemplar prediction was better than
than ten exemplars. The first five columns of               the prototype predictor in the majority of the
Table 2 show the correlations of the exemplar               concepts studied (with the exception of sports
predictor based on the weighted sum of the 10               and of furniture). The correlations shown in
most frequently generated exemplars with the                Table 2 (based on the weighted sum of the 10
four dependent variables for the eight concepts             “best” exemplars) was significant (p , .05) for
studied.                                                    all concepts except kitchen utensils. Finally, for
62                                 STORMS, DE BOECK, AND RUTS

the exemplar-generation frequencies, the exem-                                 TABLE 3
plar-based measure is clearly superior to the           Correlations of Typicality with the Hampton Prototype
prototype predictor. The latter reaches the level     Predictors and the Exemplar Predictors Based on Weighted
of the exemplar predictor only for furniture.         Sum of Similarities for the Complete Set of 36 Items:
The values in Table 2 are significant for all eight   Category Members and Nonmembers
concepts (p , .05).                                                             Correlation of typicality versus
   An important consideration for interpreting
the correlations of the two predictors with the                               Hampton prototype           Exemplar
dependent variables is whether the prototype              Concept                predictors               predictors
predictor and the exemplar-based predictor can
be differentiated. The last column of Table 2         Fruits                          .97                    .97
                                                      Birds                           .98                    .98
shows the correlations between the prototype          Vehicles                        .97                    .96
predictors and the exemplar predictors based on       Sports                          .98                    .85
weighted sum of similarities toward the ten           Furniture                       .86                    .93
most frequently generated exemplars. The val-         Fish                            .98                    .99
ues show that there is some overlap, but the two      Vegetables                      .90                    .96
                                                      Kitchen utensils                .85                    .77
predictors are not at all indistinguishable. Only
for the concepts birds and vehicles, both predic-     Mean                            .96                    .95
tors are very much the same. Averaged over the
                                                        Note. All correlations are significant at the p , .001 level.
eight concepts, the exemplar-based predictor
and the prototype predictor have 51% of their
variance in common. The percentage of com-            quencies and the category-naming frequencies
mon variance does not increase when taking            (t(18) 5 2.60, p , .01). Though the category-
similarities toward all 25 exemplars into ac-         naming frequencies showed a larger percentage
count.                                                of explained variance than the exemplar-gener-
   To obtain a more detailed view of the impact       ation frequencies, the difference was not signif-
of the predictors in explaining the dependent         icant. Furthermore, the interaction between the
variables for the different concepts, an analysis     predictor factor and the dependent variable fac-
of variance (ANOVA) with a split-plot factorial       tor was also significant. Contrasts for interac-
design (Kirk, 1982) was conducted, where the          tions showed that the exemplar-based measure
eight concepts function as blocks, where the          differed only significantly from the prototype
predictor variables (prototype and exemplar-          measure in predicting the exemplar-generation
based predictors) and the four dependent vari-        frequencies (t(18) 5 6.68, p , .01), with the
ables (typicality ratings, response times, cate-      exemplar-based measure predicting the fre-
gory naming, and exemplar generation                  quencies better. All other effects and interaction
frequencies) are within-block factors, and            effects were not significant, though the differ-
where the sort of concepts (natural and nonnat-       ence between the exemplar-based measure and
ural kinds) is a between-block factor. The de-        the prototype measure approached significance
pendent variable in the ANOVA was the per-            for the typicality ratings (t(18) 5 1.42, p , .10),
centage of explained variance (the square of the      with the exemplar-based measure yielding bet-
correlations shown in Tables 1 and in the sec-        ter predictions.
ond through the fifth column of Table 2).                As mentioned previously, the results of the
   The analysis yielded a significant effect of the   typicality ratings can also be analyzed when
dependent variable factor (F(3, 18) 5 12.42,          including the related nonmembers of the cate-
p , .01). A posteriori contrasts revealed that the    gories. Table 3 shows the correlations between
typicality ratings could be predicted better than     the two predictor variables and the typicality
the other three dependent variables (t(18) 5          ratings based on the complete set of 36 items for
5.43, p , .01) and that the response times from       the eight concepts studied. These correlations
the speeded categorization task could be pre-         are a lot higher than the correlations based on
dicted better than the exemplar-generation fre-       the members only (see Tables 1 and 2). These
PROTOTYPES AND EXEMPLARS IN NATURAL LANGUAGE CATEGORIES                             63

differences can easily be explained given a re-      with predictions derived from prototype theory.
striction of range in the analyses excluding non-    The results obtained in our experiment suggest
members. An analysis of variance with a split-       that, although for all four dependent variables
plot factorial design was conducted, with the        the exemplar-based predictions were on the av-
eight concepts as blocks, the sorts of concepts      erage better than the prototype predictions, the
(artifacts versus natural kinds) as a between-       difference was not always significant. Note also
blocks factor, and the two predictor variables       that 31 out of the 32 correlations for the
(prototype measure and exemplar-based mea-           summed similarity toward 10 exemplars were
sure with 10 exemplars) as a within-blocks fac-      significant, while only 25 of the corresponding
tor. The percentage of variance accounted for        correlations for the prototype predictor reached
was again used as the dependent variable in the      significance. We will now focus on the four
analysis. On the average, there was almost no        different dependent variables separately.
difference in explained variance between the            The typicality ratings, which could better be
two predictor variables (87 and 86% for the          predicted than any of the other dependent vari-
prototype and the exemplar-based predictor, re-      ables, were significantly predicted by both pre-
spectively). Though the mean percentage of ex-       dictors for all eight concepts. The ratings
plained variance was 93 and 81% for the natural      showed a considerably higher correlation with
kind concepts and for the artifact concepts, re-     the exemplar predictor for fruits, vehicles, and
spectively, the difference was not significant.      fish, but the prototype predictor explained the
                                                     typicalities of furniture better. For the remain-
                   Discussion                        ing concepts, the difference between the two
   First of all, the main idea of an instantiation   predictors was rather small. On the average, the
principle, proposed by Heit and Barsalou (1996)      difference between the two predictors in ex-
to account for typicalities of superordinate nat-    plaining typicality was only marginally signifi-
ural language categories (like birds, fish, mam-     cant. When predicting the typicalities of the
mals, etc.) within higher level concepts (like       whole item set, that is, including related non-
animals) was shown to work well in predicting        members, the prediction based on both mea-
typicalities of exemplars within these superor-      sures increased considerably, but no significant
dinate categories as well. The correlations of the   difference between the two predictors could be
dependent variables with the predictions based       observed.
on the exemplar model were rather high. All but         As to the response times in a speeded cate-
one of the 32 correlations (8 concepts by 4          gorization task, the exemplar-based measure
dependent variables) were significant. Note,         was also a better predictor for fruits and vehi-
however, that the exemplar model proposed            cles, whereas the prototype predictor better ac-
here allowed that a concept can be instantiated      counted for the response times for kitchen uten-
with more than one exemplar, whereas Heit and        sils. Averaged over the eight concepts, the
Barsalou’s (1996) model assumes only one in-         difference between the two predictors in pre-
stantiation. For a detailed comparison of Heit       dicting the response times was again not signif-
and Barsalou’s model and the model proposed          icant.
here, see Storms, De Wilde, De Boeck, and Ruts          Looking at the correlations for the category-
(1999).                                              naming task, the exemplar-based predictor was
   The results showed also very clearly that the     better for fruits, fish, and vegetables, but the
predictive power of the exemplar-based mea-          prototype measure was better for sports and
sure increases when more exemplars are taken         furniture. Averaged over the eight concepts, the
into account. However, this increase was not         better predictions of the exemplar-based mea-
linear. The gain in predictive power slows down      sure were not significantly different from the
from seven exemplars on and on the average no        prototype predictions. The results of the reverse
improvement is found beyond ten exemplars.           process though, where subjects were given the
   Heit and Barsalou (1996) have not compared        concept label and were asked to generate exem-
the predictions of the instantiation principle       plars, were significantly better predicted by the
64                                 STORMS, DE BOECK, AND RUTS

exemplar-based measure. Only for furniture the        prototype measure and an exemplar-based mea-
prototype predictor did as good as the exemplar-      sure is that the former is based on one repre-
based predictor. A possible explanation for the       sentation for the whole concept, while the latter
clearly better exemplar-based predictor for           is based on many representations— one for ev-
these data lies in the similarity of the method for   ery stored exemplar—which are summarized or
gathering data: Both the exemplar-based predic-       averaged. Assuming that the representations
tor and the exemplar-generation frequencies           consist of features or attributes, we can rephrase
start from production frequencies.                    this by stating that the prototype predictors are
   The analyses also showed that the results of       based on one single vector of attribute values,
the different concepts, regardless of the task,       while exemplar predictors are based on the av-
were predictable to a different extent. The data      erage of many different vectors of attribute val-
for birds and for vehicles could be predicted         ues. The different procedures to calculate pro-
rather well, while the data for fruits, vegetables,   totype predictors differ in the way to obtain the
and kitchen utensils seemed to resist prediction      attributes of which the concept vector of at-
most. The differences, however, were not sig-         tribute values consists. Hampton’s (1979) pro-
nificant.                                             cedure, which was used in Experiment 1, gath-
   The reader may also recall that several pro-       ered these attributes starting from the category
totype predictors were calculated and com-            label, by simply asking participants to generate
pared. The best prototype prediction was a sim-       features of the concept. 2 Since the results of
ple unweighted sum of the applicability               Experiment 1 showed that the exemplar-based
frequencies, summed over all features. This           measure predicted the dependent variables a
does not support the suggestion of Rosch and          little better than Hampton’s prototype proce-
Mervis (1975) that the features possessed by the      dure, another experiment was conducted to
different items are weighted by their cue valid-      evaluate how consistent this advantage of the
ity to determine degree of category member-           exemplar predictor was when varying the pro-
ship. The results are in line, however, with          cedure to derive the prototype. Experiment 2
findings of Hampton (1979), who also tried two        was set up to test another procedure, based on
different sorts of feature weighting, production      Rosch & Mervis (1975) classical family resem-
frequency and rated feature importance, and           blance procedure.
found that none of these weightings improved
                                                                         EXPERIMENT 2
the prediction of typicality ratings. Our findings
that weighting did not improve the prototype             Rosch and Mervis (1975, p. 575) have de-
prediction thus confirm Hampton’s results.            scribed the prototype structure of semantic cat-
   The prototype measure in this first experi-        egories as follows: “the basic hypothesis was
ment was derived using the same procedure             that members of a category come to be viewed
used by Hampton (1979). He used the prototype         as prototypical of the category as a whole in
measure to predict responses in a speeded cat-        proportion to the extent to which they bear a
egorization task. In general, the corresponding           2
                                                            According to Hampton (1979), when asked to give
correlations obtained in Experiment 1 were            features of a concept, participants are assumed to be able to
somewhat higher than the values reported by           activate this information directly, without first instantiating
Hampton. Especially for kitchen utensils, the         exemplars of the concept. This procedure can therefore be
                                                      criticized because exactly what goes on in a subject’s mind
difference was considerable, since Hampton’s
                                                      when he or she is asked to give features of a concept is not
correlation for this concept was rather low. The      directly observable. It is very well conceivable that subjects
higher values obtained in our experiment reas-        instantiate exemplars, activate features of these exemplars,
sure us of the soundness of the response times        and respond with those features that apply to enough of the
from our speeded categorization task as well as       instantiated exemplars. We thank Larry Barsalou for point-
                                                      ing this out. However, if the activation of these “concept
of the prototype predictor.
                                                      features” really happens after exemplar instantiation, one
   However, different procedures have been            should still assume that participants first apply a sort of
proposed in the literature to calculate prototype     filter, since no features idiosyncratic for just one (or a few)
predictors. Essentially, the difference between a     exemplars are given.
PROTOTYPES AND EXEMPLARS IN NATURAL LANGUAGE CATEGORIES                                65

family resemblance to (have attributes which            Thus, the purpose of Experiment 2 is similar
overlap those of) other members of the cate-         to the purpose of Experiment 1: Again the four
gory.” While this description gives a generally      dependent variables (typicality ratings, response
accepted conception of a prototype of a seman-       times from the speeded categorization task, cat-
tic category, it is not clear at all how the rele-   egory-naming frequencies, and exemplar-gener-
vant attributes on which the prototype is based      ation frequencies) were correlated with the new
can be obtained empirically. For an elaborate        family resemblance measure, which was then
discussion of this difficulty, see Smith & Medin     compared with the prototype measure based on
(1981).                                              Hampton’s procedure and with the exemplar-
   The solution used by Hampton (1979), which        based predictor of Experiment 1.
we adopted in Experiment 1, is to simply ask
subjects to generate features that characterize      Method
the concepts under study. More in detail, Hamp-
ton asked participants to give descriptions of the      Participants. There were six different partic-
eight categories. For each category, a set of        ipants in this experiment: Three of them partic-
seven different questions was used in order to       ipated in the attribute generation task and four
encourage the participants to generate as many       judged the attribute applicabilities for every
different properties as they could. These ques-      item. All participants were graduate students of
tions were, for example, why some items only         the Psychology Department of the University of
“loosely speaking” belong to a particular cate-      Leuven.
gory, or why a certain item might be considered         Material. The same eight concepts from Ex-
a very typical example of the category. Impor-       periment 1 were used. Because of the very high
tantly, in all questions involving the consider-     correlations obtained for the typicality predic-
ation of particular examples, no specific exam-      tions including related nonmembers in Experi-
ple was ever given by the experimenter to avoid      ment 1, and because of the time-consuming
biasing the participants.                            nature of the matrix filling task in the second
   In Experiment 2, a prototype measure was          experiment, data were gathered for the 24 con-
computed using another procedure, which is           cept members only (thus dropping the 12 related
more in line with the prototype theory as for-       nonmembers of the item list).
mulated by Rosch (1975a, 1975b, 1977, 1978),            Procedure. In the attribute generation task,
1983). The procedure of Rosch and Mervis             three participants generated features for all 24
(1975) was applied, where the attributes of the      members of five or six concepts. For every
category were gathered starting from the exem-
                                                     concept, two different participants took part in
plars of the category. Next, for each category,
                                                     the attribute generation task. They were given
all attributes of the exemplars were listed and
                                                     all 24 items of every concept together and they
judged for their applicability for every exem-
                                                     were instructed to write down as many at-
plar. Each attribute then received a weight,
                                                     tributes as possible for each of the items, but
based on the number of items that had been
credited with that attribute. Finally, a family      they did not have to repeat an attribute that was
resemblance score could be calculated for every      already written down for a previous item of the
item by summing the weights of the attributes        same concept.
that were judged to apply to the item. (Another         In the attribute applicability judgment task,
version of this new prototype measure was cal-       four different participants filled up all entries of
culated, without weighting the different fea-        the eight grids, each consisting of 24 rows for
tures. Averaged over the four dependent vari-        the items and 26 to 73 columns for the at-
ables and the eight concepts, the unweighted         tributes. The participants were allowed to work
version explained less variance than the             on the task at different moments, but they were
weighted version, but like for the weighted and      asked to complete a grid before pausing, once
unweighted prototype versions in Experiment 1,       they started it. Completing a single grid took
the difference was not significant.)                 between 25 and 50 min.
You can also read