A Comparison of Four Association Engines in Divergent Thinking Support Systems on Wikipedia
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
A Comparison of Four Association Engines in Divergent Thinking Support Systems on Wikipedia Kobkrit Viriyayudhakorn, Susumu Kunifuji, and Mizuhito Ogawa Japan Advanced Institute of Science and Technology, Japan kobkrit@jaist.ac.jp, kuni@jaist.ac.jp, mizuhito@jaist.ac.jp Abstract. Associative information, e.g., the associated documents, associated keywords, freelinks, and categories are potential sources for a divergent think- ing support. This paper compares four divergent thinking support engines using the associative information extracted from the Wikipedia. The first two engines adapt the association search engine GETA [1], and the last two engines finds the association by using the document structure. Their quality is compared by exper- iments in both quantitative and qualitative evaluations by using Eureka! interface, which is inspired by “Hasso-Tobi 2” divergent thinking support groupware [2]. 1 Introduction The brainstorming process proposed by Osborn [3] is one of the most popular method- ology for enhancing creative thinking ability. The divergent thinking phase, which is a major phase in the brainstorming, generates ideas as much as possible with the follow- ing rules [4]. 1. Produce a large quantity of ideas without any criticism. 2. Unusual ideas are highly welcome. 3. Adapting or modifying previously suggested ideas is encouraged. The convergent thinking phase filters, fuses, and derives ideas to make them concrete. A divergent thinking support system can be classified into three types (from naı̈ve to sophisticated) by the level of generated suggestions [2]. 1. Free association 2. Forced association 3. Analogy conception The first category is an almost random association, which is believed to be useless. There are several examples for the second category. Watanabe [5] proposed “Idea Editor” that extracts the associated keywords, which are moderately unusual to the user. Associative keywords are also used in “Group Idea Processing System” proposed by Kohda, et al. [6]. In both systems, association among keywords are statically pre- computed by using the structure of documents. Kawaji and Kunifuji [2] proposed “Hasso-Tobi 2” divergent thinking support group- ware, which extracts freelinks from Wikipedia (in Japanese) after analyzing an input sentence by a morphological analyzer. Wang, et.al. [7] proposed “Idea Expander”,
which extracts images by querying an input sentence on Microsoft Bing Image. In both systems, association owes to existing structures on the internet. The third category is the most difficult and only few exists. Young [8] proposed the “metaphor” machine, which can construct the metaphors of the noun entities by query- ing those that use the same predicate in a corpus. This paper compares four methodologies of the forced association in divergent think- ing by experiments on Wikipedia. The first two adapt the association search, which dynamically finds the quantitative associative relation among documents by statistical computation. The later two adapt the more conventional informative entity extraction, which finds a maximal matching between an input sentence and entities of Wikipedia. We prepare an interface Eureka! inspired by “Hasso-Tobi 2” [2], and implement by using the association search engine GETA [1]. The quality of four methodologies is compared by experiments on Wikipedia in both quantitative and qualitative evaluations. Section 2 explains the association search and the informative entity extraction. Sec- tion 3 briefly reviews the association search engine GETA [1]. Section 4 presents our divergent thinking support system including Eureka! interface. Section 5 shows the four divergent thinking support engines and the setting of experiments. Section 6 gives ex- perimental results and our observation, and Section 7 concludes the paper. 2 Structured Documents 2.1 Structure of Documents and Information Source For divergent thinking systems, we focus on a multi-layered structured documents (e.g., Fig. 1) as a knowledge-base. Each node of a structure is a non-empty sequence of to- kens. Typically, a token is a word, and a node is either a category, a title, or a document. A child node may contain hyperlinks that point to a node in a higher layer. For instance, Fig 1 describes the structure of Wikipedia such that “Category” is a parent of “Title”s, “Title” is a parent of “Content”, in which a “Freelink” 1 points to a “Title”. Category Title Title Title Content Content Content Freelink Freelink Freelink Freelink Freelink Freelink Freelink Freelink Freelink Fig. 1: An example structure of target knowledge-bases. Our instance as a structured document is Wikipedia (English version), which is one of the largest collective intelligence knowledge-base, and contain the huge collection of encyclopedia articles. Each article in Wikipedia consists of a title, content, freelinks 1 http://en.wikipedia.org/wiki/Wikipedia:Free links
(a) Freelinks in the JAIST Wikipedia page. (b) Categories labeled at the bottom of JAIST Wikipedia page. Fig. 2: The associative information in a Wikipedia article and its source code (see Fig. 2a), and category labels (see Fig. 2b). We dumped whole English Wikipedia website 2 and deploy into our local knowledge-base, which is used in all experiments. 2.2 Association Search In an association search, an article is regarded as a multiset [9] of tokens (typically, an article is a document and an token is a word). A query is a (multi)set of tokens, and the searched result is a ranking among tokens with respect to a given similarity measure. Let ID1 be a set of articles, and let ID2 be a set of tokens. Definition 1. An association system is a quadruplet A = (ID1 , ID2 , a, SIM) where a is an association function and SIM is a similarity function such that a: ID1 × ID2 → N SIM : ID2 × M P (ID1 ) → R≥0 where N is the set of natural numbers, R≥0 is the set of non-negative real num- bers, and M P (X) is the multiset consisting of non-empty subsets of X. We say that At = (ID2 , ID1 , at , SIMt ) is the transpose of A where at (y, x) = a(x, y) and a given SIMt : ID1 × M P (ID2 ) → R≥0 . For X ⊆ ID1 (resp. ID2 ) and n ∈ N, let A(X, n) (resp. At (X, n)) be the func- tion collecting the top n-elements in ID2 (resp. ID1 ) with respect to the similarity SIM(y, X) (resp. SIMt (y, X)) for y ∈ ID2 . An association search is At ({y | (y, v) ∈ A(X, m)}, n) for m ∈ N. In the definition of an association search, the number m is not specified. From em- pirical study, GETA (see Section 3) sets m to 200 by its developers under the blance of efficiency and precision. Note that during an association search, we first compute A(X, m). Its result is regarded as summary that characterizes X. Typical examples of association searches are: – ID1 is the set of documents, ID2 is the set of words, and a(d, w) is the number of occurrences of a word w in a document d. In this case, an association search is documents-to-documents. 2 http://download.wikimedia.org/enwiki/20100730/
– ID1 is the set of words, ID2 is the set of documents, and at (w, d) = a(d, w). In this case, an association search is words-to-words. GETA permits a similarity function SIM of the form X wq(t, q) · wd(t, d) SIM(d, q) = t∈q norm(d, q) with the assumptions that wd(t, d) = 0 if t 6∈ d and wq(t, q) = 0 if t 6∈ q. Typically, – the value of norm(d, q) is dependent only on d. (In such cases, SIMt is obtained by simply swapping wq and wd.) – both wd and wq are defined dependent to the association function a. For an efficient association search implementation (e.g., GETA), we assume SIM(y, X) = 0 if a(x, y) = 0 for each x ∈ X ⊆ ID1 and y ∈ ID2 . Note that an association search does not require structured documents. The key observation is a dual relationship be- tween words and documents. ID1 and ID2 are swapped by regarding a document as a multiset of words and a word as a multiset of documents that contain it with multiplic- ity. Thus the association search ignores the ordering of words; it does not distinguish, say “Weather is not always fine” and “Weather is always not fine”. 2.3 Informative Entity Extraction Informative entity extraction assumes a structured document, in which a token is a word. A query is a phrase (i.e., a sequence of tokens), and the searched result is the set of titles (i.e., sequences of tokens) that contain the phrase as its subsequence. We call such titles informative entities. Definition 2. Let W be a set of words, and let T (⊆ W ∗ ) be a set of titles (sequences of words). For an input word sequence ψ ∈ W ∗ , we define Subseq(ψ) = {ψ 0 6= | ∃ψ1 , ψ2 . ψ1 ψ 0 ψ2 = ψ} A(ψ) = Subseq(ψ) ∩ T Subseq(ψ) is the set of non-empty subsequences of ψ (which is sorted with the decreasing order with respects to the length) and A(ψ) is the set of informative entities. For example, from “Reduce electricity usage in Japan Advanced Institute of Science and Technology”, “Japan Advanced Institute of Science and Technology” is extracted, which is an item 3 of Wikipedia. Since Subseq(ψ) is sorted, the extraction reports the longest informative entity first, which we can expect a more specific matching. 3 GETA - Generic Engine for Transposable Association Computation The Generic Engine for Transposable Association Computation 4 (GETA) is an asso- ciation search engine developed at NII [1]. A key feature of GETA is its scalability; 3 http://en.wikipedia.org/wiki/Japan Advanced Institute of Science and Technology 4 http://getassoc.cs.nii.ac.jp
“twitter” “tennis” “dollar” “facebook” “twitter” “dollar” IT news 2 0 1 4 Sport news 0 2 1 0 Query 1 1 Economic news 0 0 2 0 (a) A sample WAM (b) A sample query vector Similarity Score Similarity Score “facebook” 12 IT news 3 “dollar” 8 Sport news 1 “twitter” 6 Economic news 2 “tennis” 2 (c) The associated documents of (d) The summary of query query “twitter dollar” “twitter dollar” Fig. 3: An example of WAM and sample results. it quickly handles a dynamic association search on more than ten million documents, such as Webcat Plus 5 , Imagine 6 , and Cultural Heritage Online 7 . The key data structure of GETA is a WAM (Word Article Matrix), which represents an association function in Definition 1. WAM is usually a huge sparse matrix of which rows are indexed by names of documents and columns are indexed by words. When ID1 is a set of words and ID2 is a set of documents, the cross point of the row of a word w and the column of a document d is a(w, d), which is the number of occurrences of a word w in a document d. Then, the transpose at (w, d) is obtained as a transposed WAM. In GETA implementation, a huge and sparse WAM is compressed either verti- cally or horizontally. These two compressed matrices enables us to compute association functions a and at , respectively. For example, “twitter dollar” is a query of the sample WAM in Fig. 3a. This query is regarded as a two word document (Fig. 3b). If we adapt the inner product of two column vectors as a similarity function, the result is shown in Fig. 3c. We can also obtain a summary of the query as Fig. 3d. This list collects candidates for the next association search. GETA prepares several sophisticated similarity measures by default, such as TF/IDF and Smart measure by Singhal, et al. [10]. It also accepts user defined similarity func- tions, but during experiments, we simply adapt Smart measure. To expect improvement of retrieval accuracy, the stemming, which reduces the in- flected words to its root form (for example, “fishing” and “fisher” to just “fish”), is applied as preprocessing on Wikipedia. We use Snowball 8 English stemmer. 5 http://webcatplus.nii.ac.jp 6 http://imagine.bookmap.info 7 http://bunka.nii.ac.jp 8 http://snowball.tartarus.org/
4 Divergent Thinking Support System 4.1 Structure of System A divergent thinking support system suggests and archives the goals of divergent think- ing. There are three levels of a divergent thinking support system, defined by Young [11]. 1. The secretariat level, a system only stores and displays the log of user’s thought, such as a word processor. 2. The framework-paradigm level, a system has the secretariat level ability and also provide a user with appropriated paradigm to user’s thought. 3. The generative level, a system has the framework-paradigm level ability, and au- tomatically constructs and displays new ideas corresponding to previous ideas. A divergent thinking support system consists of two main components, an associa- tion engine and an interface. An engine constructs suggestions to a user. An interface manages the interaction between a user and an engine. 4.2 Workflow of System First, a user inputs a topic sentence to start the system as in Fig. 4. Next, an asso- ciation engine reads it and produces a list of associative information by consulting a knowledge-base, and forward it as a suggestion list to the interface. Then, the interface displays it to the users. Based on the suggestions, the user inputs a next query sen- tence. The input sentence is forwarded to the engine to obtain the next suggestion list. It repeatedly processes until a user is satisfied with the discovered ideas. User input the topic sentence Topic sentence An association engine Knowledge-base generates suggestions. Engine Suggestion list Input sentence User see the suggestion list and inputs a next query sentence. Interface Fig. 4: The workflow of a divergent thinking support system
Fig. 5: An Eureka!’s screenshot and functions 4.3 Eureka! Interface For experiments, we prepare an interface “Eureka!” (Fig. 5). Our focus is to evaluate the performance of different association engines, rather than the effect and the influence of social interactions. Thus, we restrict “Eureka!” as a single user interface. The topic is alway shown on the top of the screen with big fonts to prevent the off- topic thinking. At the right bottom sidebars, the suggestions are displayed. The right sidebar displays the suggestions in small fonts to allow quick glance by a user. At the bottom sidebar, the suggestions in big fonts moves horizontally in the right-to-left manner (like a headline news). It shows top ranked suggestions first, intended easy finding by users. All previously input ideas (query sentences) are displayed as rectangle labels in the central scrollable area. They are displayed without overlaps between labels. Table 1: The comparison among three divergent thinking support interfaces Interface Environ- Type of Maximum Focusing Idea Labels Can Recall ment Suggested Number on Topic Holding Area Previous Information of Users (% of Screen) Suggestions Eureka! Web Text 1 Yes 75% Yes Hasso-Tobi 2 Web Text 10 No 55% No Idea Expander Client Image 2 No 30% No
Such a layout intends two effects. First, a user may easily notice previous ideas, which would prevent duplicated ideas. Second, a user can stimulates the idea recycling. Whenever a previous idea label is clicked, the suggestion list produced from that idea is loaded and displayed on the right and bottom sidebar. Eureka! is mainly inspired from the “Hasso-Tobi 2” [2], which is the collaborative divergent thinking support system. The comparison among interfaces of “Eureka!”, “Hasso-Tobi 2” [2], and “Idea Expander” [7]) is summarized in Table 1. 5 Experimental Setting 5.1 Four Divergent Thinking Support Engines Below are the association engines used during experiments. All are used under Eureka! interface and only visible differences are produced suggestion lists. – GD: GETA’s most related Documents A user input sentence is a query for retriev- ing associated documents by GETA. The titles of the search result are suggestions, which are sorted by their similarity scores. – GK: GETA’s most related Keywords A user input sentence is a query for re- trieving the summary of the query by GETA. The resulting summary (associated keywords) are suggestions, which are sorted by their similarity scores. – WF: Wikipedia’s Freelinks A user input sentence is a query for extracting infor- mative entities. The freelinks in their contents are suggestions, which are sorted by the length of entities. That is, a freelink with a longer parent entity comes first. In our current implementation, if their parent entities have the same length, a freelink with a parent entity that appears earlier in an input sentence comes first. – WC: Wikipedia’s Categories A user input sentence is a query for extracting infor- mative entities. All titles in the same category to extracted entities are suggestions, which are sorted by the length of entities. That is, a category with a longer child entity comes first. In our current implementation, if entities have the same length, the ordering among categories and titles in categories obeys to that of Wikipedia. – NE: No engines No associative information is supplied (no suggestion lists). 5.2 Evaluation Methods The quality of an engine is measured by the degree of creativity that users generate. As proposed by Guilford[12], the degree of creativity can be measured by fluency, flexibil- ity, and originality of ideas. Neupane, et al. proposed the following four measures [13]. – Number of ideas The total number of input ideas. – Fluency of ideas The total number of input ideas excluding that are judged to be off-topic, redundant, impossible, and/or useless. – Flexibility of ideas The total number of viewpoints in input ideas. Viewpoints are defined prior to experiments. – Originality of ideas The total number of distinct ideas. When ideas are very close or identical, they are grouped together.
For example, consider a problem “Apart from avoiding rain, write down any other uses of an umbrella”. Table 2 shows the ideas generated by a participant. Table 3 shows that there are four viewpoints. The third and fifth ideas in Table 2 are grouped. These classification is performed after experiments by three human evaluators. 5.3 Procedure of Experiment The experiments are conducted by five groups of users, and each group consists of two users (ten as a total). Participants ranged from bachelor to doctoral students. Before the experiments, every participant is informed: 1. System procedure and usage. 2. Divergent thinking, its rules and examples. 3. Q&A Five following topics are assigned to all participants. In each topic, ten viewpoints (Table 4) are prepared in advance, and all resulting ideas are classified into the most related viewpoints, though there are few difficult cases. 1. If all human beings had a third hand on their back, write down the advantages of that hand. 2. Apart from avoiding rain, write down any other uses of an umbrella. 3. What steps should the concerned authorities take to increase the number of foreign tourists in Japan? 4. How could you contribute to power saving at your school? 5. Imagine that you are a product designer, please design new products likely to be sold to teenagers. Experiments use four engines and no engine (for comparison) as described in Sec- tion 5.1, and the maximum number of suggestions is limited to 30 for each. Table 2: A log of ideas during Eureka! in- Table 3: A list of viewpoints during Eu- teraction reka! interaction. Idea No. Generated Idea 1 Avoiding the sun Idea Viewpoint Idea No. 2 Collect rain water (upside down) Furniture 9 3 Used as a ruler Tool 1,3,4,5,6,8 4 To lock doors by jamming it Recycle 10 between the two handles Accessories 5 Used as a measuring stick Interior 6 Use the handle to grab something Plaything 7 Used as a basket (upside down) Container 2,7 8 A cane for elderly people Using Materials 9 Dry socks by putting it upside down Clothes and hang socks on the frame Social 10 Disassemble the umbrella and use the frame as sticks for cooking
Table 4: A list of idea viewpoints of Topic 1-5. Viewpoint No. Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 1 Furniture Security Information Habit Sport 2 Tool Fame Economy Attitude Electronic 3 Recycle Production Quality Consumption Entertainment 4 Accessories Trick Attitude Power Sources Appearance 5 Interior Social Regulation Social Enforcement Toy 6 Plaything Education Quantity Management Health 7 Container Novel Ability Support Economy Security 8 Using Materials Economy Security Regulation Transportation 9 Clothes Appearance Attention Presentation Education 10 Social Health Promotion Promotion Production Table 5: Topic and engine assignment Table 6: Average quantitative measure- ment results (per user per time period) Group Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Measurement GD GK WF WC NE A GD GK WF WC NE Number of ideas 12.8 15.9 11.8 14.8 12.7 B GK WF WC NE GD Fluency of ideas 12.5 15.0 11.2 12.0 11.2 C WF WC NE GD GK Flexibility of ideas 4.7 4.6 4.8 4.6 3.9 D WC NE GD GK WF Originality of ideas 11.1 11.7 10.2 11.1 9.8 E NE GD GK WF WC To avoid the effect of tool experiences, the topics and the engines are assigned to groups in different orders as in Table 5. The timeout for each topic is set to 15 minutes. Three human evaluators, who did not participate in the experiments, judged the flu- ency, the flexibility, and the originality of ideas. Majority vote is taken if conflict occurs. 6 Experimental Result and Observation 6.1 Quantitative Result The quality of divergent thinking engines is estimated by four measures in Section 5.2. Their average scores of ten sessions are shown in Table 6. As a result, original ideas are most discovered when supported by GK. Although WF yields the highest flexibility, all flexibility scores stay around 4.6-4.8 viewpoints per session (ten viewpoints max.), and are not enough to establish any conclusion. Our observation on the advantage of GK is, first, the GETA can accurately summa- rizes keywords. Second, each item of the summary is just a single word, which requires less time to understand and is loosely interpreted to various ways. Therefore, users most fluently generate ideas when GK is used.
6.2 Qualitative Result The qualitative evaluation is the survey answered by participants. The survey consists of four questions. The first three questions are multiple choices (Q1-Q3), and the last question is answered by individual comment (Q4). The first question is asked immedi- ately after finishing each topic. All others are asked after finishing all five topics. Q1. Please evaluate the usefulness of this divergent thinking support engine on a scale from 1(Poor) to 5(Excellent). GD GK WF WC NE 3.0 3.9 2.6 2.7 1.0 Q2. After using all five divergent thinking support engines, which one do you con- sider the MOST USEFUL to generate your ideas? GD GK WF WC NE None 2 6 0 0 1 1 Q3. After using all five divergent thinking support engines, which one do you con- sider the MOST USELESS to generate your ideas? GD GK WF WC NE None 2 0 2 3 2 1 According to the survey, the GK obtains the most satisfaction from users, which is consistent with the quantitative result. After analyzing, we observed that most users avoid inputting the lengthy informative entities. Instead, they are replaced by sample words or abbreviations. For example, in- stead of inputting “Japan Advanced Institute of Science and Technology”, they input “University” or just “JAIST”. Thus, the informative entity extraction (see Section 2.3), which reports the longest informative entity first, fails. It is a major difficulty that drops the quality of both WF and WC. Q4. Is content that was displayed to stimulate your ideas is useful for you? If so, please describe when? – Not useful.(2) – Useful, when I have no/few ideas.(2) – Useful, when seeking to trigger the next idea from the existing one.(4) – Useful, when seeking new fresh ideas.(2) Users found the suggestions are most useful when they feel that the existed ideas can be more sophisticate or can be developed to new ideas. 7 Conclusion This paper compared the four divergent thinking engines based on forced association. The first two adapt the association search engine GETA, while the latter use the more conventional informative entity extraction. The quality of four engines is evaluated by experiments on Wikipedia using the Eureka! interface. GK (the most related keywords
by GETA, as a summary of an input sentence) is the most effective and gives the highest satisfaction to users. We think that a dynamic association creation on a large scale database (e.g., GETA) is useful for divergent thinking supports, whereas previous systems used either stati- cally precomputed (relatively small) association, or manually prepared (huge number of) freelinks on the internet. We believe that empirical statistical computation would find unexpected associations beyond biassed human thinking. Acknowledgment This research is partially supported by Grant-in-Aid for Scientific Research (20680036) from the Ministry of Education, Culture, Sport, Science and Technology of Japan and by the Japan Advanced Institute of Science and Technology (JAIST) under the JAIST oversea training program. We thank to Prof.Akihiko Takano and Prof.Shingo Nishioka at the National Institute of Informatics (NII) for fruitful suggestions. References 1. Akihiko Takano. Association computation for information access. In Discovery Science, pages 33–44. Springer, 2003. 2. Takahiro Kawaji and Susumu Kunifuji. Divergent thinking supporting groupware by using collective intelligence. In Proceedings of The Third International Conference on Knowledge, Information and Creativity Support Systems (KICSS2008), pages 192–199. 3. Alex Osborn. Your Creative Power. New York : Charles Scribner’s Son, 1948. 4. Susumu Kunifuji, Naotaka Kato, and Andrzej Wierzbicki. Creativity support in brainstorm- ing. In Andrzej Wierzbicki and Yoshiteru Nakamori, editors, Creative Environments, vol- ume 59 of Studies in Computational Intelligence, pages 93–126. Springer, 2007. 5. Isamu Watanabe. Idea editor: Overview. Technical Report Research Report IIAS-RR-90-2E, Fujitsu Laboratories Limited, October 1990. 6. Youji Kohda, Isamu Watanabe, Kazuo Misue, Shinichi Hiraiwa, and Motoo Masui. Group Idea Processing System: GrIPS. Transactions of the Japanese Society for Artificail Intelli- gence, 8:601–601, 1993. 7. Hao-Chuan Wang, Dan Cosley, and Susan R. Fussell. Idea Expander: Supporting group brainstorming with conversationally triggered visual thinking stimuli. In Proceedings of the ACM Conference on Computer Supported Cooperative Work, pages 103–106. ACM, 2010. 8. Lawrence F. Young. The metaphor machine: A database method for creativity support. De- cision Support Systems, 3(4):309 – 317, 1987. 9. Nachum Dershowitz and Zohar Manna. Proving termination with multiset orderings. Com- munication of the ACM, 22(8):465–476, 1979. 10. Amit Singhal, Chris Buckley, Mandar Mitra, and Ar Mitra. Pivoted document length nor- malization. pages 21–29. ACM Press, 1996. 11. Lawrence F. Young. Idea Processing Support: Definitions and Concepts, chapt. 8. Decision Support and Idea Processing Systems, Wm. C. Brown Publishers, pages 243–268, 1988. 12. Joy Paul Guilford. Traits of Creativity, chapter Creativity and its Cultivation, pages 142–61. Harper, 1959. 13. Ujjuwal Neupane, Motoki Miura, Tessai Hayanma, and Susumu Kunifuji. Distributed envi- ronment for brain writing support groupware and its evaluation, tree and sequences. Journal of Japanese Creativity Society, 10:74–86, 2006. (in Japanese).
You can also read