Using Market Basket Analysis to Integrate and Motivate Topics in Discrete Structures
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Using Market Basket Analysis to Integrate and Motivate Topics in Discrete Structures Michael R. Wick and Paul J. Wagner Department of Computer Science University of Wisconsin-Eau Claire Eau Claire, WI 54701 {wickmr, wagnerpj}@uwec.edu ABSTRACT Nearly every computer science curriculum includes a course Following the lead of other educators [4], we have modified our called “Discrete Structures” or “Discrete Mathematics”. Over the course by moving it from being taught in the Mathematics past few years, considerable attention has been paid to this course department to being taught in the Computer Science department. in an attempt to overcome the misperception by students that the Further, we have infused into the course some topics typically material is mathematics and not related to computer science. delegated to a theory of algorithms course (for example, divide Most of these efforts deal with attempting to explicitly show and conquer, and dynamic programming). Likewise, as some students the application of discrete mathematics within computer have done [4], we have infused into the course some topics science. We present an application that adds to the efforts of this typically delegated to a data structures course (for example, community by giving instructors a modern, powerful, and elegant implementation of a Set abstract data type). However, we have of example to motivate student engagement in discrete structures. course maintained other core topics in the course such as formal logic, counting, and proof techniques. Categories and Subject Descriptors Overall, we have found that the students appreciate the course K.3.2 [Computers & Education]: Computer & more after these changes and are more readily accepting of the Information Science Education – Computer Science potential importance of discrete structures in computer science. However, our overall curriculum is highly applied and as such our Education. students tend to reserve their most favorable impressions for those courses that solve problems they see as directly applicable to the General Terms “real world”. Therefore, we have worked to find a “real-world” Computer Science Education application for inclusion in our discrete structures course that integrates several of the topics of the course and does so in a way that convinces the students of the value-added of each of these Keywords topics to their overall computer science education. In particular, Market Basket Analysis, Sets, Dynamic Programming, we have developed lecture materials based on an approach to Discrete Structures. market basket analysis that has significantly improved the students’ perception of the discrete structures course as relevant to and important in their knowledge arsenal. 1 INTRODUCTION Recent literature in computer science education has highlighted a problem that has plagued most instructors of discrete mathematics 2 MARKET BASKET ANALYSIS courses within a computer science curriculum [4, 10]. In Binary market basket analysis is a form of data mining [5] in particular, educators report that students perceive a significant which an automated system attempts to find and use previously disconnect between the topics of a discrete structures course and unknown associations between items purchased from a store the topics of the other courses within the computer science (binary reflects the fact that the number of each item purchases is curriculum. While not all institutions have this problem not recorded – just 0 for none and 1 for at least 1). The classic (particularly those that emphasize a more theoretical or example of market basket analysis is online retail suggestive sell mathematical approach to computer science), we have (like that used by major online retailers such as amazon.com and experienced it at our institution and have attempted to modify the bestbuy.com). Here, a computer program analyzes a large set of discrete structures course to more explicitly connect with other purchase records (transactions) to find sets of items that are courses in our curriculum. frequently purchased together. Such sets are called frequent item sets. The definition of “frequent” is based on a user-provided Permission to make digital or hard copies of all or part of this work for frequency and is called the necessary support. Once the frequent personal or classroom use is granted without fee provided that copies are item sets are known, a separate process can use these associations not made or distributed for profit or commercial advantage and that to help suggest additional items that a current customer might copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, purchase. This is the suggestive sell. Each suggestive sell rule requires prior specific permission and/or a fee. indicates that when certain items are held in a customer’s basket, based on past experience, there is a reasonable chance that the SIGCSE’06, March 1-5, 2006, Houston, TX, USA. Copyright 2006 ACM 1-59593-259-3/06/0003...$5.00 customer would also be interested in purchasing some other
item(s). This definition of “reasonable chance” is based on a 1. Find all learned rules that match the current content of user-provided threshold and is called the necessary confidence. the customer’s basket. Let’s take a simple example. Assume that we have a log of 1000 2. Select one or more rules to apply in suggesting previous customer purchases from our online hardware store. additional items to the customer. Assume further that we establish a minimum necessary support of A classic algorithm for finding frequent items sets (Step 1 of the 250. This means that in order for any set of items to be transaction-analysis tasks) is called the Apriori algorithm [1]. In considered frequent, that set of items must have been purchased the remainder of this paper, we primarily focus on the discrete (perhaps in addition to other items at the same time) in 250 of the structure aspects of this algorithm. However, we also present 1000 transactions. Suppose that our transaction log contains 300 examples of how the other tasks involved in market basket purchases that included a hammer, a screwdriver, and a tape analysis can be used to help connect computer science students to measure. The set {hammer, screwdriver, tape measure} is topics and concepts in a discrete structures course. therefore a frequent item set as it occurs in at least the minimum percentage of our transactions. Before we dive into the details of the Apriori algorithm, it is important to remember our goal – to show the students a “real- From the frequent item set {hammer, screwdriver, tape measure}, world” application of the topics from discrete structures in order the following rules are possible: to help them connect those topics to their own experiences and the Rule 1: If the customer has placed a hammer in the content of other seemingly more applied courses. What follows is basket, suggest a screwdriver and a tape measure. a description of the way in which we present market basket Rule 2: If the customer has placed a screwdriver in the analysis within the discrete structures course. We typically basket, suggest a hammer and a tape measure. introduce this application as a case-study in applied discrete structures and it is typically given in the course after the students Rule 3: If the customer has placed a tape measure in the have already studied logic, proof, sets, and dynamic basket, suggest a hammer and a screwdriver. programming. Rule 4: If the customer has placed a hammer and a screwdriver in the basket, suggest a tape measure. 3.1 A SET THEORETIC DEFINITION OF THE APRIORI Rule 5: If the customer has placed a hammer and a tape ALGORITHM measure in the basket, suggest a screwdriver. The Apriori algorithm is a set-based algorithm that is Rule 6: If the customer has placed a screwdriver and a exceptionally efficient at finding all possible frequent items sets tape measure in the basket, suggest a hammer. from a set of transactions. We present our students with the The decision as to which of these rules to learn is based on the following definitions that summarize the algorithm. user-specified minimum necessary confidence. Let’s assume that the user has set the confidence at 75%. This means that in order to be learned as an association rule, the number of transactions Let I = {a, b, c,…} be the set of all items available at our store that include both the item(s) in the condition and the item(s) in the suggestion of the rule must be at least 75% of the number of Let R ⊆ I be a transaction record transactions that include at least the item(s) in the condition of the Let T = {S | S ⊆ I} be the set of all rule. So, for example, if we had 600 transactions that include a transactions hammer, then for Rule 1 to be learned there must be at least 450 Let support(S) = the cardinality of{B | B ∈ T ∧ S (75% of 600) of the 600 transactions that also include a ⊆ A}. screwdriver and a tape measure. Once we have all the learned rules, these rules can be used to Let L1 = { {i} | i ∈I ∧ support({i}) min_support }. suggest items that are likely to be purchased with the items in a and customer’s basket. ∀k [ (k > 1) ∧ (Lk-1 ∅) Lk = {Si ∪ Sj | 3 THE APRIORI ALGORITHM (Si ∈ Lk-1) ∧ (Sj ∈ Lk-1) As you can see from above, there are several steps involved in (|Si–Sj| = 1 ∧ |Sj-Si| = 1) ∧ market basket analysis. Before a new customer ever enters the ∀S [((S ⊆ Si ∪ Sj) ∧ (∀ store, we must perform the following transaction-analysis tasks. (|S| = k-1)) S ∈ Lk-1]) ∧ 1. Analyze the transactions for frequent item sets that meet the minimum necessary support. (support(Si ∪ Sj) min_support) } ] 2. Construct the candidate rules from the frequent item then sets. L = ∪Lk is the set of all frequent items sets. 3. Prune the candidate rules that fail to meet the minimum necessary confidence. Figure 1: A Set Theoretic Definition of the Apriori Algorithm Once a new customer enters the store and places an item into the Notice how this definition uses sets, set operations, and formal basket, we must perform the following basket-analysis tasks. logic together in one application. For the vast majority of our students, this is the first time they have ever considered using
anything other than pseudo-code to describe the essence of an 2 elements common to the sets plus 1 element from the first set algorithm. For most of them, the above definition is both obscure and 1 element from the second set). This is important as we are and intriguing (most of our students are excited about algorithms attempting to find frequent item sets of size k exactly. Subsequent and how to represent them). To remove the obscurity of the iterations will find larger frequent item sets if they exist. The definition, we walk through each element. second constraint, again defined as a logical implication, acts as a The set I simply represents the set of all items we sell at our store filter on the resulting unions. This constraint eliminates any sets and is easily understood by the students. of size k that contain even one subset of size k – 1 that is not contained in Lk-1. Again, after some reflection, this makes sense. The set R represents a single transaction record and therefore There is no way that a set of say four elements can be purchased contains a subset of the items that we carry in our store (i.e, the at least min_support times if a subset of three of those four set I). elements exists that wasn’t purchased min_support times. Notice The set T represents the transaction log of our store. Clearly, a that neither of these constraints requires inspecting the original set transaction log is simply a collection of single transactions. So as of transactions T. This is important as that set is typically to enable this collection to form a set (and thus have no extremely large and often held in secondary storage, making duplicates), we tell the students that each element of T has a access to that set an efficiency bottleneck for the entire system. unique id representing a transaction number. Alternatively, we Essentially, the constraints on the members of Lk have allowed us could introduce the concept of a bag of items. to generate a superset of all possible members of Lk, some of which can be filtered by finding infrequent subsets. The definition of support(S) represents the number of times that Unfortunately, the final constraint does require inspection of the the items in S were purchased as part of one of the records in the transaction set T to ensure that the remaining sets are in fact transaction set. Therefore, support(S) is simply the cardinality of frequent. the set of all sets from our transaction log (T) that include S as a (possibly proper) subset. L1 is the set of all frequent item sets of cardinality 1. That is, it is 4 APRIORI AS DYNAMIC the set of all single items that have been purchased at least min_support times (keeping in mind that some of the purchases PROGRAMMING may have included other items as well). As the algorithm In the previous section we illustrated how the Apriori algorithm executes, each Li will hold the set of all frequent item sets of from market basket analysis can be succinctly and correctly cardinality i (i.e., all sets of i items that have been purchased alone defined using sets and formal logic. In this section, we use a or with other items at least min_support times). Therefore, the dynamic programming approach to implement this set theoretic entire collection of frequent items sets of any size is given by the definition. union of Li for all i. Dynamic programming is a wonderfully efficient approach to That leaves the implication to discuss. Up to this point, the solving optimization problems where problems are solved by students readily see how the formal set definitions represent the caching sub-problem solutions rather than re-computing them [6]. given concepts from the application. The implication, however, At the heart of dynamic programming is the principle of usually takes a bit more explanation than the other aspects of the optimality which states “components of a globally optimal overall definition. The basic idea of the implication is that the solution are themselves globally optimal” [7]. But how can we existence of frequent item sets of cardinality k can be determined use dynamic programming when we are not solving an from the existence of frequent item sets of cardinality k – 1. This optimization problem. Or are we? An optimization problem is makes sense if you think about it. For example, a four-element defined as “a computational problem in which the object is to find frequent item set must contain as a subset a three-element a solution in the feasible region which has the minimum (or frequent item set since for all four items to have been purchased maximum) value of the objective function.” [8]. If we define our more than min_support times, certainly three of the four items objective function as the cardinality of each ∪Lk, then our goal is must have been purchased min_support times. Students are quick to see that this definition is ripe for implementation using to find the set ∪Lk that has the maximum cardinality. In this dynamic programming (recall that we introduce this application light, finding the frequent items sets for market basket analysis is after they have already studied dynamic programming). an optimization problem. With this “big picture” as a backdrop, we then ask the students to To be able to effectively apply dynamic programming to its consider each part of the implication in turn. The antecedent of solution, we must prove two properties of the problem. the implication (k > 1) ∧ (Lk-1 ∅) indicates that 1) An optimal solution to the problem of finding ∪Lk-1 is implication holds whenever our most recently produced set of a subset of the optimal solution to the problem finding frequent item sets is non-empty. This is intuitive since you can’t have any frequent item sets of k elements when you don’t have ∪Lk. This is the principle of optimality. any frequent item sets of k-1 elements. 2) Every element of ∪Lk - ∪Lk-1 can be constructed as The consequent of the implication gives the rule for using Lk-1 to establish Lk. In particular, the implication defines Lk as being the union of elements from ∪Lk-1. This is the property constructed from the union of sets in Lk-1 subject to three that problems have overlapping subproblems. constraints. The first constraint ((|Si–Sj| = 1 ∧ |Sj-Si| = The proof of (1) is a simple proof-by-contradiction. Assume that 1)) indicates that the two sets chosen from Lk-1 must differ from ∪Lk is optimal and there exists a set A ∈ ∪Lk-1 such that A ∉ each other in exactly one element each. This constraint ensures that the union of the two sets produces a set with cardinality k (k – ∪Lk. Since membership in ∪Lk-1 implies that A is a frequent
item set with k-1 elements or fewer, then we could build a larger with a set theoretic definition of the association rules as shown in set ∪Lk-1 ∪ A contradicting the assumption that ∪Lk is optimal. Figure 2 (which assumes the definitions shown in Figure 1). Let R be the set of all association rules built The proof of (2) using the fact that every subset of a frequent item from the frequent item sets of L. set must itself be frequent. Therefore, for every A ∈ Lk+1 we can Let be an ordered pair representing an find two sets B,C ∈ Lk such that | B – C | = | C – B| = 1 and A = B association rule with antecedent A (a set) and ∪ C. consequent C (a set). Given these two properties hold for the problem of producing Let F ∈ L (a frequent item set) ∪Lk, a dynamic programming approach to the problem is Let 2F be the powerset of F. appropriate. Then, At this point, we engage our students in an investigation of R = { | appropriate data structures for our implementation. We ask our students to inspect our set theoretic definition for operations that A ∈ 2F ∧ will be required of our data structures. Almost immediately, the (C = F – A) ∧ students suggest using some form of a hash table for storing each (A ∅) ∧ Lk. They justify this decision based on the fact that the constraint ∀S [((S ⊆ Si ∪ Sj) ∧ (|S| = k-1)) (∀ S ∈ Lk-1]) (C ∅) ∧ requires that we be able to quickly determine membership in Lk-1. (support(F)/support(A) confidence) } Next, the students typically turn their collective attention to the required set operations. The constraint (|Si–Sj| = 1 ∧ |Sj- Figure 2: A Set Theoretic Definition of Association Rules Si| = 1) requires that our algorithm must be able to find all Notice that the brute force implementation of this definition elements of a set that overlap in all but one element each. Further, implies that we must consider all possible subsets of each item set the previous constraint mentioned above also requires that we be - resulting in an exponential algorithm for association rule able to find all subsets of size k – 1 for a set of size k. Finally, the learning. We have found this analysis to be an excellent way to constraint support(Si ∪ Sj) min_support requires that reinforce to students the value of algorithm analysis, order of we be able to effectively determine if a given set is a subset of magnitude functions, and big-theta estimations. another set (i.e., is Si ∪ Sj a subset of each transaction record in T). Developing an optimally efficient data structure for these requirements turns out to be quite a challenge and far beyond the 5.2 AN IMPROVED APPROACH backgrounds of our students in discrete structures (for more Clearly, the above approach is unacceptable for even modestly information on efficient data structures for the Apriori algorithm sized problems. To help motivate a more effective approach, see [2, 9]). For our purpose, we introduce the students to a bit consider the following double-consequent association rule with representation in which each set in our system is represented as an two items in the antecedent and the consequent. n-bit binary string in which n is the cardinality of I (our set of all Rule1: If a and b then c and d. items in our store) and a “1” in the binary string indicates that the corresponding item is a member of this item set. While this is Next, consider the two single-consequent rules formed from the certainly not an optimal choice, this simple approach is intuitive above rule. for the students and allows them to apply logical operations in the Rule2: If a and b then c. implementation of our set operations. Rule3: If a and b then d. Clearly, Rule1 cannot meet the minimum necessary confidence 5 ASSOCIATION RULE LEARNING unless both Rule2 and Rule3 meet this confidence. After all, if c Thus far we have discussed how we use the “find frequent item doesn’t follow sufficiently frequently from a and b, then sets” step of market basket analysis to integrate propositional certainly c and d will not. Notice we can therefore build logic, sets, optimality proof, data structures, and dynamic candidate double-consequent rules from single-consequent programming within a discrete structures course. This section rules, triple-consequent rules from double-consequent explores how we use the second phase of market basket analysis rules, and so on. This seems familiar. The generation of (association rule learning) to reinforce these same concepts. association rules is just another application of our dynamic programming approach. In fact, we use the development of 5.1 THE BRUTE FORCE APPROACH the set theoretic definition of this process as a follow-up exercise to the discussion of the Apriori algorithm. This Recall from Section 2 that a given frequent item set can lead to a built-in follow-up activity is just another of the interesting large number of possible association rules. Also recall that not all of these possible rules will be effective and thus we must use the features of using market basket analysis as an integrating confidence threshold to filter out ineffective rules. The brute application. The resulting definition of the association rule force approach, therefore, would be to generate all possible rules learning phase is shown in Figure 3. from each frequent item set and test each such rule against the transaction set T by dividing the frequency of the consequent of each rule by the frequency of the antecedent of each rule. This sounds like a lot of work, but how much is it really? Let’s start
Let L = ∪ k Lk REFERENCES Let T = {S | S ⊆ I } be the set of all transactions. 1. Agrawal R., Mannila H., Srikant R., Toivonen H. and Let be an association rule with antecedent Verkamo, A.I., “Fast Discovery of Association Rules”, A and consequent C. from “Advances in Knowledge Discovery and Data Let confid() = |{B | B ∈ T ∧ Mining”, AAI/MIT Press, 1996, pp. 307-328. (A ∪ B) ⊆ B}| / 2. Cerin, C., Gay, J-S., Mahec, G. and Koskas M, |{B | B ∈ T ∧ A ⊆ B}| “Efficient Data Structures and Parallel Algorithms for Association Rules Discovery”, http://www-lipn.univ- Let R1 = { | F ∈ L ∧ a ∈ F ∧ paris13.fr/~cerin/documents/cerin_c_enc04.pdf confide(F,a) min_confid)} and 3. Coenen, F., Leng, P. and Ahmed, S., “Data Structure for Association Rule Mining: T-Trees and P-Trees”, ∀k [ (k > 1) ∧ (Rk-1 ∅) IEEE Transactions on Knowledge and Data Rk = { | Engineering, Vol. 16, No. 6, June 2004, pp. 774-778. ( ∈ Rk-1) ∧ 4. Decker, A., and Ventura, P., “We Claim this Class for ( ∈ Rk-1) ∧ Computer Science: A Non-Mathematician’s Discrete (|Ci – Cj| =1 ∧ |Cj – Ci| = 1) ∧ Structures Course”; ACM SIGCSE Bulletin, (∀S [((S ⊆ Ci ∪ Cj) ∧ Proceedings of the 35th SIGCSE Technical Symposium on Computer Science Education, Vol. 36, No. 1, March (|S| = k-1)) ∈ Rk-1]) ∧ 2004, pp. 442-446. (confide() min_confi) } ] 5. Han, J. and Kamber, M, “Data Mining, Concepts and then Techniques”, Academic Press, 2001. R = ∪Rk is the set of all confident association rules. 6. National Institute of Standards and Technology, Figure 3: Set Theoretic Definition of Association Rule Finding http://www.nist.gov/dads/HTML/dynamicprog.html Given space constraints, we will forego a detailed analysis of this definition. However, the analysis of this definition is directly 7. National Institute of Standards and Technology, analogous to the analysis given for the set theoretic definition of http://www.nist.gov/dads/HTML/principle.html the Apriori algorithm. 8. National Institute of Standards and Technology, http://www.nist.gov/dads/HTML/optimization.html 6 SUMMARY AND CONCLUSION 9. Park J., Chen, M, and Yu P., “An Effective Hash-Based Market-basket analysis and suggestive sell strategies are Algorithm for Mining Association Rules”, Proceedings commonplace in today’s world, especially in connection with of the 1995 ACM-SIGMOD International Conference online retail stores. As such, students are well-motivated and on the Management of Data (SIGCMOD ’95), San Jose, engaged by the application. In this paper we have presented ways CA, May 1995, pp. 175-186. in which the market basket analysis application can be used to illustrate and integrate many of the typical topics in the discrete 10. Tomer, D. S., Baldwin, D., Smith, C. H., Henderson, P. structures course of an undergraduate computer science B.,& Vadisigi, V. (2000). CS1 and CS2: Foundations of curriculum. The topics involved in definition and implementation Computer Science and Discrete Mathematics. Panel of market-basket analysis include formal logic, sets, set presented at the 31st SIGCSE technical symposium on operations, power sets, proof methods, dynamic programming, Computer Science Education, Austin, Texas; Vol. 32, algorithm analysis, and data structures. Further, market-basket No. 1, March 2000, pp. 397-398. analysis can be divided into phases that allow lecture material and classroom experiences to focus on the application of these topics to the construction of frequent item sets and to allow subsequent theoretical and programming assignments to apply these same concepts to the second phase of association rule learning. While our discussion of market-basket analysis has been focused on the use of this application in a discrete structures courses, the richness of market-basket analysis provides a plethora of other possible uses in an undergraduate curriculum including (but not limited to) databases for storing learned associations, artificial intelligence for applying association rules, and advanced data structures for efficient implementations of sets using P-Trees and T-Trees [3].
You can also read