What's in the Box? Exploring the Inner Life of Neural Networks with Robust Rules
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
What’s in the Box? Exploring the Inner Life of Neural Networks with Robust Rules Jonas Fischer 1 Anna Oláh 1 Jilles Vreeken 2 Abstract Y are typically active when neurons X are. For robustness we explicitly allow for noise, and to ensure that we dis- We propose a novel method for exploring how cover a succinct yet descriptive set of rules that captures the neurons within neural networks interact. In partic- regularities in the data, we formalize the problem in terms ular, we consider activation values of a network of the Minimum Description Length principle (Grünwald for given data, and propose to mine noise-robust & Roos, 2019). To discover good rule sets, we propose rules of the form X → Y , where X and Y are sets the unsupervised E XPLAI NN algorithm and show that the of neurons in different layers. We identify the best rules we discover give clear insight in how networks per- set of rules by the Minimum Description Length forms their tasks. As we will see, these identify what the principle, as those rules that together are most de- network deems similar and different between classes, how scriptive of the activation data.To learn good rule information flows within the network, and which convolu- sets in practice, we propose the unsupervised E X - tional filters it expects to be active where. Our rules are PLAI NN algorithm. Extensive evaluation shows easily interpretable, give insight in the differences between that the patterns it discovers give clear insight datasets, show the effects of fine-tuning, as well as super- into how networks perceive the world: they iden- charge prototyping as they tell which neurons to consider in tify shared and class-specific traits, composition- unison. ality, as well as locality in convolutional layers. Moreover, they are not only easily interpretable, Explaining neural networks is of widespread interest, and but also super-charge prototyping by identifying especially important with the emergence of applications in which neurons to consider in unison. healthcare and autonomous driving. In the interest of space we here only shortly introduce the work most relevant to ours, while we refer to surveys for more information (Adadi 1. Introduction & Berrada, 2018; Ras et al., 2018; Xie et al., 2020; Gilpin et al., 2018). There exist several proposals for investigating Neural networks achieve state of the art performance in how networks arrive at a decision for a given sample, with many settings. However, how they perform their tasks, how saliency mapping techniques for CNNs among the most they perceive the world, and especially, how the neurons prominent (Bach et al., 2015; Zhou et al., 2016; Sundarara- within the network operate in concert, remains largely elu- jan et al., 2017; Shrikumar et al., 2017). Although these sive. While there exists a plethora of methods for explaining provide insight on what parts of the image are used, they neural networks, most of these focus either on the mapping are inherently limited to single samples, and do not reveal between input and output (e.g. model distillation) or only structure across multiple samples or classes. For explain- characterize a given set of neurons, but can not identify ing the inner working of a CNN, research mostly focuses which set to look at in the first place (e.g. prototyping). In on feature visualization techniques (Olah et al., 2017) that this paper, we introduce a new approach to explain how the produce visual representations of the information captured neurons in a neural network interact. In particular, we con- by neurons (Mordvintsev et al., 2015; Gatys et al., 2015). sider the activations of neurons in the network over a given Although these visualizations provide insight on how CNNs dataset, and propose to characterize these in terms of rules perceive the world (Øygard, 2016; Olah et al., 2018) it has X → Y , where X and Y are sets of neurons in different been shown that concepts are often encoded over multiple layers of the network. A rule hence represents that neurons neurons, and that inspecting individual neurons does not 1 Max Planck Institute for Informatics, Germany 2 CISPA provide meaningful information about their role (Szegedy Helmholtz Center for Information Security, Germany. Correspon- et al., 2013; Bau et al., 2017). How to find such groups of dence to: Jonas Fischer . neurons, and how the information is routed between layers in the networks, however, remains unsolved. Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).
Exploring the Inner Life of Neural Networks with Robust Rules An orthogonal approach is model distillation, where we 2. Theory train easy-to-interpret white box models to mimic the de- cisions of a neural network (Ribeiro et al., 2016; Frosst & We first informally discuss how to discover association rules Hinton, 2017; Bastani et al., 2017; Tan et al., 2018). Rules between neurons. We then formally introduce the concept of the form (if–then) are easily interpretable, and hence a of robust rules, and how to find them for arbitrary binary popular technique for model distillation (Taha & Ghosh, datasets, last, we show how to combine these ideas to reveal 1999; Lakkaraju et al., 2017). Existing techniques (Robnik- how neurons are orchestrated within feedforward networks. Šikonja & Kononenko, 2008; Özbakır et al., 2010; Barakat & Diederich, 2005) aim for rules that directly map input to 2.1. Patterns of neuron co-activation output, rather than providing insight into how information Similar to neurons in the brain, when they are active, arti- flows through the network. Tran & d’Avila Garcez (2018) ficial neurons send information along their outgoing edges. restrict themselves to Deep Belief Networks only, and for To understand flow of information through the network, these propose to mine all sufficiently strong association it is hence essential to understand the activation patterns rules. As such, their method suffers from the well-known of neurons between layers. Our key idea is to use recent pattern explosion. In contrast, Chu et al. (2018) propose advances in pattern mining to discover a succinct and non- to explain NNs by deriving decision boundaries of a net- redundant set of rules that together describe the activation work using polytope theory. While this approach permits patterns found for a given dataset. For two layers Ii , Ij , strong guarantees, it is limited to very small (< 20 hidden these rules X → Y, X ⊂ Ii , Y ⊂ Ij express that the set neurons) piecewise linear NNs. In sum, existing methods of neurons Y are usually co-activated when neurons X are either do not give insight in what happens inside a neural co-activated. That is, such a rule provides us local informa- network, and/or, are not applicable to the type or size of tion about co-activations within, as well as the dependence state-of-the-art convolutional neural networks. Zhang et al. of neurons between layers. Starting from the output layer, (2018) show how we can gain insight into convolutional we discover rules between consecutive layers Ij , Ij−1 . Dis- layers of neural networks by building an explanatory graph covering overlapping rules between layers X → Y and over sets of neurons. In contrast to what we propose, their Y → Z, X ⊂ Ij , Y ⊂ Ij−1 , Z ⊂ Ij−2 , allows us to trace method does not elucidate the relation between such filters how information flows through the entire network. and subsequent dense layers, nor to the network output. Before we can mine rules between two sets of neurons – e.g. Instead, we propose to mine sets of rules to discover groups layers – Ii and Ij of a network, we have to obtain its bina- of neurons that act together across different layers in feed rized activations for a given data set D = {dk = (sk , ok )}. forward networks, and so reveal how information is com- In particular, for each sample sk and neuron set Ii , we take posed and routed through the network to arrive at the output. the tensor of activations φi and binarize it to φbi . For net- To discover rules over neuron activations, we need an un- works with ReLU activations, which binarize naturally at supervised approach. While many rule mining methods threshold 0, we might lose some information about activa- exists, either based on frequency (Agrawal & Srikant, 1994; tion strength that is eventually used by subsequent layers. Bayardo, 1998; Moerchen et al., 2011) or statistical test- This binarization however allows us to derive crisp sym- ing (Hämäläinen, 2012; Webb, 2010), these typically return bolic, and directly interpretable statements on how neurons millions of rules even for small datasets, thus thwarting the interact. Furthermore, binarization reflects the natural on/off goal of interpretability. We therefore take a pattern set min- state of biological neurons, also captured by smooth step ing approach similar to G RAB (Fischer & Vreeken, 2019), functions such as sigmoid or tanh used in artifical neural where we are after that set of rules that maximizes a global networks. We gather the binarized activations into a dataset criterion, rather than treating each rule independently. Al- D where each row tk corresponds to the concatenation of though providing succinct and accurate sets of rules, G RAB φbi and φbj of Ii and Ij for sk , i.e., tk ∈ D is a binary vector is limited to conjunctive expressions. This is too restrictive of length |Ii | + |Ij |. See Fig. 1 for a toy example. for our setting, as we are also after rules that explain shared patterns between classes, and are robust to the inherently Next, given binary activation data D, our goal is to find that noisy activation data, which both require a more expressive set of rules that together succinctly describe the observed pattern language of conjunctions, approximate conjunctions, activations. The Minimum Description Length (MDL) prin- and disjunctions. We hence present E XPLAI NN, a non- ciple lends itself as an objective to find such sets. MDL parametric and unsupervised method that learns sets of such is a statistically well-founded and computable approxi- rules efficiently. mation of Kolmogorov Complexity (Li & Vitányi, 1993). First introduced by Rissanen (1978), the essential idea is that the model M ∗ ∈ M that best describes data D is the model that losslessly encodes D using the fewest bits M ∗ = arg minM ∈M L (D, M ). Here, our model class M
Exploring the Inner Life of Neural Networks with Robust Rules Figure 1: Overview. For a given network (left), binarized activations are gathered for the layers Ii , Ij for each sample, and summarized in the binary database D (right). Rules are discovered over D, where a good rule set M is given with at the bottom right, with rules X → Y , X ∈ Ii , Y ∈ Ij . is the superset of all possible rules over D, and by MDL we be computed in linear time (Kontkanen & Myllymäki, 2007). identify the best model M ∗ as the one that compresses the We thus have L(D, Mind ) = I∈I log |TnI | + Lpc (n) . P data best. Traditionally, rule mining is restricted to conjunc- Mind serves as our baseline model, and its singleton rules tions over items, which is not sufficient for our application; are a required part of any more complex model as they neuron activations tend to be noisy, labels are inherently ensure we can always send any data over I. mutually exclusive, and hence we consider an extended lan- guage that allows for partial disjunctions of items (neurons, Non-trivial models A non-trivial model M contains rules labels) and introduce a codelength function L(D, M ) to of the form X → Y, X, Y ⊂ I that are not part of Mind . instantiate MDL for our model class of rule sets. The idea is that we first transmit the data for where these non-trivial rules hold, and then send the remaining data 2.2. MDL for Robust Rules using Mind . To determine where such a rule applies, the receiver needs to know where X holds, and hence the data Our goal is to find a set of rules M that, in terms of de- over X needs to be transmitted first. To ensure that we can scription length L(D, M ), best describes a binary database decode the data, we only consider models M for which D = {t | t ⊂ I} that consists of transactions t that are sub- the directed graph G = (I, E) is acyclic, in which there sets of items I. Each rule is of the form X → Y, X, Y ⊂ I, exists an edge between two items i1 , i2 iff they occur in the and indicates that Y is strongly associated with, i.e. occurs head and tail of a rule, that is E = {(i1 , i2 ) | ∃X → Y ∈ mostly in transactions where X is present. We say a rule M. i1 ∈ X ∧ i2 ∈ Y }. We thus get a codelength X → Y applies to a transaction t iff X ⊂ t and say a rule holds for t if additionally Y ⊂ t. We indicate these trans- X |TX | actions sets as TX = {i | ti ∈ D, X ⊂ ti }, respectively L(D | M ∪ Mind ) = log |TY |X | TY |X = {i | ti ∈ TX , Y ⊂ ti }. Based on these definitions X→Y ∈M of rule transaction sets, we can now formally introduce our X n + log , codelength function L(D, M ). |TI0 | ∅→I∈Mind where TI0 = {t ∈ D | (I ∈ t) ∧ (∀X → Y ∈ M. I ∈ Baseline model Our base model Mind = {∅ → I | Y =⇒ t 6∈ TY |X )} is a modified transaction set containing ∀I ∈ I} consists of singleton rules only, i.e. it models transactions with item I not covered by any non-trivial rule. that all items I are generated independently. To send the n transactions of D using Mind , we simply send for each In addition to the parametric complexities of the binomial item I in which out of all transactions in the database it codes, the model cost of a non-trivial model also includes appears. We can do so optimally using a log binomial the cost of transmitting the non-trivial rules. To transmit a rule X → Y , we first send the cardinalities of X resp. Y code, which is given by log |T|TI|∅ ∅| | = log |TnI | . To un- using the universal code for integers LN (Rissanen, 1983). ambiguously decode, the recipient needs to know each For n ≥ 1, this is defined as LN (z) = log∗ z + log c0 with |TI |, which we can optimally encode via the paramet- log∗ (z) = log z + log log z + . . . , summing only over the ric complexities Pof the binomials,kwhich are defined as positive components (Rissanen, 1983). To satisfy the Kraft n n! k n−k n−k Lpc (n) = log k=0 (n−k)!k! n n , and can inequality up to equality we set c0 = 2.865064. Knowing
Exploring the Inner Life of Neural Networks with Robust Rules + − the cardinalities, we can then send the items of X resp. Y XX→Y , and XX→Y correct for the errors made in the part one by one using an optimal prefix code given by L(X) = where the rule applies and holds, respectively applies but − x∈X log P |Tx ||TI | . For a particular rule X → Y ∈ M , does not hold. These error matrices are part of the model M P I∈I the model costs for a rule, respectively the full model thus and have to be transmitted with an adapted L(D, M ). We amount to provide examples and a short review how how to adapt the codelength function in Supp. A. L(X → Y ) =LN (|X|) + LN (|Y |) + L(X) + L(Y ) + Lpc (|TX |) , L(M ∪ Mind ) =|I| × Lpc (n) + LN (|M |) X + L(X → Y ) . Complexity of the search To discover rules over the acti- X→Y ∈M vations of layers Ii , Ij , we have to explore all rules formed by subsets of neurons in Ii for the head, combined with any We provide an example calculation in Supp. A.1. With these subset of neurons of Ij for the tail. There exist 2|Ii | × 2|Ij | definitions, we have an MDL score that identifies the best |Ii |+|Ij | such rules, and hence 22 distinct models would need rule set M ∗ for data D as to be explored. Fischer & Vreeken (2019) showed that the M ∗ = arg min L(M ∪ Mind ) + L(D | M ∪ Mind ) , rule set search space does not lend itself to efficient search M ∈M as it is neither monotone nor submodular, the counterex- where M contains all possible rule sets over the items in D. amples also holding for our model definition. In fact, for robust rules, we additionally have to consider where rules Robust Rules In real world applications, we need a score should apply respectively hold – optimizing k and l – which |Ii |+|Ij | that is robust against noise.The key problem with noisy results in approximately 2|Ii |×|Ij |×2 models (details data is that a single missing item in a transaction can cause in Supp. A.4). Exhaustive search is therewith infeasible, a whole rule not to hold or apply. To discover rules that which is why we present E XPLAI NN, a heuristic algorithm generalize well, we need to explicitly account for noise. The to efficiently discover good sets of rules. idea is to let rules apply, and hold, also when some items of head respectively tail are missing. Specifying how many 2.3. Discovering good rule sets with E XPLAI NN items l, and k, out of all items in the rule head, respectively E XPLAI NN is based on the idea of iteratively refining the tail, need to be part of a transaction, we relax the original current model by merging and refining already selected rules. rule definition to account for missing items, or in other The key insight of the algorithm is that for a rule X → Y to words, noise. summarize the data well, also rules X → Y 0 with only part Furthermore, as output neurons – the classes – are only of the tail, Y 0 ⊂ Y , should summarize well, as all tail items active mutually exclusively, rules need to be able to model should be similarly co-occuring with head X. Starting from disjunctions. Setting l = 1 and k = 1 means that only one the baseline model Mind we iteratively and greedily search of the items of head respectively tail need to be present, for better models until we can no longer improve the MDL thus coincidentally corresponding to a disjunction of items score. As search steps we consider either introducing a new in the head and tail of the rule X → Y , thus allowing to rule to M , by taking a good set of items X ⊂ Ii for the model output neurons correctly, and l = |X| and k = |Y | head and a single item A ∈ Ij for the tail and refine the correspond to the original stringent rule definition. Varying model to M 0 = M ⊕ {X → A}, seeing if it decreases the between the two extremes accounts for varying levels of overall MDL costs (Eq. 2.2). Or, we merge two existing noise. The optimal l and k are those that minimize the rules r1 = X → Y1 ∈ M and r2 = X → Y2 ∈ M , to MDL score. form a new rule r0 = X → Y1 ∪ Y2 and refine the model to M 0 = M ⊕ {r0 } = (M \ {r1 , r2 }) ∪ {r0 }. For a rule r0 , To ensure a lossless encoding, we need to make sure that the refinement operator ⊕ is adding the rule r0 = X → Y the receiver can reconstruct the original data. Thus, for to M , and removes the merged rules that led to r0 , if any. the previously introduced relaxed definition of when rules Moreover, it updates the singleton transaction lists TA for hold and apply, we send for each rule the corresponding all items A ∈ Y , removing all transactions where r0 holds. number of items l that need to be present for it to apply using LN (l) bits. Knowing each l, the receiver can reconstruct To permit scaling up to the size of a typical neural net, we where each rule applies. Sending where a rule holds now next discuss how to efficiently search for candidate rules leaves the receiver with an approximation of the data. To with heads that can express anything from conjunctions to be able to reconstruct the actual data, Fischer & Vreeken disjunctions. Immediately after, we present the full algo- (2019) introduced error matrices that when XORed with the rithm E XPLAI NN for mining high quality rule sets for two approximation yield the original data. These two matrices arbitrary sets of neurons (e.g. layers) of a network.
Exploring the Inner Life of Neural Networks with Robust Rules Searching for candidates A key component of E X - low merging of similar heads and can thus revert too greedy PLAI NN is the candidate generation process, which imple- decisions introduced earlier. Furthermore, by observing that ments the two possible steps of generating new and merging independent rules X1 → Y1 , X2 → Y2 , Y1 ∪Y2 = ∅ do not existing rules. Given two layers Ii , Ij , to efficiently discover influence each others impact on codelength, we can add all rules that are both robust to noise, and may include disjunc- independent rules with the highest respective gain at once. tively active neurons in the head, we can not enumerate all We provide pseudocode for candidate generation and the possible rule heads for each individual tail neuron, as this E XPLAI NN algorithm in Supp. A.5. would result in |Ij | × 2|Ii | many rules. Instead, we keep a list Hy for each item y ∈ Ij , storing all head neurons x ∈ Ii for which y is frequently active when x is active, that is Complexity of E XPLAI NN The generation of new rules |T ∩T | σx,y = x|Tx | y > θ, where θ is a confidence threshold. We results in time O(n × |Ij | × |Ii |3 ), by iterating over each consider a rule X → Y to be good, if when neurons X are neuron in Ij , and considering each subset of the most active, the neurons Y are also likely to be active, which is overlapping neurons in Ii , and considering each threshold directly represented by the confidence θ. With parameter k = 1 . . . |Ii | for when the rule should apply, and the factor µ we account for early decisions on rule merging that later n from intersecting transaction lists T to compute the gain. hinder us to see a more general trend. The lists are sorted We can have at most |Ij | generated rules before considering decreasing on σ. We search in each Hy for the rule with rule merges, and in every iteration of merging we combine highest gain over all unions of first t = 1 . . . |Hy | neurons two rules, reducing the rule set size by 1. In each such step, in the list. We add that rule X → y with highest gain to the we consider |Ij |2 merges, for each of which we compute the candidate list. To compute the gain, we consider all possible gain considering noisy head and tail. We thus have a worst values k = 1 . . . |X| to determine for which transactions case runtime of O(n × |Ij |4 × |Ii |). As MDL ensures we TXk = {t ∈ D | |X ∩ t| ≥ k} the rule should robustly apply, consider models that tend to be succinct and hence capture where k = 1 corresponds to disjunction and k = |X| to only relevant structure in the data, E XPLAI NN is in prac- conjunction of neurons. tice much faster and easily scales to several thousands of neurons. For an individual neuron y, such a rule would be optimal, but, our goal is to discover groups of neurons that act in con- cert. To this end we hence iteratively merge rules with simi- 3. Experiments lar heads – similar, rather than same, as this gives robustness In this section we empirically evaluate E XPLAI NN on syn- both against noise in the data, as well as earlier merging de- thetic data with known ground truth and real world data to cisions of the algorithm. For two rules X1 → Y1 , X2 → Y2 explore how CNNs perceive the world. Other approaches with symmetric difference X1 X2 = (X1 \X2 )∪(X2 \X1 ), to discover patterns based on e.g. frequency measures or we consider possible candidate rules X1 ∪ X2 → Y1 ∪ Y2 statistical testing have already been shown to yield millions and X1 ∩ X2 → Y1 ∪ Y2 , iff |X1 X2 | ≤ µ for some or billions of rules or patterns, most spurious and redundant, threshold µ ∈ N. For example, µ = 1 corresponds to the and many more than anyone would be willing to investi- case that one head has one label more than the other, all gate, see e.g. (Fischer & Vreeken, 2019), we hence focus other labels are the same. on evaluating our method for the task of finding activation Both parameters θ and µ are simple, yet effective runtime patterns. Here, we look at CNNs as they count towards the optimizations. The best results with respect to MDL will most widespread use of feedforward networks and naturally always be obtained with the largest search space, i.e. with lend themselves for visualization, which helps us to interpret θ and µ set to 0, respectively |X1 | + |X2 |. Besides im- the discovered rules. We compare to traditional prototyp- pacting run-time, many of those rules may be uninteresting ing and activation map approaches on MNIST (LeCun & from a user-perspective, µ and θ allow to directly instruct Cortes, 2010), and examine which information is used how E XPLAI NN to ignore such rules. to arrive at classification for ImageNet (Russakovsky et al., 2015). Finally, we investigate the effect of fine-tuning in transfer learning on the Oxford Flower data (Nilsback & E XPLAI NN Assembling the above, we have E XPLAI NN, Zisserman, 2008). The implementation of E XPLAI NN is which given two sets of neurons Ii , Ij and a database of publicly available.1 For the below experiments, running on activations of these neurons, yields a heuristic approxima- commodity hardware E XPLAI NN took minutes for MNIST tion to the MDL optimal model M ∗ . By first introducing all and Flower, and up to 6 hours for ImageNet— yielding from relevant single neuron rules, it then proceeds by iteratively a few hundred up to 3000 rules, for the smaller, respectively merging existing rules using the approach described above, larger networks, and earlier, respectively later layers. until it can achieve no more gain. For efficiency, we separate the generation of the new rules from the merging of existing 1 http://eda.mmci.uni-saarland.de/ rules. In practice, this does not harm performance, as we al- explainn/
Exploring the Inner Life of Neural Networks with Robust Rules Figure 2: Evaluation of rule quality. Top: Performance of E XPLAI NN as precision and recall on data with varying number of planted rules with mutual exclusive head items (left) and co-occurring head items with varying noise (right). 10% noise corresponds to more noise than signal in the data. We provide the average (bar) and distribution (boxplot) across 10 repetitions. Bottom: Accuracy per class of VGG-S before (yellow) and after (blue) intervention on weights connecting neurons to class given by a rule, and 90% quantile of accuracies obtained for randomized intervention (red). 3.1. Recovering ground truth features across several classes. We compare to average ac- tivation maps and single neuron prototypes. Whereas the To evaluate how well E XPLAI NN can recover the average activation maps per class do not reveal the purpose ground truth from data, we first generate synthetic of a filter, we see that the rules learned by E XPLAI NN, binary data sets of 10000 samples and introduce clearly identify which pixels together trigger a filter. For {10, 50, 100, 200, 300, 500} rules with up to 5 items in example, in filter 2 in layer 1 the prototype looks like a maze head and tail, respectively. For each rule, the frequency and does not reveal any insight, and average activation maps is drawn from U (.02, .08), the confidence is drawn from just show the number given by the class, whereas the dis- U (.5, 1). We introduce noise by flipping 0.1% of the en- covered rules identify shared structure, such as curvatures tries chosen uniformly at random, and add 5 noise features shared between digits. For filter 36 in layer 2, the discovered with frequency equal to those of rules. Fischer & Vreeken rules show that it detects horizontal edges in a class specific (2019) showed that a similar MDL model works for con- manner, whereas prototyping and activation maps again fail junctive rules, hence we will focus on investigating perfor- to reveal this information. Interestingly, the discovered rules mance for mutually exclusive rule heads and noise. In the indicate that certain filters learn a negative, with activated first set of experiments, we set head items mutual exclu- areas corresponding to the imprint of the digit. We provide sively, in line of finding rules over the NN output labels. images visualizing rules, prototypes, and average activations E XPLAI NN achieves high recall and precision (see Fig- in Supp. B.1. ure 2) in terms of retrieving exact ground truth rules, and does not retrieve any redundant rules. Next, we investi- gate the impact of noise on the performance, generating How information flows To understand the inner life of data of 10000 samples and 100 rules similar to above, with neural networks in a more complex setting, we examine the head items now set co-occuring, varying the level of noise activations for the ImageNet data set of pretrained VGG- in {0.1%, 0.5%, 1%, 3%, 5%, 10%} bitflips in the matrix, S and GoogLeNet architectures (Chatfield et al., 2014; where 10% noise means more noise than actual signal. E X - Szegedy et al., 2015). We focus on analyzing the VGG-S re- PLAI NN is robust to noise, even when facing almost the sults for which an optimized and highly interpretable proto- same amount of noise and signal (see Fig. 2). typing method to visualize multiple neurons exists (Øygard, 2016), and provide results for GoogLeNet in Supp. B.2.1. 3.2. How neural networks perceive the world Here, we focus on particular rules, and provide a larger and more diverse set of results in Supp. Fig. 16, 17. We see How information is filtered We first consider the MNIST that rule-derived prototypes generally show highly inter- data of handwritten digits. We train a simple CNN of 2 con- pretable features for the corresponding classes. Mining for volutional and one fully connected layer using Keras, achiev- rules from the output to the last layer, E XPLAI NN yields ing 99% classification accuracy on test data (see Supp. B.1 rules with individual heads spanning multiple labels and for details). We are interested in what the individual filters tails spanning multiple neurons, which together encode the learn about the digits, and how E XPLAI NN reveals shared information shared between labels. Examples include the
Exploring the Inner Life of Neural Networks with Robust Rules covered for the individual classes. Beside discovering what shared and distinct traits the network has learned for classes, we also find out when it learns differences across samples of the same class. As one example, for the dog breed Great Danes, we discover three rules that upon visualization each correspond to visually very different sub-breeds, whereas a simple class prototype does not reveal any such information (Supp. Fig. 15). Next we investigate the information flow within the network, by iteratively finding rules between adjacent layers, starting with rules X → Y from output layer to last fully connected layer FC7. Based on this set of rules, we then apply E X - (a) Visualization for the whole tail PLAI NN to discover rules Y → Z between FC7 and FC6, where heads Y are groups of neurons found as tails in the previous iteration. We recursively apply this process until we arrive at a convolutional layer. This gives us traces of neuronal activity by chaining rules X → Y → Z → · · · discovered in the iterative runs. We visualize two such traces in Fig. 5, which give insight in how the network perceives different classes, passing on information from layer to layer. One example of a discovered trace is for the class totem pole (Fig. 5a). We observe that the set of neurons discov- ered for FC7 and FC6 each yield prototypes that clearly resemble the animalistic ornaments of such totem poles, which can also be found in the training data. Interestingly, we see that the neuron sets found for different filters of the (b) Visualization for the units in the tail individually last convolutional layer CONV5 together detect parts of the object, including the vertical pole, and the rooftop-like Figure 3: Characteristic faces. From the data for all structure, decorated with animalistic shapes with eyes, that dog breed categories, E XPLAI NN discovered the rule be- is typically found at the top of a totem. These filters act in tween the labels {Japanese spaniel, Pekinese, a highly specific manner, detecting only specific parts of Shih-Tzu, Lhasa, Affenpinscher, Pug, the image, such as thinner or wider vertical structures in the Brabancon griffon}, and 5 units from the FC7 layer, center, or objects at the top center of the image. for which a prototype is given in the top image. The units together capture the characteristic face of these breeds, We also find signs of overfitting, e.g. when considering the wherease individual units (bottom) give only little insight information trace for a set of dog breeds (Fig. 5b). Note that about the encapsulated information. due to space, we here only show a subset of the discovered rules. We observe that the prototypes for FC7 and FC6 both show side-views of animals. The networks seems to learn faces of certain dog breeds, for which, if we visualize these features that are specific to side photos of dogs, which are neurons individually (Fig. 3), it is hard to extract anything prevalent in the training data, also indicated by the filter meaningful from the images: the information is really en- prototypes. For the filters, we see that the network acts coded in the set of neurons that act together. on very specific parts of the image, detecting structures at the bottom that resemble paws and pairs of front and We also observe cases where rules describe how the network hind legs, and at the top of the image, which resemble discriminates between similar classes. We give an example dog faces and clouds. We also find more abstract features in Fig. 4 for the neurons E XPLAI NN discovers to be asso- with groups of filters detecting horizontal edges, which ciated with just huskies, just malamutes, and both of these reminds of the back of the dog in side-view. While there is classes together. These dog breeds are visually similar, shar- room for interpretation of prototypes, the discovered traces ing a black–white fur pattern, as well as head and ear shapes. provide evidence on how the network perceives the world, as These traits are reflected by the neurons corresponding to the information from prototypes can be interpreted across layers, rule for both breeds. Looking closer, we can see that distinct and in combination with the spatial location of activations traits, the more pointy ears of the husky, respectively the in the filters. fluffy fur of the malamute, are picked up by the neurons dis-
Exploring the Inner Life of Neural Networks with Robust Rules Siberian husky Siberian husky Malamute AND NOT AND NOT Malamute AND Malamute Siberian husky (a) Top: Malamute, Bottom: Siberian Husky (b) Rule prototypes for FC7 Figure 4: Neurons discriminating Huskies and Malamutes. a) Huskies and Malamutes are very similar looking dog breeds. b) Prototypes for rules X → Y discovered for classes X, Siberian husky (red frame), class Malamute (yellow frame), resp. both (orange frame) and neurons Y in FC7. The neurons associated with both classes represent typical features shared between the two classes, those associated only with Siberian huskies show their slightly sharper, more defined head, while those associated only with Malamutes capture their more fluffy fur. Rules carry class information To quantitatively assess for the original VGG-S show almost identical shapes and the rules that E XPLAI NN discovers, we here consider the pattern, but with less intense colour, and in both observe VGG-S network for ImageNet and intervene on those neu- prototypes with animal-like features such as eyes or beaks. rons in the last fully connected layer that E XPLAI NN finds to be class-associated. For each class c, we set incoming 4. Discussion and Conclusion weights from neurons y to 0, for which we have discovered a rule X → Y, c ∈ X, y ∈ Y , comparing classification The experiments show that E XPLAI NN is able to discover rate before and after intervention. As baseline, we addi- distinct groups of neurons that together capture traits shared tionally intervene on an equally sized random subset of all and distinct between classes, within-class heterogeneity, and weights leading to class c, again measuring classification how filters are used to detect shared features, segment back- rate after intervention. We see that for all classes, perfor- ground, or detect edges locally. Neither of these are revealed mance drops much more strongly for the actual interven- by activation maps, which miss the local information that tions than for the random ones, in most cases even to 0 (see patterns provide, nor by saliency maps, which investigate Fig. 2 bottom). This gives evidence that the discovered network attention for an individual image alone. Proto- rules capture information necessary for classification. We typing is a great tool for visualizing neuron information further observe that under intervention the model often pre- content, but, by itself is limited by the massive number of dicts closely related classes, e.g. Fire Salamander to possible combinations of neurons, requiring thousands of Spotted Salamander, Barbell to Dumbbell, or hours to painstakingly handpick and connect the information Palace to Monastery, which gives insight towards sim- of just individual neurons (Olah et al., 2020). Combining ilarity of classes, robustness of predictions, and therewith E XPLAI NN with prototyping permits exploring networks sensitivity to adversarial attacks. beyond single neurons, by automatically discovering which neurons act in concert, which information they encode, and how information flows through the network. The effect of fine tuning Finally, we show that E X - In particular, we discover distinct groups of neurons in fully PLAI NN provides insight into the effect of fine-tuning in connected layers that capture shared respectively distinct transfer learning. For this we consider Oxford Flower traits across classes, which helps in understanding how the data (Nilsback & Zisserman, 2008), which consists of 8k network learns generality but still can discriminate between images of flowers of 102 classes. For investigation, we con- classes. Due to the local information that our rules provide, sider both the vanilla VGG-S network trained on ImageNet we can also detect differences in the perception across sam- from above, and a fine-tuned version from the Caffee model ples of a single class, where for example different groups of zoo.2 We run E XPLAI NN to obtain rules between the output neurons describe visually different sub-breeds of a class of and the final layer of both networks. We provide examples dogs. By connecting rules that we find across several layers, in Supp. Fig. 18. The visualizations show, as expected, a we trace how information is gathered and combined to ar- strong emphasis on colour and shape of the corresponding rive at a classification, from filters that detect typical class flower. Interestingly, the visualizations of the same neurons specific features in the image, through fully connected lay- 2 https://github.com/jimgoo/caffe-oxford102 ers where multiple neurons together encode the combined
Exploring the Inner Life of Neural Networks with Robust Rules CONV5 Labels FC7 FC6 Totem pole (a) Information flow for class Totem Pole. CONV5 Labels FC7 FC6 Dogs* (b) Part of an information flow for *{Black-and-tan coonhound, english foxhound, borzoi, ibizan hound, saluki, scottish deerhound, curly-coated retriever, entle bucher, mexican hairless}. Figure 5: Information flow. Example rule cascades discovered for ImageNet. For each rule X → Y , the group of neurons of tail Y are used to generate a prototype (images in colored frames). To discover these rule cascades, we first mine rules between output and FC7. We use the tails of these rules (neurons of FC7) as heads to mine rules to the next layer (FC6). Finally, we use the tails of those rules to mine rules between FC6 and CONV5. information, up to the final classification output. Apply- Acknowledgements ing E XPLAI NN to investigate the impact of fine-tuning in transfer learning, we found that for groups of neurons in the Anna Olah and Jonas Fischer are supported by scholarships given fine-tuned CNN, surprisingly, the contained informa- from the International Max Planck Research School for tion is almost identical to the original CNN, but capturing Computer Science (IMPRS-CS). the traits of the new classes almost perfectly. For the given task, fine-tuning thus mostly resulted in routing information References differently, rather than learning to detect new features. Adadi, A. and Berrada, M. Peeking inside the black-box: A Overall, E XPLAI NN performs well and finds surprising re- survey on Explainable Artificial Intelligence (XAI). IEEE sults that help to understand how CNNs perceive the world. Access, 6:52138–52160, 2018. While many important tasks are solved by such networks, attention based architectures play an important role in e.g. Agrawal, R. and Srikant, R. Fast algorithms for mining language processing. Although rules can likely also help to association rules. In Proceedings of the 20th International understand what these models learn, these networks encode Conference on Very Large Data Bases (VLDB), Santiago an entirely different type of information that is inherently de Chile, Chile, pp. 487–499, 1994. hard to understand and visualize, and hence an exciting challenge for future work. Here, our main interest was char- Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, acterizing information flow through neural networks, and K.-R., and Samek, W. On pixel-wise explanations for hence, we focused on subsequent layers. E XPLAI NN, how- non-linear classifier decisions by layer-wise relevance ever, operates on arbitrary sets of neurons, thus naturally propagation. PloS one, 10(7):e0130140, 2015. allows investigating e.g. residual networks, where the previ- ous two layers contribute information. Currently scaling to Barakat, N. and Diederich, J. Eclectic rule-extraction from thousands of neurons, it will make for engaging future work support vector machines. International Journal of Com- to scale to entire networks at once. putational Intelligence, 2(1):59–62, 2005. Bastani, O., Kim, C., and Bastani, H. Interpreting blackbox
Exploring the Inner Life of Neural Networks with Robust Rules models via model extraction. CoRR, abs/1705.08504, Lakkaraju, H., Kamar, E., Caruana, R., and Leskovec, J. 2017. Interpretable & explorable approximations of black box models. arXiv preprint arXiv:1707.01154, 2017. Bau, D., Zhou, B., Khosla, A., Oliva, A., and Torralba, A. Network dissection: Quantifying interpretability of LeCun, Y. and Cortes, C. MNIST handwritten digit deep visual representations. In Proceedings of the IEEE database. 2010. URL http://yann.lecun.com/ conference on computer vision and pattern recognition, exdb/mnist/. pp. 6541–6549, 2017. Li, M. and Vitányi, P. An Introduction to Kolmogorov Com- Bayardo, R. Efficiently mining long patterns from databases. plexity and Its Applications. Springer-Verlag, Berlin, In Proceedings of the ACM International Conference on Heidelberg, 1993. ISBN 0387940537. Management of Data (SIGMOD), Seattle, WA, pp. 85–93, 1998. Moerchen, F., Thies, M., and Ultsch, A. Efficient mining of all margin-closed itemsets with applications in temporal Chatfield, K., Simonyan, K., Vedaldi, A., and Zisserman, A. knowledge discovery and classification by compression. Return of the devil in the details: Delving deep into con- Knowledge and Information Systems, 29(1):55–80, 2011. volutional nets. arXiv preprint arXiv:1405.3531, 2014. Mordvintsev, A., Olah, C., and Tyka, M. Inceptionism: Chu, L., Hu, X., Hu, J., Wang, L., and Pei, J. Exact Going deeper into neural networks. 2015. and consistent interpretation for piecewise linear neu- ral networks: A closed form solution. In Proceedings Nilsback, M.-E. and Zisserman, A. Automated flower clas- of the 24th ACM SIGKDD International Conference on sification over a large number of classes. In 2008 Sixth Knowledge Discovery and Data Mining, KDD ’18, pp. Indian Conference on Computer Vision, Graphics & Im- 1244–1253. Association for Computing Machinery, 2018. age Processing, pp. 722–729. IEEE, 2008. ISBN 9781450355520. Olah, C., Mordvintsev, A., and Schubert, L. Feature vi- Fischer, J. and Vreeken, J. Sets of robust rules, and how to sualization. Distill, 2017. doi: 10.23915/distill.00007. find them. In Proceedings of the European Conference on https://distill.pub/2017/feature-visualization. Machine Learning and Principles and Practice of Knowl- Olah, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert, edge Discovery in Databases (ECML PKDD), Würzburg, L., Ye, K., and Mordvintsev, A. The building blocks of Germany. Springer, 2019. interpretability. Distill, 2018. doi: 10.23915/distill.00010. Frosst, N. and Hinton, G. Distilling a neural network into https://distill.pub/2018/building-blocks. a soft decision tree. arXiv preprint arXiv:1711.09784, Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, 2017. M., and Carter, S. Zoom in: An introduction to cir- Gatys, L. A., Ecker, A. S., and Bethge, M. A neural algo- cuits. Distill, 2020. doi: 10.23915/distill.00024.001. rithm of artistic style. arXiv preprint arXiv:1508.06576, https://distill.pub/2020/circuits/zoom-in/. 2015. Øygard, A. M. Peeking inside convnets. Gilpin, L. H., Bau, D., Yuan, B. Z., Bajwa, A., Specter, M., https://www.auduno.com/2016/06/18/ and Kagal, L. Explaining explanations: An overview of peeking-inside-convnets/, June 2016. interpretability of machine learning. In 2018 IEEE 5th Özbakır, L., Baykasoğlu, A., and Kulluk, S. A soft International Conference on data science and advanced computing-based approach for integrated training and analytics (DSAA), pp. 80–89. IEEE, 2018. rule extraction from artificial neural networks: Difaconn- Grünwald, P. and Roos, T. Minimum description length miner. Applied Soft Computing, 10(1):304–317, 2010. revisited. International Journal of Mathematics for In- Ras, G., van Gerven, M., and Haselager, P. Explanation dustry, 11(01):1930001, 2019. methods in deep learning: Users, values, concerns and Hämäläinen, W. Kingfisher: an efficient algorithm for challenges. In Explainable and Interpretable Models searching for both positive and negative dependency rules in Computer Vision and Machine Learning, pp. 19–36. with statistical significance measures. Knowledge and Springer, 2018. information systems, 32(2):383–414, 2012. Ribeiro, M. T., Singh, S., and Guestrin, C. ”Why should I Kontkanen, P. and Myllymäki, P. A linear-time algorithm for trust you?” Explaining the predictions of any classifier. computing the multinomial stochastic complexity. Infor- In Proceedings of the 22nd ACM SIGKDD international mation Processing Letters, 103(6):227–233, 2007. ISSN conference on knowledge discovery and data mining, pp. 0020-0190. 1135–1144, 2016.
Exploring the Inner Life of Neural Networks with Robust Rules Rissanen, J. Modeling by shortest data description. Auto- Xie, N., Ras, G., van Gerven, M., and Doran, D. Explainable matica, 14(5):465–471, 1978. deep learning: A field guide for the uninitiated. arXiv preprint arXiv:2004.14545, 2020. Rissanen, J. A universal prior for integers and estimation by minimum description length. The Annals of Statistics, Zhang, Q., Cao, R., Shi, F., Wu, Y. N., and Zhu, S. Inter- 11(2):416–431, 1983. preting CNN knowledge via an explanatory graph. In Proceedings of the Thirty-Second AAAI Conference on Robnik-Šikonja, M. and Kononenko, I. Explaining classifi- Artificial Intelligence, (AAAI-18), the 30th innovative Ap- cations for individual instances. IEEE Transactions on plications of Artificial Intelligence (IAAI-18), and the 8th Knowledge and Data Engineering, 20(5):589–600, 2008. AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., February 2-7, 2018, pp. 4454–4463, 2018. Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. Imagenet large scale Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, visual recognition challenge. International Journal of A. Learning deep features for discriminative localiza- Computer Vision, 115(3):211–252, 2015. tion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929, 2016. Shrikumar, A., Greenside, P., and Kundaje, A. Learning important features through propagating activation differ- ences. In Proceedings of the 34th International Confer- ence on Machine Learning - Volume 70, ICML’17, pp. 3145–3153. JMLR.org, 2017. Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attri- bution for deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 3319–3328. JMLR.org, 2017. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015. Taha, I. A. and Ghosh, J. Symbolic interpretation of artificial neural networks. IEEE Transactions on knowledge and data engineering, 11(3):448–463, 1999. Tan, S., Caruana, R., Hooker, G., Koch, P., and Gordo, A. Learning global additive explanations for neural nets us- ing model distillation. arXiv preprint arXiv:1801.08640, 2018. Tran, S. N. and d’Avila Garcez, A. S. Deep logic net- works: Inserting and extracting knowledge from deep belief networks. IEEE Transactions on Neural Networks and Learning Systems, 29(2):246–258, 2018. Webb, G. I. Self-sufficient itemsets: An approach to screen- ing potentially interesting associations between items. ACM Transactions on Knowledge Discovery from Data, 4(1):1–20, 2010.
Exploring the Inner Life of Neural Networks with Robust Rules A. MDL for robust rules ric complexity. We thus get In this section we will give extended examples on how to X compute the MDL score for a given database and set of L(M ∪ Mind ) =|I| × Lpc (|D|) + LN (|X|) rules, elaborate on the error encoding for the rule tails, and X→Y ∈M give a visual toy example on the impact of the extended ! pattern language for the rule head. + LN (|Y |) + L(X) + L(Y ) + Lpc (TX ) A.1. Computing MDL for rules =4 × Lpc (100) + LN (1) + LN (2) For the given example in Fig. 6, we will now compute the 40 35 33 codelength L(D, M ) = L(M ) + L(D | M ) of transmitting − log − log − log + Lpc (40) the whole database D using M ∪ Mind . Here, we will stick 143 143 143 with the simple encoding without error matrices, to make the process of computation more understandable. For reference, ≈14.88 + 1.52 + 2.52 + 1.84 we first compute the baseline model, which is given by + 2.03 + 2.12 + 3.11 =28.02. X |D| Hence, the model with the complex rule has a smaller code- L(D, Mind ) =|I| × Lpc (|D|) + log |TI | lengh than the baseline, with L(D, M ∪ Mind ) = 284.44 I∈I 100 bits. =4 × Lpc (100) + log 40 A.2. The error encoding for tails 100 100 + 2 log + log For the error encoding for tails, which allow to discover 35 33 ≈14.88 + 93.47 + 179.64 + 87.93 = 375.92. rules in noisy settings (compare Fig. 7a,b), we send where a rule X → Y approximately holds according to some parameter k, which defines the number of items of the tail that have to be present in the transaction. The errors made by Thus, sending the data with just the baseline model costs this approximation are then accounted for by sending error − + 375.92 bits. Now, we will compute L(D, M ∪ Mind ), we correcting matrices XX→Y and XX→Y , which account for will start with the costs of sending the data L(D | M ∪ the destructive, respectively additive noise in the are where Mind ) the rule applies (compare Fig. 7c). Let us first assume we are given a k, we will later show ! how we can optimize for k. We redefine the transaction X |TX | sets TY |X = {t ∈ D | (X ⊂ t) ∧ (|Y ∩ t| ≥ k)}, L(D | M ∪ Mind ) = log |TY |X | which corresponds to the transactions where the rule ap- X→Y ∈M ! proximately holds. We will now slightly abuse notation X |D| and indicate the binary input matrix that correspond to D + log |TI0 | by D, and we subset this matrix using the transaction id I∈I lists and item subsets. Both of these are sets of indices that 40 100 100 indicate which rows, respectively columns to use of the ma- = log + log + log 30 40 5 trix. For example, the submatrix where X holds is given by D[TX , X]. We can now define the error correcting matrices − − = D[TY |X , Y ] × 1|TY |X |×|Y | , and XX→Y 100 100 to be XX→Y = + log + log 3 35 D[T X \ TY |X , Y ], where × is the element-wise XOR oper- i×j ≈29.66 + 93.47 + 26.17 + 17.30 + 89.82 ator and 1 is a matrix of size i × j filled with ones. The receiver, knowing TX and TY |X , can then reconstruct the − original data D[TY |X , Y ] = 1|TY |X |×|Y | × XX→Y , respec- =256.42. + tively D[TX \ TY |X , Y ] = XX→Y . While this explains the concept of how error correcting The model costs are composed of the parametric complexi- matrices can be used to reconstruct the original input, which ties for the (adapted) baseline rules, plus the costs of trans- hence define a lossless encoding, we are mainly interested mitting what the rule is composed of along with its paramet- in the codelength functions. To adapt the data costs, we now
Exploring the Inner Life of Neural Networks with Robust Rules Figure 6: Example database and model. A toy database D with blocks indicating where the items A, B, C, D occur in D, margins and relevant joint counts are given on the right. A sensible rule set M ∪ Mind = A → BC ∪ Mind is given on the right, the part of the database where the rule applies and holds is indicated by a light respectively dark orange area. additionally send the two error matrices, which we can do as is the data cost. That means we can optimize a k for using binomial codes. Hence, we get each rule separately. Thus, for a given rule X → Y we can ! enumerate all |Y | many models for the different thresholds X |TX | k and let MDL decide which one fits the data best. L(D | M ) = log |TY |X | X→Y ∈M ! A.3. The impact of the extended pattern language X |D| + log Extending the pattern language for rule heads is crucial to |TI0 | I∈I be applicable for tracing activation patterns through a neural |TY |X | × |Y | network. First of all, we need to start from labels, which + log − |XX→Y | are inherently activated mutually exclusive, as we only have |TX \ TY |X | × |Y | a single label as classification. To find shared features of + log + , labels, it is essential to be able to express disjunctions with |XX→Y | rule heads. Furthermore, the data as well as activation with the second line providing the codelength of the error patterns across the data are very noisy. Thus, determining matrices, and |X | indicating the number of ones in X . where a rule applies just based on conjunctions of features can give a very twisted look of the data at hand, as visualized Our model M now not only consists of rules M ∪ Mind , but in Fig. 8. That is the reason to introduce a more flexible also of the set of error correcting matrices. As the submatrix language similar to approximate rule tails, which solves to which we need to apply the matrix is fully defined by these issues. TX , TY |X , and Y of the corresponding rule, also defining its size, the only adaptation we need for the model costs A.4. Search complexity is the parametric complexities induced by the codes for transmitting the data. This yields The size of the search space implied by our model class |Ii |+|Ij | M is in O(2|Ii |×|Ij |×2 ). For two layers Ii , Ij , we X L(M ) =|I| × Lpc (|D|) + L(X → Y ) X→Y ∈M ! + Lpc (|TY |X | × |Y |) + Lpc (|TX \ TY |X | × |Y |) . This completes the MDL costs for rules robust to noise in the tail for a given k. To optimize k, the crucial insight is that the codelength of individual complex rules are independent,
Exploring the Inner Life of Neural Networks with Robust Rules Figure 7: Example of tail error encoding. For a given database D given in a, where blocks indicate the occurrence of items, a good rule is given byA → BCDE. The part of the database where the rule applies is indicated by the orange area. In b we show the part of the transaction were the rule holds for varying number k of tail items that have to be present in a transaction, from all items on the left – corresponding to a conjunction – towards just a single item on the right, which corresponds to a disjunction. In c we visualize the error encoding used to transmit the data for k = 3. We first transmit the data where the rule holds, resulting in the area that is indicated by the gray block. XORing the error matrix X − with this block, it is possible to reconstruct the original data for the part where the rule holds. Using X + , we reconstruct the original data in the area where the rule applies but does not hold. enumerate all possible rules by encoding. As in theory we can have any subset of these rules |Ii |+|Ij | |Ii | X |Ij | as a model, we thus get approximately 2(|Ij |×|Ii |×2 ) X |Ii | |Ij | many different models. k× × l× k l k=0 l=0 | {z } | Possibilities for head {z } Possibilities for tail A.5. Algorithm Pseudocode |Ii | |Ij | E XPLAI NN explores the search space of rule sets in an itera- X |Ii | X |Ij | ≤ |Ii | × |Ij | tive fashion, either generating new rules with a single item in k l k=0 l=0 the tail, or merging two existing rules, thus generating more |Ii | |Ij | = |Ii | 2 × |Ij | 2 = |Ij | |Ii | 2|Ii |+|Ij | , complex rules with multiple items in the tail. Using these two steps, we can generate all potential candidate rules to where the first sum enumerates all heads of size k, the bi- add to the model, and evaluate their respective gain in terms nomial coefficient describes the ways of drawing heads of of MDL. For a rule r0 , we will say model M 0 = M ⊕ r0 is such size, and the term k is the number of models given by the refined model, with the refinement operator ⊕ adding the robust head encoding. Similarly, the second sum enu- the rule r0 = X → Y to M , removing the merged rules that merates all tails of size l, the binomial coefficient describes led to r0 , if any, and updating the singleton transaction lists the drawing of such tails, and the term l is the number of TA for all items in the tail A ∈ Y . Here, we will provide ways to place the error correcting matrices for the robust tail
You can also read