SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
1 SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets Sayyed-Ahmad Naghavi-Nozad and Maryam Amir Haeri and Gianluigi Folino Abstract—This paper presents a batch-wise density-based clustering approach for local outlier detection in massive-scale datasets. Differently from the well-known traditional algorithms, which assume that all the data is memory-resident, our proposed method is scalable and processes the input data chunk-by-chunk within the confines of a limited memory buffer. At the first phase, a temporary arXiv:2006.07616v6 [cs.LG] 26 Oct 2020 clustering model is built, then it is incrementally updated by analyzing consecutive memory-loads of points. Subsequently, at the end of scalable clustering, the approximate structure of original clusters is obtained. Finally, by another scan of the entire dataset and using a suitable criterion, an outlying score is assigned to each object, which is called SDCOR (Scalable Density-based Clustering Outlierness Ratio). Evaluations on real-life and synthetic datasets demonstrate that the proposed method has a low linear time complexity and is more effective and efficient compared to best-known conventional density-based methods, which need to load all data into the memory; and also, to some fast distance-based methods, which can perform on data resident in the disk. Index Terms—local outlier detection, massive-scale datasets, scalable, density-based clustering, anomaly detection F 1 I NTRODUCTION outliers only deviate significantly w.r.t. a specific neighbor- hood of the object [1], [5]. In [6], [7], it is noted that the Outlier detection, which is a noticeable and open line of concept of local outlier is more comprehensive than that of research [1], [2], [3], is a fundamental issue in data min- global; i.e., a global outlier could also be considered as a ing. Outliers refer to rare objects that deviate from well- local one, but not vice versa. This is the reason that makes defined notions of expected behavior and discovering them finding local outliers much more cumbersome. is sometimes compared with searching for a needle in In recent years, advances in data acquisition have made a haystack, because the rate of their occurrence is much massive collections of data, which contain valuable informa- smaller than normal objects. Outliers often interrupt the tion in diverse fields like business, medicine, society, gov- learning procedure from data for most of analytical models ernment etc. As a result, the common traditional software and thus, capturing them is very important, because it can methods for processing and management of this massive enhance model accuracy and reduce computational loads of amount of data will no longer be efficient, because most the algorithm. However, outliers are not always annoying of these methods assume that the data is memory-resident and sometimes, they become of special interest for the and their computational complexity for large-scale datasets data analyst in many problems such as controlling cellular is really expensive. phones activity to detect fraudulent usage, like stolen phone To overcome the above-mentioned issues, we propose airtime. Outlier detection methods could also be considered a new scalable and density-based clustering method for as a preprocessing step, useful before applying any other local outlier detection in massive-scale datasets that cannot advanced data mining analytics, and it has a wide range of be loaded into memory at once, employing a chunk-by- applicability in many research areas including intrusion de- chunk load procedure. In practice, the proposed approach tection, activity monitoring, satellite image analysis, medical is a clustering method for very large datasets, in which condition monitoring etc [2], [4]. outlier identification comes after that as a side effect; and Outliers can be generally partitioned into two different it is inspired by a scaling clustering algorithm for very categories, namely global and local. Global outliers are large databases, named after its authors, BFR [8]. BFR has objects, which show significant abnormal behavior in com- a strong assumption on the structure of existing clusters, parison to the rest of the data, and thus in some cases they which ought to be Gaussian distributed with uncorrelated are considered as point anomalies. On the contrary, local attributes, and more importantly, it is not introducing noise. In short, this clustering algorithm works as follows. First, it • Sayyed-Ahmad Naghavi-Nozad is with the Computer Engineering De- reads the data as successive (preferably random) samples, partment, Amirkabir University of Technology, Tehran, Iran; Email: so that each sample can be stored in the memory buffer, sa_na33@aut.ac.ir and then it updates the current clustering model over the Maryam Amir Haeri is with the Department of Computer Science, Univer- sity of Kaiserslautern, Kaiserslautern, Germany; Email: haeri@cs.uni-kl.de contents of buffer. Based on the updated model, singleton Gianluigi Folino is with ICAR-CNR, Rende, Italy; Email: gian- data are classified in three groups: some of them can be dis- luigi.folino@icar.cnr.it carded over the updates to the sufficient statistics (discard Sayyed-Ahmad Naghavi-Nozad is the corresponding author. set DS); some can be moderated through compression and
2 abstracted as sufficient statistics (compression set CS); and 2 R ELATED WORKS some demand to be retained in the buffer (retained set RS). Like BFR, our proposed method operates within the Outlier detection methods can be divided into following six confines of a limited memory buffer. Thus, by assuming that categories [13]: extreme value analysis, probabilistic meth- an interface to the database allows the algorithm to load an ods, distance-based methods, information-theoretic meth- arbitrary number of requested data points, whether sequen- ods, clustering-based methods, and density-based methods. tially or randomized, we are forced to load the data chunk- In extreme value analysis, the overall population is sup- by-chunk, so that there is enough space for both loading posed as having a unique probability density distribution, and processing each chunk at the same time. The proposed and only those objects at the very ends of it are considered approach is based on clustering, and therefore, it must avoid as outliers. In particular, these types of methods are useful that outliers play any role in forming and updating clusters. to find global outliers [13], [14]. After processing each chunk, the algorithm should combine In probabilistic methods, we assume that the data were its approximate results with those of previous chunks, in generated from a mixture of different distributions as a a way that the final approximate result will compete with generative model, and we use the same data to estimate the same result obtained by processing the entire dataset the parameters of the model. After determining the spec- at once. An algorithm, which is capable of handling data ified parameters, outliers will be those objects with a low in such an incremental way and finally provides an ap- likelihood of being generated by this model [13]. Schölkopf proximate result, from an operational perspective is called a et al. [15] propose a supervised approach, in which a prob- scalable algorithm [9]. Moreover, from an algorithmic point abilistic model w.r.t. the input data is provided, so that it of view, scalability means that algorithm complexity should can fit normal data in the best possible way. In this manner, be nearly linear or sublinear w.r.t. the problem size [10]. the goal is to discover the smallest region that contains most of normal objects; data outside this region are supposed to In more detail, the proposed method comprises of three be outliers. This method is, in fact, an extended version of steps. In the first step, a primary random sampling is carried Support Vector Machines (SVM), which has been improved out to create an abstraction of the whole data on which to cope with imbalanced data. In practice, in this way, a the algorithm works. Then, an initial clustering model is small number of outliers is considered as belonging to the built and some information required for the next phase rare class and the rest of data as belonging to normal objects. of incremental clustering will be acquired. In the second step, on the basis of currently loaded chunk of data into In distance-based methods, distances among all objects memory, a scalable density-based clustering algorithm is are computed to detect outliers. An object is assumed to executed in order to identify dense regions, which leads to be a distance-based outlier, iff it is 0 distance away from building incrementally some clusters, named mini-clusters at least fraction 0 of other objects in the dataset [16]. or sub-clusters. When all chunks are processed, the final Bay and Schwabacher [17] propose an optimized nested- clustering model will be built by merging the information loop algorithm based on the k nearest neighbor distances obtained through these mini-clusters. Finally, by applying among objects, that has a near linear time complexity, and a Mahalanobis distance criterion [11], [12] to the entire is shortly named ORCA. ORCA shuffles the input dataset in dataset, an outlying score is assigned to each object. random order using a disk-based algorithm, and processes it in blocks, as there is no need to load the entire data into In summary, the main contributions of our proposed memory. It keeps looking for a set of user-defined number method are listed as follows: of data points as potential anomalies, as well as for their anomaly scores. The cut-off value is set as the minimum • There is no need to know the real number of original outlierness score of the set, and it will get updates if there clusters. is a data point having a higher score in other blocks of data. • It works better with Gaussian clusters having corre- For data points which obtain a lower score than the cut-off, lated or uncorrelated features, but it can also work they should be pruned; this pruning scheme will only expe- well with convex-shaped clusters following an arbi- dite the process of distance computation in the case of data trary distribution. being ordered in an uncorrelated manner. ORCA’s worst • It has a linear time complexity with a low constant. case time-complexity is 2 , and the I/O cost for the • In spite of working in a scalable manner and op- data accesses is quadratic. For the anomaly definition, it can erating on chunks of data, in terms of detection use either the kth nearest neighbor or average distance of accuracy, it is still able to compete with conventional k nearest neighbors. Furthermore, in [18], a Local Distance- density-based methods, which maintain all the data based Outlier Factor (LDOF) is proposed to find outliers in memory; and also, with some fast distance-based in scattered datasets, which employs the relative distance methods, which do not require to load the entire data from each data object to its nearest neighbors. Sp [19] is a into memory at their training stage. simple and rapid distance-based method that utilizes the Nearest Neighbor distance on a small sample from the The paper is organized as follows: Section 2 discusses dataset. It takes a small random sample of the entire dataset, some related works in the field of outlier detection. In and then assigns an outlierness score to each point, as the Section 3, we present detailed descriptions of the proposed distance from the point to its nearest neighbor in the sample approach. In Section 4, experimental results and analysis set. Therefore, this method has a linear time complexity on various real and synthetic datasets are provided. Finally, in the number of objects, dimensions, and samples, and conclusions are given in Section 5. also a constant space complexity, which makes it ideal for
3 analyzing massive datasets. each point, the density around it is estimated employing a Information-theoretic methods could be particularly Weighted Kernel Density Estimation (WKDE) strategy with deemed at almost the same level as distance-based and a flexible kernel width. Moreover, the Gaussian kernel func- other deviation-based models. The only exception is that tion is adopted to corroborate the smoothness of estimated in such methods, a fixed concept of deviation is determined local densities, and also, the adaptive kernel bandwidth en- at first, and then, anomaly scores are established through hances the discriminating ability of the algorithm for better evaluating the model size for this specific type of deviation; separation of outliers from inliers. ODRA is specialized for unlike the other usual approaches, in which a fixed model is finding local outliers in high dimensions and is established determined initially and employing that, the anomaly scores on a relevant attribute assortment approach. There are two are obtained by means of measuring the size of deviation major phases in the algorithm, which in the preliminary one, from such model [13]. Wu and Wang [20] propose a single- unimportant features and data objects are pruned to achieve parameter method for outlier detection in categorical data better performance. In other words, ODRA firstly tries to using a new concept of Holoentropy and by utilizing that, distinguish those attributes which contribute substantially a formal definition of outliers and an optimization model to anomaly detection from the other irrelevant ones; and for outlier detection is presented. According to this model, also, a significant number of normal points residing in the a function for the outlier factor is defined which is solely dense regions of data space will be discarded for the sake based on the object itself, not the entire data, and it could be of alleviating the detection process. In the second phase, updated efficiently. like RDOS, a KDE strategy based on kNN, RNN and SNN Clustering-based methods use a global analysis to detect of each data object is utilized to evaluate the density at crowded regions, and outliers will be those objects not the corresponding locations, and finally assign an outlying belonging to any cluster [13]. A Cluster-Based Local Outlier degree to each point. In NaNOD, a concept of Natural Factor (CBLOF) in [21], and a Cluster-Based Outlier Factor Neighbor (NaN) is used to adaptively attain a proper value (CBOF) in [22] are presented. In both of them, after the for k (number of neighbors), which is called here the Natural clustering procedure is carried out, in regard to a specific Value (NV), while the algorithm does not ask for any input criterion, clusters are divided into two large and small parameter to acquire that. Thus, the parameter choosing groups; it is assumed that outliers lie in small clusters. At challenge could be mitigated through this procedure. More- last, the distance of each object to its nearest large cluster is over, a WKDE scheme is applied to the problem to assess the used in different ways to define the outlier score. local density of each point, as it takes benefit of both kNN In density-based methods, the local density of each ob- and RNN categories of nearest neighbors, which makes ject is calculated in a specific way and then is utilized to the anomaly detection model become adjustable to various define the outlier scores. Given an object, the lower its local kinds of data patterns. density compared to that of its neighbors, the more likely Beyond the mentioned six categories of outlier detection it is that the object is an outlier. Density around the points methods, there is another state-of-the-art ensemble method could be calculated by using many techniques, which most based on the novel concept of isolation, named iForest [29], of them are distance-based [6], [13]. For example, Breunig [30]. The main idea behind iForest is elicited from a well- et al. [6] propose a Local Outlier Factor (LOF) that uses known ensemble method called Random Forests [31], which the distance values of each object to its nearest-neighbors is mostly employed in classification problems. In this ap- to compute local densities. However, LOF has a drawback proach, firstly it is essential to build an ensemble of isolation which is that the scores obtained through this approach are trees (iTrees) for the input dataset, then outliers are those not globally comparable between all objects in the same objects with a short average path length on the correspond- dataset or even in different datasets. The authors of [23] ing trees. For making every iTree, after acquiring a random introduce the Local Outlier Probability (LoOP), which is an sub-sample of the entire data, it is recursively partitioned enhanced version of LOF. LoOP gives each object a score by randomly choosing an attribute, and then randomly in the interval [0,1], which is the probability of the object specifying a split value among the interval of minimum and being an outlier, and is widely interpretable among various maximum values of the chosen attribute. Since this type of situations. An INFLuenced Outlierness (INFLO) score is partitioning can be expressed by utilizing a tree structure, presented in [24], which adopts both nearest neighbors thus the number of times required to partition the data and reverse nearest neighbors of an object to estimate its for isolating an object is equal to the path length from the relative density distribution. Moreover, Tang and He [25] starting node to the ending node in the respective iTree. In propose a local density-based outlier detection approach, such a situation, the tree branches which bear outliers have concisely named RDOS, in which the local density of each a remarkably lower depth, because these outlying points object is approximated with the local Kernel Density Estima- fairly deviate from the normal conduct in data. Therefore, tion (KDE) through nearest-neighbors of it. In this approach, data instances which possess substantially shorter paths not only the k-Nearest-Neighbors (kNN) of an object are from the root in various iTrees are more potential to be taken into account but, in addition, the Reverse-Nearest- outliers. Neighbors (RNN) and the Shared-Nearest-Neighbors (SNN) One chief issue about using such isolation-based method are considered for density distribution estimation as well. is that with the increase in the number of dimensions, Furthermore, Wahid and Rao have recently proposed if an incorrect attribute choice for the splitting matter is three different density-based methods, shortly named RK- made at the higher levels of the iTree, then the probability DOS [26], ODRA [27] and NaNOD [28]. RKDOS is mostly of detection outcomes getting misled, grows potentially. similar to RDOS, in which by using both kNN and RNN of Nevertheless, the advantage of iForest is that by considering
4 the idea of isolation, it can utilize the sub-sampling in a mization algorithm to find the optimal values for these pa- more sophisticated way than the existing methods, and rameters. Here, we prefer to use the evolutionary algorithm provide an algorithm with a low linear time complexity PSO [45]. and a low memory requirement [32]. Moreover, in [33], Remark 3. Another important difference between the [34], an enhanced version of iForest, named iNNE is pro- proposed method and BFR concerns the volume of struc- posed which tries to overcome some drawbacks of the main tural information, which they need to store for clustering method, including the incapability of efficiently detecting procedures. As the proposed method, differently from BFR, local anomalies, anomalies with a low amount of pertinent can handle Gaussian clusters with correlated attributes too, attributes, global anomalies that are disguised through axis- thus the covariance matrix will not always be diagonal parallel projections, and finally anomalies in a dataset con- and could have many non-zero elements. Therefore, the taining various modals. proposed method will consume more space than BFR for Remark 1. According to the fact that during scalable building the clustering structures. clustering, we use the Mahalanobis distance measure to Since the Mahalanobis distance criterion is crucially assign each object to a mini-cluster; and besides, the size based on the covariance matrix, hence, this matrix will be of temporary clusters is much smaller than that of original literally the most prominent property of each sub-cluster. clusters, it would be worth mentioning an important matter But according to the high computational expense of comput- here. With respect to [12], [35], [36], [37], in the case of ing Mahalanobis distance in high-dimension spaces, thus, high-dimensional data, classical approaches based on the as in [35], [46], we will use the properties of Principal Mahalanobis distance are usually not applicable. Because, Components (PCs) in the transformed space. Therefore, the when the cardinality of a cluster is less than or equal to its covariance matrix of each mini-cluster will become diagonal dimensionality, the sample covariance matrix will become and by transforming each object to the new space of the singular and not invertible [38]; hence, the corresponding mini-cluster, like BFR, we can calculate the Mahalanobis Mahalanobis distance will no longer be reliable. distance without the need to use matrix inversion. Therefore, to overcome such problem, in a preprocess- According to [47], when the covariance matrix is di- ing step, we need to resort to dimensionality reduction agonal, the corresponding Mahalanobis distance becomes approaches. However, due to the serious dependence of the same normalized Euclidean distance. Moreover, we can some dimensionality reduction methods like PCA [39] to establish a threshold value for defining the Mahalanobis original attributes, and the consequent high computational radius. If the value of this threshold is, e.g. 4, it means load because of the huge volume of the input data, we need that all points on this radius are as far as four standard to look for alternative methods to determine a basis for data deviations from the mean; if we just denote the number of projection. dimensions by p, the size of this Mahalanobis radius is equal √ A simple and computationally inexpensive alternative is to 4 . the exercise of random basis projections [40], [41], [42]. The main characteristic of these types of methods is that they will approximately preserve pairwise euclidean distances 3 P ROPOSED APPROACH between data points, and, in addition, the dimension of the The proposed method consists of three major phases. In transformed space is independent of the original dimension, the first phase, a preliminary random sampling is con- and only relies logarithmically on the number of data points. ducted in order to obtain the main premises on which Finally, after such a preprocessing step, we can be optimistic the algorithm works, i.e. some information on the original that the singularity problem will not be present during clusters and some parameters useful for the incremental clustering procedures; or in the case of it would be present, clustering. In the second phase, a scalable density-based we would have a suitable mechanism to handle it. clustering algorithm is carried out in order to recognize Remark 2. As stated earlier, our proposed approach dense areas, on the basis of currently loaded chunk of data is inspired by BFR. However, BFR by default, uses the K- points in memory. Clusters built incrementally in this phase, means algorithm [43] in almost all of its clustering proce- are called mini-clusters or sub-clusters, and they form the dures. In addition to this drawback of the K-means algo- temporary clustering model. After loading each chunk of rithm, which is being seriously depending on foreknowing data, according to the points already loaded in memory and the true number of original clusters, in the case of the pres- those undecided from previous chunks, and by employing ence of outliers, K-means performs poorly, and therefore, we the Mahalanobis distance measure and in respect to the need to resort to a clustering approach which is resistant to density-based clustering criteria, we update the temporary anomalies in data. Here in this paper, we prefer to employ clustering model, which consists of making some changes the density-based clustering method, DBSCAN [44]1 . Other to existing mini-clusters or adding new sub-clusters. noise-tolerant clustering algorithms could be utilized as a Note that, in the whole scalable clustering procedure, substitute option too, but definitely with their own specific our endeavor is aimed to not let outliers participate actively configurations. in forming and updating any mini-cluster, and thus, after However, DBSCAN is strongly dependent on the choice processing the entire chunks, there will be some objects in of its parameters. Thus, we are forced to utilize some opti- buffer remained undecided. Some of these data are true outliers, while some others are inliers, which, due to con- 1. Although we will demonstrate that sometimes throughout scalable straints, have failed to play any effective role in forming a clustering, even DBSCAN may produce some mini-clusters that cannot abide outliers, because of some data assignment restraints; hence, we sub-cluster. Finally, all these undecided points are cleared will be forced to use the same K-means algorithm to fix the issue. from the buffer, while only the structural information of
5 temporary clusters is maintained in memory. Then, at the TABLE 1 last part of scalable clustering, depending on the mini- Major notations clusters associated to each initial mini-cluster out of the Notation Description "Sampling" stage, we combine them to obtain the final [X ] × Input dataset X with objects and dimensions clusters, which their structure will be approximately the ∈X An instance in X same as of the original clusters. S ⊂X A random sample of X X⊂X A set of some points in buffer At last, in the third phase of the proposed approach, Y ⊂X A chunk of data w.r.t. the final clustering model gained out of the second Ω A partition of some data points in memory into mini- clusters as {X1 , · · · , Xk } phase, once again, we process the entire dataset in chunks, ℭ Number of objects in Y to give each object an outlying score, according to the {Γ}8×L Information array of temporary clusters with 8 proper- ties for each mini-cluster same Mahalanobis distance criterion. Fig. 1 illustrates the {Γ }8×1 ∈ Γ Information standing for the th mini-cluster in Γ, 1 ≤ software architecture of the approach. Moreover, in Table 1, ≤L Mini-cluster points associated with Γ , which are re- the main notations used in the paper are summarized. moved from buffer < Retained set of objects in RAM buffer
6 Algorithm 1: Framework of SDCOR these mini-clusters, we will obtain the approximate struc- Input : [X ] × - The by input dataset X; - Random ture of the original clusters. sampling rate; Λ - PC total variance ratio; - As we adopt PSO to attain optimal values of parame- Membership threshold; - Pruning threshold; C - ters S and S for DBSCAN to cluster the sampled data, Sampling Epsilon coefficient; C - Sampling MinPts coefficient it would be essential to stipulate this truth that as the Output: The outlying score for each object in X density of the sampled distribution is much less than that 1 Phase 1 — Sampling: of the original data, we cannot use the same parameters 2 Step 1. Take a random sample S of X according to the for the original distribution, while applying DBSCAN on sampling rate objects loaded in memory, during scalable clustering. Thus, 3 Step 2. Employ the PSO algorithm to find the optimal values we have to use a coefficient in the interval [0,1] for each for DBSCAN parameters S and S , required for clustering S parameter. It is also necessary to mention that is much 4 Step 3. Run DBSCAN on S using the obtained optimal more sensitive than , as with a slight change in the value parameters, and reserve the count of discovered mini-clusters of , we may observe a serious deviation in the clustering as T, as the true number of original clusters in data result, but this does not apply to . We show the mentioned 5 Step 4. Build the very first array of mini-clusters information (temporary clustering model) out of the result of step 3, w.r.t. coefficients with C and C , for S and S respectively, and Algorithm 2 their values could be obtained through the user; but the best 6 Step 5. Reserve the covariance determinant values of the initial values gained out of our experiments are 0.5 and 0.9, for C sub-clusters as a vector ® = [ 1 , · · · , T ], for the maximum and C , respectively. covariance determinant condition 7 Step 6. Clear S from the RAM and maintain the initial Now, before building the first clustering model, we need temporary clustering model in buffer to extract some information out of the sampled clusters 8 Phase 2 — Scalable Clustering: obtained through DBSCAN, and store them in a special 9 Prepare the input data to be processed chunk by chunk, so array. As stated earlier about the benefit of using properties that each chunk could be fit and processed in RAM buffer at of principal components for high-dimensional data, we need the same time to find those PCs that give higher contributions to the cluster 10 Step 1. Load the next available chunk of data into the RAM 11 Step 2. Update the temporary model of clustering over the representation. To this aim, we sort PCs on the basis of their contents of buffer, w.r.t. Algorithm 3 corresponding variances in descending order, and then we 12 Step 3. If there is any available unprocessed chunk, go to step 1 choose the topmost PCs having a share of total variance at 13 Step 4. Build the final clustering model, w.r.t. Algorithm 7, least equal to Λ percent. We call these PCs superior com- using the temporary clustering model obtained out of the previous steps ponents and denote their number as 0 . Let be an object 14 Phase 3 — Scoring: among total objects in dataset [X] × , belonging to the temporary cluster [X ] × ; then the information about this 15 According to the final clustering model, for each data point ∈ X, employ the Mahalanobis distance criterion to find the sub-cluster as {Γ }8×1 , in the array of temporary clustering closest final cluster, and lastly, assign to that cluster and model {Γ}8×L , is as follows: use the criterion value as the object outlierness score ∑︁ 1) Mean vector in the original space, X = 1 ∑︁ ∈X 2) Scatter matrix in the original space, SX = ( − ∈X X ) · ( − X ) of samples from each original cluster, and thus the clustering 3) 0 superior components, 1 , · · · , 0 , derived from structure obtained after applying the sampling may not be Í the covariance matrix X = −1 1 SX , which form the representative of the original one. Hence, outliers could be columns of the transformation matrix AX misclassified over scalable clustering. In such a situation, 4) Mean vector in the transformed space, X0 = X AX obtaining a satisfactory random sample requires processing 5) Square root 0 the entire dataset. In the following, the percentage of sam- √ √︁ of the top PC variances, 1 , · · · , 0 pled data, is indicated as . 6) Size of the mini-cluster, After obtaining the sampled data, we conduct the DB- 7) Value of 0 SCAN algorithm to cluster them, and assume that the 8) Index to the nearest initial mini-cluster, ∈ number of mini-clusters obtained through this, is the same {1, · · · , T } as the number of original clusters T in the main dataset. Algorithm 2 demonstrates the process of obtaining and We reserve T for later use. Besides, we presume that the adding this information per each mini-cluster attained location (centroid) and the shape (covariance structure) through either of the "Sampling" and "Scalable Clustering" of such sub-clusters are so close to the original ones. In stages, to the temporary clustering model. As for the eighth Section 4, we will show that even by using a small rate property of the information array, for every sampled cluster, of random sampling, the mentioned properties of sampled it is set as the corresponding original cluster number; and clusters are similar enough to those of the original ones. for any of the sub-clusters achieved out of incremental The idea behind making these primary sub-clusters, which clustering, it is set as the corresponding index to the nearest are so similar to the original clusters in the input dataset in initial sub-cluster2 . terms of basic characteristics, is that, we intend to determine a Mahalanobis radius, which collapses a specific percentage 2. It is clear that all those sampled points, which the initial clustering of objects belonging to each original cluster; and let other model is built upon them, will be met again throughout scalable clustering. But as they are totally randomly sampled, hence, it could sub-clusters be created around this folding area during suc- be asserted that they will not impair the structure of final clustering cessive memory-loads of points; and ultimately, by merging model.
7 Algorithm 2: [Γ] = MiniClustMake(Γ, Ω, Λ, ) on the Mahalanobis radius, within reason, is hindering Input : Γ - Current array of mini-clusters information; outliers to be accepted as a member. In other words, when Ω = {X1 , · · · , Xk } - A partition of some data points in the Mahalanobis distance of an outlier is more than the memory into mini-clusters; Λ - PC share of total predefined radius threshold, it cannot be assigned to the variance; - Index to the nearest initial mini-cluster Output: Γ - Updated temporary clustering model sub-cluster, if and only if that threshold is set to a fair value. 1 ←L Hence, we do not check on the covariance determinant of 2 foreach mini-cluster X , 1 ≤ ≤ k do sub-clusters, while they are growing. 3 Apply PCA on X and obtain its PC coefficients and Here, the first phase of the proposed approach is fin- variances. Then choose 0 as the number of those top PC variances, for which their share of total variance is at ished, and we need to clear RAM buffer of any sampled least Λ percent data, and only maintain the very initial information ob- 4 Γ {1, + } ← Mean vector of X tained about the existing original clusters. 5 Γ {2, + } ← Scatter matrix of X 6 Γ {3, + } ← Top 0 PC coefficients corresponding to the top 0 PC variances 3.2 Scalable Clustering 7 Γ {4, + } ← Transformed mean vector, as Γ {1, + } · Γ {3, + } During this phase, we have to process the entire dataset 8 Γ {5, + } ← Square root of the top 0 PC variances chunk-by-chunk, as for each chunk, there is enough space 9 Γ {6, + } ← Number of objects in X 10 Γ {7, + } ← Value of 0 in memory for both loading and processing it all at the same 11 if ≡ 0 then /* "Sampling" stage */ time. After loading each chunk, we update the temporary 12 Γ {8, + } ← clustering model according to the data points, which are 13 else /* "Scalable Clustering" stage */ 14 Γ {8, + } ← currently loaded in memory from the current chunk, or 15 end retained from other previously loaded chunks. Finally, after 16 end processing the entire chunks, the final clustering model is built out of the temporary clustering model. A detailed description of this phase is provided as follows. According to [48], [49], when a multivariate Gaussian distribution is contaminated with some outliers, then the 3.2.1 Updating the temporary clustering model w.r.t. con- corresponding covariance determinant is no longer robust tents of buffer and is significantly more than that of the main cluster. After loading each chunk into memory, the temporary clus- Following this contamination, the corresponding (classical) tering model is updated upon the objects belonging to the Mahalanobis contour line3 will also become wider than currently loaded chunk and other ones sustained from the that of the pure cluster (robust one), as it contains also formerly loaded data. First of all, the algorithm checks for abnormal data. So, it makes sense that there is a direct the possible membership of each point of the present chunk relationship between the value of covariance determinant of to any existing mini-cluster in the temporary clustering an arbitrary cluster and the wideness of its tolerance ellipse, model4 . which could also be referred to as the spatial volume of the Then, after the probable assignments of current chunk cluster. Moreover, by being contaminated, this volume could points, there are some primary and secondary information increase and become harmful. of modified sub-clusters that shall be updated. After this Since during scalable clustering, new objects are coming update, the structure of altered sub-clusters will change, over time and mini-clusters are growing gradually, so, it is and thus they might still be capable of absorbing more possible for a mini-cluster with an irregular shape to accept inliers. Therefore, the algorithm checks again for the likely some outliers. Then, the following covariance matrix will memberships of sustained points in memory to updated no longer be robust, and the corresponding Mahalanobis sub-clusters. This updating and assignment checking cycle contour line will keep getting wider too, which could cause will keep going until there is not any retained point that the absorption of many other outliers. Therefore, to impede could be assigned to an updated mini-cluster. the creation process of these voluminous non-convex sub- When the membership evaluation of the present chunk clusters, which could be contaminated with myriad outliers, and retained objects is carried out, the algorithm tries to we have to put a limit on the covariance determinant cluster the remaining sustained objects in memory, regard- of every sub-cluster, which is discovered through scalable ing the density-based clustering criteria which have been clustering. Here, we follow a heuristic approach and employ constituted at the "Sampling" phase. After the new mini- the covariance determinant of the nearest initial sub-cluster clusters were created out of the last retained points, there is obtained out of the "Sampling" stage, as the limit. this probability that some sustained inliers in buffer might This problem that outliers can be included in clusters not be capable of participating actively in forming new sub- and could no longer be detected, is called "masking effect". clusters, because of the density-based clustering standards, Note that we are using the mentioned constraints only when though could be assigned to them, considering the firstly sub-clusters are created for the first time, not while they are settled membership measures. Hence, the algorithm goes growing over time. The justification is that, while an object another time in the cycle of assignment and updating pro- is about to be assigned to a mini-cluster, the other constraint cedure, like what was done in the earlier steps. 3. Since the terms "Mahalanobis contour line" and "tolerance ellipse" 4. After this step, the unassigned objects of the lastly and previously are the same in essence, thus, we will use them indifferently in this loaded chunks, will be all together considered as retained or sustained paper. objects in buffer.
8 Algorithm 3, demonstrates the steps necessary for up- does, some information connected to the corresponding sub- dating the temporary clustering model, w.r.t. an already cluster shall be updated. loaded chunk of data and other undecided objects retained in memory from before. The following subsections will Algorithm 4: [Γ, 0 ,
9 Algorithm 5: [Γ,
10 4 respectively. It is clear that even incoherent subdivided sub- clusters, with a covariance determinant less than or equal 2 to the predefined threshold though, could be hazardous as non-convex sub-clusters with a high value of covariance x2 0 determinant, as their tolerance ellipses could get out of the -2 scope of the following main cluster, and suck outliers in10 . -4 An alternative for this is to use hierarchical clustering -10 -5 0 5 10 algorithms, which typically present a higher computational x1 (a) The Whole Cluster with a Distinct Non-convex Subcluster and Some Outliers Around It load than K-means. For this matter, we decide to adopt a 4 4 K-means variant, which, for every subdivided sub-cluster 2 2 obtained through K-means, we apply DBSCAN again to verify its cohesion. Finally, we increase the value of K for K-means, from 2 till a value for which11 , three conditions x2 x2 0 0 -2 -2 for every subdivided sub-cluster are met: to be coherent, not to be singular, and not having a covariance determinant -4 -4 less than or equal to . -10 -5 0 -10 -5 0 x1 x1 Fig. 3c illustrates a scenario in which, three smaller mini- (b) Dividing the Non-convex Subcluster Using Kmeans with K=2 (c) Dividing the Non-convex Subcluster Using Kmeans with K=3 clusters, are represented as the K-means output in different colors; centroids and tolerance ellipses are shown as in Fig. 3b. The covariance determinant values are equal to 0.12, Fig. 3. Breaking a non-convex mini-cluster with a very wide tolerance 1.02 and 0.10 for the green, red and blue subdivided sub- ellipse into smaller pieces by the K-means algorithm6 clusters respectively. As it is obvious, all subdivided sub- clusters are coherent and not singular, with much smaller determinants than the threshold, and much tighter spatial illustrated with magenta pentagons7 . volumes as such. Ultimately, after attaining acceptable sub- As it is evident, the irregular mini-cluster can absorb clusters, it is time to update the temporary clustering model some local outliers, as its tolerance ellipse is covering a w.r.t. them, with regard to Algorithm 2. Algorithm 6 shows remarkable space out of the containment area by the original all the steps required to cluster data points retained in cluster. Moreover, the covariance determinant value for the memory buffer. irregular mini-cluster is equal to 15.15, which is almost 3.2.1.5 Trying to assign retained objects in buffer twice that of the initial mini-cluster equal to 7.88. Thus, for to newly created mini-clusters: After checking on retained fixing this concern, by considering the proportion between objects in buffer in the case of being capable of forming the covariance determinant of an arbitrary cluster and its a new mini-cluster, it would be necessary to examine the spatial volume, we decide to divide the irregular mini- remaining retained objects once more, w.r.t. Algorithm 5; cluster to smaller coherent pieces, with smaller covariance i.e., it ought to be inspected whether or not they could be determinants, and also, more limited Mahalanobis radii as assigned to the newly created mini-clusters, in the same well. Therefore, we heuristically set the threshold value for cycle of membership checking and mini-cluster updating, the covariance determinant of any newly discovered sub- like what was done in subsection 3.2.1.3. The reason for cluster or subdivided sub-cluster, as that of the nearest this concern is that, due to limitations connected to the initial mini-cluster8 . utilized density-based clustering algorithm, such objects For the division process of an irregular sub-cluster, we may not have been capable of being an active member of prefer to adopt the K-means algorithm. However, K-means any of the newly created sub-clusters, even though they can cause some incoherent subdivided sub-clusters in such lie in the associated accepted Mahalanobis radius. Hence, cases9 , as shown in Fig. 3b. In Fig. 3b, two smaller mini- it becomes essential to check again the assignment of these clusters, produced as the K-means result, are represented latter retained objects. in different colors, with covariance determinants of 1.63 Fig. 4 demonstrates an intuitive example of such sit- and 7.71 for the coherent blue and incoherent red mini- uation, in which, some objects, according to the density clusters respectively. The associated centroids and tolerance restrictions cannot be assigned to a cluster, even though ellipses are denoted as black crosses and black solid lines, they lie in the accepted Mahalanobis radius of that cluster. Objects that have had the competence to form a cluster are 7. Here, for challenging the performance of our method, we are taking outliers so close to the original cluster. But in reality, it is not shown with black solid circles; and those which are not a usually like that and outliers often have a significant distance from part of the cluster, but reside in its accepted Mahalanobis every normal cluster in data. radius, which is represented by a blue solid line, are denoted 8. This threshold is denoted as , 1 ≤ ≤ T, for any sub-cluster as red triangles. discovered near the th initial mini-cluster, through scalable clustering. The proximity measure for this nearness is as the Mahalanobis distance of the new discovered sub-cluster centroid, from the initial mini- 10. However, the divided mini-cluster could be significantly smaller clusters. We assume any sub-cluster with a covariance determinant in size and determinant, though because of the lack of coherency, some greater than such threshold, as a candidate for a non-convex sub-cluster, PC variances could be very larger than others; therefore, the regarding whose spatial volume could cover some significant space out of the tolerance ellipse will be more stretched in those PCs, and thus harmful. scope of the related original cluster. 11. Here, the upper bound for K is b |
11 ® S , S , C , C , Algorithm 6: [Γ,
12 , and a covariance matrix Σ . Hence, w.r.t. Algorithm 7, distance [13] is assigned to the object as its outlying score. if a final cluster contains only one temporary cluster, then The higher the distance, the more likely it is that the object the final centroid and the final covariance matrix will be the is an outlier. Here, we name such score obtained out of same as for the temporary cluster. Otherwise, in the case of our proposed approach, SDCOR, which stands for "Scalable containing more than one temporary cluster, we utilize the Density-based Clustering Outlierness Ratio"12 . sizes and centroids of the associated mini-clusters, to obtain the final mean. For acquiring the final covariance matrix of a final cluster 3.4 Algorithm complexity comprising of multiple temporary clusters, for each of the Here, at first, we analyze the time complexity of the pro- associated mini-clusters and w.r.t. its centroid and covari- posed approach. For the first two phases, "Sampling" and ance structure, we afford to regenerate a specific amount of "Scalable Clustering", the most expensive operations are the fresh data points with Gaussian distribution. We define the application of DBSCAN to the objects residing in memory regeneration size of each mini-cluster equal to the product after loading each chunk, and the application of PCA to each of the sampling rate—which was used at the "Sampling" mini-cluster to obtain and update its secondary information. stage—and the cardinality of that mini-cluster, as · ; this Let ℭ be the number of data points contained in a is necessary for saving some free space in memory, while chunk. Considering the three-sigma rule of thumb, in every regenerating the approximate structure of an original clus- memory-load of points, the majority of these points lies ter. We consider all regenerated objects of all sub-clusters in the accepted Mahalanobis neighborhood of the current belonging to a final cluster, as a unique and coherent cluster temporary clusters and is being assigned to them—and and afford to obtain the final covariance structure out of it. this will escalate over time by the increasing number of But before using the covariance matrix of such regener- sub-clusters, which are being created during each memory ated cluster, we need to mitigate the effect of some generated process—and also, by utilizing an indexing structure for outliers, which could be created unavoidably during the the k-NN queries, the time complexity of the DBSCAN regeneration process, and can potentially prejudice the final algorithm will be ℭ log ( ℭ ) ; where, is a low constant accuracy outcomes. For this purpose, we need to prune close to zero. Note that exercising an indexing structure to this transient final cluster, according to the Mahalanobis decrease the computational complexity of answering the k- threshold , obtained through the user. Thus, regenerated NN queries is only applicable to lower dimensions; for high- √ objects having a Mahalanobis distance more than · , dimensional data, we need some sort of sequential scans to from the regenerated cluster, will be obviated. Now, we can handle the k-NN queries, which leads to an average time compute the ultimate covariance matrix out of such pruned complexity of 2ℭ for DBSCAN. regenerated cluster, and then remove this transient cluster. This procedure is conducted in sequence for every final Applying PCA on mini-clusters is 3 , 3ℭ [49], cluster which consists of more than one temporary cluster. as p stands for the dimensionality of the input dataset. Fig. 5a demonstrates a dataset consisting of two dense But according to our strong assumption that < ℭ , thus, Gaussian clusters with some local outliers around them. applying PCA will be 3 . Hence, the first two phases of This figure is in fact, a sketch of what the proposed method 2 the algorithm will totally take ℭ , . 3 looks like at the final steps of scalable clustering and before The last phase of the algorithm, w.r.t. this concern that building the final clustering model. The green dots are only consists of calculating the Mahalanobis distance of normal objects belonged to a mini-cluster. The red dots are the total n objects in the input dataset to T final clusters, temporary outliers, and the magenta square points represent is ( T ); and regarding that T , hence, the time the temporary centroids. Fig. 5b demonstrates both tempo- complexity of this phase will be ( ). The overall time rary means and final means, represented by solid circles and complexity is thus at most 2ℭ , 3 + . However, triangles respectively, with a different color for each final cluster. Fig. 5c colorfully demonstrates pruned regenerated in addition to being a truly tiny constant, it is evident that data points for every final cluster besides the final means, both p and ℭ values are negligible w.r.t. n; therefore, we can denoted as dots and triangles respectively. state that the time complexity of our algorithm is linear. Now, after obtaining the final clustering model, the Analysis of the algorithm space complexity is twofold. second phase of the proposed approach is finished. In the First, we consider the stored information in memory for the following subsection, the third phase named "Scoring" is clustering models; and second, the amount of space required presented, and we will describe how to give each object, a for processing the resident data in RAM. With respect to the score of outlierness, w.r.t. the final clustering model obtained fact that the most voluminous parts of the temporary and of out of scalable clustering. the final clustering models are the scatter and covariance matrices respectively, and that L T , thus, the space complexity of the first part will be L 2 . For the second 3.3 Scoring part, according to this matter that in each memory-load of At this phase, w.r.t. the final clustering model which was points, the most expensive operations belong to the cluster- obtained through scalable clustering, we give each data ing algorithms, DBSCAN and K-means, and regarding the point an outlying rank. Therefore, like phase two, once more, we need to process the whole dataset in chunks, 12. Due to the high computational loads associated with calculating Mahalanobis distance in high-dimensions, one can still gain the benefit and use the same Mahalanobis distance criterion to find the of using properties of principal components for computing outlying closest final cluster to each object. This local Mahalanobis scores, like what was done during scalable clustering.
13 20 20 20 10 10 10 x2 x2 x2 0 0 0 -10 -10 -10 -20 -20 -20 -20 -15 -10 -5 -20 -15 -10 -5 -20 -15 -10 -5 x1 x1 x1 (a) After Processing the Last Chunk (b) Temporary Means and Final Means (c) Regenerated Objects and Final Means Fig. 5. Proposed method appearance at the last steps of scalable clustering linear space complexity of these methods, hence, the overall for outliers in each dataset, their share of total objects in the space complexity will be ℭ + L 2 . corresponding dataset is reported in percentage terms. In all of our experiments, the Area Under the Curve (AUC) (curve of detection rate and false alarm rate) [1], [2] is used to 4 E XPERIMENTS evaluate the detection performance of compared algorithms. In this section, we conduct effectiveness and efficiency tests The AUC and runtime results14 of different methods are to analyze the performance of the proposed method. All the summarized in Table 3 and Table 4, respectively. The bold- experiments were executed on a laptop having a 2.5 GHz faced AUC and runtime indicate the best method for a Intel Core i5 processor, 6 GB of memory and a Windows 7 particular dataset. operating system. The code was implemented using MAT- LAB 9, and for the sake of reproducibility, it is published on TABLE 2 GitHub13 . Properties of the datasets used in the efficacy experiments For the effectiveness test, we analyze the accuracy and stability of the proposed method on some real-life and syn- Dataset #n #p #o (%o) Shuttle 49,097 9 3,511 (7.15%) thetic datasets, comparing to two state-of-the-art density- Smtp (KDDCUP99) 95,156 3 30 (0.03%) Real based methods, namely LOF and LoOP; a fast K-means Datasets ForestCover 286,048 10 2,747 (0.96%) variant optimized for clustering large-scale data, named X- Http (KDDCUP99) 567,498 3 2,211 (0.38%) means [50]; and two state-of-the-art distance-based meth- Data1 500,000 30 5,000 (1.00%) Synth. Data2 1,000,000 40 10,000 (1.00%) ods, namely Sp and ORCA. Consequently, for the efficiency Datasets Data3 1,500,000 50 15,000 (1.00%) test, experiments are conducted on some synthetic datasets Data4 2,000,000 60 20,000 (1.00%) in order to assess how the final accuracy varies when the number of outliers is increased. In addition, more ex- For the competing density-based methods, the parame- periments are executed to appraise the scalability of the ters are set as suggested, i.e. = 10, = 50 proposed algorithm, and the effect of random sampling rate in LOF and = 30, = 3 in LoOP. For X-means, the on the final detection validity. minimum and maximum number of clusters are set to 1 and 15 respectively; maximum number of iterations and number of times to split a cluster are set to 50 and 6 respectively as 4.1 Accuracy and stability analysis well. Here, the evaluation results of the experiments conducted In ORCA, the parameter denotes the number of near- on various real and synthetic datasets, with a diversity of est neighbors, which by using higher values for that, the size and dimensionality, are presented in order to demon- execution time will also increase. Here, the suggested value strate the accuracy and stability of the proposed approach. of , equal to 5, is utilized in all of our experiments. The parameter specifies the maximum number of anomalies to 4.1.1 Real datasets Some public and large-scale real benchmark datasets, taken 14. For the proposed method, as we know the true structural char- acteristics of all the real and synthetic data, we compute the accurate from UCI [51], and preprocessed and labeled by ODDS [52], anomaly score, based on the Mahalanobis distance criterion, for each are used in our experiments. They are representative of object and report the following optimal AUC next to the attained result different domains in science and humanities. Table 2 shows by SDCOR. Moreover, we do not take into account the required time for the characteristics of all test datasets, namely the numbers finding the optimal parameters of DBSCAN, as a part of total runtime. Furthermore, for X-means, as it assumes the input data free of noise, of objects (#n), attributes (#p) and outliers (#o). In addition, thus after obtaining the final clustering outcome, the Euclidean distance of each object to the closest centroid is assigned to it as an outlier score, 13. https://github.com/sana33/SDCOR and hence, the following AUC could be calculated.
14 be reported. If is set to a small value, then ORCA increases from the normal clusters, and it is clear from its average the running cut-off quickly and therefore, more searches will AUC for real datasets, which shows that outliers are totally be pruned off, which will result in a much faster runtime. misclassified. Hence, as the correct number of anomalies is not assumed to Furthermore, Table 4 reveals that SDCOR preforms be preknown in the algorithm, we set = 8 , as a reasonable much better than other competing methods in terms of value, where stands for the cardinality of the input data. execution time, except for Sp , which is slightly faster than For Sp , the sample size is set to the default value as the proposed method. Although Sp is the fastest among the suggested by the author, equal to 20, in our experiments. compared algorithms, as it was noted, its AUC suffers from Furthermore, for SDCOR, each dataset is divided into 10 large variations and is lower than SDCOR. Moreover, w.r.t. chunks, to simulate the capability of the algorithm for the two state-of-the-art density-based methods, it is evident scalable processing. Moreover, for the algorithms that have that there is a huge difference on consuming time between random elements, including Sp and the proposed method, SDCOR and these methods; it is due to the fact that in their detection results are reported in the format of ± , SDCOR, it is not required to compute the pairwise distances over 40 independent runs, as and stand for the mean of the total objects in a dataset, differently from LOF and value and the standard deviation of AUC values. LoOP. Additionally, as long as we are trying to not let outliers As stated earlier at the beginning of this paper, the participate actively in the process of forming and updating strong assumption of SDCOR is on the structure of existing mini-clusters during scalable clustering, two of the input clusters in the input dataset, which should have Gaussian parameters of the proposed algorithm are more critical, distribution. In practice though, w.r.t. [53], with appreciation listed in the following: the random sampling rate , which to the Central Limit Theorem, a fair amount of real world influences the parameter , ® the boundary on the volume of data are following the Gaussian distribution. Furthermore, mini-clusters at the time of creation; and the membership w.r.t. [54], [55], [56], even when original variables are not threshold , which is useful to restrain the volume of mini- Normal, employing properties of PCs for detecting outliers clusters, while they are incrementally growing over time. is possible and the corresponding results will be reliable. Note that the sampling rate should not be set too Since, given that PCs are linear functions of p random low, as by which, the singularity problem might happen variables, a call for the Central Limit Theorem may vindicate during the "Sampling" phase, or even some original clusters the approximate Normality for the PCs, even in the cases may not take the initial density-based form, for the lack of that the original variables are not Normal. Regarding this enough data points15 . The same problem concerns , as a concern, it is plausible to set up more official statistical tests too low value could bring to the problem that the number of for outliers based on PCs, supposing that the PCs have sub-clusters which are being created over scalable clustering Gaussian distributions. Moreover, the use of Mahalanobis will become too large, which leads to a much higher com- distance criterion for outlier detection is only viable for putational load. In addition, a too high value for brings convex-shaped clusters. Otherwise, outliers could be as- the risk of outliers getting joined to normal sub-clusters, signed to an irregular (density-based) cluster under masking and this will escalate over time, which increases the "False effect and thus, will be misclassified. Negative" rate. Here, in all experiments, is set to 2. 4.1.2 Synthetic datasets As for the real datasets, Table 3 shows that in terms of AUC, SDCOR is more accurate than all the other competing Experiments on synthetic datasets are conducted in an ideal methods, with the exception of Smtp dataset, in which LoOP setting, since these datasets are following the strong as- has achieved the best result, though not with a remarkable sumptions of our algorithm on the structure of existing clus- difference. Moreover, it is obvious that the attained results ters, which should be Gaussians. Moreover, the generated by the proposed approach are almost the same as the opti- outliers are typically more outstanding than those in the mal ones, and also, the average line indicates that SDCOR real data, and the outliers "truth" can be utilized to evaluate performs overall much better than all the other methods. whether an outlier algorithm is eligible to locate them. More importantly, SDCOR is effective on the largest dataset The experiments carried out on four artificial datasets are Http. In addition, by considering the standard deviations of reported at the bottom of Table 3, along with their execution AUC values for Sp and SDCOR, it is evident that SDCOR is times in Table 4. Each dataset consists of 6 Gaussian clusters, much more stable than Sp , as the average value of standard and outliers take up 1 percent of its volume. deviations for SDCOR is 0.007, which is really dispensable In more detail, for each dataset having p dimensions, we comparing to that of Sp , equal to 0.116. And this is due to build Gaussian clusters with arbitrary mean vectors, so that using one very small sample only by Sp , which causes the they are quite far away from each other to hinder possible algorithm go through large variations on the final accuracy. overlappings among the multidimensional clusters. As for For X-means, as it is not compatible with anomalies in the covariance matrix, first, we create a matrix [A] × , the input data, thus it severely fails on separating outliers whose elements are uniformly distributed in the interval [0,1]. Then, we randomly select half the elements in A and 15. Hence, one can state that there is a straight relationship between make them negative. Finally, the corresponding covariance Í the random sampling rate, and the true number of original clusters in matrix is obtained in the form of [ ] × = A A. Now, data. In other words, for datasets with a high frequency and variety w.r.t. to the mean vector (location) of each cluster and its of clusters, we are forced to take higher ratios of random sampling, to covariance matrix (shape of the cluster), we can generate an avoid both problems of singularity and misclustering, in the "Sampling" stage. Here, regarding our pre-knowledge about real and synthetic arbitrary number of data points from a p-variate Gaussian data, we have set to 0.5%. distribution. Moreover, to eliminate marginal noisy objects
You can also read