Hierarchical Bitmap Indexing for Range and Membership Queries on Multidimensional Arrays - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Hierarchical Bitmap Indexing for Range and Membership Queries on Multidimensional Arrays Luboš Krčál Shen-Shyang Ho Jan Holub Czech Technical University in Rowan University, Glassboro, Czech Technical University in Prague, Czech Republic NJ, USA Prague, Czech Republic lubos.krcal@fit.cvut.cz hos@rowan.edu jan.holub@fit.cvut.cz arXiv:2108.13735v1 [cs.DB] 31 Aug 2021 ABSTRACT 1. INTRODUCTION Traditional indexing techniques commonly employed in da- Research in many areas, such as geoscience or model sim- tabase systems perform poorly on multidimensional array ulations, produces large scientific datasets, which are stored scientific data. Bitmap indices are widely used in commer- in multidimensional arrays of arbitrary size, dimensional- cial databases for processing complex queries, due to their ef- ity and cardinality, such as QuikSCAT satellite data [16]. fective use of bit-wise operations and space-efficiency. How- Efficient processing of such data is challenging because of ever, bitmap indices apply natively to relational or linearized their multidimensional nature. However, most of the analy- datasets, which is especially notable in binned or compressed sis techniques apply to relational datasets or require a strict indices. linearization of the data. We propose a new method for multidimensional array in- To query multidimensional array data, one needs an effec- dexing that overcomes the dimensionality-induced inefficien- tive system index and subsequently query the data. Major- cies. The hierarchical indexing method is based on n-dimen- ity of the current systems rely on linearization of the array sional sparse trees for dimension partitioning, with bound data, i.e., mapping the data into one dimension, enabling number of individual, adaptively binned indices for attribute many one-dimensional access methods to be used. Others, partitioning. This indexing performs well on range involv- such as array databases [27, 2], work natively with multidi- ing both dimensions and attributes, as it prunes the search mensional arrays. space early, avoids reading entire index data, and does at A popular and very effective method of indexing arbitrary most a single index traversal. Moreover, the indexing is eas- data is bitmap indexing, which is an index consisting of a set ily extensible to membership queries. of bitmaps (bitvectors) with associated metadata. Bitmap The indexing method was implemented on top of a state indices leverage hardware support for fast bit-wise opera- of the art bitmap indexing library Fastbit. We show that the tions (AND, OR, NOT, XOR), and are very space-efficient, hierarchical bitmap index outperforms conventional bitmap especially for low-cardinality attributes, although this was indexing built on auxiliary attribute for each dimension. partially overcome by sophisticated multi-level and multi- Furthermore, the adaptive binning significantly reduces the component indices. Bitmap indices are used in majority of amount of bins and therefore memory requirements. commercial relational databases [9, 22, 23, 8]. The major disadvantage of bitmap indices for multidimen- sional array data indexing is their linear nature. Even with a variation of run-length compression, of which the most well- Categories and Subject Descriptors known is WAH, that only partially suppresses the issue. H.2.8 [Information Systems]: Database Management—Database Our major contribution is a new method of bitmap index- Applications, Scientific databases ing for multidimensional arrays that overcomes the dimen -sionality-induced inefficiencies. The method is based on n-dimensional sparse trees for dimension partitioning, and General Terms on attribute partitioning using adaptively binned indices. We demonstrate the performance on range queries involving Keywords both dimensions and attributes. We also show the effec- tiveness of our hierarchical indexing method as it prunes bitmap indexing, multidimensional arrays, range queries, the search space early, avoids reading entire index data, and scientific datasets, Fastbit does at most a single index traversal. The paper is organized as follows. In Section 2, we briefly describe previous work on bitmap indexing, scientific ap- plications and multidimensional arrays. In Section 3, we describe the preliminaries to our work, including bitmap in- dexing, array data model and array queries. In Section 4, we introduce our hierarchical bitmap array index, discuss its concepts, and explain its construction. In Section 5, we de- scribe the query evaluation process for mixed attribute and dimension range queries. In Section 6, we demonstrate the
effectiveness on multiple queries and compare our index to sion and attribute in one of the following formats. A one- other solutions. In Section 7, we conclude with several notes sided range query: y ≤ 45; two-sided range query: 23.4 ≤ on future research and development directions. y < 73.2, equality query: y = 89; membership query: y ∈ {2, 4, 6, 8, 10}, where y is either dimension or attribute of the array. Figure 1 shows a query that has a two-sided constraint 2. RELATED WORK on an attribute a and a one-sided constraint on dimension d2 Traditional indexing methods like B-trees and hashing are on a 2-dimensional array and the (shaded) query outcome. not effectively applicable to index multiple attributes in a Note that equality query is a special case of membership single index, being replaced by multidimensional indexing query, and that all queries can be rewritten to a set of range methods, such as R-trees [10], R*-trees [3], KD-trees, n- queries. Mixed queries are queries that pose constrains on dimensional trees (quadtrees, octrees, etc.) [19, 20]. These at least one dimension and one attribute. methods are not very effective for high dimension arrays and An example query on array SatelliteArray [latitude, longitude, altitude, time] indexing algorithms is in [21], though majority of the focus may look like this: is on traditional spatial data instead of multidimensional SELECT * FROM SatelliteArray WHERE 50.68 ≤ latitude ≤ arrays. 50.88 AND 14.37 ≤ longitude ≤ 14.57 AND 30.0 ≤ snowf all. The drawbacks of traditional indexing algorithms led to The result would then be a possibly empty subarray of the introduction of bitmap indices [6] and their applications the same format as SatelliteArray. for scientific data [25]. Bitmap indices are naturally based on linear data, ideal for relational databases. Space fill- A[d1,d2] A'[d1,d2] ing curves, such as Z-order curve and Hilbert curves [14, 3 3 2 ~ ~ SELECT * FROM A 3 3 2 ~ ~ WHERE 2 ≤ a ≤ 4 13] were used for linearization and subsequent querying of 2 4 2 1 5 2 4 2 1 5 d1 AND d2 ≤ 2; d1 multidimensional data. Hilbert curves were used in [13], 1 4 7 3 2 1 4 7 3 2 while Z-order curves were used in [17], which is a system 0 ~ 5 4 1 0 ~ 5 4 1 0 1 2 3 0 1 2 3 for querying spatial data (not arrays) using compressed hi- d2 d2 erarchical bitmap indices. Hierarchically organized bitmap Figure 1: An example of a range query on a two-dimensional indices were also used for star queries on data with hier- array. archically organized dimensions [7]. Bitmap indices have also been used for approximating aggregations [29], contrast set mining [36], subgroup discovery [30], correlation analysis 3.2 Distributed Arrays [28]. All of which use bitmap indices on auxiliary attributes made from dimensions (see Section 3.3). Other works utilize Due to the large size of scientific data, it is often necessary bitmap indexing for spatial applications, but do not model to split the data into subarrays called chunks. the data as multidimensional arrays [15, 24, 26]. There are two commonly used strategies. Regularly grid- The boom of multidimensional, scientific array data gave ded chunking, where all chunks are of equal shape and do birth to open-source multidimensional array-based data man- not overlap. This array data model is known in SciDB as agement and analytics systems, namely RasDaMan [2] and MAC (Multidimensional Array Clustering) [27]. This ar- SciDB [27]. These databases work natively with multidi- ray model works well for coarse dimension-based queries, mensional arrays, but lack some of the effective query pro- but requires either additional indexes or filtering for fine cessing methods implemented in other databases. On the dimension-bases and for any attribute-based queries. This other hand, SciDB has been established as a foundation for array data model is the foundation (the lowest level) of our many multidimensional array processing tasks. Searchlight hierarchical bitmap array index. The second strategy is ir- [12] is a SciDB based system for range queries with aggre- regularly gridded chunking, which is one of the chunking gation constraints, using constraints programming on top of option in RasDaMan [2]. array synopsis – lossy representation of small array chunks. 3.3 Bitmap Indexing Bitmap indices, originally introduced in [6], were shown 3. PRELIMINARIES to be very effective for read-only or append-only data, we We first introduce the multidimensional array data model, used in many relational databases and for scientific data then describe types of commonly used queries on arrays, management [9, 22, 23, 8]. with some examples. Next, we introduce bitmap indexing on Bitmaps can either be created for a single attribute value, linear data, binning types, encoding types and compression called low-level bitmaps, or for multiple values, called high- schemes. level bitmaps, where the bitmap is set to 1 for the cell of the arrays whose indexed value is in the value range of such 3.1 Array Data Model bitmap. An array A consists of cells with dimensions indexed by The structure of high-level bitmaps is determined by a d1 , . . . , dn . Each cell is a tuple of several attributes a1 , . . . , am . binning strategy. For high cardinality attributes, binning is We assume the structure of the attributes is the same for all the essential minimum to keep the size of the index reason- cells in the array. The array is denoted as A < a1 , . . . , am > able [35, 34]. Binning effectively reduces the overall number [d1 , . . . , dn ]. For example, satellite data may have latitude, of bitmaps required to index the data, but increases the longitude, altitude and time as dimensions, and precipita- number of cells that have to be later verified. This is called tion, temperature, wind speed, etc. as attributes. a candidate check. Two most common binning strategies We form a query on arrays based on constraints. A di- are equi-width binning, which divides the attribute domain mension and attribute constraint is a constraint on a dimen- into equal intervals, and equi-depth binning, which divides
the attribute domain into intervals covering equal (or near 4. HIERARCHICAL BITMAP ARRAY IN- equal) number of cells. Equi-width binning is highly prone DEX to excessive candidate checks, especially on skewed data. We now briefly discuss a common way of indexing multi- dimensional arrays using additional bitmap indexes for each d1 d2 a EBM E[1] E[2] E[3] E[4] E[5] E[6] E[7] R[1,1] R[1,2] R[1,3] R[1,4] R[1,5] R[1,6] R[1,7] I[1,4] I[2,5] I[3,6] I[4,7] dimension. Then we describe the structure of our hierarchi- 0 0 0 1 3 2 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 cal bitmap array index. 0 2 ~ 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Arrays Aha1 , . . . , am i[d1 , . . . , dn ] are usually stored in a 0 3 ~ 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 4 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 linearized representation, most commonly C-style row-major 1 1 2 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 2 1 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 array representation. Creating one index Idi =k (d1 , . . . , dn ) 1 3 5 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 1 1 2 0 4 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 for each dimension d, which is set to 1 for cells of array A 2 1 7 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 2 2 3 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 where d is equal to a value k. This allows filtering out results 2 3 3 0 2 ~ 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 based on dimensions using binary AND. 3 3 1 2 5 4 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 Note that the dimensions index Idi =k (d1 , . . . , dn ) does not 3 3 1 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 necessarily have to use equality encoding, but based on the expected queries, we may choose a better combination of Figure 2: Bitmap index for attribute a of the array A from binning, encoding and compression. This approach is used in Figure 1: empty bitmask EBM, equality encoded index E, [30, 36] with equi-depth binning or in [29] with v-optimized range encoded index R and interval encoded index I. binning based on v-optimal histograms [11] and C-style row- major linearization in [28]. Unfortunately, dimension bitmap index is not effectively compressible. Consider an example of row-major ordering Another crucial aspect of bitmap indexing is encoding [6]. on 5x5 array. Then the row dimension index for column = 1 which determines how a set of bins, B, of attribute do- is 01000 01000 01000 01000 01000, which cannot be ef- main is encoded in each bitmap and consecutively into a fectively compressed using either BCC or WAH, since the bitmap index. The simplest encoding, called equality en- compression context of both is a single bit. This can be coding, encodes each bin with one bitmap for a total of partially mitigated by stretching dimensions to multiples of |B| bitmaps. Processing of equality queries reads a single bytes or words, and extending the run-length compression to bitmap, but processing of range queries has to read at most use byte or word in its compression context, instead of single half of all the bitmaps. Range encoding uses B − 1 bitmaps, bits. Another option is to use either Z-order or Hilbert space each bitmap Ri encodes a range of bins [B1 , Bi ]. The pro- filling curves to further increase locality of the dimensions. cessing of range encoded bitmap index for range queries Neither, however, solves the problem entirely. reads at most two bitmaps. Interval encoding [5] uses |B| 2 bitmaps, each bitmap Ii is based on range encoded bitmaps 4.1 Partitioning of Arrays Ri ⊕ Ri+ |B| . Interval encoding uses at most two bitmaps to 2 Non-partitioned data require much finer binning and the process range queries. Compared to range encoding, it uses domain of the dimension is higher than its partitioned coun- only half the space. Figure 2 shows an example of equality, terpart, thus higher amount of bins is required. By partition- range and interval bitmaps for the array in Figure 1. ing the array Aha1 , . . . , am i[d1 , . . . , dn ] into a set of regularly Bitmap indices, based on the number of bins, may take gridded chunks C in the Multidimensional Array Clustering up to |B| · C, where C is the cardinality of the indexed fashion described in Section 3.2, such that: attribute, leading to very small number of bins needed to exceed the size of the raw data. Binary run-length com- Ci [o1 , o2 , . . . on , e1 , e2 , . . . , en ] = pression algorithms are usually applied on bitmap indices Aha1 , . . . , am i[o1 ≤ d1 < e1 , . . . , on ≤ dn < en ] to reduce the overall size. However, another requirement is posed to these compression algorithms, such that it must be All chunks in our data model are of the same shape, i.e., possible to run bit-wise operations effectively on the com- for all chunks Ci , Cj of array A, it holds that pressed bitmaps. There are two representative compression Ci [ek ] − Ci [ok ] = Cj [ek ] − Cj [ok ] algorithms, namely Byle-aligned Bitmap Code – BCC [1] and Word-Aligned Hybrid (WAH) compression [32]. for all dimensions k, and chunks are not overlapping and In order to facilitate effectively high cardinality attributes completely cover the whole array A. In the chunk notation, with space efficient indices and fast querying, two compos- ok stands for offset and ek stands for end of the chunk along ite methods were introduced. The first method is multi- that dimension (exclusive boundary). component, where the attribute value is decomposed into By chunking the array, we limit the domain of both at- multiple components, which are then indexed independently. tributes and dimensions in a given partition. In our adaptive An example of multi-component index is a bit-sliced index binning indices, we use the fact that the domain of the at- [18], where each component corresponds to a bit of the value. tribute varies based on the location. Second composite method is called multi-level indexing [23], The first problem arising from the equal size chunking where the binning of the attribute becomes progressively model is that within a single chunk, we are still required to more precise with increasing levels. use either indexing or at least aggregate information on the Thorough performance analysis of bitmap indexing, espe- attributes, such as min and max for precise queries or his- cially multi-level and multi-component both uncompressed tograms for probabilistic queries, or data exploration. We and compressed is presented in [33]. An open-source bitmap choose to use bitmap indexing on both attributes and dimen- indexing framework Fastbit [31] implements most of cur- sions within the chunk. Note that the dimension indices are rently existing indexing schemes, mainly two-level indices. the same for all chunks in the array, since for each chunk,
we can simply subtract its offset from the dimensions query The overall internal node fanout F can be expressed in constraints. terms of a fanout Fdk for a single dimension k as The second problem lies in the overall structure of the Yn n chunks. There is no direct, high level index of the attributes F = Fdk ≤ max Fdk for the chunks. It is necessary to scan through the synop- 1≤k≤n k=1 sis of all the individual chunks, or generate a hierarchical Assuming that the dimension fanout Fdk is the same for synopsis. The latter has been utilized in [12] in a form of a all dimensions, we can get graph generated over merging sub-arrays. j 1k We propose a unified solution that solves both the problem Fd k = F n with dimension attributes and with synopsis of array chunks. Our solution is in a form of hierarchical bitmap index on top As we will see in Section 5.2, in order to facilitate efficient of a n-dimensional tree (such as octree for 3 dimensions) with dimension range queries, the size of F cannot be too large, variable binning for each node in the tree. since the size of precomputed dimension clipping bitmaps depends on F . The index tree construction works in a bottom-up fashion, 4.2 Structure of the Array Chunk Index where the leaf nodes are indexed at first. This allows both The index is done separately for each attribute of the array data appending and modification (see Section 4.7). Each A. Let’s fix an attribute α. All the following functions refer internal node is constructed from at most F direct children to this attribute. and with at most BINS attribute bins, with one additional Each chunk C(o1 , o2 , . . . , on ) of array Aha1 , . . . , am i index for empty bitmask. Each child node Ni of internal [d1 , . . . , dn ] is associated with exactly one leaf node N provides its attribute’s min(Ni ) and max(Ni ) val- N` (o1 , o2 , . . . , on ). Independently, each leaf uses an equi- ues. These values are used for the construction of the bitmap depth binning index with a total of at most BINS bins, where index of N . bin boundaries bins(N` of the index are based on an exact Let B = (min(N1 ), max(N1 )), . . . , (min(NF ), max(NF )) chunk values histogram. Note that this assumes uniform be the set of all intervals ranging from the minimum to the distribution of queries. If we had any prior knowledge of maximum value of the indexed attribute α among all the the queries based on the attribute, we would instead opt child nodes Ni . The set B is the set of bins – the individual for weighted histogram to construct the binning. The leaf’s interval boundaries are delimiters, where the attribute’s α dimension boundaries correspond to its associated chunk’s value a is in the attribute domain of different child nodes. boundaries, clipped by the global shape of the array A. Formally, let nodesin(a) ⊂ Ni be a function of a value a ∈ α Accounting for empty values (missing cells in A) is done of attribute α, which returns a subset of child nodes. using a special bitmask, known as empty bitmask, for a total of BINS + 1 indices. Only leaves with at least E · BINS non- Ni ∈ nodesin(a) ⇐⇒ min(Ni ) ≤ a ≤ max(Ni ) empty cells are indexed, where the constant E is dependent The set nodesin(a) is used to construct the binning for on the data structure used for the leaf representation, i.e., index of this internal node. We describe the encoding of do not use bitmap indexing if listing the values is more space this bitmap index in Section 4.5. efficient. The index bins are aligned with the bins from B. This Encoding of the leaf indices is left as a parameter to the guarantees that no two indices for different bins will be iden- user, as the bitmap indexing performance heavily depends tical, i.e., represent the same set of children. It also directly on the cardinality of the array attribute, desired number of implies that adding more boundaries to B would be point- bins, and query types. For generality, we assume high car- less. dinality attributes, such as integers and doubles and small number of bins such as BINS ≤ 16. 4.4 Bin Boundaries Merging in Parent Nodes Except for very narrow dimension range queries, a dimen- The number of bins from all F child nodes is higher than sion query will either cover the whole span of a leaf node, or BINS for majority of the internal nodes N , therefore it is result in a one-sided dimension range query once the query necessary to reduce the size of the set of bins, B. There are processing reaches a single chunk. Thus, the ideal encod- several strategies to choose B ⊂ D such that |B| = BINS. ings for chunks are range and interval encodings [5]. Our An example of such binning reduction is in Figure 3. default encoding is interval encoding since it uses half the The first strategy is to use an equi-width distribution of memory range encoding does. Encoding of inner nodes is the bins. This is the ideal choice assuming the attribute more complicated and we describe it in Section 4.5. part of the query is uniformly distributed or when there is no prior knowledge about the attribute query and assuming 4.3 Structure and Construction of the Hierar- the data distribution is not skewed. chical Bitmap Array Index The second strategy is to use equi-depth binning. This is ideal if the attribute distribution of the child nodes is To deal with the higher level index, we create a special skewed. It is possible to maintain the weights of the bins composite index on tree similar to n-dimensional tree. Each for leaf nodes, since those have direct access to the data. internal node of the index has at most F children, where F However, internal nodes can only make estimates about the is called a fanout. Note that, unlike in quadtrees, octrees or weight of merged bins. In each internal node and leaf, we n-dimensional trees, F is not necessarily 2n , where n is the store the weight estimate w(b), where b ∈ B. The weighted number of dimensions. Our bitmap indices are based on the square error of a bin b is fanout and we want to utilize binary operations as much as 2 possible. For this reason, the fanout F should be a multiple w(D) wse(b) = w(b) − of the processor word size W , or as close to it as possible. BINS
N1 N1 R+(-∞,1)= 0000 Input: set of bins B, set of weights w(b), b ∈ B, N2 N2 R+[1,3) = 0101 R+[3,+∞)= 1111 number of output bins BINS N3 N3 Result: approx equi-depth bins R ⊂ B, |R| =BINS N4 N4 R-(-∞,6]= 1111 B R-(6,8] = 0011 R-(8,+∞)= 0000 1 R ← eq-width bins from B, |B| =BINS ; R R 2 BS ← all possible split bins of R; 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Bitmap index of nodes that have 3 BM ← all possible merged bins of R; R -- Approximate bins for attribute index False positive attribute ranges started / ended 4 QSP LIT ← priority queue(); 5 QM ERGE ← priority queue(); Figure 3: Example of merging |B| = 8 bin boundaries to 6 for s ∈ BS do // bins to split |R| = 4 bin boundaries for 4 child nodes. False positive 7 add (s, ∆wse(s)) to QSP LIT ; ranges are marked in red. Two sided range encoded bitmaps 8 end are generated for R. 0 9 for (m, m ) ∈ BM do // bins to merge 10 add ((m, m0 ), ∆wse((m, m0 )) to QM ERGE ; 11 end and the weighted sum square error is // split that decreases wsse the most X 12 (s, ∆wse(s) ← min(QSP LIT ); wsse(B) = wse(b) // merge that increases wsse the least b∈B 0 0 13 ((m, m ), ∆wse((m, m ))) ← min(QM ERGE ); . 0 14 while ∆wse((m, m )) > ∆wse(b) do To estimate the weight of merged bin r ∈ R ⊂ B, we 15 split b; assume uniform distribution of values over the intervals of 16 merge (b, b0 ); bins b ∈ B. Then the estimated weight of r is 17 update R, BS , BM , QM ERGE , QSP LIT ; 18 end X w(r) = w(b) · sizeof(b ∩ r) b∈B Algorithm 1: Iterative equi-depth binning approximation where sizeof(b ∩ r) is the size of the intersection of r and b. We cannot use the trivial algorithm for equi-depth bin- ning, because we can only iterate by bins of variable weight, added, and we add r0 to a set R+ . Else, if nodesin(r0 ) ⊂ instead of iterating by single data points. This is why we nodesin(r), then nodes are removed in set nodesin(r0 ), and need to approximate the equi-depth using a simple iterative we add r0 to set R− . Otherwise, some nodes are added and algorithm. Details on selecting R ⊂ B approximately equi- some are removed and we add r0 to both R+ and R− . In depth bins are shown in Algorithm 1. We first start with our example in Figure 3, R+ = {[1, 3), [3, 6)} and R− = equi-width binning (line 1). Then, we generate sets of all {(3, 6], (6, 8]}. possible bin splits and merges (lines 2-3), setup two priority There is no guarantee that |R+ | = |R− |. If we wanted, queues and evaluate all possible splits and merges in terms we could run Algorithm 1 separately on boundaries B+ and of weighted sum square error (lines 4-11). After that, we B− (likewise defined) and with BINS 2 bins, but then we’d lose perform one valid split and one merge on the binning as the equi-width approximation. long as this leads to an improvement of the overall binning Now, we encode |R+ | + 1 bitmaps using range encoding, (lines 14-18). This preserves the total number of bins. so that the index for bin r+ ∈ R+ corresponds to children, In case a node has either a low cardinality attribute throu- whose attribute range minimum min(Ni ) is ≤ to the up- ghout all the child nodes, we create bins mapped to single per boundary of interval r+ . In our example, bitmap cor- values of the attribute and their corresponding bitmaps. responding to r = [1, 3) ∈ R+ is 0101, indicating that N1 Note that v-optimal binning does not work in our case, and N3 have started in or before this interval. Similarly, since we don’t have the individual data values available dur- we encode |R− | + 1 bitmaps for values r− using inverse ing construction of the internal nodes, although we could range encoding, i.e., children, whose attribute range max- approximate this using uniformly or normally distributed imum max(Ni ) is > to r− are encoded by 0 in the bitmap, estimates within the bins of child nodes, or by propagating representing children that have already ended before or in at least basic data synopsis. the interval r− . These two bitmaps easily allow evaluation of partial and 4.5 Double Range Encoding of Bitmap Indices complete matches (see Section 5.1) using only two bitmap in Internal Nodes reads and one logical operation for both partial and complete Unlike in bitmap indexing in leaves where one encodes query. positions of individual values, we encode sets of child nodes nodesin(a) for attribute values a in the internal nodes. Our 4.6 Locality of the Hierarchical Index binning B has the property that for all attribute values In order to preserve locality of the data during queries, we ab , a00b ∈ b ∈ B it holds that nodesin(ab ) = nodesin(a0b ). store the whole index in a locality preserving linearization of Note that this does not hold for intervals r ∈ R (See Figure an n-dimensional tree. For each query, blocks of the index 3 for an example). are loaded sequentially and sparsely, based on the parame- We will now describe an effective bitmap encoding of ters in the query. Thus, only one traversal, possibly incom- nodesin(a), a ∈ r ∈ R. Let’s have two adjacent intervals plete, of the index data is needed. The index data consist r ∈ R and r0 ∈ R, such that rh = r`0 Note that since of bin boundaries, weight estimates and bitmap indices. R ⊂ B, we have nodesin(r) 6= nodesin(r0 ). If nodesin(r0 ) ⊃ We use space filling curves, namely the Z-order curve to nodesin(r), then r0 corresponds to a bin, where nodes are linearize the multidimensional array index. We choose not
to use recursive multi-level Z-order curves, as this would dimensions, in which case we fill all q with remaining di- force the query processing to be based on pre-order traver- mensions, to a complete query. Dimensions, that were not sal of the index tree. We also choose not to use row major specified, are filled with (dj , min(dj ), max(dj )) triples. One- ordering, since it has poor locality and it would slow down sided range constraints are also extended in similar manner. retrieving locations child nodes and partitions. Hilbert curve The core of the query algorithm is a breadth-first descent has perfect locality, but it does not preserve dimensions or- through the index tree. At each level, the search space is dering. This means we would need to precompute bitmaps pruned according to both dimension and attribute values. for dimension constraints for each block of Hilbert curve Let N be the currently searched node, Ni be its child separately. Z-order curve allows for fast child and parent nodes, where 0 ≤ i < F ; multidimensional range DN be the node index computations, preserves dimensionality between set of dimension boundaries in the format [DN [d]` , DN [d]h ], different level and has a good locality. where d is dimension, ` designates lower bound, h upper The order Z` of the Z-order curve of level ` is determined bound, associated with node N . by the maximal fanout Fmax = max1≤k≤n Fdk , where Fdk Throughout the query processing, we maintain a queue of is a fanout of dimension k. partially matched nodes P and a set of completely matched nodes C. We start at a root node Nr , setting P = {Nr }, Z` = ` · dlog2 Fmax e assuming that both: node N ’s boundaries and query dimen- Assuming Fdk is the same for all dimensions, the order of sions are not disjoint: DN ∩ QD 6= ∅ and Z-order curve is then (min(N ), max(N )) ∩ QA 6= ∅, otherwise node N ∈ / P and l j 1 km N∈ / C. Z` = ` · log2 F n Let p, p0 , p∗ and c, c0 , c∗ be zero bitmaps of size F ; the bitmaps p indicates partial attribute matches among the and such a Z-order curve has length of (Z` )n . children of node N , p0 indicated partial dimensions matches, Several of the higher levels are stored in a dense vector, as p∗ indicates partial matches, similarly the vectors c, c0 , c∗ specified by a user parameter. These vectors are expected indicate complete matches. We will now set these vectors to be densely filled. The remaining levels are stored as non- according to the query Q for the first node in queue P . The overlapping intervals on a Z-order dimension (1D) in con- partial and complete matches bitmap computation is also tinuous blocks, indexed by a binary search tree. This is a described in Algorithm 2 and in Figure 4. compromise between sparse single node map and full vec- tor used for higher levels. Note that the blocks may not be Input: query q = {(a` , ah ), (d1 , d` , dh ), . . .} with DIMS sequential in memory, but at most a single transition is guar- dimension constraints; node N ; node children anteed, i.e., no blocks are read twice during the processing N1 , . . . , NF ; boundaries [DN [d]` , DN [n]h ] for N and of a single query. all Ni and dimensions d; Result: partial matches p∗; complete matches c∗; 4.7 Appending and Modifying Data 1 PN,S , CN,S ← load index for node N ; Scientific data is often considered either fixed or append 2 PS,d0 0 , CS,d ; // precomputed; only, our indexing approach allows for both appending and F 0 F 3 p ← {0} , p ← {0} , p∗; data modification, although the latter is not convenient. F 0 F 4 c ← {1} , c ← {1} , c∗; To append data along any dimension, we apply the same 5 if ah < min(N ) or a` > max(N ) then bottom-up procedure to update the index. It is necessary to update the dimension bounds of internal nodes (that were 6 return p∗ ← {0}F , c∗ ← {0}F possibly previously clipped by the global shape of the array) 7 c = c & CN,S (a` , ah ); and bitmap indices (to include the new child nodes). Note 8 p = p | PN,S (a` , ah ) & ∼c; that we do not have to update the weight estimates and bin 9 for dimensions d, 1 ≤ d ≤ DIMS do boundaries (except min and max) in order to assure index 10 if dh < DNi [d]` or a` > DNi [d]h then correctness. However, in order to assure the equi-depth op- 11 return p∗ ← {0}F , c∗ ← {0}F timal binning, we need to run the bin merge algorithm again 12 if d` > DN [d]` then on affected nodes. 13 p0 = p0 | PS,d 0 (d` ); 14 if dh < DN [d]h then 15 p0 = p0 | PS,d 0 (dh ); 5. QUERYING DIMENSIONS AND 16 c0 = c0 & CS,d 0 (d` , dh ); ATTRIBUTES 17 end 0 0 0 In this work, we focus on selection queries over dimensions 18 p ← p & c ; 0 0 0 and attributes of an array. Such query consists of a set 19 c ← c & ∼p ; 0 of dimension constraints and attribute constraints. Let’s 20 c∗ ← c & c ; 0 0 specify a query q over an array Aha1 , . . . , am i[d1 , . . . , dn ] as 21 p∗ ← (p | c) & (p | c ) & ∼c∗; a set of ranges over dimensions qD and attributes qA . 22 return p∗, c∗ Algorithm 2: Evaluation of partial and complete match q = qA ∪ qD = {(a, a` , ah )} ∪ {(dj , j` , jh ), . . .} bitmaps for a single node. where (a, a` , ah ) is a triple specifying attribute constraint: attribute, its lower bound and its (exclusive) upper bound; same goes for dimensions. In this work, we focus on a single 5.1 Attribute based Matches attribute query. Therefore, we simplify qA to (a` , ah ). It In this subsection, we explain how attribute bitmask is set. is possible for a query to not specify constraints for some This subsection further describes lines 5–8 in Algorithm 2.
If ah < min(N ), or a` > max(N ), there are neither par- The second expression is similar, but for dh . Third and forth tial nor complete attribute matches and we terminate pro- expression combine the partial matches over both query lim- cessing the current node. its and all dimensions. Note that this results in excessive Let PN,S (a` , ah ) be a partial attribute match bitmasks partial candidates since all child nodes that intersect the specific to node N of for an array of shape S, with bits set query constraints along at least one dimension qualify as to one corresponding to children Ni so that the intersection partial candidates. [a` , ah ] ∩ [min(Ni ), max(Ni )] 6= ∅. Partial dimension matches are evaluated using one pre- computed bitmap index corresponding to PN,S (a` , ah )[i] = 1 ⇐⇒ PB|N,S (ah )[i] ∧ ¬PE|N,S (a` )[i] 0 PB|N,S (a)[i] = 1 ⇐⇒ min(Ni ) ≤ a PS,d (b)[i] = 1 ⇐⇒ b = DNi [d] PE|N,S (a)[i] = 1 ⇐⇒ max(Ni ) ≥ a where b is a bucket corresponding to the chunking of the ar- The second expression describes bitmap set to 1 for children ray A. There are a total of Fd such buckets along dimension that have started before or at value a, the third one describes d, resulting in a total of Fd · d bitmaps of size F . We query children that have ended at or after a. The first expression these bitmaps for all dimensions and combine them using then combines both. OR into p0 To evaluate PN,S (a` , ah ), we first use binary search on There is a special case of false negative dimension result. R+ and R− to find two bins L ∈ R+ and H ∈ R− such that If d` or dh is equal to the d’th dimension range border of a` ∈ L and ah ∈ H. These bins L and H mark the attribute a child node Ni , and at the same time the other end of boundary bins. Then, PB|N,S (ah ) is identical to R+ [H] and d` or dh causes the dimension to be fully covered in Ni , ¬PE|N,S (a) is identical to R− [L], where R+ and R− are the i.e. d` = DNi [d]` and dh ≥ DNi [d]h or dh = DNi [d]h and bitmap indices described in Section 4.3, each queried for a d` ≤ DNi [d]` , the query is evaluated as partial match for Ni single bin. Then we add PN,S (a` , ah ) to p using bitwise OR. and dimension d, while in fact dimension d contributes to Now, we process complete candidates in a similar fashion. complete matches. A check for this scenario requires com- Let CN,S (a` , ah ) be a complete attribute match bitmask spe- paring the dimension ranges of child nodes to the query cific to node N for array of shape S, so that the intersection range, and was ignored on purpose, as it complicates and [a` , ah ] ∩ [min(Ni ), max(Ni )] = [a` , ah ]. slows down the query process. For complete candidates, we will slightly modify the defi- CN,S (a` , ah )[i] = 1 ⇐⇒ PB|N,S (a` )[i] ∧ ¬PE|N,S (ah )[i] 0 nition of C used for attributes. Let CS,d (d` , dh ) be a complete This expression is very similar to PN,S (a` , ah ), describing dimension match for array of shape S, indicating which child children that have started at or before a` and have not ended nodes Ni are partially or fully covered by interval [d` , dh ]. at or before ah . To evaluate CN,S (a` , ah ), we query R+ [L] Despite the semantics indicating partially matches should and R− [H]. Then, we add the result to c using bitwise OR not be included, we later trim the complete dimension match and remove those from p, i.e., p = p ∧ ¬c. bitmap accordingly. Note that both partial and complete attribute candidates 0 CS,d (d` , dh )[i] = 1 ⇐⇒ [d` , dh ] ∩ DNi [n] 6= ∅ use a total of 4 index queries. An example of attribute query \ 0 0 is displayed in the bottom row in Figure 4. CS (d` , dh )[i] = CS,d [i] 1≤n≤DIMS 5.2 Dimension based Matches Complete dimension matches are evaluated using two pre- Next, we explain how the dimension masks are set. This computed bitmap indices corresponding to subsection further describes lines 9–17 in Algorithm 2. If for any dimension d it holds that dh < DNi [d]` or a` > 0 CB|S,d (b)[i] = 1 ⇐⇒ b ≤ DNi [d] DNi [d]h , there are neither partial nor complete dimension 0 matches and we terminate processing the current node. CE|S,d (b)[i] = 1 ⇐⇒ b ≥ DNi [d] Unlike attribute query, the evaluation of dimension query similarly to bitmaps used for partial matches. There is a is the same for all nodes N , so all the bitmaps for processing total of 2 · Fd · d bitmaps of size F for complete matches. We dimensions queries are precomputed. query these bitmaps for all dimensions and combine them 0 Let PS,d (d` , dh ) be a partial dimension match, where d is a using AND into c0 . dimension in the query constraint (d, d` , dh ), for an array of We now combine the partial dimension matches bitmap c0 shape S, indicating child nodes Ni such that the intersection with p0 , such that p0 = p0 ∧ c0 . Then, we clip the complete [DNi [d]` , DNi [d]h , ] ∩ [d` , dh ] 6= ∅. dimension bitmap by the partial bitmap as c0 = c0 ∧ ¬p0 . Let’s fix a dimension d for which we evaluate partial mat- During the evaluation of dimension matches, we used a total 0 c0hes PS,d (d` , dh ): of 3 · d index queries. An example of dimension query is 0 PS,d (d` )[i] = 1 ⇐⇒ d` ∈ DNi [d] ∧ d` 6= DNi [d]` displayed in the top row in Figure 4. 0 PS,d (dh )[i] = 1 ⇐⇒ dh ∈ DNi [d] ∧ dh 6= DNi [d]h 5.3 Partial and Complete Matches 0 0 0 PS,d (d` , dh )[i] = 1 ⇐⇒ PS,d (d` )[i] ∨ PS,d (dh )[i] Now that we have both attribute and dimension, and both 0 [ 0 partial and complete candidates, we may proceed to merging PS (d` , dh )[i] = PS,d [i] the candidates and generating a bitmap representing the set 1≤d≤DIMS ∗ of result node children CN,S and a bitmap representing the ∗ The first expression describes which children Ni have di- set of potential node children PN,S that will be recursively mension d range such that the query limit d` falls inside the explored. This subsection further describes lines 18–22 in range, but it is not equal to the lower limit of that range. Algorithm 2.
∗ The CN,S bitmap is easier to obtain, as it is the intersec- 6. EXPERIMENTAL EVALUATION tion of both complete bitmaps without partial candidates We have tested our implementation against several other bitmaps. solutions, of which none is specifically tailored to mixed at- ∗ CN,S = CN,S ∧ CS0 tribute and dimensions range queries, but those are the only readily available solutions involving bitmap indices and be- ∗ We obtain the set of partial candidates PN,S by joining ing capable of executing range queries. the dimension-based partial candidates with the attribute- We measured the time and space efficiencies for each in- based candidates and clipping both by complete candidates dividual query, i.e. total query execution time, and space ∗ requirements for the index. Timing was measured as an PN,S = (PN,S ∨ CN,S ) ∧ (PS0 ∨ CS0 ) ∧ ¬CN,S ∗ average of 3 runs with data preloaded into memory. For Fastbit queries, we use their internal wall time measuring We then iterate through the results, adding child nodes ∗ systems, meaning certain pre and post processing steps are from CN,S to the result set C and the partial candidates ∗ not included in the time measurements, such as query string PN,S into the queue P to be processed subsequently. This parsing. Space requirements were measured based on the process is done on top of Z-order indices, as it is trivial disk space required to store the bitmap index together with to generate Z-order indices corresponding to nodes in the all relevant metadata. lower levels. The Z-order ordering of the inner nodes and The experiments were run on a single physical machine breadth-first traversal also ensures single traversal through – Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz, 16 GB the index. RAM, 1TB 7.2K RPM SATA 6Gbps; running Ubuntu 14.04.1 SELECT * FROM A WHERE 2 ≤ a ≤ 4 AND 1.3 ≤ d1 AND d2 ≤ 2.5; (3.19.0-32 kernel). We use a synthetic dataset to test our queries on – ran- C' = C' AND NOT P' 3-5 2-3 ~ ~ 3-5 2-3 ~ ~ 3-5 2-3 ~ ~ domly generated multidimensional sum gaussian distribution 4-5 2-4 5-7 1-3 4-5 2-4 5-7 1-3 4-5 2-4 5-7 1-3 P' = P' AND C' SumGauss. Its only attribute aG is a sum of G randomly 4-6 7-8 2-2 3-5 4-6 7-8 2-2 3-5 4-6 7-8 2-2 3-5 initialized Gaussian distribution in D dimensions: (P OR C) AND (P' OR C') AND G ! ~ 5-6 4-6 1-1 ~ 5-6 4-6 1-1 ~ 5-6 4-6 1-1 (d − µi )T Σ−1 NOT (C AND C') 1 i (d − µi ) X ~ = aG (d) exp − Partial dimension Complete dimension p matches - P' matches - C' C AND C' i=1 (2π)D |Σi | 2 3-5 2-3 ~ ~ 3-5 2-3 ~ ~ 3-5 2-3 ~ ~ 3-5 2-3 ~ ~ where µi and Σi are randomly generated distribution mean 4-5 2-4 5-7 1-3 4-5 2-4 5-7 1-3 4-5 2-4 5-7 1-3 4-5 2-4 5-7 1-3 vector and a bounded symmetric positive definite covariance 4-6 7-8 2-2 3-5 4-6 7-8 2-2 3-5 4-6 7-8 2-2 3-5 4-6 7-8 2-2 3-5 matrix for dimension i. For sparse arrays, a threshold for the Gaussian functions is used. Attribute is treated as empty if ~ 5-6 4-6 1-1 ~ 5-6 4-6 1-1 ~ 5-6 4-6 1-1 ~ 5-6 4-6 1-1 the value is below this threshold. Only partitions with at Partial attribute Complete attribute Node query matches - P matches - C C P = P AND NOT C output least one non empty value are generated. Figure 4: Processing of a query in a single node of the hi- 6.1 Fastbit Integration erarchical index. Top row represents dimension constraints, Fastbit [31] is an open source library that implements bottom row represents attribute constraints. Bottom right bitmap indexing. It’s not a complete database management is the final product. Blue nodes represent partial matches system, rather a data processing tool, as its main purpose and green node represent complete matches. is to facilitate selection queries and estimates. Fastbit’s key technological features are WAH bitmap compression multi- component and multi-level indices with many different com- Running the algorithm for multiple queries or multiple binations of encoding and binning schemes. attribute constraints in a single query can be implemented We use Fastbit’s partitions to setup the lowest level of our using iteration through the constraints in the worst case. indices (leaves), and base our binning indices on Fastbit’s single-level binning index. This approach requires prepro- 5.4 Estimating Cardinality of Results; Mem- cessing of the data into evenly shaped partitions, generating bership Queries empty bitmasks and shape metadata. Once a table is pre- It is fairly straightforward to output estimates on minimal processed into even partitions, it is indexed as described in and maximal number of matching cells by iterating some Section 4. The index generation processes one partition at bounded number of levels of the index. The minimal number a time, and once processed, the partition is never accessed outputs the size of nodes in C, while the maximum outputs again during the index generation. the size of nodes in C ∪ P . Using the w(b) estimate, we may also provide estimates on aggregates over the attribute, 6.2 Bitmap Indexing Methods based on bin-wise linear approximation. BoxClip represents a naive algorithm using 32 equi-depth There is a simple modification of the algorithm for mem- binned indices, interval encoding and WAH compression. bership queries. (See Section 3.1 for details about member- The result bitmask from the attribute query is transformed ship queries). On top of two sided range indices PN,S and to a set of “line” hyperrectangles (size of the hyperrectan- CN,S for attribute queries, we keep equality indices and it- gle in all but one dimensions is 1), which are filtered from erate through the attribute constraint. For dimension mem- the dimension query, then merged into a set of result hy- bership queries, we precompute an index for all dimension perrectangles. All the steps except filtering are built on top values (within a single chunk), as opposed to buckets corre- for Fastbit’s mesh query. The filtering is implemented using 0 0 sponding to child nodes, that are used in PS,d and CS,d . recursive sweeping line algorithm.
crease is due to the results retrieval. ArrayBit achieves BoxClip very good results for low or high hit rate queries. This is 2 1,000 space [MB] DimsAtts due to a large number of complete matches, and due to fast time [s] ArrayBit pruning of search space. For medium hit rate queries, the 1 500 algorithm has relatively high number of candidate nodes to explore, but still manages to prune the search space faster. 0 0 6.4 Parameterization We also experimented with different setups of our hierar- 8MB 128MB 1GB 8MB 128MB 1GB chical index. The major objectives remain the same: query array size array size execution time and space requirements of the index. First, the partition size determines the ratio of partition Figure 5: Query execution time and disk space required to index vs hierarchical index. We set this in equilibrium with store the indices for different array sizes. number of index bins, which increases the precision of the binning and results in higher probability of pruning the search space earlier. DimsAtts uses indexed uint auxiliary attributes made Another important parameter is a fanout of nodes. If we from dimensions (see Section 4). The dimension query is use a smaller fanout (the smallest possible is 2D ), we may preprocessed into attributes, then run as a multi constraint not fill a single memory word with the index, significantly query in Fastbit. The configuration is the same as in Box- impair bit parallelism, furthermore the index size will be Clip, using 32 binned indices, range encoding and WAH larger due to much deeper indexing tree. If the fanout is too compression on all attributes. high, we will not prune infeasible candidates fast enough. ArrayBit represents our hierarchical multidimensional We got optimal results with a fanout close to a multiple of index. We use 16 equi-depth binned indices, range encod- the word size, such as 82 = 64 for 2D arrays, 43 = 64 for ing and WAH compression to index the partitions, and 16 3D, 44 = 256 for 4D, 35 = 243 for 5D, etc. approximately equi-depth binned indices (described in Sec- tion 4.4) with two sided range encoding and no compression 7. CONCLUSIONS AND FUTURE WORK for the hierarchical index. Note that compared to BoxClip Most of the work on bitmap indexing to date focus on and DimsAtts, we only use half of the bins in the parti- improving the space efficiency and speed, while a few applied tion index. It is sufficient in our algorithm, because the bin the bitmap indices to multidimensional data. However, the boundaries are adapted to the actual data in each partition, linear form of bitmap indices was never adapted to support and because we need to store the bin boundaries within the multidimensional array data. partitions. We have proposed a bitmap indexing method that is de- signed for multidimensional arrays and focuses on overcom- 6.3 Range Queries ing the dimensionality issue. The hierarchical nature of the In our work, we focus on mixed attribute and dimension proposed method allows for continuous results and estimates queries. Regardless of the dataset, we categorize the queries to be output as intermediate results. Our approach effec- based on the overall ratio of the size of the query result to tively prunes the search space, uses data adaptive, approx- the size of the total array size. imate equi-depth binning. Furthermore, the index supports Figure 5 shows the time required to return all results. The partitioned array data and allows distributed storage. index file is preloaded into memory prior to the test for all Our experimental results show that the proposed bitmap the systems used. We used 2D array for this experiment. indexing method outperforms standard linearized approaches and a query with ≈ 10% hit ratio. Both BoxClip and Dim- for mixed attribute and dimension range query processing. sAtts run slower than ArrayBit. In case of BoxClip, the There is a possible caveat that more complex multi-level reason is that all the attribute query results had to be pro- and multi-component indices exist. None of these indices cessed, while for DimsAtts the reason is that the attribute overcome the problem of dimensionality, rather due to their made from second dimension didn’t effectively compress. In effectiveness delay the threshold where the drawbacks be- terms of space requirements, all of the algorithms save at- came noticeable (in terms of number of dimensions and size tribute index. ArrayBit uses less bins in the leaves, but of the array). stores bin boundaries for all leaves and internal nodes, plus Future work includes adapting the tree structure based on bitmaps for internal nodes, effectively taking up the same dimensions, such as adaptive mesh refinement widely used space as BoxClip. On the other hand, DimsAtts stores in physical simulations [4]. Another interesting possibility indices for all dimension attributes. Row major ordering is is multi-attribute index in a single hierarchical structure. used in this measurement. Last, we want to use better approximation algorithms to Figure 6 demonstrates the dependency of the query pro- determine feasible regions from finer attribute bins. cessing time on a hit ratio of the query, i.e., the ratio of selected cells vs total cells in the array. BoxClip algorithm 8. ACKNOWLEDGEMENT does not prune the search space based on the dimensions, resulting in number of hits dependent on the attribute only. This research was supported in part by AcRF Grant RG- Filtering these is is time intensive. DimsAtts depends lin- 18/14. early on the total number of dimensions. This is because there is an additional attribute for each dimension. There is also a small dependency on the hit ratio, where the in-
4 8 BoxClip 10 DimsAtts 6 ArrayBit time [s] 2 4 5 2 0 0 0 0 20 40 60 80 100 0 50 100 0 50 100 2D – result/array ratio [%] 3D – result/array ratio [%] 4D – result/array ratio [%] Figure 6: Query execution time for 2D, 3D and 4D queries of various hit ratios. Queries contained an attribute constraint and all dimension constraints, each constraint with approximately the same domain reduction. 9. REFERENCES [19] H. Samet. The quadtree and related hierarchical data [1] G. Antoshenkov. Byte-aligned bitmap compression. In Data structures. ACM Computing Surveys (CSUR), 16(2):187–260, Compression Conference, 1995. DCC’95. Proceedings, page 1984. 476. IEEE, 1995. [20] H. Samet. Applications of spatial data structures. 1990. [2] P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and [21] H. Samet. Foundations of multidimensional and metric data N. Widmann. The multidimensional database system rasdaman. structures. Morgan Kaufmann, 2006. In Acm Sigmod Record, volume 27, pages 575–577. ACM, 1998. [22] R. R. Sinha, S. Mitra, and M. Winslett. Bitmap indexes for [3] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The large scientific data sets: A case study. In Proceedings 20th R*-tree: an efficient and robust access method for points and IEEE International Parallel & Distributed Processing rectangles, volume 19. ACM, 1990. Symposium, pages 10–pp. IEEE, 2006. [4] M. J. Berger and P. Colella. Local adaptive mesh refinement for [23] R. R. Sinha and M. Winslett. Multi-resolution bitmap indexes shock hydrodynamics. Journal of computational Physics, for scientific data. ACM Transactions on Database Systems 82(1):64–84, 1989. (TODS), 32(3):16, 2007. [5] C. Chan and Y. Ioannidis. An efficient bitmap encoding scheme [24] T. L. L. Siqueira, C. D. de Aguiar Ciferri, V. C. Times, and for selection queries. ACM SIGMOD Record, 1999. R. R. Ciferri. The sb-index and the hsb-index: efficient indices [6] C.-Y. Chan and Y. E. Ioannidis. Bitmap index design and for spatial data warehouses. Geoinformatica, 16(1):165–205, evaluation. In ACM SIGMOD Record, volume 27, pages 2012. 355–366. ACM, 1998. [25] K. Stockinger. Bitmap indices for speeding up high-dimensional [7] J. Chmiel, T. Morzy, and R. Wrembel. Time-HOBI: indexing data analysis. In Database and Expert Systems Applications, dimension hierarchies by means of hierarchically organized pages 881–890. Springer, 2002. bitmaps. In Proceedings of the ACM 13th international [26] K. Stockinger and K. Wu. Bitmap indices for data warehouses. workshop on Data warehousing and OLAP - DOLAP ’10, Data Warehouses and OLAP: Concepts, Architectures and page 69, New York, New York, USA, oct 2010. ACM Press. Solutions, page 57, 2006. [8] J. Chou, M. Howison, B. Austin, K. Wu, J. Qiang, E. Bethel, [27] M. Stonebraker, P. Brown, D. Zhang, and J. Becla. SciDB: A A. Shoshani, O. Rübel, R. D. Ryne, et al. Parallel index and database management system for applications with complex query for large scale data analysis. In Proceedings of 2011 analytics. Computing in Science and Engineering, International Conference for High Performance Computing, 15(3):54–62, 2013. Networking, Storage and Analysis, page 30. ACM, 2011. [28] Y. Su, Y. Wang, and G. Agrawal. In-situ bitmaps generation [9] L. Gosink, J. Shalf, K. Stockinger, K. Wu, and W. Bethel. and efficient data analysis based on bitmaps. In Proceedings of Hdf5-fastquery: Accelerating complex queries on hdf datasets the 24th International Symposium on High-Performance using fast bitmap indices. In Scientific and Statistical Parallel and Distributed Computing, pages 61–72. ACM, 2015. Database Management, 2006. 18th International Conference [29] Y. Wang, Y. Su, and G. Agrawal. A novel approach for on, pages 149–158. IEEE, 2006. approximate aggregations over arrays. In Proceedings of the [10] A. Guttman. R-trees: a dynamic index structure for spatial 27th International Conference on Scientific and Statistical searching, volume 14. ACM, 1984. Database Management, page 4. ACM, 2015. [11] H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, [30] Y. Wang, Y. Su, G. Agrawal, and T. Liu. Scisd: Novel subgroup K. C. Sevcik, and T. Suel. Optimal histograms with quality discovery over scientific datasets using bitmap indices. guarantees. In VLDB, volume 98, pages 24–27, 1998. Proceedings of Ohio State CSE Technical Report, 2015. [12] A. Kalinin, U. Cetintemel, and S. Zdonik. Searchlight: enabling [31] K. Wu, S. Ahern, E. W. Bethel, J. Chen, H. Childs, integrated search and exploration over large multidimensional E. Cormier-Michel, C. Geddes, J. Gu, H. Hagen, B. Hamann, data. Proc. of the VLDB Endowment, 8(10):1094–1105, 2015. et al. Fastbit: interactively searching massive data. In Journal [13] J. Lawder and P. King. Querying multi-dimensional data of Physics: Conference Series, volume 180, page 012053. IOP indexed using the Hilbert space-filling curve. ACM Sigmod Publishing, 2009. Record, 2001. [32] K. Wu, E. J. Otoo, and A. Shoshani. Optimizing bitmap [14] J. K. Lawder and P. J. King. Using space-filling curves for indices with efficient compression. ACM Transactions on multi-dimensional indexing. In Advances in Databases, pages Database Systems (TODS), 31(1):1–38, 2006. 20–35. Springer, 2000. [33] K. Wu, A. Shoshani, and K. Stockinger. Analyses of multi-level [15] T. L. Lopes Siqueira, R. R. Ciferri, V. C. Times, and C. D. and multi-component compressed bitmap indexes. ACM de Aguiar Ciferri. A spatial bitmap-based index for Transactions on Database Systems (TODS), 35(1):2, 2010. geographical data warehouses. In Proceedings of the 2009 [34] K. Wu, K. Stockinger, and A. Shoshani. Breaking the curse of ACM symposium on Applied Computing, pages 1336–1342. cardinality on bitmap indexes. In International Conference on ACM, 2009. Scientific and Statistical Database Management, pages [16] T. Lungu and P. S. Callahan. QuikSCAT science data product 348–365. Springer, 2008. user’s manual: Overview and geophysical data products. [35] K.-L. Wu and P. S. Yu. Range-based bitmap indexing for high D-18053-Rev A, version, 3:91, 2006. cardinality attributes with skew. In COMPSAC’98. [17] P. Nagarkar, K. Candan, and A. Bhat. Compressed spatial Proceedings. The Twenty-Second Annual International, pages hierarchical bitmap (cSHB) indexes for efficiently processing 61–66. IEEE, 1998. spatial range query workloads. Proceedings of the VLDB [36] G. Zhu, Y. Wang, and G. Agrawal. Scicsm: novel contrast set Endowment, 2015. mining over scientific datasets using bitmap indices. In [18] P. O’Neil and D. Quass. Improved query performance with Proceedings of the 27th International Conference on Scientific variant indexes. In ACM Sigmod Record, volume 26, pages and Statistical Database Management, page 38. ACM, 2015. 38–49. ACM, 1997.
You can also read