COMPARE: Accelerating Groupwise Comparison in Relational Databases for Data Analytics (Extended Version)
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
COMPARE: Accelerating Groupwise Comparison in Relational Databases for Data Analytics (Extended Version) Tarique Siddiqui Surajit Chaudhuri Vivek Narasayya Microsoft Research {tasidd, surajitc, viveknar}@microsoft.com ABSTRACT Data analysis often involves comparing subsets of data across many dimensions for finding unusual trends and patterns. While the com- parison between subsets of data can be expressed using SQL, they tend to be complex to write, and suffer from poor performance over large and high-dimensional datasets. In this paper, we pro- pose a new logical operator C OMPARE for relational databases that concisely captures the enumeration and comparison between sub- sets of data and greatly simplifies the expressing of a large class of comparative queries. We extend the database engine with op- (a) Seedb [47] timization techniques that exploit the semantics of C OMPARE to significantly improve the performance of such queries. We have implemented these extensions inside Microsoft SQL Server, a com- mercial DBMS engine. Our extensive evaluation on synthetic and real-world datasets shows that C OMPARE results in a significant speedup over existing approaches, including physical plans gener- ated by today’s database systems, user-defined functions (UDFs), as well as middleware solutions that compare subsets outside the databases. 1. INTRODUCTION Comparing subsets of data is an important part of data explo- (b) Zenvisage [43] ration [8, 30, 43, 19, 47], routinely performed by data scientists to find unusual patterns and gain actionable insights. For instance, Figure 1: Examples of comparative queries from visual analytic tools: a) market analysts often compare products over different attribute com- Finding socio-economic indicators that differentiate married and unmarried binations (e.g., revenue over week, profit over week, profit over couples in Seedb [47].The user specifies the subsets (A) after which the tool outputs a pair of attributes (B) along with corresponding visualizations country, quantity sold over week, etc.) to find the ones with sim- (C) that differentiates the subsets the most. b) A comparative query in Zen- ilar or dissimilar sales. However, as the size and complexity of visage [43] for finding states with comparable housing price trends. the dataset increases, this manual enumeration and comparison of subsets becomes challenging. To address this, a number of visual- ization tools [47, 49, 19, 43] have been proposed that automatically to improve performance and scalability of comparative queries? compare subsets of data to find the ones that are relevant. Figure Supporting such queries within relational databases also makes them 1a depicts an example from Seedb [47] where the user specifies the broadly accessible via general-purpose data analysis tools such as subsets of population (e.g., based on marital status, race) and the PowerBI [3], Tableau [5], and Jupyter notebooks [27]. All of these tool automatically find a socio-economic indicator (e.g., education, tools let users directly write SQL queries and execute them in the income, capital gains) on which the subsets differ the most. Sim- DBMS to reduce the amount of data that is shipped to the client. ilarly, Figure 1b depicts an example from Zenvisage [43] for find- One option for in-database execution is to extend DBMS with ing states with similar house pricing trends. Unfortunately, most custom user-defined functions (UDF) for comparing subsets of data. of these tools perform comparison of subsets in a middleware and However, UDFs incur invocation overhead and are typically exe- as depicted in Figure 2, with the increase in size and number of cuted as a batch of statements where each statement is run sequen- attributes in the dataset, these tools incur large data movement as tially one after other with limited resources (e.g., parallelism, mem- well as serialization and deserialization overheads. This result in ory). Consequently, the performance of UDFs does not scale with poor latency and scalability.Additionally, these tools require data the increase in the number of tuples (see Figure 2). Furthermore, scientists to learn and switch to their domain-specific querying in- UDFs have limited interoperability with other operators, and are terfaces, thereby limiting their broad adoption. less amenable to logical optimizations, e.g., PK-FK join optimiza- tions over multiple tables. The question we pose in this work is: can we efficiently perform While comparative queries can be expressed using regular SQL, comparison between subsets of data within the relational databases such queries require complex combination of multiple subqueries. 1
Middleware (Outside DB) UDF-based Comparison (w.r.t. SQL SERVER) 75 COMPARE(Our approach) Improvement (%) 50 25 0 25 50 75 103 104 105 106 107 108 Number of Tuples Figure 2: Relative performance of different execution approaches for a comparative query w.r.t unmodified SQL S ERVER execution time (higher the better). The query finds a pair of origin airports that have the most similar departure delays over week trends in the flight dataset [1] The complexity makes it hard for relational databases to find ef- ficient physical plans, resulting in poor performance. While prior work have proposed extensions [14, 13, 22, 18, 44] such as group- ing variables, GROUPING SETS, CUBE; as we discussed in the later sections, expressing and optimizing grouping and comparison simultaneously remains a challenge. To describe the complexity using regular SQL, we use the following example. Figure 3: A SQL query for comparing subsets of data over different at- tribute combinations, depicting the complexity of specification using exist- Example. Consider a market analyst exploring sales trends across ing SQL expressions. different cities. The analyst generates a sample of visualizations depicting different trends, e.g., average revenue over week, aver- result for each comparison; however the join results in large inter- age profit over week, average revenue over country, etc., for a few mediate data—one tuple for each pair of matching tuples between cities. She notices that trends for cities in Europe look different the two sets, resulting in substantial overheads. from those in Asia. To verify whether this observation generalizes, she looks for a counterexample by searching for pairs of attributes 1.1 Overview of Our Approach over which two cities in Asia and Europe have most similar trends. Often, an Lp norm-based distance measure (e.g., Euclidean dis- In this paper, we take an important step towards making specifi- cation of the comparative queries easier and ensuring their efficient tance, Manhattan distance) that measures deviation between trends and distributions is used for such comparisons [43, 47, 19]. processing. To do so, we introduce a logical operator and exten- Figure 3 depicts a SQL query template for the above example. sions to the SQL language, as well as optimizations in relational The query involves multiple subqueries, one for each attribute pair. databases, described below. Within each subquery, subsets of data (one for each city) are ag- Groupwise comparison as a first class construct (Section 2 and gregated and compared via a sequence of self-join and aggregation 3). We introduce a new logical operation, C OMPARE (Φ), as a first functions that compute the similarity (i.e., sum of squared differ- class relational construct, and formalize its semantics that help cap- ences). Finally, a join and filter is performed to output the tuples ture a large class of frequently used comparative queries. We pro- of subsets with minimum scores. Clearly, the query is quite ver- pose extensions to SQL syntax that allows intuitive and more con- bose and complex, with redundant expressions across subqueries. cise specification of comparative queries. For instance, the compar- While comparative queries often explore and compare a large num- ison between two sets of cities C1 and C2 over n pairs of attributes: ber of attribute pairs [47, 28], we observe that even with only a few (x1 , y1 ), (x2 , y2 ), ..., (xn , yn ) using a comparison function F can attribute pairs, the SQL specification can become extremely long. be succinctly expressed as C OMPARE [C1 C2 ][ (x1 , y1 ), (x2 , Furthermore, the number of groups to compare can often be y2 ), ..., (xn , yn )] USING F. As illustrated earlier, expressing the large—determined by the number of possible constraints (e.g., cities), same query using existing SQL clauses requires a UNION over n pairs of attributes, and aggregation functions—which grow signifi- subqueries, one for each (xi , yi ) where each subquery itself tends cantly with the increase in dataset size or number of attributes. This to be quite complex. Overall, while C OMPARE does not give ad- results in many subqueries with each subquery taking substantially ditional expressive power to the relational algebra, it reduces the long time to execute. In particular, while there are large oppor- complexity of specifying comparative queries and facilitates opti- tunities for sharing computations (e.g., aggregations) across sub- mizations via query optimizer and the execution engine. queries, the relational engines execute subqueries for each attribute- Efficient processing via optimizations (Section 4 and 5). We pair separately resulting in substantial overhead in both runtime as exploit the semantics of C OMPARE to share aggregate computa- well as storage. Furthermore, as depicted in subquery 1 in Figure tions across multiple attribute combinations, as well as partition 3, while each pair of groups (e.g., set of tuples corresponding to and compare subsets in a manner that significantly reduces the pro- each city) can be compared independently, the relational engines cessing time. While these optimizations work for any comparison perform an expensive self-join over a large relation consisting of function, we also introduce specific optimizations (by introducing all groups. The cost of doing this increases super-linearly as the a new physical operator) that exploit properties of frequently used number and size of subsets increases (discussed in more detail in comparison functions (e.g., Lp norms). These optimizations help Section 4.1). Finally, in many cases, we only need the aggregated prune many subset comparisons without affecting the correctness. 2
product =‘Inspiron’ product = ‘Inspiron’ city = ‘Tokyo’ revenue region = ‘Asia’ city =‘Tokyo’ revenue & region = ‘Asia’ city = ‘Paris’ city = ‘Paris’ revenue revenue & region = ‘Asia’ revenue revenue profit region = ‘Asia’ week week week week week week category revenue product = ‘G7’ & region = ‘Asia’ product = ‘Inspiron’ city = ‘London’ & region = ‘Asia’ city = ‘London’ city = ‘Seoul’ revenue region = ‘Asia’ city = ‘Beijing’ revenue revenue revenue profit profit profit week week country country week week category week product = ‘XPS’ & region = ‘Asia’ product = ‘Inspiron’ city = ‘Zurich’ city = ‘Zurich’ revenue city =‘Beijing’ revenue revenue & region = ‘Asia’ city = ‘Seoul’ revenue region = ‘Asia’ revenue revenue revenue week month month week week week week 1a: One to many comparisons over fixed X and 1b: One to one comparisons over varying X and Y 2a: Many to many comparisons over fixed X and 2b: Many to many comparisons over varying Y attributes attributes Y attributes X and Y attributes denotes comparison Figure 4: Illustrating comparative queries described in Section 2.1 Inter-operator optimizations (Section 6). We introduce new revenue over week trend in Asia is very dissimilar (typically mea- transformation rules that transform the logical tree containing the sured using Lp norms) to that of the Asia’s overall trend. There are C OMPARE operator along with other relational operators into equiv- visualization systems [48, 12, 43, 28] that support similar queries. alent logical trees that are more efficient. For instance, the at- Example 1b. In the above example, the analyst finds that the trend tributes referred in C OMPARE may be spread across multiple ta- for product ‘Inspiron’ is different from the overall trend for the re- bles, involving PK-FK joins between fact and dimension tables. To gion ‘Asia’. She finds it surprising and wants to see the attributes optimize such cases, we show how we can push C OMPARE below for which trends or distributions of Inspiron and Asia deviate the join that reduces the number of tuples to join. Similarly, we de- most. More precisely, she wants to compare ‘Inspiron’ and ‘Asia’ scribe how aggregates can be pushed below C OMPARE, how multi- over multiple pairs of attributes (e.g., average profit over country, ple C OMPARE operators can reordered and how we can detect and average quantitysold over week, ..., average profit over week) and translate an equivalent sub-plan expressed using existing relational select the one where they deviate the most. Such comparisons have operators to C OMPARE. inspired features such as Explain Data[4] in Tableau and tool such Implementation inside commercial database engine (Section 7). as SeeDB [47], Zenvisage [43], Voyager [49]. We have prototyped our techniques in Microsoft SQL S ERVER en- Example 2a. Consider another scenario: the analyst visualizes the gine, including the physical optimizations. Our experiments show revenue trends of a few cities in Asia and in Europe, and finds that that even over moderately-sized datasets (e.g., 10–20 GB) C OM - while most cities in Asia have increasing revenue trends, those in PARE results in up to 4× improvement in performance relative to Europe have decreasing trends. Again, as a counterexample to this, alternative approaches including physical plans generated by SQL she wants to find a pair of cities in these regions where this pattern S ERVER, UDFs, and middlewares (e.g., Zenvisage, SeeDB). With does not hold, i.e., they have the most similar trends. Such tasks the increase in the number of tuples and attributes, the performance involving search for similar pair of items are ubiquitous in data difference grows quickly, with C OMPARE giving more than a order mining [36] and time series [35, 10, 8, 30]. of magnitude better performance. Example 2b. In the above example, the analyst finds that the out- put pair of visualizations look different, supporting her intuition that perhaps no two cities in Europe and Asia have similar revenue 2. CHARACTERIZING COMPARATIVE over week trends. To verify whether this observation generalizes QUERIES when compared over other attributes, she searches for pairs of at- In this section, we first characterize comparative queries with the tributes (similar to ones mentioned in Example 1b) for which two help of additional examples drawn from visualization tools [19, 43, cities in Asia and Europe have most similar trends or distributions. 47] and data mining [35, 10, 8, 30]. Then, we give a formal defini- Such queries are common in tools such as Zenvisage [43], Quick- tion that concisely captures the semantics of comparative queries. Insights [19] that support finding outlier visualizations over a large set of attributes. 2.1 Examples In summary, the comparative queries in above examples fast- forward the analyst to a few visualizations that depict a pattern she We return to the example scenario discussed in introduction: a wants to verify—thereby allowing her to skip the tedious and time- market analyst is exploring sales trends of products with the help consuming process of manual comparison of all possible visualiza- of visualizations to find unusual patterns. The analyst first looks at tions. As illustrated in Figure 4, each query involves comparisons a small sample of visualizations, e.g., average revenue over week between two sets of visualizations (henceforth referred as Set1 and trends for a few regions (e.g., Asia, Europe) and for a subset of Set2) to find the ones which are similar or dissimilar. Each visu- cities and products within each region. She observes some unusual alization depicting a trend is represented via two attributes (X at- patterns and wants to quickly find additional visualizations that ei- tribute, e.g., week and a Y attribute, e.g., average revenue) and a set ther support or disprove those patterns (without examining all pos- of tuples (specified via a constraint, e.g., product = ‘Inspiron’). We sible visualizations). Note that we use the term "trend" to refer to now present a succinct representation to capture these semantics. a set of tuples in a more general sense where both categorical (e.g., country) or ordinal attributes (e.g., week) can be used for alignment for comparison. We consider several examples below in increasing 2.2 Formalization order of complexity. Figure 4 illustrates each of these examples, We formalize our notion of comparative queries and propose a depicting the differences in how the comparison is performed. concise representation for specifying such queries. Example 1a. The analyst notes that the average revenue over week trends for Asia as well as for a subset of products in that region look similar. As a counterexample, she wants to find a product whose 2.2.1 Trend 3
A trend is a set of tuples that are compared together as one unit. Definition 6 [Scorer]. Given two trends t1 and t2 , a scorer is any Formally, function that returns a single scalar value called ‘score’ measuring how t1 compares with t2 . Definition 1 [Trend]. Given a relation R, a trend t is a set of tuples derived from R via the triplet: constraint c, grouping g, measure m While we can accept any function that satisfies the above defi- and represented as (c)(g, m). nition as a scorer; as mentioned earlier, two trends are often com- pared using distance measures such as Euclidean distance, Manhat- Definition 2 [Constraint]. Given a relation R, a constraint is a con- tan distance [31, 47, 43]. Such functions are also called aggregated junctive filter of the form: (p1 = α1 , p2 = α2 , ..., pn = αn ) that distance functions [34]. All aggregated distance functions use a selects a subset of tuples from R. Here, p1 , p2 , ..., pn are attributes function DIFF(.) as defined below. in R and αi is a value of pi in R. One can use ‘ALL’ to select all values of pi , similar to [22]. Definition 7 [DIFF(m1 , m2, p)]. 1 Given a tuple with measure value m1 and grouping value gi in trend t1 and another tuple with mea- Definition 3 [(Grouping, Measure)]. Given a set of tuples selected sure value m2 and the same grouping value gi , DIFF(m1 , m2 , p) via a constraint, all tuples with the same value of grouping are = |m1 − m2 |p where p ∈ Z+ . Tuples with non-matching grouping aggregated using measure. A tuple in one trend is only compared values are ignored. with the tuple in another trend with the same value of grouping. Since m1 and m2 are clear from the definition of t1 and t2 and In example 1a, (R.region = ‘Asia’)(R.week, AVG(R.revenue)) is tuples across trends are compared only when they have same group- a trend in Set1, where (region = ‘Asia’) is a constraint for the trend ing and measure expressions, we succinctly represent and all tuples with the same value of grouping:‘week’ are aggre- DIFF(m1 , m2 , p) = DIFF(p) gated using the measure: ‘AVG(revenue)’. We currently do not Definition 8 [Aggregated Distance Function]. An aggregated dis- support range filters for constraint. tance function compares trends t1 : (ci )(gi , mi ) and t2 : (cj ) 2.2.2 Trendset (gi , mi ) in two steps: (i) first DIFF(p) is computed between every pairs of tuples in t1 and t2 with same values of gi , and (ii) all values A comparative query involves two sets of trends. We formalize of DIFF(p) are aggregated using an aggregate function AGG such this via trendset. as SUM, AVG, MIN, and MAX to return a score. An aggregated distance function is represented as AGG OVER DIFF(p). Definition 4 [Trendset]. A trendset is a set of trends. A trend in one trendset is compared with a trend in another trendset. For example, Lp norms2 such as Euclidean distance can be spec- ified using SUM OVER DIFF(2), Manhattan distance using SUM In example 1a, the first trendset consists of a single trend: (R.region OVER DIFF(1), Mean Absolute Deviation as AVG OVER DIFF(1), = ‘Asia’)(R.week, AVG(R.revenue))}, while the second trendset Mean Square Deviation as AVG OVER DIFF(2). consists of as many trends as there are are unique products in R: {(R.region = ‘Asia’, R.product = ‘Inspiron’) (R.week, AVG(R.revenue)), 2.2.4 Comparison between Trendsets (R.region = ‘Asia’, R.product = ‘XPS’)(R.week, AVG(R.revenue)), We extend Definition 5 to the following observation over trend- ..., (R.region = ‘Asia’, R.product = ‘G7’) (R.week, AVG( R.revenue))}. sets. As is the case in the above example, often a trendset contains one trend for each unique value of an attribute (say p) as a constraint, Observation 1 [Comparability between two trendsets] Given two all sharing the same (grouping, measure). Such a trendset can be trendsets T1 and T2 , a trend (ci )(gi , mi ) in T1 is compared with succinctly represented using only the attribute name as constraint, only those trends (cj )(gj , mj ) in T2 where gi = gj and mi = mj . i.e., [p][(g1 , m1 )]. If α1 , α2 , ..αn represent all unique values of p, Thus, given two trendsets, we can automatically infer which trends then, between the two trendsets need to be compared. We use T 1T 2 [p][(g1 , m1 )] ⇒ {(p = α1 )(g1 , m1 ), (p = α2 )(g1 , m1 ), ..., to denote the comparison between two trendsets T1 and T2 . For (p = αn )(g1 , m1 )} (⇒ denotes equivalence) example, the comparison in example 1a can be represented as: Similarly, [p1 , p2 = β][(g1 , m1 )] ⇒ {(p1 = α1 , p2 = β)(g1 , [region = ‘Inspiron’][(week, AVG(revenue))] [region = ‘Asia’, m1 ), (p1 = α2 , p2 = β)(g1 , m1 ), ..., (p1 = αn , p2 = β)(g1 , m1 )} product][ (week, AVG (revenue))] Alternatively, a trendset consisting of different (grouping, mea- If both T1 and T2 consist of the same set of grouping and mea- sure) combinations but the same constraint (e.g., p = α1 ) can be sure expressions say {(g1 , m1 ), ..., (gn , mn )} and differ only in succinctly written as: constraint, we can succinctly represent T1 T2 as follows: [(p = α1 )][(g1 , m1 ), ..., (gn , mn )] ⇒ {(p = α1 )(g1 , m1 ), ..., [c1 ][(g1 , m1 ), ..., (gn , mn )] [c2 ][(g1 , m1 ), ..., (gn , mn )] ⇒ (p = α1 )(gn , mn )} [c1 c2 ][(g1 , m1 ), ..., (gn , mn )] Thus, the comparison between trendsets in example 1a can be 2.2.3 Scoring succinctly expressed as: [(region = ‘Asia’) (region = ‘Asia’, product) ][(week, AVG(revenue))] We first define our notion of ‘Comparability’ that tells when two trends can be compared. Similarly, the following expression represents the comparison in example 1b. Definition 5 [Comparability of two trends]. Two trends t1 : (c1 )(g1 , [(region = ‘Asia’) (region = ‘Asia’, product = ‘Inspiron’)][(week, m1 ) and t2 : (c1 )(g2 , m2 ) can be compared if g1 = g2 and m1 = AVG(revenue)), (country, AVG(profit)), ... , (month, AVG(revenue))] m2 , i.e., they have the same grouping and measure. We can now define a comparative expression using the notions For example, a trend (R.product = ‘Inspiron’) (R.week, AVG( introduced so far. R.revenue)) and a trend (R.product = ‘XPS’)( R.month, AVG(R.profit)) 1 Note that DIFF is distinct from another operator [6] with similar cannot be compared since they differ on grouping and measure. name. 2 Next, we define a function scorer for comparing two trends. We ignore the pth root as it does not affect the ranking of subsets. 4
Table 1: Output of C OMPARE in Example 1a Table 2: Output of C OMPARE in Example 1b R1 P W C M V O score R1 P W V score Asia Inspiron True False False True False 40 Asia Inspiron False True False True False 20 Asia XPS True True 30 ... ... ... ... ... ... ... ... Asia Inspiron False False True True False 10 Asia Inspiron True True 24 ... ... ... ... ... Table 1 illustrate the output of this query. The first two columns Asia G8 True True 45 R1 and P identify the values of constraint for compared trends in T1 and T2. The columns W and V are Boolean valued denoting whether R.week and AVG(R.revenue) were used for the compared trends. Thus, the values of (R1, P, W, V) together identify the pairs Definition 9 [Comparative expression]. Given two trendsets T1 of trends that are compared. Since R.week and AVG(R.revenue) T2 over a relation R, and a scorer F, a comparative expression are grouping and measure for all trends in this example, their values computes the scores between trends (ci )(gi , mi ) in T1 and (cj ) are always True. Finally, the column score specifies the scores com- (gj , mj ) in T2 where gi = gj and mi = mj . puted using Euclidean distance, expressed as SUM OVER DIFF(2). Now, consider below the query for example 1b that compares 3. THE COMPARE OPERATOR tuples where (R.region = Asia) with tuples where (R.region = Asia) In this section, we introduce a new operator C OMPARE, that and (R.product = ’Inspiron’) over a set of (grouping, measure): makes it easier for data analysts and application developers to ex- press comparative queries. We first explain the syntax and seman- Listing 2: COMPAREXPR1B tics of C OMPARE and then show how C OMPARE inter-operates SELECT R1, P, W, C, V, ..., M, score with other relational operators to express top-k comparative queries FROM sales R as discussed in Section 2.1. COMPARE [((R.region = Asia) AS R1) (R1, (R.product = ’Inspiron’) AS P)][(R.week AS W, AVG(R.revenue) AS V), (R.country AS 3.1 Syntax and Semantics C, AVG(R.profit) AS O), ..., (R.month AS M, V)] USING SUM OVER DIFF(2) AS score C OMPARE, denoted by Φ, is a logical operator that takes as input a a comparative expression specifying two trendsets T1 T2 over relation R along with a scorer F and returns a relation R0 . Table 2 depicts the output for this query. The columns R1 and P Φ(R, T1 T2 , F) → R0 are always set to "Asia" and "Inspiron" since the constraint for all 0 R consists of scores for each pair of compared trends between trends in T1 and T2 are fixed. W, C, M, V, and P consist of Boolean the two trendsets. For instance, the table below depicts the out- values telling which columns among R.week, R.country, R.month, put schema (with an example tuple) for the C OMPARE expression AVG(R.revenue), and AVG(R.profit) were used as (grouping, mea- [c1 c2 ][(g1 , m1 ), (g2 , m2 )]. The values in the tuple indicate sure) for the pair of compared trends. that the trend (c1 = α1 )(g1 , m1 ) is compared with the trend (c2 = From above examples, it is easy to see that we can write queries α2 )(g1 , m1 ) and the score is 10. with C OMPARE expression for examples 2a and 2b as follows: c1 c2 g1 m1 g2 m2 score Listing 3: COMPAREXPR2A α1 α2 True True False False 10 SELECT R1, C1, R2, C2, W, V, score ... ... ... ... ... ... ... FROM sales R COMPARE [((R.Region = Asia) AS R1, (R.city) AS C1) ((R.Region = Europe) AS R2, (R.city) AS C2)][R.week AS W, We express the C OMPARE operator in SQL using two extensions: AVG(R.revenue) AS V] C OMPARE, and USING: USING SUM OVER DIFF(2) AS score COMPARE T1 T2 Listing 4: COMPAREXPR2B USING F SELECT R1, C1, R2, C2, W, C, V, ..., M, score For instance, for example 1a, the comparison between the AVG(revenue) FROM sales R over week trends for the region ‘Asia’ and each of the products in COMPARE [((R.Region = Asia) AS R1, (R.city) AS C1) ((R.Region = Europe) AS R2, (R.city) AS C2)][(R.week AS W, AVG(R.revenue) AS region ’Asia’ can be succinctly expressed as follows: V), (R.country AS C, AVG(R.profit) AS O), ..., (R.month AS M, V)] USING SUM OVER DIFF(2) AS score Listing 1: COMPAREXPR1A SELECT R1, P, W, V, score Note that C OMPARE is semantically equivalent to a standard FROM sales R COMPARE [((R.region = Asia) AS R1) (R1, R.product AS P)] relational expression consisting of multiple sub-queries involving [R.week AS W, AVG (R.revenue) AS V] union, group-by, and join operators as illustrated in introduction. USING SUM OVER DIFF(2) AS score As such, C OMPARE does not add to the expressiveness of relational algebra SQL language. The purpose of C OMPARE is to provide a Here T1 = [((R.region = Asia) AS R1)][R.week AS W, AVG succinct and more intuitive mechanism to express a large class of (R.revenue) AS V] and T2 = [((R.region = Asia) AS R1, R.product frequently used comparative queries as shown above. For example, AS P)][R.week AS W, AVG (R.revenue) AS V]. Observe that T1 expressing the query in Listing 2 using existing SQL clauses (see and T2 share the same set of (grouping, measure) and the filter Figure 3) is much more verbose, requiring a complex sub-query for predicate (R.region = Asia) in their constraints, thus it is concisely each (grouping, measure). Prior work have proposed similar suc- expressed as [((R.region = Asia) AS R1)(R1, R.product AS cinct abstractions such as GROUPING SETs [17] and CUBE [22] P)][R.week AS W, AVG (R.revenue) AS V]. (both widely adopted by most of the databases) and more recently 5
DIFF [6], which share our overall goal that with an extended syn- The SELECT clause only outputs the values of columns for which tax, complex analytic queries are easier to write and optimize. corresponding trends has the highest score, setting NULL for other Furthermore, the input to C OMPARE is a relation, which can ei- columns to indicate that those columns were not part of top-1 pair ther be a base table or an output from another logical operator (e.g., of trends. This idea of setting NULL is borrowed from prior work join over multiple tables); similarly the output relation from C OM - on Gray et al. on computing CUBE [22]. Nevertheless, an alterna- PARE can be an input to another logical operator or the final output. tive is to output values of all columns, and add (S.W, S.M, S.C, S.P, Thus, C OMPARE can inter-operate with other operators. In order S.V) (as in the previous example) to the output to indicate which to illustrate this, we discuss how we C OMPARE inter-operates with columns were part of the comparison between top-1 pair of trends. other operators such as join, filter to select top-k trends. While C OMPARE outputs the scores for each pair of compared trends, 4. OPTIMIZING COMPARATIVE QUERIES comparative queries often involve selection of top-k trends based In this section, we discuss how we optimize a logical query plan on their scores (Section 2.1). consisting of a C OMPARE operation. Microsoft SQL S ERVER uses a cost-based optimizer for generating the physical plan for an in- 3.2 Expressing Top-k Comparative Queries put logical plan. We extend the optimizer to replace C OMPARE In this section, we show how we can use the above-listed C OM - with a sub-plan of existing physical operators such that cost of the PARE sub-expressions (referred by COMPAREXPR1A, COMPAR- sub-plan is least. This is done in two steps: we first transform EXPR1B, COMPAREXPR2A, and COMPAREXPR1B) with LIMIT C OMPARE into a sub-plan of existing logical operators. These log- and join to select tuples for trends belonging to top-k. ical operators are then transformed into physical operators using Example 1a. The following query selects the tuples of a prod- existing rules to compute the cost of C OMPARE. The cost of the uct in region ‘Asia’ that has the most different AVG(revenue) over sub-plan for C OMPARE is combined with costs of other physical week trends compared to that of region ‘Asia’ overall. COMPAR- operators to estimate the total cost of the query. We state our prob- EXPR1A refers to the sub-expression in Listing 1. lem formally: SELECT T.product, T.week, T.revenue, S.score FROM sales T JOIN Problem 4.1. Given a logical query plan consisting of C OMPARE (SELECT * FROM COMPAREXPR1A operation: Φ(R, [c1 c2 ] [(d1 , m1 ), ..., (dn , mn )], F) → R0 , ORDER BY score DESC replace C OMPARE with a sub-plan of physical operators with the LIMIT 1) AS S lowest cost. WHERE T.product = S.P For ease of exposition, we assume that both trendsets contain the The ORDER BY and LIMIT clause select the top-1 row in Ta- same set of trends, one for each unique value of c, i.e., c1 = c2 = c. ble 1 with the highest score with P consisting of the most similar product. Next, a join is performed with the base table to select all 4.1 Basic Execution tuples of the most similar product along with its score. We start with a simple approach that transforms C OMPARE into Example 2a. The query for example 2a differs from example 1a a sub-plan of logical operators. The sub-plan is similar to the one in that both trendsets consist of multiple trends. Here, one may be generated by database engines when comparative queries are ex- interested in selecting tuples of both cities that are similar, thus we pressed using existing SQL clauses (discussed in Section 1). We use the WHERE condition (T.city = S.C1 AND T.Region = S.R1) perform the transformation using the following steps, with each OR (T.city = S.C2 AND T.Region = S.R2). (S.R1, S.R2, S.C1, step using the output from previous step as input. S.C2) in SELECT clause identifies the pair of compared trends. (1) ∀(di , mi ): Ri ← Group-byc,di Aggmi (R) (2) ∀ Ri : Rij ←1Ri .c!=Ri .c,Ri .di =Ri .di (Ri ) SELECT T.Region, T.city, T.week, T.revenue, S.R1, S.C1, S.R2, S.C2, (3) ∀ Rij : Rijk ← Group-byci ,cj AggUDAF (Rij ) // ci , cj are S.score aliases of column c FROM sales T JOIN (SELECT * FROM COMPAREXPR2A (4) R0 ← Union All(Rijk ) i,j,k ORDER BY score First, for each (grouping, measure) combination, we add a group- LIMIT 1) AS S by aggregate (e.g., GROUP BY product, week, AGG on AVG(revenue)) WHERE (T.city = S.C1 AND T.Region = S.R1) OR (T.city = S.C2 AND T.Region = S.R2) to create trendsets. Then, on the output of group-by aggregate, we join tuples between two trends that have the same value of grouping (e.g., 1R’.product !=R’.product, R’.week = R’.week )). Finally, the score between Examples 1b and 2b. These examples extend the first two exam- each pair of trends is computed by applying a user-defined aggre- ples to multiple attributes. We show the query for example 2b; it’s gate (UDA) F. This is done by first partitioning the join output a complex version of (example 1b) where trends in each trendsets to group together each pair of trends. Each partition consisting of are created by varying all three: constraint, grouping, measure (ex- pair of trends is then aggregated using F. Finally, the scores for all ample 1b has a fixed constraint for each trendset). pairs of trends are aggregated via Union All. SELECT T.city, S.R1, S.R2, S.C1, S.C2, Unfortunately, this approach does not scale well as the size of CASE WHEN S.W THEN T.week ELSE NULL END, the input dataset and the number of (grouping, measure) combi- ... nations become large. There are two issues. First, aggregations CASE WHEN S.V THEN T.revenue ELSE NULL END, across (grouping, measure) are performed separately, even when S.score there are overlaps in the subset of tuples being aggregated. Second, FROM sales T JOIN (SELECT * FROM COMPAREXPR2B while each trend is compared with other trends independently, the ORDER BY score second step typically involves a join at the granularity of trendsets, LIMIT 1) AS S i.e., a relation consisting of one trendset is joined with a relation WHERE (T.city = S.C1 AND T.Region = S.R1) OR (T.city = S.C2 AND consisting of another trendset. The cost of join increases rapidly T.Region = S.R2) as the number of trends being compared and the size of each trend 6
3. Trend-wise UNION ALL comparison ...... ...... ...... ...... ...... Latency (s)50 ⨝ ⨝ ......⨝ ⨝ ......⨝ ......⨝ ......⨝ ......⨝ .... Grp-by {c1d1} Grp-by {c2d1} .... Grp-by {cnd1} ......... ......... ......... 25 Agg on {m1} Agg on {m1} Agg on {m1} ......... Partition Partition Partition ......... ......... on c on c on c cd6m6 cd4m4 cd1m1 20 18 16 14 12 10 8 6 4 21 Partition Horizontal Vertical Partition on (dimi) ......... Partitioning on (dimi) Number of group-by expressions Partitioning 2. Partitioning ......... after merging TrendSet into Trends 1. Merging Group-by {c d1d6d4} Aggregate on {m1m6m4} Group-by {c d2dn} Aggregate on {m2mn} ......... aggregates (a) Variation in performance as we merge group-by ag- R gregates to share computations Figure 6: Optimized query plan generated after applying merging and par- Trendsetwise Join titioning on basic query plan in Figure 7. 300 Partitioning + Trendwise Join Latency (s) impact of merging on the cost of subsequent comparison between 200 Only Partitioning trends; ignoring which can lead to sub-optimal plans as we describe 100 shortly. We first introduce the second optimization for comparison. Trendwise Comparison via Partitioning. The second optimiza- 101 102 103 104 105 tion is based on the observation that pairwise joins of multiple Number of trends per trendset smaller relations is much faster than the a single join between two large relations. This is because the cost of join increases super- (b) Improvements due to trendwise join after partition- linearly with the increase in the size of the trendsets. In addi- ing trendset into trends (the size of each trend is fixed to tion to improvement in complexity, trend-wise joins are also more 1000 tuples) amenable to parallelization than a single join between two trend- sets. Figure 5b depicts the difference in latency for these two ap- Figure 5: Improvement in performance due to merging group-by proaches as we increase the number of trends from 10 to 105 (each aggregates and trendwise comparison (via partitioning) of size 1000). The black dotted line shows the partitioning over- head incurred while creating partitions for each trend, showing that the overhead is small (linear in n) compared to the gains due to increases (see Figure 5b). We next discuss how we address these trendwise join. Moreover, this is much smaller than the overhead issues via merging and partitioning optimizations. incurred when partitioning is performed on the join output (∝ n2 ) in the basic plan (see step 3 in Figure 5). 4.2 Merging and Partitioning Optimization Figure 7 depicts the query plan after applying the above two op- To generate a more efficient plan, we adapt the sub-plan gener- timizations on the basic query plan. First,we merge multiple group- ated above using two steps. First, we merge group-by aggregates by aggregates to share computations (using the approach discussed into a fewer number of group-by aggregates. Next, we partition below). Then, we partition the output of merged group-by aggre- the output of group-by aggregates to create one relation for each gates into smaller relations, one for each trend. This is followed trend, which allows parallel join and comparison for scoring. We by joining and scoring between each pair of trends independently describe each of these optimizations and then present an algorithm and in parallel. Observe that the merging of group-by aggregates that jointly optimizes them to find an overall efficient plan. results in multiple trends with overlapping (grouping, measure) in Merging group-by aggregates. The first optimization shares the the output relation. Hence, we apply the partitioning in two phases. computations across a set of group-by aggregates, one for each In the first phase, we partition it vertically, creating one relation (grouping, measure), by merging them into fewer group-by aggre- for each (grouping, measure). In the second phase, we partition gates. Many (grouping, measure) often share a common group- horizontally, creating one relation for each trend. ing column, e.g., [(day, AVG(revenue), (day, AVG(profit)] or have Joint Optimization of Merging and Partitioning. As depicted in correlated grouping columns (e.g., [(day, AVG(revenue), (month, Figure 6b, the cost of partitioning increases with the increase in the AVG(revenue)]) with high degree of overlapping tuples across trends. size of its input. The input size is proportional to the number of To illustrate, we considered a set of 20 group-by aggregates in the unique group-by values, which increases with the increase in the flights [1] dataset, computing avg(ArrivalDelay), avg(DepDelay), number of merging of group-by aggregates. Thus, when the input ..., avg(Duration) grouped by day, week, ..., airport. As depicted in becomes large, the cost of partitioning dominates the gains due to Figure 5a, by merging them (using an approach discussed shortly) merging. It is therefore important to merge group-by aggregates into 12 aggregates, the latency improves by 2×. Furthermore, we such that the overall cost of computing group-by aggregates, parti- can see that after a certain number of merges, the performance de- tioning and trendwise comparison together is minimal. grades. This is because the sharing of tuples decreases as we merge In order to find the optimal merging and partitioning, we follow additional grouping columns, which in turn results in the larger in- a greedy approach as outlined in Algorithm 1. Our key idea is to crease in the output size of group-by aggregates. merge at the granularity of sub-plans instead of the group-by ag- Finding the optimal merging of group-by aggregates is NP- Com- gregates. We start with a set of sub-plans, one for each (grouping, plete [7]. Prior work on optimizing GROUPING SETS compu- measure) as generated by the basic execution strategy discussed tation [17] have proposed best-first greedy approaches that merge earlier and merge two sub-plan at a time that lead to the maximum those group-by aggregates first that lead to maximum decrease in decrease in cost. the cost. Unfortunately, in our setting, we also need to consider the Formally, if the two sub-plans operate over (d1 , m1 ) and (d2 , 7
m2 ) respectively, we merge them using the following steps (illus- the two trends as we discuss below. Given the upper and lower trated in Figure 6): bound scores of each pair, we find a pruning threshold T on the (1) R1 ← Group-byc,d1 ,d2 Aggm2 ,m2 [R] // merge group-by aggre- lowest possible top k score, as the kth largest lower bound score gates across all pairs . Any pair with its upper bound score smaller than (2) ∀ (di , mi ): Ri ← Π(di ,mi ) (R1) // vertical partitioning T can thus be pruned. 3. Finally, we perform a join between the (3) ∀ i: Rij ← Partition Ri ON c // horizontal partitioning, one tuples of trends that are not pruned. partition for each value of c [28, 30] [15, 19] [26, 29] [28, 30] [26, 29] (4) ∀ i, j: Ri0 j 0 ← Group-bycj ,di Aggmi [Rij ] // aggregate again 3 ⨝ ⨝ ⨝ ⨝ ⨝ (5) ∀ i0 , j 0 , k: Ri0 j 0 k ← Ri0 j 0 1di Ri0 k //partitition-wise join 2 0 (6) ∀ i0 , j 0 , k: Ri0 j 0 k ← AggUDAF (Ri0 j 0 k ) // compute scores Bitmap + Summary Bitmap + Summary Bitmap + Summary 0 Aggregates Aggregates Aggregates 0 (7) R ← Union 0 0 All(Ri0 j 0 k ) 1 i ,j ,k P1 P2 P3 P1 P2 P3 We first merge group-by aggregates, followed by creating one partitions for each trend using both vertical and horizontal par- Figure 7: Illustrating pruning for DIFF-based comparisons titioning. Then, we join pairs of trends and compute the score. We empirically show in Section 8 that while the pruning incurs For computing the cost of the merged sub-plan, we use the opti- an overhead of first computing the summary aggregates and bitmap mizer cost model. The cost is computed as a function of avail- for each candidate trend, the gains from skipping tuple comparisons able database statistics (e.g., histograms, distinct value estimates), for pruned trends offsets the overhead. Moreover, the summary which also captures the effects of the current physical design, e.g., aggregates of each trend can be computed independently in parallel. indexes as well as degree of parallelism (DOP). We merge two sub Computing Bounds. The simplest approach is to create a sin- plans at a time until there is no improvement in cost. gle set of summary aggregates for each trend as depicted in Fig- ure 8b. The gray and yellow blocks depict the summary aggregates for two trends respectively, consisting of COUNT (computed using Algorithm 1 Merge-Partition Algorithm bitmaps), SUM, MIN, and MAX in order. 1: Let B be a basic sub-plan computed from Φ as described in Section 4.1 First, for deriving the lower bound, we prove the following useful 2: while true do 3: C ← OptimizerCost(B) property based on the convexity property of DIFF functions (see 4: Let si ∈ S be a sub-plan in B consisting of a sequence of group-by [2] for the proof). aggregate, join and partition operations over (di , mi ) 5: Let M P = Set of all sub-plans obtained by merging a pair of sub- Theorem 1. ∀ DIFF(m1 , m2 , p), plans in S as described in Section 4.2 AVG (DIFF(m1 , m2 , p)) ≥ DIFF(AVG (m1 ),AVG (m2 ), p) 6: Let Bnew be the sub-plan in M P with lowest cost (Cnew ) after This essentially allows us to apply DIFF on the average values merging two sub-plans si , sj of each trend to get a sufficiently tight lower bounds on scores. For 7: if Cnew > C then example, in Figure 8b, we get a lower bound of 1700 for a score of 8: break; 9: end if 1717 for the two trends shown in Figure 8b. 10: C ← Cnew For the upper bound, it is easy to see that the maximum value 11: B ← Bnew of DIFF(m1 , m2 , 2) between any pairs of tuples in R and S is 12: end while given by: MAX ( |MAX (m1 ) − MIN (m2 )|, |MAX (m2 ) − MIN 13: Return B (m1 )|). Given that DIFF(m1 , m2 , 2) is Non-negative and Mono- tonic, we can compute the upper bound on SUM by multiplying the the MAX (DIFF(m1 , m2 , 2)) by COUNT. For example, in Fig- 5. OPTIMIZING DIFF-BASED COMPARI- ure 8b, we get an upper bound of 6400. SON Multiple Piecewise Summaries. Given that the value of measure While the approach discussed in the previous section works for can vary over a wide range in each trend, using a single sum- any arbitrary scorer, we note that for top-k comparative queries in- mary aggregate often does not result in tight upper bound. Thus, volving aggregated distance functions (defined in Section 2.2) such to tighten the upper bounds, we create multiple summary aggre- as Euclidean distance, we can substantially reduce the cost of com- gates for each trend, by logically dividing each trend into a se- parison between pairs of trends. We first outline the three properties quence of l segments, where segment i represents tuples from in- of DIFF(.) function that we leverage for optimizations. dex: (i − 1) × nl + 1 to i × nl where n is the number of tuples 1. Non-negativity: DIFF( m1 , m2 , p) ≥ 0 in the trend. Instead of creating a single summary, we compute the set of same summary aggregates over each segment, called seg- 2. Monotonicity: DIFF(m1 , m2 , p) varies monotonically with the ment aggregates. For example, Figure 8c depicts two segment ag- increase or decrease in |m1 − m2 |. gregates for each trend, with each segment representing a range 3. Convexity: DIFF( m1 , m2 , p) are convex for all p. of 8 tuples. The bounds between a pair of matching segments is computed in the same way as we described above for a single sum- 5.1 Summarize → Bound → Prune mary aggregates. Then, we sum over the bounds across all pairs Overview. We introduce a new physical operator that minimizes of matching segments to get the overall bound (see [2] for formal the number of trends that are compared using the following three description). As we can see in Figure 8c, increasing the number of steps (illustrated in Figure 8). 1. We summarize each trend inde- segments from 1 to 2 improves the bounds from [1700, 6400] to pendently using a set of three aggregates: SUM, MIN and MAX [1702, 4624], substantially reducing the upper bound. and a bitmap corresponding to the grouping column. 2. Next, we To estimate the number of summary aggregates for each trend, intersect the bitmaps between trends to compute the COUNT of we use Sturges formula, i.e., (b1 + log2 (n)c) [42], which assumes matching tuples between them. We use the COUNT along with the normal distribution of measure values for each trend. Because three aggregates to compute the upper and lower bounds between of its low computation overhead and effectiveness in capturing the 8
18 18 14 18 18 16 14 14 10 14 12 10 13 13 14 14 16, 229, 10, 18 8, 129, 13, 18 8, 100, 10, 14 Score = 1717 Bounds = [1700, 6400] Bounds = [1702, 4624] 26 23 23 29 30 28 24 25 27 24 24 20 21 25 20 22 16, 394, 20, 30 8, 211, 23, 30 8, 183, 20, 27 (b) Bounds on score us- (c) Bounds on score using (a) Exact score on comparing two trends ing a single summary two-segment summaries Figure 8: Using summaries to bound scores. F = SUM OVER DIFF(2). Each value in (a) corresponds to a single tuple in a trend. distribution or trends of values, Sturges formula is widely used in Algorithm 2 Pruning Algorithm for DIFF-based Comparison the statistical packages for automatically segmenting or binning 1: Compute SegAgg and bitmaps for each candidate trend ci data points into fewer groups. We empirically show in Section 8 2: for each pair of trends ci , cj do (see Figure 11a) that (b1 + log2 (n)c) number of segments per trend 3: Compute bounds on scores (Section 5.1) results in sufficiently tight bounds for effective pruning, without 4: Update PQS 5: end for adding significant computational overhead. 6: for each candidate trend ci do 7: If ((ci , cj ) upper bound < PQS .Top()) Continue; 5.2 Early Termination 8: Initialize (ci , cj ) TState and push to PQP For early termination of the top-k trends search among the un- 9: end for 10: while size of PQP > k do pruned trends, we assign an utility to each of the unpruned trends 11: (ci , cj ) = PQP .Top() that tells how likely they are going to be in the top-k. 12: Compare a segment ci , with that of cj For estimating the utility of trends , we use the bounds computed 13: Update bounds and PQS using segment aggregates in the previous step. Specifically, for se- 14: If ((ci , cj ) upper bound < PQS .Top()) Continue; lecting top-k trends in descending order of their scores, a trend with 15: Push (ci , cj ) to PQP higher upper bound score has a higher utility. On the other hand, 16: end while 17: Return Top k trend pairs from PQP for top-k trends in ascending order of their scores, a trend with the smallest lower bound has a higher utility. The processing of higher utility trends leads to the faster improvement in the pruning thresh- fetch the trend with the highest upper bound score ((line 11)), and old T, thereby minimizing wastage of tuple comparisons over low following the process outlined in Section 5.2, compare one of its utility trends. segments with a segment from the reference trend (line 12). After Moreover, the utility of a trend can increase or decrease after processing a segment, we check if the current trend pair is pruned or comparing a few tuples in a candidate trend. Hence, instead of if there is another trend pair with higher upper bound (line 14–15). processing the entire trend in one go, we process one segment of This process is continued until we are left with k pairs of trends . a trend at a time, and then updates the bounds to check (i) if the Finally, we output values of k pairs of trends with highest scores trend can be pruned, or (ii) if there is another trend with better (line 17). upper bounds that it should switch to. As we show in Section 8, incrementally comparing high utility trends first leads to pruning Memory Overhead. Given a relation of n tuples consisting of of many trends without processing all of their tuples. p trends, Φp creates p × log(n/p) segment aggregates, each seg- ment aggregate consisting of 4 aggregates values. In addition, the 5.3 Putting It All Together operator maintains a TState consisting of a fixed number of ele- In order to support the above optimizations, we implemented a ments per trend, including the largest k lower bounds on scores. new physical operator, Φp , that takes as input the partitions (one for Thus, the overall space overhead is O(p × log(n/p)) + 10 × p each trend), and replaces the join and F in query plans discussed + k) ≈ O(p × log(n/p)). Thus, for an input relation, consist- in Section 4. It outputs a relation consisting of tuples that identify ing of 100 million tuples with 100 trends and 3 attributes, of total the top-k pairs of trends as discussed in Section 3. The algorithm size about 2.5 GB, Φp incurs an overhead of approximately 64K used by the operator makes use of four data structures: (1) SegAgg floating point numbers (< 1 MB). : An array where index i stores summary aggregates for segment i. There is one SegAgg per trend. (2) TState : The state of a trend, 6. ADDITIONAL ALGEBRAIC RULES consisting of the current upper and lowers bounds on the score, The query optimizer in Microsoft SQL S ERVER relies on alge- as well as the next segment within the trend to be compared next. braic equivalence rules for enumerating query plans to find the plan There is one TState for each trend, and is updated after comparing with the least cost. When C OMPARE occurs with other logical oper- each segment. (3) PQP : a MAX priority queue that keeps track ators, we present five transformation rules (see Table 3) that reorder of the trend pairs with the highest upper bound. It is updated after Φ with other operators to generate more efficient plans. comparing each segment. (4) PQS : a MIN priority queue that R1. Pushing Φ below join. Data warehouses often have a snowflake keeps track of the trend pairs with the smallest lowest bound. It is or star schema, where the input to C OMPARE operation may involve updated after comparing each segment. a PK-FK join between fact and dimension tables. If one or more Algorithm 2 depicts the pseudo-code for a single threaded im- columns in Φ are the PK columns or have functional dependen- plementation. We first compute the segment aggregates for trends cies on the PK columns in the dimension tables , Φ can be pushed (line 1). For each pair of trends, we compute the bounds on scores down below the join on fact table by replacing the dimension tables as discussed in Section 5.1, and update PQS to keep track of top columns with the corresponding FK columns in the fact table (see k lower bounds (lines 2—5). The upper bound for each trend pair Rule R1 in Table 3.) For instance, consider example 1a in Section is compared with PQS .Top() to check if it can be pruned (line 7). 2.1 that finds a product with a similar average revenue over week If not pruned, the TState is initialized and pushed to PQP (line trend to ‘Asia’. Here, revenue column would typically be in a fact 8). Once the TState of all unpruned trends are pushed to PQP , we table along with foreign key columns for region, product and year. 9
Table 3: Algebraic equivalence rules for Φ: COMPARE [P1 = α1 |P2=α2 ][(M1, M2)] USING A OVER F Rule Equivalence rule Pre-condition R1 Φ(R 1T S) ≡ (Φk (R)) 1T S T: PK in S; T ⊆ {P1,P2, M1, M2} ⊆ R ;Φk : Φ with PK columns replaced with FK columns R2 ΥG,A (Φ(R)) ≡ Φ(ΥG,A (R)) {P1, P2, M1, M2} ⊆ G and A ∈ {MAX,MIN } R3 σC (Φ(R)) ≡ Φ(σC (R)) C ⊆ {P1, P2} R4 Φ2 (Φ1 (R)) ≡ Φ1 (Φ2 (R))) Φ1 : P1 , P2 = Φ2 : P1 , P2 R5 (LIMITK Υ(ΠDIFF(R.X, L.X, p) (σP =α R) 1 R)) 1 R ≡ Φ(R) None In such cases, we can push Φ below the join by replacing dimen- DIFF-based comparisons can be further optimized by adding a new sion table columns (e.g., product, week) values with corresponding physical operator that first computes the upper and lower bounds on PK column values. the scores of each trend, which can then be used for pruning parti- R2. Pushing Group-by Aggregate (Υ) below Φ to remove du- tions without performing costly join. plicates. When an aggregate operation occurs above a C OMPARE Robustness to physical design changes. A large part of C OM - operation, in some cases we can push the aggregate operation be- PARE execution involves operators such as group-by, joins and par- low the C OMPARE to reduce the size of each partition. In particular, tition (See Figure 6). Hence, the effect of physical design changes consider an aggregate operation ΥG,A with group by attributes G on C OMPARE is similar to their effect on these operators. For in- and aggregate function A such that all columns used in Φ are in G. stance, since column-stores tend to improve the performance of Then, if all aggregation functions in Φ ∈ {MAX, MIN }, we can group-by operations, they will likely improve the performance of push Υ below Φ as per the Rule R2 in Table 3. Pushing aggregation C OMPARE. Similarly, if indexes are ordered on the columns used operation below Φ reduces the size of each partition by removing in constraints or grouping, the optimizer will pick merge join over the duplicate values. hash-join for joining tuples from two trends. Finally, if there is a R3. Predicate pushdown. A filter operation (σ) on partition col- materialized view for a part of the C OMPARE expression, modern umn (e.g., product) can be pushed down below Φ, to reduce the day optimizers can match and replace the part of the sub-plan with number of partitions to be compared. While predicate pushdown in a scan over the materialized view. We empirically evaluate the im- a standard optimization, we notice that optimizers are unable to ap- pact of indexes on C OMPARE implementation in Section 8. ply such optimizations when the C OMPARE are expressed via com- Applications of C OMPARE. C OMPARE is meant to be used by plex combination of operations as described in Section 1. Adding data analysts as well as applications to issue comparative queries an explicit logical C OMPARE, with a predicate pushdown rule makes over large datasets stored in relational databases. It has two advan- it easier for the optimizer to apply this optimization. Note that if σ tages over regular SQL and middleware approaches (e.g., Zenvis- involves any attribute other than the partitioning column, then we age, Seedb). First, it allows succinct specification of comparative cannot push it below Φ. This is because the number of tuples for queries which can be invoked from data analytic tools supporting partitions compared in Φ can vary depending on its location. SQL clients. Second, it helps avoid data movement and serializa- R4. Commutativity. Finally, a single query can consist of a chain tion and deserialization overheads, and is thus more efficient and of multiple C OMPARE operations for performing comparison based scalable. We classify the applications into three categories: on different metrics (e.g., comparing products first on revenue, and BI Tools. BI applications such as Tableau and Power BI do not pro- then on profit). When multiple Φ operations on the same parti- vide an easier mechanism for analysts to compare visualizations. tioning attribute, we can swap the order such that more selective However, for supporting complex analytics involving multiple joins C OMPARE operation is executed first. and sub-queries, these tools support SQL querying interfaces. For R5. Reducing comparative sub-plans to Φ. Finally, we extend comparative queries, users currently have to either write complex the optimizer to check for an occurrence of the comparative sub- SQL queries as discussed in Introduction, or generate all possible expression specified using existing relational operators to create an visualizations and compare them manually. With C OMPARE, users alternative candidate plan by replacing the sub-expression with Φ. can now succinctly express such queries (as illustrated in Section In order to do so, we add the equivalence rule R5 where the expres- 3) for in-database comparison. sion on the left side represents the sub-expression using existing Notebooks. For large datasets stored in relational databases, it is relational operators. This rule allows us to leverage physical op- inefficient to pull the data into notebook and use dataframe APIs timizations for comparative queries expressed without using SQL for processing. Hence, analysts often use a SQL interface to access extensions. and manipulate data within databases. While one can also expose Python APIs for comparative queries and automatically translate them to SQL, such features are limited to the users of the Python 7. DISCUSSION library. SQL extensions, on the other hand, can be invoked from We discuss the generalizability and robustness of our proposed multiple applications and languages that support SQL clients. Fur- optimizations as well as potential applications of C OMPARE. thermore, in the same query, one can use C OMPARE along with Generalizability of optimizations. Our proposed optimizations in other relational operators such as join and group-by that are fre- Section 4 deal with replacing C OMPARE to a sub-plan of logical quently used in data analytics (see Section 3.2). and physical operators within existing database engines. These op- Visual analytic tools. Finally, there are visual analytic tools such as timizations can be incorporated in other database engines support- as Zenvisage and Seedb that perform comparison between subsets ing cost-based optimizations and addition of new transformation of data in a middle-ware. With C OMPARE, such tools can scale to rules. Concretely, given a C OMPARE expression, one can generate large datasets and decrease the latency of queries as we show in a sub-plan using Algorithm 1 and transformation rules implement- Section 8. ing steps outlined in Section 4.1 and Section 4.2. Furthermore, we discuss additional transformation rules (see Table 3) in Section 6 that optimize the query when C OMPARE occurs along with other logical operators such as join, group-by, and filter. We show that 8. PERFORMANCE EVALUATION 10
You can also read