An
Effective Aggregation Methodology
Michael Kamfonas
Copyright © InfoKarta inc
The Double-Dipping Query Pattern
The classic form of the fact-dimension join works
for queries written against the base fact. In order to redirect queries to
pre-joined MQTs, the columns involved in the query must be embedded in the MQT.
This becomes impractical with names and long descriptions for several levels
per dimension since if embedded in the MQTs they would make their rows
extremely long. A more appropriate query pattern is to make a double dip to
each dimension. First, using a core set of identifier fields that are carried
in the MQTs to filter and group the data, and second to pick-up descriptions
and names, by rejoining back to the dimension after the grouping. This pattern
can be either explicitly introduced by the query designer or implicitly induced
by the rewrite step of the optimizer cued by informational functional
dependency constraints. This article explores this “double-dipping” query
pattern in the context of realistic cube design, and discuses the pros and cons
of designing it explicitly or relying on the optimizer to invoke it
implicitly.
Q1: Select
P.CLS_NM, P.CLS_DSC, L.STR_NM, L.ZIP, C.YR, C.WK, S.SUM(AMT) From
SALES S join PROD P on S.UPC_ID=P.UPC_ID join LOC L on S.STR_ID=L.STR_ID join CAL C on S.DT_ID=C.DT_ID where
P.CAT_ID = 101 and L.RGN_ID = 205 and C.YR = 2005 and C.WK = 40 group
by P.CLS_NM, P.CLS_DSC, L.STR_NM, L.ZIP, C.YR, C.WK |
A typical dimensional query pattern involves a fact table and one join per dimension. Let’s take a Sales fact, and three dimensions: location, product and calendar. The dimensions typically include multiple levels, each level characterized at least by an identifier and a name, often including a long description or special attributes particular to the level. For example, the product dimension may have UPC as the base level rolling into subclass (SCL), to class (CLS) and to category (CAT). Each of the levels has a token ID generated by the system, a name (NM) and a long description (DSC). The location dimension may include stores (STR) at the bottom level, which carry ZIP code, and open date (OPN_DT) as special attributes, with district (DST), region (RGN), division (DIV) etc. as its higher levels. A typical drill-down query would select a cell of the OLAP cube by filtering on each dimension, and summing quantities and amounts after grouping at the filtering level or lower along each dimension. For example, lets view week 40 sales of major category 101 for region 205 broken down by store and product class. Let’s assume that the displayed results must include class descriptions, store names and ZIP code. The query would look something like Q1.
Let’s also assume the existence of an MQT at the weekly level by store by subclass, which of course we want to use. Here are a couple MQT definitions that could be used to successfully redirect our query:
MQT01
as ( Select
P.SCL_ID,P.SCL_NM,P.SCL_DSC,P.CLS_ID,P.CLS_NM, P.CLS_DSC,P.CAT_ID..., S.STR_ID, C.WK_ID,
C.YR, C.WK, ..., S.SUM(AMT) as AMT From SALES S join PROD P on S.UPC_ID=P.UPC_ID
join CAL C on S.DT_ID=C.DT_ID group
by ...) ; MQT02
as ( Select
P.SCL_ID,P.SCL_NM,P.SCL_DSC,P.CLS_ID,P.CLS_NM, P.CLS_DSC,P.CAT_ID..., L.STR_ID, STR_NM, STR_DSC, L.ZIP, ..., L.DST_ID, ..., L.RGN_ID C.WK_ID,
C.YR, C.WK, ..., S.SUM(AMT) From
SALES S join PROD P on S.UPC_ID=P.UPC_ID join LOC L on S.STR_ID=L.STR_ID join CAL C on S.DT_ID=C.DT_ID group
by ...) ; MQT03
as ( Select
P.SCL_ID, P.CLS_ID, P.CAT_ID...,
C.YR, C.WK_ID, C.PER_ID, ..., S.SUM(AMT) From
SALES S join PROD P on S.UPC_ID=P.UPC_ID join CAL C on S.DT_ID=C.DT_ID group
by ...) ; MQT04
as ( Select
P.SCL_ID, P.CLS_ID, P.CAT_ID...,
L.STR_ID, L.DST_ID,
L.RGN_ID, L.DIV_ID, C.YR, C.WK_ID, C.PER_ID, ..., S.SUM(AMT) From
SALES S join PROD P on S.UPC_ID=P.UPC_ID join LOC L on S.STR_ID=L.STR_ID join CAL C on S.DT_ID=C.DT_ID group
by ...) ; |
Both MQT01 and MQT02 include each level of product starting from subclass and up, and each level of time starting from week and up. MQT02 also includes the complete location dimension. Also notice that they include IDs, names (NM), descriptions (DSC) and special attributes (e.g. ZIP). The optimizer will gladly match our query Q1 with either of the first two MQTs and rewrite the query like Q1.1 and Q1.2. In both cases the performance is likely to improve substantially because the amount of data in the two MQTs should be at least one, if not two levels of magnitude less than the base table, and two or three joins respectively are avoided.
If we are to include all names and descriptions in the MQT, however, the MQT row becomes too large. It is preferable to design MQTs that are not cluttered by long descriptive columns, but only with level identifiers and critical attributes that can facilitate fast filtering and grouping. MQT03 and MQT04 are examples of this approach with row size that could be hundreds, if not thousands, of bytes shorter than the first two MQTs respectively. These MQTs can only be used by a double-dipping query such as Q1.3 or Q1.4. These queries essentially use the subset of the columns that match the MQT for filtering and grouping, and then join back to a distinct subset of the dimensions to extract names and descriptions at the grouping level. This double-dipping to the dimension can be derived by the optimizer if certain functional dependencies are guaranteed and defined in the schema. Alternatively a double-dipping query pattern can be explicitly coded. The pros and cons will be discussed along with design alternatives and trade-offs.
Q1.1: Select
M.CLS_NM, M.CLS_DSC, L.STR_NM, L.ZIP, M.YR, M.WK, SUM(M.AMT) From
MQT01 M join LOC L on M.STR_ID=L.STR_ID where
M.MCAT_ID = 101 and L.RGN_ID = 205 and M.YR = 2005 and M.WK = 40 group
by M.CLS_NM, M.CLS_DSC, L.STR_NM, L.ZIP, M.YR, M.WK ; Q1.2: Select
M.CLS_NM, M.CLS_DSC, M.STR_NM, M.ZIP, M.YR, M.WK, SUM(M.AMT) From
MQT02 M where
M.MCAT_ID = 101 and M.RGN_ID = 205 and M.YR = 2005 and M.WK = 40 group
by M.CLS_NM, M.CLS_DSC, M.STR_NM, M.ZIP, M.YR, M.WK ; Q1.3: Select
P.CLS_NM, P.CLS_DSC, L.STR_NM, L.ZIP, M.YR, M.WK, SUM(M.AMT) From
MQT03 M join LOC L on M.STR_ID=L.STR_ID join
(select distinct CLS_ID, CLS_NM, CLS_DSC from PROD) P on
M.CLS_ID=P.CLS_ID where
M.MCAT_ID = 101 and L.RGN_ID = 205 and M.YR = 2005 and M.WK = 40 group
by M.CLS_NM, M.CLS_DSC, L.STR_NM, L.ZIP, M.YR, M.WK ; Q1.4: Select
M.CLS_NM, M.CLS_DSC, M.STR_NM, M.ZIP, M.YR, M.WK, M.SUM(AMT) From
MQT04 M join (select distinct CLS_ID, CLS_NM, CLS_DSC
from PROD) P on
M.CLS_ID=P.CLS_ID join
LOC P on M.STR_ID=L.ID where
M.MCAT_ID = 101 and M.RGN_ID = 205 and M.YR = 2005 and M.WK = 40 group
by P.CLS_NM, P.CLS_DSC, L.STR_NM, L.ZIP, M.YR, M.WK ; |