MySQL partition prunning and Mondrian

About one month ago, the performance of one of the Data Warehouses i maintain (MySQL based) started to slowly degrade as records build up (about 70M in one of the fact tables). The first thing to check were the indexes, nothing wrong there, no more and no less than were needed for querying, so the next step was to partition the table to keep the performance linear across the records.

Now the first big Q was how to partition the table (important huh). After some query analysis i realized most or actually all the queries done were time based so partitioning by year and month was probably the best option.

Regarding MySQL partition prunning (PP) functionality which makes, whenever possible, the optimizer determine the minimum set of partitions the query will need to access, there is one limitation that makes this a bit more tricky – the partitioning column must appear in the where clause for the PP to be computed. Well fair enough since the optimizer most likely doesn’t have a crystal ball around to guess what you want to do.

Since I’m using pentaho for BI implementations, the mondrian olap engine is the responsible for the automagic SQL generation for the MDX queries. A typical mondrian generated query will end up something like the following:

-- Example query
<fact measures>
fact_table as 'fact_table',
dim_table as 'dim_table'
fact_table.id_dim = dim_table.id_dim and
dim.field = something

So, all in all, mondrian will never slice the data with a fact table column but with a dimension, which makes total sense in an OLAP context. So, how do you effectively partition the fact table by date (month in this example) and make mysql make use of the PP functionality? Stay on.

The only common bit between the fact and the dimension table will be the technical key (TK) so this has to be the partition point if you ever want to make use of the PP functionality. The problem here is when you religiously keep with the meaningless TK schema induced by the sacred gods of Data Warehousing (DWing), which i fully agree on to one exception – the date dimension. If the key is meaningless there’s no efficient way (that i know of) that you can compute a function that will properly partition by date segments. Since i had to deal with this problem, i decided to change my date dimension TK key to something a bit more meaningful such as yyyymmdd (e.g. 20100101). This not only preserves the ID as being monotonic but you can still make use of the 0 or -1 TKs if you wish so for invalid dates and so on.

So with these changes, imagine the following MDX query example:

[Measures].Members on columns,
[Date Dim].[All Dates].[2010].[1].[1] on rows
from [Some Cube]

Assuming your mondrian schema file for the date dimension will look something along these lines:

<Dimension name="Date Dim" type="TimeDimension">
<Hierarchy hasAll="true" allMemberName="All Dates" primaryKey="date_tk">
<Table name="date_dim"/>
<Level name="Year" column="year" uniqueMembers="true" levelType="TimeYears" type="Numeric"/>
<Level name="Month" column="month" uniqueMembers="false" ordinalColumn="month" nameColumn="month_name" levelType="TimeMonths" type="Numeric"/>
<Level name="Day" column="day_in_month" uniqueMembers="false" ordinalColumn="day_in_month" nameColumn="day_name" levelType="TimeDays" type="Numeric"/>

The generated SQL query will be something like this:

<fact measures>
fact_table as 'fact_table',
date_dim as 'date_dim'
fact_table.date_tk = date_dim.date_tk and
date_dim.year = 2010 and
date_dim.month = 1 and
date_dim.day_in_month = 1

Which is what I’d expect a query to look like, and indeed it works, but… not with PP. Why? because there isn’t any reference to the date_tk column anywhere in the query as of yet. One thing i didn’t mention yet, is that MySQL can transitively apply the prunning optimization on the right side of a join, this means that if you filter the query on the dim_date.date_tk column, MySQL will know how to optimize on the fact table side. Great! So how do we do this?

Subtle change to the mondrian schema:

<Level name="Day" column="date_tk" uniqueMembers="false" ordinalColumn="day_in_month" nameColumn="day_name" levelType="TimeDays" type="Numeric"/>

And now the query generated will look like this:

<fact measures>
fact_table as 'fact_table',
date_dim as 'date_dim'
fact_table.date_tk = date_dim.date_tk and
date_dim.year = 2010 and
date_dim.month = 1 and
date_dim.date_tk = 20100101

Ok so this might seem a bit redundant but if you make use of the explain partitions function you will see the optimizer will only hit the partition where the 20100101 key lands on!

All automagic from now on…

hope this was helpful!

Edit: I didn’t mention but this approach is probably valid for some other database management systems, leave a comment, share your experience.

About these ads

About ITXpander
A Business IT Consultant, moderately crazy and a life lover!

6 Responses to MySQL partition prunning and Mondrian

  1. Trooper says:

    Not bad article, but I really miss that you didn’t express your opinion, but ok you just have different approach

  2. Luc Boudreau says:

    Good article. I like it. Thanks for writing this up.

  3. Ben Weinstein says:

    Thank you, this is a really helpful article. We have been trying to partition our fact table by year and month and we are using Mondrian as our OLAP engine. This works perfectly when the query is limited to a specific day but when I ask for a month it has to check every partition.

    Do you have any idea how to make the Partial Partitioning work with queries at the month or year level?

    • ITXpander says:

      Thanks Ben, glad to be helpful :-)

      I am actually being bitten by that problem at the moment. I’ll update the post or make a new one if I go around this.

  4. Ranish says:

    Cant we do that ,combining the PP+Mondrian Aggregate tables , by defining Aggregate tables at eh month level using the same smart key key approach for month (20120101 represents January)

  5. Raúl Micharet says:

    Ben Weinstein, ITXpander Something new about partition fact table by year insteed days?
    I have the same problem. A workaround is have a year dimensión and configure it as hidden. Later you can formulate measures to use the correct member (year) with ancestor(dailyDimension.currentMember, year).

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

%d bloggers like this: