Multiple output indexes in Pentaho CDA

Until today, if you wanted to reuse a query in a .cda file with a different output spec, you needed to duplicate it as many times as you needed different outputs, today i committed a code change that allows all you guys to specify it directly in the same DataAccess xml node:


<CDADescriptor>
...
<DataAccess id="1" connection="1" type="sql" access="public" cache="true" cacheDuration="3600">
...
<Output indexes="1,0,2,3"/>
<Output id="2" indexes="1,0,2"/>
<Output id="3" indexes="0,2,3"/>
</DataAccess>
</CDADescriptor>

To use a different output you just need to specify the outputIndexId parameter in the call to the doQuery servlet! :)

All this code was developed internally at Xpand IT, I just wrapped it up, cleaned the code and committed it back :-)

Cheers,
ITXpander

Counting days with mondrian and MDX

This is probably a problem mostly everyone dealing with daily averages pondered measures has to deal with, which is knowing the exact number of days accounted for some kind of calculation.
After some days (weeks) fiddling with this I came up with some solutions…

Approach #1 (aka the naive time consuming, not so efficient approach):

In my DW’s I usually keep a smart date key in the yyyymmdd format which is great for doing these types of calculations… so in a more naive approach you would think that setting a measure such as distinct-count with mondrian would work:

<Measure name="Number of Days" column="id_date" aggregator="distinct-count">
</Measure>

Ok so this is great, and it works… but mondrian will have to query your DW for every dimension level you use in your queries for a count(distinct id_date) operation (more on this here). So as you can guess, doing this for a large amount of data is highly unefficient and it can take a hell lot of aggregate tables as described in the mondrian docs.

Approach #2 (efficient but not so perfect solution):

So, after some time looking up the vast and immense internet world for a solution i found this post which had a possible solution for the problem. So in short, you pass the responsibility of the counting to the OLAP layer only defining a calculated member depending on the date dimension:

Count(Descendants([Date].CurrentMember, [Date].[Day]), EXCLUDEEMPTY)

This will count all the non-empty date dimension day level members from the cube. Great, exactly what I needed and no evil distinct-count measure needed. Of course this solution had to have a catch to give me some more trouble. This solution only works when you have all the day members in the query non-aggregated, so that means the following won’t work:

with member [Date].[Date Range] as 'Aggregate([Date].[2010].[11].[1]:[Date].[2010].[11].[5])'
select [Measure].[Number of Days] on columns
from [My Cube]
where [Date].[Date Range]

It will always return 1, because the [Date].CurrentMember will always be 1 day.

Approach #3 (Efficient and working solution. Elegant? Maybe not…):

After thinking this throught I figured out I had to make mondrian to always count a set of days even when they are not present unaggregated in the query. So why not set the date dimension as a cube itself?

<Cube name="date cube">
<Table name="dim_date">
</Table>
<DimensionUsage name="date" source="date" foreignKey="id_date">
</DimensionUsage>
<Measure name="Number of Days #3" column="id_date" aggregator="count">
</Measure>
</Cube>

Now the former query works because mondrian will query the date dimension filtering the sql query by he selected day level members and the count with return 5 as expected. But… there won’t be any “empty” members as this dummy cube will always have all the members (and it should because it’s your date dimension after all). How do you solve this? Combining approach #2 and #3!

Step 1: Build a virtualcube with your own cube, where you use the date cube also:

<VirtualCube name="My Combined Cube">
<CubeUsages>
<CubeUsage cubeName="date cube" ignoreUnrelatedDimensions="true"/> <!-- This will force mondrian to make measures of this cube valid in all non related dimensions -->
<CubeUsage cubeName="my cube"/>
</CubeUsages>
...
<VirtualCubeMeasure cubeName="date cube" name="[Measures].[Number of Days #3]"/>
...
</VirtualCube>

Step 2: Add the approach #2 measure as a calculated member:

<VirtualCube name="My Combined Cube">
...
<CalculatedMember name="Number of Days #2" formula="Count(Descendants([Date].CurrentMember, [Date].[Day]), EXCLUDEEMPTY)" dimension="Measures">
</CalculatedMember>
...
</VirtualCube>

Step 3: Add a final measure that combines both:

<VirtualCube name="My Combined Cube">
...
<CalculatedMember name="Number of Days" formula="IIf([Measures].[Number of Days #2] = 1,ValidMeasure([Measures].[Number of Days #3]),[Measures].[Number of Days #2])" dimension="Measures">
</CalculatedMember>
...
</VirtualCube>

So now combining both approaches all the day counting cases are covered (or so i think so…) :) Have a better solution for this? I’m all eyes and ears!

EDIT: In step #3 the [Measures].[Number of Days #3] measure must be wrapped inside the ValidMeasure MDX function to work correctly.

MySQL partition prunning and Mondrian

About one month ago, the performance of one of the Data Warehouses i maintain (MySQL based) started to slowly degrade as records build up (about 70M in one of the fact tables). The first thing to check were the indexes, nothing wrong there, no more and no less than were needed for querying, so the next step was to partition the table to keep the performance linear across the records.

Now the first big Q was how to partition the table (important huh). After some query analysis i realized most or actually all the queries done were time based so partitioning by year and month was probably the best option.

Regarding MySQL partition prunning (PP) functionality which makes, whenever possible, the optimizer determine the minimum set of partitions the query will need to access, there is one limitation that makes this a bit more tricky – the partitioning column must appear in the where clause for the PP to be computed. Well fair enough since the optimizer most likely doesn’t have a crystal ball around to guess what you want to do.

Since I’m using pentaho for BI implementations, the mondrian olap engine is the responsible for the automagic SQL generation for the MDX queries. A typical mondrian generated query will end up something like the following:

-- Example query
select
<fact measures>
from
fact_table as 'fact_table',
dim_table as 'dim_table'
where
fact_table.id_dim = dim_table.id_dim and
dim.field = something

So, all in all, mondrian will never slice the data with a fact table column but with a dimension, which makes total sense in an OLAP context. So, how do you effectively partition the fact table by date (month in this example) and make mysql make use of the PP functionality? Stay on.

The only common bit between the fact and the dimension table will be the technical key (TK) so this has to be the partition point if you ever want to make use of the PP functionality. The problem here is when you religiously keep with the meaningless TK schema induced by the sacred gods of Data Warehousing (DWing), which i fully agree on to one exception – the date dimension. If the key is meaningless there’s no efficient way (that i know of) that you can compute a function that will properly partition by date segments. Since i had to deal with this problem, i decided to change my date dimension TK key to something a bit more meaningful such as yyyymmdd (e.g. 20100101). This not only preserves the ID as being monotonic but you can still make use of the 0 or -1 TKs if you wish so for invalid dates and so on.

So with these changes, imagine the following MDX query example:

select
[Measures].Members on columns,
[Date Dim].[All Dates].[2010].[1].[1] on rows
from [Some Cube]

Assuming your mondrian schema file for the date dimension will look something along these lines:

<Dimension name="Date Dim" type="TimeDimension">
<Hierarchy hasAll="true" allMemberName="All Dates" primaryKey="date_tk">
<Table name="date_dim"/>
<Level name="Year" column="year" uniqueMembers="true" levelType="TimeYears" type="Numeric"/>
<Level name="Month" column="month" uniqueMembers="false" ordinalColumn="month" nameColumn="month_name" levelType="TimeMonths" type="Numeric"/>
<Level name="Day" column="day_in_month" uniqueMembers="false" ordinalColumn="day_in_month" nameColumn="day_name" levelType="TimeDays" type="Numeric"/>
</Hierarchy>
</Dimension>

The generated SQL query will be something like this:

select
<fact measures>
from
fact_table as 'fact_table',
date_dim as 'date_dim'
where
fact_table.date_tk = date_dim.date_tk and
date_dim.year = 2010 and
date_dim.month = 1 and
date_dim.day_in_month = 1

Which is what I’d expect a query to look like, and indeed it works, but… not with PP. Why? because there isn’t any reference to the date_tk column anywhere in the query as of yet. One thing i didn’t mention yet, is that MySQL can transitively apply the prunning optimization on the right side of a join, this means that if you filter the query on the dim_date.date_tk column, MySQL will know how to optimize on the fact table side. Great! So how do we do this?

Subtle change to the mondrian schema:

<Level name="Day" column="date_tk" uniqueMembers="false" ordinalColumn="day_in_month" nameColumn="day_name" levelType="TimeDays" type="Numeric"/>

And now the query generated will look like this:

select
<fact measures>
from
fact_table as 'fact_table',
date_dim as 'date_dim'
where
fact_table.date_tk = date_dim.date_tk and
date_dim.year = 2010 and
date_dim.month = 1 and
date_dim.date_tk = 20100101

Ok so this might seem a bit redundant but if you make use of the explain partitions function you will see the optimizer will only hit the partition where the 20100101 key lands on!

All automagic from now on…

hope this was helpful!

Edit: I didn’t mention but this approach is probably valid for some other database management systems, leave a comment, share your experience.

The start

So i finally decided to start a blog to share my experiences while expanding someone’s IT. Here you’ll find tech tips on how to make the most out of your IT, mostly based on open source technologies. My main working area is Business Intelligence so expect to see a lot of content related with it.

Stick around maybe you’ll find something useful,

ITXpander, out!

Follow

Get every new post delivered to your Inbox.