Tag: date
February 11, 2010

Overview

A good date dimension is essential for any data warehouse or reporting database. If you don't have the right tools, you'll often find yourself generating this date dimension in Excel, which can be tedious.

I've created a reusable Pentaho Data Integration transformation that lets you create a date dimension with a user-defined start date and number of days.

Feel free to tweak to your heart's content. The download link is at the bottom of the blog entry.

Caveats

  • The first date of this date dimension starts in October 2009, which coincidentally (grin) matches up with the day Braintapper came into existence. You can change this date as indicated in the transformation instructions.
  • This transformation is for educational purposes only. It's up to you to make sure it's debugged if you're thinking of putting it into a production system.
  • This transformation is based on a standard calendar year. It doesn't factor anything like lunar dates, 4-5-4 calendars or fiscal years that don't start on January 1.

How it Works

A quick view of the transformation is here:

There are 5 basic steps used in this transformation.

  1. Using a "Generate Rows" step, I set the first day of the dimension to iterate an arbitrary number of times. In the sample, I create 10,000 lines, which means there are 10,000 days in the dimension.

  2. Next, I use an "Add Sequence" step to create a counter for each row.

  3. I use a "Formula" step to create each date in the dimension, using the counter from the previous step.
  4. I create the rest of the date dimension elements using another "Formula" step.
  5. Finally, I output the dimension table to an Excel file. You can change this to any other output type that you would like.

Date Dimension

In the date dimension, I like to create "In Dimension Counters". These are basically running counts of the Year, Quarter, Month and Day in the dimension. I also like to include Month and Weekday names and abbreviations.

I like to include these added elements because some database/reporting engines do not have fully fleshed date functions, so these added elements allow you to use plain old SQL to do conversions and calculations.

Elements

  • DimensionDate - The date, as a date type.
  • DateSid - The date, as an integer representation of YYYYMMDD.
  • Year - The year, as an integer.
  • Quarter - The quarter, as an integer.
  • Month - The month in the year, as an integer.
  • Day - The day in the month, as an integer.
  • DayInYear - The day in the year, as an integer between 1 and 366.
  • Weekday - The day of the week, starting on Sunday indexed at 1.
  • MonthName - The name of the month, in English.
  • MonthAbbreviation - Three letter abbreviation of the month, in English.
  • DayName - The weekday name, in English.
  • DayAbbreviation - Three letter abbreviation of the weekday, in English.
  • YearInDimension - The year counter for the entire dimension, starting at 1.
  • QuartersInDimension - The quarter counter for the entire dimension, starting at 1.
  • MonthInDimension - The month counter for the entire dimension, starting at 1.
  • DayInDimension - The day counter for the entire dimension, starting at 1.

Notable Formulas and Gotchas

Calculating Month/Weekday Names and Abbreviations

Since the Pentaho Formula step doesn't have month or weekday name date functions, I used a variant of the tip that I blogged about last November to generate these.

DateDifs and In Dimension Counters

Pentaho's DateDif function in the "Formula" step appeared to be a little buggy, so many of the "In Dimension" element calculations are a little more complicated than they needed to be.

Because my first year start date is not January 1 (which appears to be the cause of the bug), I had to create an exception for the first year counters. For example, the month in dimension calculation is as follows.

if
(
  // calculations for the first year
  year([FirstDate])=year([DimensionDate]);
  
  datedif([FirstDate];[DimensionDate];"m")+1;
  datedif([FirstDate];date(year([FirstDate]);12;31);"m")+1) 
  +
  // calculated from second year forward
  if(year([FirstDate])=year([DimensionDate]);0;
  datedif(date(year([FirstDate])+1;1;1) ;
  [DimensionDate];
  
  "m")+1
)

If the DateDif worked properly (I was getting something like 21 months between 2009/10 and 2010/1 - what???) then I'd be able to drop the first part of the formula above.

In any case, the formula now only suffers from a little redundancy, but it can accommodate any date dimension that doesn't start on New Year's day.

Transformation File

DateDimensionGenerator.ktr

keyword dimension  keyword date  
November 18, 2009
Whether you're working in Cognos, Pentaho or some other BI tool, you'll inevitably be given a data element with a date format your system doesn't like.

Usually it involves a month abbreviation and having to convert it to a number.

Not all formula languages provide this type of conversion with a built-in command.

The first solution that usually comes to mind is an if->elseif or case / switch block to handle the month abbreviation conversion.

There is a quicker way to do this, using the string location function in your tool of choice, which of course, is never the same between tools.

Here's a quick version in pseudocode:
FIND(UCASE("Jan");"###JANFEBMARAPRMAYJUNJULAUGSEPOCTNOVDEC"))/3

Let's walk through my pseudocode.

FIND is the command that I want to use for searching my string, and here's the syntax: FIND(needle;HAYSTACK)

It should return me with the first position of the string when found. In some formula languages, it's substr or substring.

Gotcha: Some FIND commands are indexed at 1, others at 0. That means the first character could return 0 or 1. I'm going to assume that FIND returns 0. If your FIND command returns 1, you'll have to adjust accordingly.

So let's look at the string positions:

###JANFEBMARAPRMAYJUNJULAUGSEPOCTNOVDEC
012345678901234567890123456789012345678

A couple of things worthy of note. I've indexed with zero, and I've left-padded my string with hashes (#). Because of how my pseudocode FIND works, searching for "JAN" will return 3, since that's the first character position it finds. "FEB" will return 6, "MAR" will return 9 and so on.

Since each of these results are divisible by 3, all I need to do is return the position divided by 3 to give me the month number. If your FIND command is indexed with 1, then the string positions would look like this:

##JANFEBMARAPRMAYJUNJULAUGSEPOCTNOVDEC
12345678901234567890123456789012345678

You should only pad your haystack string with 2 hashes in this case.

So to recap, perform an uppercase conversion on your abbreviation in case your FIND command is case sensitive, and divide the "found" position by 3.

Easy peasy.
keyword date  keyword algorithm