- air
- ajax
- algorithm
- apple
- bitbucket
- braintapper_exchange
- charts
- chumby
- codeigniter
- cognos
- complexity
- crosstab
- dash
- dashboard
- date
- dbvisualizer
- decisions
- dimension
- dogfood
- dona_wong
- edward_tufte
- feature_checklists
- feature_excellence
- filemaker
- firefox
- firewall
- flot
- flowing_data
- fogbugz
- football
- free
- freenas
- freshbooks
- gm
- google_charts
- iPad
- javascript
- jdbc
- jedox
- mac
- macbook
- maps
- marsedit
- mercurial
- metaweblog
- metrics
- microstrategy
- monowall
- moo
- nathan_yau
- open_source
- palo
- pentaho
- pfsense
- printing
- programmers_interfaces
- rapidweaver
- regex
- regexr
- rest
- smoothwall
- sony
- sqlpower
- stackoverflow
- statistics
- stephen_few
- svg
- tablet
- ticket_agent
- tip
- tm1
- transformer
- trick
- typographic grid
- usability
- visualization
- w3c
- web
- wiki
- wikkawiki
- work_management
- wsj
Nathan Yau at Flowing Data (Add it to your RSS feeds now. Yes, it's that good.) had a great entry this past week on how to roll your own map infographic using a Python script on an SVG.
That got me thinking, how would that be done in Pentaho Data Integration?
Not wanting to spend too much time on it, I threw together a quick proof-of-concept, and I thought I'd share it with you.
Before we start:
- This is a tutorial for creating the map, not on using Pentaho. I will assume that you know how to create jobs, transformations, steps, etc.
- Read the Flowing Data article carefully before you read this. While most of this tutorial is self-explanatory, I'm not going into as much detail as Nathan did in the original. I will try to follow Nathan's structure to make things a little easier to digest.
- This solution was done as a "quick and dirty" solution. There's more than one way to skin a cat, so feel free to tweak and optimize as you would like.
Step 0. System requirements
- A recent version of Java installed on your computer. Chances are, you've already got it.
- The latest version of Pentaho Data Integration.
- A copy of Inkscape, a freeware SVG editor. Snow Leopard users be warned, Inkscape won't work without a lot of tweaking, you might want to open it up on a Leopard machine or a Windows VM.
- A text editor would be handy, but Notepad or Textedit will do.
Step 1. Clean the Map
As Nathan mentioned, the map he used is an SVG, which is really an XML format. The Python library he used, Beautiful Soup, is very forgiving on XML files. Pentaho, not so much.
Inkscape native SVGs do not make for Pentaho friendly files. Fortunately, Inkscape can save some files that will almost work in Pentaho.
As in the Flowing Data article, we're only concerned about the County IDs (the FIPS number) and the background colour of each county. So first download the counties map.
Open it up in Inkscape. Go to Save as, and save it as a "Plain SVG" with the filename "USA_Counties_with_FIPS_and_names plain.svg" (you can name it whatever you want, but my Pentaho solution used this name). This is key to cleaning out the extraneous Inkscape crud in the SVG file that would otherwise give us problems in Pentaho.
Feeling lazy? Here's [the plain SVG][8] I made.
Step 2. Get the Data
Nathan kindly created a CSV file here, so let's reuse it. Remember that Pentaho can handle just about any data source, which is why this whole mapping exercise is so worthwhile. If you've got similar data by county, you can just substitute the input in the solution for your data source, and make dynamic maps accordingly.
Step 3. Take a quick Look at the Plain SVG
In a text editor (Notepad++ on Windows, or TextEdit on Mac), open USA_Counties_with_FIPS_and_names plain.svg. It'll look something like this:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?!-- Created with Inkscape (http://www.inkscape.org/) -->
<?svg
xmlns:svg="http://www.w3.org/2000/svg";
xmlns="http://www.w3.org/2000/svg";
<version="1.0"
<width="555.22198"
<height="351.66797"
<id="svg9559">
<defs
<id="defs9561" />
<path .../>
There are still a couple of troublesome things with the file above that will cause some hiccups with Pentaho, but we'll resolve them right in Pentaho with a preprocessing transformation.
The namespace attributes (xmlns, in red above) are going to cause Pentaho to choke, but we can fix this quite easily. We'll fix this with a preprocessing step, outlined in Step 5.
Step 4: Launch Pentaho
A gentle reminder guys (and gals): I'm not going to get into any detail on how to use Pentaho. I'm assuming that you'll be able to figure out how to create the steps, hops and linkages to make this work.
For the lazy among you, I've included a ZIP of the XML files you can import into Pentaho so you don't have to recreate all of this from scratch.
Important: Make sure you change the paths so that they work locally on your machine. I did this on a Mac, so my paths will definitely not work if you're on Windoze.
Step 5: Preprocessing Transformation
Even with the "plain svg" map, we still need to fix the file a little. Instead of doing this manually, I've created a preprocessing transformation. The thinking behind this is to allow you to tweak the solution for your own uses so that it's reusable.
To do this, I created a new Transformation called "00 Map Preprocessing":
In this Transformation, we need to perform three tasks:
- Load the Plain SVG file
- Parse out a couple of offending attributes
- Rewrite the SVG file as a temp file
Load the Plain SVG file
Create a Text file input step and point it to the Plain SVG file you created earlier. Make sure you click on Add.
In the content tab, there are a couple of things you've got to do. Leave it as a CSV, but make sure you change the delimiter to TAB (not visible in the screen grab) and set the format to Unix. If you do it as DOS, you're more likely to end up with unwanted errors.
Finally, in the Fields tab, just create a single field called Line and make sure it's a String type with an undefined length.
Parse out the offending attributes
- The next step is to create a Formula step (from the Scripting folder). Make sure you put a hop in from the first step.Create an entry called NewLine that is a string with the following formula:
SUBSTITUTE( SUBSTITUTE( [Line]; "xmlns:svg=""http://www.w3.org/2000/svg"""; ""); "xmlns=""http://www.w3.org/2000/svg"""; "")
If you look at the code, it will wipe out the xmlns namespace attributes with a blank.
Rewrite the SVG file as a temp file
Finally, we're ready to create the output step.
Create a Text file output step.
- Set the file as: USA_Counties_with_FIPS_and_names plain.svg
- Set the extension as tmp
- The file will be saved as "USA_Counties_with_FIPS_and_names plain.svg.tmp"
In the Content Tab, Set TAB as the delimiter and UNIX as file type.
Finally in the Fields tab:
Add NewLine as the output and make it a String with an undefined length.
Run the entire transformation to make sure it works.
Step 6. Create the main transformation
Now that you've got a transformation that creates a temp file, you can parse it in Pentaho, join it to the unemployment data, and output a colorful SVG file.
To do this, I created a new Transformation called "01 Map Processing":
There are two streams that will merge in the end.
The first stream is the load of the unemployment data, and the second stream is the load of the map. They'll be merged into a new map.
Load the Unemployment Data
Create a CSV file input step.
Point it to the path of the data file. Most of the defaults are fine.
Add the following fields:
- Ignore1, String
- Fips1, String
- Fips2, String
- Ignore4, String
- Ignore5, String
- Ignore6, String
- Ignore7, String
- Ignore8, String
- Rate, Number
Only Fips1, Fips2 and Rate will be read. As mentioned in the Flowing Data article, Fips1 and Fips2 are concatenated to generate the full Fips code.
Next, add a formula step:
There are two transformations that need to be done.
Fips1 and Fips2 need to be combined to make the county id. The formula is simple:
[Fips1]&[Fips2]We're going to replace the style of the county items based on the Rate value (using the same rules and colours as the Flowing Data article). Create another field called style_new. This formula is not hard, but it's long:
"font-size:12px;fill-rule:nonzero;stroke:#FFFFFF;stroke-opacity:1;stroke-width:0.1;stroke-miterlimit:4;stroke-dasharray:none;stroke-linecap:butt;marker-start:none;stroke-linejoin:bevel;fill:" & IF( [Rate] > 10 ; "#980043" ; IF([Rate] > 8; "#DD1C77";IF([Rate] > 6; "#DF65B0";IF([Rate] > 4; "#C994C7";IF([Rate] > 2; "#D4B9DA";"F1EEF6" ) ) ) ))Basically the formula above is the string for the style, with a conditional statement at the end setting the colour based on the value for the Rate.
Next, we'll sort the County Ids (Fips code) so that we can join them. Add a Sort Rows step to the Transformation. and sort by the field named id.
Load the Temporary Map SVG
Create a Get data from XML step.
Point it to the .tmp file you created and make sure you click Add.
Go to the Content tab. We're only going to load the Path elements from the xml file, so in the Loop XPath, click on Get XPath Nodes, and select "/svg/path" Set the encoding to UTF-8 to be safe.
Finally, in the fields tab:
Click on Get Fields. Make sure the order is the same as the screenshot above. Rename the style to style_original.
Create another sort step to sort by Id
Join the Data to the Map
Now we're ready to join the data to the map.
Create a Merge Join step.
Set the Sort Paths (xml file) as the first step, Sort Data step as the second step and make the join an LEFT outer. Use id as the key field for both steps.
Now we're going to assign the styles using a Formula step.
In case there are counties without data, we want to retain the original style from the map.
In the join, if the style_new step is null (there are a few), we'll take the original style.
Use the formula below:
IF(isblank( [style_new]) ; [style_original] ; [style_new])Finally we want to output the new merged map data to an xml file. The problem is that the XML Output step can't output attributes into the XML, but we can fix that in post processing.
Set the filename to paths_restyled and leave the file type as XML for now. The SVG data being written will still need to be cleaned up.
In the Content tab, set the encoding to UTF-8, the parent XML element to svg and the row XML element to path:
In the fields tab, make sure the fields are output as strings in this order: style, d, id. The order is important because of the way the output file will be parsed in the Post Processing stage.
Step 7. Post Processing
If you look at the file that was output in the previous transformation, it should look something like this.
<?xml version="1.0" encoding="UTF-8"?>
<svg>
<path>
<style>
font-size:12px;fill-rule:nonzero;stroke:#FFFFFF;stroke-opacity:1;stroke-width:0.1;
stroke-miterlimit:4;stroke-dasharray:none;stroke-linecap:butt;marker-start:none;stroke-linejoin:bevel;fill:#DD1C77</style> <d>M 396.96898,238.579 L 397.37998,242.405 L 396.53598,243.339 L 395.20198,242.757 L 393.90798,242.91 L 393.43098,242.995 L 393.39698,243.055 L 393.10198,243.726 L 391.60198,239.67 L 392.68698,239.615 L 393.79598,239.489 L 393.73498,238.922 L 396.96898,238.579</d> <id>01001</id></path>
Notice the difference in the path elements. Instead of having attributes, it has child elements. This is easily fixed with some simple search and replace.
Finally, we need to create the post processing to fix the SVG file. Create a new transformation called 02 Map Postprocessing
Create a Text file input step and point it to the paths_restyled.xml file.
Set the content tab to be the same as the one in the PreProcessing stage.
Set the fields tab to be the same as the one in the PreProcessing stage.
Create a formula step with a new field called
LineOut1.Set the formula to the one below:
SUBSTITUTE( SUBSTITUTE( SUBSTITUTE( SUBSTITUTE( SUBSTITUTE([Line];"";" ";""" d=""");" ";"""/>")";""" id=""");" This formula parses each path line and converts the elements to attributes.
Create another formula step to re-embed those namespace attributes we took out earlier. Add a field called LineOut2.
SUBSTITUTE([LineOut1];"Note that the def tag was stripped out earlier in the main transformation since we only iterated through the paths.
Create a step for the Text file output and set the filename to be final_map, with the extension set to be svg.
In the fields tab, set LineOut2 as the final output field. Make sure it's a string with an undefined length.
Step 8. Put It All Together
To integrate all the Transformations, feel free to create a Job that links all three of the transformation steps.
Now, when you run it, you should be able to create the following SVG:
Next Steps
So now you know how to build this map in Pentaho. While the instructions are long, a seasoned Pentaho developer could put this together at least as quickly as using the Python method in the Flowing Data article.
So how can this be changed to make it more useful?
Well if you have any county level statistics in a database, Excel file or whatnot, you can easily port this solution to that. Better yet, if this data is in a database and changing regularly, you can schedule Pentaho to build the map at regular intervals.
The great thing is that you can do this with free tools. No expensive maps or software packages to buy.
Cool!
[8]: http://blog.braintapper.com/assets/images/pentaho-map/USA_Counties_with_FIPS_and_names plain.svg


























