SQL Crosstab, a hell of a DBI idiom

A few months ago, dws, against the popular opinion promoting SQL statements builders, stated that hardcoded SQL is better and somethimes quicker to obtain. I heartily agree on his position, since I usually do the same thing, never trusting a tool to create SQL code for production purposes. However, there is an exception to this rule, and I rely on a special tool to produce complex queries for cross-tabulation. Since I am the author of such a tool, I am less suspicious towards its evil side effects on the rest of my applications.

Until a few months ago, the tool we are talking about was just a ugly perl script with a few dozen variables, which I used to create these exceedingly long queries that I then pasted to the production code. That was not a satisfactory solution, though, and when the requirements for truly dynamic queries became unavoidable, I created DBIx::SQLCrosstab, allowing the execution of dynamically created crosstab queries.

For those of you not familiar with the concept, let me remind that crosstab queries are quite peculiar, often involving hundreds of SQL lines, unlikely to be handled by simple DBI wrappers. A recent article explains in full detail the database theory behind the creation of crosstab queries and gives some tips on how to use DBIx::SQLCrosstab to its best.

What the article doesn't cover is most of the "under-the-hood" stuff, the juicy details of how the engine works and how to expand it. Since OnLamp editors wanted to keep the article focused on database matters, I stashed away all the Perl relevant issues for PerlMonks, and here it comes ...

DBIx::SQLCrosstab interiors

One of the most intriguing aspects of DBIx::SQLCrosstab is the SQL builder engine, which, in addition to producing valid SQL code, must be also able to cope with a long list of requirements.

Requirements complexity

To create a realistic multi-level crosstab query, you must:

get a list of values for every level of headers;
combine the values of each level to create a condition
create a sensible name for the resulting calculated column
add columns (with appropriate calculations) for column sub-totals
ensure that combinations of values from various levels are not impossible (e.g.: allow UK/London, Germany/Berlin, and Italy/Rome, but avoid Italy/London, UK/Berlin, Germany/Rome)
For each level, allow either instructions to get the values from the database or a pre-compiled list of values
assemble the query lines together, taking into account mandatory clauses (FROM and GROUP BY) and optional ones (WHERE, HAVING, ORDER BY)
allow handlers for row sub-totals and multi-operator queries.

a simple, brute-force engine

To appreciate the complexity of the requirements, let's build a simpler engine that can only deal with a few rules, creating a query from simple requirements. The following example will create a query from a list of values and the corresponding field names, and some information on the operation to perform.

(You may want to look at the database schema . The source data is available in the article)

#!/usr/bin/perl -w
use strict;

my @values = (
    [ qw( pers dev sales ) ],                   # dept
    [ qw( consultant contractor employee ) ],   # category
    [ qw( m f ) ]                               # gender

);

my @condition_names = ('dept', 'category', 'gender');
my $operation = 'SUM';
my $field_to_operate = 'salary';

# permute function written by Randal L. Schwartz,
# aka merlyn
# http://www.perlmonks.org/index.pl?node_id=24270
sub permute {
   my $last = pop @_;
   unless (@_) {
     return map [$_], @$last;
   }
   return map { my $left = $_; map [@$left, $_], @$last } permute(@_);
}

my @permuted = permute(@values);

#now @permuted = (
#   ['pers',  'consultant', 'm'],
#   ['pers',  'consultant', 'f'],
#   ['pers',  'contractor', 'm'],
#   ['pers',  'contractor', 'f'],
#   ['pers',  'employee',   'm'],
#   ['pers',  'employee',   'f'],
#   ['dev',   'consultant', 'm'],
#   ['dev',   'consultant', 'f'],
#   ['dev',   'contractor', 'm'],
#   ['dev',   'contractor', 'f'],
#   ['dev',   'employee',   'm'],
#   ['dev',   'employee',   'f'],
#   ['sales', 'consultant', 'm'],
#   ['sales', 'consultant', 'f'],
#   ['sales', 'contractor', 'm'],
#   ['sales', 'contractor', 'f'],
#   ['sales', 'employee',   'm'],
#   ['sales', 'employee',   'f'],
#);

my $query = qq{SELECT country, loc\n};

for my $line(@permuted) {
    #my @flds = @condition_names;
    my $ndx=0;
    $query .=
        ","
        . $operation
        . "(CASE WHEN "
        . join(" AND ", map { $condition_names[$ndx++] . " = '" . $_ .
+ "'" } @$line)
        . " THEN "
        . $field_to_operate
        . " ELSE NULL END ) AS "
        . "'"
        . join ("#", @$line) . "'\n"
}

$query .= ",$operation($field_to_operate) AS total\n"
        . qq{FROM
            person
            INNER JOIN depts ON (person.dept_id=depts.dept_id)
            INNER JOIN categories ON (person.cat_id=categories.cat_id)
            INNER JOIN locs on(locs.loc_id=person.loc_id)
            INNER JOIN countries ON(locs.country_id=countries.country_
+id)\n}
       . "GROUP BY country, loc\n";

print $query;
[download]

This script shows only the SQL builder part. The preparatory part would require a few additional queries, one for each level of header, to get the values from the appropriate database tables. Something along the lines of:

my $depts = $dbh->selectcol_arrayref(qq{SELECT DISTINCT dept FROM dept
+s})
    or die "can't get dept values\n";
my $categories = $dbh->selectcol_arrayref(qq{SELECT DISTINCT category 
+FROM categories})
    or die "can't get category values\n";
my $genders = $dbh->selectcol_arrayref(qq{SELECT DISTINCT gender FROM 
+person})
    or die "can't get gender values\n";

my @values = ( $depts, $categories, $genders );
[download]

The heart of the engine is in those two joins, the first of which creates a composite condition. Each item is a combination "field name + value". Several items are joined by a "AND" operator. The resulting operation is included inside a "CASE" function (SQL ANSI 92), which passes to the "SUM" function either the relevant field or a NULL.

The second join creates the column name for the result set. What is missing in this crude version is a check on the applicability of the character used here as a field separator. Also, some database engines have limitations on the column name, which has a maximum length and some restriction on the characters to use in it. The real module has a default mechanism to use fake names (fld001, fld002, and so on) and a support hash to keep track of the real names. This trick makes the query more acceptable to fragile DBMS parsers.

The resulting query follows. Don't be afraid. It isn't the shortest you can get, but it's far from being the longest one. Crosstab queries with the tiny sample database that ships with the module (9 records) can be up to 600 lines long.

$query = qq{
SELECT country, loc
,SUM(CASE WHEN dept = 'pers' AND category = 'consultant' AND gender = 
+'m'
        THEN salary ELSE NULL END ) AS 'pers#consultant#m'
,SUM(CASE WHEN dept = 'pers' AND category = 'consultant' AND gender = 
+'f'
        THEN salary ELSE NULL END ) AS 'pers#consultant#f'
,SUM(CASE WHEN dept = 'pers' AND category = 'contractor' AND gender = 
+'m'
        THEN salary ELSE NULL END ) AS 'pers#contractor#m'
,SUM(CASE WHEN dept = 'pers' AND category = 'contractor' AND gender = 
+'f'
        THEN salary ELSE NULL END ) AS 'pers#contractor#f'
,SUM(CASE WHEN dept = 'pers' AND category = 'employee' AND gender = 'm
+'
        THEN salary ELSE NULL END ) AS 'pers#employee#m'
,SUM(CASE WHEN dept = 'pers' AND category = 'employee' AND gender = 'f
+'
        THEN salary ELSE NULL END ) AS 'pers#employee#f'
,SUM(CASE WHEN dept = 'dev' AND category = 'consultant' AND gender = '
+m'
        THEN salary ELSE NULL END ) AS 'dev#consultant#m'
,SUM(CASE WHEN dept = 'dev' AND category = 'consultant' AND gender = '
+f'
        THEN salary ELSE NULL END ) AS 'dev#consultant#f'
,SUM(CASE WHEN dept = 'dev' AND category = 'contractor' AND gender = '
+m'
        THEN salary ELSE NULL END ) AS 'dev#contractor#m'
,SUM(CASE WHEN dept = 'dev' AND category = 'contractor' AND gender = '
+f'
        THEN salary ELSE NULL END ) AS 'dev#contractor#f'
,SUM(CASE WHEN dept = 'dev' AND category = 'employee' AND gender = 'm'
        THEN salary ELSE NULL END ) AS 'dev#employee#m'
,SUM(CASE WHEN dept = 'dev' AND category = 'employee' AND gender = 'f'
        THEN salary ELSE NULL END ) AS 'dev#employee#f'
,SUM(CASE WHEN dept = 'sales' AND category = 'consultant' AND gender =
+ 'm'
        THEN salary ELSE NULL END ) AS 'sales#consultant#m'
,SUM(CASE WHEN dept = 'sales' AND category = 'consultant' AND gender =
+ 'f'
        THEN salary ELSE NULL END ) AS 'sales#consultant#f'
,SUM(CASE WHEN dept = 'sales' AND category = 'contractor' AND gender =
+ 'm'
        THEN salary ELSE NULL END ) AS 'sales#contractor#m'
,SUM(CASE WHEN dept = 'sales' AND category = 'contractor' AND gender =
+ 'f'
        THEN salary ELSE NULL END ) AS 'sales#contractor#f'
,SUM(CASE WHEN dept = 'sales' AND category = 'employee' AND gender = '
+m'
        THEN salary ELSE NULL END ) AS 'sales#employee#m'
,SUM(CASE WHEN dept = 'sales' AND category = 'employee' AND gender = '
+f'
        THEN salary ELSE NULL END ) AS 'sales#employee#f'
,SUM(salary) AS total
FROM
    person
    INNER JOIN depts ON (person.dept_id=depts.dept_id)
    INNER JOIN categories ON (person.cat_id=categories.cat_id)
    INNER JOIN locs on(locs.loc_id=person.loc_id)
    INNER JOIN countries ON(locs.country_id=countries.country_id)
GROUP BY country, loc
}
[download]

Trees and combinations

The simple example in the previous section can solve many cases. But if we expand the requirements as to include column sub-totals, that simple paradigm can't carry out the task successfully. We need to either modify the engine to take sub totals into account or modify the permutation algorithm to produce the sub totals combinations in the same data structure used for the normal processing.

I chose the second solution, and I implemented a second permutation function that produces a modified @permuted array.

 @permuted = (
   ['pers',  'consultant', 'm'],
   ['pers',  'consultant', 'f'],
   ['pers',  'consultant'],
   ['pers',  'contractor', 'm'],
   ['pers',  'contractor', 'f'],
   ['pers',  'contractor'],
   ['pers',  'employee',   'm'],
   ['pers',  'employee',   'f'],
   ['pers',  'employee'],
   ['pers'],
   ['dev',   'consultant', 'm'],
   ['dev',   'consultant', 'f'],
   ['dev',   'consultant'],
   ['dev',   'contractor', 'm'],
   ['dev',   'contractor', 'f'],
   ['dev',   'contractor'],
   ['dev',   'employee',   'm'],
   ['dev',   'employee',   'f'],
   ['dev',   'employee'],
   ['dev'],
   ['sales', 'consultant', 'm'],
   ['sales', 'consultant', 'f'],
   ['sales', 'consultant'],
   ['sales', 'contractor', 'm'],
   ['sales', 'contractor', 'f'],
   ['sales', 'contractor'],
   ['sales', 'employee',   'm'],
   ['sales', 'employee',   'f'],
   ['sales', 'employee'],
   ['sales'],
);
[download]

There were several choices to create this modified array. Although I got some clever tips on how to make this function, in the end I decided to implement it with a Tree::DAG_Node, mostly because I would need this module to process the cross tabulation output.

When passed to the same engine, this data structure produces a query with a sub total for each level of headers.

A (brief) digression about database algorithms

A more difficult case is when a level depends on the previous one, such as "countries / towns" or "University faculties / subjects". If I were processing such columns with the same algorithm used so far, I would end up with things like "Mathematics / Greek Philosophy", "Languages / calculus", or "History / operating systems."

The difficult thing about database programming is that an algorithm can involve steps in the client interspersed with steps in the server.

If you are used to designing algorithms on the client side, dealing with this mixed nature can be frustrating, since you could try to solve on the client side what you should do on the server side. On the other hand, if you are familiar with database solutions, you could be tempted to do on the server side simple operations that should be treated in the client.

When you reach the right balance, you can create sensible algorithms. Trouble is, sometimes you become too much confident in what you've done, especially if you have just created a good, clever-looking, and efficient paradigm. When this happens, it is harder for you to find a fault in your work, and when you find it, your mind is clouded by your previous success, thus making it more difficult to find a solution for your latest problem.

In my case, when I found out that my algorithm did not cover this particular aspect, I spent some time fantasizing about very complicated variations of what I had already done, with intermediate data structures and acrobatic manipulations of the latter. Then I decided that I was doing something wrong, and I left the matter rest for a while. Finally, the clear solution came to me, i.e. whenever there is a case of dependency among levels, then I let the database engine find the legal permutations. Just querying for the DISTINCT values of all the involved columns will return all the valid permutations. So I added this variation as an option and now the module seems to cover every angle of the matter.

Expanding the engine

Creating the query was only half of the problem. Returning the data in a format suitable for human consumption is a strictly related issue, and I designed the module to facilitate this important part.

The format issue is not addressed directly by DBIx::SQLCrosstab. To keep the contents separated from their appearance, I created a descendant of the first module, DBIx::SQLCrosstab::Format, which can do everything its parent can do, and can also create some good-looking reports, like the one shown here.

Area	country	location	pers						sales							dev					T
			empl			contr		T	empl		contr		cons		T	empl		cons		T
			f	m	T	m	T	T	m	T	m	T	f	T	T	m	T	f	T	T
N	Germany	Berlin	1	0	1	0	0	1	0	0	0	0	0	0	0	1	1	0	0	1	2
		Bonn	0	0	0	0	0	0	1	1	0	0	0	0	1	0	0	0	0	0	1
		Munich	0	1	1	0	0	1	0	0	0	0	1	1	1	0	0	0	0	0	2
		total	1	1	2	0	0	2	1	1	0	0	1	1	2	1	1	0	0	1	5
	UK	London	0	0	0	1	1	1	0	0	1	1	0	0	1	0	0	0	0	0	2
	UK	total	0	0	0	1	1	1	0	0	1	1	0	0	1	0	0	0	0	0	2
	total	total	1	1	2	1	1	3	1	1	1	1	1	1	3	1	1	0	0	1	7
S	Italy	Rome	0	1	1	0	0	1	0	0	0	0	0	0	0	0	0	1	1	1	2
	Italy	total	0	1	1	0	0	1	0	0	0	0	0	0	0	0	0	1	1	1	2
	total	total	0	1	1	0	0	1	0	0	0	0	0	0	0	0	0	1	1	1	2
total	total	total	1	2	3	1	1	4	1	1	1	1	1	1	3	1	1	1	1	2	9

This table was created by the as_html method, with the help of a few data structures that I designed to deal with general purpose tree-like reports, like this peculiar HTML table or a rich XML document.

Data structures for format depicting

The above table was created using these two data structures, each one describing a tree-like header, one at the top of the table and one at its left.

The first one, $header_formats, is a description of the headers at the top. Each element in this array reference describes one line of headers as an array of hashes. For example, the first line says that 'Area' will occupy one column, but it will span down three rows. Armed with this information, it is very easy to build a HTML table header.

$header_formats = [
  [
    { name => 'Area',       colspan => 1, rowspan => 3 },
    { name => 'country',    colspan => 1, rowspan => 3 },
    { name => 'location',   colspan => 1, rowspan => 3 },
    { name => 'pers',       colspan => 6, rowspan => 0 },
    { name => 'sales',      colspan => 7, rowspan => 0 },
    { name => 'dev',        colspan => 5, rowspan => 0 },
    { name => 'total',      colspan => 1, rowspan => 3 }
  ],
  [
    { name => 'employee',   colspan => 3, rowspan => 0 },
    { name => 'contractor', colspan => 2, rowspan => 0 },
    { name => 'total',      colspan => 1, rowspan => 2 },
    { name => 'employee',   colspan => 2, rowspan => 0 },
    { name => 'contractor', colspan => 2, rowspan => 0 },
    { name => 'consultant', colspan => 2, rowspan => 0 },
    { name => 'total',      colspan => 1, rowspan => 2 },
    { name => 'employee',   colspan => 2, rowspan => 0 },

    { name => 'consultant', colspan => 2, rowspan => 0 },
    { name => 'total',      colspan => 1, rowspan => 2 }
  ],
  [
    { name => 'f',          colspan => 1, rowspan => 1 },
    { name => 'm',          colspan => 1, rowspan => 1 },
    { name => 'total',      colspan => 1, rowspan => 1 },
    { name => 'm',          colspan => 1, rowspan => 1 },
    { name => 'total',      colspan => 1, rowspan => 1 },
    { name => 'm',          colspan => 1, rowspan => 1 },
    { name => 'total',      colspan => 1, rowspan => 1 },
    { name => 'm',          colspan => 1, rowspan => 1 },
    { name => 'total',      colspan => 1, rowspan => 1 },
    { name => 'f',          colspan => 1, rowspan => 1 },
    { name => 'total',      colspan => 1, rowspan => 1 },
    { name => 'm',          colspan => 1, rowspan => 1 },
    { name => 'total',      colspan => 1, rowspan => 1 },
    { name => 'f',          colspan => 1, rowspan => 1 },
    { name => 'total',      colspan => 1, rowspan => 1 }
  ]
];
[download]

The second data structure, $recs_formats, describes the row headers in a different way. It is a hash, whose values are hashes of arrays. The keys for the hash are the level, i.e. the column number. The inner hash keys are the same values that will show up in the data.

This structure is supposed to be used in a destructive way:

Reading the records, at column [0] I get the value "N". I check if there is a "N" key in the hashref under "0". If I find it and the value is defined, then I get the first available item from the arrayref. Something along the lines of:

# $level =0; $value = 'N';
print '<td rowspan= "',
    shift @{$recs_formats->{$level}->{$value}},
    '">', $value,'</td>';
[download]

The real code is a tad more complex than this, but just to give you the idea.

$recs_formats = {
    0 => {
        'N'         => [ 7 ],
        'S'         => [ 3 ],
        'zzzz'      => [ 1 ],
    },
    1 => {
        'Germany'   => [ 4 ]
        'UK'        => [ 2 ],
        'Italy'     => [ 2 ],
        'zzzz'      => [ 1, 1, 1 ],
    },
    2 => {
        'Berlin'    => [ 1 ],
        'Bonn'      => [ 1 ],
        'Munich'    => [ 1 ],
        'London'    => [ 1 ]
        'Rome'      => [ 1 ],
        'zzzz'      => [ 1, 1, 1, 1, 1, 1 ],
  }
};
[download]

Descending from Trees

These structures didn't come from the sky, but were created from two trees that can both produce simpler structures or be the basis for some demanding reports. For example, the XML output was generated directly from the co-operation between these two trees.

You can see how these trees are related with the simpler data structures shown before. Generating a array of hashes is a straightforward exploit of tree traversal functions. Using a post-order traversal you get the column span values, and with a pre-order traversal you push the data into the array of hashes. A similar treatment is due for the record descriptor.


-- xtab              { colspan => 22 }
   |-- Area          { colspan =>  1 rowspan => 3 }
   |-- country       { colspan =>  1 rowspan => 3 }
   |-- location      { colspan =>  1 rowspan => 3 }
   |-- pers          { colspan =>  6 }
   |   |- employee   { colspan =>  3 }
   |   |  |- f       { colspan =>  1 rowspan => 1 }
   |   |  |- m       { colspan =>  1 rowspan => 1 }
   |   |  +- total   { colspan =>  1 rowspan => 1 }
   |   |- contractor { colspan =>  2 }
   |   |  |- m       { colspan =>  1 rowspan => 1 }
   |   |  +- total   { colspan =>  1 rowspan => 1 }
   |   +- total      { colspan =>  1 rowspan => 2 }
   |-- sales         { colspan =>  7 }
   |   |  employee   { colspan =>  2 }
   |   |  |- m       { colspan =>  1 rowspan => 1 }
   |   |  +- total   { colspan =>  1 rowspan => 1 }
   |   |- contractor { colspan =>  2 }
   |   |  |- m       { colspan =>  1 rowspan => 1 }
   |   |  +- total   { colspan =>  1 rowspan => 1 }
   |   |- consultant { colspan =>  2 }
   |   |  |- f       { colspan =>  1 rowspan => 1 }
   |   |  +- total   { colspan =>  1 rowspan => 1 }
   |   +- total      { colspan =>  1 rowspan => 2 }
   |-- dev           { colspan =>  5 }
   |   |- employee   { colspan =>  2 }
   |   |  |- m       { colspan =>  1 rowspan => 1 }
   |   |  +- total   { colspan =>  1 rowspan => 1 }
   |   |- consultant { colspan =>  2 }
   |   |  |- f       { colspan =>  1 rowspan => 1 }
   |   |  +- total   { colspan =>  1 rowspan => 1 }
   |   +- total      { colspan =>  1 rowspan => 2 }
   +-- total         { colspan =>  1 rowspan => 3 }

-- recs              { rowspan => 11 }
   |-- N             { rowspan =>  7 }
   |   |- Germany    { rowspan =>  4 }
   |   |  |- Berlin  { rowspan =>  1 }
   |   |  |- Bonn    { rowspan =>  1 }
   |   |  |- Munich  { rowspan =>  1 }
   |   |  +- zzzz    { rowspan =>  1 }
   |   |- UK         { rowspan =>  2 }
   |   |  |- London  { rowspan =>  1 }
   |   |  +- zzzz    { rowspan =>  1 }
   |   +- zzzz       { rowspan =>  1 }
   |      +- zzzz    { rowspan =>  1 }
   |-- S             { rowspan =>  3 }
   |   |- Italy      { rowspan =>  2 }
   |   |  |- Rome    { rowspan =>  1 }
   |   |  +- zzzz    { rowspan =>  1 }
   |   +- zzzz       { rowspan =>  1 }
   |      +- zzzz    { rowspan =>  1 }
   +-- zzzz          { rowspan =>  1 }
       +- zzzz       { rowspan =>  1 }
           +- zzzz   { rowspan =>  1 }
[download]

The same features are useful to create a XML document. A pre-order traversal will create the opening tags, while a post-order traversal will create the closing tags. This approach, rather than using XML::Simple, was preferred because this special structure can also add information about the data before the DBMS processing. The data set reports 'pers' and 'f', but the module internal memory knows that 'pers' is a 'department' and 'f' is a 'gender'. The resulting XML is enriched by this knowledge which would be lost in a simple conversion.

Try it!

Where does all this talk lead?

The module is designed for expansion, and you could expand it to suit your needs if it doesn't do it now. You could create a new format or adapt existing formats to accept cross tabulations. You could integrate crosstabs into graphical interfaces, statistical packages, decision making applications, or whichever fancies you today.

It is not that hard. The above structures would be easily available to a descendant of DBIx::SQLCrosstab::Format. Your descending object should be used as the ones in the official examples until you get the records. After that, the private method _find_headers() will create the data structures described in this node. From that point on, it's all yours.

Enjoy.

Further reading

Generating Database Server-Side Cross Tabulations, the original article;
DBIx::SQLCrosstab online documentation, where everything is duly explained;
Introduction to Tree::DAG_Node, where your curiosity about trees can be satisfied;
DBI Recipes, where less demanding DBI idioms are explained.

Thanks to tye, who adjusted PM allowed tags and made the formatting of this node much easier.

 _  _ _  _  
(_|| | |(_|><
 _|

Back to Meditations