Architectural question...

devnul has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I've started to hit a wall with SQL and sort-of had this idea of moving a certain piece of functionality out of the SQL query and into something which operates more quickly. Speed is vital here, as I will explain.

First of all, what I am doing involves running a query against a data table and joining/grouping that result by what is stored in a category table. Thus giving me a count of all the items in the data table which fall into a given category.

Something like:

category|count
--------|-----
a       |22
a.b     |10
a.b.a   |2
a.b.a.a |1
a.b.a.b |1
a.b.b   |8
a.b.b.a |6
a.b.b.b |2
a.c     |12
a.c.a   |4
a.c.a.a |2
a.c.a.b |2
a.c.b   |8
a.c.b.a |6
a.c.b.b |2
[download]

.. I just typed this, it is not real data.. Hopefully I got all the math right and things add up correctly.

While doing this sort of query works well for smaller tables, the data has grown dramatically over the past several months and I now have close to 100,000 rows in this table. On my fast production systems, performance is "adequate" (2 or 3 seconds). On my incredibly slow development system it is unacceptable (30+ seconds). What I am trying to achieve is "adequate" performance on my development system.

One way I thought about doing this was to have a separate process/server running that when fed a list of IDs would look up each of them in its cache, requerying the database for any which have grown stale/etc.

The first order of business was to determine if this mountain can be moved. My test script is:

use strict;
use Data::Dumper;
use Time::HiRes qw(time);
print STDERR "Building cats\n";
my %cats = ();
foreach my $one ('a'..'d')
{
  $cats{"a.$one"} = "a.$one";
  foreach my $two ('a'..'z')
  {
    $cats{"a.$one.$two"} = "a.$one.$two";
    foreach my $three ('a'..'z')
    {
      $cats{"a.$one.$two.$three"} = "a.$one.$two.$three";
    }
  }
}
my @keys = sort keys %cats;
my $cnt = scalar @keys;
print STDERR "Building array from $cnt possibilities\n";
my $n = 100000;
my %ids = ();
foreach my $id (0..$n)
{
  my $rand = int(rand($cnt));
  $ids{$id} = $cats{$keys[$rand]};
}
@keys = keys %ids;
$cnt = scalar @keys;
my %ret = ();
my $x = 100000;
my $start_time = time();
print STDERR "Returning mapped categories for values from a pool of $c
+nt\n";
foreach my $id (0..$x)
{
  my $cat = $ids{$id};
  if(!$ret{$ids{$id}}{'count'})
  {
    $ret{$ids{$id}}{'count'} = 0;
    $ret{$ids{$id}}{'head_count'} = 'head_count';
    $ret{$ids{$id}}{'cat_count'} = 'cat_count';
    $ret{$ids{$id}}{'subcat_count'} = 'subcat_count';
  }
  $ret{$ids{$id}}{'count'}++;
}
@keys = keys %ret;
$cnt = scalar @keys;
my $end_time = time();
my $total_time = $end_time - $start_time;
print STDERR "Built $cnt entries to start with, $total_time seconds\n"
+;
while(my($key,$val) = each(%ret))
{
  my @parts = split(/\./, $key);
  my $id = '';
  foreach my $add (@parts)
  {
    $id .= $add;
    if($id eq $key)
    {
      next;
    }
    if(!$ret{$id}{'count'})
    {
      $ret{$id}{'count'} = 0;
      $ret{$id}{'head_count'} = 'head_count';
      $ret{$id}{'cat_count'} = 'cat_count';
      $ret{$id}{'subcat_count'} = 'subcat_count';
    }
    $ret{$id}{'count'} += $ret{$key}{'count'};
    $id .= '.';
  }

}
$end_time = time();
$total_time = $end_time - $start_time;
my @sorted_keys = sort keys(%ret);
my $tot = $ret{'a'}{'count'};
print STDERR "Completed in $total_time seconds ($tot)\n";
foreach my $key (@sorted_keys)
{
  my $cnt = $ret{$key}{'count'};
    #print "KEY: $key $cnt\n";
}
print STDERR "Sleeping...\n";
sleep(1000);
[download]

Most of this script is just being used to setup the data/cache for later use. The real work I am trying to measure begins with the "Returning mapped categories for values..." and ends at the "Completed in $total_time" message.

As you can probably tell from the code I am setting up 2812 categories and testing how quickly I can map 100,001 of them.

The last sleep is just so I can look at the process and guage how much memory it is using.

So with all this being said, I get a run time of about 2.1 seconds on my development system. Do you see any obvious ways that this section of the code could be improved, considering the importance of raw performance here?

My second question is... The recipient of what this server would output would actually be a PHP script. I wrote a PHP version of this same test script, but the performance was much worse (3.6 seconds). What might be a quick and efficient means to connect this up to PHP/Apache?

.... Of course, I am very open to any other possible solutions to this problem, this is just one I thought of tonight, but I'm sure there must be a better way...

Thank you for your help and consideration!

- dEvNuL

Comment on Architectural question... Select or Download Code

Replies are listed 'Best First'.
Re: Architectural question... by BrowserUk (Patriarch) on Jul 05, 2005 at 09:13 UTC
Could you fix your code. The while loop doesn't terminate and as the indenting has gone ary somewhere in posting, it's not immediately obvious where the missing '}' should be? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.	[reply] [d/l]
Re^2: Architectural question... by devnul (Monk) on Jul 05, 2005 at 09:18 UTC
I re-pasted it, please let me know if it still isn't working..... ... If you cut and paste the code from here you may need to make 2 edits on line 33 and 52 because those lines wrap and have + symbols on them... - dEvNuL	[reply]
Re^3: Architectural question... by BrowserUk (Patriarch) on Jul 05, 2005 at 10:58 UTC
I don't get this code? You take the keys in turn and break them into their constituant parts. `while(my($key,$val) = each(%ret)) { my @parts = split(/\./, $key); my $id = ''; foreach my $add (@parts) {` [download] Then you iterate over those parts and concatenate them back together, piece by piece, into $id. `$id .= $add;` [download] And then, if $id (so far) doesn't match $key, you skip to the next iteration `if($id eq $key) { next; }` [download] Which means you'll never reach this code, until you've completely re-built $id to match $key. `if(!$ret{$id}{'count'}) { $ret{$id}{'count'} = 0; $ret{$id}{'head_count'} = 'head_count'; $ret{$id}{'cat_count'} = 'cat_count'; $ret{$id}{'subcat_count'} = 'subcat_count'; } $ret{$id}{'count'} += $ret{$key}{'count'}; $id .= '.'; } }` [download] Which, unless I'm missing something (quite possible), there is no need for split or the for loop as this would do the same thing: `while(my($key,$val) = each(%ret)) { $id = $key; if(!$ret{$id}{'count'}) { $ret{$id}{'count'} = 0; $ret{$id}{'head_count'} = 'head_count'; $ret{$id}{'cat_count'} = 'cat_count'; $ret{$id}{'subcat_count'} = 'subcat_count'; } ## Though this doesn't do much? $ret{$id}{'count'} += $ret{$key}{'count'}; }` [download] What did I miss? BTW. It won't help much towards your performance goal, but you can save a bit of memory by changing: `my @keys = sort keys %cats; my $cnt = scalar @keys; ... @keys = keys %ids; $cnt = scalar @keys; ... @keys = keys %ret; $cnt = scalar @keys;` [download] to `my $cnt = keys %cat; ... $cnt = keys %ids; ... $cnt = keys %ret;` [download] which does the same thing without copying all the keys to an array or sorting one set of them for no reason. Actually, if your low on memory, avoiding that sort and the (re-)allocation of those arrays might help your performance. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.	[reply] [d/l] [select]
Re^4: Architectural question... by devnul (Monk) on Jul 05, 2005 at 17:50 UTC
Re: Architectural question... by raafschild (Novice) on Jul 05, 2005 at 14:05 UTC
I think you're comparing apples and oranges. The SQL program reads data from disk and groups that, the perl programs groups data that is already in memory. When you put the perl program in production it should also read the data from the database. Do I understand correctly that all that you do is something like: `select a,b,c,d,count(*) from x group by a,b,c,d` [download] Or do you also sum subgroups? Can you please show the SQL query? Raafschild	[reply] [d/l]
Re^2: Architectural question... by devnul (Monk) on Jul 05, 2005 at 18:02 UTC
Yes... The problem is my orange is not performing well, so I'm switching to an apple. The question is: which apple should I use? The code, in this case, does (or rather will, if I go forward with it) indeed read from the database as that is how it will populate the data structure I am using. The point is, this "server" would be persistent. So each request would only need to execute the code I am timing (and to fetch any data not yet in its cache or which is "dirty". I wanted to avoid getting into the SQL specifics, because I use some Postgres-specific stuff which is not immediately apparent. My query is something like: `SELECT count(*) as count,category.category,nlevel(category.category) A +S level, subpath(category.category,0,nlevel(category.category)-1) as parent, category.head_title, category.cat_title, category.subcat_title FROM da +ta, category WHERE data.category <@ category.category GROUP BY category.category, category.head_title, category.cat_title, c +ategory.subcat_title` [download] The category table looks like: `Table "category" Column \| Type --------------+------------------- head_title \| character varying cat_title \| character varying subcat_title \| character varying category \| ltree` [download] The data table looks like: `Table "data" Column \| Type + +---------------------------------------------- id \| integer category \| ltree[]` [download] ... there are other columns here but these are the ones relevant to the query. - dEvNuL	[reply] [d/l] [select]
Re^3: Architectural question... by raafschild (Novice) on Jul 05, 2005 at 21:04 UTC
This certainly is a fruit of a different kind ;-). You might want to store the results of the query in a summary table and have that table maintained by a set of triggers on the data table. That way the persistence and the caching is handled by the database, so you don't have to write it yourself. Queries on this summary table should by quick, and all changes in the data table are immediately available in the summary table. Sorry for a not very perlish suggestion in this forum. Raafschild	[reply]
Re^4: Architectural question... by devnul (Monk) on Jul 06, 2005 at 02:30 UTC
Re: Architectural question... by InfiniteLoop (Hermit) on Jul 05, 2005 at 14:12 UTC
Try using memcached	[reply]
Re^2: Architectural question... by devnul (Monk) on Jul 05, 2005 at 18:12 UTC
This is a fascinating tool. I'm going to need to study this for a while.. :) Might you know what happens if I start a process with say "200mb" of memory and I try to stuff 300mb into it? Does the call to "set" fail? - dEvNuL	[reply]
Re: Architectural question... by cbrandtbuffalo (Deacon) on Jul 05, 2005 at 17:40 UTC
You don't mention the database you are running on, but you might consider tuning your SQL too. You say you've 'hit a wall' so maybe you've already done all you can. However, if you haven't, some databases have automatic tuning tools and tuned SQL that can really improve performance. For example, there is a tuner in the TOAD package (not free) that you can run against Oracle. You can also get help from SQL gurus. There are often tips for each database that dramatically improve performance by moving things around a bit. If you've truly done all you can on the tuning front, then please disregard. :)	[reply]
Re^2: Architectural question... by devnul (Monk) on Jul 05, 2005 at 18:05 UTC
Thank you for the timely comment.. :) The database is Postgres using the ltree extension. I've done all I can in the tuning side for now, I've fooled with various ways of doing the query, but its just a matter of the type of work to be done. - dEvNuL	[reply]
Re^3: Architectural question... by nevyn (Monk) on Jul 06, 2005 at 19:37 UTC
But, as you said, you aren't doing the same kind of work in perl ... so it's likely that if you do the work differently in SQL it'll be faster. I'd put money on postgresql vs. perl for calculating a table of numbers from another set of numbers. Saying that it matters a _lot_ what version of postgresql you are using, so an upgrade to postgresql-8.0 could provide the speedup you need ... and an increase in other parts :). -- James Antill	[reply]
Re^3: Architectural question... by devnul (Monk) on Jul 06, 2005 at 22:49 UTC
I am runing postgres 7.4. I don't see much on the Postgres site about performance improvements, except for the somewhat vague statement "This release has a more intelligent buffer replacement strategy, which will make better use of available shared buffers and improve performance.". The situation (on my development box) is one of SEVERE disk thrashing. Just doing a "select count(category),category from data" and I can hear the disk drive sounding like it is going to take off. While I agree that Postgres can summarize rows faster then I could ever hope to, it lacks the sophisticated cache configurations that are possible with other RDBMS (notably Sybase or Informix, which if I could use either of those here could probably make this problem a non-issue). That being said, I will give upgrading Postgres a shot, but I am not optimistic as it still relies on the operating system to cache files and thus there is no way to ensure certain tables have a higher caching priority. - dEvNuL	[reply]
Re: Architectural question... by aquarium (Curate) on Jul 07, 2005 at 20:45 UTC
it looks like you are doing strange joins with the SQL. rather than using strange-and-wonderful postgres specific functions, i would re-evaluate the design of the database....as from a cursory glance at the sql, you seem to be doing things of a hierarchial/recursive nature. not something relational databases are designed for. the hardest line to type correctly is: stty erase ^H	[reply]
Re^2: Architectural question... by devnul (Monk) on Jul 07, 2005 at 22:43 UTC
Dealing with heirarchal data is precisely what ltree was designed to do (using GiST indexes). I do tend to agree that this is not ideally done in a relational database, which is the purpose of this post. I'm trying to find a different/better solution but so far am coming up empty. :( - dEvNuL	[reply]