Re^3: Time series normalization

Perhaps I'm not understanding, but inserting undefs for missing values doesn't seem that bad. To ensure a value for each host/timestamp pair, you have two options:

Use JOIN statement in your SQL to ensure you get a row for each timestamp regardless of other values
Populate a HoH with your data, using the timestamp as the first key and each hostX as your subhash key.

The hash approach seems like it would be the most efficient to me, but that would depend on the amount of data, efficiency of the database, etc. If you did go that route, this is what your data structure would look like:

$VAR1 = '08:03';
$VAR2 = {
          'host3' => 4,
          'host2' => 4,
          'host1' => 4
        };
$VAR3 = '08:02';
$VAR4 = {
          'host3' => undef,
          'host2' => 3,
          'host1' => 3
        };
$VAR5 = '08:01';
$VAR6 = {
          'host3' => undef,
          'host2' => 2,
          'host1' => 2
        };
$VAR7 = '08:00';
$VAR8 = {
          'host3' => 1,
          'host2' => undef,
          'host1' => 1
        };
[download]

Comment on Re^3: Time series normalization Download Code

Replies are listed 'Best First'.
Re^4: Time series normalization by 0xbeef (Hermit) on Jul 16, 2009 at 18:37 UTC
I am using your method (except for formatting differences as per what GD::Graph wants) for daily graphs, but the graphs I am referring to here is for a long term trend per managed system. Each managed system could easily contain 20 logical partitions, and for a 3 month trend it could be about 3500 - 5000 values per LPAR. Using the "select just the 100% common times" method takes about 30 odd seconds to do such a graph for nearly 20 members of the managed system, but to get this result took quite a bit of SQLite3 tuning. The really big problem with just inserting undefs would be the number of samples. If even just one server was set to gather stats at a short interval, each of the other servers involved running at a different interval would now have to include extra empty values. I would therefore be inclined to discard the times for which less than x % of hosts have values, but is this the best solution? Since I have the number of samples per data-series, is there no way to fit each data-series between a start and end time using some sort of approximation or mathematical transform? Niel	[reply]
Re^5: Time series normalization by jrsimmon (Hermit) on Jul 16, 2009 at 19:00 UTC
I'm still not convinced that you're not making the problem harder than it should be. A select that has to compare all entries for one key vs all entries of every other key, even if the db is indexed by that key, is fairly intensive. It doesn't surprise me that it required some tuning to get the time down. Simply populating a hash, though, with the values you get and undef as placeholders, should be quite efficient. That said, you might check Chart::Plot to see if it will do what you want. It does not require uniform length data sets, per the doc.	[reply]


Pathologically Eclectic Rubbish Lister
	PerlMonks