Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Time series normalization

by 0xbeef (Hermit)
on Jul 16, 2009 at 14:29 UTC ( #780721=perlquestion: print w/replies, xml ) Need Help??

0xbeef has asked for the wisdom of the Perl Monks concerning the following question:


Be warned in advance, I have no education in statistics and might be using the wrong terminology... :)

I store system performance metrics as a (epoch) time-series in a SQLite database which may have missing data-points at a given time. It could be simplified like this:

Time Val1 Val2 Val3 08:00 11 70 0 08:20 22 10 08:40 56 80 25

I'm using GD::Graph to plot each individual data-set, and in some cases along with the summed values in a line graph. The problem is that just feeding the Val2 series as-is to GD will result in a shorter line graph over time than the Val1 series, with the actual values not corresponding to the correct time on the x-axis.

The way I have it now is to do a complex SELECT to get values for just the common time points and feed that to GD, but it is quite inaccurate and it is slow due to the number of data-points involved. So some sort of mathematical transformation could probably work better (pity I know very little about them).

I'm guessing I need to normalize each data-set across the overall time-range, and do a linear interpolation for each missing value, but the normalization is a bit beyond me.

It seems Data::TimeSeries could do something like this, but my data times could be a granular as a few minutes and this module only seems to support HOURS as a period.

I've read a bit about RRDTool and it sounds like it might be a great alternative to using a database altogether especially in reducing disk space usage, but to rewrite my code seems a bit more involved than what I prefer right now.

Does anyone perhaps know of another fairly efficient way to normalise data like this?


Replies are listed 'Best First'.
Re: Time series normalization
by jrsimmon (Hermit) on Jul 16, 2009 at 14:46 UTC

    From the GD::Graph documentation:

    If you don't have a value for a point in a certain dataset, you can use undef, and the point will be skipped.

    Are you setting the empty value to undef before plotting it?

      Alas, things are not as simple as this:

      My objective commonly would be to produce line graphs for PCPU - a metric that stores the physical CPUs consumed by a server over a given time interval. I would also need to plot the sum of pcpu values for all of the servers on the same graph. The data-series array is acquired similar to this:

      $common_times = "SELECT timestamp FROM pcpu WHERE (nethost IN ($host_l +ist)) GROUP BY timestamp HAVING (count(*) >= $percent_overlap_require +d)"; $result = $$dbh->selectall_arrayref("SELECT host,timestamp,val FROM pc +pu WHERE (host IN ($host_list) AND timestamp IN ($common_times)) GROU +P BY nethost,timestamp;");

      So host1 result could contain time/value pairs for 08:01, 08:02, 08:03, 08:04 while host2 result might contain 08:00, 08:01 and 08:04. There are no undefined values to begin with, and to interpolate the missing times with undefined values at the SQL level would very likely make the SELECT even more expensive. And make my head hurt while trying to understand how to do it... :)

      I could also easily ensure that there would never be any undef value if I set $percent_overlap_required to 100, but that is unfortunately not realistic in terms of how things work - a server might be offline, or have a slightly different time than another, and as more servers are involved the greater the probability of missing a metric.

      After the SELECT, I could build a list that contains just those times/values in common on 80% of the hosts, but there are so many values involved that memory becomes a concern.

      My gut feel is that the way forward would involve normalizing the values, but I am struggling to understand how this is done given what I have to start out with.


        Perhaps I'm not understanding, but inserting undefs for missing values doesn't seem that bad. To ensure a value for each host/timestamp pair, you have two options:

        • Use JOIN statement in your SQL to ensure you get a row for each timestamp regardless of other values
        • Populate a HoH with your data, using the timestamp as the first key and each hostX as your subhash key.
        The hash approach seems like it would be the most efficient to me, but that would depend on the amount of data, efficiency of the database, etc. If you did go that route, this is what your data structure would look like:

        $VAR1 = '08:03'; $VAR2 = { 'host3' => 4, 'host2' => 4, 'host1' => 4 }; $VAR3 = '08:02'; $VAR4 = { 'host3' => undef, 'host2' => 3, 'host1' => 3 }; $VAR5 = '08:01'; $VAR6 = { 'host3' => undef, 'host2' => 2, 'host1' => 2 }; $VAR7 = '08:00'; $VAR8 = { 'host3' => 1, 'host2' => undef, 'host1' => 1 };

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://780721]
Approved by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2022-05-20 15:05 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (73 votes). Check out past polls.