Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Re^2: Time series normalization

by 0xbeef (Hermit)
on Jul 16, 2009 at 16:04 UTC ( #780763=note: print w/replies, xml ) Need Help??

in reply to Re: Time series normalization
in thread Time series normalization

Alas, things are not as simple as this:

My objective commonly would be to produce line graphs for PCPU - a metric that stores the physical CPUs consumed by a server over a given time interval. I would also need to plot the sum of pcpu values for all of the servers on the same graph. The data-series array is acquired similar to this:

$common_times = "SELECT timestamp FROM pcpu WHERE (nethost IN ($host_l +ist)) GROUP BY timestamp HAVING (count(*) >= $percent_overlap_require +d)"; $result = $$dbh->selectall_arrayref("SELECT host,timestamp,val FROM pc +pu WHERE (host IN ($host_list) AND timestamp IN ($common_times)) GROU +P BY nethost,timestamp;");

So host1 result could contain time/value pairs for 08:01, 08:02, 08:03, 08:04 while host2 result might contain 08:00, 08:01 and 08:04. There are no undefined values to begin with, and to interpolate the missing times with undefined values at the SQL level would very likely make the SELECT even more expensive. And make my head hurt while trying to understand how to do it... :)

I could also easily ensure that there would never be any undef value if I set $percent_overlap_required to 100, but that is unfortunately not realistic in terms of how things work - a server might be offline, or have a slightly different time than another, and as more servers are involved the greater the probability of missing a metric.

After the SELECT, I could build a list that contains just those times/values in common on 80% of the hosts, but there are so many values involved that memory becomes a concern.

My gut feel is that the way forward would involve normalizing the values, but I am struggling to understand how this is done given what I have to start out with.


Replies are listed 'Best First'.
Re^3: Time series normalization
by jrsimmon (Hermit) on Jul 16, 2009 at 16:34 UTC

    Perhaps I'm not understanding, but inserting undefs for missing values doesn't seem that bad. To ensure a value for each host/timestamp pair, you have two options:

    • Use JOIN statement in your SQL to ensure you get a row for each timestamp regardless of other values
    • Populate a HoH with your data, using the timestamp as the first key and each hostX as your subhash key.
    The hash approach seems like it would be the most efficient to me, but that would depend on the amount of data, efficiency of the database, etc. If you did go that route, this is what your data structure would look like:

    $VAR1 = '08:03'; $VAR2 = { 'host3' => 4, 'host2' => 4, 'host1' => 4 }; $VAR3 = '08:02'; $VAR4 = { 'host3' => undef, 'host2' => 3, 'host1' => 3 }; $VAR5 = '08:01'; $VAR6 = { 'host3' => undef, 'host2' => 2, 'host1' => 2 }; $VAR7 = '08:00'; $VAR8 = { 'host3' => 1, 'host2' => undef, 'host1' => 1 };
      I am using your method (except for formatting differences as per what GD::Graph wants) for daily graphs, but the graphs I am referring to here is for a long term trend per managed system.

      Each managed system could easily contain 20 logical partitions, and for a 3 month trend it could be about 3500 - 5000 values per LPAR.

      Using the "select just the 100% common times" method takes about 30 odd seconds to do such a graph for nearly 20 members of the managed system, but to get this result took quite a bit of SQLite3 tuning.

      The really big problem with just inserting undefs would be the number of samples. If even just one server was set to gather stats at a short interval, each of the other servers involved running at a different interval would now have to include extra empty values.

      I would therefore be inclined to discard the times for which less than x % of hosts have values, but is this the best solution? Since I have the number of samples per data-series, is there no way to fit each data-series between a start and end time using some sort of approximation or mathematical transform?


        I'm still not convinced that you're not making the problem harder than it should be. A select that has to compare all entries for one key vs all entries of every other key, even if the db is indexed by that key, is fairly intensive. It doesn't surprise me that it required some tuning to get the time down. Simply populating a hash, though, with the values you get and undef as placeholders, should be quite efficient.

        That said, you might check Chart::Plot to see if it will do what you want. It does not require uniform length data sets, per the doc.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://780763]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2022-05-22 11:38 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (80 votes). Check out past polls.