Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

Counting concurrent event jobs

by vagnerr (Prior)
on Apr 24, 2006 at 16:55 UTC ( #545325=perlquestion: print w/replies, xml ) Need Help??

vagnerr has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellow monks! ...

I am doing some performance analysis on some batch processing that we run (log processing). I need to determine how many concurrent jobs are running at set intervals over a day/month period (and of what type), based on a log of the start and finish times of each job. For arguments sake lets say the log files are in the following format.
[<date>] <start|finish>: <command> <file>
for example
[Mon Apr 24 11:56:23 2006] start: split www1.log [Mon Apr 24 11:57:23 2006] start: filter www2.log [Mon Apr 24 12:50:23 2006] start: split www1.log [Mon Apr 24 13:59:23 2006] finish: filter www2.log
I need to be able to convert that to some sort of report along the lines of
time,splits,filters,total 11:55,0,0,0 11:56,1,0,1 11:57,1,1,2 ... 12:50,1,1,2 12:51,0,1,1 ... 13:59.0,1,1 14:00,0,0,0
The simplest solution would appear to be to create a nice big array, or hash. Each node would represent a 1-5 minute window (depending on the required granularity) referencing an array of the individual types. The logs would then be processed line by line. As we find a matching set of "start" and "finish" lines we update the counters for that type for all the time segments between the start and the finish times.

The problem as I see it with this solution is that whilst it does work, its a little messy and can create quite a large data structure. Does anyone have any suggestions on a better approach to the problem?


Remember that amateurs built Noah's Ark. Professionals built the Titanic.

Replies are listed 'Best First'.
Re: Counting concurrent event jobs
by gaal (Parson) on Apr 24, 2006 at 17:06 UTC
    If this is the level of detail you need, it looks like you don't have to correlate specific start and end events. So all you need is one counter per command type (e.g., "split" and "filter").

      In that case, your script would print out the results for each time interval and then go on to evaluate the next lines of the logfiles, like this:

      while(<>) { $timestring = $_ =~ $some_regexp; $min = &get_nr_of_mins ($timestring); if(/start:/) { /split/ and ++$split; /filter/ and ++$filter; } elsif(/finish:/) { /split/ and --$split; /filter/ and --$filter; } if($min % $granularity == 0) { print "$time: $split, $filter\n"; } }

      (note, that this gives you the number of processes at the time of the last line eaten from the logfile, not an average value for the last interval).

      OTOH, if you want to do more sophisticated analysis of the logfile, this approach might be too simple.

      Unfortunately that is not the case. The processing of these log files goes on for hours and involves hundreds of logs, each taking between a few minutes and an hour or two to run. We need to be able to graph the data (hence the csv output) and see that for example we do a lot of split jobs at one time of day and a lot of filter jobs at another. We need to know because some of the jobs use a lot of cpu, others may use a lot of network bandwidth, and we want to be able to tune things to share the resources we have.

      Remember that amateurs built Noah's Ark. Professionals built the Titanic.
        That still doesn't seem to contradict gaal's approach, and only differs from mantadin's in choosing when and how to report. If I'm missing something, then tell me what is wrong with
        my $REPORT_INTERVAL = 300; # seconds my %active = ( 'split' => 0, 'filter' => 0); my $next_report = date_to_timestamp("...start of day..."); my $last_report = date_to_timestamp("...end of day..."); while(<>) { # Parse out the fields my ($date, $action, $jobtype, $logfile) = /.../; # Update current active job counts if ($action eq 'start') { ++$active{$jobtype}; elsif ($action eq 'finish') { --$active{$jobtype}; } else { die "Huh? $_"; } # Output counts for all report lines between # the last printed report and the time of this # log line. Most of the time, this will be empty # because we won't have reached the next report # time yet. my $stamp = date_to_timestamp($date); while ($stamp > $next_report) { report_counts($next_report, \%active); $next_report += $REPORT_INTERVAL; } } # Finish off the report for the report periods # at the end of the reporting range. while ($next_report < $last_report) { report_counts($next_report, \%active); $next_report += $REPORT_INTERVAL; }

        Based on your proposed solution, it seems like you think that you have to correlate a finish event with the start event for that job -- but if all you want is the counts, then as gaal said, the correlation is unnecessary.

        If for some reason you do need to correlate them, then you can always keep all active jobs' state in the %active hash:

        ... if ($action eq 'start') { $active{$jobtype}{$logfile} = 1; } elsif ($action eq 'finish') { delete $active{$jobtype}{$logfile}; } ... my $split_count = keys %{ $active{'split'} }; ...

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://545325]
Approved by marto
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2021-10-18 17:48 GMT
Find Nodes?
    Voting Booth?
    My first memorable Perl project was:

    Results (74 votes). Check out past polls.