http://qs321.pair.com?node_id=267549

Earindil has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I don't believe it is possible to make the following code any worse. I've been hacking away at perl for many years but am still less than an amateur.
Could you please take a look and suggest a better way of looping through the data (or avoiding the loop all together). In a worse case, I loop through several thousand lines of the same data 80+ times. Ugly.
This snippet of a larger script is used to draw web posts/reqs and opens for a large number of servers using the wonderful Ploticus graphing program.
@servers is a list of 60+ webservers
@DATA web server data pulled from the DB in 5 minute increments

The server in the DB is identified by a HOSTID so I load %SERVER with all the HOSTIDs. IE $SERVER{WEB1}=1234
----

foreach $server (@servers) { chomp $server; open TMP, ">/tmp/$server.tmp"; $ENV{"FILENAME"} = "/tmp/$server\.tmp"; # SUPER INEFFIECIENT, looping through all of the data for each s +erver. Need to rethink this foreach (@DATA) { ($hostid, $time, $get, $post, $currconn) = split(/\t/); next if ($hostid ne $SERVER{$server}); $time = ((int($time/300))*300)+10800; # Add 3 hours for EST + time ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = loca +ltime($time); $year+=1900; $year = substr($year,2); $mon++; $mon = "0".$mon if ($month < 10); $mday = "0".$mday if ($mday < 10); $hour = "0".$hour if ($hour < 10); $min = "0".$min if ($min < 10); $time = "$mon/$mday/$year\.$hour:$min"; $total = $get+$post; print TMP "$time\t$server\t$get\t$post\t$total\t$currconn\n" +; $total_get{"$time"} += $get; $total_post{"$time"} += $post; $total_conn{"$time"} += $currconn; } close TMP; if (!$ARGV[3]) { `$dir/plbin210/bin/pl $dir/WebHits.pol -o $output_dir/graphs/$server\_ +webhits.gif -gif title="$server Gets, Posts and Opens"`; `$dir/plbin210/bin/pl $dir/WebHits.pol -o $output_dir/graphs/$server\_ +Twebhits.gif -gif -scale 0.50 title="$server $hour:$min"`; }

update (broquaint): changed <pre> to <code> tags

Replies are listed 'Best First'.
Re: Super Duper Inefficient
by particle (Vicar) on Jun 20, 2003 at 13:13 UTC

    here's a clue:

    foreach $server (@servers) { chomp $server; open TMP, ">/tmp/$server.tmp"; $ENV{"FILENAME"} = "/tmp/$server\.tmp"; foreach(@DATA) { ($hostid, $time, $get, $post, $currconn) = split(/\t/); next if ($hostid ne $SERVER{$server}); ## more here... } close TMP; }

    this is the inefficiency. invert these loops to make this something like:

    ## iterate through record foreach(@DATA) { ($hostid, $time, $get, $post, $currconn) = split(/\t/); ## match against the server list and process accordingly foreach $server (@servers) { chomp $server; next unless $hostid eq $SERVER{$server}; open TMP, ">>/tmp/$server.tmp"; $ENV{"FILENAME"} = "/tmp/$server\.tmp"; ## more here... close TMP; ## done with this record, move on last; } }

    then you'll iterate over the data once, writing to multiple files depending on the $hostid variable. after each data record is processed, it moves on to the next line without evaluating any other servers in the list.

    if you haven't already, you might consider using warnings or -w. you might also consider using strict. it would take some work with this code, but it makes perl debugging so much easier.

    ~Particle *accelerates*

      Would opening and closing files thousands of times be more of a hit than just 80 or so times?
      If generating 5 day graphs @DATA would be approx 288*5*80 lines.
      Would it perhaps be better to open all the files first before the loop and then close them all after?

        sure, you could open all files first, and store the filehandles in a hash, keyed by servername. depending on the size of the records, you could also process the data in one loop, store it in memory, and write it all out in a second loop. there are many ways to do it. you might try benchmarking a few to see what works best for you.

        ~Particle *accelerates*

Re: Super Duper Inefficient
by EvdB (Deacon) on Jun 20, 2003 at 13:10 UTC
    I note the line:
    next if ($hostid ne $SERVER{$server});
    Why not split @DATAinto lots of $hostid. Try:
    my %new_data = (); foreach ( @DATA ) { my ($hostid, $rest) = split /\t/, $_, 2; push @{$new_data{$hostid}}, "$hostid\t$rest"; }
    Then instead of running through @DATA for each $hostid you can just run through @{$new_data{$hostid}} once you have determined $hostid

    --tidiness is the memory loss of environmental mnemonics

      I was formulating this same response myself, but I'll provide a minor variation, which does pretty much the same, but uses automatic variables instead of temporary placeholder variables:
      my %new_data = (); foreach ( @DATA ) { /^(.*?)\t.*$/; push @{$new_data{$1}}, $_; }
      Either case is equally valid in my eyes - I do not know which is faster, the split() or //, so I'll leave that to better souls than mine to answer on.

      The important thing I'd like to point out is that the above method - ran as a step before the server loop - effectively reduces your redundancy to two reads - once to populate the hash, and then once for the server-specific loop.

Re: Super Duper Inefficient
by nite_man (Deacon) on Jun 20, 2003 at 13:42 UTC
    You can format a date easyer way using function strftime from POSIX:
    use POSIX 'str2time'; my $time = strftime '%m/%d/%Y %H:%M', localtime($time);
          
    --------------------------------
    SV* sv_bless(SV* sv, HV* stash);
    
      That's great. Thanks. I was already using POSIX earlier in the script for mktime to get my time in seconds but didn't know about strftime. Looks much better and cuts out several lines. Thanks to everyone else to. I'm trying out each of your suggestions.
Re: Super Duper Inefficient
by benn (Vicar) on Jun 20, 2003 at 13:18 UTC
    When you read @DATA from your DB, you could split it up into a hash of arrays keyed on hostid...in fact, you could split all your data at this point, to avoid the split in your main loop...
    foreach (@DATA) { ($hostid, $time, $get, $post, $conn) = split(/\t/); push @{$munged_data{$hostid}}, {time => $time,get => $get,post=>$post,conn=>$conn}; }
    ...allowing you to then do something like...
    foreach my $server (@servers) { foreach(@{$munged_data{$server}}) { #do stuff with $_->{time} etc. } }
    Alternatively, you could look into splitting the data when it gets written into the db in the first place if this is possible, which would presumably involve configuring whatever logging module it is you're using.

    Cheers, Ben.

Re: Super Duper Inefficient
by hmerrill (Friar) on Jun 20, 2003 at 13:23 UTC
    I'm not sure I understand - can you explain what data is in the database? Is it webserver 'access_log's? From one webserver, or from all 60 webservers?

    Hard to make a recommendation without a better understanding of the situation.

    My guess is that the database contains a conglomeration of 60+ webserver access(?) logs - your script separates out the entries for each webserver into a separate '.tmp' file for each webserver. An easier way to accomplish this doesn't really come to mind - the only way I can see to make this better would be to have each of the webserver's log into their own log file from the start, instead of having to separate out each webserver after the fact. If they're all in one database because you have 1 webserver serving 60+ virtual webservers, then I don't know how that could be done.

    Sorry I couldn't be more help(yet). Clarifying and describing your setup in more detail may spark an idea from either me or someone else ;-)