Super Duper Inefficient

Earindil has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I don't believe it is possible to make the following code any worse. I've been hacking away at perl for many years but am still less than an amateur.
Could you please take a look and suggest a better way of looping through the data (or avoiding the loop all together). In a worse case, I loop through several thousand lines of the same data 80+ times. Ugly.
This snippet of a larger script is used to draw web posts/reqs and opens for a large number of servers using the wonderful Ploticus graphing program.
@servers is a list of 60+ webservers
@DATA web server data pulled from the DB in 5 minute increments

The server in the DB is identified by a HOSTID so I load %SERVER with all the HOSTIDs. IE $SERVER{WEB1}=1234
----

foreach $server (@servers) {
      chomp $server;
      open TMP, ">/tmp/$server.tmp";
      $ENV{"FILENAME"} = "/tmp/$server\.tmp";
      # SUPER INEFFIECIENT, looping through all of the data for each s
+erver.  Need to rethink this    
      foreach (@DATA) {
          ($hostid, $time, $get, $post, $currconn) = split(/\t/);
          next if ($hostid ne $SERVER{$server});
          $time = ((int($time/300))*300)+10800;  # Add 3 hours for EST
+ time
          ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = loca
+ltime($time);
          $year+=1900; $year = substr($year,2);
          $mon++; $mon = "0".$mon if ($month < 10);
          $mday = "0".$mday if ($mday < 10);
          $hour = "0".$hour if ($hour < 10);
          $min  = "0".$min if ($min < 10);
          $time = "$mon/$mday/$year\.$hour:$min";
          $total = $get+$post;
          print TMP "$time\t$server\t$get\t$post\t$total\t$currconn\n"
+;
          $total_get{"$time"}  += $get;
          $total_post{"$time"} += $post;
          $total_conn{"$time"} += $currconn;
      }
      close TMP;
if (!$ARGV[3]) {
`$dir/plbin210/bin/pl $dir/WebHits.pol -o $output_dir/graphs/$server\_
+webhits.gif -gif title="$server Gets, Posts and Opens"`;
`$dir/plbin210/bin/pl $dir/WebHits.pol -o $output_dir/graphs/$server\_
+Twebhits.gif -gif -scale 0.50 title="$server $hour:$min"`;
}
[download]

update (broquaint): changed <pre> to <code> tags

Comment on Super Duper Inefficient Download Code

Replies are listed 'Best First'.
Re: Super Duper Inefficient by particle (Vicar) on Jun 20, 2003 at 13:13 UTC
here's a clue: `foreach $server (@servers) { chomp $server; open TMP, ">/tmp/$server.tmp"; $ENV{"FILENAME"} = "/tmp/$server\.tmp"; foreach(@DATA) { ($hostid, $time, $get, $post, $currconn) = split(/\t/); next if ($hostid ne $SERVER{$server}); ## more here... } close TMP; }` [download] this is the inefficiency. invert these loops to make this something like: `## iterate through record foreach(@DATA) { ($hostid, $time, $get, $post, $currconn) = split(/\t/); ## match against the server list and process accordingly foreach $server (@servers) { chomp $server; next unless $hostid eq $SERVER{$server}; open TMP, ">>/tmp/$server.tmp"; $ENV{"FILENAME"} = "/tmp/$server\.tmp"; ## more here... close TMP; ## done with this record, move on last; } }` [download] then you'll iterate over the data once, writing to multiple files depending on the $hostid variable. after each data record is processed, it moves on to the next line without evaluating any other servers in the list. if you haven't already, you might consider using warnings or -w. you might also consider using strict. it would take some work with this code, but it makes perl debugging so much easier. ~Particle accelerates	[reply] [d/l] [select]
Re: Re: Super Duper Inefficient by Earindil (Beadle) on Jun 20, 2003 at 13:19 UTC
Would opening and closing files thousands of times be more of a hit than just 80 or so times? If generating 5 day graphs @DATA would be approx 288580 lines. Would it perhaps be better to open all the files first before the loop and then close them all after?	[reply]
Re^3: Super Duper Inefficient by particle (Vicar) on Jun 20, 2003 at 13:26 UTC
sure, you could open all files first, and store the filehandles in a hash, keyed by servername. depending on the size of the records, you could also process the data in one loop, store it in memory, and write it all out in a second loop. there are many ways to do it. you might try benchmarking a few to see what works best for you. ~Particle accelerates	[reply] [d/l]
Re: Super Duper Inefficient by EvdB (Deacon) on Jun 20, 2003 at 13:10 UTC
I note the line: `next if ($hostid ne $SERVER{$server});` [download] Why not split `@DATA`into lots of `$hostid`. Try: `my %new_data = (); foreach ( @DATA ) { my ($hostid, $rest) = split /\t/, $_, 2; push @{$new_data{$hostid}}, "$hostid\t$rest"; }` [download] Then instead of running through `@DATA` for each `$hostid` you can just run through `@{$new_data{$hostid}}` once you have determined `$hostid` --`tidiness is the memory loss of environmental mnemonics`	[reply] [d/l] [select]
Re: Re: Super Duper Inefficient by Anonymous Monk on Jun 20, 2003 at 13:35 UTC
I was formulating this same response myself, but I'll provide a minor variation, which does pretty much the same, but uses automatic variables instead of temporary placeholder variables: `my %new_data = (); foreach ( @DATA ) { /^(.?)\t.$/; push @{$new_data{$1}}, $_; }` [download] Either case is equally valid in my eyes - I do not know which is faster, the split() or //, so I'll leave that to better souls than mine to answer on. The important thing I'd like to point out is that the above method - ran as a step before the server loop - effectively reduces your redundancy to two reads - once to populate the hash, and then once for the server-specific loop.	[reply] [d/l]
Re: Super Duper Inefficient by nite_man (Deacon) on Jun 20, 2003 at 13:42 UTC
You can format a date easyer way using function `strftime` from POSIX: `use POSIX 'str2time'; my $time = strftime '%m/%d/%Y %H:%M', localtime($time);` [download] -------------------------------- SV* sv_bless(SV* sv, HV* stash);	[reply] [d/l] [select]
Re: Re: Super Duper Inefficient by Earindil (Beadle) on Jun 20, 2003 at 14:02 UTC
That's great. Thanks. I was already using POSIX earlier in the script for mktime to get my time in seconds but didn't know about strftime. Looks much better and cuts out several lines. Thanks to everyone else to. I'm trying out each of your suggestions.	[reply]
Re: Super Duper Inefficient by benn (Vicar) on Jun 20, 2003 at 13:18 UTC
When you read @DATA from your DB, you could split it up into a hash of arrays keyed on hostid...in fact, you could split all your data at this point, to avoid the `split` in your main loop... `foreach (@DATA) { ($hostid, $time, $get, $post, $conn) = split(/\t/); push @{$munged_data{$hostid}}, {time => $time,get => $get,post=>$post,conn=>$conn}; }` [download] ...allowing you to then do something like... `foreach my $server (@servers) { foreach(@{$munged_data{$server}}) { #do stuff with $_->{time} etc. } }` [download] Alternatively, you could look into splitting the data when it gets written into the db in the first place if this is possible, which would presumably involve configuring whatever logging module it is you're using. Cheers, Ben.	[reply] [d/l] [select]
Re: Super Duper Inefficient by hmerrill (Friar) on Jun 20, 2003 at 13:23 UTC
I'm not sure I understand - can you explain what data is in the database? Is it webserver 'access_log's? From one webserver, or from all 60 webservers? Hard to make a recommendation without a better understanding of the situation. My guess is that the database contains a conglomeration of 60+ webserver access(?) logs - your script separates out the entries for each webserver into a separate '.tmp' file for each webserver. An easier way to accomplish this doesn't really come to mind - the only way I can see to make this better would be to have each of the webserver's log into their own log file from the start, instead of having to separate out each webserver after the fact. If they're all in one database because you have 1 webserver serving 60+ virtual webservers, then I don't know how that could be done. Sorry I couldn't be more help(yet). Clarifying and describing your setup in more detail may spark an idea from either me or someone else ;-)	[reply]

Back to Seekers of Perl Wisdom