Array loop is very inefficient, 100% CPU usage

waiterm has asked for the wisdom of the Perl Monks concerning the following question:

I'm having problems with the following sub routine, when it gets going my CPU usage shoots up to 100% for however long it takes to run, sometimes over an hour. I was just wondering why it was so inefficient??

I'm using Win XP Pro, and Active Perl 5.8.

Initially the routine opens up a small txt file which contains a file path, it checks if the file exists, if it does then it carries on, if not it writes the whole of the data file to $data which is then spat out in an email.

Then it is designed to open up the two text files (new data and old data) containing data in the following format:

101210::20041009::YYYYYYY::::1096930741::_1_
101210::20041016::YYYYYYY::::1096930741::_1_
101210::20041023::NNNNNNN::::1096930741::_1_
101210::20041030::NNNNNNN::::1096930741::_1_
101210::20041106::NNNNNNN::::1096930741::_1_
101212::20041009::NNNNNNN::::1096930741::_2_
101212::20041016::NNNNNNN::::1096930741::_2_
101212::20041023::NNNNNNN::::1096930741::_2_
101212::20041030::YYYYYYY::::1096930741::_2_
101212::20041106::YYYYYYY::::1096930741::_2_
[download]

It writes the two sets of data into an array each split by the endline character. Then goes through line by line and compares them using the 'cid' and 'date' sections at the beginning of each line i.e. 101212::20041106::, and then checks whether the 'avs' i.e. YYYYYYY is the same in each file where the 'cid' and 'date' are the same. Hope that makes sense!

sub compare_avs  {
    
    # find the name of the file created at last running

open(DIFF_OLD, $path.$agent."/AVS/DIFF_OLD.txt") || die "ERROR: Unable
+ to open ".$agent." - DIFF_OLD_FILE.txt";
    $DIFF_OLD = <DIFF_OLD>;
    close DIFF_OLD;

        # check if file exists if not, write contents of new file to $
+data
    
    if ($DIFF_OLD eq "")  {
        print "DIFF_OLD file - Does not exist\n"; 
        open(DIFF, "$EMAILfile") || die "ERROR: Unable to open ".$agen
+t." - ".$EMAILfile;
        while(my @AVSn = split/\n/, <DIFF>) {
            foreach $AVn(@AVSn) {
                $data = $data.$AVn."\n";
            }
        }
        close DIFF;
    }  else  {
        print "DIFF_OLD file - ".$DIFF_OLD."\n"; sleep(5);
        
                # write contents of old file to an array

        open(CHECK, $path.$agent."/AVS/".$DIFF_OLD) || die  "ERROR: Un
+able to open ".$agent." - ".$DIFF_OLD;
        @AVSo = <CHECK>;
        close CHECK;
        
        # write contents of new file to array and loop through it

                open(DIFF, "$EMAILfile") || die "ERROR: Unable to open
+ ".$agent." - ".$EMAILfile;
        while(my @AVSn = split/\n/, <DIFF>) {
            foreach $AVn(@AVSn) {
                $AVSchecked = 0;
                ($SIDn,$DATEn,$AVSn,$NOTHINGn,$EPOCHDATEn,$SERIALn) = 
+split/::/,$AVn;
                        # compare contents of old list with new list

                                        foreach $AVo(@AVSo)  {
                    $AVo =~ s/\n|\r//g;
                    ($SIDo,$DATEo,$AVSo) = split/::/,$AVo;
                    if ($SIDn eq $SIDo && $DATEn eq $DATEo && $AVSn ne
+ $AVSo)  {
                        $data = $data.$SIDn."::".$DATEn."::".$AVSn."::
+::".$EPOCHDATEn."::".$SERIALn."\n";
                        print $SIDn."::".$DATEn."::".$AVSn."::::".$EPO
+CHDATEn."::".$SERIALn."\n";
                        $AVSchecked = 1;
                    }  elsif  ($SIDn eq $SIDo && $DATEn eq $DATEo && $
+AVSn eq $AVSo)  {
                        $AVSchecked = 1;
                    }
                }
                if ($AVSchecked == 0)  {
                    $data = $data.$SIDn."::".$DATEn."::".$AVSn."::::".
+$EPOCHDATEn."::".$SERIALn."\n";
                    print $SIDn."::".$DATEn."::".$AVSn."::::".$EPOCHDA
+TEn."::".$SERIALn."\n";
                }
            }
        }
        close DIFF;
    }
    
    $DIFF_OLD_filex = $path.$agent."/AVS/DIFF_OLD.txt";
    open(DIFF_OLD,">".$path.$agent."/AVS/DIFF_OLD.txt") || die "ERROR:
+ Unable to open ".$agent." - DIFF_OLD_FILE.txt to update";
    print DIFF_OLD $agent.$$.".AVS";
    close DIFF_OLD;
    
}
[download]

Comment on Array loop is very inefficient, 100% CPU usage Select or Download Code

Replies are listed 'Best First'.
Re: Array loop is very inefficient, 100% CPU usage by tmoertel (Chaplain) on Oct 06, 2004 at 15:20 UTC
You're comparing every old element against every new element, which is expensive. If you have M old elements and N new elements, this requires MN* comparisons. However, using a hash, you can reduce the total to M+N comparison-like operations. The following code shows the idea: `my %AVSo_by_sig; sub cid_date { @cid_and_date = split /::/, $_[0], 2; return "@cid_and_date"; } sub avs { return (split /::/, $_[0], 3)[2]; } # create a hash of cid_date->elem for the old elements for (@AVSo) { $AVSo_by_sig{cid_date($_)} = $_; } # compare new elements to old via their cid_dates for my $new (@AVSn) { my $old = $AVSo_by_sig{cid_date($new)}; if ($old && avs($old) ne avs($new)) { # changed from old to new print $new; } }` [download] Cheers, Tom Tom Moertel : Blog / Talks / CPAN / LectroTest / PXSL / Coffee / Movie Rating Decoder	[reply] [d/l]
Re: Array loop is very inefficient, 100% CPU usage by dragonchild (Archbishop) on Oct 06, 2004 at 14:48 UTC
First off, your code doesn't work. `$DIFF_OLD = <DIFF_OLD>;` reads the first line of the file, not the whole file. If you want to slurp it, set $/ to undef. Second, you're on Unix (at least, I'm assuming that based on your directory separator). So, why aren't you using the some of the commandline utilities to help you out? For example, I'm assuming that the only thing that will change is the YYYY / NNNN, plus the _1_ / _2_. So, why not create a temporary file for each one, with the crud stripped out, then using sort and diff? You can using Unix's sort to sort based on various keys, split using your own delimiter. Even with the temp files, this shouldn't take more than a second per 20k rows, and probably much less than that. Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply] [d/l]
Re: Array loop is very inefficient, 100% CPU usage by jdporter (Paladin) on Oct 06, 2004 at 14:56 UTC
*(Updated)* `my %diff; my %new; open F, "< $old_file" or die "read $old_file: $!"; while (<F>) { chomp; my( $cid, $date, $avs ) = split /::/; $diff{"$cid::$date"}{$avs}++; } close F; open F, "< $new_file" or die "read $new_file: $!"; while (<F>) { my( $cid, $date, $avs ) = split /::/; my $key = join '::', $cid, $date; $diff{$key}{$avs}++; $new{$key} = $_; } close F; while ( my( $key, $val_hr ) = each %diff ) { keys %$val_hr > 1 and $data .= $new{$key}; }` [download]	[reply] [d/l]


Syntactic Confectionery Delight
	PerlMonks