Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Array loop is very inefficient, 100% CPU usage

by waiterm (Acolyte)
on Oct 06, 2004 at 14:33 UTC ( [id://396996]=perlquestion: print w/replies, xml ) Need Help??

waiterm has asked for the wisdom of the Perl Monks concerning the following question:

I'm having problems with the following sub routine, when it gets going my CPU usage shoots up to 100% for however long it takes to run, sometimes over an hour. I was just wondering why it was so inefficient??

I'm using Win XP Pro, and Active Perl 5.8.

Initially the routine opens up a small txt file which contains a file path, it checks if the file exists, if it does then it carries on, if not it writes the whole of the data file to $data which is then spat out in an email.

Then it is designed to open up the two text files (new data and old data) containing data in the following format:
101210::20041009::YYYYYYY::::1096930741::_1_ 101210::20041016::YYYYYYY::::1096930741::_1_ 101210::20041023::NNNNNNN::::1096930741::_1_ 101210::20041030::NNNNNNN::::1096930741::_1_ 101210::20041106::NNNNNNN::::1096930741::_1_ 101212::20041009::NNNNNNN::::1096930741::_2_ 101212::20041016::NNNNNNN::::1096930741::_2_ 101212::20041023::NNNNNNN::::1096930741::_2_ 101212::20041030::YYYYYYY::::1096930741::_2_ 101212::20041106::YYYYYYY::::1096930741::_2_
It writes the two sets of data into an array each split by the endline character. Then goes through line by line and compares them using the 'cid' and 'date' sections at the beginning of each line i.e. 101212::20041106::, and then checks whether the 'avs' i.e. YYYYYYY is the same in each file where the 'cid' and 'date' are the same. Hope that makes sense!
sub compare_avs { # find the name of the file created at last running open(DIFF_OLD, $path.$agent."/AVS/DIFF_OLD.txt") || die "ERROR: Unable + to open ".$agent." - DIFF_OLD_FILE.txt"; $DIFF_OLD = <DIFF_OLD>; close DIFF_OLD; # check if file exists if not, write contents of new file to $ +data if ($DIFF_OLD eq "") { print "DIFF_OLD file - Does not exist\n"; open(DIFF, "$EMAILfile") || die "ERROR: Unable to open ".$agen +t." - ".$EMAILfile; while(my @AVSn = split/\n/, <DIFF>) { foreach $AVn(@AVSn) { $data = $data.$AVn."\n"; } } close DIFF; } else { print "DIFF_OLD file - ".$DIFF_OLD."\n"; sleep(5); # write contents of old file to an array open(CHECK, $path.$agent."/AVS/".$DIFF_OLD) || die "ERROR: Un +able to open ".$agent." - ".$DIFF_OLD; @AVSo = <CHECK>; close CHECK; # write contents of new file to array and loop through it open(DIFF, "$EMAILfile") || die "ERROR: Unable to open + ".$agent." - ".$EMAILfile; while(my @AVSn = split/\n/, <DIFF>) { foreach $AVn(@AVSn) { $AVSchecked = 0; ($SIDn,$DATEn,$AVSn,$NOTHINGn,$EPOCHDATEn,$SERIALn) = +split/::/,$AVn; # compare contents of old list with new list foreach $AVo(@AVSo) { $AVo =~ s/\n|\r//g; ($SIDo,$DATEo,$AVSo) = split/::/,$AVo; if ($SIDn eq $SIDo && $DATEn eq $DATEo && $AVSn ne + $AVSo) { $data = $data.$SIDn."::".$DATEn."::".$AVSn.":: +::".$EPOCHDATEn."::".$SERIALn."\n"; print $SIDn."::".$DATEn."::".$AVSn."::::".$EPO +CHDATEn."::".$SERIALn."\n"; $AVSchecked = 1; } elsif ($SIDn eq $SIDo && $DATEn eq $DATEo && $ +AVSn eq $AVSo) { $AVSchecked = 1; } } if ($AVSchecked == 0) { $data = $data.$SIDn."::".$DATEn."::".$AVSn."::::". +$EPOCHDATEn."::".$SERIALn."\n"; print $SIDn."::".$DATEn."::".$AVSn."::::".$EPOCHDA +TEn."::".$SERIALn."\n"; } } } close DIFF; } $DIFF_OLD_filex = $path.$agent."/AVS/DIFF_OLD.txt"; open(DIFF_OLD,">".$path.$agent."/AVS/DIFF_OLD.txt") || die "ERROR: + Unable to open ".$agent." - DIFF_OLD_FILE.txt to update"; print DIFF_OLD $agent.$$.".AVS"; close DIFF_OLD; }

Replies are listed 'Best First'.
Re: Array loop is very inefficient, 100% CPU usage
by tmoertel (Chaplain) on Oct 06, 2004 at 15:20 UTC
    You're comparing every old element against every new element, which is expensive. If you have M old elements and N new elements, this requires M*N comparisons. However, using a hash, you can reduce the total to M+N comparison-like operations.

    The following code shows the idea:

    my %AVSo_by_sig; sub cid_date { @cid_and_date = split /::/, $_[0], 2; return "@cid_and_date"; } sub avs { return (split /::/, $_[0], 3)[2]; } # create a hash of cid_date->elem for the old elements for (@AVSo) { $AVSo_by_sig{cid_date($_)} = $_; } # compare new elements to old via their cid_dates for my $new (@AVSn) { my $old = $AVSo_by_sig{cid_date($new)}; if ($old && avs($old) ne avs($new)) { # changed from old to new print $new; } }

    Cheers,
    Tom

Re: Array loop is very inefficient, 100% CPU usage
by dragonchild (Archbishop) on Oct 06, 2004 at 14:48 UTC
    First off, your code doesn't work. $DIFF_OLD = <DIFF_OLD>; reads the first line of the file, not the whole file. If you want to slurp it, set $/ to undef.

    Second, you're on Unix (at least, I'm assuming that based on your directory separator). So, why aren't you using the some of the commandline utilities to help you out? For example, I'm assuming that the only thing that will change is the YYYY / NNNN, plus the _1_ / _2_. So, why not create a temporary file for each one, with the crud stripped out, then using sort and diff? You can using Unix's sort to sort based on various keys, split using your own delimiter. Even with the temp files, this shouldn't take more than a second per 20k rows, and probably much less than that.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re: Array loop is very inefficient, 100% CPU usage
by jdporter (Paladin) on Oct 06, 2004 at 14:56 UTC
    (Updated)
    my %diff; my %new; open F, "< $old_file" or die "read $old_file: $!"; while (<F>) { chomp; my( $cid, $date, $avs ) = split /::/; $diff{"$cid::$date"}{$avs}++; } close F; open F, "< $new_file" or die "read $new_file: $!"; while (<F>) { my( $cid, $date, $avs ) = split /::/; my $key = join '::', $cid, $date; $diff{$key}{$avs}++; $new{$key} = $_; } close F; while ( my( $key, $val_hr ) = each %diff ) { keys %$val_hr > 1 and $data .= $new{$key}; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://396996]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (1)
As of 2024-04-24 13:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found