very slow processing

sandy105 has asked for the wisdom of the Perl Monks concerning the following question:

i am trying to process a largish ~50 mb log file ,however with the current code ,its taking way too long

i am basically searching for unique id's in second [] and then looping thru whole file looking for keywords and writing to output file..but its taking hours

my@id;
my $date;
my $id;
my $keyword;
my $mess;
my @uniqueid;
my %seen;

#read the file (PLEASE PROVIDE INPUT FILE PATH)
open(hanr,"d:/Log.txt")or die"error $!\n";
#digesting the lines
@lines = <hanr>;
#iterating through the lines
foreach $line (@lines) {
    $line =~ /\[(.+?)\] .* \[(.+?)\] .* \[[^]]+\] \s+ (.*) /x or next 
+;
    $id = $2;
    push (@id , $id);                #pushing id to array
    }

#for getting unique user id's 
foreach $value (@id) {
    if (! $seen{$value}++ ) {
        push @uniqueid, $value;
        
        }
    }

#OPENING OUTPUT FILE;PROVIDE PATH
open (myfile,">D:\\output\\op.txt") or die("error:cannot create $! \n"
+);

foreach $uniquevalue (@uniqueid) {

    foreach $line (@lines) {
        $line =~ /\[(.+?)\] .* \[(.+?)\] .* \[[^]]+\] \s+ (.*) /x or n
+ext ;
        $date = $1;
        $id = $2;
        $keyword = $3;
        
        
        if($uniquevalue eq $2 && $keyword eq "Orchestration Started"){
        print myfile "$date,$id,$keyword \n";
        next;
        }
        if($uniquevalue eq $2 && $keyword =~/^Input Message/){
        print myfile "$date,$id,"."Input Message to P5 \n";
        next;
        }
        if($uniquevalue eq $2 && $keyword =~ /^TP Service Request/){
        print myfile "$date,$id,"."Service Request \n";
        next;
        }
        if($uniquevalue eq $2 && $keyword =~/^P5 Move request :/ ){
        print myfile "$date,$id,"."Move request \n";
        next;
        }
        if($uniquevalue eq $2 && $keyword =~/^ProcessName:/){
        $mess = substr $keyword , 12;
        print myfile "$date,$id,$mess \n";
        next;
        }
        
        if($uniquevalue eq $2 && $keyword =~/^Process Message :/ ){
        my $mess = substr $keyword , 17;
        print myfile "$date,$id,$mess \n";
        next;
        }
        
        
    }
}
[download]

the search for unique id's is fast enough , but for the second block ,the if loops are searching for keyword for each unique id thru the WHOLE file.its painfully slow,how can i improve speed ??

ps. as requested a sample of log or exact signature

[2014-06-24 15:10:10,936] DEBUG - [27421b9e-fbd3-11e3-943c-ff89c266c130] [PreProcess] Ended successfully.

[2014-06-24 15:10:10,936] DEBUG - [27421b9e-fbd3-11e3-943c-ff89c266c130] [PreProcess] Orchestration Started

[date]text - [uniqueid][sometext]keyword-egProcessName:

also this is some more information , i dont need the id's in order in output file. and for each unique id like 27421b9e-fbd3-11e3-943c-ff89c266c130 i want to log date , id and "keyword" so i will pickup only the second line above

Comment on very slow processing Download Code

Replies are listed 'Best First'.
Re: very slow processing by RonW (Parson) on Aug 20, 2014 at 17:25 UTC
When you make the first pass, instead of pushing each ID into an array then filtering out duplicates, use a hash with ID as the key. For the value, concatenate your formatted output. For the second pass, loop on the keys of the hash, printing the strings in the hash: `my $lnum = 0; for my $line (@lines) { $line =~ /your regex/; my $date = $1; my $id = $2; my $keyword = $3; $urecs{$id} .= "$date,$id,$keyword \n"; } print $urecs{$urec} for my $urec (keys %urecs);` [download] Displamer: Not tested.	[reply] [d/l]
Re^2: very slow processing by sandy105 (Scribe) on Aug 20, 2014 at 18:37 UTC
the id's are repeated so need to check for unique'ids and keywords	[reply]
Re^3: very slow processing by RonW (Parson) on Aug 20, 2014 at 23:17 UTC
Hash keys are always unique. In my example, if an ID has already been seen, the new string is appended to the previous content of the value for that ID. I could have written: `for my $line (@lines) { $line =~ /your regex/; if (exists $urecs{$2}) { $urecs{$2} .= "$1,$2,$3\n"; } else { $urecs{$2} = "$1,$2,$3\n"; } }` [download] but that is not necessary because Perl treats appending to an undefined value the same as appending to an empty string. Another thing you could do: `$ids{$2}++` would give you a hash of IDs seen (the keys) and how many (the values) times each was seen (again, no need to check for existence first). As for checking the keywords, I left that out so to focus my example on the use of the hash.	[reply] [d/l] [select]
Re: very slow processing by philipbailey (Curate) on Aug 20, 2014 at 19:23 UTC
My eye is drawn to this: `$line =~ /\[(.+?)\] .* \[(.+?)\] .* \[[^]]+\] \s+ (.) /x or next ;` It seems to me that the repeated `.` is likely to result in a lot of backtracking during the match of the regex. I suggest you try replacing the first two `.` expressions with something like `[^\[]+` and add a `$` anchor at the end. That would change that line to: `$line =~ /\[(.+?)\] [^\[]+ \[(.+?)\] [^\[]+ \[[^]]+\] \s+ (.)$ /x or next ;` There may be something more elegant you could do (and somebody will chip in) but I bet that will result in a speed improvement. Mind you, the O n^2 algorithm won't help.	[reply] [d/l] [select]
Re: very slow processing by GrandFather (Saint) on Aug 20, 2014 at 21:16 UTC
When you profiled the code where did the time get spent? Take a look at Devel::Profile if you are not using it already. If your code really is slow (you've not actually told us how long it takes - really "hours"?) the time you spend learning to use the profiling tool will pay for itself in short order. As an aside, are you sure your tests are correct? I'd be a bit worried about using $2 like that, especially as you use further matches after the first match. It would help if you showed us a sample of your input file so we can see its structure and maybe suggest a better way to process it. I suspect a single pass through the lines should suffice, but we can't tell from the data you've not shown us. Almost always the big time wins come from changing your algorithm rather than fine grain tweeking of existing code. Perl is the programming world's equivalent of English	[reply]
Re^2: very slow processing by RonW (Parson) on Aug 20, 2014 at 23:23 UTC
Good thinking and good advice. I will point out that Sandy said in the original post: the search for unique id's is fast enough , but for the second block	[reply]
Re^2: very slow processing by sandy105 (Scribe) on Aug 21, 2014 at 09:33 UTC
umm the time is spent in the inner loop because for each id it searches the whole log file again which at 50 MB is huge ; so i am making n-no of unique ids +1 sweeps basically; it didnt finish in 1.5 hrs so i killed it.checked with a smaller file it took 10 mins for profile module "perl -d:Profile test.pl " only -d parameter is needed ? and yes $2 i was worried too but output is as expected	[reply]
Re: very slow processing by oiskuu (Hermit) on Aug 21, 2014 at 00:17 UTC
(Pretty much what RonW said.) You require fast lookup over the `$id` field. Populate your data structure accordingly! Hash will be a good choice here. If I understood correctly, the task is about grouping lines in a log file by the ID field, plus some additional munging. The following skeleton code might give you some ideas: use strict; use warnings; # @ID to keep unique id's (in order they are seen) # %T to group things by id. Make it HoA (Hash of Arrays) my (@ID, %T); ... while (<HANR>) { my ($date, $id, $kw) = /\[(.+?)\]/g; my $txt; next unless $kw; $txt = "$+ to P5" if $kw =~ /^(Input Message)/; $txt = $+ if $kw =~ /^(Orchestration Started)$/; $txt = $+ if $kw =~ /^ProcessName:(.)/; # note the opportunity to merge some regexes above next unless $txt; push @ID, $id unless $T{$id}; push @{ $T{$id} }, "$date,$id,$txt \n"; } for my $group (@T{ @ID }) { # this is a hash slice! print OUT for @$group; } [download] ps. I'm trying to puzzle out what hanr might stand for? And what is a displamer? Update.* In above example, the `@ID` array is only to keep IDs ordered by their first encounter. If that's unimportant, replace the hash slice with just `(values %T)`. The `$+` is documented in perlvar. But my regex to split fields doesn't quite cut it if kw is not in brackets - needs a fix. Having a firm handle on (perl) data structures is of great utility; it helps one in making the right algorithmic decisions. For starters, perldsc is a good read. pps. I'm awfully suspecting that Hanr shot first. With a displamer.	[reply] [d/l] [select]
Re^2: very slow processing by sandy105 (Scribe) on Aug 21, 2014 at 09:51 UTC
could you please explain your code further hanr and hanw , i use them for read/write file handles ..and displamar should be disclaimer..	[reply]
Re: very slow processing by GrandFather (Saint) on Aug 21, 2014 at 21:10 UTC
Looks like you only need to read the file once: use strict; use warnings; my $fInName = "d:/Log.txt"; my $fOutName = "d:/Output/op.txt"; open my $fIn, '<', $fInName or die "Can't open '$fInName': $!\n"; open my $fOut, '>', $fOutName or die "Can't create '$fOutName': $!\n"; my %ids; while (<$fIn>) { my ($date, $id, $keyword) = /\[(.+?)\] .* \[(.+?)\] .* \[[^]]+\] \s+ (.*) /x or next; push @{$ids{$id}}, [$date, $id, $keyword]; } for my $id (sort keys %ids) { for my $entry (@{$ids{$id}}) { my ($date, $id, $keyword) = @$entry; if ($keyword eq "Orchestration Started") { print $fOut "$date,$id,$keyword \n"; next; } if ($keyword =~ /^Input Message/) { print $fOut "$date,$id," . "Input Message to P5 \n"; next; } if ($keyword =~ /^TP Service Request/) { print $fOut "$date,$id," . "Service Request \n"; next; } if ($keyword =~ /^P5 Move request :/) { print $fOut "$date,$id," . "Move request \n"; next; } if ($keyword =~ /^ProcessName:/) { my $mess = substr $keyword, 12; print $fOut "$date,$id,$mess \n"; next; } if ($keyword =~ /^Process Message :/) { my $mess = substr $keyword, 17; print $fOut "$date,$id,$mess \n"; next; } } } [download] Completely untested because you didn't supply any sample data, but there's a fair chance it'll just work. Note use of strictures, lexical file handles, three parameter open and informative open failure diagnostics. Also note that variables are declared where they are first needed so that scope is managed correctly and accidental reuse is more likely to be noticed. Oh, and the regex only needs to be given once. Perl is the programming world's equivalent of English	[reply] [d/l]
Re: very slow processing by sandy105 (Scribe) on Aug 21, 2014 at 10:47 UTC
thank you so much guys ,specially ronW .taking your cue ,i just sweeped thru the file once capturing the 3 datas pushing it into a hash with id as key.it takes few seconds now .	[reply]


The stupid question is the question not asked
	PerlMonks