Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Comparing succeeding lines in two files.

by BrowserUk (Patriarch)
on Sep 27, 2002 at 21:52 UTC ( [id://201360]=note: print w/replies, xml ) Need Help??


in reply to Comparing succeeding lines in two files.

An algorithm that might work is this. Build an array from the lines of the small file, having removed the timestamps.

open SMALL, '<small file' or die "Couldn't open smalllfile; $!\n"; my @find = map { $_ =~ s/^\d{2}:\d{2}(.*)$/$1/; } <SMALL>; close SMALL or warn "Couldn't close smallfile; $!\n";

Then its just a case of running through the large file one line at a time, striping the timstamp, Checking it against the next line in the array. If the line matches you increment the match number, if it failes you reset it to zero and continue.

open BIG, '<bigfile' or die "Couldn't open bigfile; $!\n"; my ($match, $matched) = (0,0); while ( <BIG> ) { s/^\d{2}:\d{2}(.*)$/$1/; $match=0 and next unless $_ eq @find[match++]; next unless $match == scalar @find; # If you got this far, you've matched all the lines in the smallfi +le # as contigeous lines in the bigfile. So do something... # If you need to know where the sequence of matching lines started + (in the bigfile) # $. = scalar @find will tell you. $matched = 1; } close BIG or warn "Couldn't close bigfile; $!"; print "Didn't find the contents of smallfile in bigfile\n" unless $mat +ched;

Two possible problems arising because you were vague with the requirements.

  1. You mention 10 .. 20 lines in the small file but only show 2 lines in your sample. If this was to save space and you always want to match every line in the smallfile before you decide you have a match, great this will work ok. If you need to match anyone of a series of sequences held in the smallfile, you'd need to decide how you will determine how many lines make a sequence.

    An array of Arrays could be one way forward if this is the case.You then also need another scalar to index your way through the AoA's.

  2. You mention but don't elusidate upon the idea of approximate matching. Without further information on how approximate and in what way, this is difficult to address, but for example you might build a regex from the words contained in each of the lines in the smallfile, possibly excluding common and/or small words something like this.

(Assume you already populated the @find array as above.)

my @excluded = qw( a did encountered process ); # tailor as appropr +iate for my $line (@lines) { local $"= '.*?'; #" # break the line into an array of words minus exclusions my @words = grep{ !(1+index($excluded,$_)); } $line =~ m/\b\w+\b/g # replace each line with a fuzzy matching compiled regex $line = qr"@word"o; }

Then the line

$match=0 and next unless $_ eq @find[match++];

becomes

$match=0 and next unless $_ =~ @find[match++];

Using this process, your two line sample would become regexes

(?i-xsm:abc.*?problem) (?i-xsm:abc.*?restart)

and would case-independantly match any line containing the process name and the second word in that order regardless of intervening words.

By tailoring the list of excluded words to your trace file, this should be a fairly powerful fuzzy-match mechanism.

Another idea I had would be to construct a regex from the lines something like

for my $line (@lines) { # break the line into an array of words minus exclusions my @words = grep{ !(1+index($excluded,$_)); } $line =~ m/\b\w+\b/g # Word breaks to avoid partial word matches local $local $"='\b|\b'; #" # replace each line with a multi-matching compiled regex $line = qr"\b@a\b"oi; }

then use /g to match as many words as possible and obtain a count of the number.

my $n = () = $_ =~ m/@find[match++]/g; $match=0 and next unless $n > $than_some_predetermined_ number;

With this method you would probably need to work out a 'minimum words to be matched number' on a line by line basis in the smallfile. They would probably be best appended to each line and parsed at the same time the timestamp is stripped.


Cor! Like yer ring! ... HALO dammit! ... 'Ave it yer way! Hal-lo, Mister la-de-da. ... Like yer ring!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://201360]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2024-04-25 09:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found