Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Searching Array To Hold RegEx Stack Is Order Dependant

by ~~David~~ (Hermit)
on Nov 21, 2008 at 20:16 UTC ( [id://725221]=perlquestion: print w/replies, xml ) Need Help??

~~David~~ has asked for the wisdom of the Perl Monks concerning the following question:

I have to search many files that can be upwards of 1M lines. I need to find about 10 fields through the entire file store their value in a hash.
The method I am using works fine, but I think it is very fragile. If anything would be out of order, my method fails.
What I have done is this:
Store my fields in an array in the order I am guessing they appear
my %SEARCH_FIELDS = ( '1.X' => [qw!RESULT_T LOTID DEVICEID SETUPID STEPID OMARK D +IEPITCH WAFERID SLOTID CENTER DEFECTS!], '1.8' => [qw!LOTID WAFERID RESULT_T CENTER SLOTID DEFECTS D +EVICEID DIEPITCH OMARK SETUPID STEPID!], );
I store my regex in a hash for searching:
my %KLARF_REGEXP = ( '1.X' => { 'LOTID' => qr/LotID "(.+)";/i, 'DEVICEID' => qr/DeviceID "(\w+)";/i, 'STEPID' => qr/StepID "(.+)";/i, 'SLOTID' => qr/Slot (\d+);/i, 'DEFECTS' => qr/DefectList/i, 'RESULT_T' => qr/ResultTimestamp (.+);/i, 'WAFERID' => qr/WaferID "(.+)";/i, 'SETUPID' => qr/SetupID (.+);/i, 'OMARK' => qr/OrientationMarkLocation (.+);/i, 'DIEPITCH' => qr/DiePitch (.+);/i, 'CENTER' => qr/SampleCenterLocation (.+);/i, }, '1.8' => { 'LOTID' => qr/LotRecord "(.+)"/i, 'DEVICEID' => qr/DeviceID 1 \{"(\w+)"\}/i, 'STEPID' => qr/StepID 1 \{"(.+)"\}/i, 'SLOTID' => qr/Field SlotNumber 1 \{(\d+)\}/i, 'DEFECTS' => qr/DefectList/i, 'WAFERID' => qr/WaferRecord "(.+)"/i, 'RESULT_T' => qr/Field ResultTimestamp \d \{(.+)\ +}/i, 'SETUPID' => qr/Field RecipeID 3 \{(.+)\}/i, 'OMARK' => qr/Field OrientationMarkLocation 1 +\{(.+)\}/i, 'DIEPITCH' => qr/Field DiePitch \d \{(.+)\}/i, 'CENTER' => qr/Field SampleCenterLocation \d \{ +(.+)\}/i, }, );
I then shift off the current $search_field and run until I successfully have a match. Upon a successful match, I shift off the next value.
while ( <FILE> ){ if ( $_ =~ $KLARF_REGEXP{$KLARF_VERSION}{$current_state} ){ $summary->{$current_state} = $1 if $1; LogMsg( "Found $current_state $summary->{$current_state}" ) if + $summary->{$current_state}; $current_state = shift @{$SEARCH_FIELDS{$KLARF_VERSION}}; } }
Is there another better faster stronger way to do what I am doing and not have to hard code the search order? I thought about iterating every line over every possibe regex, but I don't know if that is the best method. The machine that runs this process is already heavily utilized, so I am looking for a memory / processor efficient way to do this.

Replies are listed 'Best First'.
Re: Searching Array To Hold RegEx Stack Is Order Dependant
by ikegami (Patriarch) on Nov 21, 2008 at 20:43 UTC
    my %to_find = map { $_ => 1 } keys %{ $KLARF_REGEXP{$KLARF_VERSION} }; while ( <FILE> ){ for my $field ( keys %to_find ) { next if !/$KLARF_REGEXP{$KLARF_VERSION}{$field}/; if ( @- > 1 ) { $summary->{$field} = $1; LogMsg( "Found $field $summary->{$field}" ); } else { LogMsg( "Found $field" ); } delete $to_find{$field}; } } die if keys %to_find;

    You'll get a slow down since you now have to check every pattern for every line instead of just one.

    An alternative implementation is to load the entire file into memory if it fits.

    my $file; { local $/; $file = <FILE>; } for my $field ( keys %{ $KLARF_REGEXP{$KLARF_VERSION} ) { die if !/$KLARF_REGEXP{$KLARF_VERSION}{$field}/; if ( @- > 1 ) { $summary->{$field} = $1; LogMsg( "Found $field $summary->{$field}" ); } else { LogMsg( "Found $field" ); } }
Re: Searching Array To Hold RegEx Stack Is Order Dependant
by genehack (Beadle) on Nov 21, 2008 at 20:36 UTC
    If I understand what you're asking, each field -- each thing you're searching for -- will only appear once in the input, right? If that's the case, you can iterate over your hash of regexps for each line, BUT after you find a match, delete that particular key from the hash. When your hash is empty, you're done searching. Alternatively, you could iterate over the hash of regexps, but only try the match if $summary->{$key} wasn't set.
Re: Searching Array To Hold RegEx Stack Is Order Dependant
by gone2015 (Deacon) on Nov 21, 2008 at 22:01 UTC

    The other way is to stitch the regexes together, so that you run a single regex per line, and then sort out which field(s) you've found when you get a match. Along the lines of:

    use strict ; use warnings ; my %KLARF_REGEXP = ( '1.X' => { 'LOTID' => qr/LotID "(.+)";/i, 'DEVICEID' => qr/DeviceID "(\w+)";/i, 'STEPID' => qr/StepID "(.+)";/i, 'SLOTID' => qr/Slot (\d+);/i, 'DEFECTS' => qr/DefectList/i, 'RESULT_T' => qr/ResultTimestamp (.+);/i, 'WAFERID' => qr/WaferID "(.+)";/i, 'SETUPID' => qr/SetupID (.+);/i, 'OMARK' => qr/OrientationMarkLocation (.+);/i, 'DIEPITCH' => qr/DiePitch (.+);/i, 'CENTER' => qr/SampleCenterLocation (.+);/i, }, '1.8' => { 'LOTID' => qr/LotRecord "(.+)"/i, 'DEVICEID' => qr/DeviceID 1 \{"(\w+)"\}/i, 'STEPID' => qr/StepID 1 \{"(.+)"\}/i, 'SLOTID' => qr/Field SlotNumber 1 \{(\d+)\}/i, 'DEFECTS' => qr/DefectList/i, 'WAFERID' => qr/WaferRecord "(.+)"/i, 'RESULT_T' => qr/Field ResultTimestamp \d \{(.+)\ +}/i, 'SETUPID' => qr/Field RecipeID 3 \{(.+)\}/i, 'OMARK' => qr/Field OrientationMarkLocation 1 +\{(.+)\}/i, 'DIEPITCH' => qr/Field DiePitch \d \{(.+)\}/i, 'CENTER' => qr/Field SampleCenterLocation \d \{ +(.+)\}/i, }, ); my $vers = '1.8' ; my $r_items = $KLARF_REGEXP{$vers} ; my $r = join('|', values(%$r_items)) ; my $rx = qr/($r)/ ; my %result = () ; while (my $line = <DATA>) { ITEM: while ($line =~ m/$rx/g) { my $value = $1 ; $value =~ s/\s*\z// ; foreach my $id (keys %$r_items) { if ($value =~ m/$r_items->{$id}/) { $result{$id} = $1 ; # Worry about multiple values ? next ITEM ; } ; } ; die "Found '$value', but no match ...??" ; } ; } ; foreach my $id (keys %result) { print "$id = '", $result{$id} || '', "'\n" ; } ; __DATA__ gjhfgljhgwdchgdjlwhgjhdgsljh sg hjhg sljhg sljhg ljhjlhg sljg LotRecord "Don't Look Back" gjhfgljhgwdchgdjlwhgjhdgsljh sg hjhg sljhg sljhg ljhjlhg sljg gjhfgljhgwdchgdjlwhgjhdgsljh sg hjhg sljhg sljhg ljhjlhg sljg DeviceID 1 {"The Device"} gjhfgljhgwdchgdjlwhgjhdgsljh sg hjhg sljhg sljhg ljhjlhg sljg StepID 1 {"Staircase"} Field SlotNumber 1 {497562} DefectList gjhfgljhgwdchgdjlwhgjhdgsljh sg hjhg sljhg sljhg ljhjlhg sljg WaferRecord "Sandwich" gjhfgljhgwdchgdjlwhgjhdgsljh sg hjhg sljhg sljhg ljhjlhg sljg Field ResultTimestamp 7 {25-Oct-1952} Field RecipeID 3 {First catch your rabbit} Field OrientationMarkLocation 1 {This way up} gjhfgljhgwdchgdjlwhgjhdgsljh sg hjhg sljhg sljhg ljhjlhg sljg gjhfgljhgwdchgdjlwhgjhdgsljh sg hjhg sljhg sljhg ljhjlhg sljg Field DiePitch 3 {Beetle} Field SampleCenterLocation 4 {Gravitas} ghcgjhcgjhghjgd sljh dsg ;kj sh;kj shkj dhk sjdh

    I can think of ways to improve the inner loop, which is iterating over the item regexes -- but I wouldn't worry about that unless this is still not fast enough.


    PS: incidentally, I'd look at the regexes and see if:

    • you want to replace single spaces by '\s+', to allow a little lee-way.

    • you can replace '.*' and '.+' by '[^?]*' and '[^?]+', where '?' is the expected terminator (eg '"' or '}' etc).

    The very rough code above makes a mess of dealing with the Field DiePitch line, because the regex gets greedy !

Re: Searching Array To Hold RegEx Stack Is Order Dependant
by roboticus (Chancellor) on Nov 22, 2008 at 13:14 UTC
    ~~David~~:

    Others have already suggested ways to speed your search. I'm just going to throw in a technique I occasionally find useful when dealing with large files. Sometimes you can easily differentiate between different types of data in your file, and take a quick pass over the file, discarding the obviously uninteresting stuff, creating a much smaller extract that fits in memory. You can then process that extract. For example, suppose your file looked something like this:

    This header section contains a bunch of left-justified lines that are all shorter than 60 characters. Only the headers and summary lines are interesting. Col1 Col2 Col3 Notes ---- ---- ---- ------------- 1 123 123 Foo bar baz 2 741 200 Barbaz Foobar 3 100 0 Boofar fazboo Total: 323 Next header section, etc.

    So you could discard all unless they are left-justified and shorter than 60 characters or don't contain 'Total:'. You can then use the faster in-memory techniques to dig through the extract. Something like so:

    my @extract; while (<$INF>) { # Ignore useless lines next if length($_) > 60; next unless /^(\w|\s*Total:)/; push @extract, $_; } # Now you can process @extract for the interesting stuff print @extract;

    ...roboticus

    Time flies like an arrow ... fruit flies like a banana.

    --Groucho Marx

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://725221]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (1)
As of 2024-04-25 04:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found