Shell Script Woes

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All:
I am in the process of converting a lot of shell scripts to Perl. I am having a problem with the following:

#!/usr/bin/ksh
GrepList=`sed '/^ *$/d;/^#/d' traplist.in 2>/dev/null`
grep -ilF "$GrepList" out/do* 2>/dev/null | xargs -i mv {} ./capture 2
+>/dev/null
[download]

This isn't the entire script, but it is in a continual loop.
The mv is in a race condition that is beyond my control and the GrepList changes over time.

Some of the limitations of the shell script implementation are:
1. The item in the GrepList that matched isn't recorded
2. The grep fails if the string is wrapped (imbedded) newline

I came up with the following (with help from the CB) as a start (code doesn't exactly match)

#!/usr/bin/perl
my @checklist;
my @testfile;
my $rulenames;
open (FILE,"filename");
while (<FILE>) {
 next if ( $_ =~ /^ *$/ || $_ =~ /^#/ );
 push @checklist , $_;
}
chomp @checklist;
close (FILE);
my $GrepList = join '|', map "($_)", @checklist;

@ARGV = <*>;
if (@ARGV) {
 undef $/;
 while (<>) {
  $_ =~ s/\n//;
  next unless ($_ =~ /$GrepList/i);
  print "$ARGV : $+\n";
 }
}
[download]

The final product will store the time stamp of the traplist file and only update on change and fix the obvious errors, but I am more concerned with the raw functionality.

What I am trying to accomplish is the following:

Match on any item in list

Stop after first match

Remember what matched

Be as fast as possible using the least amount of resources (I know - you trade one for the other)

What I can't do is:

Slurp in the file

Build an ever increasing string

Make the assumption that the string won't wrap 3 or more lines (though unlikely)

I originally envisioned what I do when I do a word search. You scan using your finger until you come across a letter that starts a word you are looking for. You follow it until you get a match or until you realize it is a dead end. If it is a dead end, you keep on going having forgot everything that came before.

Unfortunately, I don't believe there is a way to tell RegEx to start paying attention on a partial match.

Any ideas?

Limbic~Region

Comment on Shell Script Woes Select or Download Code

Replies are listed 'Best First'.
Re: Shell Script Woes (tye's try) by tye (Sage) on Jan 02, 2003 at 22:59 UTC
You need to quotemeta since you are trying to match a list of fixed strings not a list of regular expressions. This also makes it easy to figure out the maximum length of "carry over" you need between buffers. #!/usr/bin/perl -w use strict; my @checklist; my @testfile; my $rulenames; open( FILE, "wordlist" ) or die "Can't read wordlist: $!\n"; my $maxLen = 0; while( <FILE> ) { next if $_ =~ /^ $/ \|\| $_ =~ /^#/; $maxLen = length($_) if $maxLen < length($_); push @checklist, $_; } chomp @checklist; close(FILE); my $bufSize= 81024; $bufSize= 2$maxLen if $bufSize < 2$maxLen; my $GrepList = join '\|', map quotemeta $_, @checklist; $GrepList = qr/($GrepList)/i; @ARGV = grep -f $_, <>; $/= \$bufSize; # Have <> read $bufSize bytes if( @ARGV ) { my $prev= ""; while( <> ) { $_ =~ s/\n//g; if( ($prev.$_) =~ /$GrepList/ ) { print "$ARGV : $1\n"; close( ARGV ); $prev= ""; } elsif( eof ) { $prev= ""; } else { $prev = substr( $_, -$maxLen ); } } } [download] Tested and works. Note that this assumes that you don't have huge runs of newlines in the middles of your matches. - tye Updated*: I originally left out the setting of $/.	[reply] [d/l]
Re: Re: Shell Script Woes (tye's try) by Limbic~Region (Chancellor) on Jan 09, 2003 at 20:14 UTC
tye, Thank you very much for the insight. I have "borrowed" a great deal of your code and came up with the following: #!/usr/bin/perl -w use strict; use Time::Local; chdir "/var/spool/wt400/gateways/$ARGV[0]" or exit; mkdir "capture", 0755 unless (-d "capture"); my $Dir = $ARGV[1]; my $ListTime = 0; my $BufferSize = 64 * 1024; my $MaxLen = 0; my %Traps; my @GrepList; my $GrepString; my $Counter = 1; my $Size; my $Prev; my $Now; my $NF; while (1) { if ($Counter > 20 \|\| ! @GrepList) { if ( (stat("traplist.$Dir"))[9] gt $ListTime ) { $ListTime = (stat(_))[9]; open (LIST,"traplist.$Dir"); while (<LIST>) { next if ($_ =~ /^Created\t\tExpires/ \|\| $_ =~ /^ $/); my @Fields = split "\t" , $_; my($mon, $day, $year, $hour, $min) = split ?[-/:]? , $Fields[1]; my $Expiration = timelocal(0, $min, $hour, $day, $mon -1, $year + + 100); $Traps{"$Fields[6]"} = [ $Expiration,$Fields[2],$Fields[5],$Fields +[7] ]; } close (LIST); } } @GrepList = (); $Now = time; foreach my $trap (keys %Traps) { push @GrepList,$Traps{$trap}[3] unless (($Traps{$trap}[0] < $Now && +$Traps{$trap}[1]) \|\| $trap eq "SIZE"); } map { $MaxLen = length($_) if length($_) > $MaxLen } @GrepList; $BufferSize = 2 $MaxLen if ($BufferSize < 2 * $MaxLen); if (exists $Traps{"SIZE"} && $Traps{"SIZE"}[1]) { $Size = $Traps{"SIZE"}[2] unless ($Traps{"SIZE"}[0] < $Now && $Traps +{"SIZE"}[2] > 0); } exit unless (@GrepList \|\| $Size); $GrepString = join '\|', map quotemeta $_, @GrepList; $GrepString = qr/($GrepString)/i; if ($Dir eq "out") { @ARGV = <out/do>; } elsif ($Dir eq "in") { @ARGV = <in/di>; } else { @ARGV = <out/do* in/di> } if (@ARGV) { $/=\$BufferSize; $Prev= ""; while (<>) { $_ =~ tr/\n//d; if(($Prev.$_) =~ /$GrepString/) { ($NF = "$ARGV-$+") =~ s/^.\///; rename $ARGV , "capture/$NF"; close (ARGV); $Prev = ""; } if (eof) { $Prev = ""; } else { $Prev = substr($_,-$MaxLen); } } } $/ = "\n"; ++$Counter; sleep 3 } [download] This provides 10X more functionality the original shell script did. I would appreciate any advice on how it could be made to go faster and still be efficient. L~R	[reply] [d/l]
Re^3: Shell Script Woes (review) by tye (Sage) on Jan 09, 2003 at 21:30 UTC
First, I'd use more than one space for indentation. I use 4 because I like the way it discourages overly deep nesting of code. Even 2 or 3 would be quite a bit better than 1, IMO. `$Traps{"$Fields[6]"} = [ $Expiration,$Fields[2],$Fields[5],$Fields[7] ];` can be written `$Traps{$Fields[6]} = [ $Expiration, @Fields[2,5,7] ];` Putting in too many quotes can bite you (though using it as a hash key also does the stringification which would bite you in the same way in this case -- changing an object into a string) so be careful of it. You can make the code clearer using a few constants: `sub EXPIRE() { 0 } sub WHATEVER() { 1 } sub FOOBAR() { 2 } sub FILENAME() { 3 }` [download] (these make no difference in the running time of the code since they get optimized away at compile time). I'd also avoid 'unless' so `push @GrepList,$Traps{$trap}[3] unless (($Traps{$trap}[0] < $Now && $Traps{$trap}[1]) \|\| $trap eq "SIZE");` becomes `push @GrepList, $Traps{$trap}[FILENAME] if $trap ne "SIZE" and ! $Traps{$trap}[WHATEVER] \|\| $Now <= $Traps{$trap}[EXPIRES];` [download] for example (I find spacing more effective at conveying grouping than parens, YMMV). Don't use map unless you want the list that it builds: `map { $MaxLen = length($_) if length($_) > $MaxLen } @GrepList;` becomes `for( @GrepList ) { $MaxLen = length($_) if length($_) > $MaxLen; }` [download] If you really have a need for single-line code (which is a mistake in my book), then remove the newlines. Use `local( $/ )= \$BufferSize;` and you can drop the `$/ = "\n";` line. So no speed improvements to offer. (: - tye	[reply] [d/l] [select]
Re: Re^3: Shell Script Woes (review) by Limbic~Region (Chancellor) on Jan 10, 2003 at 23:14 UTC
Re: Shell Script Woes by traveler (Parson) on Jan 02, 2003 at 23:00 UTC
Make the assumption that the string won't wrap 3 or more lines (though unlikely) I am a bit confused by this. It looked as though "filename" contained a list of string or re's to match. It seems as though the remaining files contain lines of data. I guess what this means is that the string to match could be very long: possibly 3+ lines. Is that correct? Some ideas that might get you going: Could you modify tcgrep to do what you want? It seems to be pretty close. Using the "finger scan" is really ineffecient. This is one reason the re engine is nice. Lots of credit goes to Aho. Here is an article describing the math. It gets complex, but that is why we have the perl re engine and why we have study. It seems as though what you need is a special kind of string that looks to the re engine like a string, but reads from STDIN when it needs to and throws out data in the process: a circular buffer. I searched CPAN and could not find such a beast. Maybe this is a good project? HTH, --traveler	[reply]
Re: Shell Script Woes by runrig (Abbot) on Jan 02, 2003 at 23:57 UTC
Make the assumption that the string won't wrap 3 or more lines (though unlikely) Why assume that unlikely == impossible? Starting with tye's answer, count the number of newlines in the grep string with the most newlines, buffer that many lines, and use that (+ the current line) to match your pattern.	[reply]