Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Shell Script Woes

by Limbic~Region (Chancellor)
on Jan 02, 2003 at 21:59 UTC ( [id://223874]=perlquestion: print w/replies, xml ) Need Help??

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All:
I am in the process of converting a lot of shell scripts to Perl. I am having a problem with the following:

#!/usr/bin/ksh GrepList=`sed '/^ *$/d;/^#/d' traplist.in 2>/dev/null` grep -ilF "$GrepList" out/do* 2>/dev/null | xargs -i mv {} ./capture 2 +>/dev/null

This isn't the entire script, but it is in a continual loop.
The mv is in a race condition that is beyond my control and the GrepList changes over time.

Some of the limitations of the shell script implementation are:
1. The item in the GrepList that matched isn't recorded
2. The grep fails if the string is wrapped (imbedded) newline

I came up with the following (with help from the CB) as a start (code doesn't exactly match)

#!/usr/bin/perl my @checklist; my @testfile; my $rulenames; open (FILE,"filename"); while (<FILE>) { next if ( $_ =~ /^ *$/ || $_ =~ /^#/ ); push @checklist , $_; } chomp @checklist; close (FILE); my $GrepList = join '|', map "($_)", @checklist; @ARGV = <*>; if (@ARGV) { undef $/; while (<>) { $_ =~ s/\n//; next unless ($_ =~ /$GrepList/i); print "$ARGV : $+\n"; } }

The final product will store the time stamp of the traplist file and only update on change and fix the obvious errors, but I am more concerned with the raw functionality.

What I am trying to accomplish is the following:

  • Match on any item in list
  • Stop after first match
  • Remember what matched
  • Be as fast as possible using the least amount of resources (I know - you trade one for the other)

    What I can't do is:

  • Slurp in the file
  • Build an ever increasing string
  • Make the assumption that the string won't wrap 3 or more lines (though unlikely)

    I originally envisioned what I do when I do a word search. You scan using your finger until you come across a letter that starts a word you are looking for. You follow it until you get a match or until you realize it is a dead end. If it is a dead end, you keep on going having forgot everything that came before.

    Unfortunately, I don't believe there is a way to tell RegEx to start paying attention on a partial match.

    Any ideas?

    Limbic~Region

  • Replies are listed 'Best First'.
    Re: Shell Script Woes (tye's try)
    by tye (Sage) on Jan 02, 2003 at 22:59 UTC

      You need to quotemeta since you are trying to match a list of fixed strings not a list of regular expressions. This also makes it easy to figure out the maximum length of "carry over" you need between buffers.

      #!/usr/bin/perl -w use strict; my @checklist; my @testfile; my $rulenames; open( FILE, "wordlist" ) or die "Can't read wordlist: $!\n"; my $maxLen = 0; while( <FILE> ) { next if $_ =~ /^ *$/ || $_ =~ /^#/; $maxLen = length($_) if $maxLen < length($_); push @checklist, $_; } chomp @checklist; close(FILE); my $bufSize= 8*1024; $bufSize= 2*$maxLen if $bufSize < 2*$maxLen; my $GrepList = join '|', map quotemeta $_, @checklist; $GrepList = qr/($GrepList)/i; @ARGV = grep -f $_, <*>; $/= \$bufSize; # Have <> read $bufSize bytes if( @ARGV ) { my $prev= ""; while( <> ) { $_ =~ s/\n//g; if( ($prev.$_) =~ /$GrepList/ ) { print "$ARGV : $1\n"; close( ARGV ); $prev= ""; } elsif( eof ) { $prev= ""; } else { $prev = substr( $_, -$maxLen ); } } }
      Tested and works. Note that this assumes that you don't have huge runs of newlines in the middles of your matches.

                      - tye

      Updated: I originally left out the setting of $/.

        tye,
        Thank you very much for the insight. I have "borrowed" a great deal of your code and came up with the following:

        #!/usr/bin/perl -w use strict; use Time::Local; chdir "/var/spool/wt400/gateways/$ARGV[0]" or exit; mkdir "capture", 0755 unless (-d "capture"); my $Dir = $ARGV[1]; my $ListTime = 0; my $BufferSize = 64 * 1024; my $MaxLen = 0; my %Traps; my @GrepList; my $GrepString; my $Counter = 1; my $Size; my $Prev; my $Now; my $NF; while (1) { if ($Counter > 20 || ! @GrepList) { if ( (stat("traplist.$Dir"))[9] gt $ListTime ) { $ListTime = (stat(_))[9]; open (LIST,"traplist.$Dir"); while (<LIST>) { next if ($_ =~ /^Created\t\tExpires/ || $_ =~ /^ *$/); my @Fields = split "\t" , $_; my($mon, $day, $year, $hour, $min) = split ?[-/:]? , $Fields[1]; my $Expiration = timelocal(0, $min, $hour, $day, $mon -1, $year + + 100); $Traps{"$Fields[6]"} = [ $Expiration,$Fields[2],$Fields[5],$Fields +[7] ]; } close (LIST); } } @GrepList = (); $Now = time; foreach my $trap (keys %Traps) { push @GrepList,$Traps{$trap}[3] unless (($Traps{$trap}[0] < $Now && +$Traps{$trap}[1]) || $trap eq "SIZE"); } map { $MaxLen = length($_) if length($_) > $MaxLen } @GrepList; $BufferSize = 2 * $MaxLen if ($BufferSize < 2 * $MaxLen); if (exists $Traps{"SIZE"} && $Traps{"SIZE"}[1]) { $Size = $Traps{"SIZE"}[2] unless ($Traps{"SIZE"}[0] < $Now && $Traps +{"SIZE"}[2] > 0); } exit unless (@GrepList || $Size); $GrepString = join '|', map quotemeta $_, @GrepList; $GrepString = qr/($GrepString)/i; if ($Dir eq "out") { @ARGV = <out/do*>; } elsif ($Dir eq "in") { @ARGV = <in/di*>; } else { @ARGV = <out/do* in/di*> } if (@ARGV) { $/=\$BufferSize; $Prev= ""; while (<>) { $_ =~ tr/\n//d; if(($Prev.$_) =~ /$GrepString/) { ($NF = "$ARGV-$+") =~ s/^.*\///; rename $ARGV , "capture/$NF"; close (ARGV); $Prev = ""; } if (eof) { $Prev = ""; } else { $Prev = substr($_,-$MaxLen); } } } $/ = "\n"; ++$Counter; sleep 3 }

        This provides 10X more functionality the original shell script did.
        I would appreciate any advice on how it could be made to go faster and still be efficient.

        L~R

          First, I'd use more than one space for indentation. I use 4 because I like the way it discourages overly deep nesting of code. Even 2 or 3 would be quite a bit better than 1, IMO.

          $Traps{"$Fields[6]"} = [ $Expiration,$Fields[2],$Fields[5],$Fields[7] ]; can be written $Traps{$Fields[6]} = [ $Expiration, @Fields[­2,5,7] ]; Putting in too many quotes can bite you (though using it as a hash key also does the stringification which would bite you in the same way in this case -- changing an object into a string) so be careful of it.

          You can make the code clearer using a few constants:

          sub EXPIRE() { 0 } sub WHATEVER() { 1 } sub FOOBAR() { 2 } sub FILENAME() { 3 }
          (these make no difference in the running time of the code since they get optimized away at compile time). I'd also avoid 'unless' so push @GrepList,$Traps{$tr­ap}[3] unless (($Traps{$trap}[0] < $Now && $Traps{$trap}[1]) || $trap eq "SIZE"); becomes
          push @GrepList, $Traps{$tr­ap}[FILENAME] if $trap ne "SIZE" and ! $Traps{$trap}[WHATEVER] || $Now <= $Traps{$trap}[EXPIRES];
          for example (I find spacing more effective at conveying grouping than parens, YMMV).

          Don't use map unless you want the list that it builds: map { $MaxLen = length($_) if length($_) > $MaxLen } @GrepList; becomes

          for( @GrepList ) { $MaxLen = length($_) if length($_) > $MaxLen; }
          If you really have a need for single-line code (which is a mistake in my book), then remove the newlines.

          Use local( $/ )= \$BufferSize; and you can drop the $/ = "\n"; line.

          So no speed improvements to offer. (:

                          - tye
    Re: Shell Script Woes
    by traveler (Parson) on Jan 02, 2003 at 23:00 UTC
      Make the assumption that the string won't wrap 3 or more lines (though unlikely)
      I am a bit confused by this. It looked as though "filename" contained a list of string or re's to match. It seems as though the remaining files contain lines of data. I guess what this means is that the string to match could be very long: possibly 3+ lines. Is that correct?

      Some ideas that might get you going:

      • Could you modify tcgrep to do what you want? It seems to be pretty close.
      • Using the "finger scan" is really ineffecient. This is one reason the re engine is nice. Lots of credit goes to Aho. Here is an article describing the math. It gets complex, but that is why we have the perl re engine and why we have study.
      • It seems as though what you need is a special kind of string that looks to the re engine like a string, but reads from STDIN when it needs to and throws out data in the process: a circular buffer. I searched CPAN and could not find such a beast. Maybe this is a good project?
      HTH, --traveler
    Re: Shell Script Woes
    by runrig (Abbot) on Jan 02, 2003 at 23:57 UTC
      Make the assumption that the string won't wrap 3 or more lines (though unlikely)

      Why assume that unlikely == impossible? Starting with tye's answer, count the number of newlines in the grep string with the most newlines, buffer that many lines, and use that (+ the current line) to match your pattern.

    Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Domain Nodelet?
    Node Status?
    node history
    Node Type: perlquestion [id://223874]
    Approved by joe++
    Front-paged by tye
    help
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this?Last hourOther CB clients
    Other Users?
    Others imbibing at the Monastery: (4)
    As of 2024-04-19 05:05 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found