Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Between-text range operator problem

by jlongino (Parson)
on May 17, 2002 at 01:01 UTC ( [id://167160]=perlquestion: print w/replies, xml ) Need Help??

jlongino has asked for the wisdom of the Perl Monks concerning the following question:

I came across a situation similar to this while working on a project. It is not a large project, but the program involved will be kicked off every 5 minutes. To my dismay, the "integrity" of the input files is not of the highest standards, but then they do originate from a governmental source.

The program opens several files, and is to grab the text between header/footer tags (they're not html) including the header/footer. I thought that using the range operator would be perfect but that's when I got hit by the "integrity" factor. Apparently it is not uncommon for a footer to be left off when it is the last section in a file. So much for elegance.

My question is, is there a clever/elegant solution to this problem (i.e., slight modification to the range statement) or will it require a more brute force approach? I'm not asking that anyone rewrite the entire program to make it work, I can do that myself. It seems to me that this must be a common problem and that the use of the range operator as I've done is too fragile for any but the most controlled circumstances.

file2.txt:
FILE2 AAA text1 text2 text3 EOAAA BBB text4 EOBBB CCC text5 text6 text7 text8 EOCCC
file1.txt:
FILE1 SEGMENT1 text1 text2 text3 EOS1 SEGMENT2 text4 EOS2 SEGMENT3 text5 text6 text7 text8
file3.txt:
FILE3 P201 text1 text2 text3 EOP201 P333 text4 EOP333 P588 text5 text6 text7 text8 EOP588
Program:
use strict; my @jobs = ( 'file2.txt|AAA|EOAAA', 'file1.txt|SEGMENT3|EOS3', 'file3.txt|P333|EOP333' ); for (@jobs) { my ($file, $beg, $end) = split /\|/; my $first_line = 1; my @lines = (); print "Opening file: '$file'\n", " beg: '$beg' end: '$end'\n"; open (INFILE, "<$file") or die "Could not open '$file': $!\n"; while (<INFILE>) { if (/$beg/ .. /$end/) { chomp; print " first: '$_'\n\n" if $first_line; $first_line = 0; push (@lines, $_); } } print "$_\n" for @lines; print "-" x 30, "\n\n"; undef @lines; }
Output:
Opening file: 'file2.txt' beg: 'AAA' end: 'EOAAA' first: 'AAA' AAA text1 text2 text3 EOAAA ------------------------------ Opening file: 'file1.txt' beg: 'SEGMENT3' end: 'EOS3' first: 'SEGMENT3' SEGMENT3 text5 text6 text7 text8 ------------------------------ Opening file: 'file3.txt' beg: 'P333' end: 'EOP333' first: 'FILE3' FILE3 P201 text1 text2 text3 EOP201 P333 text4 EOP333 ------------------------------
I tried what seemed like reasonable approaches, but none of them worked, such as the following and variations thereof: if ((/$beg/ .. /$end/) or (/$beg/ .. eof())) { Thanks much,

--Jim

Replies are listed 'Best First'.
•Re: Between-text range operator problem
by merlyn (Sage) on May 17, 2002 at 01:28 UTC
      Well, chalk one up for shaving serendipities. This morning while I was shaving, I had a flash--maybe I should've explicitly used eof(INFILE) instead of eof(). Unfortunately, I didn't have time to test it until I got home from work this evening.

      Both of the following will work:

      if (/$beg/ .. (/$end/ || eof )) { ... } if (/$beg/ .. (/$end/ || eof(INFILE) )) { ... }
      but the original will not: if (/$beg/ .. (/$end/ || eof() )) { ... } After figuring this out, I checked a few resources (the best explanation came from perlfunc eof):
      An eof without an argument uses the last file read as argument. Using eof() with empty parentheses is very different. It indicates the pseudo file formed of the files listed on the command line, i.e., eof() is reasonable to use inside a while (<>) loop to detect the end of only the last file.
      Obvously, I made some faulty assumptions as to how eof() works. Given how seldom I've actually used eof, I should've checked the docs when I first encountered the "freeze-up" problem.

      --Jim

      Thanks for replying merlyn. Actually, this is one of the variations I tried, but it goes into an infinite loop in the first file after hitting the range statement on the fourth iteration (I believe when the successful /$end/ match occurs).

      --Jim

Re: Between-text range operator problem
by tadman (Prior) on May 17, 2002 at 01:12 UTC
    First, this type of thing was discussed recently in read between two strings, so maybe that will be of help.

    Secondly, if people are going to be fools, then you probably have to do something besides use the range operator. A regular state system seems to work well:
    my %pair = ( 'foo' => 'bar', 'jack' => 'daniels', 'Tom' => 'Jerry', 'BEGIN' => 'END', ); my %data; my $tail; my $type; while (<FILE>) { chomp; s/\s+$//; # Clean up invisible stuff if ($pair{$_}) { $type = $_; $tail = $pair{$_}; } elsif ($_ eq $tail) { undef $type; } push (@{$data{$type}}, "$_\n"); }
    Now you can get the stuff out of the %data hash, or deal with it some other way. YMMV.
      Thanks for the reply tadman. The link, though somewhat related, doesn't really address the header/footer failure aspect. I think I understand your example but I'll have to play with it some to be sure.

      --Jim

Re: Between-text range operator problem
by jlongino (Parson) on May 17, 2002 at 03:16 UTC
    I would still like learn how to implement the between range operator elegantly to handle this type of situation, but in the meantime I've rewritten the program to use a state variable (thanks tadman for the suggestion). TIMTOWTDI:
    use strict; my @jobs = ( 'file2.txt|CCC|EOCCC', 'file1.txt|SEGMENT3|EOS3', 'file3.txt|P333|EOP333' ); for (@jobs) { my ($file, $beg, $end) = split /\|/; my $first_line = 1; my $state = 0; my @lines = (); print "Opening file: '$file'\n", " beg: '$beg' end: '$end'\n"; open (INFILE, "<$file") or die "Could not open '$file': $!\n"; while (<INFILE>) { chomp; $state = 1 if (/$beg/); if ($state) { print " first: '$_'\n\n" if $first_line; $first_line = 0; push (@lines, $_); } $state = 0, last if /$end/; } ## no matching /$end/, so add one push (@lines, $end) if $state; print "$_\n" for @lines; print "-" x 30, "\n\n"; }

    --Jim

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://167160]
Approved by jlongino
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-03-29 13:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found