Parse Loops with flat text files. (code)

deprecated has asked for the wisdom of the Perl Monks concerning the following question:

I've had the pleasure of hacking through three types of flat text file recently. They are

the entire pile of RFC's from http://www.faqs.org

3000 alcoholic beverage recipes (from somewhere I, mysteriously, cannot remember)

So I have been parsing a lot of flat text files. The RFC's are HTML, but there's a lot of fluff at the beginning and at the end, so I've been using the belowmentioned loop to extract the 'meat'. The drinks are also HTML, but have different crap around them. The CIF files are not HTML, and I cant really strip a lot of data from them -- but I want to be able to strip them out of other data.

So with these three programs (in about 2 weeks) I have had to use some sort of start parsing - parse - stop parsing loop three times. I've even pondered writing a small module to do it for me (not for CPAN, probably would post it here, but just something to keep in my homedir to ease future scripts). This is something that has undoubtedly been done zillions of times. After all, what is perl but a <!- pathologically eclectic RUBBISH lister!!!!!! ->parser?

So whilst working on making my code readable, I stumbled upon (see Using arrays of qr!! to simplify larger RE's for readability (code). and Optimization for readability and speed (code)) the use of arrays of qr!! and iterate through them when matching text. This allows some flexibility (i.e., mulitple "start" and "finish" conditions), and it also is pretty clear to read (as it reduces the size of the individual regular expressions).

But looking over the code, I dont get a good "satisfied" feeling re-using it. So, here it is, and I'd like to know what others would do instead:

my @beginnings =
( qr{This is valid</a>},
  qr{[So]+(?:IS|isnt) this}, 
);

my @endings =
( qr{(?:we) Should not [be] [Pp]arsing after},
  qr{either (o|f) these [Ll]ines},
);

sub isbeg {
  my $test = shift;
  foreach (@beginnings) { return undef unless $test =~ $_ }
  $test;
}

sub isend {
  my $test = shift;
  foreach (@endings) { return undef unless $test =~ $_}
  $test;
}

# here is the part I dislike, partially because of the $parsing variab
+le
# it just doesnt seem as "clean" as something some of
# you would write.

  my $parsing;
  foreach my $line (@lines) {
    $parsing++ if isbeg( $line );
    push @extracted, "$line\n" if $parsing;
    last if isend( $line );
  }
[download]

I'm familiar with HTML::TokeParser and HTML::Parser, but since I do this a lot on non-HTML files, I'd like to extract the good parts with my loop and use the parsing modules to parse the stuff I want to parse (rather than the gristle).

thanks
hermano deppon

--
Laziness, Impatience, Hubris, and Generosity.

Comment on Parse Loops with flat text files. (code) Select or Download Code

Replies are listed 'Best First'.
Re: Parse Loops with flat text files. (code) by danger (Priest) on May 13, 2001 at 23:20 UTC
A couple of points. First of all, your return logic seems off in your two test subs. Hazarding a guess, I'd say you just wrapped the foreach loops around your previous version that dealt with only one pattern -- but now a line must match all the patterns to succeed (it'll return 'undef' if any don't match). I'm thinking you want isbeg() (and isend()) to return true if any one of the re's matches the line: `sub isbeg { my $test = shift; foreach (@beginnings) { return 1 if $test =~ /$_/ } return; } sub isend { my $test = shift; foreach (@endings) { return 1 if $test =~ /$_/} return; }` [download] As for your parsing loop, you can use the range operator in scalar context (flip-flop op): `foreach my $line (@lines) { push @extracted, $line if isbeg($line) .. isend($line); last if isend($line); }` [download] You could drop the 'last' statement if the data might have more than one valid section you want to grab. Also, a style note about your use of 'return undef' -- to return a generic false value just use 'return' with no arguments: it'll return undef in scalar context and an empty list in list context. Thus, not only it is shorter to type, it is more versatile as well.	[reply] [d/l] [select]
Re:(2) Parse Loops with flat text files. (code) by deprecated (Priest) on May 13, 2001 at 23:51 UTC
Hm, I knew of the flip-flop operator but I havent ever used it. Time to go read perlop. with regards to returning undef. I use it just as a rule of thumb because it will evaluate to false. However, in subs where I am returning the value of the tested object, I want to be able to return 0 if the test is good -- so I always check to see if defined test( $foo ). dig? brother dep. -- Laziness, Impatience, Hubris, and Generosity.	[reply]
Re: Re:(2) Parse Loops with flat text files. (code) by danger (Priest) on May 14, 2001 at 00:39 UTC
My only point was regarding when you simply want to return a false value to indicate a subroutine failure. Consider the following contrived example where we only want to process strings beginning with a particular pattern: `#!/usr/bin/perl -w use strict; my @patterns = (qr/^foo/, qr/^qux/); my @strings = ('foo bar', 'bar bar', 'qux bar'); my @stuff; foreach my $string (@strings){ if(@stuff = dice_it($string)){ print "Processing: $string\n"; process_stuff(@stuff); } } sub dice_it { my $string = shift; foreach my $pat (@patterns) { return split //, $string if $string =~ /$pat/; } return undef; } sub process_stuff { foreach (@_) { print "<$_>"; } print "\n"; }` [download] I'm not suggesting this a terribly common problem (or that the above is a good way to approach this particular example). I just wanted to point out that returning 'undef' as a failure mode isn't always appropriate -- and people who do so may forget that an array containing one undefined element still evaluates to true so they may bang their head for a while before they realize why they are processing the string 'bar bar' and getting a warning. Changing the last line of dice_it() to just a bare 'return' statement alleviates the problem because it returns the 'right' thing depending on context.	[reply] [d/l]
Re: Re:(2) Parse Loops with flat text files. (code) by Anonymous Monk on May 15, 2001 at 00:07 UTC
I was also thinking the flip-flop operator would be good for this. It's what I always try and use for this kind of "in the middle" test.	[reply]
Re (tilly) 1: Parse Loops with flat text files. (code) by tilly (Archbishop) on May 14, 2001 at 02:18 UTC
Two comments. First of all isbeg and isend are names that grate on me. In general if you provide visual separation for what are supposed to be separate words, people find it easier to read. So I would be inclined to name them is_beg and is_end. A more subtle issue that I have found is that when I abbreviate, I sometimes abbreviate inconsistently. I have enough code, and I am surrounded by enough, that I usually don't worry about it. But I am seriously considering longer names. In which case I would write is_begin instead of is_beg. (I haven't tried that last though, I may feel differently on it tomorrow.) Beyond that you have a lot of state that seems spread out across multiple places of the code. And what is worse is that the state is done in a way where you have no choice but to slurp up the whole file. Depending on whether you ever expect to hit a large file, you may not care. But if you do then you may want to consider how to incrementally process the file. With this you would want to consider your filehandle as an infinite stream, wrap that in an object that reads through the filehandle, reading as needed, which knows how to filter and returns wanted values as long as there are more of them. And then in the main body of your code you could loop over the output of your filter. (If you wanted you could even make this a tied handle.) Now here are three random links that each relate in some way to what I just outlined above. Be warned. None of them apply to how to implement the above in Perl (which you should be able to figure out), but each one applies in some way to either the problem or the proposed solution. Each one opens up some issue or concern. And if all else fails, I found each one interesting...	[reply]
Re: Parse Loops with flat text files. (code) by MeowChow (Vicar) on May 14, 2001 at 11:41 UTC
My first impression is that your code is lacking in abstraction and encapsulation, but it's difficult to recommend specific changes based on this little snippet. I would recommend reading the thread starting at SAS log scanner, which discusses using "`\|`" concatenation to make more efficient regexes. And since you've already got the entire file snarfed into memory, you may as well operate on it as one large string buffer, instead of an array, which would eliminate all of the undesireable code. Though, if your files are large, I would suggest that you parse them as streams, as tilly has already recommended. Your API / return value conventions really irk me, but we've gone over this before, and we agreed to disagree. I think it's worth mentioning, however, that other monks whose perl-fu is strong (hdp, danger) have taken you task on this very issue, and that I've seen you make several mistakes directly attributable to confusing return value logic. I have the feeling that this is a nasty habit carried over from a past of writing shell scripts. Well, when in Perl, do as the perlmonks do, or something like that... Anyway, enough discussion :) here's a code sample which demonstrates some of the ideas I've suggested in this node: my @beginnings = ( qr{This is valid</a>}, qr{[So]+(?:IS\|isnt) this}, qr{start}, ); my @endings = ( qr{(?:we) Should not [be] [Pp]arsing after}, qr{either (?:o\|f) these [Ll]ines}, qr{end}, ); my ($re_begin, $re_end, $re_extract); { local $" = '\|'; $re_begin = qr{@beginnings}; $re_end = qr{@endings}; $re_extract = qr{$re_begin(.*?)$re_end}s; } # is_begin and is_end are no longer needed, but # you can see how trivial they've become after # concatenating the regexes with `\|' sub is_begin { shift =~ $re_begin } sub is_end { shift =~ $re_end } # extract is equally simple sub extract { shift =~ $re_extract } my $text; { local $/; $text = <DATA>; } my ($extracted) = extract $text; print $extracted; __DATA__ blah blah blah start 4 5 6 end more blah blah [download] MeowChow s aamecha.s a..a\u$&owag.print	[reply] [d/l]


laziness, impatience, and hubris
	PerlMonks