Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Re3: Regexes on Streams - Revisited!

by tilly (Archbishop)
on Oct 16, 2003 at 14:38 UTC ( [id://299761]=note: print w/replies, xml ) Need Help??


in reply to Re3: Regexes on Streams - Revisited!
in thread Regexes on Streams - Revisited!

You don't see it as buggy that the way that $/ matches a record depends on the layout of the data and your buffer size in some rather complex way, with the only guarantee being that if you suck in the whole stream then it will work as expected? It isn't enough to say, "The program does what it is coded to." If you are going to promise to allow people to use regexes to match input, then make some attempt to do it consistently, and if you can't then make some attempt to at least fail in an easily explainable way. For instance I can be OK with, "I can produce strange results if $/ is supposed to match more than one block." If you can't deliver it, then don't even seem to promise delivering it.

If you want to attempt to deliver the promise, then one idea is the approach that I suggested at Re: Regexes on Streams, which the code above implements a variation on. If the programmer wants to give a potentially infinite match, you do your best to satisfy them. Perhaps it is qr/.*/ and life sucks. Perhaps it is /[\r\n](?:\s*[\r\n]|)/ (ie match end of line and any following blank lines, with either Unix or DOS line endings) and even though it is potentially infinite, with real data it is also pretty sensible and shouldn't be broken too easily. (Note: the $/ example that Dominus used in his chapter was potentially infinite...)

A sample program to play with is the following. Save it and feed it different buffer sizes. The end of record expression is greedy, of size 12. Yet from sizes of 1-17 only twice does it produce the result which Dominus' original description would lead you to expect. And in a longer data example, it would continue to mess up, and there is no simple way to say that it shouldn't.

#! /usr/bin/perl -w use strict; sub blocks { my $fh = shift; my $blocksize = shift || 8192; sub { return unless read $fh, my($block), $blocksize; return $block; } } sub records { my $blocks = shift; my $terminator = @_ ? shift : quotemeta($/); my @records; my ($buf, $finished) = (""); sub { while (@records == 0 && ! $finished) { if (defined(my $block = $blocks->())) { $buf .= $block; my @newrecs = split /($terminator)/, $buf; while (@newrecs > 2) { push @records, shift(@newrecs).shift(@newrecs); } $buf = join "", @newrecs; } else { @records = $buf; $finished = 1; } } return shift(@records); } } my $iter_block = blocks(\*DATA, shift || 10); my $iter_record = records($iter_block, "(?:foo\n)+"); while (my $record = $iter_record->()) { print "GOT A RECORD:\n$record\n"; } __DATA__ hello foo foo foo world foo foo foo !
And about using capturing parens into a terminator pattern, there is no real reason for wanting to do so, but it is incredibly easy to do by accident when not thinking about it. If you don't know a fair amount about regexes, you could take a good while to figure out why things look weird and what to do to fix it. (Confession: When writing the above sample code I originally had capturing parens and was somewhat surprised at the results. This is how I became aware of that issue...)

Replies are listed 'Best First'.
Re5: Regexes on Streams - Revisited!
by Hofmator (Curate) on Oct 16, 2003 at 14:49 UTC
    Ahh, now I see light (and the problems :). Thanks for your thorough explanation, tilly++ !!

    -- Hofmator

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://299761]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (6)
As of 2024-04-18 11:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found