Matching against a partially known string

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Here I have a RE which I will call $pattern; it begins with a ^. Over there is a string which I will call $fullstring. I would like to know if $pattern matches $fullstring; that is: $fullstring =~ /$pattern/. The challenge is that I only know $buff, a prefix of $fullstring. I want a programmatic way to answer the following question:

Can I be certain that, no matter what I add to $buff, that $pattern will not match?

Asked again, more formally and in the inverse (I don't care what $remainder is, I just want to know if it's possible for $pattern to match with some oracle-given $remainder.):

Does there exist a string $remainder such that ($buff.$remainder) =~ /$pattern/?

Please ponder the above for a minute before I clutter your mind with the practical bits and my particular ideas for a solution.

Okay, now I'll tell a bit more about the practical side. I am breaking a stream coming over a network into tokens, each token matching one of a set of patterns. I want to be able to eliminate patterns that can't match the next token, so that when I have eliminated all the patterns I recognize that there has been a protocol error. Most of the patterns are quite simple, usually fixed-length and fixed-position:

# simple keyword
/^KEYWORD/

# simple keyword with fixed-width integer parameter
/^INTVALUE(\d{4})/

# keyword with integer parameter specifying length of binary blob
/^CHUNKOFGARBAGE(\d{4})/ and then /^.{$1}/
[download]

Ideally, I'd like to get some kind of hook into the RE engine, such that when the engine reaches the end of $buff, rather than backtracking, it should immediately report a (potential) match. Alas, nothing in man pages, my Nutshell books, CPAN, Usenet searches, or Perl Monks has hinted at a way to call my code from within the RE engine (In particular, yes, I've read man overload and "Creating custom RE engines" in man perlre.)

Is there a way to programmatically transform $pattern into another RE that I can apply against $buff and get an answer to my formal question?

One way to transform $pattern would be to replace every node (named $node) of its AST with ( $ | $node ). After such a transformation, the RE engine would claim a match whenever it hit the end of $buff. I can't think of an elegant way to perform such a transformation without making my own RE parser.

Can I do something with $^R, (?{ code }), and/or (??{ code })?

The important thing for me right now is that it be correct. Once I get that, I'll worry about making it work efficiently.

Comment on Matching against a partially known string Select or Download Code

Replies are listed 'Best First'.
Re: Matching against a partially known string (link) by tye (Sage) on Sep 19, 2003 at 18:53 UTC
You might find Applying regexes to streams: Perl enhancement idea informative. - tye	[reply]
Re: Re: Matching against a partially known string (link) by wonkozen (Initiate) on Sep 19, 2003 at 21:30 UTC
Yes, the /z modifier would precisely fit what I'm looking for. Let me clarify my problem by contrasting it with your sample code: The stream should consist exclusively of a sequence of matchable texts. That is, $` should always be ''. (Remember the ^ at the beginning of my $pattern.) I want to know as soon as possible that a pattern won't be able to match the buffer. It's somewhat easier for me because I know that my pattern begins with ^. Dare I open up the perl source code and try to implement this now?	[reply]
Re: Re: Re: Matching against a partially known string (link) by wonkozen (Initiate) on Oct 10, 2003 at 01:22 UTC
I poked around in regexec.c. Whoa, heavy magic! Maybe --just maybe-- if the perl RE engine weren't so optimized, I would have the confidence to change stuff and implement /z. But not on the existing RE engine. I'm now considering using Sandfly's suggested technique with YAPE::Regexp to provide the list of atoms. Thanks to all for the useful comments and suggetions.	[reply]
Re: Matching against a partially known string by sandfly (Beadle) on Sep 19, 2003 at 21:58 UTC
If the patterns are all simple, one approach would be to represent each pattern as a list of atoms, and test whether a partial pattern can consume the buffer: `my @re= qw( [abc] b+ c ); my @tests = qw( a ab ba bb bbbbbcdef bbbbbbg ); TEST: foreach my $buff ( @tests ) { my @tre = @re; my $tre = ""; while (@tre > 1) { $tre .= shift @tre; if ( $buff =~ m/^$tre$/ ) { print "Test string '$buff' matched partial pattern, $tre\n +"; next TEST; } } $tre .= shift @tre; if ( $buff =~ m/^$tre/ ) { print "Test string '$buff' matched whole pattern, $tre\n"; } else { print "Test string '$buff' did not match pattern $tre\n"; } }` [download] This method will fail if you use alternation, or atoms with a minumum length > 1, such as "a{5}".	[reply] [d/l]
Re: Re: Matching against a partially known string by wonkozen (Initiate) on Sep 20, 2003 at 15:51 UTC
Ahh, not a horrible idea for simple patterns. I would need to add a ^ at the beginning and maintain a $ at the end of the RE. For a simple enough pattern I could even generate the partial REs programatically. Alas, it would be tougher to deal with quantifiers, capturing groups, etc, and, alas, I am quite the fan of quantifiers. (It took me a while to recognize what you were suggesting. I didn't notice the ".=" in `$tre .= shift @tre`.)	[reply] [d/l]
Re: Matching against a partially known string by kvale (Monsignor) on Sep 19, 2003 at 18:33 UTC
If you have fixed text that you are matching on, you could use Text::Abbrev to test whether your `$buff` equals one of the possible abbreviations. If you know the fixed length of the buffer, simply either create a fixed length version of your regexp or work out manually, if you can, all the ways of expanding the regexp out to the length of the buffer. For fixed strings, use `substr`. This will be much, much easier, that trying to interface to the perl regexp code. -Mark	[reply] [d/l] [select]
Re: Matching against a partially known string by wonkozen (Initiate) on Sep 19, 2003 at 18:05 UTC
Whoops, this silly new user posted anonymously, even after he created an account just to ask this question. Oh well, maybe he'll learn something else soon.	[reply]
Re: Re: Matching against a partially known string by wonkozen (Initiate) on Sep 19, 2003 at 22:17 UTC
(I am using terms "parser", "tokenizer / lexer", "abstract syntax tree (AST)" in a compiler context.) I was pointed to the YAPE::Regex module "Yet Another Parser/Extractor for Regular Expressions". From the documentation and my experiments in the debugger, I believe it to be only a lexer, not a proper parser. As far as I can tell, it breaks down the text of a RE into strings of characters with particular meanings (tokens), but it doesn't assemble those meanings into a hierarchy (abstract syntax tree). For a quick look at the AST of a RE, you can see the indented text in "Debugging regular expressions" of man perldebguts or `perl -Mre=debug -e '$re = qr/^a(b(cd?)?)?/;'` [download] . To do one of the transformations that many of us have thought up, I should be operating on the AST, not just on the individual tokens. Also, I would have to break each EXACT node into a sequence of single-character EXACT nodes.	[reply] [d/l]


laziness, impatience, and hubris
	PerlMonks