HOP::Lexer not doing what I expected

bart has asked for the wisdom of the Perl Monks concerning the following question:

I'm working my way through the HOP::Lexer tutorial on perl.com, and it just isn't working as I expected. No matter what I do, I can't make it do what I want. Please tell me what I am doing wrong. I'm currently really thinking it's a bug in the module.

Here's a reduced (for brevity of the output), modified (to do more different things) version of the code in the article.

use strict;
my $sql = <<'--SQL--';
select case
  when a=b then 'c'
  else 'd'
  end "tough_one",
  e as "even tougher"
from mytable
--SQL--

use HOP::Lexer 'make_lexer';
my @sql   = $sql;
my $lexer = make_lexer(
    sub { shift @sql },  # iterator
    [ WORD      => qr/\w+/i        ],
    [ DQWORD    => qr/"\w+"/       ],
    [ DQUOTED   => qr/"[^"]+"/     ],
    [ QUOTED    => qr/'[^']*'/     ],
    [ COMMA     => qr/,/           ],
    [ SPACE     => qr/\s+/, sub {} ],
);

# parse
my @out;
push @out, $_ while $_ = $lexer->();

# Data::Dump the output (elaborated in order to produce compact result
+s)
use Data::Dumper;
$Data::Dumper::Indent = 0;
$Data::Dumper::Terse = 1;
($\, $,) = ("\n", ",\n");
print map { Dumper $_ } @out;
[download]

Here's what it produces:

['WORD','select'],
['WORD','case'],
['WORD','when'],
['WORD','a'],
'=',
['WORD','b'],
['WORD','then'],
'\'',
['WORD','c'],
'\'',
['WORD','else'],
'\'',
['WORD','d'],
'\'',
['WORD','end'],
'"',
['WORD','tough_one'],
'"',
['COMMA',','],
['WORD','e'],
['WORD','as'],
'"',
['WORD','even'],
['WORD','tougher'],
'"',
['WORD','from'],
['WORD','mytable']
[download]

and here is what I expected:

['WORD','select'],
['WORD','case'],
['WORD','when'],
['WORD','a'],
'=',
['WORD','b'],
['WORD','then'],
['QUOTED', '\'c\''],
['WORD','else'],
['QUOTED','\'d\''],
['WORD','end'],
['DQWORD','"tough_one"'],
['COMMA',','],
['WORD','e'],
['WORD','as'],
['DQUOTED','"even tougher"'],
['WORD','from'],
['WORD','mytable']
[download]

I hope it's obvious why: if you go down the lexer list in and grab the first regexp that matches on the leftmost item, then DQUOTED should have precedence over '"' (not found), and DQWORD over DQUOTED, in the result. And it doesn't recognize the singly quoted string either.

So, why is it not doing what I want?

Update for the people who are too impatient to read the whole thread, I'll now reveal the solution to the mystery (thanks to cmarcelo): HOP::Lexer is not trying to find the leftmost matching token, unlike what the rest of the world tends to do. It tries to match the most important type of token first, and then tries to find the other types of tokens in what remains on its left. It never backtracks. So, you always should put the rules for the tokens you don't ever want to be split up by other rules, first. Put a string matcher before a word matcher.

That's still problematic as a solution for possibly overlapping rules, such as quoted strings and comments.

Comment on HOP::Lexer not doing what I expected Select or Download Code

Replies are listed 'Best First'.
Re: HOP::Lexer not doing what I expected by cmarcelo (Scribe) on Nov 11, 2006 at 21:51 UTC
Looking at code from HOP::Lexer I found some interesting things. First an example (most of the code is the same from original post, I only change the data and lexer rules): `use strict; my $sql = <<'--SQL--'; aaaa a baaaab a --SQL-- use HOP::Lexer 'make_lexer'; my @sql = $sql; my $lexer = make_lexer( sub { shift @sql }, # iterator [ A => qr/a+/i ], [ BAB => qr/ba+b/i ], [ SPACE => qr/\s+/, sub {} ], );` [download] This gives us: `['A','aaaa'], ['A','a'], 'b', ['A','aaaa'], 'b', ['A','a']` [download] which seems wrong but: note that the rule `A` matches everytime the rule `B` matches (not exactly the same match but both match something) and, here's the surprise, HOP::Lexer uses `split` instead of matching the start of the string. This makes sense because you can have garbage or non-matched data at the start of the buffer, e.g. in original post example there's `=` which isn't matched by any rule. Now it's easy to see why the rules work like that, for example with: `[ WORD => qr/\w+/i ], [ DQWORD => qr/"\w+"/ ],` [download] So, considering that `split` is used and `WORD` has precedence, always will happen that `"` will be considered what I called garbage. And that's why giving higher priority to `DQWORD` works (as I replied in the thread), because otherwise `WORD` would match the `\w+` inside of the double quoted one. As a rule of thumb: if a rule has other rule inside it, give it higher priority.	[reply] [d/l] [select]
Re^2: HOP::Lexer not doing what I expected by bart (Canon) on Nov 11, 2006 at 22:48 UTC
Right, OK, got it. This even seems to work as I want: Read more... (2 kB) As to your rule of thumb, it's not always feasable, especially with possibly overlapping matches, for example in Perl, a string can contain a "#" symbol, and a comment can contain quotes. So, which to match first, the comment or the string?	[reply] [d/l] [select]
Re^3: HOP::Lexer not doing what I expected by cmarcelo (Scribe) on Nov 12, 2006 at 00:49 UTC
Indeed, my rule of thumb isn't that good after all :-(. This snippet illustrate what you said about string vs. comment: Read more... (791 Bytes) Here is ambiguous what to do, and both orders give a bad result. If `STRING` comes first, it finds strings inside comments, and if `COMMENT` comes first, it finds comments inside string. (Well, there's a workaround similar to what the original article used to deal with parenthesis, which involves another parsing phase, but this is a little bit cheating I guess ;-) Read more... (2 kB)	[reply] [d/l] [select]
Re: HOP::Lexer not doing what I expected by GrandFather (Saint) on Nov 11, 2006 at 18:47 UTC
Order is important: `------------- 8< --------- my $lexer = make_lexer( sub { shift @sql }, # iterator [ DQUOTED => qr/"[^"]+"/ ], [ QUOTED => qr/'[^']*'/ ], [ DQWORD => qr/"\w+"/ ], [ WORD => qr/\w+/i ], [ COMMA => qr/,/ ], [ SPACE => qr/\s+/, sub {} ], ); ------------- 8< ---------` [download] Prints: `['WORD','select'], ['WORD','case'], ['WORD','when'], ['WORD','a'], '=', ['WORD','b'], ['WORD','then'], ['QUOTED','\'c\''], ['WORD','else'], ['QUOTED','\'d\''], ['WORD','end'], ['DQUOTED','"tough_one"'], ['COMMA',','], ['WORD','e'], ['WORD','as'], ['DQUOTED','"even tougher"'], ['WORD','from'], ['WORD','mytable']` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: HOP::Lexer not doing what I expected by Corion (Patriarch) on Nov 11, 2006 at 18:50 UTC
But your order is wrong: `['DQUOTED','"tough_one"'],` [download] shouldn't be a `DQUOTED` but a `DQWORD` - your code won't ever match a `DQWORD`, which I think was the original goal.	[reply] [d/l] [select]
Re^3: HOP::Lexer not doing what I expected by bart (Canon) on Nov 11, 2006 at 18:55 UTC
Yes but at least he got QUOTED to match, which is something I couldn't do. I just can't make sense of the ordering rules.	[reply]
Re^4: HOP::Lexer not doing what I expected by cmarcelo (Scribe) on Nov 11, 2006 at 19:41 UTC
Re^2: HOP::Lexer not doing what I expected by ikegami (Patriarch) on Nov 11, 2006 at 18:49 UTC
Why? WORD and DQUOTED can't match the same thing. Or is the match unanchored (i.e. preceeded by `.*?`)?	[reply] [d/l]
Re^3: HOP::Lexer not doing what I expected by GrandFather (Saint) on Nov 11, 2006 at 19:03 UTC
I'm wondering that myself! I installed HOP::Lexer to have a play and learn (I've not used the module before). It seemed that WORD was matching first and changing the order fixed that as I expected. It was not obvious why it should match first however. I've now skim read the documentation (including HOP::Lexer::Article) and still don't understand why WORD was matching! Maybe time to trawl through the code? DWIM is Perl's answer to Gödel	[reply]
Re: HOP::Lexer not doing what I expected by bart (Canon) on Nov 11, 2006 at 21:58 UTC
OK guys, it's getting to look worse all the time. I found a much simpler example of something that I think is going terribly wrong, and I'd like you to chew it over. `use HOP::Lexer 'string_lexer'; my $text = 'xselectx'; my $lexer = string_lexer( $text, [KEYWORD => qr/select/i], [WORD => qr/\w+/ ] );` [download] (n.b. `string_lexer` is just a routine in the module that wraps the input string in an iterator, and then calls `make_lexer`, so we don't have to do it by hand. The code we have to write just becomes a bit simpler.) Tell me that the result it parses into is what you think makes sense. Because it doesn't make any sense to me at all: Read more... (1180 Bytes)	[reply] [d/l] [select]
Re^2: HOP::Lexer not doing what I expected by cmarcelo (Scribe) on Nov 11, 2006 at 22:11 UTC
Sorry, but I don't get what's the problem. `[KEYWORD => qr/select/i], [WORD => qr/\w+/ ],` [download] What were you expecting exactly to have as result for the rules above for the string `xselectx`? Are you expecting to deal with word boundaries, like not matching `KEYWORD` only when it's separated by spaces or something, so doesn't match `xselectx`? And according to my explanation, this is the right order, since `WORD` matches whatever `KEYWORD` matches, but `KEYWORD` is more specific, so goes up.	[reply] [d/l] [select]
Re^3: HOP::Lexer not doing what I expected by bart (Canon) on Nov 11, 2006 at 22:23 UTC
Word boundaries? Hmm... interesting take. It's not something that's been mentioned in the docs, or in the perl.com article. Where it really does go wrong, in my opinion, is that it doesn't make any attempt to try and find a leftmost match. That's what all lexers are supposed to do. So you can rightfully argue that it must find "select" in the string "selectx", it makes no sense to skip the first "x" in "xselectx". No other lexer or parser in the world would do that, not by design.	[reply]
Re^4: HOP::Lexer not doing what I expected by cmarcelo (Scribe) on Nov 11, 2006 at 22:39 UTC
Re: HOP::Lexer not doing what I expected- OT by Anonymous Monk on Nov 12, 2006 at 08:17 UTC
When will this book be put online free? Or did the plan for this got canceled ?	[reply]
Re^2: HOP::Lexer not doing what I expected- OT by bart (Canon) on Nov 12, 2006 at 08:29 UTC
No it didn't get cancelled. You can check the progress status here. You could even volunteer to help. :) Apparently it's a bit slow at the moment, the last status update was a year ago, the last mailing list message is 1/2 year old.	[reply]


"be consistent"
	PerlMonks