Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

HOP::Lexer not doing what I expected

by bart (Canon)
on Nov 11, 2006 at 17:49 UTC ( [id://583502]=perlquestion: print w/replies, xml ) Need Help??

bart has asked for the wisdom of the Perl Monks concerning the following question:

I'm working my way through the HOP::Lexer tutorial on perl.com, and it just isn't working as I expected. No matter what I do, I can't make it do what I want. Please tell me what I am doing wrong. I'm currently really thinking it's a bug in the module.

Here's a reduced (for brevity of the output), modified (to do more different things) version of the code in the article.

use strict; my $sql = <<'--SQL--'; select case when a=b then 'c' else 'd' end "tough_one", e as "even tougher" from mytable --SQL-- use HOP::Lexer 'make_lexer'; my @sql = $sql; my $lexer = make_lexer( sub { shift @sql }, # iterator [ WORD => qr/\w+/i ], [ DQWORD => qr/"\w+"/ ], [ DQUOTED => qr/"[^"]+"/ ], [ QUOTED => qr/'[^']*'/ ], [ COMMA => qr/,/ ], [ SPACE => qr/\s+/, sub {} ], ); # parse my @out; push @out, $_ while $_ = $lexer->(); # Data::Dump the output (elaborated in order to produce compact result +s) use Data::Dumper; $Data::Dumper::Indent = 0; $Data::Dumper::Terse = 1; ($\, $,) = ("\n", ",\n"); print map { Dumper $_ } @out;
Here's what it produces:
['WORD','select'], ['WORD','case'], ['WORD','when'], ['WORD','a'], '=', ['WORD','b'], ['WORD','then'], '\'', ['WORD','c'], '\'', ['WORD','else'], '\'', ['WORD','d'], '\'', ['WORD','end'], '"', ['WORD','tough_one'], '"', ['COMMA',','], ['WORD','e'], ['WORD','as'], '"', ['WORD','even'], ['WORD','tougher'], '"', ['WORD','from'], ['WORD','mytable']
and here is what I expected:
['WORD','select'], ['WORD','case'], ['WORD','when'], ['WORD','a'], '=', ['WORD','b'], ['WORD','then'], ['QUOTED', '\'c\''], ['WORD','else'], ['QUOTED','\'d\''], ['WORD','end'], ['DQWORD','"tough_one"'], ['COMMA',','], ['WORD','e'], ['WORD','as'], ['DQUOTED','"even tougher"'], ['WORD','from'], ['WORD','mytable']
I hope it's obvious why: if you go down the lexer list in and grab the first regexp that matches on the leftmost item, then DQUOTED should have precedence over '"' (not found), and DQWORD over DQUOTED, in the result. And it doesn't recognize the singly quoted string either.

So, why is it not doing what I want?

Update for the people who are too impatient to read the whole thread, I'll now reveal the solution to the mystery (thanks to cmarcelo): HOP::Lexer is not trying to find the leftmost matching token, unlike what the rest of the world tends to do. It tries to match the most important type of token first, and then tries to find the other types of tokens in what remains on its left. It never backtracks. So, you always should put the rules for the tokens you don't ever want to be split up by other rules, first. Put a string matcher before a word matcher.

That's still problematic as a solution for possibly overlapping rules, such as quoted strings and comments.

Replies are listed 'Best First'.
Re: HOP::Lexer not doing what I expected
by cmarcelo (Scribe) on Nov 11, 2006 at 21:51 UTC

    Looking at code from HOP::Lexer I found some interesting things. First an example (most of the code is the same from original post, I only change the data and lexer rules):

    use strict; my $sql = <<'--SQL--'; aaaa a baaaab a --SQL-- use HOP::Lexer 'make_lexer'; my @sql = $sql; my $lexer = make_lexer( sub { shift @sql }, # iterator [ A => qr/a+/i ], [ BAB => qr/ba+b/i ], [ SPACE => qr/\s+/, sub {} ], );

    This gives us:

    ['A','aaaa'], ['A','a'], 'b', ['A','aaaa'], 'b', ['A','a']
    which seems wrong but: note that the rule A matches everytime the rule B matches (not exactly the same match but both match something) and, here's the surprise, HOP::Lexer uses split instead of matching the start of the string. This makes sense because you can have garbage or non-matched data at the start of the buffer, e.g. in original post example there's = which isn't matched by any rule.

    Now it's easy to see why the rules work like that, for example with:

    [ WORD => qr/\w+/i ], [ DQWORD => qr/"\w+"/ ],

    So, considering that split is used and WORD has precedence, always will happen that " will be considered what I called garbage. And that's why giving higher priority to DQWORD works (as I replied in the thread), because otherwise WORD would match the \w+ inside of the double quoted one.

    As a rule of thumb: if a rule has other rule inside it, give it higher priority.

      Right, OK, got it. This even seems to work as I want:

      As to your rule of thumb, it's not always feasable, especially with possibly overlapping matches, for example in Perl, a string can contain a "#" symbol, and a comment can contain quotes. So, which to match first, the comment or the string?

        Indeed, my rule of thumb isn't that good after all :-(. This snippet illustrate what you said about string vs. comment:

        Here is ambiguous what to do, and both orders give a bad result. If STRING comes first, it finds strings inside comments, and if COMMENT comes first, it finds comments inside string.

        (Well, there's a workaround similar to what the original article used to deal with parenthesis, which involves another parsing phase, but this is a little bit cheating I guess ;-)

Re: HOP::Lexer not doing what I expected
by GrandFather (Saint) on Nov 11, 2006 at 18:47 UTC

    Order is important:

    ------------- 8< --------- my $lexer = make_lexer( sub { shift @sql }, # iterator [ DQUOTED => qr/"[^"]+"/ ], [ QUOTED => qr/'[^']*'/ ], [ DQWORD => qr/"\w+"/ ], [ WORD => qr/\w+/i ], [ COMMA => qr/,/ ], [ SPACE => qr/\s+/, sub {} ], ); ------------- 8< ---------

    Prints:

    ['WORD','select'], ['WORD','case'], ['WORD','when'], ['WORD','a'], '=', ['WORD','b'], ['WORD','then'], ['QUOTED','\'c\''], ['WORD','else'], ['QUOTED','\'d\''], ['WORD','end'], ['DQUOTED','"tough_one"'], ['COMMA',','], ['WORD','e'], ['WORD','as'], ['DQUOTED','"even tougher"'], ['WORD','from'], ['WORD','mytable']

    DWIM is Perl's answer to Gödel

      But your order is wrong:

      ['DQUOTED','"tough_one"'],

      shouldn't be a DQUOTED but a DQWORD - your code won't ever match a DQWORD, which I think was the original goal.

        Yes but at least he got QUOTED to match, which is something I couldn't do.

        I just can't make sense of the ordering rules.

      Why? WORD and DQUOTED can't match the same thing. Or is the match unanchored (i.e. preceeded by .*?)?

        I'm wondering that myself! I installed HOP::Lexer to have a play and learn (I've not used the module before). It seemed that WORD was matching first and changing the order fixed that as I expected. It was not obvious why it should match first however.

        I've now skim read the documentation (including HOP::Lexer::Article) and still don't understand why WORD was matching! Maybe time to trawl through the code?


        DWIM is Perl's answer to Gödel
Re: HOP::Lexer not doing what I expected
by bart (Canon) on Nov 11, 2006 at 21:58 UTC
    OK guys, it's getting to look worse all the time. I found a much simpler example of something that I think is going terribly wrong, and I'd like you to chew it over.
    use HOP::Lexer 'string_lexer'; my $text = 'xselectx'; my $lexer = string_lexer( $text, [KEYWORD => qr/select/i], [WORD => qr/\w+/ ] );
    (n.b. string_lexer is just a routine in the module that wraps the input string in an iterator, and then calls make_lexer, so we don't have to do it by hand. The code we have to write just becomes a bit simpler.)

    Tell me that the result it parses into is what you think makes sense. Because it doesn't make any sense to me at all:

      Sorry, but I don't get what's the problem.
      [KEYWORD => qr/select/i], [WORD => qr/\w+/ ],

      What were you expecting exactly to have as result for the rules above for the string xselectx? Are you expecting to deal with word boundaries, like not matching KEYWORD only when it's separated by spaces or something, so doesn't match xselectx?

      And according to my explanation, this is the right order, since WORD matches whatever KEYWORD matches, but KEYWORD is more specific, so goes up.

        Word boundaries? Hmm... interesting take. It's not something that's been mentioned in the docs, or in the perl.com article.

        Where it really does go wrong, in my opinion, is that it doesn't make any attempt to try and find a leftmost match. That's what all lexers are supposed to do. So you can rightfully argue that it must find "select" in the string "selectx", it makes no sense to skip the first "x" in "xselectx". No other lexer or parser in the world would do that, not by design.

Re: HOP::Lexer not doing what I expected- OT
by Anonymous Monk on Nov 12, 2006 at 08:17 UTC
    When will this book be put online free? Or did the plan for this got canceled ?


      No it didn't get cancelled. You can check the progress status here. You could even volunteer to help. :)

      Apparently it's a bit slow at the moment, the last status update was a year ago, the last mailing list message is 1/2 year old.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://583502]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2024-04-19 02:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found