Flex, converting regexes, and other Interesting Stuff.

Petruchio has asked for the wisdom of the Perl Monks concerning the following question:

A few questions for the smart people here. And a story for the rest of us. ;-) Skip to the end for the questions.

A recent dinner with AgentM and bastard left me with a number of things to ponder, and ideas to try out. One thing I decided was to make my home page more configurable, by offering the stylesheet by CGI, and allowing users to mess with the settings, saving the resulting preferences in a cookie.

Rather than write up a form to do this, I wanted to simply provide options for each setting, and have the CGI program automatically generate appropriate form elements. So I considered the format I should use to express possible settings... and I decided that an actual stylesheet would do very well. The "cascading" effect allows you to provide multiple values for each property, for the user agent to regard as being in descending order of importance. All the CGI program would have to do would be to parse a stylesheet, and create a form allowing the user to decide which values would be the first ones.

I reached immediately for CSS::Parser, but it didn't seem to be quite the thing; instead of taking in a stylesheet and handing back a tree, it lets you set up callback functions for specific tokens. Perhaps it would actually work, or it contains code I could use, but I moved on. The direction I took was interesting, if not strictly practical.

I went to the W3C, looked around the CSS2 spec, and came across a Flex specification for a CSS2 parser. I don't like to flatter myself, but I really am darned Lazy. I thought, "why even consider writing my own parser? Even if perl parsers are very easy, here is an Official Specification. All I have to do is find the program which turns Flex specs into Perl, rather than C, code.

I went back to CPAN, and promptly found Parse::Lex and its close bretheren. But on perusing the documentation, I found that it (apparently) didn't deal with Lex/Flex specifications at all, but with Perl stuff... much as Expect.pm doesn't read Expect scripts, but performs a similar function.

All the while, by the way, I had the feeling of someone who was wading into deeper and deeper water. Being a non-CS type, I soon reached the point where each step is a jump to get a gulp of air.

Next stop: the chatterbox. I waited until there was a respectable showing of local sages, asked a few questions, and was much surprised not to receive answers. So now I'm posting... and I have little doubt that the responses will be very, very interesting.

My questions:

Is there a means of converting Flex specifications (sans any embedded C code, naturally) into Perl code? If not, I assume it's quite difficult... why is this?

Is there a means of turning Flex regular expressions (DFA) into Perl regular expressions (Traditional NFA)? If one wished to make a flex2perl program, I would think this would be necessary. There was a very brief conversation here, when the place was much less well-populated, about the possibility of converting NFA expressions to DFA expressions, by the way.

Am I seeing Parse::Lex correctly? Or is it the tool I want after all?

Am I seeing CSS::Parser correctly? At this point, I'm interested in the other stuff for its own sake, but this remains a practical point.

Yes, I'm sure someone is going to reply with a one-line CSS2 parser. :-) That'll be cool, but not as cool as the education I hope to get from this post.

Comment on Flex, converting regexes, and other Interesting Stuff.

Replies are listed 'Best First'.
Re (tilly) 1: Flex, converting regexes, and other Interesting Stuff. by tilly (Archbishop) on Feb 03, 2001 at 21:35 UTC
Here be dragons. :-) The fundamental problem here lies in the differing NFA/DFA semantics. No, there is no easy way to convert one to the other. As Mastering Regular Expressions explains so well, NFAs are a brute-force recursive search. If there is a combinatorial explosion of possible ways to start trying to match pattern to text (think about `/^(\syada\s)+$/` for a second) you get a situation where an NFA will hit exponential failure modes. (No, Perl didn't fail to see that there was no match, you just didn't let it work for enough years.) But since an NFA is a brute force search, just adjusting the search pattern lets you specify which alternatives come first, and whether this wildcard pattern is greedy while that one is not. Plus things like backreferences are easy to implement. By contrast DFAs walk the string once, keeping track of all of the places they might be in the pattern at once. This is guaranteed to finish quickly, but good luck if you want to keep track of things like backreferences. (Well actually there are quadratic algorithms that will capture backreferences, but you cannot use them within the pattern.) By an implementation quirk, a DFA will always find the longest overall match. (They can also be designed the find the shortest, the Unix grep utility does this for speed.) So NFAs and DFAs find different matches by default. NFAs offer more features and finer control. DFAs better guaranteed performance. If you understand all of that then you should be able to follow this explanation of the dangers that lie in wait for someone attempting a native conversion of flex to using Perl's RE engine. Now in addition to the other ideas you have had (most of which will be workable, but will involve some learning) if you don't mind trying something which is possibly insane you could investigate whether Inline makes it feasible to use the Flex specification directly in Perl... UPDATE danger pointed out that I had said that a DFA was a brute force search when I clearly meant that an NFA was. Fixed.	[reply]
(tye)Re: Flex, converting regexes, and other Interesting Stuff. by tye (Sage) on Feb 03, 2001 at 21:15 UTC
Looking around a bit I'd recommend using Parse::RecDescent and taking 10 minutes to run through the regexes to make them Perl-compliant (change {h} to $h or $token{h}, for example). But none of that is based on a thorough understanding of the problem. ): - tye (but my friends call me "Tye")	[reply]
Re: (tye)Re: Flex, converting regexes, and other Interesting Stuff. by Crulx (Monk) on Feb 04, 2001 at 16:24 UTC
Good Idea. It was the first thing I thought of too. Unfortunately, it does not work... The documentation clearly states that Parse::RecDescent does not opperate like yacc. The first successful match terminates the search. Yacc uses the "prefer the longest match" to choose which rule to fire if there are multiple rules that match the incomming string. So the way the rules are set up in the flex code that he got will cause lots of problems. See my post in the main thread as to why. --- Crulx crulx@iaxs.net	[reply]
Re: Flex, converting regexes, and other Interesting Stuff. by Crulx (Monk) on Feb 04, 2001 at 16:15 UTC
To the best of my knowledge... The problem is not just that Perl uses a NFA algorithm for its regular expressions and Flex uses a DFA one. It is that flex maintains considerable special state information and processes all of the rules in parallel. Consider the following flex code.... `%x comment %% int line_num = 1; "/" BEGIN(comment); <comment>[^\n]* /* eat anything that's not a '' / <comment>""+[^/\n]* /* eat up ''s not followed by '/'s / <comment>\n ++line_num; <comment>""+"/" BEGIN(INITIAL);` [download] Now this will eat anything in a comment... while maintaing that line count var.... But implied in that code is A system for maintaining which runstate we are in. That is what remembers that we are in the Comment state. A system for matching the greediest RE in the list. That is what keeps `<comment>[^\n]` [download] from matching the in /. The existance of that "" + "/" rule. Of course nothing is stopping you from writing a program that does all of those things in perl. But you cannot just use the regular expressions cart blanch. There are slight differences in perl's re and flex's use of yacc's re system. Ie `"/" BEGIN(comment);` [download] is perl's `if ($bob =~ /^\/\/) { $comment = 1; }` [download] But again.. you could write something that properly escaped these things. Here is some perl code to do that flex code. #!/usr/bin/perl use strict; my $bob = "/* THis comment is\n\nMy\ngreatest\n Comment** /"; my $comment = 0; my $line_num = 1; while($bob) { my @match_list; print "STARTLOOP matching $bob\n"; if ($bob =~ /^\/\/) { push @match_list,{rest=>$',code=>"\$comment=1",matched=>$&}; print "start comment matched: $&\n"; } if($comment && $bob =~ /^[^\n]/) { push @match_list,{rest=>$',code=>"",matched=>$&}; print "not star matched: $&\n"; } if($comment && $bob =~ /^\[^\n]/) { push @match_list,{rest=>$',code=>"",matched=>$&}; print "star with no slash matched: $&\n"; } if($comment && $bob =~ /^\n/) { push @match_list,{rest=>$',code=>"++\$line_num;",matched=>$&}; print "newline matched: $&\n"; } if($comment && $bob =~ /^\\//) { push @match_list,{rest=>$',code=>"\$comment=0",matched=>$&}; print "end comment matched: $&\n"; } my $max_length = 0; my $match; #should be priority queue.. but I'm too tired to # use that module now... foreach $match (@match_list) { if(length($match->{matched}) > $max_length) { $max_length = length($match->{matched}); } } foreach $match (@match_list) { if(length($match->{matched}) == $max_length) { eval $match->{code}; $bob = $match->{rest}; } } } print "line num is $line_num\n"; [download] You should be able to see how you could make a program that took the flex code and made perl from it. It could just leave stub functions or "" in the code key of that hash. DFA's are just NFA's that just don't allow multiple moves from one state to another via the same character. It is hard to get an NFA in DFA form, but the converse is not true. Hope this makes sense. --- Crulx crulx@iaxs.net	[reply] [d/l] [select]


more useful options
	PerlMonks