Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
In Apocalypse 5, Larry Wall stated that our "regex culture has gone wrong in a variety of ways". When I first read that, I thought: "What on earth is he talking about? Perl regex do exactly what I need; so what's wrong with them?" Sometimes you need to see solution to realize how blatant the problem really is.

In general, readability and modularity is stressed to the extreme, yet, not for regexes. In fact, the complete opposite usually is. Far from clear and organized, our regexes tend to look like strings of crap. As a result, anything more advanced than mid-level-logic becomes seemingly impossible to do. Thats where perl 6 comes to the rescue.

Perl 6 has loads of new syntactical changes that help regex writers to clean up their act. New perl 6 regexes are now readable, modular, and even easy to write.

Advantages of Perl 6 Regex

  • Improved Syntax increases readability and writeability

    Larry has completely refactored and redone the perl 6 regex syntax from scratch. From the way you think about spacing, to the very way you structure your logic; almost everything is new. Check out the summary of changes.

  • Rules help create modular designs

    Rules are defined assertions. Similar to, but more powerful than, the (??{}) assertion from perl5. You can think of a rule as a form of limited subroutine. Arguments are passed in the header, and can then be used within the rule. Instead of containing code, the rule contains a regex. Modifiers can also be used with rules.

  • Grammars allow for abstraction

    As rules are to subroutines, grammars are to classes. Groups of rules and assertions can be grouped together into a regex class called a grammar. Grammars have similar attributes to classes; for instance, they create their own namespace, can use inheritance and polymorphism, and group together their members. Grammars make it easy to define common rules.

Enough of the hoopla; lets see it in action.

The Situation

Say you work at a web design establishment. Your current assignment is to catalog all javascript functions that have been used in many of the various sites the company has done over the years. Each function is to be catagorized as either complex or simple; a function is to be considered complex if it contains some sort of loop. The question is, how do you start?

Developing the Logic

The answer is to start mapping out logic and strategy for a parser. More precisely, a recursive parser. One strategy that we can use is to extract each nested set of information that we need, and then extract the next level from that. For instance, against the raw HTML, extract the <script></script> elements; from that, extract all functions; from that, check for the existance of a loop for each match. Each of these definitions can be broken into separate grammars, and common elements among them can be grouped into a base grammar from which each of the sub grammars can inherit from.

Even with the benefits of OO organization, writing a parser is no easy task. Perl6 provides many useful tools to make the job easier, but that doesn't change the fact logic needs to be mapped out before the regex is written. Or, at least, it should be. Anyways, before the parsing begins lets define our definitions:

(Note: the script definition, as well as a few definition to code explanations, are going to be skipped because their explanation is mundane for the purposes of this article; however, they is still encluded in the final code below)

The Definitions

First, a function:

- A header made up of: - the word function - a function name - an arguments list - A body made up of a block of code

Next, our loops.

A while loop:

- A header of: - the word while - a condition - A body of a block of code

A for loop:

- A header of: - the word for - enclosed in parenthesis: - a declaration and then a semicolon - a condition and then a semi-colon - an increment - A body of a block of code

A do loop:

- A header of the word do - A body of a block of code - A footer of: - the word while - a condition

Finding Common Ground

Now that the parsing is planned, the common elements can be located and modularized, such as:

- a header - a body - a block of code - a condition

Each of these common elements can now be placed into our base class. Although the header and body set for each object isn't quitethe same, a generic set can be defined in the base class, and the sub-grammars can overload if needed.

Defining a Block

Descending one level further, the code block and condition need to be defined. A code block is a set of balanced brackets, of which subsets may be nested. Borrowing from perl5's perlre, we can turn:

$block = qr/ \{ (?: (?> [^{}]* ) | (??{ $block }) )* \}


rule block { \{ [ <-[{}]>*: | <block> ]* \} }

Defining a condition is not much different than defining a code block. Since validating is of no concern, a condition can be defined as a set of balanced parenthesis; or, as:

rule block { \( [ <-[()]>*: | <block> ]* \) }

Since the block of code and condition are similarly defined, we can bind them to a single assertion that takes 2 arguments that define the delimeters.

rule block ($left,$right) { $left [ <-[$left $right]>*: | <block $left $right> ]* $right }

However, we'll also need to account for several other things to completely define our block; for instance, ignoring our delimiters within comments and strings, so that they don't interupt our block finding. Before any further progress can be made, comments and strings will need to be defined; their definitions will also be placed into the base class.

Defining Quotes

Quotes can be defined as:

- start quote - string of (non-quote or backslashed quote) characters - end quote

Which can easily be translated into:

rule quoted_string ($type) { " [ <-["]>+: | [ <after \\ > " ] ]* " }

However, two types of quoting exists in javascript; single and double. Instead of creating a rule for each, the above rule can easily be generalized into:

rule quoted_string ($type) { $type [ <-[$type]>+: | [ <after \\ > $type ] ]* $type }

Single Quotes can now be called as: <quoted_string '>
Dobule Quotes can be called as <quoted_string ">

We can now bind them as:

rule string { <quoted_string "> | <quoted_string '> }

Ignoring Your Comments

Javascript uses C++ style comments: // for single line, and /* */ for multi-line. Therefore, a comment is:

A single-line comment or A multi-line comment

A single-line comment is easy to define: its nothing more than a // and the rest of the line:

rule single_line_comment { // \N*: }

A multi-line comment is a bit more complicated. It can be defined as:

- A leading /* - A string of characters that includes non asterix and astrix not foll +owed by a / - A closing */

Translated to regex:

rule multi_line_comment { /\* [ <- [*] >* : | <before \*/ > | \* ]* \*/ }

Rounding Out the Block

Now that the sub-definitions are completed, the block definition can be finished. Its a simple matter of adding our new pieces to the alternation:

rule block ($left,$right) { $left [ <-[$left$right"'/]>*: | <comment> | <string> | <block $left $right> ]* $right }

Enough is Enough; Time for Action!

Enough with boring definitions; its pretty easy to match up our parts with the definitions we created earlier. Its time to see the completed parser, which is below. However, a few things to notice:

  • Look closer as to how the script elements are extracted. If you follow backwards, you'll notice that the arrayref is refering to an array that is tied to standard input! Since perl will concatanate all files supplied on the command line into stdin, a huge list of files can be placed into the input stream and perl will take care of them for you. Also note how the rules in the script grammar use <cut> instead of :; this is to snip the infinite string that is created by applying a regex directly on the input stream.
  • Instead of perl 5 style extraction of:

    @matches = $var =~ /(match)/g;

    Perl 6 style extraction is used by binding an array to the match itself:

    $var =~ m:e/@matches:=(match)/;

    This has the added advanatage of not "capturing" non captures as undefined values since perl 6 matches are treated as hypothetical variables.

The Parser

#!/usr/bin/perl -w use strict; grammar javascript { rule single_line_comment { // \N* : } rule multi_line_comment { /\* [ <- [*] >* : | <before \*/ > | \* ]* \*/ } rule comment { <single_line_comment> | <multi_line_comment> } rule quoted_string ($type) { $type [ <- [$type] >+ : | [ <after \\ > $type ] ]* $type } rule string { <quoted_string "> | <quoted_string '> } rule block ($left,$right) { $left [ <- [$left$right"'/] >* : | <comment> | <string> | <block $left $right> ]* $right } rule code { <block \{ \}> } rule cond { <block \( \)> } rule header { } rule body :w { <code> } rule match { <header> <body> } } grammar script is javascript { rule header :wi { \< script <- [>] >* <cut> \> [ \< ! -- ]? } rule body :wi { [ <- [<("'/] >* <cut> | <comment> | <quoted_string "> | <quoted_string '> | <cond> | <!before :: [ \</ ] | <null> > ]* } rule footer :wi { \</ script \> } rule match { <header> <body> <footer> } } grammar function is javascript { rule arg_list :w { \( [ \w+ <!before \) :: , | <null> > ]* \) } rule name { \w+ } rule header :wi { function <name> <arg_list> } } grammar while is javascript { rule header :wi { while <cond> } } grammar for is javascript { rule decl { <- [;] >* : } rule cond { <- [;] >* : } rule incr { <- [)] >* : } rule header :wi { for \( <decl> ; <cond> ; <incr> \) } } grammar do is javascript { rule header :wi { do } rule footer :wi { while <cond> } rule match { <header> <body> <footer> } } module main; my (@scripts, @functions, @html); @html := <>; my $html is from(\@html); $html =~ m:e/ @scripts := (<javascript::script::match>) /; @scripts =~ m:e/ @functions := (<javascript::function::match>) /; for @functions -> $function { given $function { when / <javascript::while::match> | <javascript::for::match> | <javascript::do::match> / { catalog_complex ($_) } default { catalog_simple ($_) } } } sub catalog_simple { ... } sub catalog_complex { ... }

A subset as Perl 5

A few of the main items described in terms of perl 5 regex. I tried doing a direct translation of the parser above with perl 5 objects; however, everything got incredibly messy :(

use re 'eval'; sub quoted_string { my $type = quotemeta shift; return qr/ $type (?: [^$type]+ | (?: (?<= \\ ) $type ) )* $type /x; } my $comment = qr~ (?: // (?> [^\n]* ) ) # single line | (?: # multi line /\* (?: (?> [^*]+ ) | (?= \*/ ) | \* )* \*/ ) ~x; sub block { my $left = quotemeta shift; my $right = quotemeta shift; my $block = qr! $left (?: (?> [^$left$right"'/] *) | (??{$comment}) | (??{ quoted_string( qq(") ) }) | (??{ quoted_string( qq(') ) }) | (??{$block}) )* $right !x; return $block; } my $arg_list = qr/ \( (?: [^,)]* (,|) )* \) /x; # ugly! my $block; my $code = qr/ (??{$block = block ( "\{" , "\}" ); "$block"}) /x; my $cond = qr/ (??{$block = block ( "(" , ")" ); "$block"}) /x;

Update: Removed the \Q\E and replaced with quotemeta in the dynamic assertions in the perl5 section. Also, there was a paste error with $block in the perl5 section that would have caused it to break had it been used.

Update: Fixed a mistake in the multi-line comment regex that was pointed out by kelen.

Update: Updated Perl6 code to accurate reflect changes in the language since this article was written (4-27-03).

In reply to Parsing with Perl 6 by jryan

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others romping around the Monastery: (5)
    As of 2020-09-25 00:04 GMT
    Find Nodes?
      Voting Booth?
      If at first I donít succeed, I Ö

      Results (135 votes). Check out past polls.