Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Help on using alternation grouping star versus dot star.

by Anonymous Monk
on Jun 10, 2001 at 06:49 UTC ( [id://87257]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm a little new at regular expression and my use of them is in JavaScript, so please bear with me. There are several regular expressions that can be used to replace all <div class="blockqte">...</div> tag-pairs with <pre>...</pre>. The content of this particular <div> pair contains no other <div>'s but does contain <span> tag pairs. Of course, all <div>'s that don't have the class="blockqte" attribute are left alone.

The three regexp that I've designed to do this in JavaScript are:

1. re = /<div class="blockqte">((?:[^<]|<\/?s)*)<\/div>/g; 2. re = /<div class="blockqte">((?:[^<]*|<\/?s)*)<\/div>/g; 3. re = /<div class="blockqte">((?:.|\n)*?)<\/div>/g;
Any one of these go with the statement:
str = str.replace(re, "<pre>$1</pre>");
My understanding and please correct me if I'm wrong, is that expression number 3 is inefficient since it leaves the grouping with each of its evaluations to check the remainder of the expression (i.e. </div>). However, in 1 and 2 the group is left only with the first failure in matching all of its alternatives. That seems to make 1 and 2 better choices and it's here that I'm not sure of the advantages of one over the other. Can any one help with this?

By the way if there are better ways to do all this, I would much appreciate the advice and any clarifications. Thank you in advance.

Replies are listed 'Best First'.
Re: Help on using alternation grouping star versus dot star.
by ZZamboni (Curate) on Jun 10, 2001 at 08:41 UTC
    Although your application is in JavaScript, the question itself is about regular expressions. Benchmark reports (output edited for briefness, code at the end):
    Benchmark: timing 100000 iterations of 1, 2, 3... 1: 22 wallclock secs @ 4553.73/s 2: 13 wallclock secs @ 7575.76/s 3: 15 wallclock secs @ 6765.90/s
    So it seems that method 2 is the best, but not by much, followed closely by method 3, and method 1 is a distant third. Almost the opposite of what you originally thought!

    I'll let the experts (if they want) explain why this is, in terms of how the regex parser operates, but my guess is that 2 is faster because most characters are not opening angle brackets, so the [^<]* absorbs them.

    I also have the feeling that the comparison may change when confronted with real data, because of the length and contents of whatever is inside the <div> tags.

    Update: and of course, all of this only applies to the Perl implementation of regular expressions, so these results are probably mostly useless, because different JavaScript implementations could do things differently.

    --ZZamboni

    use strict; use Benchmark; my $str=q(this is some text <div class="blockqte">a block <span>quote</span></div> some more text <b>maybe</b> some other tags <div class="blockqte">another block</div> we end here); sub method1 { my $s=shift; $s=~s@<div class="blockqte">((?:[^<]|<\/?s)*)<\/div>@<pre>$1</pre>@g; } sub method2 { my $s=shift; $s=~s@<div class="blockqte">((?:[^<]*|<\/?s)*)<\/div>@<pre>$1</pre>@g +; } sub method3 { my $s=shift; $s=~s@<div class="blockqte">((?:.|\n)*?)<\/div>@<pre>$1</pre>@g; } timethese(100000, {'1' => sub { method1($str) }, '2' => sub { method2($str) }, '3' => sub { method3($str) }});
(tye)Re: Help on using alternation grouping star versus dot star.
by tye (Sage) on Jun 10, 2001 at 11:59 UTC

    (3) will speed up if you replace "(?:.|\n)" with something simpler such as "[\s\S]", or best, add the "s" option to your regex and just use ".".

    (1) is slower than (2) because (1) has to dispatch a lot more regex opcodes. That is, (2) can just hang out in the "[^<]*" opcode while it gobbles quite a few characters while (1) has to leave "[^<]" for each character (leaving the alternation/parens to move to the "*" then come back in through the alternation/parens to get back to the "[^<]").

    You are correct (to my understanding) about the disadvantage of (3). But (3) has an advantage in that it is simpler than (1) and (2).

    According to ZZamboni's benchmarks, (3)'s disadvantage is only slightly greater than its advantage and slight compared to (1)'s disadvantage. But I suspect this is all rather dependant on the input used in the benchmarks. In particular, the length of the text to be matched and the frequency of <span> tag pairs within it will affect the relative performance characteristics.

    As for other alternatives, I see no reason to not simplify things greatly to:

    s#<div class="blockqte">#<pre>#g; s#</div>#</pre>#g;
    and not force the regex engine to match the intervening text at all.

    Finally, many regex libraries are built around deterministic finite state automata, which means that different regexes run at much closer to the same speed (though more complex ones take longer to compile), but memory consumption can grow out of bounds. So Java's regex performance characteristics could be drastically different than Perl's.

            - tye (but my friends call me "Tye")
      The JavaScript RE spec is modelled after Perl and includes full support for backreferences. Therefore they cannot use a DFA style engine.

      tye, well there is a reason not to simplify like you suggested:

      s#<div class="blockqte">#<pre>#g; # this is OK s#</div>#</pre>#g; # this creates the problems

      you replace all </div> with </pre> and if I understood it correctly, not all the <div> tags have the attribute class="blockqte" and should be replaced ... ;-)

      -- Hofmator

Re: Help on using alternation grouping star versus dot star.
by Abigail (Deacon) on Jun 10, 2001 at 17:17 UTC
    In Friedls book Mastering Regular Expressions this technique is called loop unrolling. Your second example is bad because of the nested * - Perl should warn you about them. I would write it as:
    m{<div\s+class\s*=\s*"blockqte"\s*>((?:[^<]+|</?s)*)</div>}

    -- Abigail

      Hi Abigail. Thank you for the advise. Could you please explain to this neophyte why changing the nested * to a nested + makes so much of a difference. Mark.
Re: Help on using alternation grouping star versus dot star.
by andreychek (Parson) on Jun 10, 2001 at 08:01 UTC
    Well, I'll leave the regular expression evaluating to the guru's here.. :-) However, you asked if there are other ways to handle what you are looking to do, so I'll offer two.

    There are two modules at CPAN, HTML::Parser and HTML::TokeParser.

    HTML::Parser is an event driven HTML code parser. It parses through your document, and calls a function of your choice upon seeing any HTML Start Tag. Whenever it hits an HTML End Tag, it also calls a function of your choice. You could use this method to rewrite your HTML code, by having the functions HTML::Parser calls test to see if the tag is named "div". If so, you can change the name. I'm not sure how it handles attribute tags though.

    I personally have had more luck with HTML::TokeParser. It would allow you to write code such as:
    $parser->get_tag("div");
    Which would find the next div start tag. If it returns something, you can first test it's attributes. If the class attribute equals "blackqte", you could have some code to rewrite the entire tag. HTH!
    -Eric

    Update: I may have misunderstood what you are looking for, but if you really are looking for a Javascript solution, and not Perl.. my particular answer won't help you at all. Sorry :-)
Re: Help on using alternation grouping star versus dot star.
by TStanley (Canon) on Jun 10, 2001 at 17:17 UTC
    You might also want to look at this as well.

    TStanley
    --------
    There's an infinite number of monkeys outside who want to talk to us
    about this script for Hamlet they've worked out
    -- Douglas Adams/Hitchhiker's Guide to the Galaxy
Re: Help on using alternation grouping star versus dot star.
by Anonymous Monk on Jun 14, 2001 at 09:20 UTC
    I probably missed the articles since what I've found really doesn't address the issue. Can you be more specific?
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://87257]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2024-04-25 23:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found