Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: Re: A few random questions from Learning Perl 3

by gjb (Vicar)
on Jan 06, 2003 at 06:09 UTC ( [id://224559]=note: print w/replies, xml ) Need Help??


in reply to Re: A few random questions from Learning Perl 3
in thread A few random questions from Learning Perl 3

It might be useful to read up a bit on the theory of formal languages. You'll see that there's a whole family of languages, each described by a certain mathematical formalism. Regular languages are an example, and as you can guess they're described by regular expressions. Unfortunately, HTML is not a regular language and hence can not be described by regular expressions since they're just not powerful enough.

By way of example, consider <em>hello beautiful HTML <em>world</em></em>: easy to write a regular expression to get the inner "world", isn't it? Now consider <em>hello <em>beautiful<em>HTML world</em></em></em>, if you want to match something, again you can write a regular expression... as long as you know the maximum number of times the <em>...</em> tags will be embedded.

HTML allows unbounded nesting of tags, so this means that you can't write a general regular expression that describes every possible nesting situation. Regular expression are simply not powerful enough, you'll need at least context free languages, hence a tool such as HTML::Parser or for general cases something like Parse::RecDescent.

Now you can argue:

  1. yeah right, but real world HTML is not that complicated, or
  2. you can fiddle with embedded code and cuts in regular expressions.
As to the first argument: you don't always know this in advance if you don't control the HTML generation yourself, people are bound to do weird things, mostly not even on purpose.
As to the second argument: true, but these are still experimental features (as the docs specify for 5.6.1) and they're not at all obvious to use, even up to the point that it is easier to use a more powerful tool than get the particular regular expression right. (Note from a formal language theory point of view: embedded code, cuts and the like increase Perl "regular expressions" beyond regular languages.)

Given this story, your claim that one can deal with all problems HTML by using regular expressions shows some unwarranted optimism on your part. Obviously there's no reason to believe me, so I'll suggest a number of references on the subject:

And who knows, maybe our own mstone will write a MOPT on the subject one of these days? (Hint, hint ;-)

Just my 2 cents, -gjb-

Update: Thanks TheHobbit for reiterating the points I actually mention in my text if you bother to read it carefully. (?{...}) and /e are called code embedding.

Replies are listed 'Best First'.
Re: Re: Re: A few random questions from Learning Perl 3
by TheHobbit (Pilgrim) on Jan 06, 2003 at 13:23 UTC

    Hi,
    I'll add some considerations which looks needed. This will also be an answer to the 'Anonymous' below, who thinks he or she can hide and insult people without even disturbing him ore herself to register into the community...

    Stricly speaking, Perl regex are realy much more powerfull than those described in the wonderfull books you refer to. To understand regex as they are used in perl (but also in other langages & tools) I'd rathere refer to

    A basic thing that one always see written about regex is that the can not count. Meaning that you must know the maximum number of times the <em>...</em> will be embedded..

    While this is true of 'standard' regex, this is not true for Perl regex. By using carefull combination of the /e modifier and of the (?{}) programmatic pattern you can do using regex, everithing a parser will do.

    IMHO, using a regex or another approach is a matter of taste, and a careful crafted and optimized regex will be more efficent than a sloppy written rec descent parser.

    Just my 5 (euro) cents.

    Cheers


    Leo TheHobbit
      By using carefull combination of the /e modifier and of the (?{}) programmatic pattern you can do using regex, everithing a parser will do.

      My guess is that you probably mean the (??{...}) assertion.

      (?{...}) merely executes, whereas
      (??{...}) executes and interpolates.

      (A possibly confusing mnemonic would be that one ? would be like one q, which doesn't interpolate. Double ? would be like double q, which interpolates. It's different types of interpolations (one interpolates into the construction, one interpolates its result), so ignore this if it doesn't make sense to you.)

      A bit generalized you may say that:
      (?{...}) is used for debugging and/or setting state.
      (??{...}) is used for generating patterns at "match-time".

      Beware of using =~ inside either of these assertions though. The engine is known to often blow upon that.

      Update:
      A good example that uses both these assertion is to be found at Re: Capturing brackets within a repeat group [plus dynamic backreferences].

      Hope I've helped,
      ihb
      I am the previous anonymous monk.

      Perl 6's regex syntax will make using it for parsing reasonable. But while you theoretically can do that with a lot of care and using constructs that very people know about, the odds are strongly that the average programmer who thinks that they can just misunderstands and underestimates the difficulties.

      Therefore on the odds I stand by my previous comments.

Re: Re: Re: A few random questions from Learning Perl 3
by theorbtwo (Prior) on Jan 07, 2003 at 04:37 UTC

    You're right, and you're wrong... I'm fairly certian that while ordinary regular expressions aren't up to parsing HTML, even on a theorical basis. Perl regular expressions are a whole 'nother breed. Regular expressions with backreferences are NP-complete; it's been proven at least twice. (Well, three times, but one of them is buggy.) I suspect I'm missing somthing here... if anybody knows what (other then my mind), I'd love to hear it.


    Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).

      NP-completeness is a property of an algorithm. It implies that no algorithm is known to solve the problem in polynomial time.
      This means that if you increase the length of the input for the problem, the execution time will increase exponentially. (Of course there are input cases which are polynomial, but many of interest are not). Essentially, it means that brute force is the only known method to tackle the problem exactly.

      The question is on the relation between the behavior of an algorithm to decide on a language and the class to which this language belongs. For regular languages and context free languages polynomial time algorithms are known, but does this necessarily mean that since regular expressions with backreferences are proven to be NP-complete that the language they describe are a superset of regular and context free languages?

      It certainly means it is hard to decide whether or not a certain string is an element of the language described by a regular expression with backreferences. But what does it tell us about the expressive power?

      The expression /^(.*)\1$/ defines the language {ww | w in sigma*}, known neither to be regular, nor context free. On the other hand, regular expressions with backreference can't describe {a^n b^n | n >= 0} which is definitely context free.

      So on the one hand, regular expressions with backreference describe languages that are not context free, but can't describe all context free languages either! This example illustrates that one has to be very careful when judging expressive power from algorithmic complexity. A high complexity is a sign that the expressive power must be high in some cases, but doesn't guarantee that everything can be done.

      Incidently, the code below shows two Perl regular expressions that describe non-regular languages:

      { a^n b^n | n >= 0} /^ (a*) (??{sprintf("b{%d}", (length($1)))}) $/x
      which is context free as mentioned above and
      { a^n b^n c^n | n >= 0 } /^ (a*) (??{sprintf("b{%d}", (length($1)))}) (??{sprintf("c{%d}", (length($1)))}) $/x
      which is context sensitive.

      Just my 2 cents, -gjb-

        NP-completeness is a property of an algorithm. It implies that no algorithm is known to solve the problem in polynomial time.
        This means that if you increase the length of the input for the problem, the execution time will increase exponentially. (Of course there are input cases which are polynomial, but many of interest are not). Essentially, it means that brute force is the only known method to tackle the problem exactly.

        I think that this may be a little misleading. Right now (as 6 years ago), NP-completeness of a problem means that no polynomial-time algorithm is known, but that statement may eventually become false *. Maybe it's better to say “Computer scientists believe that, if a problem is NP-complete, then there is no polynomial-time algorithm to solve it”?

        Also, I'm not sure that it's fair to say that NP-completeness of a problem means that the time-complexity of the problem grows exponentially in the input. Again, we think that NP-completeness correlates with exponential time-complexity, but that could change *. For that matter, can't NP-complete problems have super-exponential complexity (like 2^(n^2))—or are you using ‘exponential’ in the generic sense of ‘faster-growing than polynomial’?

        * Although we all know that it won't really. :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://224559]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (3)
As of 2024-04-26 06:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found