Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

RegEx: Why is [.] not a valid character class?

by hoppfrosch (Scribe)
on Nov 17, 2004 at 14:41 UTC ( [id://408422]=perlquestion: print w/replies, xml ) Need Help??

hoppfrosch has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

my problem:

I've got a single string, containing the contents of a complete HTML-File (including CarriageReturns). This string contains several <table>...<\table> parts (with a lot of linebreaks in the ... part).

What I want to do is to get the first table from the string.

What I tried:
my $html="<table>\nt1\n<\/table>\n<table>\nt2\n<\/table>\n" $html =~ m/<table>(.*?)</table>/; # Failed, since 'The period '.' matches any character but ``\n'' $html =~ m/<table>(.*?)</table>/s; # SUCCESS, since 's modifier (//s): Treat string as a single long # line. '.' matches any character, even "\n"' $html =~ m/<table>[a-zA-Z0-9 \r\n<>"!-=]*?</table> # SUCCESS - Emulating '.' with defining a own character class, # containing 'any' characters (or a subset in this example) including # '\n' # NO s modifier (//s) needed $html =~ m/<table>[.\n]*?</table>; # Same as above, but using '.' instead of explicitly listing all # characters ... # Failed, WHY? Why is '.' not allowed within a character class?
Further investigation shows, that [.] is not a valid character class ...

My questions are:

Why is '.' not allowed within a character class?

It's clear to me now that my desired character class [.\n] can be achieved with the s modifier - but why is there such an "inconsistent way" using a modifier to emulate a character class?

Why is there no "super" character class - matching ALL characters including '\n'?

What's the reason excluding '\n' from '.'? (Why is '\n' handled in a special way?)

Hoppfrosch

Edit by castaway - use html entities instead of angled brackets

Replies are listed 'Best First'.
Re: RegEx: Why is [.] not a valid character class?
by bart (Canon) on Nov 17, 2004 at 14:57 UTC
    Why is '.' not allowed within a character class?

    It is a valid character class. It matches only a dot.

    You seem to be asking "Why is . not a metacharacter in a character class?" Just because. Meta stuff in character classes seems to be limited to backslash+letter.

    It's clear to me now that my desired character class [.\n] can be achieved with the s modifier - but why is there such an "inconsistent way" using a modifier to emulate a character class?

    Why is there no "super" character class - matching ALL characters including '\n'?

    What's the reason excluding '\n' from '.'? (Why is '\n' handled in a special way?)

    All good questions. I have no answer to them... Except from what I hear, Larry would change a few things in perl6 — so you're not alone in your gripe. One idea was letting . match anything (including newline), and \N match anything but newline (the current .). I don't know what the current projections for perl6 are, as I don't actively follow its evolution.

    An excuse for the things being the way they are, is that Larry thought no normal string contains newlines, as perl was originally mainly intended to do line-by-line processing of files, and the only place you could have a newline, is at the end of the string. That's why for example, $ is allowed to match just in front of a newline at the end of a string. Another example on how a newline is different.

    BTW you can localize the effect of /s by using the (?s:PATTERN) syntax. That is: add options between the question mark and the colon, in the syntax for non-capturing parens. Put "-" in front of options you want disabled. You can experiment using

    print qr/./s;
    whenever you forgot the syntax, again — it inserts those options.
Re: RegEx: Why is [.] not a valid character class?
by larryp (Deacon) on Nov 17, 2004 at 17:33 UTC

    Quoting from Jeffrey Friedl's excellent book Mastering Regular Expressions, 2nd Ed.:

    Usually, dot does not match a newline. The original Unix regex tools worked on a line-by-line basis, so the thought of matching a newline wasn't even an issue until the advent of sed and lex. By that time, '.*' had become a common idiom to match "the rest fo the line," so the new languages disallowed it from crossing line boundaries in order to keep it familiar.1 Thus, tools that could work with multiple lines (such as a text editor) generally disallow dot from matching a newline. (Mastering Regular Expressions, Second Edition, p. 110.)
    1As Ken Thompson (ed's author) explained it to me, it kept '.*' from becoming "too unwieldy." (Mastering Regular Expressions, Second Edition, p. 110.)

    I strongly suggest this book for those fighting with regular expressions. It's a complete, well-written reference to the topic and it gives excellent examples. Furthermore, it addresses regular expressions as they relate to several languages including Perl, PHP, JavaScript, Java, and .NET among others.

    HTH,
    /Larry

Re: RegEx: Why is [.] not a valid character class?
by marcelo.magallon (Acolyte) on Nov 17, 2004 at 21:31 UTC

    A couple fellow monks have answered your question...

    ... I just can't resist the urge to give you an orthogonal answer and point you towards HTML::TreeBuilder. Originally by Gisle Aas and the further improved by Sean Burke, HTML::TreeBuilder let's you answer queries like "give me the first table in the document" or "I want the second paragraph within the second table" with awesome ease.

    If I unterstand your underlying problem right, HTML::TreeBuilder might be just what you are looking for.

Re: RegEx: Why is [.] not a valid character class?
by ikegami (Patriarch) on Nov 17, 2004 at 16:17 UTC
    "Why is '\n' handled in a special way?" Because \n IS special, especially in the past when text was often line-oriented. Simply add /s (it has no other side effects) or use (?:.|\n). In Perl6, /s is on my default.

      (?:.|\n) is unnecessary slow due to the alternation. (?s:.) doesn't suffer from that.

      ihb

      See perltoc if you don't know which perldoc to read!
      Read argumentation in its context!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://408422]
Approved by Corion
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2024-04-25 23:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found