RegEx: Why is [.] not a valid character class?

hoppfrosch has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

my problem:

I've got a single string, containing the contents of a complete HTML-File (including CarriageReturns). This string contains several <table>...<\table> parts (with a lot of linebreaks in the ... part).

What I want to do is to get the first table from the string.

What I tried:

my $html="<table>\nt1\n<\/table>\n<table>\nt2\n<\/table>\n"

$html =~ m/<table>(.*?)</table>/;   
# Failed, since 'The period '.' matches any character but ``\n''

$html =~ m/<table>(.*?)</table>/s;   
# SUCCESS, since 's modifier (//s): Treat string as a single long 
# line. '.' matches any character, even "\n"'

$html =~ m/<table>[a-zA-Z0-9 \r\n<>"!-=]*?</table>
# SUCCESS - Emulating '.' with defining a own character class, 
# containing 'any' characters (or a subset in this example) including 
# '\n'
# NO s modifier (//s) needed

$html =~ m/<table>[.\n]*?</table>;
# Same as above, but using '.' instead of explicitly listing all 
# characters ...
# Failed, WHY? Why is '.' not allowed within a character class?
[download]

Further investigation shows, that [.] is not a valid character class ...

My questions are:

Why is '.' not allowed within a character class?

It's clear to me now that my desired character class [.\n] can be achieved with the s modifier - but why is there such an "inconsistent way" using a modifier to emulate a character class?

Why is there no "super" character class - matching ALL characters including '\n'?

What's the reason excluding '\n' from '.'? (Why is '\n' handled in a special way?)

Hoppfrosch

Edit by castaway - use html entities instead of angled brackets

Comment on RegEx: Why is [.] not a valid character class? Select or Download Code

Replies are listed 'Best First'.
Re: RegEx: Why is [.] not a valid character class? by bart (Canon) on Nov 17, 2004 at 14:57 UTC
Why is '.' not allowed within a character class? It is a valid character class. It matches only a dot. You seem to be asking "Why is `.` not a metacharacter in a character class?" Just because. Meta stuff in character classes seems to be limited to backslash+letter. It's clear to me now that my desired character class `[.\n]` can be achieved with the s modifier - but why is there such an "inconsistent way" using a modifier to emulate a character class? Why is there no "super" character class - matching ALL characters including '\n'? What's the reason excluding '\n' from '.'? (Why is '\n' handled in a special way?) All good questions. I have no answer to them... Except from what I hear, Larry would change a few things in perl6 — so you're not alone in your gripe. One idea was letting `.` match anything (including newline), and `\N` match anything but newline (the current `.`). I don't know what the current projections for perl6 are, as I don't actively follow its evolution. An excuse for the things being the way they are, is that Larry thought no normal string contains newlines, as perl was originally mainly intended to do line-by-line processing of files, and the only place you could have a newline, is at the end of the string. That's why for example, `$` is allowed to match just in front of a newline at the end of a string. Another example on how a newline is different. BTW you can localize the effect of /s by using the `(?s:PATTERN)` syntax. That is: add options between the question mark and the colon, in the syntax for non-capturing parens. Put "-" in front of options you want disabled. You can experiment using `print qr/./s;` [download] whenever you forgot the syntax, again — it inserts those options.	[reply] [d/l] [select]
Re: RegEx: Why is [.] not a valid character class? by larryp (Deacon) on Nov 17, 2004 at 17:33 UTC
Quoting from Jeffrey Friedl's excellent book Mastering Regular Expressions, 2nd Ed.: Usually, dot does not match a newline. The original Unix regex tools worked on a line-by-line basis, so the thought of matching a newline wasn't even an issue until the advent of sed and lex. By that time, '.' had become a common idiom to match "the rest fo the line," so the new languages disallowed it from crossing line boundaries in order to keep it familiar.¹ Thus, tools that could work with multiple lines (such as a text editor) generally disallow dot from matching a newline. (Mastering Regular Expressions, Second Edition, p. 110.) ¹As Ken Thompson (ed's author) explained it to me, it kept '.' from becoming "too unwieldy." (Mastering Regular Expressions, Second Edition, p. 110.) I strongly suggest this book for those fighting with regular expressions. It's a complete, well-written reference to the topic and it gives excellent examples. Furthermore, it addresses regular expressions as they relate to several languages including Perl, PHP, JavaScript, Java, and .NET among others. HTH, /Larry	[reply]
Re: RegEx: Why is [.] not a valid character class? by marcelo.magallon (Acolyte) on Nov 17, 2004 at 21:31 UTC
A couple fellow monks have answered your question... ... I just can't resist the urge to give you an orthogonal answer and point you towards HTML::TreeBuilder. Originally by Gisle Aas and the further improved by Sean Burke, HTML::TreeBuilder let's you answer queries like "give me the first table in the document" or "I want the second paragraph within the second table" with awesome ease. If I unterstand your underlying problem right, HTML::TreeBuilder might be just what you are looking for.	[reply]
Re: RegEx: Why is [.] not a valid character class? by ikegami (Patriarch) on Nov 17, 2004 at 16:17 UTC
"Why is '\n' handled in a special way?" Because \n IS special, especially in the past when text was often line-oriented. Simply add /s (it has no other side effects) or use `(?:.\|\n)`. In Perl6, /s is on my default.	[reply] [d/l]
Re^2: RegEx: Why is [.] not a valid character class? by ihb (Deacon) on Nov 18, 2004 at 02:49 UTC
`(?:.\|\n)` is unnecessary slow due to the alternation. `(?s:.)` doesn't suffer from that. `ihb` See perltoc if you don't know which perldoc to read! Read argumentation in its context!	[reply] [d/l] [select]


more useful options
	PerlMonks