http://qs321.pair.com?node_id=274230

svsingh has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to pull the title and h1 out of an HTML file (local). I figured this is a little too simple to use any of the HTML parsing modules and I'm using a simple match. The HTML file is guaranteed to have only one h1.

Here's what I'd like to do ...

$/ = '</h1>'; my $chunk = <HTMFILE>; $chunk =~ m%<title>(.+)</title>.*<h1>(.+)</h1>%i;

... which returns a pair of undefs. If I split the match over a couple of lines, however, everything works out just fine. Here's what's working:

$/ = '</h1>'; my $chunk = <HTMFILE>; $chunk =~ m%<title>(.+)</title>%i; my $title = $1; $chunk =~ m%<h1>(.+)</h1>%i; my $heading = $1;

The best explanation I can think of is .* only matches up to a certain number of characters. My test file has 3750 characters between </title> and <h1>. Is that what's happening here?

Thanks for your help.

Replies are listed 'Best First'.
Re: Is there a Limit on Matching .*
by sauoq (Abbot) on Jul 15, 2003 at 00:01 UTC
    Is that what's happening here?

    No. There's no arbitrary limit on the number of characters dot-star can match.

    What's happening in your case, I'll bet, is that you have newlines in your $chunk and forgot that a dot doesn't match a newline unless you include the /s modifier on the regex.

    Be careful about setting your input record separator to '</h1>' too. That's an exact string and will be case sensitive.

    I guess I'd be remiss without including some standard scolding like, "you should parse HTML with an HTML parser, not a regex."

    -sauoq
    "My two cents aren't worth a dime.";
    
      But he's not "parsing" html at all, all he's doing is extracting a certain pattern. Sounds like what a regex was designed for to me.
        But he's not "parsing" html at all, all he's doing is extracting a certain pattern.

        It's a question of whether he should be parsing it instead of using a regex to extract a chunk. I don't know; I'm not working on his project. (And that's why I tossed it in as an afterthought.) TIMTOWTDI, YMMV, etc., etc., and so forth.

        -sauoq
        "My two cents aren't worth a dime.";
        
      That did it. Thank you everyone! Also, thanks for the tip on the input separator. The HTML files are being generated by RoboHELP and the tags are consistently lowercase. The insensitive match is more of a habit than anything else. I think I should take that out of this script for effeciency.
Re: Is there a Limit on Matching .*
by Elian (Parson) on Jul 15, 2003 at 03:38 UTC
    Newlines, as has been pointed out, are probably your problem.

    While the regex engine does have some limits to it, these are generally documented, and not small. The {n,m} style of repeat caps at 32K, for example, and there's a limit (IIRC) of 32K match variables, and there are some recursion depth issues, but it takes a lot to trip them. Normal regexes won't, generally speaking.

    .* is generally limited by memory and in pathological cases runtime. As an example, perl -e '$foo = "x" x 6000000; $foo =~ /(x*)/; print length($1), "\n"' outputs a length of 6000000.

Re: Is there a Limit on Matching .*
by nysus (Parson) on Jul 15, 2003 at 00:12 UTC
    I believe the answer to your question is that that '.' is not matching over newlines. You must use the 's' modifier like so:

    $chunk =~ m%<title>(.+)</title>.*<h1>(.+)</h1>%is;

    Update: Removed first paragraph may have been wrong.

    $PM = "Perl Monk's";
    $MCF = "Most Clueless Friar Abbot Bishop Pontiff";
    $nysus = $PM . $MCF;
    Click here if you love Perl Monks

Re: Is there a Limit on Matching .*
by LazerRed (Pilgrim) on Jul 15, 2003 at 00:11 UTC
    Exactly what I was thinking sauoq

    $chunk =~ m%<title>(.+)</title>.*<h1>(.+)</h1>%i;
    Is looking for everything on one line, so it will not match unless everything <title></title><h1></h1> is on the same line.

    Where: $chunk =~ m%<title>(.+)</title>%i;

    And:$chunk =~ m%<h1>(.+)</h1>%i;

    is matching each thing on their own lines.
Re: Is there a Limit on Matching .*
by graff (Chancellor) on Jul 15, 2003 at 04:57 UTC
    Just a nit-pick: this use of $/ seems inappropriate:
    $/ = '</h1>';
    considering that you don't seem to expect the close tag to always be lower case... I presume you had a reason for including the "i" flag on this regex:
    $chunk =~ m%<h1>(.+)</h1>%i;
    And of course, the value of $/ cannot be treated as a regex -- it has to be a literal string.

    Actually, given that you can "guarantee" only one "h1" tag in an html file, if it happens to be capitalized, you'll just slurp the whole file into $chunk, and the remaining logic will work in any case. But don't fall into a false sense of safety about this sort of usage -- it'll trip you someday.

(jeffa) Re: Is there a Limit on Matching .*
by jeffa (Bishop) on Jul 15, 2003 at 15:09 UTC
    I am actually a bit shocked that no-one mentioned using a negated character class to grab what you need. The idea is to grab everything that is not the character '<':
    my ($title) = $chunk =~ /<title>([^<]+)/; my @h1 = $chunk =~ /<h1>([^<]+)/g;
    However, this is still not perfect. I personally think that nothing is too simple for a parser module, especially if that parser module is HTML::TokeParser::Simple:
    use strict; use warnings; use Data::Dumper; use HTML::TokeParser::Simple; my $d = do {local $/;<DATA>}; my $p = HTML::TokeParser::Simple->new(\$d); my %hash; while ( my $token = $p->get_token ) { $hash{title} = $p->get_token->return_text if $token->is_start_tag('title'); push @{$hash{h1}}, $p->get_token->return_text if $token->is_start_tag('h1'); } print Dumper \%hash; __DATA__ <html> <head> <title>foo</title> </head> <body> <h1>one</h1> <h1>two</h1> <h1>three</h1> </body> </html>

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)