Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Many matches to an array

by ciryon (Sexton)
on Jul 15, 2004 at 10:54 UTC ( [id://374600]=perlquestion: print w/replies, xml ) Need Help??

ciryon has asked for the wisdom of the Perl Monks concerning the following question:

Venerable Monks,

I'm trying to extract some info from a HTML-page. The information looks like <div>important_info_here</div> and there are many div-tags that I'm interested in. I want to store all info in an array. I have done something like:

... $content = get_content($url); @results = ($content =~ /<div>(.+)<\/div>/s); foreach(@results) { print; }

but I only get one match. What have I forgotten? I have searched through the monastery to no avail.

Update: Thanks for the replies! I followed tachyon's advice and did it this way:

@divs = $content =~ m/<\s*div[^>]*>(.+?)<\s*\/div>/sig;
I agree it would be better to use a real HTML parser, if I needed to extract more info.

Replies are listed 'Best First'.
Re: Many matches to an array
by tachyon (Chancellor) on Jul 15, 2004 at 11:07 UTC

    You are missing the /g so you only get one match. You must have missed my Perl Idioms Explained - @ary = $str =~ m/(stuff)/g tutorial :-)

    Now even if you had added the /g it would have given you a single match as you are using .+ which is greedy. You need .+? if it is going to work. You should also add /i to make it case insensitive and what about .... type syntax? Where we are heading is the usual suggestion to use HTML::Parser. This should be reasonably reliable, if not best practice.

    @divs = $content =~ m!<\s*div[^>]*>(.+?)<\s*/div!sig;

    I have posted an HTML::Parser example for you down here

    cheers

    tachyon

Re: Many matches to an array
by Paulster2 (Priest) on Jul 15, 2004 at 11:09 UTC

    Don't you want to put your entire content into an array and then check each line of that array for the content that you are looking for? It appears that you put all of the content into a string (or one element) then check that single element for your content. When you do that I would think that it would only return one hit. Change $content to an array and then search each line (foreach) for your content matches. That, I would think, will get you where you want to go. (I am assuming that the regex is correct.)

    Paulster2


    You're so sly, but so am I. - Quote from the movie Manhunter.
      How do I put entire content into an array?
        maybe start with

        my @content = get_content($url); foreach my $var ( @content ) { next if $var =~ / undesirable regex /; $var =~ / your regex / && do something; .. }
        I think this is what Paulster2 means.. at least now the @content you fetched is going to get processed line by line..

Re: Many matches to an array
by friedo (Prior) on Jul 15, 2004 at 11:14 UTC
    In addition to adding /g, you'll want to make your regex non-greedy, by changing (.+) to (.+?).
Re: Many matches to an array
by wfsp (Abbot) on Jul 15, 2004 at 11:53 UTC
    Building a regex to parse html is tricky. I suscribe to the "don't do that" school, especially as someone has already done it for you. This is one way. There are, of course, many others.
    use strict; use warnings; use HTML::TokeParser; my $p = HTML::TokeParser->new("index.html") or die "Can't open: $!"; my @results; while (my $t = $p->get_tag( 'div' ) ){ my $text = $p->get_trimmed_text( '/div' ); push @results, $text; } print join "\n", @results;

    Update: Warning - See tachyon's reply below.

      Unfortunately that example does not work - unless the OP just wants the text between the div tokens. Div tags generally have all sorts of HTML between them, not just text. That example will lose it all. You can do it like this with HTML::Parser

      { package MyParser; use base 'HTML::Parser'; sub start { my($self, $tagname, $attr, $attrseq, $origtext) = @_; $self->{divs}->[-1] .= $origtext if $self->{dc}; if ( $tagname eq 'div' ) { push @{$self->{divs}}, ''; $self->{dc}++; } } sub end { my($self, $tagname, $origtext) = @_; $self->{dc}-- if $tagname eq 'div'; $self->{divs}->[-1] .= $origtext if $self->{dc}; } sub text { my($self, $origtext, $is_cdata) = @_; $self->{divs}->[-1] .= $origtext if $self->{dc}; } sub comment { my($self, $origtext) = @_; $self->{divs}->[-1] .= "<!--$origtext-->" if $self->{dc}; } } my $p = MyParser->new; $p->parse($content); # WARNING this array deref will die if we have not put anything # in (ie not divs) as we will try to deref an undefined value if ( exists $p->{divs} ) { print"($_)\n" for @{$p->{divs}}; undef $p->{divs}; # prevent leaks, and accumulating in $p object }

      Try your example on this HTML

      $content = ' <html> <div>foo <!-- comment here --> </div> <div id="foo">bar <a href="hello">somestuff</a> </div> </html> ';

      cheers

      tachyon

        Yes, absolutely. And nested divs. And shed loads of whitespace. Which makes the regex route, IMHO, even more scary.
        I was attempting to make a point with a simple case. But, as you point out, it was probably more misleading than helpful.
        I must have a look at HTML::Parser. It appears to be the parser of choice 'round these parts.
        Thanks, wfsp
Re: Many matches to an array
by si_lence (Deacon) on Jul 15, 2004 at 11:16 UTC
    In addition to the /g modifier to get more than one match you might want to turn your (.+) pattern into (.+?) so it will not match everything between the first div and the last /div in $content.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://374600]
Approved by Paulster2
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (7)
As of 2024-04-19 08:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found