Many matches to an array

ciryon has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Many matches to an array by tachyon (Chancellor) on Jul 15, 2004 at 11:07 UTC
You are missing the /g so you only get one match. You must have missed my Perl Idioms Explained - @ary = $str =~ m/(stuff)/g tutorial :-) Now even if you had added the /g it would have given you a single match as you are using .+ which is greedy. You need .+? if it is going to work. You should also add /i to make it case insensitive and what about .... type syntax? Where we are heading is the usual suggestion to use HTML::Parser. This should be reasonably reliable, if not best practice. `@divs = $content =~ m!<\sdiv[^>]>(.+?)<\s*/div!sig;` [download] I have posted an HTML::Parser example for you down here cheers tachyon	[reply] [d/l]
Re: Many matches to an array by Paulster2 (Priest) on Jul 15, 2004 at 11:09 UTC
Don't you want to put your entire content into an array and then check each line of that array for the content that you are looking for? It appears that you put all of the content into a string (or one element) then check that single element for your content. When you do that I would think that it would only return one hit. Change `$content` to an array and then search each line (`foreach`) for your content matches. That, I would think, will get you where you want to go. (I am assuming that the regex is correct.) Paulster2 You're so sly, but so am I. - Quote from the movie Manhunter.	[reply] [d/l] [select]
Re^2: Many matches to an array by ciryon (Sexton) on Jul 15, 2004 at 11:38 UTC
How do I put entire content into an array?	[reply]
Re^3: Many matches to an array by hsinclai (Deacon) on Jul 15, 2004 at 11:53 UTC
maybe start with `my @content = get_content($url); foreach my $var ( @content ) { next if $var =~ / undesirable regex /; $var =~ / your regex / && do something; .. }` [download] I think this is what Paulster2 means.. at least now the @content you fetched is going to get processed line by line..	[reply] [d/l]
Re: Many matches to an array by friedo (Prior) on Jul 15, 2004 at 11:14 UTC
In addition to adding `/g`, you'll want to make your regex non-greedy, by changing `(.+)` to `(.+?)`.	[reply]
Re: Many matches to an array by wfsp (Abbot) on Jul 15, 2004 at 11:53 UTC
Building a regex to parse html is tricky. I suscribe to the "don't do that" school, especially as someone has already done it for you. This is one way. There are, of course, many others. `use strict; use warnings; use HTML::TokeParser; my $p = HTML::TokeParser->new("index.html") or die "Can't open: $!"; my @results; while (my $t = $p->get_tag( 'div' ) ){ my $text = $p->get_trimmed_text( '/div' ); push @results, $text; } print join "\n", @results;` [download] Update: Warning - See tachyon's reply below.	[reply] [d/l]
Re^2: Many matches to an array by tachyon (Chancellor) on Jul 15, 2004 at 11:59 UTC
Unfortunately that example does not work - unless the OP just wants the text between the div tokens. Div tags generally have all sorts of HTML between them, not just text. That example will lose it all. You can do it like this with HTML::Parser { package MyParser; use base 'HTML::Parser'; sub start { my($self, $tagname, $attr, $attrseq, $origtext) = @_; $self->{divs}->[-1] .= $origtext if $self->{dc}; if ( $tagname eq 'div' ) { push @{$self->{divs}}, ''; $self->{dc}++; } } sub end { my($self, $tagname, $origtext) = @_; $self->{dc}-- if $tagname eq 'div'; $self->{divs}->[-1] .= $origtext if $self->{dc}; } sub text { my($self, $origtext, $is_cdata) = @_; $self->{divs}->[-1] .= $origtext if $self->{dc}; } sub comment { my($self, $origtext) = @_; $self->{divs}->[-1] .= "<!--$origtext-->" if $self->{dc}; } } my $p = MyParser->new; $p->parse($content); # WARNING this array deref will die if we have not put anything # in (ie not divs) as we will try to deref an undefined value if ( exists $p->{divs} ) { print"($_)\n" for @{$p->{divs}}; undef $p->{divs}; # prevent leaks, and accumulating in $p object } [download] Try your example on this HTML `$content = ' <html> <div>foo <!-- comment here --> </div> <div id="foo">bar <a href="hello">somestuff</a> </div> </html> ';` [download] cheers tachyon	[reply] [d/l] [select]
Re^3: Many matches to an array by wfsp (Abbot) on Jul 15, 2004 at 12:28 UTC
Yes, absolutely. And nested divs. And shed loads of whitespace. Which makes the regex route, IMHO, even more scary. I was attempting to make a point with a simple case. But, as you point out, it was probably more misleading than helpful. I must have a look at HTML::Parser. It appears to be the parser of choice 'round these parts. Thanks, wfsp	[reply]
Re^4: Many matches to an array by tachyon (Chancellor) on Jul 15, 2004 at 14:23 UTC
Re: Many matches to an array by si_lence (Deacon) on Jul 15, 2004 at 11:16 UTC
In addition to the /g modifier to get more than one match you might want to turn your (.+) pattern into (.+?) so it will not match everything between the first div and the last /div in $content.	[reply]


Keep It Simple, Stupid
	PerlMonks