Isn't /m for multiline regex?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I must be missing something obvious here. I want to grab the text between the strings <blockquote> and </blockquote> so I tried this ...

$ cat test.pl 
#!/usr/bin/perl

my $data;
while(<DATA>) {
    $data .= $_;
}
my $stuff = "uninit";

if ( $data =~ /<blockquote>(.*)<\/blockquote>/m ) {
    $stuff = $1;    
}

print "$stuff\n";

__DATA__
</p>
<blockquote>This is a non-fiction collection of Maugham's observations
+ of life in Asia in the early 20th Century. (Summary by BellonaTimes)
</blockquote>
<!-- if div.cd-cover --><div class="cd-cover">

</div><!-- end if -->
[download]

The output is simply ...

uninit
[download]

... I was hoping for ...

This is a non-fiction collection of Maugham's observations of life in 
+Asia in the early 20th Century. (Summary by BellonaTimes)
[download]

... what am I doing wrong here?

Thanks!

Comment on Isn't /m for multiline regex? Select or Download Code

Replies are listed 'Best First'.
Re: Isn't /m for multiline regex? by jwkrahn (Abbot) on Apr 19, 2010 at 04:52 UTC
The `/m` option affects the use of the `^` and `$` anchors, but you are not using those anchors in your pattern. You need to use the `/s` option so that the `.` character class will match a newline as well as every other character.	[reply] [d/l] [select]
Re: Isn't /m for multiline regex? by PeterPeiGuo (Hermit) on Apr 19, 2010 at 04:32 UTC
Change /m to /s. /m treats the text as multiple lines - meaning that each line is viewed separately. Obviously you don't have any line that has both the opening and closing blockquote tag. /s treats the text as one single line. Now the opening and closing tags meet each other. Peter (Guo) Pei	[reply]
Re: Isn't /m for multiline regex? by afoken (Chancellor) on Apr 19, 2010 at 11:39 UTC
what am I doing wrong here? You try to parse HTML using regular expressions. That simply can't work, due to the way HTML is defined. Use a HTML parser, a CPAN search will list several. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re^2: Isn't /m for multiline regex? by ww (Archbishop) on Apr 19, 2010 at 12:45 UTC
The parent node overreaches. While it's true that it's not generally a good idea to try to parse html with regexen, "(t)hat simply can't work is not. It can be done... and often is for simple cases... but is fraught with so many difficulties that it's inadvisable. What's more, trying to parse html of any complexity with tools other than the well-tested modules referenced above flies in the face of the mantra 'don't re-invent the wheel.'	[reply]
Re^3: Isn't /m for multiline regex? by GertMT (Hermit) on Apr 20, 2010 at 07:18 UTC
from perlfaq6 Here's code that finds everything between START and END in a paragraph: `undef $/; # read in whole file, not just one line or paragraph while ( <> ) { while ( /START(.*?)END/sgm ) { print "$1\n"; } }` [download]	[reply] [d/l]

Back to Seekers of Perl Wisdom

Peter (Guo) Pei