http://qs321.pair.com?node_id=835388

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I must be missing something obvious here. I want to grab the text between the strings <blockquote> and </blockquote> so I tried this ...

$ cat test.pl #!/usr/bin/perl my $data; while(<DATA>) { $data .= $_; } my $stuff = "uninit"; if ( $data =~ /<blockquote>(.*)<\/blockquote>/m ) { $stuff = $1; } print "$stuff\n"; __DATA__ </p> <blockquote>This is a non-fiction collection of Maugham's observations + of life in Asia in the early 20th Century. (Summary by BellonaTimes) </blockquote> <!-- if div.cd-cover --><div class="cd-cover"> </div><!-- end if -->
The output is simply ...
uninit
... I was hoping for ...
This is a non-fiction collection of Maugham's observations of life in +Asia in the early 20th Century. (Summary by BellonaTimes)
... what am I doing wrong here?

Thanks!

Replies are listed 'Best First'.
Re: Isn't /m for multiline regex?
by jwkrahn (Abbot) on Apr 19, 2010 at 04:52 UTC

    The /m option affects the use of the  ^ and  $ anchors, but you are not using those anchors in your pattern.

    You need to use the /s option so that the  . character class will match a newline as well as every other character.

Re: Isn't /m for multiline regex?
by PeterPeiGuo (Hermit) on Apr 19, 2010 at 04:32 UTC

    Change /m to /s.

    • /m treats the text as multiple lines - meaning that each line is viewed separately. Obviously you don't have any line that has both the opening and closing blockquote tag.
    • /s treats the text as one single line. Now the opening and closing tags meet each other.

    Peter (Guo) Pei

Re: Isn't /m for multiline regex?
by afoken (Chancellor) on Apr 19, 2010 at 11:39 UTC
    what am I doing wrong here?

    You try to parse HTML using regular expressions. That simply can't work, due to the way HTML is defined. Use a HTML parser, a CPAN search will list several.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      The parent node overreaches.

      While it's true that it's not generally a good idea to try to parse html with regexen, "(t)hat simply can't work is not.

      It can be done... and often is for simple cases... but is fraught with so many difficulties that it's inadvisable. What's more, trying to parse html of any complexity with tools other than the well-tested modules referenced above flies in the face of the mantra 'don't re-invent the wheel.'

        from perlfaq6

        Here's code that finds everything between START and END in a paragraph:
        undef $/; # read in whole file, not just one line or paragraph while ( <> ) { while ( /START(.*?)END/sgm ) { print "$1\n"; } }