Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Regular expression for grabbing strings with multiple lines between tags

by Max_NL (Novice)
on Apr 05, 2016 at 09:25 UTC ( [id://1159602]=perlquestion: print w/replies, xml ) Need Help??

Max_NL has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to extract strings from a data file where the to be extracted information is between tags.
If the strings have no multiple lines (\n) the regular expression works but it fails when the strings have multiple lines.
The data has the following structure (only <tag2> can have multiple lines):

<tag1>[abcd]</tag1>Blah, blah, blah?: <tag2>Yes</tag2> <tag1>[efgh]</tag1>Yadah, yadah?: <tag2>1) Foo; 2) Bar; 3) Quux; </tag2> <tag1>[ijkl]</tag1>Blah?: <tag2>Yes</tag2> <tag1>[mnop]</tag1>Blah, bleh?: <tag2>Yes, I will. If this and that</tag2>

What I want to extract is the strings between <tag1> and <tag2> tags.
The code:

#!/usr/local/bin/perl -w open (DATA, "data.txt") or die "Error"; undef $/; # slurp mode $body=<DATA>; close DATA; while ( $body =~ /<tag1>\[(\w+)\]<\/tag1>.*<tag2>(.*)<\/tag2>/g ) { print "[$1] => [$2]\n"; }

It should print:

[abcd] => [yes] [efgh] => [1) Foo; 2) Bar; Quux;] [ijkl] => [Yes] [mnop] => [Yes, I will. If this and that.]

Instead it only prints [abcd] and [ijkl], the r.e. does not work if the string in <tag2> has multiple lines.
I tried several combinations with s and g modifier and .*? but can't get anything to work.
I think the problem is in the part made bold:

$body =~ /<tag1>\[(\w+)\]<\/tag1>.*<tag2>(.*)<\/tag2>/g

I'm overlooking something but don't know what... :(

Replies are listed 'Best First'.
Re: Regular expression for grabbing strings with multiple lines between tags
by Ratazong (Monsignor) on Apr 05, 2016 at 09:33 UTC
Re: Regular expression for grabbing strings with multiple lines between tags
by Eily (Monsignor) on Apr 05, 2016 at 09:46 UTC

    You can also read about the Common Regex Gotchas. The first one not only is a piece of advice you should follow, it also uses a problem similar to yours as an example, so you should be able to get your solution from it. The second probably applies to you as well, if you're not going to use a module to validate the whole XML file (meaning you trust the content to be what you expect), you just need to grab the text between </tag1> and <tag2>.

    Edit: I may be wrong about the second gotcha though, you probably want to capture the content of the tags as well as what's in between.

      Handy tips in the link, thank you!
Re: Regular expression for grabbing strings with multiple lines between tags
by choroba (Cardinal) on Apr 05, 2016 at 09:52 UTC
    If the file is in a known format (HTML, XML), you might have better luck using an existing library that parses the format.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      Unfortunately the files are not in a standard format like XML.
Re: Regular expression for grabbing strings with multiple lines between tags
by Max_NL (Novice) on Apr 05, 2016 at 09:53 UTC

    Just before giving up I found it!

    I needed to limit both '.*' to the smallest match '.*?':

    $body =~ /<tag1>\[(\w+)\]<\/tag1>.*?<tag2>(.*?)<\/tag2>/gs

    With the '?' pattern matching quantifier in combination with 's' modifier it will get strings with multiple lines

    Took me a few hours but that's the r.e. learning curve ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1159602]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (7)
As of 2024-04-18 06:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found