Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

extracting a substring from a string - multiple variables

by walinsky (Scribe)
on Oct 27, 2007 at 22:02 UTC ( [id://647635]=perlquestion: print w/replies, xml ) Need Help??

walinsky has asked for the wisdom of the Perl Monks concerning the following question:

Dear all,

In a given string:
my $string = '...blah...<file fiop="foo" length="bar"/>baz</file>...blah...'
(where bar is an integer - if that helps)
I'd like to catch foo, bar and baz in variables; but I'd also like to end $string up
without the "<file fiop="foo" length="bar"/>baz</file>" part.

I've tried:
my ($fiop, $length, $data) = ($string =~ /\<file fiop=\"(.*)\" length=\"(.*)\"\/\>(.*)\<\/file>/);
(without success) - but that would leave me with (re-)building the given part for substringing it from $string.
Could anyone point me to the right direction to get this done

rgrds,
Walinsky
  • Comment on extracting a substring from a string - multiple variables

Replies are listed 'Best First'.
Re: extracting a substring from a string - multiple variables
by graff (Chancellor) on Oct 27, 2007 at 22:23 UTC
    Is the data some sort of home-grown imitation of XML? If it was "real" XML, there wouldn't be a slash before the first close-angle-bracket. (I guess since it isn't real XML, it wouldn't help to recommend an XML parsing module.)

    Do you mean something like this?

    my $string = '...blah...<file fiop="foo" length="bar"/>baz</file>...bl +ah...'; my ( $foo, $bar, $baz ); if ( $string =~ s{<file fiop="([^"]+)" length="([^"]+)"/>([^<]+)</file +>}{} ) { ( $foo, $bar, $baz ) = ( $1, $2, $3 ); print "extracted $foo, $bar, $baz; left $string\n"; }
      Actually you hit it right on the spot; it's home-grown XML from Cupertino...
      The baz part is raw binary data, inserted in the XML; that's why I want to extract it before parsing the valid XML.
      I hadn't even noticed the close-angle-bracket (thanks - but it's really there).

      I've tried your code; but it doesn't seem to get me there.
      Any further suggestions ?
        When I run my snippet as posted, I get the following output:
        extracted foo, bar, baz; left ...blah......blah...
        Do you get something different when you run it? Or do you want something different from that?

        When you try to use the "s{...}{}" expression in your own code, is it possible that your "raw binary data" (in "the baz part") might contain a byte value of 0x3C? This would be treated as a "<" character in the regex match, which would cause trouble. Something like this might work better in that case:

        s{<file fiop="([^"]+)" length="([^"]+)"/>(.*?)</file>}{}s
        (update: added the "s" modifier at the end, in case the raw binary stuff might contain a line-feed)

        Note the question mark after ".*" -- that's the important thing that was missing from your initial attempt: it makes the wildcard match non-greedy (stops matching as soon as possible).

      I find it funny that the right solution ("use a parser") is shot down because this isn't exactly XML. Are you sure an HTML parser wouldn't parse it properly? Weird how everyone gets stuck on regex.

Re: extracting a substring from a string - multiple variables
by mwah (Hermit) on Oct 27, 2007 at 22:31 UTC

    Nobody answered after 16 min? Oh, graff did (and was faster than me) ;-)

    ... my ($fiop, $length, $data) = $string =~ m{<file # tag anchor \s+ fiop="([^"]+)" # (foo) \s+ length="([^"]+)" # (bar) /> # end: start file tag \s* (.*?) # (baz) - note the "nongreedy +ness" .*? </file> # end: file tag }x; print "$fiop, $length, $data\n"; ...

    Addendum: forgot the tag-cleaning part:

    ... (my $notags = $string) =~ s{<file.+?</file>}{}; print "$notags\n"; ...

    Your mistake was basically to take the "greedy modifier" (.*), which matches until the end of the string - and backtracks then - and matches from the rear ...

    Regards

    mwa

      I'm not sure if I should rephrase...
      The data I get is from a POST request; it's almost XML except for the <file/> part that I need take out, as parser.pm barfs on it..
      in "<file fiop="foo" length="bar"/>baz</file>", baz is raw binary;
      that's why (I think) the preg matching (m^) doesn't work.
      FYI baz looks like: µÜ¡3õ§©AEurope/Amsterdam$...
        Ohh, *if* there is some binary within the tag and *if* the "length" field says sth. about its *length* you could easily construct a regex that extracts binary data of that length:
        my $binary = pack 'F*', (3.141592) x 10; # make binary vector of len +gth 80 bytes my $string = '...blah...<file fiop="foo" length="' . length($binary) +.'"/>' . $binary . '</file>...blah...'; my ($fiop, $length, $data) = $string =~ m{<file # tag anchor \s+ fiop="([^"]+)" # (fiop) \s+ length="([^"]+)" # (length) /> # end: start file tag ((??{ "\\C{$2}" })) # self modifying regex for +binary stuff </file> # end: file tag }sx; print "$fiop, $length (data comes below)\n"; print join ',', unpack("F*", $data); # extract binary data again (my $notags = $string) =~ s{<file.+</file>}{}; print "\n$notags\n";

        In the above I pack a binary sequence of 10 Pi-Numbers (double, 10 x 8 bytes) into the tag, match a binary sequence of its length ($2) and unpack it afterwards.

        Regards

        mwa

Re: extracting a substring from a string - multiple variables
by ww (Archbishop) on Oct 28, 2007 at 12:03 UTC

    Using the fubarred xml from your question in the CB:

    use warnings; use strict; my $string = q {...blah...<file fiop="foo" length="304507"/> SyncBookm +arkRecordType12345·bjbjUU</file>...blah}; my $foo = qr {<file fiop="([a-z]+)" length="(\d+)"/> (.+)(?=</file>)}; if ($string =~ m#$foo#) { print "\nMatched \"fiop\" label: |$1|\nMatched Length: |$2|\nMatche +d data: |$3|\n"; } else { print "no match \n"; } print "\nDone \n\n";

    OUTPUT
    Matched "fiop" label: |foo|
    Matched Length: |304507|
    Matched data: |SyncBookmarkRecordType12345?+bjbjU?U?|

    Done

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://647635]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (2)
As of 2024-04-26 06:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found