Re: extracting a substring from a string - multiple variables
by graff (Chancellor) on Oct 27, 2007 at 22:23 UTC
|
my $string = '...blah...<file fiop="foo" length="bar"/>baz</file>...bl
+ah...';
my ( $foo, $bar, $baz );
if ( $string =~ s{<file fiop="([^"]+)" length="([^"]+)"/>([^<]+)</file
+>}{} ) {
( $foo, $bar, $baz ) = ( $1, $2, $3 );
print "extracted $foo, $bar, $baz; left $string\n";
}
| [reply] [d/l] |
|
Actually you hit it right on the spot; it's home-grown XML from Cupertino...
The baz part is raw binary data, inserted in the XML; that's why I want to extract
it before parsing the valid XML.
I hadn't even noticed the close-angle-bracket (thanks - but it's really there).
I've tried your code; but it doesn't seem to get me there.
Any further suggestions ?
| [reply] |
|
When I run my snippet as posted, I get the following output:
extracted foo, bar, baz; left ...blah......blah...
Do you get something different when you run it? Or do you want something different from that?
When you try to use the "s{...}{}" expression in your own code, is it possible that your "raw binary data" (in "the baz part") might contain a byte value of 0x3C? This would be treated as a "<" character in the regex match, which would cause trouble. Something like this might work better in that case:
s{<file fiop="([^"]+)" length="([^"]+)"/>(.*?)</file>}{}s
(update: added the "s" modifier at the end, in case the raw binary stuff might contain a line-feed)
Note the question mark after ".*" -- that's the important thing that was missing from your initial attempt: it makes the wildcard match non-greedy (stops matching as soon as possible). | [reply] [d/l] [select] |
|
|
| [reply] |
Re: extracting a substring from a string - multiple variables
by mwah (Hermit) on Oct 27, 2007 at 22:31 UTC
|
Nobody answered after 16 min? Oh, graff did (and was faster than me) ;-)
...
my ($fiop, $length, $data)
= $string =~ m{<file # tag anchor
\s+ fiop="([^"]+)" # (foo)
\s+ length="([^"]+)" # (bar)
/> # end: start file tag
\s* (.*?) # (baz) - note the "nongreedy
+ness" .*?
</file> # end: file tag
}x;
print "$fiop, $length, $data\n";
...
Addendum: forgot the tag-cleaning part:
...
(my $notags = $string) =~ s{<file.+?</file>}{};
print "$notags\n";
...
Your mistake was basically to take the "greedy modifier" (.*), which
matches until the end of the string - and backtracks then - and matches from
the rear ...
Regards
mwa | [reply] [d/l] [select] |
|
I'm not sure if I should rephrase...
The data I get is from a POST request; it's almost XML except for the <file/> part that I need take out, as parser.pm barfs on it..
in "<file fiop="foo" length="bar"/>baz</file>", baz is raw binary;
that's why (I think) the preg matching (m^) doesn't work.
FYI baz looks like: µÜ¡3õ§©AEurope/Amsterdam$...
| [reply] |
|
Ohh, *if* there is some binary within the tag and *if* the "length" field
says sth. about its *length* you could easily construct a regex that extracts
binary data of that length:
my $binary = pack 'F*', (3.141592) x 10; # make binary vector of len
+gth 80 bytes
my $string = '...blah...<file fiop="foo" length="' . length($binary)
+.'"/>' . $binary . '</file>...blah...';
my ($fiop, $length, $data)
= $string =~ m{<file # tag anchor
\s+ fiop="([^"]+)" # (fiop)
\s+ length="([^"]+)" # (length)
/> # end: start file tag
((??{ "\\C{$2}" })) # self modifying regex for
+binary stuff
</file> # end: file tag
}sx;
print "$fiop, $length (data comes below)\n";
print join ',', unpack("F*", $data); # extract binary data again
(my $notags = $string) =~ s{<file.+</file>}{};
print "\n$notags\n";
In the above I pack a binary sequence of 10 Pi-Numbers (double, 10 x 8 bytes) into the tag, match a binary sequence of its length ($2) and unpack it afterwards.
Regards
mwa
| [reply] [d/l] |
|
|
|
|
|
Re: extracting a substring from a string - multiple variables
by ww (Archbishop) on Oct 28, 2007 at 12:03 UTC
|
use warnings;
use strict;
my $string = q {...blah...<file fiop="foo" length="304507"/> SyncBookm
+arkRecordType12345·bjbjUU</file>...blah};
my $foo = qr {<file fiop="([a-z]+)" length="(\d+)"/> (.+)(?=</file>)};
if ($string =~ m#$foo#) {
print "\nMatched \"fiop\" label: |$1|\nMatched Length: |$2|\nMatche
+d data: |$3|\n";
} else {
print "no match \n";
}
print "\nDone \n\n";
OUTPUT
Matched "fiop" label: |foo|
Matched Length: |304507|
Matched data: |SyncBookmarkRecordType12345?+bjbjU?U?|
Done
| [reply] [d/l] |
|
| [reply] |