Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^5: extracting a substring from a string - multiple variables

by mwah (Hermit)
on Oct 28, 2007 at 16:19 UTC ( [id://647706]=note: print w/replies, xml ) Need Help??


in reply to Re^4: extracting a substring from a string - multiple variables
in thread extracting a substring from a string - multiple variables

Hmmm .. you are hitting the "quantifier length limit" of your perl implementation (which should be 0xffff) (?).

(1) How long is your binary chunk at all (above message says "304507" - dooh!) and (2) what number is in the ... length="xxx" ... field? Really *that* large?


update:

How to read arbitary big binary chunks from within regular expressions ...

You could advance until you hit the data (after the closing of the start tag) and simply read the data that follow. This implies you have one ... ...<file>..</file> ... entry per string at this point.

... my $binary = pack 'F*', (3.141592) x 8001; # this will dump a 64K+ bi +nary chunk my $string = '...blah...<file fiop="foo" length="' . length($binary) +.'"/>' . $binary . '</file>...blah...'; my ($fiop, $length, $data); if( $string =~ m{<file \s+ fiop="([^"]+)" \s+ length="([^"]+)" />}gx +) { ($fiop, $length) = ($1, $2); # extract tag prop +erties as usual $data = substr $string, pos($string), $length # extract data by +direct string copy } print "$fiop, $length\n"; print join ',', unpack("F*", $data); (my $notags = $string) =~ s{<file.+</file>}{}; print "\n$notags\n"; ...

Regards

mwa

Replies are listed 'Best First'.
Re^6: extracting a substring from a string - multiple variables
by walinsky (Scribe) on Oct 28, 2007 at 18:57 UTC
    since:
    if ( $indata =~ s{<file fiop="([^"]+)" length="([^"]+)"/>(.*?)</file>} +{}s ) { ( $fiop, $fileLength, $fileData ) = ( $1, $2, $3 );

    works, I wondered if we couldn't just back reference like:
    if ( $indata =~ s{<file fiop="([^"]+)" length="([^"]+)"/>(.{$2})</file +>}{}s ) { ( $fiop, $fileLength, $fileData ) = ( $1, $2, $3 );

    Wouldn't something like that be possible; that would also leave out the implication that there's just one <file></file> pair
      Whoa... let's take a step back.
      • You are trying to handle POSTed data, so there's a reasonable chance that you can't trust the 'length="..."' information.
      • There is also a concern (because it's POSTed data) of corruptions involving "file" tags somehow being present within the binary data.
      • The binary chunks are apparently rather large, so you might run into memory issues if your approach involves having too many copies of too much data in perl variables.
      • Now you seem to be hinting that a given POST might contain two or more segments within "file" tags.
      • You haven't said much about the content outside the "file" tags, but apparently it's supposed to be valid XML once the "file" tags are removed.

      I think you'd be better off if your client(s) used ftp to transfer the binary stuff as data files (with distinct file names), and then just put references to the file names in the XML stream that gets posted. This way, there's nothing in the XML stream except valid XML, and doing stuff with the binary data will be easier, putting less load on the overall process.

      But if there's no chance of doing it sensibly like that, then you just need to use a while loop for handling more than one <file/>...</file> element in the data, and hope for the best:

      while ( $indata =~ s{<file fiop="([^"]+)" length="(\d+)"/>(.*?)</file> +}{}s ) { ( $fiop, $fileLength, $fileData ) = ( $1, $2, $3 ); # do something with $fileData, possibly after checking that # $fileLength == length( $fileData ), if that matters to you } if ( $indata =~ m{<file fiop=|</file>} ) { # there's something wrong with the posted data, so it's still # not suitable for XML parsing... }
      walinsky
      Wouldn't something like that be possible; that would also leave out the implication that there's just one <file/></file> pair

      To make the problem clearer:

      • we have a text like
        <file fiop="fiop_name" length="333333"/>#333K binary chunk goes here#}</file>
      • the "binary chunk" may have any length and may contain any data, possibly (with a lower probability) even the ending tag ... \x02\xc5</file>\x64\xf4  ...
      • per string $string, there is more than one of such sequences <file .../> ...</file> to be expected

      The only chance I'd see here would be to advance to the start of data, extract the data by substr($string, pos($string), $length) and update the string's pos($string) behind the data: pos($string) += $length. At that point, it could be checked for the expected ending tag </file>. All this happens in a while loop under /g until no more <file> blocks can be found.

      Could the above text describe problem and solution?

      Regards

      mwa

      Replying to myself, mwah and graff

      I'm handling POSTed data, actually I'm reverse engineering .Mac services. This is why I'm not afraid the 'length' information can't be trusted; programmers from Cupertino wouldn't fool themselves _that_ much.
      Also, as far as I've seen, there's never been more than 1 <file></file> pair. It would just have been the cherry on the cake to take that chance out.
      As
      if ( $indata =~ s{<file fiop="([^"]+)" length="([^"]+)"/>(.*?)</file>} +{}s ) { ( $fiop, $fileLength, $fileData ) = ( $1, $2, $3 );

      works (for now), but I definitely want to take out the chance that the binary data contains </file>, I just'd like to optimize the regex.
      I _do_ know it can be done with substr, but (knowing -but not completeley understanding- the power of regex) I just wondered if a back-referencing to length within the regex would/could be possible.
      Is something like:
      if ( $indata =~ s{<file fiop="([^"]+)" length="([^"]+)"/>(.{$2})</file +>}{}s ) { ( $fiop, $fileLength, $fileData ) = ( $1, $2, $3 );
      possible ?

      update:
      As regex is limited in matching to a given (64k - or so) length; we decided to assume there's only 1 occurence of a <file/> node; we can match greedy:
      if ( $indata =~ s{<file fiop="([^"]+)" length="([^"]+)"/>(.*)</file>}{ +}s ) { ( $fiop, $fileLength, $fileData ) = ( $1, $2, $3 );
      (matching the final occurence of </file>)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://647706]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (2)
As of 2024-04-24 23:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found