Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Faster way to parse binary stream

by Anonymous Monk
on Jun 19, 2007 at 19:16 UTC ( #622081=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am working on a script to read and interpret a binary file, which consists of a variable number of variable length strings.
There are count values embedded in the file for the number of strings, and the length of each string, e.g (artificial example):
NumberOfStrings (1 byte) Str1Length (1 byte) Str1Characters (Str1Length bytes) Str2Length (1 byte) Str2Characters (Str2Length bytes) ... StrNLength (1 byte) StrNCharacters (StrNLength bytes)
The script itself contains something like this:
# Create an example byte-sequence $count = 3; $str1 = "First string"; $str2 = "Second string"; $str3 = "Third string"; $bytes = pack("C C A* C A* C A*", $count, length($str1), $str1, length($str2), $str2, length($str3), $str3); unpack("C", $count); $bytes = substr($bytes, 1); foreach $i (1 .. $count) { $length = unpack("C", $bytes); $str = unpack("xA$length", $bytes); # null byte skips length $bytes = substr($bytes, $length + 1); print "$str\n"; }
My question is - how can avoid the constant substr calls to remove the bytes I have already read?. I know I could use the "C/A*" template to read the length and string in one go, but that doesn't handle the variable number of strings.
In the real-life example, there may be millions of strings, and there may actually be other data types (ints) interspersed. I do actually read the binary file in chunks, so the actual number of bytes that substr is acting on is relatively small, but there must be a more efficient way? Can I have a pointer into the byte stream and call unpack on that? Maybe some of the modules that allow a file to be accessed as a scalar variable might help? Note: I did also try replacing the substr with a variant of the unpack, but it didn't seem to improve performance, and seemed to cause extra bytes to be consumed?: ($str, $bytes) = unpack("x C/A* A*), $bytes) Thanks!

Replies are listed 'Best First'.
Re: Faster way to parse binary stream
by BrowserUk (Patriarch) on Jun 19, 2007 at 19:50 UTC

    If the entire string (after the first 'no of strings' byte) is made up length/string pairs, then (subject to using an unenbalmed version of Perl) let unpack take care of it. Ignore/skip the first byte and use the ()repeat pattern to deal with it. Eg:

    $count = 3; $str1 = "First string"; $str2 = "Second string"; $str3 = "Third string"; $bytes = pack("C C A* C A* C A*", $count, length($str1), $str1, length($str2), $str2, length($str3), $str3);; print for unpack 'x(c/a*)*', $bytes;; First string Second string Third string

    If there is other stuff following the strings, then you'll need to unpack the first byte and then generate the pattern:

    $n = unpack 'C', $bytes;; print for unpack 'x (c/a*)' . $n, $bytes;; First string Second string Third string

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      You can even combine the two calls to unpack:

      my @strings = unpack 'c/(c/a*)', $bytes;
      I hadn't seen the slash template item before. To quote from the docs:

      The / template character allows packing and unpacking of strings where the packed structure contains a byte count followed by the string itself. You write length-item/string-item.

      Good solution, BrowserUK.


      Caution: Contents may have been coded under pressure.
        Thanks to all who replied - the "C/(C/A*)*" looks like the way to go :)
Re: Faster way to parse binary stream
by ikegami (Patriarch) on Jun 19, 2007 at 19:50 UTC

    Instead of removing parsed chars from the strings, I'd use pos to keep track of where you are in the string. That way, you can intermix the use regexps and substr to extract data from the packed string.

    for ($packed) { # alias $_ = $packed; pos = 0; /\G (.) /xgc or die; my $count = unpack('C', $1); for my $i (1..$count) { /\G (.) /xgc or die; # Extract data using a re my $length = unpack('C', $1); my $str = substr($_, pos, $length); # Extract data using substr pos($_) += $length; # Don't forget to upd pos push @strings, $str; } # Make sure there's nothing extra at the end. /\G \z /xgc or die; }

    Another advantage to this method is that you can break your parser down into multiple functions.

    sub extract_string { /\G (.) /xgc or die; my $length = unpack('C', $1); my $str = substr($_, pos, $length); pos($_) += $length; return $str; } sub parse { for ($_[0]) { # alias $_ = $_[0]; pos = 0; /\G (.) /xgc or die; my $count = unpack('C', $1); my @strings; for my $i (1..$count) { push @strings, extract_string(); } # Make sure there's nothing extra at the end. /\G \z /xgc or die; return @strings; } } my @strings = parse($packed);

    The final advantage is backtracking. Since the the string isn't being destroyed, parts of it can be re-parsed.

    Update: Added advantages and second snippet.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://622081]
Approved by wazoox
Front-paged by Roy Johnson
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2022-05-23 09:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (81 votes). Check out past polls.

    Notices?