Faster way to parse binary stream

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am working on a script to read and interpret a binary file, which consists of a variable number of variable length strings.
There are count values embedded in the file for the number of strings, and the length of each string, e.g (artificial example):

 
NumberOfStrings (1 byte)
Str1Length (1 byte) Str1Characters (Str1Length bytes)
Str2Length (1 byte) Str2Characters (Str2Length bytes)
...
StrNLength (1 byte) StrNCharacters (StrNLength bytes)
[download]

The script itself contains something like this:

# Create an example byte-sequence
$count = 3;
$str1 = "First string";
$str2 = "Second string";
$str3 = "Third string";
$bytes = pack("C C A* C A* C A*", $count, length($str1),
   $str1, length($str2), $str2, length($str3), $str3);

unpack("C", $count);
$bytes = substr($bytes, 1);
foreach $i (1 .. $count) {
   $length = unpack("C", $bytes);
   $str = unpack("xA$length", $bytes); # null byte skips length
   $bytes = substr($bytes, $length + 1);
   print "$str\n";
}
[download]

My question is - how can avoid the constant substr calls to remove the bytes I have already read?. I know I could use the "C/A*" template to read the length and string in one go, but that doesn't handle the variable number of strings.
In the real-life example, there may be millions of strings, and there may actually be other data types (ints) interspersed. I do actually read the binary file in chunks, so the actual number of bytes that substr is acting on is relatively small, but there must be a more efficient way? Can I have a pointer into the byte stream and call unpack on that? Maybe some of the modules that allow a file to be accessed as a scalar variable might help? Note: I did also try replacing the substr with a variant of the unpack, but it didn't seem to improve performance, and seemed to cause extra bytes to be consumed?: ($str, $bytes) = unpack("x C/A* A*), $bytes) Thanks!

Comment on Faster way to parse binary stream Select or Download Code

Replies are listed 'Best First'.
Re: Faster way to parse binary stream by BrowserUk (Patriarch) on Jun 19, 2007 at 19:50 UTC
If the entire string (after the first 'no of strings' byte) is made up length/string pairs, then (subject to using an unenbalmed version of Perl) let unpack take care of it. Ignore/skip the first byte and use the ()repeat pattern to deal with it. Eg: `$count = 3; $str1 = "First string"; $str2 = "Second string"; $str3 = "Third string"; $bytes = pack("C C A* C A* C A", $count, length($str1), $str1, length($str2), $str2, length($str3), $str3);; print for unpack 'x(c/a)', $bytes;; First string Second string Third string` [download] If there is other stuff following the strings, then you'll need to unpack the first byte and then generate the pattern: `$n = unpack 'C', $bytes;; print for unpack 'x (c/a)' . $n, $bytes;; First string Second string Third string` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^2: Faster way to parse binary stream by ikegami (Patriarch) on Jun 19, 2007 at 20:29 UTC
You can even combine the two calls to `unpack`: `my @strings = unpack 'c/(c/a*)', $bytes;` [download]	[reply] [d/l] [select]
Re^3: Faster way to parse binary stream by BrowserUk (Patriarch) on Jun 19, 2007 at 21:29 UTC
Cool++ Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^2: Faster way to parse binary stream by Roy Johnson (Monsignor) on Jun 19, 2007 at 20:29 UTC
I hadn't seen the slash template item before. To quote from the docs: The / template character allows packing and unpacking of strings where the packed structure contains a byte count followed by the string itself. You write length-item/string-item. Good solution, BrowserUK. Caution: Contents may have been coded under pressure.	[reply]
Re^3: Faster way to parse binary stream by Anonymous Monk on Jun 19, 2007 at 21:27 UTC
Thanks to all who replied - the "C/(C/A)" looks like the way to go :)	[reply]
Re: Faster way to parse binary stream by ikegami (Patriarch) on Jun 19, 2007 at 19:50 UTC
Instead of removing parsed chars from the strings, I'd use `pos` to keep track of where you are in the string. That way, you can intermix the use regexps and `substr` to extract data from the packed string. `for ($packed) { # alias $_ = $packed; pos = 0; /\G (.) /xgc or die; my $count = unpack('C', $1); for my $i (1..$count) { /\G (.) /xgc or die; # Extract data using a re my $length = unpack('C', $1); my $str = substr($_, pos, $length); # Extract data using substr pos($_) += $length; # Don't forget to upd pos push @strings, $str; } # Make sure there's nothing extra at the end. /\G \z /xgc or die; }` [download] Another advantage to this method is that you can break your parser down into multiple functions. `sub extract_string { /\G (.) /xgc or die; my $length = unpack('C', $1); my $str = substr($_, pos, $length); pos($_) += $length; return $str; } sub parse { for ($_[0]) { # alias $_ = $_[0]; pos = 0; /\G (.) /xgc or die; my $count = unpack('C', $1); my @strings; for my $i (1..$count) { push @strings, extract_string(); } # Make sure there's nothing extra at the end. /\G \z /xgc or die; return @strings; } } my @strings = parse($packed);` [download] The final advantage is backtracking. Since the the string isn't being destroyed, parts of it can be re-parsed. Update: Added advantages and second snippet.	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom