Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Pattern matching in binary mode

by punchcard_don (Beadle)
on Mar 19, 2004 at 12:40 UTC ( [id://337977]=perlquestion: print w/replies, xml ) Need Help??

punchcard_don has asked for the wisdom of the Perl Monks concerning the following question:

Magniferous Monks,

Related to another thread, I need to do simple pattern matching and replacing on a file in binary mode (because its a binary file!).

I'm looking for, for example, '77777'(decimal), which is '37 37 37 37 37'(hex), or '0010 0101 0010 0101 0010 0101 0010 0101 0010 0101' (binary).

How does one construct a simple pattern matching and replacing regex in binmode?

Another wrinkle - in binary files, is there sucha thing as 'lines' requiring that we worry about matching over two lines?

Thanks.

Replies are listed 'Best First'.
Re: Pattern matching in binary mode
by Abigail-II (Bishop) on Mar 19, 2004 at 13:07 UTC
    It's important to realize that "binary files" and "text files" are things humans find important. For a computer, there's no difference. It just sees the byte '37', and it that has no futher meaning. Humans might interpret that as the character '7' though.

    "binmode" is only relevant for some OSses, and have to do with translation of end-of-line markers.

    Hence, whether you want to replace the ASCII character 7, or the byte 37, you'd do the same.

    s/7/this is a seven/;
    and
    s/\x37/this is a seven/;
    will do the same thing.

    Abigail

Re: Pattern matching in binary mode (I/O)
by tye (Sage) on Mar 19, 2004 at 17:15 UTC
    Related to another thread, I need to do simple pattern matching and replacing on a file in binary mode (because its a binary file!).

    The main difference will not be in the replacing, but in the reading and writing. Most substituting can be done be reading one line, substituting, writing, repeat. This allows very large text files to be processed quickly (no allocating huge buffers to hold the entire file contents or the file just being too big to even fit in memory).

    For a binary file, you could get a similar process quite easily with $/ = \4096;, which would cause <IN> to read a 4096-byte chunk each time. Unfortunately, '77777' could end up with the first two characters at the end of one buffer and the last three characters at the beginning of the next buffer (for example), so s/77777/.../ would fail to substitute that case.

    If your binary files are small enough to fit into memory (preferably fit into physical memory but fitting into virtual memory may still be 'fast enough'), then you can just slurp the whole file into a single scalar quite easily (using a 'slurp' module or setting $/ to undef, etc.).

    If your binary files are too big, then things get trickier. Probably the most general solution is to use a sliding window. Pick a string length that you are pretty sure is longer than any substring that you'll run into that matches your pattern:

    sub binSubst { my( $infile, $outfile, $regex, $repl, $maxlen, $bufsiz )= @_; binmode($infile); binmode($outfile); $bufsize ||= 16*1024; my $buf= ''; # Read the next chunk, appending to any left-over bytes: while( sysread( $infile, $buf, $bufsize, length($buf) ) ) { $buf =~ s/$regex/$repl/g; # How much to write out, unless... my $end= length($buf)-$maxlen; # ... we matched after that point and so # should write upto the end of last match: $end= $+[0] if $end < $+[0]; # Write out what we can, removing it from the buffer: print $outfile substr($buf,0,$end,''); } # Write out any left overs: print $outfile $buf; }

    - tye        

      Appreciated this comment, as it's one of the few useful things Google returned when searching for "perl sliding window string replace". I used it to make something similar that uses substr() instead of a regex, and bumps the buffer up if the search string is larger than the window size.

      # note: only lightly tested, ymmv sub sliding_replace { my($srcfile,$dstfile,$search,$replace)=@_; if (! -e $file) { die("File [$file] does not exist\n"); } open(my $src,'<:raw',$srcfile); open(my $dst,'>:raw',$dstfile); my $winsize=4096; my $buf= ''; while(1) { my $bytecount=$src->sysread($buf, $winsize*2, length($buf)); while (1) { my $index=index($buf,$search); if ($index > 0) { substr($buf,$index,length($search),$replace); my $len=$index+length($replace); $dst->print(substr($buf,0,$len,'')); } else { $dst->print(substr($buf,0,$winsize,'')); last; } } last if $bytecount == 0; } # print any leftovers $dst->print($buf); $src->close(); $dst->close(); }
Re: Pattern matching in binary mode
by tachyon (Chancellor) on Mar 19, 2004 at 13:42 UTC

    In addition to the above with a binary file you have a stream of bytes. Some of those will be what we refer to as newline chars (10 0xA \012) if we had a text file so if you do:

    open F, $file or die $!; binmode F; while (<F>) { # $_ will contain a randomish length of string # which simply depends on where the newline chars fall }

    If you don't want to find a pattern like \032\012\032 which would appear in two different reads from <F> then you have no issue. Typically you read binary files using read and ask for X number of bytes to be read, but for your purposes <F> should probably be fine. If you do need to match strings that contain \012 then you will need to read and buffer. This is more complex.

    Don't forget to binmode your output handle as well.

    cheers

    tachyon

Re: Pattern matching in binary mode
by matija (Priest) on Mar 19, 2004 at 13:16 UTC
    In general, Perl doesn't care if the string you're doing matches over contains funny characters or not - and it doesn't care if the string you're searching for contains such characters.

    As the previous poster said, if you can write out the values in the \xNN\ notation, you're all set.

    There is no concept of lines in regexp per se, but if you will be using the "." wildcard, it is better if you put the s qualifier at the end of the search - that way, . will match any newlines that might be in the code, and ^ and $ will match the beginning and the end of the string - rather than a "pretend" newline that happens to be somewhere in the string.

Re: Pattern matching in binary mode
by Hofmator (Curate) on Mar 19, 2004 at 12:58 UTC
    Does s/\x37{5}/whatever/g work?

    -- Hofmator

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://337977]
Approved by Tomte
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (9)
As of 2024-04-23 08:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found