http://qs321.pair.com?node_id=516706

crenz has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I'm looking for an efficient way to search for all occurrences of a string in a binary file. The problem is: Since the file is binary, I need to read it in blocks rather than lines. This way, however, it is easy to miss occurrences of the string on a block boundary (unless you use a sliding-window approach).

Is there a module that implements a search that takes this into account?

Replies are listed 'Best First'.
Re: Searching in binary files
by Corion (Patriarch) on Dec 14, 2005 at 18:05 UTC

    Use a sliding window:

    my @lines; $/ = \80; # assume a block size of 80 while (<$file>) { push @lines, $_; my $str = join "", @lines; if ($str =~ /searchword/) { my $loc = tell $file - pos $str; print "Found a match starting after $loc.\n"; }; if (@lines > 2) { shift @lines }; };

    Instead of looking for a match in just one "line", you look for a match in the "line" and the "line" after it.

      Danke nach Frankfurt!

      There are a few bugs in your code, though. I changed the if to a while loop and fixed a few other problems:

      while ($str =~ /searchword/g) { my $loc = tell($fh) - length($str) + pos($str); print "Found a match starting after $loc.\n"; }

      Update: That doesn't quite work either... it finds too many occurrences...

        while ($str =~ /$pattern/g) { my $loc = tell($fh) - length($str) + pos($str) - length($patte +rn); print "Found a match starting after $loc.\n"; }
Re: Searching in binary files
by jdporter (Paladin) on Dec 14, 2005 at 18:05 UTC

    You could do this:

    local $/ = "find this string"; while (<FILE>) { if ( chomp ) { # you know you found an occurrence } }

    We're building the house of the future together.
      That is rather a nice idea, except: whether one really could do this depends on file size, and whether the process happens to reach either the target string or end-of-file before it runs out of memory.

      That looks rather nice -- but unfortunately, it is rather slow also... I'm dealing with files that could potentially be hundreds of megabytes in size.

Re: Searching in binary files
by BrowserUk (Patriarch) on Dec 14, 2005 at 19:22 UTC

    You could adapt this:

    #! perl -slw use strict; open I, '<:raw', $ARGV[0] or die $!; my $regex = $ARGV[1] or die 'No search pattern supplied.'; my $o = 0; my $buffer; ## Read into the buffer after any residual copied from the last chunk while( my $read = read I, $buffer, 4096, pos( $buffer )||0 ) { while( $buffer =~ m[$regex]gc ) { ## Print the offset, the matched text plus (following) context print $o + $-[0], ':', substr $buffer, $-[0], 100; } ## Slide the unsearched remainer to the front of the buffer. substr( $buffer, 0, pos( $buffer ) ) = substr $buffer, pos( $buffe +r ); $o += $read; ## track the overall offset. } close I;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Searching in binary files
by jdporter (Paladin) on Dec 15, 2005 at 15:51 UTC

    You don't say what kind of file it is; so, if you do know anything about the structure of the file, you might to well to take a look at Tie::MmapArray. It works well for files which are strictly arrays of C-struct type data records; in general, such things look like binary to perl. Then you can simply iterate over the array of structs, and test the various fields for your pattern. You may even know which of the fields might and might not contain what you're searching for.

    We're building the house of the future together.
Re: Searching in binary files
by pileofrogs (Priest) on Dec 14, 2005 at 19:01 UTC

    This is probably an uncool suggestion, but you could always pipe your data through strings if you're on a unixy system.

Re: Searching in binary files
by GrandFather (Saint) on Dec 15, 2005 at 02:54 UTC

    What do you actually want to do? Check that the string exists? Count the number of occurences? Find a string that matches some pattern? Find a prefix string and extract some trailing text?

    A neat way to perform some of those searches is:

    local $/ = "the string to match"; while (<fileHandle>) { #do stuff with the "line" in $_ #chomp will remove "the string to match"; }

    DWIM is Perl's answer to Gödel

      That works well until the one occurance of the string is close to the end of your n GB file.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Searching in binary files
by zentara (Archbishop) on Dec 15, 2005 at 12:32 UTC
    How about a "sniffing dog" approach(or fingerprints?). Break your search string into small fragments, but long enough to be fairly unique. Then search for those fragments in a small sliding chunks of the big file. If a fragment is found, check to see if the adjacent fragments are there. If the fragment is found near the beginning or end of the sliding chunk, load in the appropriate adjacent chunk and retest.

    I'm not really a human, but I play one on earth. flash japh

      Sorry zentra, I don't quite follow. What's the advantage in searching for several smaller bits in the buffer over searching for one bigger bit?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        The advantage is that the regex you are running over the sliding binary chunks will be smaller, and (I think) therefore faster. So I figured it would be more efficient on big files. If it's a really big file, most chunks won't match, so why test a regex string of length 1000, when a test on length 250 will give a quick "sniff", indicating whether you should dig deeper at that general area.

        But as always, I defer to your greater experience and wisdom. :-)


        I'm not really a human, but I play one on earth. flash japh