Search hex string in vary large binary file

westrock2000 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Search hex string in vary large binary file by davido (Cardinal) on Feb 07, 2015 at 00:04 UTC
BTW: index may be a good alternative to regex I agree with LanX: You're searching for a specific sequence, not a pattern. No need to fire up the regex engine to search for something that isn't a pattern. index is a good start. I don't know anything about the MV4 file format, but wouldn't the string you're searching for be in a header near the beginning of the file? That may also simplify your search. Dave	[reply]
Re: Search hex string in vary large binary file by LanX (Saint) on Feb 06, 2015 at 23:46 UTC
general answers: > 1. How can I read the file from disk? use sliding window technique, you only need to hold at least twice the searched string in memory. see length argument in `read` Though multiples of 4kb big chunks seems reasonable. > And then 2nd how would I do a regular expression for binary instead of searching for text? strings are just binaries, you just need to convertš your hex to them. BTW: `index` may be a good alternative to regex Cheers Rolf PS: Je suis Charlie! š) e.g. `DB<115> join "", map {chr} 0x20,0x41,0x42 => " AB"` [download] see also `pack` for a direct approach. `DB<123> pack 'H*', '204142' => " AB"` [download]	[reply] [d/l] [select]
Re: Search hex string in vary large binary file by BrowserUk (Patriarch) on Feb 07, 2015 at 05:03 UTC
Try this (I was bored:): #! perl -slw use strict; our $BUFN //= 1024; $BUFN = 4096; our $SIG //= '68 64 76 64 00 00 00 11 64 61 74 61 00 00 00 15 00 00 0 +0 00 02 00 00 00'; $SIG =~ tr[ ][]d; $SIG = pack 'H', $SIG; open my $in, '<:raw', $ARGV[0] or die $!; my( $offset, $buffer ) = ( 0, '' ); while( sysread( $in, $buffer, $BUFN, length $buffer ) ) { my $pos = 1+index( $buffer, $SIG ); if( $pos ) { print "Found signature at offset: ", $offset + $pos - 1; exit; } $offset += length( $buffer ) - length( $SIG ); $buffer = substr $buffer, - length $SIG; } close $in; print "Signature not found"; __END__ 02/02/2015 15:42 10,737,418,241 big.csv C:\test>junk71 -BUFN=4096 -SIG="52 5f d7 58 22 0d 0a 61 68 73 68 77 65 + 2c 38 30 33 37 31 37 38 35 2c 46" big.csv Found signature at offset: 1073741817 [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l]
Re: Search hex string in vary large binary file by GrandFather (Saint) on Feb 07, 2015 at 07:23 UTC
Take a look at Video::Dumper::QuickTime. I wrote it when I needed to pull apart MP4 files and it's designed to have new metatag decoders plugged into it. Be warned, it's quite a few years since I last looked at it and the documentation may not be as good now as I thought it was when I understood the code! Perl is the programming world's equivalent of English	[reply]
Re: Search hex string in vary large binary file by Anonymous Monk on Feb 07, 2015 at 14:42 UTC
Just to point out an issue with your approach: If you were simply looking for a specific sequence of 24 bytes within up to 20,000,000,000 bytes, what about false positives? To avoid that, you'd actually have to parse the file and only look in the appropriate places for that flag. Which, if you were to DIY, would be a lot of reading specs and writing code, so it really is best to use an existing tool. You're in luck! Someone actually submitted a patch for MP4::Info to add support for the HDVD tag: https://rt.cpan.org/Public/Bug/Display.html?id=101016 There's a quick & really dirty way to patch the module on your system: "wget -nv https://rt.cpan.org/Ticket/Attachment/1444239/767837/0001-add-support-for-HDVD-tag.patch -O- \| patch `perldoc -l MP4::Info`" (you'll probably need to do this as root). However, a somewhat cleaner way would be to patch the module before installation: `# in the shell: $ cd /tmp $ wget http://www.cpan.org/authors/id/J/JH/JHAR/MP4-Info-1.13.tar.gz $ tar xzf MP4-Info-1.13.tar.gz $ cd MP4-Info-1.13/ $ wget -nv https://rt.cpan.org/Ticket/Attachment/1444239/767837/0001-a +dd-support-for-HDVD-tag.patch -O- \| patch` [download] ... and then install to a local module repository separate from your system's modules. For example, see the instructions under "I don't have permission to install a module on the system!" in A Guide to Installing Modules.	[reply] [d/l] [select]
Re^2: Search hex string in vary large binary file by BrowserUk (Patriarch) on Feb 07, 2015 at 15:03 UTC
Its a point; but I wonder how many .mv4s you'd have to search before you found "hdvd" & "data" separated by exactly 4 bytes that wasn't part of the required 24 bytes? To clarify, in totally random data, there are 256**24 (6.2771e+57) permutations of 24 bytes. A 20GB file has 21474836473 sets of 24-bytes. So the odds of one of them being a false hit is: 3.4211e-48 (0.00000000000000000000000000000000000000000000034211%). And every restriction on those bytes increases the odds. Pretty good odds that any hit is a good one. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]
Re^3: Search hex string in vary large binary file by Anonymous Monk on Feb 07, 2015 at 15:57 UTC
Agreed! The point was also meant to be more general about the selection of the solution: personally, my Plan A would be "see if there's a module to do it 'right'", and Plan B would be "meh, I'll just grep the whole file", not the other way around (as the OP seems to imply).	[reply]
Re^4: Search hex string in vary large binary file by BrowserUk (Patriarch) on Feb 07, 2015 at 16:18 UTC
Re^5: Search hex string in vary large binary file by Anonymous Monk on Feb 07, 2015 at 19:56 UTC
Some notes below your chosen depth have not been shown here


Do you know where your variables are?
	PerlMonks