Re: Efficient search through a huge dataset

After writing everything below it occurred to me that your original solution might work just fine if you used a DB_File tied hash rather than a Perl in-memory hash. Roughly:

use DB_File;
my $hash;
tie %hash, DB_File, "first_file.hash";
while (<FIRST_FILE>)
{
  $hash{$_} = undef;
}

while (<SECOND_FILE>)
{
  print $_ if exists $hash{$_};
}
[download]

Below is what I wrote first of all, it's still interesting as it explains a bit about how to make your own hash.

Your idea of a hash is correct but as you say, Perl's internal hash is not suitable. Instead, you could build your own hash.

First you need a hash function. This could be fairly simple. For example just add up the ascii values of the characters in each line and then find the remaineder mod 256

my $h = 0;
foreach (split("", $line)) {
  $h+=ord($_)
}
$h = $h % 256;
[download]

with this you can assign each line to a particular bucket and hopefully there should be a fairly even spread over the 256 possible buckets (if not you can use something more complicated for your hash function). Another good one to use is part of a time value - if each line contains a timestamp then taking the seconds value or the minutes and seconds together might give you a fairly even spread.

The thing now is that 2 lines which have differing hash values must be different, there is no need to compare them character by character. If the hash values are the same you do have to compare the lines character by character . So comparing the hash value first eliminates a large amount of work.

What you have now is exactly the same as a Perl internal hash except that Perl's hash buckets are stored in memory, yours are stored on disk. Also, Perl uses a more complicated function to make sure that things are spread evenly around all the buckets.

There are a couple of different things you can do now.

You could make 2 directories, one for each big file. Each directory will have 256 files - 1 for each bucket. Then you go through the big files, writing each line out into it's correct bucket file. You'll be left with 256 pairs of files, something like

file1/0
file1/1
...
file1/255
file2/0
file2/1
...
file2/255
[download]

now you need to compare file1/0 and file2/0 to find common lines and then file1/1 and file2/1 and so on. Each file should be 256 times smaller than the orignal files so hopefully they're small enough that they can be compared easily using Perl's internal hash function. If dividing by 256 isn't enough then you might need a different hash function.

Another approach is to split just 1 of the files into buckets and then go through the other file line by line, figuring out which bucket each line would be in and then checking to see if it is. This will be pretty slow. To speed it up, instead of each bucket just being a flat file, you could make each bucket a DB_File file, then you can use DB_File's hash lookup to see if a particular line is there. Basically you're using a 2-layered hash.

Comment on Re: Efficient search through a huge dataset Select or Download Code


P is for Practical
	PerlMonks