http://qs321.pair.com?node_id=400708

johnnywang has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm looking for wisdom on the following searching/matching problem: I have two files, each containing some records (say one string per line). I'd like to go through the second file to find out whether each record is in the first file. One easy way to do it is to load the first as a hash, then just iterate through the second. My problem is the files are huge, putting the whole thing in memory is pushing the limits. Is there a less resource-demanding approach? of course speed is also a great concern. (The files are now containing about 10 million records, but growing. If the whole thing can be done within a few hours to a day would be fine.) Thanks.
  • Comment on Efficient search through a huge dataset

Replies are listed 'Best First'.
•Re: Efficient search through a huge dataset
by merlyn (Sage) on Oct 19, 2004 at 23:58 UTC
      I guess you meant to store the first file as database records in a table, and for each record in the second file, do a select from the table? That will be millions of selects, can a perl/batch be faster? Thanks.

        Millions of records? Not true at all!

        You can do simply one query, say you want to find out all records that are in table1 but not table2, then you can either:

        select blah from table1 t1 where not exists (select 1 from table2 t2 where t1.blah = t2.blah )

        Or (if munus is supported):

        select blah from table1 minus select blah from table2

        It should be easy for you to modify the query a bit to satisfy your needs.

      DBD::SQLite might be the answer to this particular case, but it is slow when you try to insert into a table with index. (Not sure what happens without index, general speaking insert is slower with index, when select is faster with index. That's OT)

      I compared DBD::SQLite, with ODBC, the same table structure and index. insert 1000 rows, and then select. It took ODBC 2 seconds to insert 1000 rows, and 0 (which means less than 1) second to select; but it took SQLite 310 seconds to insert (way to big), adn 0 second to select (which is virtually the same):

      use DBI; use Data::Dumper; use strict; use warnings; my $dbh = DBI->connect("dbi:SQLite:dbname=dbfile","",""); #my $dbh = DBI->connect("dbi:ODBC:everything","",""); =document #$dbh->do('create table table1(col1 number(10), col2 number(10))'); $dbh->do('create table table1(col1 int, col2 int)'); $dbh->do('create index index1 on table1(col1)'); #$dbh->do('create table table2(col1 number(10), col2 number(10))'); $dbh->do('create table table2(col1 int, col2 int)'); $dbh->do('create index index2 on table2(col1)'); =cut $dbh->do('delete from table1'); $dbh->do('delete from table2'); my $st1 = $dbh->prepare('insert into table1(col1, col2) values(?, ?)') +; my $st2 = $dbh->prepare('insert into table2(col1, col2) values(?, ?)') +; print time, "\n"; for my $i (1..1000) { $st1->execute($i, $i * 2); if ($i % 2) { $st2->execute($i, $i * 3); } } print time, "\n"; { my $st3 = $dbh->prepare('SELECT t1.col1 FROM table1 t1 LEFT OUTER +JOIN table2 t2 ON (t1.col1 = t2.col1) WHERE t1.col1 IS NOT NULL AND t +2.col1 IS NULL'); $st3->execute(); my $ref3 = $st3->fetchall_arrayref; #print Dumper($ref3); } print time, "\n"; $dbh->disconnect();
        it took SQLite 310 seconds to insert

        The SQLite docs say what you should do to improve speed: group instructions into transactions and eventually modify the "syncronous" pragma.

        If you add $dbh->do('begin'); before your loop and $dbh->do('commit'); at the end, the insertion will take less than one second, as it should be.

        use Time::HiRes qw(gettimeofday tv_interval); my $start_time = [gettimeofday]; $dbh->do('begin'); for my $i (1..1000) { $st1->execute($i, $i * 2); if ($i % 2) { $st2->execute($i, $i * 3); } } $dbh->do('commit'); my $elapsed = tv_interval ( $start_time, [gettimeofday]); print "creation time: $elapsed\n"; __END__ insertion time: 0.263462
        A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Efficient search through a huge dataset
by tmoertel (Chaplain) on Oct 20, 2004 at 02:33 UTC
    Although not a Perl solution, if you're on a Unix-like platform, you have access to the standard tool comm, which does exactly what you want, provided that the input files are sorted. Comm can tell you which lines are shared by the files and which are unique to either file. For example, here is how you would find the records that are shared by the files:
    $ comm -12 <(sort -u file1) <(sort -u file2)
    (If the files are already sorted, you can just pass them directly to comm, without first processing with sort. Here, I'm using the bash shell's <(command) syntax to avoid using having to deal with temporary files for holding the sorted records.)

    Here's how to find the records that are unique to the first file:

    $ comm -23 <(sort -u file1) <(sort -u file2)
    Most sort implementations are fast and will use external (file-based) sorting algorithms when the input is large, so you don't need to worry about input size.

    Cheers,
    Tom

Re: Efficient search through a huge dataset
by lhoward (Vicar) on Oct 20, 2004 at 00:16 UTC
    If the files can be stored in sorted order (or you can maintain an index on them that lets you access them in sorted order quickly a-la b-tree or you don't mind going through the overhead of sorting both before performing the comparison) based on the fields you want to compare then you could step through the 2 of them in lock-step fashion basically like the merge step of the mergesort algorithm. Pseudoperlcode (based on the assumption that the entire line is the key you want to match):

    open FILE1,"<file1"; open FILE2,"<file2"; my $key1=<FILE1>; my $key2=<FILE2>; while((!eof(FILE1)&&(!eof(FILE2))){ if($key1 gt $key2){ # do something when you find a key in FILE2 and not in FILE1 # read a line from FILE2 $key2=<FILE2> }elsif($key1 lt $key2){ # do something when you find a key in FILE1 and not in FILE2 # read a line from FILE1 $key1=<FILE1>; }else{ #found a match, read a line from FILE1 and FILE2 #this behavior may vary depending on how you want to #handle multiple matches, i.e. a given line is in both #files more than once $key1=<FILE1>; $key2=<FILE2>; } } close FILE1; close FILE2;

    That way doing the check for common records is as fast as reading each file once and you never have to hold more than one record from each file in memory at a time.

    L

Re: Efficient search through a huge dataset
by fergal (Chaplain) on Oct 20, 2004 at 09:31 UTC

    After writing everything below it occurred to me that your original solution might work just fine if you used a DB_File tied hash rather than a Perl in-memory hash. Roughly:

    use DB_File; my $hash; tie %hash, DB_File, "first_file.hash"; while (<FIRST_FILE>) { $hash{$_} = undef; } while (<SECOND_FILE>) { print $_ if exists $hash{$_}; }

    Below is what I wrote first of all, it's still interesting as it explains a bit about how to make your own hash.

    Your idea of a hash is correct but as you say, Perl's internal hash is not suitable. Instead, you could build your own hash.

    First you need a hash function. This could be fairly simple. For example just add up the ascii values of the characters in each line and then find the remaineder mod 256

    my $h = 0; foreach (split("", $line)) { $h+=ord($_) } $h = $h % 256;
    with this you can assign each line to a particular bucket and hopefully there should be a fairly even spread over the 256 possible buckets (if not you can use something more complicated for your hash function). Another good one to use is part of a time value - if each line contains a timestamp then taking the seconds value or the minutes and seconds together might give you a fairly even spread.

    The thing now is that 2 lines which have differing hash values must be different, there is no need to compare them character by character. If the hash values are the same you do have to compare the lines character by character . So comparing the hash value first eliminates a large amount of work.

    What you have now is exactly the same as a Perl internal hash except that Perl's hash buckets are stored in memory, yours are stored on disk. Also, Perl uses a more complicated function to make sure that things are spread evenly around all the buckets.

    There are a couple of different things you can do now.

    You could make 2 directories, one for each big file. Each directory will have 256 files - 1 for each bucket. Then you go through the big files, writing each line out into it's correct bucket file. You'll be left with 256 pairs of files, something like

    file1/0 file1/1 ... file1/255 file2/0 file2/1 ... file2/255
    now you need to compare file1/0 and file2/0 to find common lines and then file1/1 and file2/1 and so on. Each file should be 256 times smaller than the orignal files so hopefully they're small enough that they can be compared easily using Perl's internal hash function. If dividing by 256 isn't enough then you might need a different hash function.

    Another approach is to split just 1 of the files into buckets and then go through the other file line by line, figuring out which bucket each line would be in and then checking to see if it is. This will be pretty slow. To speed it up, instead of each bucket just being a flat file, you could make each bucket a DB_File file, then you can use DB_File's hash lookup to see if a particular line is there. Basically you're using a 2-layered hash.

Re: Efficient search through a huge dataset
by Anonymous Monk on Oct 20, 2004 at 13:50 UTC
    A few hours to a day? O, my, what kind of hardware are you running? For 2 ten million record files, it takes me less than 8 minutes to find out which records are in the second file, but not in the third. Just using shell commands (but using Perl to create the large files):
    $ # Create two large files to work with. $ perl -e 'printf "%08d\n", int rand 100_000_000 for 1 .. 10_000_000' +> big1 $ perl -e 'printf "%08d\n", int rand 100_000_000 for 1 .. 10_000_000' +> big2 # Sort them, make then unique. $ time sort -u big1 > big1.s real 4m0.489s user 2m4.360s sys 0m7.200s $ time sort -u big2 > big2.s real 3m24.848s user 1m55.430s sys 0m6.460s # Report the number of lines that are in the second file, and not in t +he first $ time comm -13 big1.s big2.s | wc -l 8611170 real 0m14.278s user 0m12.850s sys 0m0.400s
    Total elapsed time: less than 8 minutes, of which almost half are spend in disk I/O. No point to use a database for such puny sized sets.
Re: Efficient search through a huge dataset
by artist (Parson) on Oct 19, 2004 at 23:57 UTC
    I don't have solution at this point, but we will be sure if this is exactly what you desire.

  • A:Are you trying to find out whether your second file is a subset of first file?
  • Or B: Find (also) the records in the second file which are not in the first file?

      B. Actually for each record in the second file, I'd like to mark it 1/0 depending on whether it's in the first.
Re: Efficient search through a huge dataset
by TedPride (Priest) on Oct 20, 2004 at 06:18 UTC
    The simplest solution (assuming you don't have comm) is to buy a few more blocks of memory and use the hash.