Re: Best way to look-up a string amongst 100 million of its peers

When in doubt, use brute force!

The usual way of going about this is to fan the list out into many smaller files, and then identify the right file when you search.

The following code will split out a list of 250 000 words between 1 and 24 characters long into a three level directory structure in about 5 seconds on my machine. Words shorter than 3 are collected in a file at the top level of subdirs. For the rest, the words are stored in one file in the leaf directory.

#! /usr/local/bin/perl -w

use strict;
use warnings;

use File::Path 'mkpath';

my %fd;
my $root = 'cache';

$|=1;
while (<>) {
    chomp;
    my @arr = split //, $_, 4;
    my $fd;
    if (@arr < 4) {
        if (!$fd{1}{$arr[0]}) {
            my $dir = "$root/$arr[0]";
            mkpath $dir;
            open $fd{1}{$arr[0]}, '>', "$dir/list.txt" or die "open $d
+ir/list.txt";
        }
        $fd = $fd{1}{$arr[0]};
    }
    else {
        if (!$fd{n}{$arr[0]}{$arr[1]}{$arr[2]}) {
            my $dir = "$root/$arr[0]/$arr[1]/$arr[2]";
            mkpath $dir;
            open $fd{n}{$arr[0]}{$arr[1]}{$arr[2]}, '>', "$dir/list.tx
+t" or die "open $dir/list.txt";
        }
        $fd = $fd{n}{$arr[0]}{$arr[1]}{$arr[2]};
    }
    print $fd "$_\n";
    print "." unless $. % 5000;
}
[download]

This is a bit wasteful; it would be better to hoist all the 'list' files up into the parent directory, but you get the idea (hey! it's free code). Also, if all your words are longer than the fan-out then matters are simplified.

It is then a simple matter of programming to take a word, figure out what file you should be looking at, and then see if the word found or not.

If you play around with the right fan-out so that no file is particularly big, you can slurp it into memory, and treat it as a string. Then you search for /\b$target\b and the regexp engine will run a fast Boyer-Moore search on the string. I have used newlines to separate the words, but there is nothing stopping you from using \x0 (binary zero) or some other sentinel.

If your words can use the all the byte patterns from 0 to 255 you'll have to do something a bit smarter.

You can also play with a cache of slurped files, and keep the last n files opened on hand. If there is some locality of reference in the search space, you'll save on throwing away a slurped file just to reopen it again on the next word.

• another intruder with the mooring in the heart of the Perl

Comment on Re: Best way to look-up a string amongst 100 million of its peers Download Code


Perl Monk, Perl Meditation
	PerlMonks