comment on

Did you ever settle upon a solution?

For grins, I just ran a test that looked up 1000 randomly generated 10-digit telephone numbers (nnn-nnn-nnnn) in a flatfile database containing approximately 6.6% (2e6 / 3e7) of the 1e10 numbers:

c:\test>572961
9991230061
9991230061 is not found
9991230062
9991230062 is found
9991230063
9991230063 is not found
Terminating on signal SIGINT(2)

c:\test>perl -wle"printf qq[%03d%03d%04d\n], int( rand 1000 ), int( ra
+nd 1000 ), int( rand 10000 ) for 1 .. 1e3" | perl 572961.pl >nul
File for area code '000' not found at 572961.pl line 12, <STDIN> line 
+57.
999 trials of lookup (32.287s total), 32.319ms/trial
[download]

Each lookup takes around 33 ms which ought to be quick enough for most purposes.

The disk files (for all 999 possible area codes) require 10 GB, though that could trivially be reduced to 2.5 GB. Each area code is stored in a separate file, with one line of 10,000 characters for each of the 999 subarea codes; and each byte in the line representing a single telephone number by a simple '0' or '1'.

The lookup process is:

Split the number into it's 3 component parts. (nnn-nnn-nnnn);
Open the appropriate areacode file.
Seek to the appropriate subarea line and read it.
substr the appropriate byte of the line and it's value tells you whether the number is 'found' or 'not found'.

Care to trade 10 MB (2.5 MB) of diskspace per area code for 32 ms lookup time regardless of how the application grows?

#! perl -slw
use strict;
use Benchmark::Timer;

my $T = new Benchmark::Timer;


while( my $number = <STDIN> ) {
    chomp $number;
    $T->start( 'lookup' );
    if( my( $area, $subarea, $no ) = $number =~ m[^(\d{3})(\d{3})(\d{4
+})$] ) {
        open FILE, '<', "./tele/$area" 
            or warn "File for area code '$area' not found" and next;
        seek FILE, ( $subarea - 1 ) * 10002, 0;
        my $mask = <FILE>;
        print "$number is ", ( substr $mask, ( $no - 1 ), 1 ) 
            ? 'found' : 'not found';
    }
    else {
        print "Invalid telephone number: $number";
    }
    $T->stop( 'lookup' );
}

$T->report;
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: Searching text files by BrowserUk
in thread Searching text files by SteveS832001

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


We don't bite newbies here... much
	PerlMonks