http://qs321.pair.com?node_id=572961

SteveS832001 has asked for the wisdom of the Perl Monks concerning the following question:

My company has to use a Do Not Call list that is generated for us by another company and they list all the numbers in text files.

Example
9991112222
9991112223
9991112224

We Have 3 area codes that we have lists for. So that means we have 3 text files that have a grand total of, a little over 2 million phone numbers
They are wanting me to create a program with a gui that would search the text files for the number that they want to search for. Which I have created, But the program uses a lot of memory and a lot of the processor and takes around 15-20 seconds to fail which takes the logest since it has to search the most.
My question is how do I make this instant or faster. 2 or 3 Seconds. I have tried slurping the files into an array and searching that way but it only improves it by a second or two.
use Tk; use Tk::Dialog; use Win32::GUI; $count = 1; my $mw = MainWindow->new; my $c = $mw->Canvas( -width =>400, -height=>200, -background => "white"); $c->pack; ############################Main window $map = $mw->Photo ( -file => 'bos.gif', -height => 40, width => 400 ); $map_label = $mw->Label ( -image => $map); $map_label->place(-x =>0, -y => 0); ############################Inserts the bank of sullivan image $quit = $mw->Button( -text => "Quit", -padx => "8", -command => \&Quit); $quit->place(-x => 350, -y => 180); ###########################adds the quit button $Find = $mw->Button( -text => "Search", -command => \&Search); $Find->place(-x => 300, -y => 180); ################################################ $lab = $mw->Label ( -text => "Enter Phone Number", -background => "white" ); $lab->place(-x => 5, -y => 50); ##############################Label $Entry = $mw->Entry( -textvariable => \$num, -background => "white" ); $Entry->place(-x => 5, -y => 68); ##################################Entry $text = $mw->Text( -height => 5, -width => 55, -wrap => char ); $text->place(-x => 5, -y => 90); ################################# $labWarning = $mw->Label ( -text => 'Example XXX - XXX - XXXX', -background => "white" ); $labWarning->place(-x => 140, -y => 68); $index = 1; ####################################### sub Quit { my $d = $mw->Dialog( -text => "Are you sure you want to quit?", -buttons => ["Yes", "No"]); my $answer = $d->Show(); close(UNLOCK); exit if ($answer eq "Yes") } sub Search { $found = 0; @number = split(/-/, $num); $file1 = 'f:\donotcall\314.txt'; $file2 = 'f:\donotcall\573.txt'; $file3 = 'f:\donotcall\636.txt'; $file4 = 'f:\donotcall\Cell.txt'; open(FIRST, "<$file1") or die "Can't open File $!"; open(SECOND, "<$file2") or die "Can't open File $!"; open(THIRD, "<$file3") or die "Can't open File $!"; open(Cell, "<$file4") or die "Can't open File $!"; $text->insert (0.1, "Searching\n"); $mw->update(); if ($number[0] == 314) { while(my $line = <FIRST>){ if($line =~ /$number[0]$number[1]$number[2]/) { close(FIRST); close(SECOND); close(THIRD); close(Cell); $found = 1; } } } if ($number[0] == 573) { while(my $line = <SECOND>){ if($line =~ /$number[0]$number[1]$number[2]/) { close(FIRST); close(SECOND); close(THIRD); close(Cell); $found = 1; } } } if ($number[0] == 636) { while(my $line = <THIRD>){ if($line =~ /$number[0]$number[1]$number[2]/) { close(FIRST); close(SECOND); close(THIRD); close(Cell); $found = 1; } $count++; if ($count == 2000) { $mw->update(); $count = 0; } chomp($line); $text->insert(0.1, "$line "); } } while(my $line = <Cell>){ if($line =~ /$number[0]$number[1]$number[2]/) { close(FIRST); close(SECOND); close(THIRD); close(Cell); $found = 1; } } if ($found == 1) { $text->insert ("0.1", "The number ($num) was found\n"); $text->insert ("0.2", "DO NOT CALL\n"); $text->insert ("0.3"," \n"); $text->insert ("0.4"," \n"); $text->insert ("0.5"," \n"); } else { $text->insert ("0.1", "The number ($num) was NOT found\n"); $text->insert ("0.2", " \n"); $text->insert ("0.3", " \n"); $text->insert ("0.4", " \n"); $text->insert ("0.5", " \n"); } } MainLoop;

Replies are listed 'Best First'.
Re: Searching text files
by davido (Cardinal) on Sep 14, 2006 at 18:58 UTC

    SQLite would be an excellent database alternative for your needs. 2-million phone numbers isn't all that big of a deal for a lightweight database like SQLite. And it's self contained; all you need is DBI and DBD::SQLite.

    But I wonder if even a lightweight database is needed. What you could use is a series of area-code specific flat files. Maintain each flat file in sorted order. Put one phone number per line in the file, so that each line is the same length. That gives you a simple starting point for doing binary searches. If each line is twelve characters long (11 digit phone plus newline), you can use seek to read from offsets within the file.

    The idea behind a binary search is to start at the halfway point in the file. If that record is too 'high', divide the first half into half and look there. If that's too low, divide the 2nd quarter into half and look there. ...and so on. There are lots of examples on the net for binary search algorithms. The good thing about a binary search is that it takes place in O(log n) average time, whereas a linear search (what you're doindg now) takes O(n) time without the complexity of a database.

    ...just another thought. But honestly, have we forgotton about our computer science 101 classes? A binary search of a fixed-width record length flat file is "how it's done" programatically. Using a database is convenient, but it shouldn't be the only option considered. As for slurping flat files, that doesn't scale well, and with 2 million (and growing) records, you need to think about that.


    Dave

        Seems almost like overkill when it can be done in just a few lines. The function below searches three million records very quickly.
        # Usage: record_search($search_string, $filename, 11); # Returns the record number if the record is found, undef otherwise. # Searched file must be sorted. sub record_search { my ($key, $filename, $recordsize) = @_; my $fh = IO::File->new($filename, 'r') or croak("Could not open `$filename': $!"); my $hi = ($fh->stat)[7] / $recordsize; my $lo = 0; while ($lo < $hi) { my $mid = int(($lo + $hi) / 2); $fh->seek($mid * $recordsize, 0); my $target = $fh->getline(); chomp($target); if ($target < $key) { $lo = $mid + 1; } elsif ($target == $key) { return $mid; } else { $hi = $mid; } } }
        Update Forgot to mention: even though this function opens the file every time, it can successfully search 3 million records about 2000 times per second on a 3Ghz Pentium 4. I can't see much justification for anything more complicated. Threads? Bit strings? Crazy! :)
Re: Searching text files
by ikegami (Patriarch) on Sep 14, 2006 at 19:39 UTC

    There is a lot you can do to make your code better, as shown below. Note that this will not make your code faster.

    • Localize your variables (using my).
    • Localize file handles too, after making them lexicals. (Now that they're scoped, close is no longer needed.)
    • Use strict to make sure your variables are localized.
    • Use warnings to detect errors.
    • Don't open four files when you only need to open one or two.
    • Indent properly.
    • Use the safer 3-arg open.
    • Don't hardcode the area codes.
    • Don't use regexp to search for strings. Use index for a great speed boost.
    use strict; use warnings; use File::Spec qw( ); use IO::Dir qw( ); my $dir = 'E:\\Documents and Settings\\ebrine'; my $cell = File::Spec->catfile($dir, 'Cell.txt'); my %area_codes = map { /(\d+)/; $1 => File::Spec->catfile($dir, $_) } grep /^\d+\.txt\z/i, IO::Dir->new($dir)->read(); my $num; ... Tk code ... sub search { my $found = 0; my @number = split(/-/, $num); my $search_text = join '', @number; my $file = $area_codes{$num[0]}; if (defined $file) { open(my $fh, '<', $file) or die("Unable to open file for areacode $num[0]: $!\n"); $found = search_file($fh, $search_text); } if (not $found) { open(my $fh, '<', $cell) or die("Unable to open file for cellphones: $!\n"); $found = search_file($fh, $search_text); } ... Show results ... } sub search_file { my ($fh, $search_text) = @_; while (<$fh>) { return 1 if index($_, $search_text) >= 0; ... Update ticker ... } return; }
Re: Searching text files
by rminner (Chaplain) on Sep 14, 2006 at 18:29 UTC
    diffrent recommendations:
    1.- only open the file if you want to read it, then you won't have to close the handle if you didn't need to open it in the first place.
    2.- if you want to keep it in the text files and parse them every time, consider File::Slurp, which is the fastest way to slurp i know.
    3.- if you want to have it really fast, have a look at the first chapter of Programming Pearls by Jon Bentley. The Basic idea is to use a bitstring(initialized with zeroes), and toggle the bits to 1 (plain old binary OR)which represents the telefone number. As Telefone Numbers are unique, and you don't have any data associated with it, its an valid approach, to check for the existence of a number. If you want you can create that bistring once, and only read it in afterwards (you could use Storable, or plain spewing using File::Slurp). This should drastically reduce your memory consumption and speed should be lightning fast.Update: using Bit::Vector::Array would probably be the simplest way to achieve the bitstring (simply use the telefone number as an index). Like that you would only need one OP (no searching needed, just check that index of the bit vector whether its 1 or 0) for an lookup. Initialization of the bit vector would be just as easy, simply toggle the bits with the correct index to 1.

    Hope this gives you some ideas.

    Cheers Roland

      #3 is the idea I like most

      I don't know much about american phone numbers, but if they all have a fixed length of 10, you'd just need slightly more than 1GB disk space to store one bit for each existing number.

      I wouldn't create this bit vector in memory. Just create a big enough file, initialized with zeros and then go through your text file and position with fseek to $phone_num >> 3 and set bit number $phone_num & 7.

      do the same positioning for read access, but check the bit.

      I think searching will be done in less than a second.

      Update: Of course you can couple this with the idea of splitting for each area code. This should reduce the summed size of your three files to 1/333 (about 4MB) if the area code has 3 numbers.

      Update #2: If you have 10 numbers in each phone number and have 2million numbers you already have 21MB disk space used. So the bit vector on disk will save you 16MB.


      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
        Update: I Misread your post, your 4MiB are for 3 area codes, thus the same result (1.2MiB per Area Code). Being able to read cleary is an advantage. Sorry.

        As you stated, the most practical approach would be to split it by area code. The point where i disagree is, that you think that it would eat up 4MB of space per area code.
        rminner@Rosalinde:~$ bc bc 1.06 Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc. This is free software with ABSOLUTELY NO WARRANTY. For details type `warranty'. obase=1024 10^7-1 0009 0549 0639 last/8 0001 0196 0719
        i get 10 Million Bits (minus 1) for 7 digits. That would mean roughly 1.2 MiB and not 4MiB. Depending on the amount of memory available, you could load only a limited number of area codes. Like this the data structure for all do not call numbers in one are should be just a little bit more than those 1.2 MiB. Thus 5 Area Codes would only eat up 6 MiB, and as i said, a lookup would be instantaneous (from a user perspective) as it requires only to check one bit. One could allocate a limited number of slots for area codes, and could free them using whatever replacement algorithm one prefers (for example LRU or LFU). Loading should be also fast using File::Slurp, as directly slurping 1MiB into Memory using sysread, should be really fast when DMA is active.(you could also seek directly in the file (as stated by skeeve), reducing it to a single seek statement is also possible, keeping memory consumption even lower, and just requiring a single hd seek.)
        The Caching of the bitstring could be done very easily. Simply store the bitstring in a file, with the same name but for example with the extension .bin . Afterwards set the same mtime for the .bin file as for the .txt file. Later if the mtime is identical, you can use your precomputed bitstring and if the mtimes differs, the txt file has been modified, and the .bin file can be recreated from scratch (also shouldn't take more than 1-2 seconds). Like this your data would be always up-to-date just using plain .txt files, but speed should still be more or less instantaneous.
Re: Searching text files
by hgolden (Pilgrim) on Sep 14, 2006 at 18:14 UTC
    A database would be better. If you wanted to stick with text files, you could create subfiles that contained all the numbers with a given exchange (i.e. the first three numbers after the area code). You can then only open the file that matches that area-code and exchange, and you'll do much less searching.
Re: Searching text files
by Tanktalus (Canon) on Sep 14, 2006 at 18:24 UTC

    Combining the above two great ideas, I would suggest not just a straight-forward database, but I would also split up the numbers into area code, exchange, extention. (And, if you want, country code - but sounds like you're just dealing with the US, so this would always be '1'.) This would allow the database to also set up indexes on area code and exchange for faster lookups. Ok, you have to tell the database to create indexes, but then you'll be able to do so.

      Off Topic a Bit

      Area codes, exchanges and extensions are US/Canada/Carribean concepts. Very few other areas use them. Germany for example uses over four thousand prefixes to represent local dialing zones, but they are not analagous to exchanges nor to area codes (and are of variable lengths). If you design your software with 3-3-4 in mind your software isnt going to work anywhere else in the world except maybe Switzerland (who use the NA system internally). My own telephone number is 11 digits for instance, and thats NOT including the country code. I know of peope who entire number is only 6 digits.

      I would say that indexing based on the full CLI is the best, and most portable.

      ---
      $world=~s/war/peace/g

Re: Searching text files
by runrig (Abbot) on Sep 14, 2006 at 18:06 UTC
    Is there anything preventing you from using a database? It would be faster than scanning large text files.
      No, There is not, I was just Trying to find the best way to do this since my way sucks
        If you are just looking for exact matches, then a non-SQL database like DB_File (the tied-hash interface) is probably adequate; if not, then a SQL database using 1 table, 1 column, and an index. Then of course you have to develop processes to update the database and keep in sync with the text files.
        There is also File::SortedSeek which I have no experience with, but might be "good enough" if the files are sorted, and would save you from keeping the database in sync with the text files.
        It appears that you are installing and running this on each desktop. This is highly inefficient a slient server model would be a lot better as this would decrease the workload on the desktop machine.
        The simplest way of doing this is using a database and you get the performance benifits if server caching index.
        Your current code appears to do a straight string match similar to a like. Indexes are not as good as handling these kinds of searches as the whole string is not matched. If you are matching a complete sting an index is very fast.
        If you need fast lookups on partial numbers you might look at either reverse indexes that sort based on the digits in reverse order or preprocessing the numbers and extracting sequences of digits and then indexing them.

        UnderMine
Re: Searching text files
by zentara (Archbishop) on Sep 14, 2006 at 20:51 UTC
    Hi, I modified your code to use threads, and I used numbers1..4 as the filenames, which I filled with
    #!/usr/bin/perl use warnings; use strict; my $file = shift or die; open (FH,">$file") or die; for(1..1000){ my $first = sprintf('%03d', int rand 999); my $second = sprintf('%03d', int rand 999); my $third = sprintf('%04d', int rand 9999); print FH "$first-$second-$third\n"; }
    for testing.

    Now here is a way to search them for a single number, which stops all the threads when a number is found and reports the line number. Your Tk code could use some minor fixing, but that is up to you. There is also alot of code cleanup and opportunities for ingenious juggling of the return data....like search certain threads only for certain prefixes, or looking for the same number on multiple lists. But I made this as simple as I could to demonstrate the basics... like the timer to watch the shared variables, and how to cleanup the threads when exiting. Also you will need to work out proper messaging and resetting the threads after an empty search

    #!/usr/bin/perl use warnings; use strict; use Tk; use Tk::Dialog; use threads; use threads::shared; my @sfiles = qw(numbers1 numbers2 numbers3 numbers4); # make threads first before any tk code my %shash; my %thash; my $numworkers = scalar @sfiles; my $searchnum; share $searchnum; $searchnum = '002-339-7473'; foreach my $dthread(1..$numworkers){ share ($shash{$dthread}{'go'}); share ($shash{$dthread}{'die'}); share ($shash{$dthread}{'file'}); share ($shash{$dthread}{'return'}); $shash{$dthread}{'go'} = 0; $shash{$dthread}{'file'} = shift @sfiles; $shash{$dthread}{'die'} = 0; $shash{$dthread}{'return'} = ''; $thash{$dthread}{'thread'} = threads->new(\&work,$dthread); } my $count = 1; my $mw = MainWindow->new; my $c = $mw->Canvas( -width =>400, -height=>200, -background => "white"); $c->pack; ############################Main window my $quit = $mw->Button( -text => "Quit", -padx => "8", -command => \&Quit); $quit->place(-x => 350, -y => 180); ###########################adds the quit button my $Find = $mw->Button( -text => "Search", -command => \&Search); $Find->place(-x => 300, -y => 180); ################################################ my $lab = $mw->Label ( -text => "Enter Phone Number", -background => "white" ); $lab->place(-x => 5, -y => 50); ##############################Label my $Entry = $mw->Entry( -textvariable => \$searchnum, -background => "white" ); $Entry->place(-x => 5, -y => 68); ##################################Entry my $text = $mw->Text( -height => 5, -width => 55, -wrap => 'char' ); $text->place(-x => 5, -y => 90); ################################# my $labWarning = $mw->Label ( -text => 'Example XXX - XXX - XXXX', -background => "white" ); $labWarning->place(-x => 140, -y => 68); my $index = 1; ####################################### sub Quit { my $d = $mw->Dialog( -text => "Are you sure you want to quit?", -buttons => ["Yes", "No"]); my $answer = $d->Show(); close(UNLOCK); #cleanup threads if($answer eq "Yes"){ for(1..$numworkers){ $shash{$_}{'die'} = 1; $thash{$_}{'thread'}->join; } exit; } } sub Search { my $found = 0; for(1..$numworkers){ $shash{$_}{'go'} = 1; } #setup a timer to watch for returns thru shared variables my $timer; # declare outside of block so you can kill it in callback $timer = $mw->repeat(10,sub{ for(1..$numworkers){ if( length $shash{$_}{'return'} > 0){ if( $shash{$_}{'return'} > 0 ){ my $message = "thread $_ detected number at line +$shash{$_}{'return'} \n"; print "$message\n"; $text->insert('end',"$message\n"); } #stop other threads for(1..$numworkers){ $shash{$_}{'return'} = 0 } #stop timer $timer->cancel; } } }); } MainLoop; ###################################################################### +# sub work{ my $dthread = shift; $|++; my $file = $shash{$dthread}{'file'}; open(FH,"< $file") or die "$!\n"; print "thread $dthread created\n"; while(1){ if($shash{$dthread}{'die'} == 1){ goto END }; if ( $shash{$dthread}{'go'} == 1 ){ seek FH, 0, 0; #reset to top $shash{$dthread}{'return'} = ''; print "thread $dthread going\n"; while(my $line = <FH>){ chomp $line; if($line eq $searchnum){ $shash{$dthread}{'return'} = __LINE__ ; print "$dthread found $shash{$dthread}{'return'}\ +n"; goto RESET; } if($shash{$dthread}{'go'} == 0){goto RESET} if($shash{$dthread}{'die'} == 1){ goto END }; + } RESET: $shash{$dthread}{'go'} = 0; #turn off self before returni +ng print "thread $dthread going back to sleep\n"; }else { select(undef,undef,undef,.1) } # tenth second sleep, can be + smaller } END: }

    I'm not really a human, but I play one on earth. Cogito ergo sum a bum
Re: Searching text files
by nferraz (Monk) on Sep 14, 2006 at 18:57 UTC

    My first suggestion is that you do not start an application by creating the GUI. Start with a simple command-line script which will solve your problem, and make sure it works; they you can play with UI.

    Now, the fastest solution is, IMO, the use of CDB_File. It will return data in a fraction of millisecond.

    It usually takes two scripts: the first one is used to create a CDB file from your data:

    use CDB_File; $t = new CDB_File ('t.cdb', "t.$$") or die ...; $t->insert('key', 'value'); $t->finish;

    The second script simply ties the CDB to a hash and look for a key:

    use CDB_File; tie %data, 'CDB_File', $filename or die "$0: can't tie to $filename: $!\n"; if (defined $data{}) { # ... }

    That's it! The fact that it doesn't require a relational database will make it easier to set up your application. Think about that!

    Finally, take a look at Tie::File::AsHash -- it can be usefull.

Re: Searching text files
by aquarium (Curate) on Sep 14, 2006 at 18:26 UTC
    not sure if you need anything fancy....something like (unix code)
    grep $phone_num file1 file2 file3
    ...sitting behind a primitive gui should do it. you can get a (standalone..without cygwin) grep.exe for windows as well. if you must call it from perl then just wrap inside backticks and assign the output to a variable
    my $result = `grep $phone_num file1 file2 file3`; if($result) { print "number is in blocked list"; }
    the hardest line to type correctly is: stty erase ^H
      you can get a <...> grep.exe for windows as well...

      Why bother, when windows has a native tool that serves a similar purpose:

      my $result = `findstr $phone_num file1 file2 file3`; if ( $result ) { print "DO NOT CALL $phone_num\n"; } else { print "You can annoy $phone_num with your gratuitous cold calling\n" +; }

        grep and findstr are both O(n) linear searches. Maybe I'm missing something here, but I think that's what the OP is trying to avoid; a linear search that takes O(n) time to fail.


        Dave

Re: Searching text files
by wfsp (Abbot) on Sep 14, 2006 at 18:25 UTC
    I go with the db suggestion too.

    Perhaps have a look at DBM::Deep? It's pure Perl, well suited to large lookup tables and, according to the docs, "pretty fast".

    The docs also discuss speed/memory issues which you may find suit your requirements.

Re: Searching text files
by duckyd (Hermit) on Sep 14, 2006 at 19:00 UTC
    If you have enough memory that you can slurp the files into arrays, you might consider slurping them into hashes instead. Then checking if a number is on the list will be very fast, I.E.
    if( $do_not_call{ $test_number } ){ print "Do Not Call $test_number\n"; }
    As far as database options, you might consider DBD::SQLite. It's very fast.

      Slurping the files into hashes will consume (assuming two million entries) about 22 megabytes JUST for the data. That doesn't seem unreasonable by todays standards, but it won't scale; someday someone will be sorry they used that approach as the dataset grows. 22 megs of phone numbers will take around 44 megs of total space including Perl's internal overhead (very approximately, assuming 11 bytes per number, plus hash and sv overhead).

      Plus, if the files are slurped, that means each time someone starts up the program, it will have to do all the work all over again. If it's slurping into an array, it'll have to be sorted so it can be searched efficiently (the sort will be O(n log n), and searches will be O(log n) if a bin-search is used. If it's slurping into a hash, that operation alone consumes O(n log n) to hashify, every time the program starts up... and then searches will be O(1) after that. Every time the script starts, it'll have to re-slurp, and re-organize.


      Dave

Re: Searching text files
by zentara (Archbishop) on Sep 14, 2006 at 19:08 UTC
    Hi, since you are opening 4 files, have you thought about opening each file in a separate thread, so they can be searched concurrently? Even go a step further, and bust the 4 files into a series of smaller ones, and use even more threads. There would be a small startup-time cost in creating, say 8 threads, but after that, it would be 8 times as fast.

    Of course, with Tk and threads, you must take precautions so that all threads are started before the main Tk thread is coded, and don't try to access any Tk widget from the threads. But you can use shared variables to communicate, and just have your search code in the threads.

    If you can wait a bit, I'll post a minimal example.


    I'm not really a human, but I play one on earth. Cogito ergo sum a bum
Re: Searching text files
by Anonymous Monk on Sep 15, 2006 at 02:50 UTC
    Search::Dict
      Hmmm, a fun chance to play with Entry validate, and to simplify a UI. I'm getting around 8 milliseconds for the look() call on a file with just over two million numbers in it, so it makes sense to call it whenever the number is valid, and thus no "Search" button is needed.
      #!/usr/bin/perl use Search::Dict; use Tk; use strict; my $file = 'd.search'; # combined *sorted* file with phone numbers open my $fh, '<', $file or die "$0: $! opening $file"; my $mw = new MainWindow; my $label = $mw->Label( -font => 'fixed 40', -height => 3, )->pack(-fill => 'x'); $mw->Entry( -font => 'fixed 40', -validate => 'key', -vcmd => \&test, )->pack->focus; test(''); # do once before first key MainLoop; ############################################ sub test { tr/0-9//cd for my $n = shift; $label->configure( length $n == 0 ? (-bg => 'gold', -text => 'enter number below') : length $n < 10 ? (-bg => 'yellow', -text => 'need more digits') : length $n > 10 ? (-bg => 'yellow', -text => 'too many digits') : look($fh, $n, 0, 0) >= 0 && <$fh> == $n ? (-bg => 'red', -text => 'DO NOT CALL !!') : (-bg => 'green', -text => 'go ahead and call')); return 1; }
Re: Searching text files
by bpoag (Monk) on Sep 14, 2006 at 21:31 UTC
    Sorry for asking the obvious, but, why don't you just split the large file into several smaller files, and then import the corresponding piece depending upon what number the user enters?
      You mean like I suggested way up at the top? :)
Re: Searching text files
by f00li5h (Chaplain) on Sep 15, 2006 at 11:27 UTC
    Why not just make a file for each number, in a directory
    /home/phones/1/2/3/1/1/3/5/5/6
    then all you've gotta do is
    if ( -f "$directory/".join '/' split '' $number ) { print "Lordy! don't call them!"; }
    will: do{ perl programing } for ($cash);
      ... A mere two million new directories. Hope you've got this on its own partition!
        Heck, you could even build it all in a ram disk, then it'll be good and quick to access!
        will: do{ perl programing } for ($cash);
Re: Searching text files
by NiJo (Friar) on Sep 15, 2006 at 20:25 UTC
    Bloom filters are very efficient, especially with low hit rates as in your case.

    Speed, scalability and low memory are major advantages. Disadvantages include a tunable rate of false positives and not telling which phone number did match.

    Your application could use a precomputed and 'store'ed filter. Reimplement the algorithm using Bit::Vector if you need more speed.

Re: Searching text files
by johndageek (Hermit) on Sep 15, 2006 at 18:49 UTC
    Just a thought for fun, something like the following might work, of course it is stripped to show the general idea. create an array with 9,999,999 entries. create a bit mask for each area code. place the appropriate bitmask in the array entry corresponding to the phone number, this allows you to store 8 area codes per byte of data stored. of course all edit checks, warnings etc are stripped.
    #!/usr/bin/perl open in,"987.txt" or die " could not open 987.txt as input"; # assign bit mask for this area code and pack it to binary $a = "00000001"; $b = pack('b8',$a); # load an array with the phone number less the area code as the subscr +ipt @arr; while (<in>){ chomp; $arr[$_]=$b; } # takes time to load about 10 seconds for all 10 million recs print "\n\narray loaded\n"; # response for lookup is sub-second while (true){ print "enter phone number: "; $pni = <>; chomp $pni; $pna = unpack('b8',$arr[$pni]); if (substr($pna,7,1) eq "1" ){ print "on file\n"; }else{ print "not on file\n"; } }

    Enjoy!
    Dageek

Re: Searching text files
by Dervish (Friar) on Sep 18, 2006 at 03:58 UTC
    Just as a comment, your tests seem to show that reading your entire database (or at least the entire database for a particular area code) takes a fixed amount of time -- 30 seconds or so. So your goal would be to reduce the total amount of reading time, either by sorting the database files and doing a binary search, or saving that effort and using some kind of database to do the same thing for you. In either case, the goal is to reduce the total number of bytes checked for a failed search.
Re: Searching text files
by DrHyde (Prior) on Sep 18, 2006 at 09:23 UTC
    The three most obvious speedups you can attempt are to look for exact matches instead of using a regex to see if the number is in your list; to return immediately you find a match instead of continuing to plough through the rest of the list (this will halve your average lookup time, assuming calls are evenly distributed across the numbers); and to put the list into a hash - using the numbers as keys, the actual values don't matter - and use the exists() function to look your prospective victim up.

    If you know that your data mostly consists of long runs of consecutive numbers instead of occasional numbers scattered at random through the number space, then you'll get better results by changing the algorithm to look at number prefixes like I do in Number::Phone::UK. The scripts I use for creating the database that uses are in the tarball on the CPAN.

Re: Searching text files
by MidLifeXis (Monsignor) on Sep 18, 2006 at 17:23 UTC

    Basic data structures and algorithms question: use a binary search.

    Assumption: Files are sorted lines of 10 numbers followed by a newline.

    Your record size is 10+length('\n') (this is a text file by definition)

    See seek(), wikipedia, google, and a book on the topic. Most introductary CS texts will also address this topic.

    Your search time should be O(log2(n)), which is execllent - for 3 million records you are looking at a maximum of 22 comparisons.

    I hate to say this, but most of the answers on this thread are way too complex.

    --MidLifeXis

Re: Searching text files
by BrowserUk (Patriarch) on Sep 18, 2006 at 05:44 UTC

    Did you ever settle upon a solution?

    For grins, I just ran a test that looked up 1000 randomly generated 10-digit telephone numbers (nnn-nnn-nnnn) in a flatfile database containing approximately 6.6% (2e6 / 3e7) of the 1e10 numbers:

    c:\test>572961 9991230061 9991230061 is not found 9991230062 9991230062 is found 9991230063 9991230063 is not found Terminating on signal SIGINT(2) c:\test>perl -wle"printf qq[%03d%03d%04d\n], int( rand 1000 ), int( ra +nd 1000 ), int( rand 10000 ) for 1 .. 1e3" | perl 572961.pl >nul File for area code '000' not found at 572961.pl line 12, <STDIN> line +57. 999 trials of lookup (32.287s total), 32.319ms/trial

    Each lookup takes around 33 ms which ought to be quick enough for most purposes.

    The disk files (for all 999 possible area codes) require 10 GB, though that could trivially be reduced to 2.5 GB. Each area code is stored in a separate file, with one line of 10,000 characters for each of the 999 subarea codes; and each byte in the line representing a single telephone number by a simple '0' or '1'.

    The lookup process is:

    1. Split the number into it's 3 component parts. (nnn-nnn-nnnn);
    2. Open the appropriate areacode file.
    3. Seek to the appropriate subarea line and read it.
    4. substr the appropriate byte of the line and it's value tells you whether the number is 'found' or 'not found'.

    Care to trade 10 MB (2.5 MB) of diskspace per area code for 32 ms lookup time regardless of how the application grows?

    #! perl -slw use strict; use Benchmark::Timer; my $T = new Benchmark::Timer; while( my $number = <STDIN> ) { chomp $number; $T->start( 'lookup' ); if( my( $area, $subarea, $no ) = $number =~ m[^(\d{3})(\d{3})(\d{4 +})$] ) { open FILE, '<', "./tele/$area" or warn "File for area code '$area' not found" and next; seek FILE, ( $subarea - 1 ) * 10002, 0; my $mask = <FILE>; print "$number is ", ( substr $mask, ( $no - 1 ), 1 ) ? 'found' : 'not found'; } else { print "Invalid telephone number: $number"; } $T->stop( 'lookup' ); } $T->report;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.