Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^2: Searching text files

by Skeeve (Parson)
on Sep 14, 2006 at 23:52 UTC ( [id://573037]=note: print w/replies, xml ) Need Help??


in reply to Re: Searching text files
in thread Searching text files

#3 is the idea I like most

I don't know much about american phone numbers, but if they all have a fixed length of 10, you'd just need slightly more than 1GB disk space to store one bit for each existing number.

I wouldn't create this bit vector in memory. Just create a big enough file, initialized with zeros and then go through your text file and position with fseek to $phone_num >> 3 and set bit number $phone_num & 7.

do the same positioning for read access, but check the bit.

I think searching will be done in less than a second.

Update: Of course you can couple this with the idea of splitting for each area code. This should reduce the summed size of your three files to 1/333 (about 4MB) if the area code has 3 numbers.

Update #2: If you have 10 numbers in each phone number and have 2million numbers you already have 21MB disk space used. So the bit vector on disk will save you 16MB.


s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

Replies are listed 'Best First'.
Re^3: Searching text files
by rminner (Chaplain) on Sep 15, 2006 at 05:18 UTC
    Update: I Misread your post, your 4MiB are for 3 area codes, thus the same result (1.2MiB per Area Code). Being able to read cleary is an advantage. Sorry.

    As you stated, the most practical approach would be to split it by area code. The point where i disagree is, that you think that it would eat up 4MB of space per area code.
    rminner@Rosalinde:~$ bc bc 1.06 Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc. This is free software with ABSOLUTELY NO WARRANTY. For details type `warranty'. obase=1024 10^7-1 0009 0549 0639 last/8 0001 0196 0719
    i get 10 Million Bits (minus 1) for 7 digits. That would mean roughly 1.2 MiB and not 4MiB. Depending on the amount of memory available, you could load only a limited number of area codes. Like this the data structure for all do not call numbers in one are should be just a little bit more than those 1.2 MiB. Thus 5 Area Codes would only eat up 6 MiB, and as i said, a lookup would be instantaneous (from a user perspective) as it requires only to check one bit. One could allocate a limited number of slots for area codes, and could free them using whatever replacement algorithm one prefers (for example LRU or LFU). Loading should be also fast using File::Slurp, as directly slurping 1MiB into Memory using sysread, should be really fast when DMA is active.(you could also seek directly in the file (as stated by skeeve), reducing it to a single seek statement is also possible, keeping memory consumption even lower, and just requiring a single hd seek.)
    The Caching of the bitstring could be done very easily. Simply store the bitstring in a file, with the same name but for example with the extension .bin . Afterwards set the same mtime for the .bin file as for the .txt file. Later if the mtime is identical, you can use your precomputed bitstring and if the mtimes differs, the txt file has been modified, and the .bin file can be recreated from scratch (also shouldn't take more than 1-2 seconds). Like this your data would be always up-to-date just using plain .txt files, but speed should still be more or less instantaneous.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://573037]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2024-04-18 04:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found