Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^3: Searching text files

by rminner (Chaplain)
on Sep 15, 2006 at 05:18 UTC ( [id://573056]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Searching text files
in thread Searching text files

Update: I Misread your post, your 4MiB are for 3 area codes, thus the same result (1.2MiB per Area Code). Being able to read cleary is an advantage. Sorry.

As you stated, the most practical approach would be to split it by area code. The point where i disagree is, that you think that it would eat up 4MB of space per area code.
rminner@Rosalinde:~$ bc bc 1.06 Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc. This is free software with ABSOLUTELY NO WARRANTY. For details type `warranty'. obase=1024 10^7-1 0009 0549 0639 last/8 0001 0196 0719
i get 10 Million Bits (minus 1) for 7 digits. That would mean roughly 1.2 MiB and not 4MiB. Depending on the amount of memory available, you could load only a limited number of area codes. Like this the data structure for all do not call numbers in one are should be just a little bit more than those 1.2 MiB. Thus 5 Area Codes would only eat up 6 MiB, and as i said, a lookup would be instantaneous (from a user perspective) as it requires only to check one bit. One could allocate a limited number of slots for area codes, and could free them using whatever replacement algorithm one prefers (for example LRU or LFU). Loading should be also fast using File::Slurp, as directly slurping 1MiB into Memory using sysread, should be really fast when DMA is active.(you could also seek directly in the file (as stated by skeeve), reducing it to a single seek statement is also possible, keeping memory consumption even lower, and just requiring a single hd seek.)
The Caching of the bitstring could be done very easily. Simply store the bitstring in a file, with the same name but for example with the extension .bin . Afterwards set the same mtime for the .bin file as for the .txt file. Later if the mtime is identical, you can use your precomputed bitstring and if the mtimes differs, the txt file has been modified, and the .bin file can be recreated from scratch (also shouldn't take more than 1-2 seconds). Like this your data would be always up-to-date just using plain .txt files, but speed should still be more or less instantaneous.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://573056]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (9)
As of 2024-04-18 11:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found