Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: Efficient search through a huge dataset

by Anonymous Monk
on Oct 20, 2004 at 13:50 UTC ( [id://400863]=note: print w/replies, xml ) Need Help??


in reply to Efficient search through a huge dataset

A few hours to a day? O, my, what kind of hardware are you running? For 2 ten million record files, it takes me less than 8 minutes to find out which records are in the second file, but not in the third. Just using shell commands (but using Perl to create the large files):
$ # Create two large files to work with. $ perl -e 'printf "%08d\n", int rand 100_000_000 for 1 .. 10_000_000' +> big1 $ perl -e 'printf "%08d\n", int rand 100_000_000 for 1 .. 10_000_000' +> big2 # Sort them, make then unique. $ time sort -u big1 > big1.s real 4m0.489s user 2m4.360s sys 0m7.200s $ time sort -u big2 > big2.s real 3m24.848s user 1m55.430s sys 0m6.460s # Report the number of lines that are in the second file, and not in t +he first $ time comm -13 big1.s big2.s | wc -l 8611170 real 0m14.278s user 0m12.850s sys 0m0.400s
Total elapsed time: less than 8 minutes, of which almost half are spend in disk I/O. No point to use a database for such puny sized sets.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://400863]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2024-04-25 23:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found