Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^3: search a large text file

by jethro (Monsignor)
on Feb 08, 2011 at 13:34 UTC ( [id://886963]=note: print w/replies, xml ) Need Help??


in reply to Re^2: search a large text file
in thread search a large text file

This is the ideal application for a hash tied to a file. You might like to take a look at DBM::Deep. This is a well-tested and well-liked implementation of a disk based hash.

Just use a script to generate your hash once (that will take a while), after that any search will be nearly as fast as a single disk access. Store multiple values either concatenated as a string or better use an array for that. Since DBM::Deep is multilevel, storing a HashofArrays is no further problem

Replies are listed 'Best First'.
Re^4: search a large text file
by perl_lover_always (Acolyte) on Feb 08, 2011 at 14:41 UTC
    my hash file will run out of memory! that was the main problem that I could not generate the hash at the first stage! any sample code as a clue?

      If you use a disk based hash, for example a DBM::Deep hash, you won't run out of memory. The main reason to have a disk-based hash is that you can create hashes that are bigger than your memory (another reason to use it is that the hash is permanent).

      For sample code just look at the perldoc page of DBM::Deep, under 'Synopsis'. Basically you just call DBM::Deep to link a hashname with a file on disk and after that you use that hash like any other hash, only that everything you store in there is (behind the scenes) transfered directly to disk.

        I tried to use it! why when I use in this way, the results are not correct.
        sub to_hash { my $file = shift; my $db = DBM::Deep->new( "$file.db" ); open(FILE, "<$file"); foreach $l (<FILE>) { my ($ngram,$line) = split /\t/, $l; push(@{ $db->{$ngram} }, $line); } close FILE; return $db; }
        for example when I search for a key, I'll get the correct value few times instead of for example one or two times!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://886963]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-04-19 19:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found