Re^2: search a large text file

Replies are listed 'Best First'.
Re^3: search a large text file by chrestomanci (Priest) on Feb 08, 2011 at 13:31 UTC
So in short you have a static 5G dataset, that you need to search frequently. I think your best bet would be use a database to index the data, and let it worry about how to create an optimised index. I would put the entire file contents into the database, and discard the original file. If each line also contains lots of other stuff that you will not be searching on, then I would still keep it in the database, but I would put it in a different collum without an index so as not to bloat the database to much.	[reply]
Re^3: search a large text file by BrowserUk (Patriarch) on Feb 08, 2011 at 13:33 UTC
This really does sound like a perfect application for using a database. Especially of you are generating the file and can load it directly into the DB and cut out the middle man file. That said, loading the Db via the tools bulk loader is often faster than loading it via DBI one record at a time. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^3: search a large text file by jethro (Monsignor) on Feb 08, 2011 at 13:34 UTC
This is the ideal application for a hash tied to a file. You might like to take a look at DBM::Deep. This is a well-tested and well-liked implementation of a disk based hash. Just use a script to generate your hash once (that will take a while), after that any search will be nearly as fast as a single disk access. Store multiple values either concatenated as a string or better use an array for that. Since DBM::Deep is multilevel, storing a HashofArrays is no further problem	[reply]
Re^4: search a large text file by perl_lover_always (Acolyte) on Feb 08, 2011 at 14:41 UTC
my hash file will run out of memory! that was the main problem that I could not generate the hash at the first stage! any sample code as a clue?	[reply]
Re^5: search a large text file by jethro (Monsignor) on Feb 08, 2011 at 17:34 UTC
If you use a disk based hash, for example a DBM::Deep hash, you won't run out of memory. The main reason to have a disk-based hash is that you can create hashes that are bigger than your memory (another reason to use it is that the hash is permanent). For sample code just look at the perldoc page of DBM::Deep, under 'Synopsis'. Basically you just call DBM::Deep to link a hashname with a file on disk and after that you use that hash like any other hash, only that everything you store in there is (behind the scenes) transfered directly to disk.	[reply]
Re^6: search a large text file by perl_lover_always (Acolyte) on Feb 10, 2011 at 09:45 UTC
Re^7: search a large text file by jethro (Monsignor) on Feb 10, 2011 at 10:16 UTC
Some notes below your chosen depth have not been shown here
Re^3: search a large text file by moritz (Cardinal) on Feb 08, 2011 at 13:58 UTC
Sounds like a perfect match for dictd, which is very fast, and has a Perl client on CPAN: Net::Dict Perl 6 - second systems done right	[reply]
Re^4: search a large text file by perl_lover_always (Acolyte) on Feb 08, 2011 at 14:44 UTC
well, I have no clue how I can manage that! I have my data in this way: `pleasant 3 festive 2 period 2 i declare 5 declare resumed 7 resumed the 15 the session 9 session of 13` [download] how can I do that?	[reply] [d/l]


Pathologically Eclectic Rubbish Lister
	PerlMonks