Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Very large text file - simple indexing

by Anonymous Monk
on Apr 09, 2003 at 19:57 UTC ( [id://249393]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I apologize if this is an utterly clueless question, but I'm a newbie so here goes:

All I want to do is to be able to correlate two entries in a very large text file, for instance the file is of the format:

12 4433 13 4433 14 4476 15 4477 16 4477

...and so on with tabs separating the columns. I'm going to be accessing this array hundreds of thousands of times. Now the approach for this that seemed obvious to me was just like this:

while (chomp($input=$handle->getline)) { @temp=split("\t",$input); $index[$temp[0]]=$temp[1]; }

Which is great, now we have $index(field 1 value) = field 2 value, which is what I wanted. and works great for the original 40 meg text file.

The problem comes with the new text file, which is almost 400 megs (many millions of entries). My system slows to a crawl and eventually seizes up. I assume this is most likely memory issues (only 256M physical due to a RAM blowout, another 500 M swap) or something, but I'm not sure.

Is there a better way to do this kind of thing for a huge text file like this?

Replies are listed 'Best First'.
Re: Very large text file - simple indexing
by zby (Vicar) on Apr 09, 2003 at 20:13 UTC
    You can use a tied hash for it. DB_File is an example module implementing that (or you can use the Berkeley DB).

      BerkeleyDB is even nicer once you move into non-trivial amounts of data (I call that hundreds of megs). If you need an ordered database see BerkeleyDB::Hash.

Re: Very large text file - simple indexing
by BrowserUk (Patriarch) on Apr 09, 2003 at 20:52 UTC

    Your post is a little confusing. In the sample data, you show field_1 as numeric and apparently incrementing by one, but it doesn't start from either zero or one, and in the sample code, you are using this numeric field as the index into an array ($index[$temp[0]]=$temp[1];), then immediately following it you show $index(field 1 value) = field 2 value using parens rather than square brackets.

    • Is field_1 always numeric?
    • Are they sequential?
    • Is the file (or can it be) sorted by this first field in ascending order?
    • Does Field_1 start from 0 or 1?

    If the answer to all these questions is yes, then possibly the easiest solution to the problem would be to use Tie::File. Read the excellent documentation for this module for the full nitty-gritty, but simply stated, it allows you, with a single statement, to treat a file as an array. Once you have tied the array to the file, you can just use the array as if it were entirely in memory and it takes care of caching, flushing, opening & closing it. You can specify how much memory you wish to allocate to the caching of the file and thereby make your own choices about the trade-off between memory use and performance.

    The only downside given your file format is that each array element would contain both fields, but it would be a fairly trivial process to modify the module for your own purposes to remove field_1 on the FETCHes and replace it on STOREs.

    If not all the answers to the 4 questions above are yes, for example if sequence numbers do not start from 0 or 1, or if the sequences have large gaps, then you would need to make more substantial changes to the module to map the sequence numbers to record numbers, which may be more work than you want to do, but it's worth considering if there is a algorithmic relationship involved.


    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.
Re: Very large text file - simple indexing
by waxmop (Beadle) on Apr 09, 2003 at 20:06 UTC
    Do you really need all the rows loaded in memory at once? If so, I suggest stuffing all this stuff into a MySQL database.
Re: Very large text file - simple indexing (seek)
by tye (Sage) on Apr 10, 2003 at 16:31 UTC

    Those records are so simple, I'd make sure they were fixed length (rewriting the file if necessary) and simply seek( FILE, $index*$reclen, 0 ) then readline (<FILE>).

                    - tye

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://249393]
Approved by Mr. Muskrat
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-04-19 03:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found