Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Out of Memory

by instinct_4ever (Initiate)
on Oct 25, 2009 at 04:30 UTC ( [id://803105]=perlquestion: print w/replies, xml ) Need Help??

instinct_4ever has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I am working on a couple of scripts for word processing/counting. One of them involves a hashmap, which could have around 41 million keys in the worst case. The trouble is that while running the script always runs out of memory. I cant flush the hashmap at some intervals because all the information in that is necessary. Is there some solution to this?

I am posting a snippet of my code below:

while(<FH4>) { if(exists $wordCount{$_}) { $wordCount{$_} = $wordCount{$_} + 1; } else { $wordCount{$_} = 1;
<FH4> can contain around 41 million lines on an average.

Replies are listed 'Best First'.
Re: Out of Memory
by ikegami (Patriarch) on Oct 25, 2009 at 04:39 UTC

    Use a trie instead of a hash table.


    For future reference,

    if(exists $wordCount{$_}) { $wordCount{$_} = $wordCount{$_} + 1; } else { $wordCount{$_} = 1; }

    can be shortened to

    ++$wordCount{$_};

    It won't help your memory problem, just your code clarity.

        That is indeed an implementation of the previously mentioned data structure. I've never used it or other modules, thus the lack of recommendation.
Re: Out of Memory
by virtualsue (Vicar) on Oct 25, 2009 at 10:57 UTC
Re: Out of Memory
by afoken (Chancellor) on Oct 25, 2009 at 09:02 UTC

    Use an RDBMS, e.g. SQLite or PostgreSQL. Stuff the lines read into a table, create an index on the column, and use something like SELECT line,COUNT(*) FROM table GROUP BY line to fetch the counts.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: Out of Memory
by JavaFan (Canon) on Oct 25, 2009 at 10:40 UTC
    Without knowing why you are putting so many keys in a hash, it's hard to say what to do. (Note that it matters less how many lines there are in the file, as well as the number of different words). One obvious savings you can do is chopping off the newline - that would save you a couple of Mb.

    But you might consider using a disk bound datastructure. Perhaps a database, or one of the DB files. A trie was suggested as well, but I'm not sure how much it will save. Obviously, the amount of string data is reduced, but at the cost of introducing more hashes (or arrays), which themselves come with quite a lot of memory overhead. It will depend on the prefix duplication in the data set.

Re: Out of Memory
by BrowserUk (Patriarch) on Oct 25, 2009 at 16:07 UTC

    If all you're doing is counting the unique lines, sorting the file first and then counting the consecutively similar lines will require very little memory and probably be quicker than loading the data into a db.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Out of Memory
by Jenda (Abbot) on Oct 26, 2009 at 11:17 UTC

    use BerkeleyDB;

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

[DUP] Re: Out of Memory
by ikegami (Patriarch) on Oct 25, 2009 at 04:41 UTC

    [ Duplicate post. Please ignore ]

    Use a trie instead of a hash table.


    For future reference,

    if(exists $wordCount{$_}) { $wordCount{$_} = $wordCount{$_} + 1; } else { $wordCount{$_} = 1; }

    can be shortened to

    ++$wordCount{$_};

    It won't help your memory problem, just your code clarity.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://803105]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-04-26 00:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found