Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re: Slurping BIG files into Hashes

by dws (Chancellor)
on Jun 18, 2003 at 18:26 UTC ( [id://266941]=note: print w/replies, xml ) Need Help??


in reply to Slurping BIG files into Hashes

But it seems to be taking about half an hour to do the initial processing. Is there a faster way to do it?

A quick back-of-the-envelope: 30 minutes to load ~160,000 records is roughly 90 records/second. That seems pretty slow. Have you tried instrumenting the code to take some timings? If you dumped a timestamp (or a delta) every 1K records, you might see an interesting slowdown pattern. Correlating this with a trace of your systems memory availability might show what memory is an issue, particularly if the system starts swapping at some point during the load.

Can you say more about the form of the keys and values? There might be something about their nature that you could exploit to find a different data structure.

Replies are listed 'Best First'.
Re: Re: Slurping BIG files into Hashes
by waswas-fng (Curate) on Jun 18, 2003 at 18:56 UTC
    Looks like you have something goofy going on there. look at the time report at the bottom of this post for my runtime on a 2 proc sun box.
    open (CONFIG, "<iaout.txt") || die "Coulnd't open config file!"; my %lookup; while (<CONFIG>) { $lookup{substr($_, 0, 13)} = substr ($_, 13); } #script used to generate data in the form of: # 21 random alpha chars per line # # #use Data::Random qw(:all); #open IA, ">iaout.txt"; #for $x (1 .. 160000) { #my @random_chars = rand_chars( set => 'alphanumeric', min => 21, max +=> 21 ); #print IA @random_chars, "\n";; # #}[1:50pm] 161 [/var/tmp]: time perl t 2.95u 0.18s 0:03.30 94.8%
    How simular are the key parts of the data in your file? I am wondering if you are getting a very high collision rate on they key for some reason? either that or memory is my best guess.

    -Waswas

      Aha, t'was written...

      "I am wondering if you are getting a very high collision rate on they key for some reason?"

      I reckon that this is what the problem is as the key values are very similar all the way through and there's not a lot which can be done about it. Hmmm... I'm trying to think of a better data structure. Thanks for the help everybody who contributed.

      Elgon

      PS - The box is an 8 processor Sun server running Solaris with 8GB of RAM. Neither the IO nor the memory seem to be the problem from continuous observation of the stats.

      update - Thanks to BrowserUK et al. for their help unfortunately the version we are using is 5.004_5 and I am not allowed to change it. Oh well. I'm trying to find a workaround as we speak...

      update 2 - Thanks to jsprat, the script now runs in about a minute. Ta to all...

      Please, if this node offends you, re-read it. Think for a bit. I am almost certainly not trying to offend you. Remember - Please never take anything I do or say seriously.

        Try presizing the hash - keys %lookup = 160_000;

        If it is hash collisions, this might solve the problem.

        dominus has an interesting bit at perl.plover.com called When Hashes Go Wrong.

        Update: Meant to ask you to "print scalar %lookup;" after all is done. scalar %hash will give you the number of used buckets / number of allocated buckets. If the number of used buckets is low (like 1/16) all your hash items have been put in the same bucket!

Re: Re: Slurping BIG files into Hashes
by waswas-fng (Curate) on Jun 18, 2003 at 18:46 UTC
    Very slow, considering I used Data::Random to generate a file to the specification listed (160,000 lines, 21 char) and it only took me 3.5 minutes to _generate the file. I agree it is sounding like a memory issue. What OS are you running this on? what are the specs on the box?

    -Waswas

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://266941]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (2)
As of 2024-04-26 04:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found