http://qs321.pair.com?node_id=266924

Elgon has asked for the wisdom of the Perl Monks concerning the following question:

Fellow Monks,

I have a bit of query: I am trying to do some relatively straitforward file transformations as part of some testing we're doing on a project at work. The first stage of this involves reading in a big lookups file, circa 160,000 records ~3.4MB, and converting it into a hash.

The format of the file is basically records of twenty-one characters long, where the first thirteen characters are going to be the key for the next eight. I need to go through the file and read it into a hash for use as a lookup table. I am doing this thusly...

open (CONFIG, "<$config_file)" || die "Coulnd't open config file!"; my %lookup; while (<config>) { $lookup{substr($_, 0, 13)} = substr ($_, 13); }

This may seem hugely inefficient, however the overhead of setting this up is offset by the fact that we'll be processing some 700 files each of about a megabyte in length. This is being done on a BIG box needless to say and memory is not going to be a problem.

But it seems to be taking about half an hour to do the initial processing. Is there a faster way to do it?

Thanks in advance...

Elgon

Please, if this node offends you, re-read it. Think for a bit. I am almost certainly not trying to offend you. Remember - Please never take anything I do or say seriously.

Replies are listed 'Best First'.
Re: Slurping BIG files into Hashes
by hardburn (Abbot) on Jun 18, 2003 at 17:50 UTC

    Slurp it into an array first, then move to a hash:

    my %lookup = map { substr($_, 0, 13) => substr($_, 13) } <CONFIG>;

    ----
    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    Note: All code is untested, unless otherwise stated

      hardburn,

      Thanks - 15 minutes as opposed to 30.

      Elgon

      Please, if this node offends you, re-read it. Think for a bit. I am almost certainly not trying to offend you. Remember - Please never take anything I do or say seriously.

Re: Slurping BIG files into Hashes
by dws (Chancellor) on Jun 18, 2003 at 18:26 UTC
    But it seems to be taking about half an hour to do the initial processing. Is there a faster way to do it?

    A quick back-of-the-envelope: 30 minutes to load ~160,000 records is roughly 90 records/second. That seems pretty slow. Have you tried instrumenting the code to take some timings? If you dumped a timestamp (or a delta) every 1K records, you might see an interesting slowdown pattern. Correlating this with a trace of your systems memory availability might show what memory is an issue, particularly if the system starts swapping at some point during the load.

    Can you say more about the form of the keys and values? There might be something about their nature that you could exploit to find a different data structure.

      Looks like you have something goofy going on there. look at the time report at the bottom of this post for my runtime on a 2 proc sun box.
      open (CONFIG, "<iaout.txt") || die "Coulnd't open config file!"; my %lookup; while (<CONFIG>) { $lookup{substr($_, 0, 13)} = substr ($_, 13); } #script used to generate data in the form of: # 21 random alpha chars per line # # #use Data::Random qw(:all); #open IA, ">iaout.txt"; #for $x (1 .. 160000) { #my @random_chars = rand_chars( set => 'alphanumeric', min => 21, max +=> 21 ); #print IA @random_chars, "\n";; # #}[1:50pm] 161 [/var/tmp]: time perl t 2.95u 0.18s 0:03.30 94.8%
      How simular are the key parts of the data in your file? I am wondering if you are getting a very high collision rate on they key for some reason? either that or memory is my best guess.

      -Waswas

        Aha, t'was written...

        "I am wondering if you are getting a very high collision rate on they key for some reason?"

        I reckon that this is what the problem is as the key values are very similar all the way through and there's not a lot which can be done about it. Hmmm... I'm trying to think of a better data structure. Thanks for the help everybody who contributed.

        Elgon

        PS - The box is an 8 processor Sun server running Solaris with 8GB of RAM. Neither the IO nor the memory seem to be the problem from continuous observation of the stats.

        update - Thanks to BrowserUK et al. for their help unfortunately the version we are using is 5.004_5 and I am not allowed to change it. Oh well. I'm trying to find a workaround as we speak...

        update 2 - Thanks to jsprat, the script now runs in about a minute. Ta to all...

        Please, if this node offends you, re-read it. Think for a bit. I am almost certainly not trying to offend you. Remember - Please never take anything I do or say seriously.

      Very slow, considering I used Data::Random to generate a file to the specification listed (160,000 lines, 21 char) and it only took me 3.5 minutes to _generate the file. I agree it is sounding like a memory issue. What OS are you running this on? what are the specs on the box?

      -Waswas
Re: Slurping BIG files into Hashes
by BrowserUk (Patriarch) on Jun 18, 2003 at 20:23 UTC

    I have to concur with waswas-fng, your data, or rather your keys must be such that your approaching worst case behaviour. I tried your original code on my 233Mhz P2 and loading the hash took a little under 8 seconds.

    To acheive the load times of 30 minutes you are quoting. you are either

    1. extremely unlucky.
    2. using an Apple II or ZX 81 :)
    3. your data has been picked to deliberately induce the pathological behaviour.

    If a) is the case, my commiserations, but I do have a work-around for you. If you are currently using 5.8, revert to 5.6.1. If your currently using 5.6.1 or earlier, switch to 5.8. The hashing function changed between these two builds and if your data is genuinely inducing the O(n^2) insertion behaviour on one build, it will perform as you would hope on the other.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


      ++, also if the data is not valuable or meaningful put it up on a siteand send me a message with the url, I will verify behavior across platforms (+ it might make a good dataset to see how the hash keys are vaulnerable)

      -Waswas
Re: Slurping BIG files into Hashes
by BazB (Priest) on Jun 18, 2003 at 18:12 UTC

    This does seem a little slow, but it might be down to your hardware.

    I perform a similar operation, but on a properly big file: 27 million records, 8 character keys and values.
    That only takes about 20 minutes, using exactly the same method as you - substr()'ing out the keys and values from a flat file and stuffing them into a hash.

    I can't think of a better way of doing it - unpack is unlikely to be much faster, but you might want to do some benchmarking.


    If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong.
    That way everyone learns.

Re: Slurping BIG files into Hashes
by cciulla (Friar) on Jun 18, 2003 at 17:51 UTC

    I'd be tempted to do it thusly, as it appears that you're throwng away the value part:

    while (<config>) { chomp; my ($key, $value) = /^(.{1,13})(.+)$/; $lookup{$key} = $value; }
    ...or, am I missing something?

    Update: perldoc -f substr has shown me the light.
    Update the Second: As has hardburn.

      Regexen are almost always going to be slower than substr. Normally, I'd put a big notice here about premature optimization being the root of all something or other, but in this case we actually do need to optimize. Of course, benchmarks need to be done to be sure.

      ----
      I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
      -- Schemer

      Note: All code is untested, unless otherwise stated