Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

write hash to disk after memory limit

by hailholyghost (Novice)
on Mar 13, 2015 at 12:44 UTC ( [id://1119948]=perlquestion: print w/replies, xml ) Need Help??

hailholyghost has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I've got a script on genomic data that uses about 17 GB of RAM, most of it to a hash. I have tried "undef" on the biggest hashes but this does not free memory to the OS. The computer has 8 GB of RAM. How can I get the script to write the excess memory (say, any RAM over 6.5 GB) to disk instead of RAM, while the program is executing, deleting the excess memory hash afterward? thanks, -DEC

Replies are listed 'Best First'.
Re: write hash to disk after memory limit
by BrowserUk (Patriarch) on Mar 13, 2015 at 17:22 UTC
    I have tried "undef" on the biggest hashes but this does not free memory to the OS.

    You haven't explained your reasoning for wanting to "free this memory back to the OS"; but I think you are aiming for the wrong goal.

    Once you've undef'd the hash, the memory it occupied will no longer be accessible to your program; thus, it's place in physical memory will quickly be replaced by any data you are still using; by exchanging them from the swap file.

    That is to say: once a process moves into swapping; the memory you are using will be kept in physical ram; and the memory you are not using -- on a least recently used basis -- will be 'swapped out' to disk in the system swap file. Once there, it will have no effect on the performance of your process (or your system) unless it is accessed again at which point it needs to be exchanged with something else currently in physical ram.

    So, once you've finished with your hash; you are better off letting it get swapped out to disk as the system see's fit, than you are trying to reclaim it. This is because the very act of undefing the hash, will cause Perl's garbage collection mechanism to visit every key and value (and subkey and value; and every element of those arrays of arrays) in order to decrement their reference counts and (if appropriate) free them back to the process memory pool (heap). In order to do that, it means that any parts of the hash -- every last dusty corner of them -- that may have been benignly lying dormant in the swap file for ages, will need to be swapped back in to be freed; and in the process, memory that you need to access may get swapped out; only to have to be bought back in -- swapped with the now inaccessible pages that used to contain the redundant hash -- almost immediately.

    Upshot: If you cannot avoid your process moving into swapping in the first place -- and you haven't supplied enough details about the nature and size of your data for us to help you do that -- then *DO NOT attempt to free it*. Far better to let the system deal with making sure that the memory your process needs -- along with all the other running process' needs -- is available when it is needed.

    And if you are concerned with how long your process takes to end -- after you've finished your processing -- because it spends ages reloading swapped out, redundant pages during final global clean up; then call POSIX::_exit() which will bypass that clean up and terminate your process quickly and efficiently.

    (BUT Only do so once you've manually closed any and all output files; data ports etc. otherwise you could loose data!)

    I'd also urge you to more fully describe your data -- a few sample records -- and volumes; because it is nearly always possible to substantially reduce the memory requirements; even at the cost of a little up-front performance. Avoiding moving into swapping will speed your process by 1 or 2 orders of magnitude in many cases; so it leaves a lot of scope for trading a little speed for a lot of memory and coming out a net winner.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: write hash to disk after memory limit
by LanX (Saint) on Mar 13, 2015 at 12:51 UTC
    I'm not sure if I understand your question completely...

    ... but I once had a problem with a giant hash constantly swapping and solved it by splitting up the hash into a two tier HoH.

    If you can organize the upper tier roughly according to the timeline of your process, your system will only swap the necessary lower hashes on demand.

    I already described this here, will update the link after I found it.

    HTH! :)

    update

    see Re: Small Hash a Gateway to Large Hash?

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)

    PS: Je suis Charlie!

      thanks a lot, I've been using hash-hash-array-array in order to keep memory use down. I think array access is also faster than hash, so I did this:
      foreach my $rat (@directories) { print "Reading Merged_99$rat/bs_seeker-CG.tab ...\n"; open(FH,"<Merged_99$rat/bs_seeker-CG.tab") or die "cannot read M +erged_99$rat/bs_seeker-CG.tab: $!"; while (<FH>) { if (/M/) { next; } elsif ((/^chr(\S+)\s+(\d+)\s+\d+\s+(\d)\.(\d+)\s+(\d+)/) && + ($1 ~~ @CHROMOSOMES) && ($5 >= $MINIMUM_COVERAGE)) { #chromosome $1, methylated C $2, percent $3.$4 and coverage $5 $DATA{$1}{$2}[$set][$replicate] = "$3.$4"; } elsif ((/^chr(\S+)\s+(\d+)\s+\d+\s+(\d)\s+(\d+)/) && ($1 ~~ + @CHROMOSOMES) && ($4 >= $MINIMUM_COVERAGE)) { $DATA{$1}{$2}[$set][$replicate] = $3; } } close FH; $replicate++; }
        As I said, better

        > > organize the upper tier roughly according to the timeline of your process

        No idea where $set comes from but $replicate could be such a top tier.

        so $data[$set][$replicate]{$1}{$2} should have far less memory swapping problems (AFAIS).

        (BTW better reserve uppercase var-names to perl buit-ins)

        If this structure doesn't fit into your future plans, you most likely want to use a DB anyway.

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)

        PS: Je suis Charlie!

        Do you later use the value as a string or as a number? If you use it as a number, I believe you could save quite a bit of memory by forcing a conversion before storing the data. The way you do it, you end up with a scalar containing both the string and (as soon as you use the number for the first time) the number.

        ... $DATA{$1}{$2}[$set][$replicate] = 0 + "$3.$4"; } elsif ((/^chr(\S+)\s+(\d+)\s+\d+\s+(\d)\s+(\d+)/) && ($1 ~~ + @CHROMOSOMES) && ($ +4 >= $MINIMUM_COVERAGE)) { $DATA{$1}{$2}[$set][$replicate] = 0 + $3; ...

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.

Re: write hash to disk after memory limit
by ww (Archbishop) on Mar 13, 2015 at 12:52 UTC

    Standard answers (and none of these may be applicable, but you haven't given us much to go on):

    • Add RAM
    • Read and process data line-by-line or paragraph-by-paragraph (read about the special variable "input record separator").
    • AND BEST OF ALL: tell us more about the required processing. That may permit Monks to offer more specific and readily applicable approaches and/or algorithms.
Re: write hash to disk after memory limit
by MidLifeXis (Monsignor) on Mar 13, 2015 at 12:57 UTC

    To echo ww's comment - we don't have enough information.

    A couple of more potential solutions might be a "better" algorithm or a disk-based hash store (DBM::Deep, as an example). Whether it helps with your use case or not depends on many factors, none of which we have at this point.

    --MidLifeXis

      For DBM::Deep, I recall needing 64-bit Perl for large files (>4GB). See DBM::Deep Large File Support for more info.

      But, as said above, this is only one of several options.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

        64-bit perl != largefile support, except on their test machines. I can attest that given largefile support in a 32-bit perl + filesystem support does work.

        --MidLifeXis

Re: write hash to disk after memory limit
by FloydATC (Deacon) on Mar 13, 2015 at 21:59 UTC

    Swapping is the very act of writing to disk after the physical memory limit has been reached, is it not? When choosing what chunks of memory to swap out, the operating system will usually pick those that have not been in recent use. Unless you can write a significantly smarter algorithm, I'd expect the performance to be worse if you try to swap manually.

    Only if you can make better guesses on what chunks of data you won't be needing any time soon will you be able to outperform the memory manager. But then, if you knew you wouldn't be needing parts of the data in memory, you probably wouldn't have bothered placing it there to begin with, right?

    If the data set was 10 times larger, maybe I'd spend some time trying to come up with a completely different approach. Today, for a 17 GB data structure I'd seriously consider just buying more RAM so I could get back to work.

    The sad truth is, one day wasted on writing, testing and debugging clever code costs far more than a 16 GB stick these days.

    -- FloydATC

    Time flies when you don't know what you're doing

      the sad truth is, one day wasted on writing, testing and debugging clever code costs far more than a 16 GB stick these days.

      True. But only if the hardware is capable of accommodating it.

      Now the choice is to upgrade the motherboard to one that can accommodate the "extra stick"; but that usually means also upgrading the CPU because later motherboards that can handle more memory have different, later cpu sockets. So now we're looking at anything from 3 to 10 times the price.

      But, does the version of the OS we're using support that new hardware? Does it have drivers available for everything? Does the new hardware still support the legacy ports and drivers need for the other processes that run on the same box?

      Is the, now required, OS upgrade covered by the current license? Is it approved by your company/organisation? What are the costs involved in that upgrade? How many other processes will need to be compatibility tested with it? How long will the integration/testing/approval process take and how much will it cost?

      What if this process is run concurrently on a cluster -- 16 to 32 machines -- or a farm -- 100s or 1000s of machines. How much does that "extra stick" cost now?

      So, sod the upgrade, farm it out to AWS. Fine, but what are the security and legal implications of doing so? Is the data in whole or in part identifiable as customer data? Can a European company legally transmit customer data to US (sited or controlled) servers? How much will the test case in the European Court of Human Rights cost?

      Or; maybe we could just do some bit-twiddlingTM and compress the data representation some, and avoid the whole issue. At least until we get a ECoHR hearing date.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      .

        I agree, sometimes it's not as easy as simply throwing a stick of memory at the problem. I'm just saying it's usually worth considering before you start reinventing fundamental parts of the operating system.

        Sometimes, even a forklift upgrade of the entire data center can be the most sensible thing to do.

        -- FloydATC

        Time flies when you don't know what you're doing

      > Today, for a 17 GB data structure I'd seriously consider just buying more RAM so I could get back to work.

      And when his laboratory gets expanded to output 170GB he's supposed to run and buy 10 times more RAM?

      Clever algorithms pay off buy scaling silently without causing such troubles.

      Only counting the day you spend designing is a miscalculation...

      Look at the code he showed us and how just re-sorting the dimensions of his data structure will reduce any swapping dramatically.

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)

      PS: Je suis Charlie!

        And when his laboratory gets expanded to output 170GB he's supposed to run and buy 10 times more RAM?

        Like I said, if the data set was 10 times bigger...

        I don't disagree with any of what you say, it's just that having a working data set of 17 GB that simply isn't suitable for anything else than keeping it all in RAM is not unheard of in this day and age.

        Assuming for a moment that this isn't a problem that needs to scale for an entire datacenter, and we're not talking about reprogramming a deep space probe launched 20 years ago can also help with reducing the need for throwing man hours on the problem.

        If it turns out that in this particular case the data set wasn't really 17 GB after all but only expanded to this size as it was read into memory, that's great :-)

        I was merely trying to illustrate why replacing OS swapping with home baked swapping would probably not be worth the effort.

        -- FloydATC

        Time flies when you don't know what you're doing

        A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1119948]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (7)
As of 2024-03-28 10:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found