Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Force perl to release memory back to the operating system

by Roger (Parson)
on Sep 25, 2003 at 04:59 UTC ( [id://294065]=perlquestion: print w/replies, xml ) Need Help??

Roger has asked for the wisdom of the Perl Monks concerning the following question:

Hi Brothers,

I have written a script that parses 20 million+ customer transaction records stored in text files. These text files are extracts from various source systems. The purpose is to find out who are the top 30 customers, and extract details from these customers.

The following is extracted from my main module -
... # Find top 30 customers # Build a hash for top 30 customers in the # form of { 'cust1' => balance1, 'cust2' => balance2, ... } my $top30_customers = &GetTopCustomers(\@input_files, $fx_rates, 30); # Retrieve details about these top 30 customers my $customer_details = &GetCustomerDetails(\@input_files, $top30_customers); ... sub GetTopCustomers() { my ($inputfiles, $fx_rates, $nth) = @_; my %balances; ... foreach my $file (@{$inputfiles}) { my $f = new IO::File $file, "r"; ... while (<$f>) { # $cust, $bal, $ccy are extracted from each line ... $balances{$cust} += $bal * $fx_rates->{$ccy}; ... } } ... # Sort the customers by descending order of balance my @sorted_cust_list = reverse map {$_->[0]} sort {$a->[1]<=>$b->[1]} map{[$_,$balances{$_}]}keys %balances; my %topNth = map {$_=>$balances{$_}}(@sorted_cust_list)[0..$nth-1]; # Try to force the garbage collector to work ?@$!%# undef %balances; undef @sorted_cust_list; return \%topNth; }
As you can see, at the end of the sub "GetTopCustomers" I tried to "force" the garbage collector to work. But of course this is not going to work, as the perl processor does not give up its virtual memory until the process terminates. But I really want to free up memory for this process, because when it gets to the end of "GetTopCustomers", the process has eaten up ~2Gb of memory.

I dug up the following node http://www.perlmonks.com/index.pl?node_id=126591 from Perl Monks. Towards the end of the discussion node, Tye recommended restarting the process with exec { $^X, $0, @ARGV }.

Ummm, what does exec { $^X, $0, @ARGV } do? I have a feeling that if I do the exec, I will lose all the variables. Is that right?

Ah, this suddenly gave me an idea - what if I do the following:
... # Find top 30 customers # Build a hash for top 30 customers in the # form of { 'cust1' => balance1, 'cust2' => balance2, ... } my $top30_customers = &GetTopCustomers(\@input_files, $fx_rates, 30); # Try to free up some memory here by starting # a child process. my $pid = fork() or die("Can not do fork!); if ($pid) { exit(0); # kill the parent process } # Retrieve details about these top 30 customers my $customer_details = &GetCustomerDetails(\@input_files, $top30_customers); ...
I haven't tried the fork yet. But my gut feeling is that this will not work. I can think of two scenarios:

Senario 1 - by killing the parent process, the child process will die with it???

Senario 2 - Ok, perhaps the child process will not die, but it will inherit all the virtual space from the parent, which means all these efford for nothing???

Oh, dear brothers, what shall I do? Can you please give me some advice on how to free up process memory?

Thanks you very much, your help is greatly appreciated!

Replies are listed 'Best First'.
Re: Force perl to release memory back to the operating system
by tachyon (Chancellor) on Sep 25, 2003 at 05:41 UTC

    It seems to me that this is a rather inefficient way to do it. In fact 20 million lines of data in flat files is a very solid indication it is time to put the data into a RDBMS. Given that MySQL and Postgres are pretty good and also Free, along with the fact that Perl has the excellent DBI module there are not many reasons not to. If you just put the customer records into a real DB you would get the following benefits:

    1. You would not need to parse the flat text files every time you wanted to check the results.
    2. Once the initial load is done all you need to do is add the incremental changes.
    3. You would not need gigabytes of memory. Ever
    4. It would be a hell of a lot faster. You can load 10,000+ records/sec into most DBs using their native text load facility. The actual query and dump will probably only take a few seconds (perhaps much faster depending on how you structure the DB).
    5. When the bosses decide they want top 10/20/30/50 by state/zip code/hair color or whatever you have to write a single line of SQL to get the answer. You can even dump this to a tab sep text file ready to import straight into Excel ready for a PPT by the PHB to the visiting VP :-) with a line like SELECT cutomer, balance FROM customers ORDER BY balance DESC LIMIT 30 INTO OUTFILE '/tmp/less_work_is_good.txt'

    You could use a hash tied to an underlying file but this limits you to a single key value pair. But all that really does is solve your memory issue (at the expense of a losing lot of speed) so I will let someone else suggest that. Here is one of the many intros to DBI A short guide to DBI to get you started.

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      In fact 20 million lines of data in flat files is a very solid indication it is time to put the data into a RDBMS.

      On the contrary. Not putting them in a DB can make things _much_ more efficient. The real deciding factor is not the data load size, but the data access requirements and volatility of the data.

      For instance I receive about 2 gigs worth of records every day that I have to process. Never in their life do these records see a DB. Just loading them into a DB and indexing them is signifigantly slower than the processing I need to do. And I only need to do that processing once (or very rarely twice). RDBMS are not IMO suitable for managing large volumes of data that are only going to be accessed once or twice, never are changed, and can be disposed of once processed.

      Anyway, just thought Id throw that in there.


      ---
      demerphq

        First they ignore you, then they laugh at you, then they fight you, then you win.
        -- Gandhi


        Yes I did make a couple of assumptions about what look like finacial transactions in multiple currencies that are cumulated and then indexed against against customer details to get the most active customer list by transaction value. I can't think why that looked like a DB task :-)

        But of course you are right. I won't be putting my web or squid logs into a DB anytime soon although I do rotate them daily and use YAML to serialize data we parse out into flat files so we can get it back as required (rarely and for static HTML anyway)

        As always tools and tasks.....

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

        I agree. I have done match/merge processing with large lists much faster in flat files with unix commands than loading them into Oracle, for example, and trying to do the processing there. The data was very thin, less than 10 columns, but fairly large, about 1M records.
      Hi Tachyon, thanks for your suggestion with the database approach! And yes I have access to Sybase database.

      Unfortunately I cann't load them into the database. All these files come from who-knows-what systems, and I have no control over their creations (and I don't want to know either). And I specifically avoided to load them into database because I was told not to.

      Sorry I forgot to mention that I am only allowed to work on the flat files, and the extraction is run only once per day. Why can't load these data into database in the first place? I don't know. That's the probably an old business decision.

        Are you not allowed to use any DB, or is the decision just to not allow you to use the company Sybase, so you don't bother the DBA? If you use the DBI with DBD::SQLite there will be no need for a DB server (it's included in the module) and the whole DB will be a single file. This removes any administration, the DBA is happy, your boss should be happy too, and you gain scalability and the convenience of using SQL to work on your data.

        SQLite is really pretty fast, and ideally suited for this kind of single-user application. Just test whether the DB file doesn't grow past your OS limit if there is one and you should be fine.

        It might be time to rethink that business decision. Anyway you can save some memory if you think about what you are doing.

        # get the customer => blance hash mapping - if you can't RDBMS you are + stuck with this $cust_bal = ...... # now all we are interested in is the top 30 # we don't need to sort them all using a Schwartzian # we only need to sort the values. This saves us a lot of memory as # [ cust, bal ] will take up a lot more space than just bal # we can also exit at this point and write a new temp flat file # with all the cust => bal mappings and this reclaims the memory. # either way we just want those top 30 so just sort the balances: my @bals = sort { $b <=> $a } values %$cust_bal # we are only interested in the top 30 which is a balance greater than +.... my $gt = $bal[30] # so now we iterate over our hash again (or the flat file) my $top_30; for my $cust( keys %$cust_bal ) { next unless $cust_bal > $gt; $top_30->{$cust} = $cust_bal->{$cust}; } # now the top_30 are in a hash ready to be sorted for mapping to cust +details and output

        This will save you roughly 25% and up to 50% (by using a temp file) of the memory that the original algorithm used which might get you home. If not you will have to tie your hash to a file. dbmopen might be the DB you have when you are not having a DB :-) If you can't do it in memory and have to tie to the disk it will be really slow, but I gues you have got all day :-)

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

        i vote for the database as well. even if you have to install mysql yourself. but if you must you can try something like this.

        #!/usr/bin/perl # # foo - called by bar. recieves files to search on it's # input, calculates top N, retrieves the details for # the top N and prints them to STDOUT in some easily # parseable format # use strict; use warnings; $|++; my @files = <>; chomp @files; for (@files) { # process files warn "processing $_$/"; } my @top = ( [ a => 10 ], # get top N [ b => 5 ], [ c => 3 ], ); my %detail = ( # and detail info a => [ 1, 1, 4, 2, 1, 1 ], b => [ 1, 2, 2 ], c => [ 1, 2 ], ); for (@top) { printf "%s %d$/", @$_; printf "@{$detail{$_->[0]}}$/"; } exit; #!/usr/bin/perl # # bar - calls foo as a child process and colects tidbits # of data for future use. foo's memory will go back # to the system soon after the answers are read. # use strict; use warnings; use IPC::Open2; my @files = qw( xxx xxy xxz ); my ($read, $write, $child); $child = open2($read, $write, './foo'); print $write $_,$/ for @files; close $write; my @info = <$read>; close $read; waitpid $child, 0; print @info;
Re: Force perl to release memory back to the operating system
by Abigail-II (Bishop) on Sep 25, 2003 at 07:42 UTC
    Whether or not a process can give back memory to the OS depends on the OS. And, AFAIK, on some OSses, Perl does give back memory to the OS.

    In your case however, you are trying to fix your problem in the wrong way. There's no need to slurp in all the data in memory, if only you want to keep track of the top 30.

    Here's an outline of a better algorithm:

    Read in the first 30 records, sort them. Then process the other records one by one. If the record is larger than the smallest of the 30 records so far, remove the smallest, and add the newests (keep it in order).

    Abigail

      That's a damned good idea: much better implementation, as long as there is no record aggregation needed prior to sorting.
Re: Force perl to release memory...(use less and sort faster too!)
by BrowserUk (Patriarch) on Sep 25, 2003 at 10:20 UTC

    Rather than trying to force the release memory back to the OS (which does happen naturally under the right circumstances as I demonstrated here and here) it would be better to avoid using it in the first place. Sorting 20,000,000 elements is going to take huge amounts of addition memory over and above that used to just store the data, especially when using the ST.

    As you only require the top N values, you can sort the data in chunks, adding the current top N to the chunk each time and then extracting the new top N. In the end you have your required result, but any given sort is operating on a much smaller set, and the memory consumed sorting that set is available for re-use in the next sort. Memory growth is minimised.

    Finding the balance between memory consumption and speed requires a little experimentation, but it would appear that there is little performance gain to be had by using 100 sorts of 10,000+N elements over 1,000 sorts of 1,000+N, and the latter uses far less memory.

    #! perl -slw use strict; use vars qw[ $SIZE $N $CHUNK ]; $SIZE ||= 100_000; $N ||= 30; $CHUNK ||= 100; srand( 1 ); sub getTopN { my( $aref, $n ) = @_; my @top = @{ $aref }[ 0 .. ( $n - 1) ]; for( my $p = $CHUNK; $p < @$aref; $p += $CHUNK ) { my $end = $p + $CHUNK < @$aref ? $p + $CHUNK : $#{ $aref }; @top = ( sort{ $b <=> $a } @top, @{ $aref }[ $p .. $end ] )[ 0 .. ( $n - 1 ) ]; } return @top; } print scalar localtime; my @numbers; $#numbers = $SIZE; $numbers[ $_ ] = int rand $SIZE for 0 .. $SIZE; print scalar localtime; print "@{[ getTopN( \@numbers, $N ) ]}"; print scalar localtime;

    Collating some results for different size arrays and chunk sizes, it shows that using this method, the sort times scale fairly linearly with array size and using a chunk size of 1000 gives about the best results, though you might want to investigate chunk sizes between the 1000 and 10,000 I tested.

    Array size | gen.(s) | sort(s)| sort(s) | sort(s) | sort(s) | | (30) | (100) | (1000) | (10000) -----------+---------+--------+---------+---------+------- 100_000 | 1 | 4 | 3 | 2 | 2 1_000_000 | 12 | 55 | 38 | 33 | 36 2_000_000 | 29 | 138 | 102 | 88 | 93

    By way of comparison, sorting the 1_000_000 numbers using a straight numeric sort takes 154 seconds which compares very unfavourably with the 33 seconds it takes to sort them in 1,000 passes of 1,030 elements at a time.

    Some (truncated) raw results


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

      (@_@) Thanks brother! This technique looks great. I will think about how to apply this technique to sort my %balances hash table tomorrow. I am learning something new everyday!
Re: Force perl to release memory back to the operating system
by dws (Chancellor) on Sep 25, 2003 at 08:06 UTC
    On the assumption that an RDBMS isn't in the cards, let's look at where you're chewing through memory, and see if we can't figure out a way to get by with less.
    $balances{$cust} += $bal * $fx_rates->{$ccy};
    is central to what you're doing. Unless there's some heuristic you can apply to exclude some customers (i.e., "forget these, since they'll never be anywhere near the top N"), you may be stuck here. How many customers do you have?

    Moving on,

    my @sorted_cust_list = reverse map {$_->[0]} sort {$a->[1]<=>$b->[1]} map{[$_,$balances{$_}]}keys %balances;
    is the second hit, and it's a big one. First, keys %balances builds a big array, then you build another one via sort, then another via reverse. (Not all of the arrays are live all the way through the pipeline, but still...) There's a way around this, though. Instead of using a Schwartzian transform, make a low-footprint pass through the hash, using
    while ( my($cust,$balance) = each %balances ) { ... }
    and keep a smaller data structure that records the top N customers by balance. (I'll leave the choice of algorithm as an exercise for the motivated reader.) This is a bit more work, but it might run faster given the reduction in memory demands and attendant reduction in swapping/thrashing.

    At this point, the big memory sink, assuming you have lots of customers, is %balances. You can undef this at the end of the subroutine, but, as far as I know, this won't result in Perl releasing memory back to the OS (and it goes away at the end of the routine anyway). The last time I looked into it, which was a few releases of Perl ago, Perl indirectly used a "raise the watermark if necessary" memory allocation strategy, with no provision for ever lowering the watermark.

      The first part of the collection process -
      $balances{$cust} += $bal * $fx_rates->{$ccy};
      calculates the total balance for each customer, where each customer can have transactions in multiple currencies. I have thought about applying certain heuristic, unfortunately I can't make assumptions about customer behaviours (not to mention the consequence of producing an approximate report).

      I am stuck with storing balances in memory. I don't want to export these balances into temporary files, because that will introduce significant penalties on storing 10+ million individual accounts to disk and reading them back in again. I will get rid of the memory hungary Schwartzian transform and use a sorting technique that will be balanced on memory usage and speed.

      At the end of the selection process, I will have a list of top 30 customers. Come to think about it, I will probably join the 30 customer info in a string of comma separated values, and restart the perl script again with the exec { $^X, $0, @ARGV } technique.

      I will go to sleep tonight with these ideas in my mind, I will probably get enlightened in my dreams. :-)

      And thanks again for your suggestions.
        Have you thought about using a tied hash? DB_File would work perfectly since you are using a large hash that is bigger than RAM. The Berkeley DB library handles all the nasty details of saving the data that won't fit in memory. In addition, DB_File supports in-memory databases that back to disk so you wouldn't even have to worry about creating and deleting a temp file.

        Since you are selecting a limited number of the highest values, keeping a list of the highest values see so far while scanning through the entire hash will be much more efficient than operating on everything at once.

Re: Force perl to release memory back to the operating system
by Beechbone (Friar) on Sep 25, 2003 at 07:59 UTC
    fork()ing will not release memory, yes. But think about this:
    @ARGV = ('--restart', Data::Dumper::Dumper(\%whatever)); exec( $^X, $0, @ARGV ); RESUME: #only if you really need to!
    and at the beginning of your program:
    if ($ARV[0] eq '--restart') { %whatever = %{ eval $ARGV[1] }; goto RESUME; #only if you really need to! }
    Or you can use a temporary file to store your variables. There is not that much space in @ARGV...
      This looks like a great technique! I will keep that in mind tomorrow when I am back at the office. Even if I am not going to do this in my current project, I will certainly try it out later!
Re: Force perl to release memory back to the operating system
by demerphq (Chancellor) on Sep 25, 2003 at 11:59 UTC

    It looks like the problem is that %balances is growing unwieldy. So the simplest solution is to tie the hash to an on disk representation. You could do this via DB_File for instance and only have to change two or three lines in your code.

    You could also use a RDBMS for this, but given what ive read so far I dont see the point, and this may not be an option for you.


    ---
    demerphq

      First they ignore you, then they laugh at you, then they fight you, then you win.
      -- Gandhi


Design issue
by DentArthurDent (Monk) on Sep 25, 2003 at 13:13 UTC
    I think your problem is one of design. Rather than trying to save all the records and sorting them, just iterate through all of them saving only the ones with the thirty highest balances if all you want to do is find the top clients.
      You missed part of the point: The data sought is the top thirty customers by total, where each customer has multiple transactions. There's no way to know that the last transaction in the file, plus something you've already discarded as inadequate, isn't enough to put a customer back in the top 30.

      The previous suggestions of sorting address this neatly. I was hoping to have something clever to add, but as usual the more prompt monks have covered this one thoroughly.

      --
      Spring: Forces, Coiled Again!
Re: Force perl to release memory back to the operating system
by Roger (Parson) on Sep 25, 2003 at 23:52 UTC
    Oh brothers! Thank you very much everyone for offering your great advices!

    I took the advise and applied some very basic assumptions to reduce the data set first.

    My new assumption is that top 30 customers will most likely have balances greater than 0.5 billion dollars. So as a first step, I added the following code to reduce my data set, having previously obtained a hash table of customers & balances.
    while (...) { $balances{$cust} += $bal * $fx_rates->{$ccy}; } ... # Find the customers whose total balance is greater than $500000000. +00 my %top_balances; my $top_threshold = 510000000.00; # default threshold $0.5 bil my $top_count; # keep a count of how many found do { # lower threshold by 10 million every time $top_threshold -= 10000000.00; die ("This is impossible, the data has error!") if ($top_threshold < 0.00) $top_count = 0; while ( my ($cust, $balance) = each %balances ) { if ($balance > $top_threshold) { $top_balances{$cust} = $balance; $top_count++; } } } while ($top_count < $nth); # Safety check - how big is our list? Is it > n? if ($VERBOSE) { print "$top_count customers have ", "balances greater than \$${top_threshold}.\n"; } # Sort only the top customers by descending order of balance my @sorted_cust_list = reverse map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [$_, $top_balances{$_}] } keys %top_balances;
    As it turned out, the %balance structure occupied less than 400Mb in memory. It was the Schwartzian transform on the entire %balance table that thrashed the machine.

    When I ran the new code, it showed that the reduced set %top_balances only has less than 100 customers. So I was able to keep the Schwartzian transform, knowning that I only have to sort about 100 customers in the reduced customer set.

    As it turned out, I don't have to fork/restart/reclaim perl memory at all. Also the modified script was able to finish everything (including the extraction of top 30 customer details) under 6 minutes! In comparison it used to run more than 20 minutes and was killing the machine.

    A big THANK YOU to everyone for the great help!
      You have accounts receivable of 6,000,000,000 dollars and you're not using a database to store transactions? Which cartel do you work for?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://294065]
Approved by tachyon
Front-paged by dbwiz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2024-04-19 15:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found