Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

System call doesn't work when there is a large amount of data in a hash

by Nicolasd (Acolyte)
on Apr 28, 2020 at 19:13 UTC ( #11116169=perlquestion: print w/replies, xml ) Need Help??

Nicolasd has asked for the wisdom of the Perl Monks concerning the following question:

I have script where I need to do a system call But when there is a lot of virtual memory (250 GB) used by storing it in a hash, the system call doesn't work, although there is still 250 gb of RAM available.

I tried many different system calls, all work in low memory jobs, none work in high memory jobs. Anyone idea how it can be resolved? Thanks.
  • Comment on System call doesn't work when there is a large amount of data in a hash

Replies are listed 'Best First'.
Re: System call doesn't work when there is a large amount of data in a hash
by swampyankee (Parson) on Apr 28, 2020 at 20:55 UTC

    I think you are going to need to supply more information for any of the monks here to provide a useful answer, like the version of Perl, the O/S you are using, and which system call is being made.

    Be forewarned that I'll not be the only Monk asking: why do you have 250 GB of data in a hash?


    Information about American English usage here and here. Floating point issues? Please read this before posting. — emc

      Thanks for the reply, Perl version v5.26.2 and the O/S is Centos 7 I need these large hashes to store genetic data in a hash, it's for a genome assembly tool: I want to add a new module, but I need a system call for that, but I can't get it to work when I run it on large datasets. I am not an informatician so have a limited knowledge Any help would be greatly appreciated https://github.com/ndierckx/NOVOPlasty

        Hi,

        " I need these large hashes to store genetic data in a hash"

        That's a bit like saying " I need these hashes because I need these hashes."

        See:

        Also see:

        Does your "genome assembly tool" accept Perl data hashes as input? Of course it does not. Therefore you must be somehow serializing your massive input to the program in your system call. Perhaps you need to write a file, or provide a data stream to a server? As noted by my learned colleague swampyankee, it's hard to conceive of why you need to store 250Gb of data in an in-memory hash. There are myriad techniques to avoid doing so, depending on your context; why don't you explain a bit more about that, and show some code?

        Hope this helps!


        The way forward always starts with a minimal test.

        I'm not a bioinformatitician either, but that repo has some problems, filenames using the : character, a single perl file > 1MB with over 23K lines, a quick glance at which shows room for improvement. I'm not sure if part of the relatively popular Bioperl suite of tools can address your requirements. Regardless all of this is good advice. You don't need to store everything in memory even if you are just planning to call some external command line tool. Consider an alternative such as a database.

Re: System call doesn't work when there is a large amount of data in a hash
by aitap (Curate) on Apr 29, 2020 at 14:48 UTC
    But when there is a lot of virtual memory (250 GB) used by storing it in a hash, the system call doesn't work

    This may have to do with the operating system you are using (you said it's Centos 7), and the way it is tuned. You see, the only way to run a program on most Unix-like systems is to fork() a copy of the existing process, then exec() inside the child process to replace the currently running image with a different one. (There is another way involving the vfork() system call, which freezes the parent and makes it undefined behaviour to do almost anything in the child process before the exec(), but almost no-one uses it except some implementations of posix_spawn() standard library function.) Yes, copying entire 250G of the address space of a process just to throw it away on the next system call is wasteful, so when fork() happens, Linux kernel makes the child process refer to the same physical memory that the parent uses, only making a copy when one of the processes tries to change the contents ("copy on write").

    This optimisation makes it possible to fork() processes occupying more than 50% of the memory, at the same time introducing a way to break the promise of the allocated process memory: now if both parent and child try to use all of their rightfully allocated (or inherited) address space, the system will run out of physical memory and will have to start swapping. Some people disable this behaviour because they prefer some memory allocation requests (including allocating memory for a fork() of a large process) to fail instead of letting them be paged out or killed by OOM-killer. What is the value of overcommit settings on the machine you are running this code on?

    There is a kludge you can use to work around this behaviour: at the beginning of the program, fork() a child process that never does anything besides reading command lines to launch over a pipe from the parent and feeding them to system. This way, the parent stays small enough to have a good chance fork() succeeding, even after the grand-parent grows huge.

      Thanks so much! What you explained seems to be the problem, I just tested a script, this is the answer to a previous comment:
      Seems to verify what you explained

      I tried this script and it worked fine on my laptop, put 12 GB of the 16 GB available in hash and system call still works

      I did got varying results on the Centos 7 (450 GB of RAM), I monitored it also with top, if I saw a memory increase

      20 GB, 50 GB, 100 GB, 150 GB and 200 GB all worked fine, didn't see any memory increase either

      But with 230 GB (more than half of the available) I ran out of memory (Cannot allocate memory), so I need the same amount of memory free than there is in the hash.

      I also made the system call loop for 10 times and the bigger the hash, the slower the system call starts

      Now I have to try to figure out your suggestion, not easy to understand as non informatician :) p> overcommit is set to 0 I think, I checked it like this:less /proc/sys/vm/overcommit_memory
      The problem is that the tool has a lot of users on github, so have to keep in mind that the usage has to be straightforward.

      So I should look in to vfork() or in the last suggestion you gave?

      Thanks again for the help, the new tool is to be used for research on genetic disorders in children, so it's for a good cause!

        overcommit is set to 0 I think, I checked it like this:less /proc/sys/vm/overcommit_memory
        Huh, I thought it was 2. I bet your desktop also has 0 there, but for some reason it works there. I am not sure what other settings could influence this behaviour. You could set it to 1 if you have root access and it may even help, but at the cost of potentially summoning OOM-Killer later.
        So I should look in to vfork() or in the last suggestion you gave?
        There is POSIX::RT::Spawn that might use vfork() under the hood. Try it first. Creating your own child spawn helper is harder, but you could copy the code from Bidirectional Communication with Yourself and start from there. Both options are specific to *nix-like systems and should be avoided if $^O eq 'MSWin32' at least.
Re: System call doesn't work when there is a large amount of data in a hash
by marioroy (Parson) on Apr 29, 2020 at 17:19 UTC

    Hi Nicolasd,

    One may spin up a worker early and communicate via a channel. The worker makes the system call and notifies once completed. The 'Simple' channel is specified due to just one background worker (i.e. no mutex locking needed). Otherwise, 'Mutex' is the default when not specified. Lastly, this works on Unix OSes and Windows.

    Update: Pass back the status. See system on PerlDocs for how to inspect the status.
    Update: Including error message.

    use strict; use warnings; use MCE::Child; use MCE::Channel; my $chnl = MCE::Channel->new( impl => 'Simple' ); # spin up worker early before creating big hash mce_child { while ( my ($cmd, @args) = $chnl->recv ) { local ($?, $!); system($cmd, @args); $chnl->send2($?, $!); } }; # create big hash my %big_hash; my ($status, $errmsg); # pass command and optionally args to worker $chnl->send('ls'); ($status, $errmsg) = $chnl->recv2; # ditto, sleep for 2 seconds $chnl->send('sleep', '2'); ($status, $errmsg) = $chnl->recv2; # notify no more work, then reap worker $chnl->end; MCE::Child->waitall;

    The background worker awaits for the next system call to make. Waiting involves no CPU time. No hash copy either because the worker is spun early.

    See also Child in meta::cpan.

    Regards, Mario

      Hi again,

      I made an improved version including worker sending the error message. Replace system with syscmd in your script. That sends the command along with any arguments to the background worker. The worker receives the command to run and any arguments. The status and error message are sent to the main process after running the command.

      syscmd:

      use strict; use warnings; use MCE::Child; use MCE::Channel; my $chnl = MCE::Channel->new( impl => 'Simple' ); # spin up worker early before creating big hash mce_child { local $SIG{__WARN__} = sub {}; while ( my ($cmd, @args) = $chnl->recv ) { local ($?, $!); system($cmd, @args); $chnl->send2($?, $!); } }; sub syscmd { my $cmd = shift; return unless $cmd; $chnl->send($cmd, @_); my ($status, $errmsg) = $chnl->recv2; if ($status == -1) { print "SYSTEM: failed to execute ($cmd): $errmsg\n"; } elsif ($status & 127) { printf "SYSTEM: $cmd died with signal %s, %s coredump\n", ($status & 127), ($status & 128) ? 'with' : 'without'; } else { printf "SYSTEM: $cmd exited with status %d\n", $status >> 8; } } # create big hash my %big_hash; # pass command and optionally args syscmd('ls'); # attempt to run a command not found syscmd('something'); # sleep for 2 seconds syscmd('sleep', '2'); # notify no more work, then reap worker $chnl->end; MCE::Child->waitall;

      output:

      ls output from syscmd SYSTEM: ls exited with status 0 SYSTEM: failed to execute (something): No such file or directory SYSTEM: sleep exited with status 0

      Regards, Mario

        Using the memory gobbling technique by Corion here, let's do some testing. I divided by 2 to reflect the actual memory consumption desired.

        My virtual CentOS 7 machine has 4 GB of RAM allocated to it. This creates a 3 GB scalar key-value pair to consume 75%. The system command called by the main process fails, no ls output on the 2nd time. Calling syscmd succeeds due to the background worker spun early.

        syscmd

        use strict; use warnings; use MCE::Child; use MCE::Channel; my $chnl = MCE::Channel->new( impl => 'Simple' ); # spin up worker early before creating big hash mce_child { local $SIG{__WARN__} = sub {}; while ( my ($cmd, @args) = $chnl->recv ) { local ($?, $!); system($cmd, @args); $chnl->send2($?, $!); } }; sub syscmd { my $cmd = shift; return unless $cmd; $chnl->send($cmd, @_); my ($status, $errmsg) = $chnl->recv2; if ($status == -1) { print "SYSTEM: failed to execute ($cmd): $errmsg\n"; } elsif ($status & 127) { printf "SYSTEM: $cmd died with signal %s, %s coredump\n", ($status & 127), ($status & 128) ? 'with' : 'without'; } else { printf "SYSTEM: $cmd exited with status %d\n", $status >> 8; } } # My CentOS VM has 4 GB of RAM # create big hash my $memory_eaten = 3 * 1024*1024*1024 / 2; # 3 GB, adjust to fit my %memory_eater = ( foo => scalar( ' ' x $memory_eaten ), ); # pass command and optionally args syscmd('ls'); # this one works; see status that it succeeded system('ls'); # this one fails; no ls output the 2nd time # attempt to run a command not found syscmd('something'); # sleep for 2 seconds syscmd('sleep', '2'); # busy loop, see top output in another terminal # notice the memory consumption (i.e. RES) # press Ctrl-C to exit or let it finish 1 for 1..3e8; # notify no more work, then reap worker $chnl->end; MCE::Child->waitall;

        output:

        ls output from syscmd SYSTEM: ls exited with status 0 SYSTEM: failed to execute (something): No such file or directory SYSTEM: sleep exited with status 0

        Regards, Mario

Re: System call doesn't work when there is a large amount of data in a hash
by bliako (Prior) on Apr 29, 2020 at 12:26 UTC

    What is a "system call"? And how do you pass an in-memory data (the hash) to said system call?

    Do you mean something like:

    my %hash=(a=>1,b=>2); system("echo", $hash{a}) == 0 or die;

    or something like:

    use File::Copy; my %hash=(a=>1,b=>2); copy($hash{a}, $hash{b}) or die $!;
      The hash and the system call are in the same script, but they are not directly related.

      The hash is data from large genomic data files that have to be accessed very fast.

      But once the hash is loaded, a system call doesn't work, I need system("blastn ...."), but system("echo Hello") does not work either. It does work when I run it on a small dataset (the hash takes 10 GB of RAM). qx/$command/ doesn't work either

      I am testing what is the limit for the hash size to make it work, but I don't understand why a system call doesn't work when I have large hash in memory

        Searching your repo I get no hit for 'system', or 'qx', despite repeated claims that you take the time to explain what you are doing there seems to be no real question here that anyone can realistically help with, beyond offering the advice you have had to date. "...does not work either..." Tutorials->Debugging and Optimization->Basic debugging checklist.

        It is very unlikely that blastn takes a GB-size sequence as input from the command line! Most likely what the command expects from you, is to provide the name of the file which holds that huge data.

        So, in all likelihood, you must 1) write your hash to a file if it is indeed in the memory of the Perl script (and not in a file already!) and then 2) make the "system call" and provide the filename to it, as part of the command arguments.

        Make sure that if the expected output is huge, to instruct blastn to write its output to a file. Do not read it back from the output of the command (stdout)! Perhaps use the -o outfile option or simply redirect your command to a file, which is not an elegant solution if you are doing it via Perl's system command.

        The above procedure is acceptable if you create/calculate/transform that hash in the Perl script. Just to make sure: if you just read the hash from file, do not change it in any way and then blastn on it (which implies writing it to a file, as I recommend above), then you are doing something wrong.

        Since you have a lot of RAM available, it is worth investigating either storing it in a RAM-disk which you have to create it first, in fact all your data could go there, including temporary files. OR, use memory-mapped files, perhaps read on File::Map.

        bliako

Re: System call doesn't work when there is a large amount of data in a hash
by 1nickt (Abbot) on Apr 29, 2020 at 13:02 UTC

    OK, this must be said.

    You came to the monastery with a (badly-formed) question. Half a dozen monks each with decades of experience have pointed out that your basic premise is Wrong. Yet in every response you cling to that premise (without explaining the reasons for your misguided belief). I, for one, am finished trying to help you; as a matter of fact I don't think you will get any help here that will be of use to you until you open your mind to the possibility that you have made some incorrect assumptions.

    "I have script where I need to do a system call ... the system call doesn't work"

    Later ...

    "it's a bit of a mess, but it works great so that's the most important"

    🙄


    The way forward always starts with a minimal test.
      I don't understand your hostility, I just wanted to know if a system call copies all the memory of the hash, and if there is an alternative for that. Some answers were useful, and I could indeed formulate the question better now, but no reason to become hostile.

      If you don't know the answer that is ok.

Re: System call doesn't work when there is a large amount of data in a hash
by roboticus (Chancellor) on Apr 30, 2020 at 20:30 UTC

    Nicolasd:

    It looks like you're already getting some help on solving the problem you posted, so I won't elaborate on that.

    However, your program looks like it could be fun to play around in and try to optimize a bit. To that end, I'd like to run the program, but I don't know enough about your field or the terminology to be able to figure out how to come up with a configuration file that will actually run and do something. Can you post a few simple config files that set up some simple runs using the test dataset you provided? If you can do that, I may be able to do some tweaking on your program to improve things a bit, and send a few pull requests your way.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      Hi,

      Improvements are always welcome! Respect if you can read that huge file of messy code! :)
      I did made a new version with some comments in the code, should I upload this one, maybe a tiny bit more clear?
      The test datasets also have config files that are ready to use (any additional questions you may ask)

      Those test datasets are very small so will go fast, but most users will have very large datasets (so large hashes).
      Loading all the data (can be around 600 GB of raw data) in the hashes is relatively slow, but not sure if much improvements are possible there.
      A huge improvement would be parallelisation of the code after loading the hashes, which I tried with a few methods, but they turned out slowing down the process or impossible because it would duplicate the hash (similar problem as before).

      I know nobody that knows Perl, so I am the only that looked at the code, so always welcome and if you see something that would improve the speed of memory efficiency greatly I can add you to the next paper. To make improvements in what it does, I think you need a genetic background.

      Greets

        Hi again,

        I'll just suggest once more that you let go of the idea that you must load all your data into an in-memory hash in order for your program to be fast. For one very fast approach please look at mce_map_f in MCE::Map (also by the learned marioroy) which is written especially for optimized parallel processing of huge files.

        (As an aside, have you profiled your code? I would think that Perl could load data from anywhere (file, database, whatever) faster than a shell call to an external analytical program would return ... or does your program not expect a response?)

        As far as your finding that

        "parallelisation of the code after loading the hashes ... turned out slowing down the process or impossible because it would duplicate the hash"
        ... please see MCE::Shared::Hash.

        Hope this helps!


        The way forward always starts with a minimal test.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11116169]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2020-10-22 17:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (229 votes). Check out past polls.

    Notices?