http://qs321.pair.com?node_id=655381

siva kumar has asked for the wisdom of the Perl Monks concerning the following question:

I want to read a huge file content (4GB) and store it in a variable that to pass to a generic function.
ls -lh /path/2024.sql -rw-r--r-- 1 user group 609M Oct 17 20:49 2024.sql
I hv used the below code, but got "Out of memory" error.
open(FH,"/path/2024.sql"); $/=EOF; $var = <FH>; &callGenericfunction($var);
Even tried using Tie::File.
use Tie::File; $filename = "/path/2024.sql"; my $obj = tie @array, "Tie::File", $filename; $var = join " ",@array; &callfunction($var);
Tried with buffering.
open (FH, "</path/2024.sql") or die "bah"; binmode FH; $buffer = 1024 * 1024 * 2; my $data = ""; while (read (FH,$data, $buffer) ){ $data .= $data; } &callfunction($var);

Replies are listed 'Best First'.
Re: Reading huge file content
by graff (Chancellor) on Dec 06, 2007 at 13:08 UTC
    Looks like you need to add more RAM to that machine. Holding all of a huge file in memory when you don't have enough RAM for it is kind of a non-starter.

    I'm confused -- you say the file is 4 GB, but you show a listing for a 690 MB file. In any case, if/when you have more memory on the machine, the first option is likely to work best, adding the smallest amount of storage overhead (assuming that internal storage handling in Perl is able to manage a single scalar variable whose length is close to (possibly greater than?) 2**32. (I don't know.)

    So perhaps a different/better question to ask is why pass such a huge amount of data as a parameter in a subroutine call? What is the sub supposed to do with that? (If, heaven forbid, it involves making a copy of the data, you may still have a problem.) -- update: As indicated by moritz's reply below, just passing a scalar string as a subroutine arg will create a copy of that string in memory, so you will need a lot more RAM to do that; alternatively, the sub would at least need to accept a reference to a scalar string (but if you're going to change the sub, change it to accept a file handle or name instead...) </update>

    Is the sub a piece of code that you wrote? If so, you should consider altering it so that it can use a file handle or file name as its input parameter, and have it handle the file reading in a reasonable way (so the whole file is not stored in memory at all). Or you need to reconsider your algorithm for achieving whatever it is you are trying to achieve. There is generally a way to break it down to work on portions of a large data set, and work around hardware limitations.

Re: Reading huge file content
by moritz (Cardinal) on Dec 06, 2007 at 13:08 UTC
    Well, you try to read more than 4GB data into memory, and then you copy it (perl subs use call-by-value, which makes a copy).

    Do you have 8GB of memory available? Do you really want to use that much?

      (perl subs use call-by-value, which makes a copy)

      I don't think this is true.

      report_mem( 'program start' ); my $big = 'x' x 100_000_000; report_mem( 'after making big string' ); take_big_arg( $big ); sub take_big_arg { printf "got %d characters\n", length $_[0]; report_mem( 'passed to function' ); } sub report_mem { my ( $msg ) = @_; printf "%d %s\n", my_mem(), $msg; } sub my_mem { my ($proc_info) = grep { $_->[2] == $$ } map { [ split ] } `ps l | tail -n +2` +; return $proc_info->[6]; } __END__ 13124 program start 208444 after making big string got 100000000 characters 208444 passed to function

      The truth is Perl gives the sub aliases to the variables you pass. Common practice is to copy them after that (my ($copy) = @_), but that's something else.

        You are right, I stand corrected. And I'm a bit ashamed that I don't know such things after more than a year perl programming and half a year in the monestary..

        In the general case I wouldn't count on the fact that a subroutine doesn't copy one of its arguments, though.

Re: Reading huge file content
by bart (Canon) on Dec 06, 2007 at 13:44 UTC
    I think you'd better rewrite your function so it can read directly from the filehandle. Reading Gigs of data is no fun, it can easily take many, many minutes. And then it hasn't done anything yet.
      in my experience, perl has some problems reading vary large amounts of data into a single variable. it may be better if you can load each line into an entry of an array, (if you've got plenty of memory) or better yet, try to just process a line at a time. For very large jobs like this, I find it preferrable to take one large job and make it in to several small jobs which usually just scan through the file. its probably work while to spend some time thinking of your algorithm to see if this is possible.
        or better yet, try to just process a line at a time.

        Indeed. Especially since even reading this much data (never mind processing it) will take considerable time. Reading one line, processing it, then reading the next would also allow you to store a counter on disk somewhere or displayed on the screen with the number of bytes successfully read and processed (obtained using tell) so that, if the process dies partway through, it can pick up from where it left off when restarted instead of having to go back to the beginning of the file and repeat what's already been done.

Re: Reading huge file content
by jrsimmon (Hermit) on Dec 06, 2007 at 13:03 UTC
    Perhaps you would have better luck passing the file name and location to the generic function? For a file that large you may only be able to process parts of the file at a time.
Re: Reading huge file content
by naChoZ (Curate) on Dec 06, 2007 at 13:24 UTC

    I think you should probably question your motives and methods as other people have suggested. But I believe I remember brian_d_foy mentioning DBM::Deep was good at this sort of thing.

    --
    naChoZ

    Therapy is expensive. Popping bubble wrap is cheap. You choose.

Re: Reading huge file content
by johngg (Canon) on Dec 06, 2007 at 14:17 UTC
    my $data = ""; while (read (FH,$data, $buffer) ){ $data .= $data; }

    I don't think that's going to work very well. Each time do the read in the while you just clobber what you have read before. Perhaps you should append $data to a different scalar. I would also cut down the size of your reads and possibly introduce an updating counter to see how far you have successfully read through the file.

    Cheers,

    JohnGG

Re: Reading huge file content
by BrowserUk (Patriarch) on Dec 06, 2007 at 13:00 UTC
Re: Reading huge file content
by KurtSchwind (Chaplain) on Dec 06, 2007 at 14:00 UTC

    As with others, I'm going to have to ask 'why'. We need the 'why' to give you a good solution.

    If we take the premise that you 'just need to', then the solution is to get a machine with 10G of memory on it and then you won't get that error. The error of running out of system memory isn't perl specific. You'd get that error in C or any other language. You've hit a physical limit.

    If it turns out that you really don't NEED to have it all slammed into memory at once and passed around, you have other options as people have pointed out. Passing file handles is a good solution. Another good solution is to use Sys::Mmap; to memory map your file. You can also use PerlIO with the :mmap tag. At any rate, give us some more info and we can provide a solution to the problem, even if it's one you wont' like. :)

    --
    I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.
Re: Reading huge file content
by cdarke (Prior) on Dec 06, 2007 at 14:51 UTC
    As other had said - not a good idea, better to pass the filename or maybe record by record through a pipe?

    On the memory issue, adding RAM may not help if you are only on a 32-bit machine or OS. The largest address on 32-bit is (2**32)-1, which gives a process address space of 4GB. Of this the kernel reserves a sizeable chunk - up to half (2GB) on some operating systems. Then you have the run-time libraries, perl itself, etc. using up address space. A 64-bit machine (in theory) has a limit of 16-Exobytes.

    Then some operating systems (Windows) pre-allocate the heap-size at compilation time, which by default is only 1MB on Visual Studio. There are ugly hacks around this limit, but I doubt perl employs them.

    Back to the drawing board I think.
Re: Reading huge file content
by GrandFather (Saint) on Dec 06, 2007 at 21:41 UTC

    Despite comments to the contrary above, available RAM is not a factor in how much stuff you can read into "memory". Virtual memory means that the physical RAM available is a performance limiting factor rather than an absolute limit. However, there are absolute limits and, depending on OS and build of Perl, 4 GB is likely to be beyond the maximum chunk of (virtual) memory you can use.

    The answer, as suggested elsewhere, is to restructure your code so that the analysis sub uses either a file handle and reads the data directly, or uses a database, or use something like Tie::File to avoid loading all of the file into "memory".


    Perl is environmentally friendly - it saves trees

      Are you sure?

      I'm fairly certain that your physical RAM limits how much you can read into a single memory segment. That is, if you want to read everything into a single variable, you'll need the physical RAM to do it. Virtual memory can only swap out what you aren't currently accessing.

      Am I wrong on this?

      --
      I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.

        Yes, you are wrong. ;)

        Approximately: virtual memory systems give a process a chunk of memory that looks to the process like one large contiguous piece of memory. The process can not tell where or how its process space memory is mapped to physical memory. Indeed that mapping may change from access to access. The OS takes care of ensuring that when a process accesses a chunk of process space memory that the access succeeds if possible. That may entail writing a chunk of physical memory to disk, reading another chunk of disk to get the accessed process memory's contents, then fixing up tables to map process memory address to physical addresses. Hitting on the VM doesn't come cheap!


        Perl is environmentally friendly - it saves trees