http://qs321.pair.com?node_id=186838

TheFifthDeuce has asked for the wisdom of the Perl Monks concerning the following question:

Hello folks. I got a problem that I can't solve. I am working on a pretty cool encryption/decryption system. OK, I am reading a text file which consists of just 0's and 1's...no newlines or whitespace. The file can range in size and get to be VERY large. For this example, the file I am reading is 2,387,250 bytes in size. I need to get every byte of the file, so here are 3 different methods I tried using, and each one eats up ALOT of RAM:
sub test_loop_1{ ######## RAM used: 190 MB my(@all, $elements); @all = (); open(FILE, $file) or die; while(<FILE>){ push @all, /\d/og; } close(FILE); $elements = @all; print $elements; # Just for confirmation - prints 2387250 } sub test_loop_2{ ######## RAM used: 185 MB my(@all, @all2, $all, $elements); open(FILE, $file) or die; @all = <FILE>; close(FILE); $all = join('', @all); @all2 = split('', $all); $elements = 0; foreach(@all2){ $elements ++; } print $elements; # Just for confirmation - prints 2387250 } sub test_loop_3{ ######## RAM used: 120 MB my(@all, @all2, $all, $elements); open(FILE, $file) or die; @all = <FILE>; close(FILE); $all = join('', @all); @all2 = (); for($i = 1; $i <= length($all); $i ++){ push @all2, substr($all, $i, 1); } $elements = @all2; print $elements; # Just for confirmation - prints 2387250 }
Is there any way around this hogging of RAM, or being that the file is just so large in size, am I gonna have to deal with it?

Thanks for any advice,
David
http://www.trixmaster.com

Replies are listed 'Best First'.
Re: Eating RAM problem
by particle (Vicar) on Aug 01, 2002 at 17:02 UTC
    if your file contains only binary data, why is it a text file? you can vastly compress it by using one bit per bit instead of one byte per bit. use vec and binmode. for a more friendly interface, you can tie your bit vector to an array with Tie::VecArray

    ~Particle *accelerates*

      Thanks, but that is not an option. It cannot be compressed any further. I am working on an encryption system where each ASCII char is assigned a certain number of bits, so for example, if the text to be encrypted is 1000 bytes, then after encryption that text will be converted to 36000 bytes consisting of just 0's and 1's.

        Bytes always consist of 0's and 1's. ;-)

        Frankly speaking, I am not sure I understand you here. Let me rephrase: For each block of n bytes, you are going to replace it by a number of m bytes; m>n. Your input data and output data are files consisting of 0's and 1's. I don't understand why, but I accept that. Is that correct?

        If yes, you can perform any mathematical operations with any of the following three representations of the data:

        • A @list of bits, e.g. @list = (0, 1, 0, 0, 0, 0, 0, 1); # 'A'; this is what you are using.
        • A $bitstring, e.g. $bitstring = '01000001'; # 'A'.
        • Binary data, e.g. $data = 'A'; (that is, read directly from the file using e.g. $data = <file>). Obviously, this representation uses the least amount of space. This is not really a compression (for my definition of compression), it is just the 'natural' representation of the data. On the contrary, the other two representations are (probably unnecessary) expansions.

        These three representations are equivalent; you just need to use different syntax to access them. For example, to access the third bit in the data, you would use

        • $third_bit = @list[2];
        • $third_bit = substr($bitstring, 2, 1);
        • $third_bit = vec($data, 5, 1); (this one is a bit more tricky, see the documentation for vec)

        To access whole bytes or blocks of bytes, you would use splice, substr and substr, respectively. All the operations you will need to perform can be expressed in all three data representations -- but the last one will only use 2M of memory... Plus, for the last one, you can use perl's binary or, and etc, whereas for the @list and $bitstring, you'll have to emulate the mathematical functions (using the abovementioned substr, vec etc.)

        You said: I am working on an encryption system where each ASCII char is assigned a certain number of bits, so for example, if the text to be encrypted is 1000 bytes, then after encryption that text will be converted to 36000 bytes consisting of just 0's and 1's.

        Is it the case that the encryption system requires access to the entire data stream in order to work at all? If encrypting, say, 10 sets of 100 bytes (producing 10 sets of 3600 bytes) works as well as cranking a lump of 1000 bytes into 36000, then you should just read, process and output a small portion of data at a time, rather than trying to hold an entire file -- with massive amounts of wasted bits -- in memory at one time.

        Apart from that -- I'm sorry but... -- if memory consumption is an issue, and forcing some particular method of bit padding is a requirement, I'd use C rather than Perl.

        update: Maybe what you want is sysread, to bring a stated number of bytes into an input scalar variable; e.g.:

        while ( $n_bytes_read = sysread( FILE, $inpbuf, 32 ) > 0 ) { if ( $n < 32 ) { # must be the last chunk # ... maybe this needs special treatment } process_input_bytes( $inpbuf ); }
Re: Eating RAM problem
by Abigail-II (Bishop) on Aug 01, 2002 at 17:54 UTC
    Your problem isn't so much the file size, your problem is that you want to make an array element for every single character. This is Perl, not C, so this is going to be costly - you'll get the overhead of a "Perl value" for each character.

    Do you really need that? Can't you use substr? Do you have to have all the characters of the file at the same time? Isn't the encryption/decryption algorithm made such that it encrypts/decrypts blocks of some decent size?

    Abigail

      Well yes, the algorhythm does work on blocks. I was using an anology for every byte of a 2 MB file, because that would still be the realistic equivilent of 32 byte size blocks of a 64 MB file. I guess I could put a max-restriction on the data that can be entered to encrypt.lol

      The point is is that I HAVE to have each chunk of 32 chars from the file to work with...whether I am using an array or not. How how can I do something like this using substr as you suggest, or for that matter, ANY way without draining RAM!lol
      sub get_data{ my(@chunks_of_32); @chunks_of_32 = (); open(FILE, $file) or die; while(<FILE>){ push @chunks_of_32, /\d{32}/og; } close(FILE); }
      Thanks
        Eh, why don't you just read in 32 characters, process them, write the output and then read in the next 32 characters?

        If you don't need the entire file at once, don't read it all at once.

        Abigail

Re: Eating RAM problem
by chromatic (Archbishop) on Aug 01, 2002 at 17:43 UTC
    sub by_string { my $file = shift; local *IN, $/; open( IN, $file ) or die "Cannot open '$file': $!"; return <IN>; }

    Access each element with substr. Memory savings? Several bytes per character, because Perl doesn't have to create a new SV for each character.

      Chromatic, I need to get each element into an array. If the file size is 2 million bytes, then the array should have 2 million elements. How can I do that without draining RAM? Using your sub I get:
      @blah = by_string($file); $i = 0; foreach(@blah){ $i ++; } print $i; # Prints 1... not what I am looking for
      Thanks

        Then use length $blah[0] instead. You can do anything with strings that you can do with an array of 0's and 1's. The syntax is just a little different.

        Ron Steinke rsteinke@w-link.net
        Why do you need to get each element into an array?
        Could you use Tie::VecArray; ?
Re: Eating RAM problem
by Cine (Friar) on Aug 01, 2002 at 18:13 UTC
    sub test_loop_4 { ####### Uses a lot less RAM, but still a lot, because there are 2mil + elems in @all2... ####### A wild guess would be about 20-25*filesize in ramusage open(FILE, $file) or die $!; my $buf = ''; my @all2 = (); while(read FILE,$buf,1) { push @all2,$buf; } }


    T I M T O W T D I
      Thanks Cine, but your example still uses 120 MB of RAM. With everybodies input, I now realize WHY RAM is being eaten alive.lol I gotta work on a buffer-scheme or multiple reads/writes from the file. Anybody comes up with anything, please post!

      Thanks,
      David
        It is quite difficult to come up with a caching scheme for a usage pattern that is unknown ;)
        I suggest you make a new question where you state what you need.

        T I M T O W T D I