Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Memory efficient design

by harangzsolt33 (Chaplain)
on Aug 26, 2022 at 03:25 UTC ( [id://11146435]=perlquestion: print w/replies, xml ) Need Help??

harangzsolt33 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I am trying to load several big files into memory by reading the entire file. (They're between 5MB and 100MB each.) I want to figure out the best way to design my code so it won't waste memory. After pulling them into memory, I do some work (generate more data), and then I will need to unload some of them once they're no longer needed to free up memory. And then I read some more files into memory. Then finally I write the data I generate to a file. What's the best way to do this? Does this look good?:

my $Data1 = ReadFile($filename1);
my $Data2 = ReadFile($filename2);
my $Data3 = ReadFile($filename3);
my $OutputData = '';

$OutputData .= ......

undef $Data2;
my $Data4 = ReadFile($filename4);

...

CreateFile($output_filename, \$OutputData); # Pass by ref to prevent double copy
exit;

Undef would successfully unload it from the memory. But will it cause large memory blocks to be moved around since it's in the middle? How does Perl deal with memory fragmentation? What happens when there's a memory hole, and I create a new string and slurp the file using sysread() to fill the string. And I imagine that as the string grows, it outgrows that hole and has to be placed into another location. Right? I'm trying to understand what's going on in the background so I can design my code to be efficient.

- I don't want to end up with creating double copies of the same data in memory.
- I don't want to hog memory by not releasing it when I'm done with it.
- I don't want to waste resources unnecessarily.

I would appreciate any helpful advice!!

The rest of the code (which is irrelevant):

# Usage: STRING = ReadFile(FILENAME, START, LENGTH) - Reads an entire file or part of a file in binary mode. Returns the file contents as a string. An optional second argument will move the file pointer before reading, and an optional third argument limits the number of bytes to read.
sub ReadFile {my$F=defined$_[0]?$_[0]:'';$F=~tr#><*%$?\r\n\"\0|##d;-e$F||return'';-f$F||return'';my$S=-s$F;$S||return'';my$L=defined$_2?$_2:$S;$L>0||return'';local*H;sysopen(H,$F,0)||return'';binmode H;my$P=defined$_1?$_1:0;$P>=0or $P=0;$P<$S||return'';$P<1||sysseek(H,$P,0);my$D='';sysread(H,$D,$L);close H;return$D;}

Replies are listed 'Best First'.
Re: Memory efficient design
by Discipulus (Canon) on Aug 26, 2022 at 07:12 UTC
    Hello harangzsolt33,

    from the little I know it is very rare that memory is actually released to the OS. I think that some memory can be reused by your perl program but I suspect it will not be released (maybe this depends on the OS?).

    Waiting for other more expert answers you can consider to use a memory profiling program to test out your assumption: Devel::SizeMe and Devel::Size seems to be appriopriate choices.

    You can also consider to not load at all the file in memory: File::Map does exactly this. See memory-map-files-instead-of-slurping-them.

    L*

    PS please add some <code> tags around your code

    PPS see also Mini-Tutorial: Perl's Memory Management which suprised me with the sentence: You are more likely to see memory being released to the OS on Windows

    PPPS Perl and Garbage Collection linking to Do subroutine variables get destroyed?

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: Memory efficient design
by eyepopslikeamosquito (Archbishop) on Aug 26, 2022 at 07:46 UTC
      Wow, you have a great list there!! I browsed through it, and I learned quite a few things from there. Thanks!

      I use TinyPerl 5.8 on Windows, and I noticed that when I create a large variable, the size of the process expands, but when I use undef $variable, then the size of the process shrinks (but not necessarily by the same amount! lol).

      I noticed that if I do not pass by reference, then an extra copy of the string is created in memory. I see that because I have the Task Manager open on the side, and the memory usage of the TinyPerl.exe application jumps to twice the expected size. But of course, I think, that's normal, and that's how it should work.

      Some unexpected behavior happens when I create a variable and initialize it like so: my $VAR1 = 'A' x 10000000;

      In this case, TWICE the amount of memory is used! However, if I initialize it in two steps using the vec() function, then only the exact amount of memory is used:

      my $VAR2 = ''; vec($VAR2, 9999999, 8) = 65;

      The latter however fills the string with zero bytes and puts a letter 'A' at the end, while the former fills it with letter A's all the way. I could use vec() in a for loop to fill the variable with letter A's, but that takes MUCH longer than using the 'x' operator.

      Also, when it's time to get rid of the variables and I do undef $VAR1; and undef $VAR2; an unexpected thing happens! undef $VAR1; releases only half of the memory, while undef $VAR2; releases the entire amount. So, when using the 'x' operator to create a large variable, not only does that take up twice as much memory, but when I undef it, it does not release all of the used memory, which looks like a memory leak in TinyPerl 5.8. I don't know how Perl in Linux and MacOS handle these two scenarios, but it's very suspicious the way it is handled in Windows.

      I have tested how storing large strings in individual variables differs from storing them in a large array, and it seems that that makes no difference in memory usage. I did notice however that storing MANY short strings (let's say a million 8-byte strings) is handled and stored a lot more efficiently when it's all packed into ONE string as opposed to an array with one million elements each containing an 8-byte string. So, when someone deals with lots of tiny pieces, it's better to store them into a string than have a giant array.

      I used this test program:

      #!/usr/bin/perl use strict; use warnings; $| = 1; print "\nSTEP 1: Reserve memory\n"; #my @GLOBAL_DATA; #$GLOBAL_DATA[0] = ''; #vec($GLOBAL_DATA[0], 9999999, 8) = 44; # Reserves just the right amou +nt. #$GLOBAL_DATA[1] = 'x' x 10000000; # Reserves 20000000 bytes #$GLOBAL_DATA[2] = 'y' x 100000; # Reserves 200000 bytes #$GLOBAL_DATA[3] = 'z' x 1000; # Reserves 2000 bytes my $AA = 'H'; vec($AA, 9999999, 8) = 44; # Reserves just the right amount. #my $BB = 'g' x 10000000; # Reserves double memory!!! $a = <STDIN>; print "\nSTEP 2: Do something with the data\n"; DoSomething(\$AA); # Pass by reference. PrintIt(\$AA); # Pass by reference. $a = <STDIN>; print "\nSTEP 3: Free up memory\n"; undef $AA; # Frees up all of it #undef $BB; # Frees up only half of it! #undef $GLOBAL_DATA[0]; # Frees up all of it #undef $GLOBAL_DATA[1]; # Frees up only half of it! #splice(@GLOBAL_DATA, 1, 1); # Frees up half $a = <STDIN>; print "\nSTEP 4: Pause\n$a"; exit; sub DoSomething { my $X = $_[0]; for (my $i = 0; $i < 60; $i++) { vec($$X, $i, 8) = $i + 48; # This operation doesn't use more m +emory } for (my $i = 9990000; $i < 9999999; $i++) { vec($$X, $i, 8) = 65; # This operation doesn't use more memory } print "\nDONE."; } sub PrintIt { my $X = $_[0]; print "\nLENGTH OF STRING = ", length($$X), "\n"; print "\nPREVIEW = ", substr($$X, 0, 60), "\n"; }

      After I initialize the $BB variable using the 'x' operator, I opened HxD memory viewer in Windows to look into TinyPerl's data segment, and as I scrolled through all the stuff, I found a large region of memory from 024f0020 to 038096B0 which is filled with letter g's. That's approximately 20,000,000 bytes. So, why does Perl create double the amount of letters when using the 'x' operator???? This is so weird.

      In JavaScript, for example, every string is stored in Unicode format, so one character takes up 2 bytes in memory. So, if I create a 10 MB string, that will normally take up double the space in memory. And if I look into it using HxD hex viewer, I'd see that its data segment is filled with 00 67 00 67 00 67 00 67 00 67 which is normal. That's how string of "gggggggggg" is stored by JavaScript. That's what I thought Perl does too (storing characters in unicode format), but that's not the case! It literally creates a string that is twice the size.

        I guess Perl first creates the result of the operation, and then copies the value to the variable. In today's Perl, the result is created at compile time if both the string and the number are constant (not sure about your ancient version).

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Memory efficient design
by eyepopslikeamosquito (Archbishop) on Aug 26, 2022 at 11:30 UTC

    > I want to figure out the best way to design my code so it won't waste memory ...
    > I'm trying to understand what's going on in the background so I can design my code to be efficient

    While those are fun and instructive things to do, just a friendly word of caution. Your primary focus should be writing simple, clear and correct code (here's my top ten list). If your simple, clear, easy to maintain code is fast enough, job done. If not, resist the urge to change it until you've benchmarked it!

    Don’t Optimize Code -- Benchmark It

    -- from Ten Essential Development Practices by Damian Conway

    While guessing or speculating might be fun, there is no substitute for measuring. Even experts often get a surprise when they measure.

    Making the problem trickier still, what is "fast" changes over time as hardware evolves. One recent example I remember illustrates how crucial cache misses are on modern CPUs. Though certainly no expert, I still remember falling off my chair as it dawned on me (while measuring with Intel VTune) that almost every memory access into my elegant 4GB lookup table missed the L1 cache, missed the L2 cache, missed the L3 cache, then waited for the cache line to be loaded into all three caches from main memory, while often incurring a TLB cache miss, just to rub salt into the wounds. :)

    Some simple examples of using Perl's core Benchmark module can be found at point number 10 in Conway's article. From CPAN, Devel::NYTProf is also highly recommended.

Re: Memory efficient design
by NERDVANA (Deacon) on Aug 26, 2022 at 20:21 UTC

    As long as you don't need to modify the contents of the data you loaded from the file, you should give File::Map a try. It memory-maps the file into your process and into a Perl scalar. Best of all, it doesn't even add that memory to Perl (if read-only) because Linux can just memory-map the filesystem cache into your process. When you unmap it, it definitely gets released back to the OS. (which may or may not choose evict it from the filesystem cache)

    One catch, that module operates on a plain scalar, and if you do ANYTHING with that scalar other than use it directly or take a reference to it, Perl is likely to make a copy of it, defeating your optimization. I recommend making a reference to a scalar, then mapping the scalar it references, then pass around the reference, possibly as an object.

    Example

    package OpenGL::Sandbox::MMap; use strict; use warnings; use File::Map 'map_file'; # ABSTRACT: Wrapper around a memory-mapped scalar ref our $VERSION = '0.120'; # VERSION sub size { length(${(shift)}) } sub new { my ($class, $fname)= @_; my $map; my $self= bless \$map, $class; map_file $map, $fname; $self; } 1;
Re: Memory efficient design
by eyepopslikeamosquito (Archbishop) on Aug 28, 2022 at 08:43 UTC

    After re-reading your root post, I'm really puzzled, especially by your ReadFile subroutine listed in italics at the end.

    As noted in my earlier reply, programmers need to adapt their coding style over the years as hardware evolves. In mainstream computing (as opposed to Embedded software) memory has nowadays become so abundant and cheap that it's seldom a concern. At least, that's what I see.

    > I am trying to load several big files into memory by reading the entire file. (They're between 5MB and 100MB each.) I want to figure out the best way to design my code so it won't waste memory.

    It's hard to buy a laptop today with less than 8 GB of memory ... so you could comfortably load forty 100 MB files into memory simultaneously on a low end laptop. If you need more, you can buy 32 GB of PC memory for around $200, a bargain compared to programmer's salaries. ;-)

    To avoid eyestrain, I've reformatted your ReadFile subroutine (in italics at the end of the root node) below:

    # Usage: STRING = ReadFile(FILENAME, START, LENGTH) # Reads an entire file or part of a file in binary mode. # Returns the file contents as a string. # An optional second argument will move the file pointer before readin +g, # and an optional third argument limits the number of bytes to read. sub ReadFile { my $F = defined $_[0] ? $_[0] : ''; $F =~ tr#><*%$?\r\n\"\0|##d; -e $F || return ''; -f $F || return ''; my $S = -s $F; $S || return ''; my $L = defined $_2 ? $_2 : $S; $L > 0 || return ''; local *H; sysopen(H, $F, 0) || return ''; binmode H; my $P = defined $_1 ? $_1 : 0; $P >= 0 or $P = 0; $P < $S || return ''; $P < 1 || sysseek(H, $P, 0); my $D = ''; sysread(H, $D, $L); close H; return $D; }

    Though you stated this code is "irrelevant", I disagree. Certainly, if this appeared in a code review at work, serious questions would be asked, especially around error handling and interface (e.g. going through each bullet point in the "API Design Checklist" section at On Coding Standards and Code Reviews).

    • What are $_1 and $_2?
    • What general approach do you take to error handling in your suite of programs?

Re: Memory efficient design
by LanX (Saint) on Aug 28, 2022 at 09:52 UTC
    Unfortunately your final paragraph is unreadable for me, because you didn't use <code> tags (please edit)

    > I do some work (generate more data)

    It really depends on what kind of work you are doing, and you keep us guessing.

    If you are just processing the input files in a linear way (from one end to the other) better consider a sliding window or iterators reading chunks in general. (see Re: Memory Leak when slurping files in a loop (sliding window explained) )

    If you need to access the data randomly, you "might" be better of by building a lookup index beforehand, to be able to only identify and load relevant parts into memory. (see also this approach to split up hashes and rely on swapping, but this will go the memory limits)

    Of course there are dedicated solutions for the latter, called database-servers.

    I agree with eyepops that memory is not that much of an issue nowadays, but if you plan on scaling things up in the future, a memory-frugal design would be reasonable.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11146435]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (4)
As of 2024-03-29 08:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found