PetaMem has asked for the wisdom of the Perl Monks concerning the following question:
Dear monks,
it seems I have - again - stumbled across some example of Perls "obscene memory consumption habits". Basically I try to tokenize a 10MB file in memory and when it crashed my computer I gave it a closer look:
Take emails (simple text, no html, no attachements) concat them to a 10MB file, then do something like
my $content = slurp 'file';
print size($content),"\n";
print total_size([split m{\p{IsSpace}}ms, $content]),"\n";
using Devel::Size to determine who is the culprit gives the numbers 10485544 (file size) and 370379304 (result of split). While the two numbers are within expectation, the script takes more than 1,8GB RAM before being able to print out the second number. Which I think is somewhat insane. It's 64bit 5.8.8 on x86_64 arch.
Of course I am aware of String::Tokenizer and other iterative approaches to tokenizing tasks. I would just want to know from someone who is more knowledgeable of Perls interna why there is a *hidden* memory consumption by a factor of 5 that is not explainable to me. Is it something special with split? Some wild copying happening?
edit:
I've learned from this: Don't use split on large strings. I.e. having a whole file, try to compute it line by line or similar chunks. With other words: make sure the string you feed to split has a guaranteed maximum length or your machine will choke someday.
Re: Tokenising a 10MB file trashes a 2GB machine
by BrowserUk (Patriarch) on Jul 16, 2008 at 09:10 UTC
|
Be aware that using Devel::Size::total_size() itself can consume prodigous amounts of memory.
In the process of examining the structure of the anonymous array you are creating, it builds a 'tracking hash' to allow it to avoid counting duplicate references to data or internal magic etc. Whether that's the cause of your memory consumption here is not clear, but it would be worth eliminating Devel::Size from the equation.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
| [reply] |
Re: Tokenising a 10MB file trashes a 2GB machine
by moritz (Cardinal) on Jul 16, 2008 at 09:23 UTC
|
I can't reproduce the problem here. I took my 20MB junk mail file and splitted, and the usage of virtual memory was about 210MB, both for perl 5.8.8 and perl 5.10.0 (on linux).
Do you do anything else in your script? What's your OS, which perl version do you use? | [reply] |
|
#!/usr/bin/perl
use warnings;
use strict;
use Devel::Size qw(size total_size);
use Encode;
my $content = decode('UTF-8', 'tralala ' x 1E6);
print size($content),"\n";
print total_size([split m{(\p{Z}|\p{IsSpace}|\p{P})}ms, $content]),"\n
+";
procinfo();
sub procinfo {
my @stat;
my $MiB = 1024 * 1024;
if (open( STAT , '<:utf8', "/proc/$$/stat")) {
@stat = split /\s+/ , <STAT>;
close STAT ;
}
else {
die "procinfo: Unable to open stat file.\n";
}
print sprintf "Vsize: %3.2f MiB (%10d\)\n", $stat[22]/$MiB, $stat[
+22];
print "RSS : $stat[23] pages\n";
}
The only difference I see, is that 32bit architecture takes half of the space as 64bit takes. But still, there is a factor of 5 between the virtual memory taken and the total size of the splitted list.
- Perl 5.8.8. on i686 (gcc 4.3.1 compiled, -march=686 -O2)
# ./tokenizer.pl
8000028
68000056
Vsize: 322.56 MiB ( 338231296)
RSS : 79087 pages
- Perl 5.8.8. on x86_64 (gcc 4.3.1 compiled, -march=core2 -O2)
# ./tokenizer.pl
8000048
112000096
Vsize: 537.61 MiB ( 563724288)
RSS : 130586 pages
- Perl 5.8.8. on x86_64 (gcc 4.1.2 compiled, -O2)
$ tokenizer.pl
8000048
112000096
Vsize: 539.42 MiB ( 565620736)
RSS : 130571 pages
So no matter what, there is always a 5-times higher memory usage than should be (30MB is perl overhead which is present even if the data is just a few bytes). Which makes me very unhappy...
| [reply] [d/l] [select] |
|
On a 32-bit system, there is an approx 32 byte overhead per string (not including the string itself). Also, if, you create a list (eg with split), then eg assign it to an array, perl may temporarily need two copies of each string (plus extra space for the large temporary stack). After the assignment the temp copy will be freed for perl to reuse, but not freed to the OS (so VM usage won't shrink). Given that Devel::Size itself has a large overhead, what you are seeing looks reasonable. Consider the following code:
my $content = decode('UTF-8', 'tralala ' x 1E6);
my @a;
$#a = 10_000_000; # presize array
for (1..5)
{
print "ITER $_\n";
push @a, split m{(\p{Z}|\p{IsSpace}|\p{P})}ms, $content;
procinfo();
}
which on my system gives the following output:
ITER 1
Vsize: 248.18 MiB ( 260235264)
RSS : 62362 pages
ITER 2
Vsize: 317.14 MiB ( 332550144)
RSS : 80000 pages
ITER 3
Vsize: 393.71 MiB ( 412839936)
RSS : 99598 pages
ITER 4
Vsize: 579.46 MiB ( 607612928)
RSS : 147156 pages
ITER 5
Vsize: 625.23 MiB ( 655597568)
RSS : 158895 pages
which averages about 94Mb growth per iteration, or 47 bytes per string pushed onto @a; allowing 32 bytes string overhead per string (SV and PV structures), leaves 15 bytes per string, which allowing for trailing \0, rounding up to a multiple of 4, malloc overhead etc etc, looks reasonable.
Dave. | [reply] [d/l] [select] |
|
I have Debian GNU/Linux on a boring 32 bit i386 machine.
perl 5.8.8:
8000028
68000056
Vsize: 322.68 MiB ( 338354176)
RSS : 79112 pages
perl 5.10.0:
8000036
84000100
Vsize: 270.80 MiB ( 283951104)
RSS : 68365 pages
| [reply] [d/l] |
|
The OS is 64bit Gentoo Linux and the Perl is 5.8.8. But nevertheless I had the/my Perl-Interpreter under suspicion because it's compiled with GCC 4.3.1 -march=core2.
Unfortunately it behaved so well on all machines, that there is no "conservative" perl left as reference. My bad. As also all testcases of Perl ran OK, this could be a candidate for a testcase. Or even a new class of tests (expected memory consumption). Maybe this could be carried to perl-porters
| [reply] |
|
$ perl ./tokenizer.pl
8000028
68000056
Vsize: 322.05 MiB ( 337694720)
RSS : 79143 pages
OS
uname -a
Linux ubuntu 2.6.24-19-generic #1 SMP Fri Jul 11 23:41:49 UTC 2008 i68
+6 GNU/Linux
GCC Version
gcc -v
gcc-Version 4.2.3 (Ubuntu 4.2.3-2ubuntu7)
| [reply] [d/l] [select] |
Re: Tokenising a 10MB file trashes a 2GB machine
by Anonymous Monk on Jul 16, 2008 at 09:36 UTC
|
use re 'debug'; and see if it sheds any light,
this replicates your results
perl -Mre=debug -e"$f = q~a b c~ x 1;$g = [split m{\p{IsSpace}}ms, $f
+];"
perl -e"die 1E7"
perl -Mre=debug -e"$f = q~a b c~ x 1E7;$g = [split m{\p{IsSpace}}ms, $
+f ];" 2>2
| [reply] [d/l] [select] |
|
perl -Mre=debug -e"$f = q~a b c~ x 1E4;$g = [split m{\p{IsSpace}}ms, $
+f ];" 2>2
The sheer size of the debugger output makes it impossible to run with the 1E7 multiplier and although I still do not know how to interpret the output, maybe someone here knows.
Multiplier | Size of debugger output |
1 | 14KiB |
10 | 21KiB |
100 | 137KiB |
1E3 | 5,7MiB |
1E4 | 507MiB |
1E5 | 50GiB |
Therefore I predict output of the debugger would be (at least) about 5TiB for 1E6. The size comes from the fact, that there is always a printout of the complete dataset that will be matched against, which is every time the regexp matches shortened by one token. Therefore the numbers mentioned above halve if we have e.g. q{1234 } instead q{a b c}.
In between these printouts there is always the same output:
Matching REx "\p{IsSpace}" against "1234 1234 1234 1234 1234 1234 1234
+ "
Matching stclass "ANYOF[{unicode}\11-\15 \302\205\302\240...+utf8::IsS
+pace]" against "1234 1234 1234 1234 1234 1234 1234 "
Setting an EVAL scope, savestack=6
49969 < 1234> < 1234 1> | 1: ANYOF[{unicode}\11-\15 \302\205\302\
+240...+utf8::IsSpace]
49970 <1234 > <1234 12> | 13: END
Match successful!
Matching REx "\p{IsSpace}" against "1234 1234 1234 1234 1234 1234 "
Matching stclass "ANYOF[{unicode}\11-\15 \302\205\302\240...+utf8::IsS
+pace]" against "1234 1234 1234 1234 1234 1234 "
Setting an EVAL scope, savestack=6
49974 < 1234> < 1234 1> | 1: ANYOF[{unicode}\11-\15 \302\205\302\
+240...+utf8::IsSpace]
49975 <1234 > <1234 12> | 13: END
Match successful!
Matching REx "\p{IsSpace}" against "1234 1234 1234 1234 1234 "
Matching stclass "ANYOF[{unicode}\11-\15 \302\205\302\240...+utf8::IsS
+pace]" against "1234 1234 1234 1234 1234 "
Setting an EVAL scope, savestack=6
49979 < 1234> < 1234 1> | 1: ANYOF[{unicode}\11-\15 \302\205\302\
+240...+utf8::IsSpace]
49980 <1234 > <1234 12> | 13: END
Match successful!
Matching REx "\p{IsSpace}" against "1234 1234 1234 1234 "
(this is taken from near the end of the debugger output to keep the size of the data sections small)
So unfortunately I do not see much from this output that could give me a hint for the additional memory consumption. Except probably the "savestack=6", but I guess that is on every other perl interpreter the same. I'll try to compile Perl conservatively with an old GCC and generic CPU architecture (maybe the new gcc does some wasting alignments for Core2 architecture).
| [reply] [d/l] [select] |
|
|