Tokenising a 10MB file trashes a 2GB machine

PetaMem has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Tokenising a 10MB file trashes a 2GB machine
by BrowserUk (Patriarch) on Jul 16, 2008 at 09:10 UTC

Be aware that using Devel::Size::total_size() itself can consume prodigous amounts of memory.

In the process of examining the structure of the anonymous array you are creating, it builds a 'tracking hash' to allow it to avoid counting duplicate references to data or internal magic etc. Whether that's the cause of your memory consumption here is not clear, but it would be worth eliminating Devel::Size from the equation.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

[reply]
[d/l]

Re^2: Tokenising a 10MB file trashes a 2GB machine

by PetaMem (Priest) on Jul 16, 2008 at 09:23 UTC

It should be obvious, that I added Devel::Size AFTER I found my script trashing the comp. I wanted to write in my original post that Devel::Size acts completely neutral in this case but found that information redundant.

Bye
PetaMem All Perl: MT, NLP, NLU

[reply]

Re: Tokenising a 10MB file trashes a 2GB machine
by moritz (Cardinal) on Jul 16, 2008 at 09:23 UTC

Do you do anything else in your script? What's your OS, which perl version do you use?

[reply]

Re^2: Tokenising a 10MB file trashes a 2GB machine

by PetaMem (Priest) on Jul 16, 2008 at 12:11 UTC

Assuming you have some Linux flavour as OS: could you please try the following script on your machine and tell its output?:

#!/usr/bin/perl

use warnings;
use strict;

use Devel::Size qw(size total_size);
use Encode;

my $content = decode('UTF-8', 'tralala ' x 1E6);

print size($content),"\n";
print total_size([split m{(\p{Z}|\p{IsSpace}|\p{P})}ms, $content]),"\n
+";

procinfo();

sub procinfo {
    my @stat;
    my $MiB = 1024 * 1024;

    if (open( STAT , '<:utf8', "/proc/$$/stat")) {
        @stat = split /\s+/ , <STAT>;
        close STAT ;
    }
    else {
        die "procinfo: Unable to open stat file.\n";

    }

    print sprintf "Vsize: %3.2f MiB (%10d\)\n", $stat[22]/$MiB, $stat[
+22];
    print "RSS  : $stat[23] pages\n";
}
[download]

The only difference I see, is that 32bit architecture takes half of the space as 64bit takes. But still, there is a factor of 5 between the virtual memory taken and the total size of the splitted list.

Perl 5.8.8. on i686 (gcc 4.3.1 compiled, -march=686 -O2)

# ./tokenizer.pl
8000028
68000056
Vsize: 322.56 MiB ( 338231296)
RSS  : 79087 pages
[download]

Perl 5.8.8. on x86_64 (gcc 4.3.1 compiled, -march=core2 -O2)

# ./tokenizer.pl
8000048
112000096
Vsize: 537.61 MiB ( 563724288)
RSS  : 130586 pages
[download]

Perl 5.8.8. on x86_64 (gcc 4.1.2 compiled, -O2)

$ tokenizer.pl
8000048
112000096
Vsize: 539.42 MiB ( 565620736)
RSS  : 130571 pages
[download]

So no matter what, there is always a 5-times higher memory usage than should be (30MB is perl overhead which is present even if the data is just a few bytes). Which makes me very unhappy...

Bye
PetaMem All Perl: MT, NLP, NLU

[reply]
[d/l]
[select]

Re^3: Tokenising a 10MB file trashes a 2GB machine

by dave_the_m (Monsignor) on Jul 16, 2008 at 13:19 UTC

my $content = decode('UTF-8', 'tralala ' x 1E6);

my @a;
$#a = 10_000_000; # presize array
for (1..5)
{
    print "ITER $_\n";
    push @a, split m{(\p{Z}|\p{IsSpace}|\p{P})}ms, $content;
    procinfo();
}
[download]

ITER 1
Vsize: 248.18 MiB ( 260235264)
RSS  : 62362 pages
ITER 2
Vsize: 317.14 MiB ( 332550144)
RSS  : 80000 pages
ITER 3
Vsize: 393.71 MiB ( 412839936)
RSS  : 99598 pages
ITER 4
Vsize: 579.46 MiB ( 607612928)
RSS  : 147156 pages
ITER 5
Vsize: 625.23 MiB ( 655597568)
RSS  : 158895 pages
[download]

Dave.

[reply]
[d/l]
[select]

Re^3: Tokenising a 10MB file trashes a 2GB machine

by moritz (Cardinal) on Jul 16, 2008 at 12:22 UTC

perl 5.8.8:
8000028
68000056
Vsize: 322.68 MiB ( 338354176)
RSS  : 79112 pages

perl 5.10.0:
8000036
84000100
Vsize: 270.80 MiB ( 283951104)
RSS  : 68365 pages
[download]

[reply]
[d/l]

Re^2: Tokenising a 10MB file trashes a 2GB machine

by PetaMem (Priest) on Jul 16, 2008 at 09:31 UTC

The OS is 64bit Gentoo Linux and the Perl is 5.8.8. But nevertheless I had the/my Perl-Interpreter under suspicion because it's compiled with GCC 4.3.1 -march=core2.

Unfortunately it behaved so well on all machines, that there is no "conservative" perl left as reference. My bad. As also all testcases of Perl ran OK, this could be a candidate for a testcase. Or even a new class of tests (expected memory consumption). Maybe this could be carried to perl-porters

Bye
PetaMem All Perl: MT, NLP, NLU

[reply]

Re^2: Tokenising a 10MB file trashes a 2GB machine

by Anonymous Monk on Jul 16, 2008 at 16:19 UTC

Here's one more output:

$ perl ./tokenizer.pl 
8000028
68000056
Vsize: 322.05 MiB ( 337694720)
RSS  : 79143 pages
[download]

uname -a
Linux ubuntu 2.6.24-19-generic #1 SMP Fri Jul 11 23:41:49 UTC 2008 i68
+6 GNU/Linux
[download]

GCC Version

gcc -v
gcc-Version 4.2.3 (Ubuntu 4.2.3-2ubuntu7)
[download]

[reply]
[d/l]
[select]

Re: Tokenising a 10MB file trashes a 2GB machine
by Anonymous Monk on Jul 16, 2008 at 09:36 UTC

use re 'debug';


perl -Mre=debug -e"$f = q~a b c~ x 1;$g = [split m{\p{IsSpace}}ms, $f 
+];" 


perl -e"die 1E7"

perl -Mre=debug -e"$f = q~a b c~ x 1E7;$g = [split m{\p{IsSpace}}ms, $
+f ];" 2>2
[download]

[reply]
[d/l]
[select]

Re^2: Tokenising a 10MB file trashes a 2GB machine

by PetaMem (Priest) on Jul 16, 2008 at 10:39 UTC

perl -Mre=debug -e"$f = q~a b c~ x 1E4;$g = [split m{\p{IsSpace}}ms, $
+f ];" 2>2
[download]

Multiplier	Size of debugger output
1	14KiB
10	21KiB
100	137KiB
1E3	5,7MiB
1E4	507MiB
1E5	50GiB

Therefore I predict output of the debugger would be (at least) about 5TiB for 1E6. The size comes from the fact, that there is always a printout of the complete dataset that will be matched against, which is every time the regexp matches shortened by one token. Therefore the numbers mentioned above halve if we have e.g. q{1234 } instead q{a b c}.

In between these printouts there is always the same output:

Matching REx "\p{IsSpace}" against "1234 1234 1234 1234 1234 1234 1234
+ "
Matching stclass "ANYOF[{unicode}\11-\15 \302\205\302\240...+utf8::IsS
+pace]" against "1234 1234 1234 1234 1234 1234 1234 "
  Setting an EVAL scope, savestack=6
49969 < 1234> < 1234 1>    |  1:  ANYOF[{unicode}\11-\15 \302\205\302\
+240...+utf8::IsSpace]
49970 <1234 > <1234 12>    | 13:  END
Match successful!
Matching REx "\p{IsSpace}" against "1234 1234 1234 1234 1234 1234 "
Matching stclass "ANYOF[{unicode}\11-\15 \302\205\302\240...+utf8::IsS
+pace]" against "1234 1234 1234 1234 1234 1234 "
  Setting an EVAL scope, savestack=6
49974 < 1234> < 1234 1>    |  1:  ANYOF[{unicode}\11-\15 \302\205\302\
+240...+utf8::IsSpace]
49975 <1234 > <1234 12>    | 13:  END
Match successful!
Matching REx "\p{IsSpace}" against "1234 1234 1234 1234 1234 "
Matching stclass "ANYOF[{unicode}\11-\15 \302\205\302\240...+utf8::IsS
+pace]" against "1234 1234 1234 1234 1234 "
  Setting an EVAL scope, savestack=6
49979 < 1234> < 1234 1>    |  1:  ANYOF[{unicode}\11-\15 \302\205\302\
+240...+utf8::IsSpace]
49980 <1234 > <1234 12>    | 13:  END
Match successful!
Matching REx "\p{IsSpace}" against "1234 1234 1234 1234 "
[download]

(this is taken from near the end of the debugger output to keep the size of the data sections small)

So unfortunately I do not see much from this output that could give me a hint for the additional memory consumption. Except probably the "savestack=6", but I guess that is on every other perl interpreter the same. I'll try to compile Perl conservatively with an old GCC and generic CPU architecture (maybe the new gcc does some wasting alignments for Core2 architecture).

Bye
PetaMem All Perl: MT, NLP, NLU

[reply]
[d/l]
[select]


P is for Practical
	PerlMonks