Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Tokenising a 10MB file trashes a 2GB machine

by PetaMem (Priest)
on Jul 16, 2008 at 08:48 UTC ( [id://697895]=perlquestion: print w/replies, xml ) Need Help??

PetaMem has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

it seems I have - again - stumbled across some example of Perls "obscene memory consumption habits". Basically I try to tokenize a 10MB file in memory and when it crashed my computer I gave it a closer look:

Take emails (simple text, no html, no attachements) concat them to a 10MB file, then do something like

my $content = slurp 'file'; print size($content),"\n"; print total_size([split m{\p{IsSpace}}ms, $content]),"\n";

using Devel::Size to determine who is the culprit gives the numbers 10485544 (file size) and 370379304 (result of split). While the two numbers are within expectation, the script takes more than 1,8GB RAM before being able to print out the second number. Which I think is somewhat insane. It's 64bit 5.8.8 on x86_64 arch.

Of course I am aware of String::Tokenizer and other iterative approaches to tokenizing tasks. I would just want to know from someone who is more knowledgeable of Perls interna why there is a *hidden* memory consumption by a factor of 5 that is not explainable to me. Is it something special with split? Some wild copying happening?

edit:
I've learned from this: Don't use split on large strings. I.e. having a whole file, try to compute it line by line or similar chunks. With other words: make sure the string you feed to split has a guaranteed maximum length or your machine will choke someday.

Bye
 PetaMem
    All Perl:   MT, NLP, NLU

Replies are listed 'Best First'.
Re: Tokenising a 10MB file trashes a 2GB machine
by BrowserUk (Patriarch) on Jul 16, 2008 at 09:10 UTC

    Be aware that using Devel::Size::total_size() itself can consume prodigous amounts of memory.

    In the process of examining the structure of the anonymous array you are creating, it builds a 'tracking hash' to allow it to avoid counting duplicate references to data or internal magic etc. Whether that's the cause of your memory consumption here is not clear, but it would be worth eliminating Devel::Size from the equation.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      It should be obvious, that I added Devel::Size AFTER I found my script trashing the comp. I wanted to write in my original post that Devel::Size acts completely neutral in this case but found that information redundant.

      Bye
       PetaMem
          All Perl:   MT, NLP, NLU

Re: Tokenising a 10MB file trashes a 2GB machine
by moritz (Cardinal) on Jul 16, 2008 at 09:23 UTC
    I can't reproduce the problem here. I took my 20MB junk mail file and splitted, and the usage of virtual memory was about 210MB, both for perl 5.8.8 and perl 5.10.0 (on linux).

    Do you do anything else in your script? What's your OS, which perl version do you use?

      Assuming you have some Linux flavour as OS: could you please try the following script on your machine and tell its output?:

      #!/usr/bin/perl use warnings; use strict; use Devel::Size qw(size total_size); use Encode; my $content = decode('UTF-8', 'tralala ' x 1E6); print size($content),"\n"; print total_size([split m{(\p{Z}|\p{IsSpace}|\p{P})}ms, $content]),"\n +"; procinfo(); sub procinfo { my @stat; my $MiB = 1024 * 1024; if (open( STAT , '<:utf8', "/proc/$$/stat")) { @stat = split /\s+/ , <STAT>; close STAT ; } else { die "procinfo: Unable to open stat file.\n"; } print sprintf "Vsize: %3.2f MiB (%10d\)\n", $stat[22]/$MiB, $stat[ +22]; print "RSS : $stat[23] pages\n"; }

      The only difference I see, is that 32bit architecture takes half of the space as 64bit takes. But still, there is a factor of 5 between the virtual memory taken and the total size of the splitted list.

      • Perl 5.8.8. on i686 (gcc 4.3.1 compiled, -march=686 -O2)
        # ./tokenizer.pl 8000028 68000056 Vsize: 322.56 MiB ( 338231296) RSS : 79087 pages
      • Perl 5.8.8. on x86_64 (gcc 4.3.1 compiled, -march=core2 -O2)
        # ./tokenizer.pl 8000048 112000096 Vsize: 537.61 MiB ( 563724288) RSS : 130586 pages
      • Perl 5.8.8. on x86_64 (gcc 4.1.2 compiled, -O2)
        $ tokenizer.pl 8000048 112000096 Vsize: 539.42 MiB ( 565620736) RSS : 130571 pages

      So no matter what, there is always a 5-times higher memory usage than should be (30MB is perl overhead which is present even if the data is just a few bytes). Which makes me very unhappy...

      Bye
       PetaMem
          All Perl:   MT, NLP, NLU

        On a 32-bit system, there is an approx 32 byte overhead per string (not including the string itself). Also, if, you create a list (eg with split), then eg assign it to an array, perl may temporarily need two copies of each string (plus extra space for the large temporary stack). After the assignment the temp copy will be freed for perl to reuse, but not freed to the OS (so VM usage won't shrink). Given that Devel::Size itself has a large overhead, what you are seeing looks reasonable. Consider the following code:
        my $content = decode('UTF-8', 'tralala ' x 1E6); my @a; $#a = 10_000_000; # presize array for (1..5) { print "ITER $_\n"; push @a, split m{(\p{Z}|\p{IsSpace}|\p{P})}ms, $content; procinfo(); }
        which on my system gives the following output:
        ITER 1 Vsize: 248.18 MiB ( 260235264) RSS : 62362 pages ITER 2 Vsize: 317.14 MiB ( 332550144) RSS : 80000 pages ITER 3 Vsize: 393.71 MiB ( 412839936) RSS : 99598 pages ITER 4 Vsize: 579.46 MiB ( 607612928) RSS : 147156 pages ITER 5 Vsize: 625.23 MiB ( 655597568) RSS : 158895 pages
        which averages about 94Mb growth per iteration, or 47 bytes per string pushed onto @a; allowing 32 bytes string overhead per string (SV and PV structures), leaves 15 bytes per string, which allowing for trailing \0, rounding up to a multiple of 4, malloc overhead etc etc, looks reasonable.

        Dave.

        I have Debian GNU/Linux on a boring 32 bit i386 machine.
        perl 5.8.8: 8000028 68000056 Vsize: 322.68 MiB ( 338354176) RSS : 79112 pages perl 5.10.0: 8000036 84000100 Vsize: 270.80 MiB ( 283951104) RSS : 68365 pages

      The OS is 64bit Gentoo Linux and the Perl is 5.8.8. But nevertheless I had the/my Perl-Interpreter under suspicion because it's compiled with GCC 4.3.1 -march=core2.

      Unfortunately it behaved so well on all machines, that there is no "conservative" perl left as reference. My bad. As also all testcases of Perl ran OK, this could be a candidate for a testcase. Or even a new class of tests (expected memory consumption). Maybe this could be carried to perl-porters

      Bye
       PetaMem
          All Perl:   MT, NLP, NLU

      Here's one more output:

      $ perl ./tokenizer.pl 8000028 68000056 Vsize: 322.05 MiB ( 337694720) RSS : 79143 pages

      OS

      uname -a Linux ubuntu 2.6.24-19-generic #1 SMP Fri Jul 11 23:41:49 UTC 2008 i68 +6 GNU/Linux

      GCC Version

      gcc -v gcc-Version 4.2.3 (Ubuntu 4.2.3-2ubuntu7)
Re: Tokenising a 10MB file trashes a 2GB machine
by Anonymous Monk on Jul 16, 2008 at 09:36 UTC
    use re 'debug'; and see if it sheds any light, this replicates your results
    perl -Mre=debug -e"$f = q~a b c~ x 1;$g = [split m{\p{IsSpace}}ms, $f +];" perl -e"die 1E7" perl -Mre=debug -e"$f = q~a b c~ x 1E7;$g = [split m{\p{IsSpace}}ms, $ +f ];" 2>2
      perl -Mre=debug -e"$f = q~a b c~ x 1E4;$g = [split m{\p{IsSpace}}ms, $ +f ];" 2>2
      The sheer size of the debugger output makes it impossible to run with the 1E7 multiplier and although I still do not know how to interpret the output, maybe someone here knows.
      MultiplierSize of debugger output
      114KiB
      1021KiB
      100137KiB
      1E35,7MiB
      1E4507MiB
      1E550GiB

      Therefore I predict output of the debugger would be (at least) about 5TiB for 1E6. The size comes from the fact, that there is always a printout of the complete dataset that will be matched against, which is every time the regexp matches shortened by one token. Therefore the numbers mentioned above halve if we have e.g. q{1234 } instead q{a b c}.

      In between these printouts there is always the same output:

      Matching REx "\p{IsSpace}" against "1234 1234 1234 1234 1234 1234 1234 + " Matching stclass "ANYOF[{unicode}\11-\15 \302\205\302\240...+utf8::IsS +pace]" against "1234 1234 1234 1234 1234 1234 1234 " Setting an EVAL scope, savestack=6 49969 < 1234> < 1234 1> | 1: ANYOF[{unicode}\11-\15 \302\205\302\ +240...+utf8::IsSpace] 49970 <1234 > <1234 12> | 13: END Match successful! Matching REx "\p{IsSpace}" against "1234 1234 1234 1234 1234 1234 " Matching stclass "ANYOF[{unicode}\11-\15 \302\205\302\240...+utf8::IsS +pace]" against "1234 1234 1234 1234 1234 1234 " Setting an EVAL scope, savestack=6 49974 < 1234> < 1234 1> | 1: ANYOF[{unicode}\11-\15 \302\205\302\ +240...+utf8::IsSpace] 49975 <1234 > <1234 12> | 13: END Match successful! Matching REx "\p{IsSpace}" against "1234 1234 1234 1234 1234 " Matching stclass "ANYOF[{unicode}\11-\15 \302\205\302\240...+utf8::IsS +pace]" against "1234 1234 1234 1234 1234 " Setting an EVAL scope, savestack=6 49979 < 1234> < 1234 1> | 1: ANYOF[{unicode}\11-\15 \302\205\302\ +240...+utf8::IsSpace] 49980 <1234 > <1234 12> | 13: END Match successful! Matching REx "\p{IsSpace}" against "1234 1234 1234 1234 "

      (this is taken from near the end of the debugger output to keep the size of the data sections small)

      So unfortunately I do not see much from this output that could give me a hint for the additional memory consumption. Except probably the "savestack=6", but I guess that is on every other perl interpreter the same. I'll try to compile Perl conservatively with an old GCC and generic CPU architecture (maybe the new gcc does some wasting alignments for Core2 architecture).

      Bye
       PetaMem
          All Perl:   MT, NLP, NLU

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://697895]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (2)
As of 2024-04-25 22:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found