Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Hi Monks!

As I was asking in CB, I have a Perl script that reads a 25 MB file to $_ (undef'ing $/) , and does a lot of regex matches against it (like a state machine).

Now I know (thanks diotalevi) that regex matches are cloning my whole string... so I'd like to ask you all some suggestions for lowering the memory consumption :-)

Below is my script:

(disclaimer - this is my full code, if need I'll update the node cutting unneeded things)

(ps - the code reads all *.ms files in the directory, but currently i'm testing with only one file)

(ps2 - the examples and text were in portuguese, but i've translated them here)

(ps3 - the stoplist is just a (very tiny) list of so called "stopwords" - words like "de","do","da" (of), etc...)

# Unify name tags # <Larry> da <Silva> <Wall> X <John> S. <Doe> # becomes # <Larry da Silva Wall> X <John S. Doe> use strict; use warnings; use diagnostics; use Storable; my $debug = 0; $" = ''; if (@ARGV < 1) { print <<INFO Usage: perl $0 [dir] Where [dir] is the directory where the .ms files are located INFO ;exit 1; } my ($dir) = @ARGV; $dir ||= './'; open SW, "/matchsimile/stoplist"; my @stopwords = <SW>; close SW; @stopwords = map { split /\s/ } @stopwords; my %stopword; @stopword{@stopwords} = undef; opendir DIR, $dir; my @files = map { $dir.'/'.$_ } grep { /\.ms$/ } sort readdir(DIR); closedir DIR; undef $/; my $caps = qr/[A-ZÄÅÆÇÈÒÉÜÓÊÝÔËðÕÌßÖÍÎØÏÙÐþÚÑÛÀÁÂÃ]/; my $texto = qr/[A-Za-zÄÅÆÇÈÒÉÜÓÊæÝÔËðçÞÕÌúñèßÖÍûòéàÎüóêáØÏýôëâÙÐþõìãÚÑ +ÿöíäÛÀîåÁøïÂùÃ\s]+/; my @buffer; for my $file (@files) { @buffer = (); open IN, "< $file" or die "'$file' couldn't be opened"; $_ = <IN>; close IN; open OUT, "> $file.new"; my $aux = select(OUT); $|=1; select($aux); print STDERR "Processando $file\n"; my $state = 'TEXT'; my $total_size = length($_); s/(<[^>]*)<([^<>]*>)/$1$2/g; # <<foo> <bar> baz <quux>> vira <foo +bar baz quux> study $_; /^/gc; my $tick = time; until(pos($_) == $total_size) { print STDERR sprintf "\r%10d bytes",$total_size - pos($_) if time +> $tick && ($tick = time); if ($debug) { print STDERR "[@buffer]"; print STDERR snippet(); print STDERR "\n"; } if ($state eq 'TEXT') { if (/\G(\s*<$caps[^<>]*>\s*)/gc) { $state = 'NAME'; push @buff +er, $1; next; } if (/\G<([^<>]*)>/gc || /\G([^<>\s]*\s*)/gc) { print OUT $1; +next; } die "STRANGE FOO ".snippet(); } if ($state eq 'NAME') { if (/\G(<$caps[^<>]*>\s*)/gc) { push @buffer, $1; next } if (/\G<([^<>]*)>/gc) { flush_name(); $state = 'TEXT'; print +OUT $1; next; } if (/\G(\s*([A-Z]\s*\.|(?!\s)$texto(?<!\s))\s*)/gc) { my ($token,$subtoken) = ($1,$2); if ($subtoken =~ /\b(?:x|e|ou)\b/i) { flush_name(); $state = ' +TEXT'; print OUT $token; next; } $subtoken =~ tr/ÄÅÆÇÈÒÉÜÓÊæÝÔËðçÞÕÌúñèßÖÍûòéàÎüóêáØÏýôëâÙÐþõìã +ÚÑÿöíäÛÀîåÁøïÂùÃ/AAACEOEUOEaYOEecTOIunesOIuoeaIuoeaOIyoeaUEtoiaUNyoia +UAiaAoiAuA/; $subtoken =~ tr/A-Z/a-z/; if (length($subtoken) <= 2 || exists $stopword{$subtoken}) { p +ush @buffer, $token; next; } flush_name(); $state = 'TEXT'; print OUT $token; next; } if (/\G([^<>\s]+\s*)/gc) { flush_name(); $state = 'TEXT'; print OUT $1; next; } die "STRANGE FOO ".snippet(); } die "STRANGE FOO ".snippet(); } close OUT; } print STDERR "\n"; exit 0; sub snippet { my $text = "{".substr($_,pos($_),42)."}"; $text =~ s/[\r\n]/|/g; return $text; } sub flush_name { my $count = 0; for my $token (@buffer) { if ( index($token,'<') >= 0 ) { if (++$count == 2) { my $buffer = "@buffer"; $buffer =~ s/>([^<>]*)</$1/g; print OUT $buffer; @buffer = (); return; } } } for my $tk (@buffer) { $tk =~ s/[<>]//g; print OUT $tk; } @buffer = (); return; }
-- 6x9=42

In reply to Regexes eating too much RAM by Articuno

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2024-04-25 22:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found