Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Memory Leak? i'm clueless.

by downer (Monk)
on Feb 23, 2008 at 20:32 UTC ( [id://669763]=perlquestion: print w/replies, xml ) Need Help??

downer has asked for the wisdom of the Perl Monks concerning the following question:

I have written a code which extracts all the words from an html page, according to some heuristics. the output is correct in this my code, but there is some mysterious memory leak. I am really trying to hone some of my perl skills, but this has me banging my head.

the input is a long long file with many html pages, separated by the http header and some additional information. Some of the code may be a little weird, its kind of slapped together, but the output is what i want. without further ado:
use strict; use warnings; use HTML::Parse; use HTML::FormatText; use File::Slurp; use Lingua::Stem::Snowball; $/ = 'warc/0.9 '; my $sep = 'warc/0.9 '; open FILE, shift; open OUT, '>', shift; open DATA, '>', shift; my $total = 0; my $stemmer = Lingua::Stem::Snowball->new( lang => 'en' ); while(<FILE>) { my $line = $_; $line =~ s/$sep//; if($line) { $line =~ /(\d+)\sresponse\s(\S+)\s/i; my $id = $1; my $url = $2; #print "$1, $2\n+++++\n"; $line =~ s/[^<]*//; #remove everything up to the 1st html tag + (header, etc) my $len = length($line); if($len < 600000) { print DATA "$id\t$url\t$total\t$len\n"; $total += $len; #print OUT "$line"; my $plain_text = HTML::FormatText->new->format(parse_html( +$line)); $plain_text =~ s/\[image\]//ig; $plain_text =~ s/(\S)/\L$1/ig; my @words = $plain_text =~ /\b\S+\b/ig; #my $stemmer = Lingua::Stem::Snowball->new( lang => 'en' ) +; $stemmer->stem_in_place( \@words ); foreach my $x (@words) { if($x =~ /^[A-Za-z0-9]+$/ and $x !~ /(http:\/\/(\S+)\ +b)|(&?nbsp)|(\b.*\d{5, }.*\b)|( ^\d+$)|(\S{32,})/) { print OUT "$x "; } } print OUT "\n"; print OUT '+-+-+-+-+', "\n"; } } }
can any monks give me suggestions here?

Replies are listed 'Best First'.
Re: Memory Leak? i'm clueless.
by ikegami (Patriarch) on Feb 23, 2008 at 21:22 UTC

    HTML-Tree doesn't use weak refs, so you must explicitly destroy the trees and elements it creates. (It's right there in the Synopsis and the detailed Description of the Synopsis.) Change

    my $html_tree = parse_html($line); $html_tree->eof(); # This was missing too. my $plain_text = HTML::FormatText->new()->format($html_tree); $html_tree->delete(); # This was missing.

    or

    use Object::Destroyer; my $html_tree = parse_html($line); $html_tree = Object::Destroyer->new($html_tree, 'delete'); $html_tree->eof(); my $plain_text = HTML::FormatText->new->format($html_tree);

    Since we're changing it anyway, might as well change HTML::Parse to HTML::TreeBuilder, since the former is a deprecated indirection layer for the latter. Also, there's no need to repeatedly create HTML::FormatText objects. You can create one and reuse it.

    use HMTL::TreeBuilder; my $formatter = HTML::FormatText->new(); my $html_tree = HTML::TreeBuilder->new_from_content($line); my $plain_text = $formatter->format($html_tree); $html_tree->delete();

    or

    use HTML::TreeBuilder; use Object::Destroyer; ... my $formatter = HTML::FormatText->new(); ... my $plain_text = $formatter->format( Object::Destroyer->new( HTML::TreeBuilder->new_from_content($line), 'delete' ) );

    Reference:

    Updated: Added missing call to eof. Please read the documentation of functions before using them. The requirements to call eof and delete are clearly documented.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://669763]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2024-03-29 13:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found