Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

HTML::Tidy - uses up RAM at crazy rate

by Anonymous Monk
on Mar 13, 2016 at 03:18 UTC ( [id://1157579]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks! I've got a perl scripts that runs perl today to clean up a bunch of HTML files. It's a huge resource hog. I was wondering if there might be a better way to do what I'm doing? Here's my code:

my $call_dir = "Bing/1Parsed/Html3"; my $contents_of_file = 1; #my $tidy = HTML::Tidy->new(); my $tidy = HTML::Tidy->new({ tidy_mark => 1, #output_xhtml => 1, # yes #output_xhtml => 1, # yes add_xml_decl => 1, # no wrap => 76, error_file => errs.txt, char_encoding => utf8, indent_cdata => 1, clean => 1, fix_bad_comments =>1 }); my @files = glob "$call_dir/*.html"; printf "Got %d files\n", scalar @files; for my $file (@files) { open my $in_fh, '<', $file or die "Could not open $file : $!"; my $contents_of_file = do { local $/;<$in_fh> }; close $in_fh; $tidy->parse( $file, $contents_of_file ); open OUT,'>',$file or die "$!"; print OUT $tidy->clean( $file, $contents_of_file ); print "cleaning" . $file; for my $message ( $tidy->messages ) { #print $message->as_string; }

Thanks in advance for any help and or insight!

Replies are listed 'Best First'.
Re: HTML::Tidy - uses up RAM at crazy rate
by graff (Chancellor) on Mar 13, 2016 at 03:42 UTC
    The OP code won't compile as posted. Show us a script that actually runs.

    Apart from that, have you tried use strict; use warnings; ? Also, with regard to handling a bunch of files, have you tried calling HTML::Tidy->new inside the loop on file names? (I don't know that it would make a difference, but maybe by parsing many files using the same instance of the object, there might be some accumulation of content?)

    Finally, what particular evidence are you seeing that leads you to regard it as "a huge resource hog"? CPU load? Memory footprint? Something else?

    Sorry - I realize RAM is in the title (which should mention "HTML::Tidy", rather than "Perl tidy", which is something else entirely). So, how much RAM are you talking about?

      Yes, you are right it's an HTML::Tidy issue not a perl tidy use. I mistyped. Here's the current code. Trying your new in the for loop suggestion:

      use strict; use warnings; use HTML::Tidy; my $call_dir = "Bing/1Parsed/Html3"; my $contents_of_file = 1; #my $tidy = HTML::Tidy->new(); #commented out for now #my $tidy = HTML::Tidy->new({ # tidy_mark => 1, # #output_xhtml => 1, # yes # #output_xhtml => 1, # yes # add_xml_decl => 1, # no # wrap => 76, # error_file => 'errs.txt', # char_encoding => 'utf8', # indent_cdata => 1, # clean => 1, # fix_bad_comments =>1 #}); my @files = glob "$call_dir/*.html"; printf "Got %d files\n", scalar @files; for my $file (@files) { #added new Tidy piece here to test: my $tidy = HTML::Tidy->new({ tidy_mark => 1, #output_xhtml => 1, # yes #output_xhtml => 1, # yes add_xml_decl => 1, # no wrap => 76, error_file => 'errs.txt', char_encoding => 'utf8', indent_cdata => 1, clean => 1, fix_bad_comments =>1 }); open my $in_fh, '<', $file or die "Could not open $file : $!"; my $contents_of_file = do { local $/;<$in_fh> }; close $in_fh; $tidy->parse( $file, $contents_of_file ); open OUT,'>',$file or die "$!"; print OUT $tidy->clean( $file, $contents_of_file ); print "cleaning" . $file. "\n"; for my $message ( $tidy->messages ) { #print $message->as_string; } }

      Compiles fine now. I'm testing speed with ->new before the for and after.

        Thanks. This version runs ok - I tried it on directory containing 77 html files, and it was pretty quick. What quantity of data are you handling? How big is the memory footprint?

        Just a couple other suggestions:

        • fix your indentation - making the code more legible really helps.
        • create the cleaned output html files in a separate directory, leaving the original input files unaltered, so that you can do multiple runs on the same input data, compare input to output, and compare outputs using different setups
        • use @ARGV for selecting input and output paths
        Something like this:
        ... unless ( @ARGV == 2 and -d $ARGV[0] and -d $ARGV[1] ) { die "Usage: $0 input/path output/path\n" } my ( $indir, $outdir ) = @ARGV my @files = glob "$indir/*.html"; die "No html files found in $indir\n" unless ( @files ); ... for my $file ( @files ) { ... ( my $ofile = $file ) =~ s{$indir}{$outdir}; open OUT, '>', $ofile or die "$!"; ... }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1157579]
Approved by graff
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (3)
As of 2024-03-28 17:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found