Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: HTML::Tidy - uses up RAM at crazy rate

by graff (Chancellor)
on Mar 13, 2016 at 03:42 UTC ( [id://1157580]=note: print w/replies, xml ) Need Help??


in reply to HTML::Tidy - uses up RAM at crazy rate

The OP code won't compile as posted. Show us a script that actually runs.

Apart from that, have you tried use strict; use warnings; ? Also, with regard to handling a bunch of files, have you tried calling HTML::Tidy->new inside the loop on file names? (I don't know that it would make a difference, but maybe by parsing many files using the same instance of the object, there might be some accumulation of content?)

Finally, what particular evidence are you seeing that leads you to regard it as "a huge resource hog"? CPU load? Memory footprint? Something else?

Sorry - I realize RAM is in the title (which should mention "HTML::Tidy", rather than "Perl tidy", which is something else entirely). So, how much RAM are you talking about?

Replies are listed 'Best First'.
Re^2: HTML::Tidy - uses up RAM at crazy rate
by Anonymous Monk on Mar 13, 2016 at 18:42 UTC

    Yes, you are right it's an HTML::Tidy issue not a perl tidy use. I mistyped. Here's the current code. Trying your new in the for loop suggestion:

    use strict; use warnings; use HTML::Tidy; my $call_dir = "Bing/1Parsed/Html3"; my $contents_of_file = 1; #my $tidy = HTML::Tidy->new(); #commented out for now #my $tidy = HTML::Tidy->new({ # tidy_mark => 1, # #output_xhtml => 1, # yes # #output_xhtml => 1, # yes # add_xml_decl => 1, # no # wrap => 76, # error_file => 'errs.txt', # char_encoding => 'utf8', # indent_cdata => 1, # clean => 1, # fix_bad_comments =>1 #}); my @files = glob "$call_dir/*.html"; printf "Got %d files\n", scalar @files; for my $file (@files) { #added new Tidy piece here to test: my $tidy = HTML::Tidy->new({ tidy_mark => 1, #output_xhtml => 1, # yes #output_xhtml => 1, # yes add_xml_decl => 1, # no wrap => 76, error_file => 'errs.txt', char_encoding => 'utf8', indent_cdata => 1, clean => 1, fix_bad_comments =>1 }); open my $in_fh, '<', $file or die "Could not open $file : $!"; my $contents_of_file = do { local $/;<$in_fh> }; close $in_fh; $tidy->parse( $file, $contents_of_file ); open OUT,'>',$file or die "$!"; print OUT $tidy->clean( $file, $contents_of_file ); print "cleaning" . $file. "\n"; for my $message ( $tidy->messages ) { #print $message->as_string; } }

    Compiles fine now. I'm testing speed with ->new before the for and after.

      Thanks. This version runs ok - I tried it on directory containing 77 html files, and it was pretty quick. What quantity of data are you handling? How big is the memory footprint?

      Just a couple other suggestions:

      • fix your indentation - making the code more legible really helps.
      • create the cleaned output html files in a separate directory, leaving the original input files unaltered, so that you can do multiple runs on the same input data, compare input to output, and compare outputs using different setups
      • use @ARGV for selecting input and output paths
      Something like this:
      ... unless ( @ARGV == 2 and -d $ARGV[0] and -d $ARGV[1] ) { die "Usage: $0 input/path output/path\n" } my ( $indir, $outdir ) = @ARGV my @files = glob "$indir/*.html"; die "No html files found in $indir\n" unless ( @files ); ... for my $file ( @files ) { ... ( my $ofile = $file ) =~ s{$indir}{$outdir}; open OUT, '>', $ofile or die "$!"; ... }

        Thanks for your suggestions on format changes and code improvements - I'll work on putting them in. Regarding size I'm running the script of about 3,000 files. OK at first much slower as it continues. Any ideas? Thanks!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1157580]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2024-04-23 14:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found