Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re^2: HTML::Tidy - uses up RAM at crazy rate

by Anonymous Monk
on Mar 13, 2016 at 18:42 UTC ( [id://1157594]=note: print w/replies, xml ) Need Help??


in reply to Re: HTML::Tidy - uses up RAM at crazy rate
in thread HTML::Tidy - uses up RAM at crazy rate

Yes, you are right it's an HTML::Tidy issue not a perl tidy use. I mistyped. Here's the current code. Trying your new in the for loop suggestion:

use strict; use warnings; use HTML::Tidy; my $call_dir = "Bing/1Parsed/Html3"; my $contents_of_file = 1; #my $tidy = HTML::Tidy->new(); #commented out for now #my $tidy = HTML::Tidy->new({ # tidy_mark => 1, # #output_xhtml => 1, # yes # #output_xhtml => 1, # yes # add_xml_decl => 1, # no # wrap => 76, # error_file => 'errs.txt', # char_encoding => 'utf8', # indent_cdata => 1, # clean => 1, # fix_bad_comments =>1 #}); my @files = glob "$call_dir/*.html"; printf "Got %d files\n", scalar @files; for my $file (@files) { #added new Tidy piece here to test: my $tidy = HTML::Tidy->new({ tidy_mark => 1, #output_xhtml => 1, # yes #output_xhtml => 1, # yes add_xml_decl => 1, # no wrap => 76, error_file => 'errs.txt', char_encoding => 'utf8', indent_cdata => 1, clean => 1, fix_bad_comments =>1 }); open my $in_fh, '<', $file or die "Could not open $file : $!"; my $contents_of_file = do { local $/;<$in_fh> }; close $in_fh; $tidy->parse( $file, $contents_of_file ); open OUT,'>',$file or die "$!"; print OUT $tidy->clean( $file, $contents_of_file ); print "cleaning" . $file. "\n"; for my $message ( $tidy->messages ) { #print $message->as_string; } }

Compiles fine now. I'm testing speed with ->new before the for and after.

Replies are listed 'Best First'.
Re^3: Perl tidy - uses up RAM at crazy rate
by graff (Chancellor) on Mar 13, 2016 at 23:06 UTC
    Thanks. This version runs ok - I tried it on directory containing 77 html files, and it was pretty quick. What quantity of data are you handling? How big is the memory footprint?

    Just a couple other suggestions:

    • fix your indentation - making the code more legible really helps.
    • create the cleaned output html files in a separate directory, leaving the original input files unaltered, so that you can do multiple runs on the same input data, compare input to output, and compare outputs using different setups
    • use @ARGV for selecting input and output paths
    Something like this:
    ... unless ( @ARGV == 2 and -d $ARGV[0] and -d $ARGV[1] ) { die "Usage: $0 input/path output/path\n" } my ( $indir, $outdir ) = @ARGV my @files = glob "$indir/*.html"; die "No html files found in $indir\n" unless ( @files ); ... for my $file ( @files ) { ... ( my $ofile = $file ) =~ s{$indir}{$outdir}; open OUT, '>', $ofile or die "$!"; ... }

      Thanks for your suggestions on format changes and code improvements - I'll work on putting them in. Regarding size I'm running the script of about 3,000 files. OK at first much slower as it continues. Any ideas? Thanks!

        Iterate over the list of files, then process each one in a child using system
        So you've tried two versions - one with HTML::Tidy->new outside the file loop, and one with it inside the loop? Was there actually no difference in behavior?

        I notice that there's a "clear_messages" function. Have you tried calling that at the end of each iteration? (I assume you've read the man page for this module...)

        I forgot to include that I've got 8gigs of DDR3 RAM and a 1.5GhZ Intel Core i5 processor. Thanks.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1157594]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (7)
As of 2024-03-28 13:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found