Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^3: Perl tidy - uses up RAM at crazy rate

by graff (Chancellor)
on Mar 13, 2016 at 23:06 UTC ( [id://1157597]=note: print w/replies, xml ) Need Help??


in reply to Re^2: HTML::Tidy - uses up RAM at crazy rate
in thread HTML::Tidy - uses up RAM at crazy rate

Thanks. This version runs ok - I tried it on directory containing 77 html files, and it was pretty quick. What quantity of data are you handling? How big is the memory footprint?

Just a couple other suggestions:

  • fix your indentation - making the code more legible really helps.
  • create the cleaned output html files in a separate directory, leaving the original input files unaltered, so that you can do multiple runs on the same input data, compare input to output, and compare outputs using different setups
  • use @ARGV for selecting input and output paths
Something like this:
... unless ( @ARGV == 2 and -d $ARGV[0] and -d $ARGV[1] ) { die "Usage: $0 input/path output/path\n" } my ( $indir, $outdir ) = @ARGV my @files = glob "$indir/*.html"; die "No html files found in $indir\n" unless ( @files ); ... for my $file ( @files ) { ... ( my $ofile = $file ) =~ s{$indir}{$outdir}; open OUT, '>', $ofile or die "$!"; ... }

Replies are listed 'Best First'.
Re^4: Perl tidy - uses up RAM at crazy rate
by Anonymous Monk on Mar 14, 2016 at 04:30 UTC

    Thanks for your suggestions on format changes and code improvements - I'll work on putting them in. Regarding size I'm running the script of about 3,000 files. OK at first much slower as it continues. Any ideas? Thanks!

      Iterate over the list of files, then process each one in a child using system
      So you've tried two versions - one with HTML::Tidy->new outside the file loop, and one with it inside the loop? Was there actually no difference in behavior?

      I notice that there's a "clear_messages" function. Have you tried calling that at the end of each iteration? (I assume you've read the man page for this module...)

      I forgot to include that I've got 8gigs of DDR3 RAM and a 1.5GhZ Intel Core i5 processor. Thanks.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1157597]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (2)
As of 2024-04-25 20:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found