in reply to HTML::Tidy - uses up RAM at crazy rate

The OP code won't compile as posted. Show us a script that actually runs.

Apart from that, have you tried use strict; use warnings; ? Also, with regard to handling a bunch of files, have you tried calling HTML::Tidy->new inside the loop on file names? (I don't know that it would make a difference, but maybe by parsing many files using the same instance of the object, there might be some accumulation of content?)

Finally, what particular evidence are you seeing that leads you to regard it as "a huge resource hog"? CPU load? Memory footprint? Something else?

Sorry - I realize RAM is in the title (which should mention "HTML::Tidy", rather than "Perl tidy", which is something else entirely). So, how much RAM are you talking about?

Replies are listed 'Best First'.
Re^2: HTML::Tidy - uses up RAM at crazy rate
by Anonymous Monk on Mar 13, 2016 at 18:42 UTC

    Yes, you are right it's an HTML::Tidy issue not a perl tidy use. I mistyped. Here's the current code. Trying your new in the for loop suggestion:

    use strict; use warnings; use HTML::Tidy; my $call_dir = "Bing/1Parsed/Html3"; my $contents_of_file = 1; #my $tidy = HTML::Tidy->new(); #commented out for now #my $tidy = HTML::Tidy->new({ # tidy_mark => 1, # #output_xhtml => 1, # yes # #output_xhtml => 1, # yes # add_xml_decl => 1, # no # wrap => 76, # error_file => 'errs.txt', # char_encoding => 'utf8', # indent_cdata => 1, # clean => 1, # fix_bad_comments =>1 #}); my @files = glob "$call_dir/*.html"; printf "Got %d files\n", scalar @files; for my $file (@files) { #added new Tidy piece here to test: my $tidy = HTML::Tidy->new({ tidy_mark => 1, #output_xhtml => 1, # yes #output_xhtml => 1, # yes add_xml_decl => 1, # no wrap => 76, error_file => 'errs.txt', char_encoding => 'utf8', indent_cdata => 1, clean => 1, fix_bad_comments =>1 }); open my $in_fh, '<', $file or die "Could not open $file : $!"; my $contents_of_file = do { local $/;<$in_fh> }; close $in_fh; $tidy->parse( $file, $contents_of_file ); open OUT,'>',$file or die "$!"; print OUT $tidy->clean( $file, $contents_of_file ); print "cleaning" . $file. "\n"; for my $message ( $tidy->messages ) { #print $message->as_string; } }

    Compiles fine now. I'm testing speed with ->new before the for and after.

      Thanks. This version runs ok - I tried it on directory containing 77 html files, and it was pretty quick. What quantity of data are you handling? How big is the memory footprint?

      Just a couple other suggestions:

      • fix your indentation - making the code more legible really helps.
      • create the cleaned output html files in a separate directory, leaving the original input files unaltered, so that you can do multiple runs on the same input data, compare input to output, and compare outputs using different setups
      • use @ARGV for selecting input and output paths
      Something like this:
      ... unless ( @ARGV == 2 and -d $ARGV[0] and -d $ARGV[1] ) { die "Usage: $0 input/path output/path\n" } my ( $indir, $outdir ) = @ARGV my @files = glob "$indir/*.html"; die "No html files found in $indir\n" unless ( @files ); ... for my $file ( @files ) { ... ( my $ofile = $file ) =~ s{$indir}{$outdir}; open OUT, '>', $ofile or die "$!"; ... }

        Thanks for your suggestions on format changes and code improvements - I'll work on putting them in. Regarding size I'm running the script of about 3,000 files. OK at first much slower as it continues. Any ideas? Thanks!