Save parsed text to file

pritchard12 has asked for the wisdom of the Perl Monks concerning the following question:

I am working with the HTML Parser module example which extracts plain text from html. After parsing the html I want to know what is the best way to strip out the special characters and extra line spacing, then save the plain text to a file. Thanks for your help.


#!/usr/bin/perl -w

# Extract all plain text from an HTML file

use strict;
use HTML::Parser 3.00 ();

my %inside;

sub tag
{
   my($tag, $num) = @_;
   $inside{$tag} += $num;
   print " ";  # not for all tags
}

sub text
{
    return if $inside{script} || $inside{style};
    print $_[0];
}

HTML::Parser->new(api_version => 3,
          handlers    => [start => [\&tag, "tagname, '+1'"],
                  end   => [\&tag, "tagname, '-1'"],
                  text  => [\&text, "dtext"],
                 ],
          marked_sections => 1,
    )->parse_file(shift) || die "Can't open file: $!\n";;
[download]

Comment on Save parsed text to file Download Code

Replies are listed 'Best First'.
Re: Save parsed text to file by poolpi (Hermit) on Jul 17, 2009 at 06:46 UTC
See HTML::TokeParser::get_trimmed_text From the doc: Any entities will be converted to their corresponding character... ( HTML::Entities ) ...Leading and trailing white space is removed. hth, PooLpi 'Ebry haffa hoe hab im tik a bush'. Jamaican proverb	[reply] [d/l]