pritchard12 has asked for the wisdom of the Perl Monks concerning the following question:

I am working with the HTML Parser module example which extracts plain text from html. After parsing the html I want to know what is the best way to strip out the special characters and extra line spacing, then save the plain text to a file. Thanks for your help.
#!/usr/bin/perl -w # Extract all plain text from an HTML file use strict; use HTML::Parser 3.00 (); my %inside; sub tag { my($tag, $num) = @_; $inside{$tag} += $num; print " "; # not for all tags } sub text { return if $inside{script} || $inside{style}; print $_[0]; } HTML::Parser->new(api_version => 3, handlers => [start => [\&tag, "tagname, '+1'"], end => [\&tag, "tagname, '-1'"], text => [\&text, "dtext"], ], marked_sections => 1, )->parse_file(shift) || die "Can't open file: $!\n";;

Replies are listed 'Best First'.
Re: Save parsed text to file
by poolpi (Hermit) on Jul 17, 2009 at 06:46 UTC

    See HTML::TokeParser::get_trimmed_text

    From the doc:

    Any entities will be converted to their corresponding character...
    ( HTML::Entities )
    ...Leading and trailing white space is removed.


    hth,
    PooLpi

    'Ebry haffa hoe hab im tik a bush'. Jamaican proverb