bradcathey has asked for the wisdom of the Perl Monks concerning the following question:

Fellow Monasterians:

Learning to use HTML::Parser to simply strip all the tags from the page and return the plain text. I have found examples of more complicated kinds of operations, but nothing quite this simple (yesterday's node got me started). Maybe it is my lack of understanding of how modules are used, or there are particulars re: H::P that are eluding me.

The following returns bless( { '_hparser_xs_state' => \138993616 }, 'HTML::Parser' ) which looks like a dereferencing issue, but not sure. Ideas? Thanks!

#!/usr/bin/perl -w use warnings; use CGI::Carp qw(fatalsToBrowser); use HTML::Parser; use Data::Dumper; my $p = HTML::Parser->new(api_version => 3); my $text = $p->parse_file("../pages/about.html") || die print "$!"; print "Content-type: text/html\n\n"; print Dumper ($text); print $text."\n";

Update: added CPAN tag


—Brad
"The important work of moving the world forward does not wait to be done by perfect men." George Eliot

Replies are listed 'Best First'.
Re: Using HTML::Parser for simple tag removal
by jeffa (Bishop) on May 25, 2005 at 20:33 UTC

    What does parse_file return? The stripped text? No ... if only it were that easy. Actually, it is that easy with HTML::TokeParser::Simple (just see the first example), but you want to learn this module. I'll give you a hint -- you have to specify a callback subroutine for HTML::Parser so that when it processes a text 'event' it knows what to do with it. If all you want is the text, then this should be enough:

    my $p = HTML::Parser->new( api_version => 3, text_h => [ sub {print shift}, "dtext" ], ); $p->parse_file('somefile.html') || die "could not parse HTML file\n";
    HTML::Parser is not easy. I recommend using HTML::TokeParser or HTML::TokeParser::Simple if you just "want to get it done," otherwise, you have a lot more reading to do. :) Try a Super Search here at the Monastery for "HTML::Parser" and you might find a lot of useful examples.

    Update: I just did a quick Super Search on my nodes, perhaps (jeffa) Re: Regexp to ignore HTML tags will be of use to you.

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    

      Thanks jeffa, worked first time. I will definitely look at HTML::TokeParser::Simple as an alternative.

      Since I came here almost 2 years ago, the mantra I've heard is "use modules," but, as you hinted, I'm finding some modules much easier to use than others, and the docs are usually not written for graphic designers ;-)


      —Brad
      "The important work of moving the world forward does not wait to be done by perfect men." George Eliot
Re: Using HTML::Parser for simple tag removal
by brian_d_foy (Abbot) on May 25, 2005 at 21:36 UTC

    Are you trying to do something like HTML::Strip?

    When I have to subclass HTML::Parser, I usually go back to one of the simple subclasses such as HTML::LinkExtor. You didn't define any handlers in your code.

    The output you see is the parser object. You have to write some code that tells the module what to do. The parse_file() method just tells the parser where to get the input, not what to do with it.

    --
    brian d foy <brian@stonehenge.com>
Re: Using HTML::Parser for simple tag removal
by davidrw (Prior) on May 25, 2005 at 20:29 UTC
    the parse_file method (see the docs for HTML::Parser) returns a reference to the parser object. So $text is a HTML::Parser object, which you suspected. I haven't used H::P before, so i don't know how to print the contetn you want.. Glancing at the docs, you may need to set up a $p->handler() method that just print's as it goes..

    Update: Look in the EXAMPLES section of the HTML::Parser man page for the example about printing the text that's inside the <title> element.
Re: Using HTML::Parser for simple tag removal
by suaveant (Parson) on May 25, 2005 at 21:25 UTC
    Maybe this would help.. not sure, though. HTML::Scrubber I use it to allow only certain tags.

                    - Ant
                    - Some of my best work - (1 2 3)

Re: Using HTML::Parser for simple tag removal
by tcf03 (Deacon) on May 25, 2005 at 20:16 UTC
    #!/usr/bin/perl -w
    is the same thing as
    use warnings;
    Maybe you typo-ed, but I think what you want is use strict;
    You may also want use CGI qw/:standard/;
    Ted
    --
    "That which we persist in doing becomes easier, not that the task itself has become easier, but that our ability to perform it has improved."
      --Ralph Waldo Emerson

      That's not true. use warnings is blocked-scoped or file-scoped if not in a block, while -w will affect modules too.

      >type script.pl use warnings; use Bla; Bla->test(); >type bla.pm package Bla; sub test { print(undef); } 1; >perl script.pl >perl -w script.pl Use of uninitialized value in print at Bla.pm line 5.

      -w is actually equivalent to BEGIN { $^W = 1 } at the top of your script.

      What you should have said is that use warnings is redundant when using -w.

      #!/usr/bin/perl -w
      is the same thing as
      use warnings;
      Not exactly. If you are using an old version of perl 5.005 you need to do -w and not "use warnings;".

      You should "use strict;"

      -------------------------------------
      Nothing is too wonderful to be true
      -- Michael Faraday

      Not sure either one of those makes a difference. And use warnings; is not a typo. And I was under the impression that -w was not the same in that it was a lot more broad in scope. Anyway.... And didn't know HTML::Parser needed CGI.


      —Brad
      "The important work of moving the world forward does not wait to be done by perfect men." George Eliot
Re: Using HTML::Parser for simple tag removal
by rlucas (Scribe) on May 26, 2005 at 19:23 UTC
    Is this an attempt at HTML to text conversion? If so, this is what I was working on yesterday. Here's my code for solving that:

    use HTML::TreeBuilder; use HTML::FormatText; use Encode; sub HTML_to_text { my $content = shift; my $html = HTML::TreeBuilder->new; $content = '<body>' . $content . '</body>' unless $content =~ m/<body[^>]*>/i; $html->parse( decode("utf8", $content) ); # this is necessary othe +rwise UTF8 chars get hamburgered, my $formatter = HTML::FormatText->new; my $out = _trim($formatter->format($html)); # trim is a selective + trimmer that preserves some kinds of whitespace, delete this if you +don't need it. return $out; }
    (The ridiculous number of lines and named variables are due to a step-by-step debugging inspection where I was trying to solve the utf8 problem -- essentially, make sure you have the latest HTML::Parser installed if you anticipate utf8 characters, and you may need to have them "marked" as such by going through the decode() function.)

    If you only need to remove *some* tags, try HTML::TagFilter.

      HTML::FromText is pretty good if you want to turn a chunk of HTML into *formatted* text (rather than just stripping out the tags). It does lists and so on as well.