Using HTML::Parser for simple tag removal

bradcathey has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Using HTML::Parser for simple tag removal by jeffa (Bishop) on May 25, 2005 at 20:33 UTC
What does `parse_file` return? The stripped text? No ... if only it were that easy. Actually, it is that easy with HTML::TokeParser::Simple (just see the first example), but you want to learn this module. I'll give you a hint -- you have to specify a callback subroutine for HTML::Parser so that when it processes a text 'event' it knows what to do with it. If all you want is the text, then this should be enough: `my $p = HTML::Parser->new( api_version => 3, text_h => [ sub {print shift}, "dtext" ], ); $p->parse_file('somefile.html') \|\| die "could not parse HTML file\n";` [download] HTML::Parser is not easy. I recommend using HTML::TokeParser or HTML::TokeParser::Simple if you just "want to get it done," otherwise, you have a lot more reading to do. :) Try a Super Search here at the Monastery for "HTML::Parser" and you might find a lot of useful examples. Update: I just did a quick Super Search on my nodes, perhaps (jeffa) Re: Regexp to ignore HTML tags will be of use to you. jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l]
Re^2: Using HTML::Parser for simple tag removal by bradcathey (Prior) on May 25, 2005 at 20:53 UTC
Thanks jeffa, worked first time. I will definitely look at HTML::TokeParser::Simple as an alternative. Since I came here almost 2 years ago, the mantra I've heard is "use modules," but, as you hinted, I'm finding some modules much easier to use than others, and the docs are usually not written for graphic designers ;-) —Brad "The important work of moving the world forward does not wait to be done by perfect men." George Eliot	[reply]
Re: Using HTML::Parser for simple tag removal by brian_d_foy (Abbot) on May 25, 2005 at 21:36 UTC
Are you trying to do something like HTML::Strip? When I have to subclass HTML::Parser, I usually go back to one of the simple subclasses such as HTML::LinkExtor. You didn't define any handlers in your code. The output you see is the parser object. You have to write some code that tells the module what to do. The parse_file() method just tells the parser where to get the input, not what to do with it. -- brian d foy <brian@stonehenge.com>	[reply]
Re: Using HTML::Parser for simple tag removal by davidrw (Prior) on May 25, 2005 at 20:29 UTC
the `parse_file` method (see the docs for HTML::Parser) returns a reference to the parser object. So `$text` is a HTML::Parser object, which you suspected. I haven't used H::P before, so i don't know how to print the contetn you want.. Glancing at the docs, you may need to set up a `$p->handler()` method that just `print`'s as it goes.. Update: Look in the EXAMPLES section of the HTML::Parser man page for the example about printing the text that's inside the <title> element.	[reply] [d/l] [select]
Re: Using HTML::Parser for simple tag removal by suaveant (Parson) on May 25, 2005 at 21:25 UTC
Maybe this would help.. not sure, though. HTML::Scrubber I use it to allow only certain tags. - Ant - Some of my best work - (1 2 3)	[reply]
Re: Using HTML::Parser for simple tag removal by tcf03 (Deacon) on May 25, 2005 at 20:16 UTC
`#!/usr/bin/perl -w` [download] is the same thing as `use warnings;` [download] Maybe you typo-ed, but I think what you want is `use strict;` You may also want `use CGI qw/:standard/;` Ted -- "That which we persist in doing becomes easier, not that the task itself has become easier, but that our ability to perform it has improved." --Ralph Waldo Emerson	[reply] [d/l] [select]
Re^2: Using HTML::Parser for simple tag removal by ikegami (Patriarch) on May 25, 2005 at 21:16 UTC
That's not true. `use warnings` is blocked-scoped or file-scoped if not in a block, while `-w` will affect modules too. `>type script.pl use warnings; use Bla; Bla->test(); >type bla.pm package Bla; sub test { print(undef); } 1; >perl script.pl >perl -w script.pl Use of uninitialized value in print at Bla.pm line 5.` [download] `-w` is actually equivalent to `BEGIN { $^W = 1 }` at the top of your script. What you should have said is that `use warnings` is redundant when using `-w`.	[reply] [d/l] [select]
Re^2: Using HTML::Parser for simple tag removal by freddo411 (Chaplain) on May 25, 2005 at 20:30 UTC
#!/usr/bin/perl -w is the same thing as use warnings; Not exactly. If you are using an old version of perl 5.005 you need to do -w and not "use warnings;". You should "use strict;" ------------------------------------- Nothing is too wonderful to be true -- Michael Faraday	[reply] [d/l]
Re^2: Using HTML::Parser for simple tag removal by bradcathey (Prior) on May 25, 2005 at 20:30 UTC
Not sure either one of those makes a difference. And `use warnings;` is not a typo. And I was under the impression that `-w` was not the same in that it was a lot more broad in scope. Anyway.... And didn't know HTML::Parser needed CGI. —Brad "The important work of moving the world forward does not wait to be done by perfect men." George Eliot	[reply] [d/l] [select]
Re: Using HTML::Parser for simple tag removal by rlucas (Scribe) on May 26, 2005 at 19:23 UTC
Is this an attempt at HTML to text conversion? If so, this is what I was working on yesterday. Here's my code for solving that: use HTML::TreeBuilder; use HTML::FormatText; use Encode; sub HTML_to_text { my $content = shift; my $html = HTML::TreeBuilder->new; $content = '<body>' . $content . '</body>' unless $content =~ m/<body[^>]>/i; $html->parse( decode("utf8", $content) ); # this is necessary othe +rwise UTF8 chars get hamburgered, my $formatter = HTML::FormatText->new; my $out = _trim($formatter->format($html)); # trim is a selective + trimmer that preserves some kinds of whitespace, delete this if you +don't need it. return $out; } [download] (The ridiculous number of lines and named variables are due to a step-by-step debugging inspection where I was trying to solve the utf8 problem -- essentially, make sure you have the latest HTML::Parser installed if you anticipate utf8 characters, and you may need to have them "marked" as such by going through the decode() function.) If you only need to remove some* tags, try HTML::TagFilter.	[reply] [d/l]
HTML::FromText by marnanel (Beadle) on May 26, 2005 at 20:57 UTC
HTML::FromText is pretty good if you want to turn a chunk of HTML into formatted text (rather than just stripping out the tags). It does lists and so on as well.	[reply]