cs202083 has asked for the wisdom of the Perl Monks concerning the following question:

hi, Monks:
I am trying to use a perl script to simulate a browser.
I did make use of LWP and HTTP::Request and HTTP::Response,
now, I need to remove html tags from the response's content.
one way I have tried is to use lynx dump switch:
like:
... open (FILE,">tmp.html"); print FILE $response->content; $text = `lynx -dump tmp.html`; print $text; ...
but this method need to use a tmp file.
my question is:
Is there (there certainly are)
any other way to get rid of the tags?
I mean If I don't want use a tmp file, or If
I don't want to use the "lynx".

Replies are listed 'Best First'.
Re: how to remove html tags
by kutsu (Priest) on Mar 16, 2004 at 18:47 UTC
Re: how to remove html tags
by saintmike (Vicar) on Mar 16, 2004 at 18:57 UTC
    HTML::FormatText does a nice job if the HTML is not too complicated and you'd like some plain text formatting:

    use strict; use HTML::FormatText; use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new(); $tree->parse("<H1>hello</H1>"); my $formatter = HTML::FormatText->new(); print $formatter->format($tree);
Re: how to remove html tags
by amw1 (Friar) on Mar 16, 2004 at 18:43 UTC
    I haven't used it at all but you may want to look at HTML-Strip from CPAN. Looks like it will do what you want.
Re: how to remove html tags
by davido (Cardinal) on Mar 17, 2004 at 07:15 UTC
    I haven't seen anyone suggest this one yet, but it seems the logical solution: HTML::Strip. From the POD for that module, you'll see this simple example:

    use HTML::Strip; my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $raw_html ); $hs->eof;

    It couldn't be easier when you use the right tool for the job.


    Dave

Re: how to remove html tags
by cormac (Acolyte) on Mar 16, 2004 at 20:21 UTC