Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have been hable to get the pages I want with lwp::simple and now I want to extract the text of the page out of HTML tag, but HTML::tree don't give plain text, or regex that remove < > tag sometime remove text also. thanks for any advice

Replies are listed 'Best First'.
Re: how do I extarct text from web site
by pzbagel (Chaplain) on Jan 07, 2004 at 00:44 UTC
Re: how do I extarct text from web site
by Roger (Parson) on Jan 07, 2004 at 01:00 UTC
    I always use HTML::Strip.

    use strict; use warnings; use HTML::Strip; my $html = .....; my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $html ); $hs->eof; print $clean_text;
Re: how do I extarct text from web site
by dominix (Deacon) on Jan 07, 2004 at 00:40 UTC
    To simply got the text you can use lynx this way
     lynx -dump -nolist URL > mytextfile 
    but well, I'll do it like that (ala tom christiansen)
    perl -MHTML::Parse -MLWP::Simple -MHTML::FormatText -e 'print HTML::Fo +rmatText->new->format(parse_html(get($ARGV[0])))' http://perlmonks.or +g >perlmonks.txt
    --
    dominix
Re: how do I extarct text from web site
by borisz (Canon) on Jan 07, 2004 at 01:08 UTC
    or try this:
    perl -MHTML::TreeBuilder -MLWP::Simple -e 'print HTML::TreeBuilder->ne +w_from_content(get "http://perlmonks.org/")->format'
Re: How do I extract text from web site?
by DigitalKitty (Parson) on Jan 07, 2004 at 15:42 UTC
    Hi.

    #!/usr/bin/perl -w use strict; use LWP::UserAgent; use HTML::TokeParser; print 'Enter site: '; chomp( my $site = <STDIN> ); my $ua = new LWP::UserAgent(); my $request = new HTTP::Request( 'GET' => $site ); my $response = $ua->request( $request ); my $data = $response->content(); my $page = new HTML::TokeParser( \$data ); while( my $token = $page->get_token() ) { my $type = shift @{ $token }; my $text = shift @{ $token }; if( $type eq "T" ) { print $text; } }


    The output isn't flawless since it 'stumbles' over html comments and a few entities.

    Hope this helps though,
    -Katie