http://qs1969.pair.com?node_id=752400

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
use strict; use warnings; use HTML::Parse; use HTML::FormatText; use LWP::Simple; my $url = "http://www.perlmonks.org"; my $html = get($url); defined $html or die "Can't fetch HTML from: ",$url; my $ascii = HTML::FormatText->new->format(parse_html($html)); print $ascii;
This code doesn't give me the text that I see with my browser. How do I do this? Thanks!

Replies are listed 'Best First'.
Re: Tricking website into thinking your a browser
by linuxer (Curate) on Mar 22, 2009 at 16:20 UTC

    If you don't get the text you see with your browser, what do you see instead?

    I tried that with LWP::Simple and left out the HTML::Parse and HTML::FormatText parts, and it looks quiet ok.

    So, what is the result you are getting?

    If you want to set the UserAgent string, you should use LWP::UserAgent instead of LWP::Simple.

Re: Tricking website into thinking your a browser
by Anonymous Monk on Mar 22, 2009 at 16:26 UTC
    This code doesn't give me the text that I see with my browser. How do I do this? Thanks!
    Why do you think it would give you what you see in your browser?
      It gives me -
      [TABLE NOT SHOWN][TABLE NOT SHOWN][TABLE NOT SHOWN][TABLE NOT SHOWN] This page brought to you by the kind folks at The Everything Development Company and maintained by Tim Vroom. PerlMonks is a proud member of the The Perl Foundation. Wonderful Web Servers and Bandwidth Generously Provided by pair Networks

        Please check the content of $html. It should contain the complete html content fetched by LWP::Simple.

        Please check, what HTML::Parse and HTML::FormatText do to the content...

        So, what's the result after parse_html($html)?

        As I haven't used those *::Parser modules too often, I wonder whether you should stick to the warnings, mentioned in the documentation of HTML::Parse itself:

        Disclaimer: This module is provided only for backwards compatibility with earlier versions of this library. New code should not use this module, and should really use the HTML::Parser and HTML::TreeBuilder modules directly, instead.

        Maybe you should use other modules for extracting the plain text information (as I assume that is what you want to do...)

        Check out the examples of HTML::Parser. They provide a script named htext, which does the following job: "# Extract all plain text from an HTML file"

        Find it for example at http://cpansearch.perl.org/src/GAAS/HTML-Parser-3.60/eg/

        (Please note the module versions; they may differ between your system and cpan.)

        Update: fixed minor typo