Tricking website into thinking your a browser

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

use strict;
use warnings;

use HTML::Parse;
use HTML::FormatText;
use LWP::Simple;

my $url = "http://www.perlmonks.org";
my $html = get($url);

defined $html or die "Can't fetch HTML from: ",$url;

my $ascii = HTML::FormatText->new->format(parse_html($html));
print $ascii;
[download]

This code doesn't give me the text that I see with my browser. How do I do this? Thanks!

Comment on Tricking website into thinking your a browser Download Code

Replies are listed 'Best First'.
Re: Tricking website into thinking your a browser by linuxer (Curate) on Mar 22, 2009 at 16:20 UTC
If you don't get the text you see with your browser, what do you see instead? I tried that with LWP::Simple and left out the HTML::Parse and HTML::FormatText parts, and it looks quiet ok. So, what is the result you are getting? If you want to set the UserAgent string, you should use LWP::UserAgent instead of LWP::Simple.	[reply]
Re: Tricking website into thinking your a browser by Anonymous Monk on Mar 22, 2009 at 16:26 UTC
This code doesn't give me the text that I see with my browser. How do I do this? Thanks! Why do you think it would give you what you see in your browser?	[reply]
Re^2: Tricking website into thinking your a browser by Anonymous Monk on Mar 22, 2009 at 17:02 UTC
It gives me - `[TABLE NOT SHOWN][TABLE NOT SHOWN][TABLE NOT SHOWN][TABLE NOT SHOWN] This page brought to you by the kind folks at The Everything Development Company and maintained by Tim Vroom. PerlMonks is a proud member of the The Perl Foundation. Wonderful Web Servers and Bandwidth Generously Provided by pair Networks` [download]	[reply] [d/l]
Re^3: Tricking website into thinking your a browser by linuxer (Curate) on Mar 22, 2009 at 17:28 UTC
Please check the content of `$html`. It should contain the complete html content fetched by LWP::Simple. Please check, what HTML::Parse and HTML::FormatText do to the content... So, what's the result after `parse_html($html)`? As I haven't used those *::Parser modules too often, I wonder whether you should stick to the warnings, mentioned in the documentation of HTML::Parse itself: Disclaimer: This module is provided only for backwards compatibility with earlier versions of this library. New code should not use this module, and should really use the HTML::Parser and HTML::TreeBuilder modules directly, instead. Maybe you should use other modules for extracting the plain text information (as I assume that is what you want to do...) Check out the examples of HTML::Parser. They provide a script named `htext`, which does the following job: "# Extract all plain text from an HTML file" Find it for example at http://cpansearch.perl.org/src/GAAS/HTML-Parser-3.60/eg/ (Please note the module versions; they may differ between your system and cpan.) Update: fixed minor typo	[reply] [d/l] [select]
Re^3: Tricking website into thinking your a browser by Anonymous Monk on Mar 22, 2009 at 17:18 UTC
What is `[TABLE NOT SHOWN]`? With ?displaytype=print;node_id=752400, I get what I'd expect Read more... (545 Bytes) Read more... (1118 Bytes) `$ pmvers HTML::Parse HTML::FormatText LWP::Simple HTML::Parse: 2.71 HTML::FormatText: 2.04 LWP::Simple: 5.810` [download]	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom