LWP::Simple question

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: LWP::Simple question by davido (Cardinal) on Apr 20, 2005 at 20:42 UTC
LWP::Simple just grabs whatever it gets from its HTTP request of the address you give it; ie, raw HTML. HTML::Strip will strip away the HTML and leave you with fairly plain old text. But it has its limitations, and by default introduces a fair amount of whitespace into the resulting output. Dave	[reply]
Re: LWP::Simple question by johnnywang (Priest) on Apr 20, 2005 at 20:36 UTC
LWP::Simple, or other LWP modules, only download the source, they don't do any parsing. You can use other modules in the HTML namespace for parsing, but getting only viewable text doesn't seem like an easy problem.	[reply]
Re: LWP::Simple question by Anonymous Monk on Apr 20, 2005 at 20:48 UTC
There is a JS source code jumbler that jumbles email addresses but on screen it's just the normal text. I wanted to see how I could take just the test and therefor prove that the JS encoder was useless. I was thinking it would be easy but I guess I was wrong.	[reply]
Re^2: LWP::Simple question by Popcorn Dave (Abbot) on Apr 21, 2005 at 03:35 UTC
Why not scrape it and find out? Javascript should be between <script> and </script> tags, shouldn't it? That should be fairly easy to do a simple regex for and pull what you're after. Something like: `use strict; use LWP::Simple; my @page = split("\n", get('http://www.myurl.com')); my $index = 0; do{$index++} while $page[$index] != /script/; # increment until first +script tag while ($page[$index] != m!/script!){ #print out script lines until clo +sing script tag print "$page[$index]\n"; $index++; }` [download] But I have to wonder what is preventing you from just looking at the source of the page in a text editor? :) Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.	[reply] [d/l]
Re: LWP::Simple question by salva (Canon) on Apr 21, 2005 at 08:54 UTC
That's pretty much all I'm wondering is whether I can take the text on the screen* as a whole without scanning the source code.* on windows, you could use Internet Explorer via COM to open the page and then save it as text, or select and copy everything to the clipboard. on Unix, you could use some text browser (links, w3m, lynxs, etc) to download the page and save it as text.	[reply]