Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Is it possible to extract just text from the viewable page without it scraping the source code? Or does LWP::Simple use the source code to get the text?

That's pretty much all I'm wondering is whether I can take the text on the screen as a whole without scanning the source code.

Replies are listed 'Best First'.
Re: LWP::Simple question
by davido (Cardinal) on Apr 20, 2005 at 20:42 UTC

    LWP::Simple just grabs whatever it gets from its HTTP request of the address you give it; ie, raw HTML.

    HTML::Strip will strip away the HTML and leave you with fairly plain old text. But it has its limitations, and by default introduces a fair amount of whitespace into the resulting output.


    Dave

Re: LWP::Simple question
by johnnywang (Priest) on Apr 20, 2005 at 20:36 UTC
    LWP::Simple, or other LWP modules, only download the source, they don't do any parsing. You can use other modules in the HTML namespace for parsing, but getting only viewable text doesn't seem like an easy problem.
Re: LWP::Simple question
by Anonymous Monk on Apr 20, 2005 at 20:48 UTC
    There is a JS source code jumbler that jumbles email addresses but on screen it's just the normal text. I wanted to see how I could take just the test and therefor prove that the JS encoder was useless.

    I was thinking it would be easy but I guess I was wrong.

      Why not scrape it and find out? Javascript should be between <script> and </script> tags, shouldn't it? That should be fairly easy to do a simple regex for and pull what you're after.

      Something like:

      use strict; use LWP::Simple; my @page = split("\n", get('http://www.myurl.com')); my $index = 0; do{$index++} while $page[$index] != /script/; # increment until first +script tag while ($page[$index] != m!/script!){ #print out script lines until clo +sing script tag print "$page[$index]\n"; $index++; }

      But I have to wonder what is preventing you from just looking at the source of the page in a text editor? :)

      Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.
Re: LWP::Simple question
by salva (Canon) on Apr 21, 2005 at 08:54 UTC
    That's pretty much all I'm wondering is whether I can take the text on the screen as a whole without scanning the source code.

    on windows, you could use Internet Explorer via COM to open the page and then save it as text, or select and copy everything to the clipboard.

    on Unix, you could use some text browser (links, w3m, lynxs, etc) to download the page and save it as text.