Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

What is the best module for extracting the text that yiu see on a webpage?

Replies are listed 'Best First'.
Re: Html to text
by Corion (Patriarch) on Mar 22, 2009 at 11:53 UTC

    Web::Scraper

    Now, if you were a bit more forecoming with your criteria for "best", you might get an answer more tailored towards your needs.

      OK thanks, I wish to extract all the words (unicode format) as presented to the user, plus the frequecy of these words. The words presented to the user from this page include -
      more useful options PerlMonks Html to text by Anonymous Monk Log in Cr +eate a new user The Monastery Gates Super Search Seekers of Perl Wisd +om Meditations PerlMonks Discussion Snippets Obfuscation Reviews Coo +l Uses For Perl Perl News Tutorials Code Poetry Recent Threads Newes +t Nodes Donate What's New on Mar at perlquestion print replies xml Ne +ed Help Anonymous Monk has asked for the wisdom of the Perl Monks con +cerning the following question What is the best module for extracting + the text that yiu see on a webpage Comment on Html to text Html to t +ext Corion Archbishop on ...

        As you seem to have retrieved the page already, then maybe something like HTML::TokeParser or still Web::Scraper are the tools to use. For the word frequency and stopwords, you will have to program. Try these and come back once you encounter problems.

        Note though that Perlmonks is not a site that should be scraped. If you have a specific need for the content of this site, contact the gods. Other automated mass access to this site is discouraged and we block badly written scripts that put an undue load on the site.

Re: Html to text
by zentara (Cardinal) on Mar 22, 2009 at 12:12 UTC
    I like to use the commandline tools like lynx, or elinks, etc.
    my $url = "http://perlmonks.org"; $html_code = `lynx -source $url`; $text_data = `lynx -dump $url`; The libwww-perl (LWP) modules from CPAN provide a more powerful way to do this. They don't require lynx, but like lynx, can still work throug +h proxies: # simplest version use LWP::Simple; $content = get($URL); # or print HTML from a URL use LWP::Simple; getprint "http://www.linpro.no/lwp/"; # or print ASCII from HTML from a URL # also need HTML-Tree package from CPAN use LWP::Simple; use HTML::Parser; use HTML::FormatText; my ($html, $ascii); $html = get("http://www.perl.com/"); defined $html or die "Can't fetch HTML from http://www.perl.com/"; $ascii = HTML::FormatText->new->format(parse_html($html)); print $ascii;

    I'm not really a human, but I play one on earth My Petition to the Great Cosmic Conciousness
Re: Html to text
by targetsmart (Curate) on Mar 22, 2009 at 12:12 UTC
    not exactly a perl solution, you can try 'lynx -dump' or 'www-browser -dump' on a unix based machine(provided if lynx and www-browser commands were installed).

    Vivek
    -- In accordance with the prarabdha of each, the One whose function it is to ordain makes each to act. What will not happen will never happen, whatever effort one may put forth. And what will happen will not fail to happen, however much one may seek to prevent it. This is certain. The part of wisdom therefore is to stay quiet.
      use strict; use warnings; use HTML::Parse; use HTML::FormatText; use LWP::Simple; my $url = "http://www.perlmonks.org"; my $html = get($url); defined $html or die "Can't fetch HTML from: ",$url; my $ascii = HTML::FormatText->new->format(parse_html($html)); print $ascii;
      This isn't giving me much :( Do I need to pretend that I'm a browser and not a bot. How do I do that? Thank you!
Re: Html to text
by Your Mother (Archbishop) on Mar 24, 2009 at 03:54 UTC

    I stumbled upon this snippet recently Re: Strip HTML tags again and discovered it works really well (because it allows for broken HTML by checking for allowed tags only). I put in an updated example too because the underlying module behavior has drifted: Re^2: Strip HTML tags again.

Re: Html to text
by ig (Vicar) on Mar 24, 2009 at 01:24 UTC

    You might try one of the html2txt programs.