web content parser

stan131 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I need to fetch the contents of a webpage and store individual words in a hash. Basically, get the webpage, strip all the tags, new lines, comments and extract just content and store the words in a hash.

I have gone thru documentation of LWP::Simple and HTML::Filter but still confused on how to approach.
Could you please provide same code/pointers that will help me proceed.
Thanks,
Stan

Comment on web content parser

Replies are listed 'Best First'.
Re: web content parser by przemo (Scribe) on May 02, 2009 at 19:24 UTC
You may start with something like below. This is my first attempt with HTML::Parser, so treat it as a starter only. `use warnings; use strict; use LWP::Simple; use HTML::Parser; # Get the question node my $doc = get('http://perlmonks.org/?node_id=761525'); die "Couldn't get the document!" unless defined $doc; # Parse it skipping all but the text my @lines; my $parser = HTML::Parser->new( text_h => [ sub { push @lines, shift }, 'text'], default_h => [ "" ] ); $parser->parse($doc); # Create keyword => counter hash my %hsh; for my $l (@lines) { my @f = split /\s+/, $l; next unless @f; ++$hsh{$_} for @f; }` [download]	[reply] [d/l]
Re^2: web content parser by jwkrahn (Abbot) on May 02, 2009 at 20:04 UTC
You can skip the array and just store the words directly into a hash: `# Parse it skipping all but the text my %words; my $parser = HTML::Parser->new( text_h => [ sub { $words{ $_ }++ for split ' ', shift }, 'text' ], default_h => [ '' ], ); $parser->parse( $doc );` [download]	[reply] [d/l]