Re: web content parser

You may start with something like below. This is my first attempt with HTML::Parser, so treat it as a starter only.

use warnings;
use strict;

use LWP::Simple;
use HTML::Parser;

# Get the question node
my $doc = get('http://perlmonks.org/?node_id=761525');
die "Couldn't get the document!" unless defined $doc;

# Parse it skipping all but the text
my @lines;
my $parser = HTML::Parser->new(
    text_h => [ sub { push @lines, shift }, 'text'],
    default_h => [ "" ]
);
$parser->parse($doc);

# Create keyword => counter hash
my %hsh;
for my $l (@lines) {
    my @f = split /\s+/, $l;
    next unless @f;
    ++$hsh{$_} for @f;
}
[download]

Comment on Re: web content parser Download Code

Replies are listed 'Best First'.
Re^2: web content parser by jwkrahn (Abbot) on May 02, 2009 at 20:04 UTC
You can skip the array and just store the words directly into a hash: `# Parse it skipping all but the text my %words; my $parser = HTML::Parser->new( text_h => [ sub { $words{ $_ }++ for split ' ', shift }, 'text' ], default_h => [ '' ], ); $parser->parse( $doc );` [download]	[reply] [d/l]

Replies are listed 'Best First'.

Re^2: web content parser
by jwkrahn (Abbot) on May 02, 2009 at 20:04 UTC

You can skip the array and just store the words directly into a hash:

# Parse it skipping all but the text
my %words;
my $parser = HTML::Parser->new(
    text_h => [ sub { $words{ $_ }++ for split ' ', shift }, 'text' ],
    default_h => [ '' ],
);
$parser->parse( $doc );
[download]

[reply]
[d/l]