What exactly are you asking? Are you saying that, given a URL, you already know how to fetch the page, count the words, but you can't figure out how to separate that count from the count of the other pages? That seems so trivial that I must be misunderstanding what you've written, even though that's what you wrote. Can you post some code to clarify what you're really asking? It seems like what do you want to is really simple:
use HTML::TokeParser::Simple 3.13;
my %words;
foreach my $url (@urls) {
$words{$url} = 0;
my $parser = HTML::TokeParser::Simple->new(url => $url);
while (my $token = $parser->get_token) {
next unless $token->is_text; # assumes you only search visible tex
+t
$words{$url} += some_word_counting_function($token->as_is);
}
}
# %words now has the count of words per url
Admittedly, that uses a different module, but that shows how easy it is to track word count per url. Did I misunderstand what you were asking?
|