perlquestion
wfsp
I have adopted the following definition of a keyword:
<p>
<h2>Word</h2>
<ul>
<li>Case ignored.
<li>Min length: 3, Max length: 20
<li>Can include a hyphen or an apostrophe but not at either end (these are stripped). Possessive 's (cat's) also stripped. (can't, won't, hasn't etc. are in the stop list).
<li>All other puctuation ignored.
<li>Four digit numbers between 1000 and 3000 with an optional trailing s (1960s). Anything else with a number in it is skipped.
<li>Skip common words (stop words).
</ul>
<h2>HTML</h2>
To preserve the apostrophe <c>’</c> and <c>'</c> are replaced with '.
<p>
All other HTML entity punctuation is then removed and the HTML decoded. Apart from punctuation it is all Latin1 (I've checked it - at length!).
<p>
<b>Update 2</b><br>
I should have mentioned that the text has already been stripped from an HTML file. (!)<br>
Apologies for any confusion<br>
<readmore>
<c>
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Entities;
my $config = {
minw => 3,
maxw => 20,
};
my $stop = get_stop();
my $punc = get_punc();
my $text = q|
cat's dogs
O’Reilly
ad-hoc
“broken”
hyphen-
the and
|;
my $words_all = {};
# contrived loop to show usage
for my $t ($text){
get_word($t);
}
print "$_\n" for keys %{$words_all};
sub get_word{
my ($text, $file_key) = @_;
my ($min, $max) = ($config->{minw}, $config->{maxw});
for ($text){
s/’|'/'/g;
s/(&#?\w+;)/exists $punc->{$1}?' ':$1/eg;
}
decode_entities($text);
$text =~ s/[^\w'-]/ /g;
my @words = split ' ', $text;
for (@words){
s/^['-]//g;
s/['-]s?$//;
next if length() < $min or length() > $max;
next if exists $stop->{$_};
next if /\d/ and not /^[12]\d{3}s?$/;
next if /--/;
push @{$words_all->{$_}}, $file_key;
}
}
sub get_stop{
# sample
return {
qw(
and ''
any ''
the ''
they ''
)
};
}
sub get_punc{
# sample
return {
'’' => undef,
'‘' => undef,
'”' => undef,
'“' => undef,
};
}
__DATA__
---------- Capture Output ----------
> "C:\Perl\bin\perl.exe" _new.pl
hyphen
cat
ad-hoc
O'Reilly
broken
dogs
> Terminated with exit code 0.
</c>
<p>
At the moment the full app is run locally on a copy of the web site.
<p>
It's generating 42k words but I'm working on the stop file to try and bring it down.
<p>
If this turns out to be fairly stable I'm considering compiling the regexes outside of the loop.
<p>
What do you reckon?
<p>
winxp, activestate 5.8
<p>
Update:<br>
Corrected get_stop() sub