http://qs1969.pair.com?node_id=485560


in reply to Extracting keywords from HTML

How do you intend to handle accented letters? Should "resumé" be equivalent to "resume"?

Right now, as your code stands, those words are not equivalent. If they should be equivalent, you'll want to look at this node I just wrote today that squishes accented letters into their non-accented equivalents.

Also, I'd suggest some tweaks in your existing code. For example, I'd change get_stop and get_punc as follows:

sub get_stop{ # sample my %stop = map { $_ => 1 } qw(and any the they); \%stop; } sub get_punc{ # sample my %punc = map { $_ => ' ' } qw(’ ‘ ” “); \%punc; }

Not only does this form make it easier to add new entries, it makes it easier to use in the rest of your code - you don't need all those calls to exists any more:

s/(&#?\w+;)/$punc->{$1}||$1/eg;
and
next if $stop->{$_};

Finally, your code as it stands doesn't actually do quite what you described - as a test give it the data:

''words in double single quotes''

The fix of course is to change the regular expressions used to normalize the data:

for (@words){ s/^['-]+//; s/['-]+s?$//; next if length() < $min or length() > $max; next if $stop->{$_}; next if /\d/ and not /^[12]\d{3}s?$/; # next if /--/; # not needed anymore $words_all->{$_}->{$file_key} += 1; }

Notice that above I also changed the structure of words_all - any given word is likely to appear several times in a file if it appears there once, and there's no need to keep a huge array with many elements repeated. You can just use keys(%{$words_all->{$word}}) to get the list of files a word appears in, and if you need to know the count, you have that too.

-- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/