comment on

How do you intend to handle accented letters? Should "resumé" be equivalent to "resume"?

Right now, as your code stands, those words are not equivalent. If they should be equivalent, you'll want to look at this node I just wrote today that squishes accented letters into their non-accented equivalents.

Also, I'd suggest some tweaks in your existing code. For example, I'd change get_stop and get_punc as follows:

sub get_stop{
  # sample
  my %stop =
    map { $_ => 1 }
    qw(and any the they);
  \%stop;
}

sub get_punc{
  # sample
  my %punc = 
    map { $_ => ' ' }
    qw(&rsquo; &lsquo; &rdquo; &ldquo;);
  \%punc;
}
[download]

Not only does this form make it easier to add new entries, it makes it easier to use in the rest of your code - you don't need all those calls to exists any more:

  s/(&#?\w+;)/$punc->{$1}||$1/eg;
[download]

and

  next if $stop->{$_};
[download]

Finally, your code as it stands doesn't actually do quite what you described - as a test give it the data:

''words in double single quotes''
[download]

The fix of course is to change the regular expressions used to normalize the data:

  for (@words){
    s/^['-]+//;
    s/['-]+s?$//;
    next if length() < $min or length() > $max;
    next if $stop->{$_};
    next if /\d/ and not /^[12]\d{3}s?$/;
    #  next if /--/;  # not needed anymore
    $words_all->{$_}->{$file_key} += 1;    
  }
[download]

Notice that above I also changed the structure of words_all - any given word is likely to appear several times in a file if it appears there once, and there's no need to keep a huge array with many elements repeated. You can just use keys(%{$words_all->{$word}}) to get the list of files a word appears in, and if you need to know the count, you have that too.

--
@/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/;
map{y/X_/\n /;print}map{pop@$_}@/for@/
[download]

In reply to Re: Extracting keywords from HTML by fizbin
in thread Extracting keywords from HTML by wfsp

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.