comment on

How do you intend to handle accented letters? Should "resumé" be equivalent to "resume"?

Right now, as your code stands, those words are not equivalent. If they should be equivalent, you'll want to look at this node I just wrote today that squishes accented letters into their non-accented equivalents.

Also, I'd suggest some tweaks in your existing code. For example, I'd change get_stop and get_punc as follows:

sub get_stop{
  # sample
  my %stop =
    map { $_ => 1 }
    qw(and any the they);
  \%stop;
}

sub get_punc{
  # sample
  my %punc = 
    map { $_ => ' ' }
    qw(&rsquo; &lsquo; &rdquo; &ldquo;);
  \%punc;
}
[download]

Not only does this form make it easier to add new entries, it makes it easier to use in the rest of your code - you don't need all those calls to exists any more:

  s/(&#?\w+;)/$punc->{$1}||$1/eg;
[download]

and

  next if $stop->{$_};
[download]

Finally, your code as it stands doesn't actually do quite what you described - as a test give it the data:

''words in double single quotes''
[download]

The fix of course is to change the regular expressions used to normalize the data:

  for (@words){
    s/^['-]+//;
    s/['-]+s?$//;
    next if length() < $min or length() > $max;
    next if $stop->{$_};
    next if /\d/ and not /^[12]\d{3}s?$/;
    #  next if /--/;  # not needed anymore
    $words_all->{$_}->{$file_key} += 1;    
  }
[download]

Notice that above I also changed the structure of words_all - any given word is likely to appear several times in a file if it appears there once, and there's no need to keep a huge array with many elements repeated. You can just use keys(%{$words_all->{$word}}) to get the list of files a word appears in, and if you need to know the count, you have that too.

--
@/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/;
map{y/X_/\n /;print}map{pop@$_}@/for@/
[download]

In reply to Re: Extracting keywords from HTML by fizbin
in thread Extracting keywords from HTML by wfsp

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Syntactic Confectionery Delight
	PerlMonks