Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

How do you intend to handle accented letters? Should "resumé" be equivalent to "resume"?

Right now, as your code stands, those words are not equivalent. If they should be equivalent, you'll want to look at this node I just wrote today that squishes accented letters into their non-accented equivalents.

Also, I'd suggest some tweaks in your existing code. For example, I'd change get_stop and get_punc as follows:

sub get_stop{ # sample my %stop = map { $_ => 1 } qw(and any the they); \%stop; } sub get_punc{ # sample my %punc = map { $_ => ' ' } qw(’ ‘ ” “); \%punc; }

Not only does this form make it easier to add new entries, it makes it easier to use in the rest of your code - you don't need all those calls to exists any more:

s/(&#?\w+;)/$punc->{$1}||$1/eg;
and
next if $stop->{$_};

Finally, your code as it stands doesn't actually do quite what you described - as a test give it the data:

''words in double single quotes''

The fix of course is to change the regular expressions used to normalize the data:

for (@words){ s/^['-]+//; s/['-]+s?$//; next if length() < $min or length() > $max; next if $stop->{$_}; next if /\d/ and not /^[12]\d{3}s?$/; # next if /--/; # not needed anymore $words_all->{$_}->{$file_key} += 1; }

Notice that above I also changed the structure of words_all - any given word is likely to appear several times in a file if it appears there once, and there's no need to keep a huge array with many elements repeated. You can just use keys(%{$words_all->{$word}}) to get the list of files a word appears in, and if you need to know the count, you have that too.

-- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/

In reply to Re: Extracting keywords from HTML by fizbin
in thread Extracting keywords from HTML by wfsp

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2023-01-29 15:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?