Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Extracting keywords from HTML

by fizbin (Chaplain)
on Aug 21, 2005 at 18:11 UTC ( [id://485560]=note: print w/replies, xml ) Need Help??


in reply to Extracting keywords from HTML

How do you intend to handle accented letters? Should "resumé" be equivalent to "resume"?

Right now, as your code stands, those words are not equivalent. If they should be equivalent, you'll want to look at this node I just wrote today that squishes accented letters into their non-accented equivalents.

Also, I'd suggest some tweaks in your existing code. For example, I'd change get_stop and get_punc as follows:

sub get_stop{ # sample my %stop = map { $_ => 1 } qw(and any the they); \%stop; } sub get_punc{ # sample my %punc = map { $_ => ' ' } qw(’ ‘ ” “); \%punc; }

Not only does this form make it easier to add new entries, it makes it easier to use in the rest of your code - you don't need all those calls to exists any more:

s/(&#?\w+;)/$punc->{$1}||$1/eg;
and
next if $stop->{$_};

Finally, your code as it stands doesn't actually do quite what you described - as a test give it the data:

''words in double single quotes''

The fix of course is to change the regular expressions used to normalize the data:

for (@words){ s/^['-]+//; s/['-]+s?$//; next if length() < $min or length() > $max; next if $stop->{$_}; next if /\d/ and not /^[12]\d{3}s?$/; # next if /--/; # not needed anymore $words_all->{$_}->{$file_key} += 1; }

Notice that above I also changed the structure of words_all - any given word is likely to appear several times in a file if it appears there once, and there's no need to keep a huge array with many elements repeated. You can just use keys(%{$words_all->{$word}}) to get the list of files a word appears in, and if you need to know the count, you have that too.

-- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/

Replies are listed 'Best First'.
Re^2: Extracting keywords from HTML
by wfsp (Abbot) on Aug 21, 2005 at 18:54 UTC
    Thanks for your response.

    resumé resume
    Yes I've wrestled with that. At the moment there are both. Which is correct? What could you expect to be typed in? When it's ready I'll ask the site maintainers what they want! Thanks for the pointer.

    Subs
    The subs shown were for the benefit of the post. There are 752 stop words (I intend to increase this) and 182 html punctuation entities which are read in from separate files. I take your point though.

    Double single quotes
    Good point, thanks. I'd actually gone through the html, found the double single quotes (there were many) and removed them. There were also ` (back tick) quotes. Single and double :-)

    words_all hash ref
    I made a mistake there preparing it for the post. In the app I use a hash (%seen) to keep track of words found in each file and then:

    push @{$words_all->{$_}}, $file_key unless exists $seen{$};
    I think your method is better. Later, the count could contribute to some form of weighting system.

    Again, many thanks, John

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://485560]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2024-03-28 10:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found