Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I want to be able to eventually take a URL and determine the keyword density of the page. For starters, I am using just a simple scalar to play with.

What I want to do is go up to three keywords at a time (one word, two words and three word phrases). The one word is easy, I'll rip through the array and populate a $hash++ with it. But now moving on to two and three keyword phrases.

I was thinking about tossing in a regex in the split (assuming we can do that) to match 2 words and 3 words. Unfortunately my regex skills aren't up to par quite yet and not sure how to go about that.

The first problem I am having is it's breaking up on my apostraphes making the following 's' a new word I don't want that.

my $content = qq(Three blind mice. Three blind mice. See how they ru +n. See how they run. The butcher's wife came after them with a knif +e, three blind mice.); my @words = split(/\W/, $content); chomp(@words); foreach (@words) { print "$_\n"; }
I would like to see some advice on how to go about doing this the way I am trying to before seeing other methods on how to do this. I know there are ways to produce shorter code, but this is probably easiest for me to understand right now.

Replies are listed 'Best First'.
Re: Word density
by jdporter (Paladin) on Mar 19, 2006 at 19:41 UTC

    I think that is one wheel you really don't want to re-invent. See Text::ExtractWords. Also, you won't want to forget to account for stop words. See Lingua::StopWords.

    There are also some "N-gram" modules that you might find useful for your main algorithm. (Ignore the "tangram" hits.)

    Also, it looks to me like what you're trying to do can be achieved, more or less, by a combination of Lingua::EN::Summarize and Lingua::EN::Keywords.

    We're building the house of the future together.
Re: Word density
by ikegami (Patriarch) on Mar 19, 2006 at 19:00 UTC

    Here are two basic regexps. One that's inclusive:

    $word = qr/ [[:alpha:]] # Start with a letter. (?: [:^space:]* # Hyphens, apostrophes, etc [[:alpha:]] # Don't end on a punctuation mark. )? # Catch single letter words. /x;

    One that's restrictive:

    $word = qr/ [[:alpha:]] # Start with a letter. (?: [[:alpha:]'-]+ # Allowed characters. [:alpha:] # Don't end on a punctuation mark. )? # Catch single letter words. /x;

    Here's how you use them:

    my $last1; my $last2; while ($content =~ /($word)/g) { my $word = $1; ++$hash{ $word }; ++$hash{ "$last1 $word"} if defined $last1; ++$hash{"$last2 $last1 $word"} if defined $last2; $last2 = $last1; $last1 = $word; }

    Update: Instead of just returning the data, I've updated my code to actually process it.

      For your second sample code, is there an error somewhere? When I run it, my arrays are returned as ARRAY(0x18740bc), etc.

      Your second code is much easier to understand, thank you!

        No, that's correct. What was the the second code (shown below) returned pairs of words as an array. You might want to take a peek at Data::Dumper to easily display structures. In any case, I've since modified my post such that the code does what you want.

        my @words; while ($content =~ /($word)/g) { push(@words, $1); } my @words_bi; my @words_tri; foreach (0..$#words) { next if $_ < 1; push(@words_bi, [ @words[$_-1 .. $_] ] ); next if $_ < 2; push(@words_tri, [ @words[$_-2 .. $_] ] ); }
Re: Word density
by blokhead (Monsignor) on Mar 19, 2006 at 19:00 UTC
    Here's a nice rule of thumb when splitting up a string into components: If it's easier to write what you want, use m/what_you_want/g. If it's easier to write what you don't want, use split /what_you_dont_want/.

    In your case, it's easier to write a regex for what you want (a word) than what you don't want (all between-word sequences):

    my @words = $content =~ m/([A-Za-z]+(?:\'[A-Za-z]+)?)/g
    This matches alphabetic characters followed by an optional apostrophe + alphabetics. This is obviously preliminary. Adjust as necessary according to your definition of a "word" ..

    blokhead

Re: Word density
by GrandFather (Saint) on Mar 19, 2006 at 21:10 UTC

    This should get you started - extract the words into a stack (array). Then in a loop: pull the first word off the stack, process it, check that there is a word remaining, combine with the previous top word and process, check that there is a third word, combine and process.

    use warnings; use strict; use Text::ExtractWords; my $text = do {local $/; <DATA>}; # Slurp the text my @words; my %hash; words_list (\@words, $text); while (@words) { ++$hash{$_ = shift @words}; ++$hash{"$_ $words[0]"} if exists $words[0]; ++$hash{"$_ $words[0] $words[1]"} if exists $words[1]; } print "$_: $hash{$_}\n" for sort keys %hash; __DATA__ Three blind mice. Three blind mice. See how they run. See how they +run. The butcher's wife came after them with a knife, three blind mice.

    Prints:

    a: 1 a knife: 1 a knife three: 1 after: 1 after them: 1 after them with: 1 blind: 3 blind mice: 3 blind mice see: 1 blind mice three: 1 butcher's: 1 butcher's wife: 1 butcher's wife came: 1 came: 1 came after: 1 came after them: 1 how: 2 how they: 2 how they run: 2 knife: 1 knife three: 1 knife three blind: 1 mice: 3 mice see: 1 mice see how: 1 mice three: 1 mice three blind: 1 run: 2 run see: 1 run see how: 1 run the: 1 run the butcher's: 1 see: 2 see how: 2 see how they: 2 the: 1 the butcher's: 1 the butcher's wife: 1 them: 1 them with: 1 them with a: 1 they: 2 they run: 2 they run see: 1 they run the: 1 three: 3 three blind: 3 three blind mice: 3 wife: 1 wife came: 1 wife came after: 1 with: 1 with a: 1 with a knife: 1

    DWIM is Perl's answer to Gödel
Re: Word density
by ambrus (Abbot) on Mar 19, 2006 at 20:22 UTC
Re: Word density
by sulfericacid (Deacon) on Mar 19, 2006 at 18:57 UTC
    For starters, you could break on whitespace instead of word boundaries.

    my @words = split(/\s+/, $content);
    The + means it'll work on the double spaces after your punctuation.

    Now for the other part of the problem. I know there is an easier way than this but the only idea I have is to retrieve the first three elements of the area and then pop off the first element. Do it in a loop until all elements are gone leaving you every two and three word phrase.



    "Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

    sulfericacid
Re: Word density
by planetscape (Chancellor) on Mar 20, 2006 at 03:52 UTC
Re: Word density
by eric256 (Parson) on Mar 21, 2006 at 04:10 UTC

    Yet another way to do it. This one is written to stop on the end of sentences (sort of. ;) ). This way phrase don't count if they are accroos a !.? boundary. It probably would have been better to use some modules but it was easy enough to hack together as is. $stop_words should be an array ref with stop words in it.

    sub process_keywords { my ($text, $weight, $stop_words, $phrase_length, $key_phrases) = @ +_; $text =~ s/&[a-z]+;*/ /g; $text =~ s/[",']//g; $text =~ s/[.!?]/ . /g; $text =~ s/\s+/ /gs; my @words = map { lc($_) } split(' ', $text); @words = grep { length($_) } @words; my $stops = { map { $_ => 1 } @$stop_words }; for my $word (0 .. scalar @words - 1) { next if exists $stops->{$words[$word]}; for my $length (0 .. $phrase_length) { next unless defined @words[$word + $length]; next if exists $stops->{$words[$word+$length]}; my $phrase = join(' ', @words[$word .. $word+$length]); $key_phrases->{$phrase} += $weight; } } return $key_phrases; }

    ___________
    Eric Hodges