Word density

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Word density by jdporter (Paladin) on Mar 19, 2006 at 19:41 UTC
I think that is one wheel you really don't want to re-invent. See Text::ExtractWords. Also, you won't want to forget to account for stop words. See Lingua::StopWords. There are also some "N-gram" modules that you might find useful for your main algorithm. (Ignore the "tangram" hits.) Also, it looks to me like what you're trying to do can be achieved, more or less, by a combination of Lingua::EN::Summarize and Lingua::EN::Keywords. We're building the house of the future together.	[reply]
Re: Word density by ikegami (Patriarch) on Mar 19, 2006 at 19:00 UTC
Here are two basic regexps. One that's inclusive: `$word = qr/ [[:alpha:]] # Start with a letter. (?: [:^space:]* # Hyphens, apostrophes, etc [[:alpha:]] # Don't end on a punctuation mark. )? # Catch single letter words. /x;` [download] One that's restrictive: `$word = qr/ [[:alpha:]] # Start with a letter. (?: [[:alpha:]'-]+ # Allowed characters. [:alpha:] # Don't end on a punctuation mark. )? # Catch single letter words. /x;` [download] Here's how you use them: `my $last1; my $last2; while ($content =~ /($word)/g) { my $word = $1; ++$hash{ $word }; ++$hash{ "$last1 $word"} if defined $last1; ++$hash{"$last2 $last1 $word"} if defined $last2; $last2 = $last1; $last1 = $word; }` [download] Update: Instead of just returning the data, I've updated my code to actually process it.	[reply] [d/l] [select]
Re^2: Word density by Anonymous Monk on Mar 19, 2006 at 19:20 UTC
For your second sample code, is there an error somewhere? When I run it, my arrays are returned as ARRAY(0x18740bc), etc. Your second code is much easier to understand, thank you!	[reply]
Re^3: Word density by ikegami (Patriarch) on Mar 19, 2006 at 19:25 UTC
No, that's correct. What was the the second code (shown below) returned pairs of words as an array. You might want to take a peek at Data::Dumper to easily display structures. In any case, I've since modified my post such that the code does what you want. `my @words; while ($content =~ /($word)/g) { push(@words, $1); } my @words_bi; my @words_tri; foreach (0..$#words) { next if $_ < 1; push(@words_bi, [ @words[$_-1 .. $_] ] ); next if $_ < 2; push(@words_tri, [ @words[$_-2 .. $_] ] ); }` [download]	[reply] [d/l]
Re^4: Word density by Anonymous Monk on Mar 19, 2006 at 19:29 UTC
Re^5: Word density by sulfericacid (Deacon) on Mar 19, 2006 at 19:36 UTC
Re^5: Word density by ikegami (Patriarch) on Mar 21, 2006 at 01:26 UTC
Re: Word density by blokhead (Monsignor) on Mar 19, 2006 at 19:00 UTC
Here's a nice rule of thumb when splitting up a string into components: If it's easier to write what you want, use `m/what_you_want/g`. If it's easier to write what you don't want, use `split /what_you_dont_want/`. In your case, it's easier to write a regex for what you want (a word) than what you don't want (all between-word sequences): `my @words = $content =~ m/([A-Za-z]+(?:\'[A-Za-z]+)?)/g` [download] This matches alphabetic characters followed by an optional apostrophe + alphabetics. This is obviously preliminary. Adjust as necessary according to your definition of a "word" .. blokhead	[reply] [d/l] [select]
Re: Word density by GrandFather (Saint) on Mar 19, 2006 at 21:10 UTC
This should get you started - extract the words into a stack (array). Then in a loop: pull the first word off the stack, process it, check that there is a word remaining, combine with the previous top word and process, check that there is a third word, combine and process. `use warnings; use strict; use Text::ExtractWords; my $text = do {local $/; <DATA>}; # Slurp the text my @words; my %hash; words_list (\@words, $text); while (@words) { ++$hash{$_ = shift @words}; ++$hash{"$_ $words[0]"} if exists $words[0]; ++$hash{"$_ $words[0] $words[1]"} if exists $words[1]; } print "$_: $hash{$_}\n" for sort keys %hash; __DATA__ Three blind mice. Three blind mice. See how they run. See how they +run. The butcher's wife came after them with a knife, three blind mice.` [download] Prints: a: 1 a knife: 1 a knife three: 1 after: 1 after them: 1 after them with: 1 blind: 3 blind mice: 3 blind mice see: 1 blind mice three: 1 butcher's: 1 butcher's wife: 1 butcher's wife came: 1 came: 1 came after: 1 came after them: 1 how: 2 how they: 2 how they run: 2 knife: 1 knife three: 1 knife three blind: 1 mice: 3 mice see: 1 mice see how: 1 mice three: 1 mice three blind: 1 run: 2 run see: 1 run see how: 1 run the: 1 run the butcher's: 1 see: 2 see how: 2 see how they: 2 the: 1 the butcher's: 1 the butcher's wife: 1 them: 1 them with: 1 them with a: 1 they: 2 they run: 2 they run see: 1 they run the: 1 three: 3 three blind: 3 three blind mice: 3 wife: 1 wife came: 1 wife came after: 1 with: 1 with a: 1 with a knife: 1 [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: Word density by ambrus (Abbot) on Mar 19, 2006 at 20:22 UTC
See stripped punctuation on how to split a text to words.	[reply]
Re: Word density by sulfericacid (Deacon) on Mar 19, 2006 at 18:57 UTC
For starters, you could break on whitespace instead of word boundaries. `my @words = split(/\s+/, $content);` [download] The + means it'll work on the double spaces after your punctuation. Now for the other part of the problem. I know there is an easier way than this but the only idea I have is to retrieve the first three elements of the area and then pop off the first element. Do it in a loop until all elements are gone leaving you every two and three word phrase. "Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us" sulfericacid	[reply] [d/l]
Re: Word density by planetscape (Chancellor) on Mar 20, 2006 at 03:52 UTC
As jdporter wisely said, you don't want to go reinventing this particular wheel. With that in mind, may I suggest Ted Pedersen's Ngram Statistics Package? HTH, planetscape	[reply]
Re: Word density by eric256 (Parson) on Mar 21, 2006 at 04:10 UTC
Yet another way to do it. This one is written to stop on the end of sentences (sort of. ;) ). This way phrase don't count if they are accroos a !.? boundary. It probably would have been better to use some modules but it was easy enough to hack together as is. $stop_words should be an array ref with stop words in it. sub process_keywords { my ($text, $weight, $stop_words, $phrase_length, $key_phrases) = @ +_; $text =~ s/&[a-z]+;*/ /g; $text =~ s/[",']//g; $text =~ s/[.!?]/ . /g; $text =~ s/\s+/ /gs; my @words = map { lc($_) } split(' ', $text); @words = grep { length($_) } @words; my $stops = { map { $_ => 1 } @$stop_words }; for my $word (0 .. scalar @words - 1) { next if exists $stops->{$words[$word]}; for my $length (0 .. $phrase_length) { next unless defined @words[$word + $length]; next if exists $stops->{$words[$word+$length]}; my $phrase = join(' ', @words[$word .. $word+$length]); $key_phrases->{$phrase} += $weight; } } return $key_phrases; } [download] ___________ Eric Hodges	[reply] [d/l]