word association problem

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: word association problem by graff (Chancellor) on Aug 06, 2002 at 04:51 UTC
On taking a closer look at your question, I realized that in my initial reply, I had missed something in your description, and the suggested script will produce more output than you wanted. For the words in wordlist.txt that contain one or more of the words in master.txt as a substring -- e.g. "accumbering", which contains both "accumb" and "accumber" -- you want to associate the wordlist word with the master word that constitutes the longest match -- i.e. "accumbering" should be listed with "accumber", not with "accumb". Have I got that right? To do that, the approach is a little more detailed: use strict; # mustn't forget that open(LIST, "wordlist.txt"); open(MSTR, "master.txt"); # get the wordlist my @wordlist = map { chomp; $_ } <LIST>; # get the master list, sorted by word length, longest words first my @master = sort { length($b) <=> length($a) } map { chomp; $_ } <MST +R>; # declare a hash to hold the findings: my %report; foreach my $lookfor ( @master ) { foreach my $lookat ( @wordlist ) { if ( $lookat =~ /$lookfor/ ) { $report{$lookfor} .= ",$lookat"; $lookat = ""; # erases this word from @wordlist } } } foreach my $word ( sort keys %report ) { $report{$word} =~ s/,/ /; # change initial comma to space print "$word:$report{$word}$/"; } [download] By seeking out the longest master words first, and "erasing" the hits from the wordlist array as you find them, each wordlist element will only be listed once, with the longest matching master word. update:Chmrr's correction to my initial response came in while I was working on this one. He's right: his version will be more efficient (and he helped me fix a typo).	[reply] [d/l]
Re: word association problem by atcroft (Abbot) on Aug 06, 2002 at 04:42 UTC
I fear I may be completely off on this, but it looked like a very intriguing problem you inquired about. My appologies up front for any confusion I may inadvertantly cause. Would something like Text::English be of help? Or Text::Metaphone or Text::Soundex for early approximations? This sounded a lot like what I found when I read rob_au's posting Natural Language Index Stemming and looked up the topic of "stemming," which sounds very close to what you are describing (although I have never worked with it, thus my preface above). Super Search gave several postings when I entered that term ("stemming"), so perhaps that might be helpful as well, and while I haven't tried the Google search against the site, I would guess it may also give some interesting results as well. I don't know for sure, though, although I would be very interested to hear of your results.	[reply]
Re: word association problem by graff (Chancellor) on Aug 06, 2002 at 03:18 UTC
I suppose the first thing I'd try would be to see whether both of these word lists fit in memory at the same time, because that makes things really easy: `open(LIST, "wordlist.txt"); open(MSTR, "master.txt"); my @wordlist = map { chomp; $_ } <LIST>; my @master = map { chomp; $_ } <MSTR>; foreach my $word ( @master ) { print "$word:",join(",",grep(/$word/,@wordlist)),$/; }` [download] update: oh yeah -- gotta use "chomp", not "chop".	[reply] [d/l]
Re: Re: word association problem by Chmrr (Vicar) on Aug 06, 2002 at 03:58 UTC
Looks like this will nearly do the trick, but not quite. The main problem with this is that, contrary to the details above, this will include "accumbering" in the "accumb:" line. Hence, we need to start from the longest master words, and remove words from the wordlist as they match. We can also get a slight speedup by using index instead of a regex. `#!/usr/bin/perl -w use strict; open(LIST, "wordlist.txt") or die "$!"; my @wordlist = map { chomp; $_ } <LIST>; close LIST; open(MSTR, "master.txt") or die "$!"; my @master = sort {length $b <=> length $a} map { chomp; $_ } <MSTR>; close MSTR; foreach my $word ( @master ) { my @matches; for (@wordlist) { next unless defined $_ and index($_, $word) >= 0; push @matches, $_; $_ = undef; } $word = [$word, join(", ",@matches)]; } print map {"$_->[0]: $_->[1].\n"} sort {$a->[0] cmp $b->[0]} @master;` [download] perl -pe '"I lo`+$^X$\"$]!$/"=~m%(.)%s;$_=$1;y^`+*^e v^#$&V"+@( NO CARRIER'	[reply] [d/l]
Re: word association problem by nmerriweather (Friar) on Aug 06, 2002 at 03:28 UTC
I'm not too experienced w/perl, but i can suggest some things based on my naiveity 1) if file a has a base word, i'd regex words that contain it out of file b. it can get messy though, because youd probably have to test against multiple sections of that word: lost lose losing -- all stem from lose, but in different tenses/forms their spelling changes drastically even on the root 2) you could use the soundex mod to get similar sounding words. and i believe there is a 'better than soundex' mod out there too.	[reply]
Re: word association problem by hsmyers (Canon) on Aug 06, 2002 at 14:00 UTC
Should this not all fit into memory then I'd suggest you solve the file based lookup problem first. Might want to research 'tries'. A trie (from retrieval), is a multi-way tree structure useful for storing strings over an alphabet. It has been used to store large dictionaries of English (say) words in spelling-checking programs and in natural-language "understanding" programs...₍₁₎ Try http://theoryx5.uwinnipeg.ca/CPAN/data/Tree-Trie/README.html ₍₁₎ from http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Trie.html. --hsm "Never try to teach a pig to sing...it wastes your time and it annoys the pig."	[reply]
Re: Re: word association problem by jackdied (Monk) on Aug 08, 2002 at 00:28 UTC
Dr. Dobbs Article on Ternary search trees as well http://www.ddj.com/documents/s=921/ddj9804a/9804a.htm I've always meant to publish a ternary search tree module for CPAN, I guess waiting paid off - someone else has done it.	[reply]
Re: word association problem by Ebany (Sexton) on Aug 06, 2002 at 17:33 UTC
I too am in a similar situation, and am on Win32, so my options are limited, but one module I ran across that seems to merit some looking into is Lingua::En::Infinitive. What it does, is attempt to take off the conjugated ending, and returns 2 results, which you could then compare against your master list. So, in the case of 'swimming', it should return as one of the choices, 'swim'. So, I guess my best suggestion would be to look into that module, and then combine pattern matching, and possibly Metaphone or DoubleMetaphone, mentioned earlier. I don't think there's any one overall module, but a combination approach I think will be helpful.	[reply]