Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi all!

I have a problem that and I need to find a solution fast ... nothing new here ;-)

Ok, I have a master file (master.txt) that contains a huge list of words.

master.txt contains (small sample):
===========================
...
accumb
accumber
...

 

And I have an other file containing other words.
 

wordlist.txt contains (small sample):
===========================
...
accumbing
accumbed
reaccumb
overaccumb
foreaccumb
deaccumb
unaccumb
accumbe
accumbering
accumbered
reaccumber
overaccumber
foreaccumber
deaccumber
unaccumber
...

My job is to link each word in wordlist.txt with a word in master.txt.

So, out.txt should contain:
==============

...
accumb: accumbing, accumbed, reaccumb, overaccumb, foreaccumb, deaccumb, unaccumb, accumbe.
accumber: accumbering, accumbered, reaccumber, overaccumber, foreaccumber, deaccumber, unaccumber.
...

So that't the problem, I need help in making the association. I have to find the 'best' word to make the association with... is there an algorythm or a CPAN modules that can help me...

thanks,

Jean-Daniel


 

Replies are listed 'Best First'.
Re: word association problem
by graff (Chancellor) on Aug 06, 2002 at 04:51 UTC
    On taking a closer look at your question, I realized that in my initial reply, I had missed something in your description, and the suggested script will produce more output than you wanted.

    For the words in wordlist.txt that contain one or more of the words in master.txt as a substring -- e.g. "accumbering", which contains both "accumb" and "accumber" -- you want to associate the wordlist word with the master word that constitutes the longest match -- i.e. "accumbering" should be listed with "accumber", not with "accumb". Have I got that right?

    To do that, the approach is a little more detailed:

    use strict; # mustn't forget that open(LIST, "wordlist.txt"); open(MSTR, "master.txt"); # get the wordlist my @wordlist = map { chomp; $_ } <LIST>; # get the master list, sorted by word length, longest words first my @master = sort { length($b) <=> length($a) } map { chomp; $_ } <MST +R>; # declare a hash to hold the findings: my %report; foreach my $lookfor ( @master ) { foreach my $lookat ( @wordlist ) { if ( $lookat =~ /$lookfor/ ) { $report{$lookfor} .= ",$lookat"; $lookat = ""; # erases this word from @wordlist } } } foreach my $word ( sort keys %report ) { $report{$word} =~ s/,/ /; # change initial comma to space print "$word:$report{$word}$/"; }

    By seeking out the longest master words first, and "erasing" the hits from the wordlist array as you find them, each wordlist element will only be listed once, with the longest matching master word.

    update:Chmrr's correction to my initial response came in while I was working on this one. He's right: his version will be more efficient (and he helped me fix a typo).

Re: word association problem
by atcroft (Abbot) on Aug 06, 2002 at 04:42 UTC

    I fear I may be completely off on this, but it looked like a very intriguing problem you inquired about. My appologies up front for any confusion I may inadvertantly cause.

    Would something like Text::English be of help? Or Text::Metaphone or Text::Soundex for early approximations?

    This sounded a lot like what I found when I read rob_au's posting Natural Language Index Stemming and looked up the topic of "stemming," which sounds very close to what you are describing (although I have never worked with it, thus my preface above). Super Search gave several postings when I entered that term ("stemming"), so perhaps that might be helpful as well, and while I haven't tried the Google search against the site, I would guess it may also give some interesting results as well. I don't know for sure, though, although I would be very interested to hear of your results.

Re: word association problem
by graff (Chancellor) on Aug 06, 2002 at 03:18 UTC
    I suppose the first thing I'd try would be to see whether both of these word lists fit in memory at the same time, because that makes things really easy:
    open(LIST, "wordlist.txt"); open(MSTR, "master.txt"); my @wordlist = map { chomp; $_ } <LIST>; my @master = map { chomp; $_ } <MSTR>; foreach my $word ( @master ) { print "$word:",join(",",grep(/$word/,@wordlist)),$/; }

    update: oh yeah -- gotta use "chomp", not "chop".

      Looks like this will nearly do the trick, but not quite. The main problem with this is that, contrary to the details above, this will include "accumbering" in the "accumb:" line. Hence, we need to start from the longest master words, and remove words from the wordlist as they match. We can also get a slight speedup by using index instead of a regex.

      #!/usr/bin/perl -w use strict; open(LIST, "wordlist.txt") or die "$!"; my @wordlist = map { chomp; $_ } <LIST>; close LIST; open(MSTR, "master.txt") or die "$!"; my @master = sort {length $b <=> length $a} map { chomp; $_ } <MSTR>; close MSTR; foreach my $word ( @master ) { my @matches; for (@wordlist) { next unless defined $_ and index($_, $word) >= 0; push @matches, $_; $_ = undef; } $word = [$word, join(", ",@matches)]; } print map {"$_->[0]: $_->[1].\n"} sort {$a->[0] cmp $b->[0]} @master;

      perl -pe '"I lo*`+$^X$\"$]!$/"=~m%(.*)%s;$_=$1;y^`+*^e v^#$&V"+@( NO CARRIER'

Re: word association problem
by nmerriweather (Friar) on Aug 06, 2002 at 03:28 UTC
    I'm not too experienced w/perl, but i can suggest some things based on my naiveity 1) if file a has a base word, i'd regex words that contain it out of file b. it can get messy though, because youd probably have to test against multiple sections of that word: lost lose losing -- all stem from lose, but in different tenses/forms their spelling changes drastically even on the root 2) you could use the soundex mod to get similar sounding words. and i believe there is a 'better than soundex' mod out there too.
Re: word association problem
by hsmyers (Canon) on Aug 06, 2002 at 14:00 UTC
      Dr. Dobbs Article on Ternary search trees as well

      http://www.ddj.com/documents/s=921/ddj9804a/9804a.htm

      I've always meant to publish a ternary search tree module for CPAN, I guess waiting paid off - someone else has done it.

Re: word association problem
by Ebany (Sexton) on Aug 06, 2002 at 17:33 UTC
    I too am in a similar situation, and am on Win32, so my options are limited, but one module I ran across that seems to merit some looking into is Lingua::En::Infinitive. What it does, is attempt to take off the conjugated ending, and returns 2 results, which you could then compare against your master list. So, in the case of 'swimming', it should return as one of the choices, 'swim'. So, I guess my best suggestion would be to look into that module, and then combine pattern matching, and possibly Metaphone or DoubleMetaphone, mentioned earlier. I don't think there's any one overall module, but a combination approach I think will be helpful.