Word Comparison

rooneyl has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Word Comparison by pc88mxer (Vicar) on Mar 26, 2008 at 22:48 UTC
There is a more-or-less conventional way to do it, and then there are those that take advantage of perl's regular expression capabilities. Here's one that uses the 's' operator: `sub contains { # return true if $_[0] is contained in $_[1] my ($s, $t) = @_; for (split('', $s)) { return 0 unless ($t =~ s/$_//); } 1; } print contains("dog", "good"), "\n"; # -> 1 print contains("food", "fodder"), "\n"; # -> 0` [download] Update: The 'conventional' way would be to perform a count of each letter and compare counts: `sub contains { my ($s, $t) = @_; my (%sc, %tc); for (split('', $s)) { $sc{$_}++ }; for (split('', $t)) { $tc($_}++ }; for my $k (keys %sc) { return 0 unless $sc{$k} <= $tc{$k}; } 1; }` [download] While not the most efficient in this situation, this code exemplifies common types of data processing that is done with perl, so it's valuable to know about.	[reply] [d/l] [select]
Re: Word Comparison by oko1 (Deacon) on Mar 26, 2008 at 23:09 UTC
Interesting how this keeps coming up - it just so happens that I've been working on this class of problems recently. `#!/usr/bin/perl -w use strict; my @words = ("dog good", "food fodder"); TOP: for (@words){ my ($first, $second) = split; for my $ltr (split //, $first){ unless ($second =~ s/$ltr//){ print "'$_' does not match.\n"; next TOP; } } print "'$_' matches.\n"; }` [download] Running it produces the following: 'dog good' matches. 'food fodder' does not match. It gets much more interesting when the words are made up of cards that can be either one or two letters long... :)	[reply] [d/l]
Re: Word Comparison by mobiusinversion (Beadle) on Mar 26, 2008 at 23:32 UTC
It looks like you want to establish word similarity without regard to order or case. Try a hash (it will give you the flexibility to modify the rules of the game later). `my $word = 'bantha fodder'; my @letters = split//, lc($word); my %counts; for(@letters){ $counts{$_}++; } for(sort {$counts{$b} <=> $counts{$a}} keys %counts){ my $t = $_ =~ /[a-z]/ ? 'letter' : $_ =~ /[0-9]/ ? 'number' : 'character'; my $s = $counts{$_} > 1 ? 's' : ''; printf " the %-10s: %-3s ocurred %-3s time%s\n", $t,$_,$counts{$_},$s }` [download] which prints `the letter : 'a' ocurred 2 times the letter : 'd' ocurred 2 times the letter : 'e' ocurred 1 time the letter : 'n' ocurred 1 time the letter : 'r' ocurred 1 time the character : ' ' ocurred 1 time the letter : 'h' ocurred 1 time the letter : 'b' ocurred 1 time the letter : 'f' ocurred 1 time the letter : 't' ocurred 1 time the letter : 'o' ocurred 1 time` [download] Now wrap that in a subroutine and use some loop control: `sub letter_counts { my @letters = split//, lc(shift); my %counts; for(@letters){ $counts{$_}++; } return %counts } my $word1 = 'bantha fodder'; my $word2 = 'banana slug'; my %counts1 = letter_counts($word1); my %counts2 = letter_counts($word2); my $result = 'are equivalent'; test_loop: for(keys %counts1, keys %counts2){ if(!defined $counts2{$_} \|\| !defined $counts1{$_}){ $result = 'are not equivalent'; last test_loop } } print "'$word1' and '$word2' $result\n";` [download] Now you can modify the loop / conditional as you see fit. For order dependent similarity, try Algorithm::Diff	[reply] [d/l] [select]
Re: Word Comparison by nefigah (Monk) on Mar 26, 2008 at 22:45 UTC
Sounds like a fun task, and regular expressions will be your friend here... but from the example you listed, I'm not sure if the rules are totally clear to me? Could you come up with a list of criteria for "what makes a word close enough to another word"? Then we can give some more concrete ideas :) I'm a peripheral visionary... I can see into the future, but just way off to the side.	[reply]
Re^2: Word Comparison by rooneyl (Sexton) on Mar 26, 2008 at 22:58 UTC
I'll try an explain how the game works. You are presented with a word, for example 'perlmonksisgreat'. You have to try and make words from this using only the letters that are in the word, you can only use each letter as many times as it appears. In the above example you could use 1 x 'p' and 2 x 'e' and so on. Some examples of valid words are 'perl', 'slop', 'term'. An example of a non-valid word would be 'pepper' (only 1 'p').	[reply]
Re^3: Word Comparison by nefigah (Monk) on Mar 26, 2008 at 23:01 UTC
Ah, in that case, use pc88's first example below, it does exactly what you want :) I'm a peripheral visionary... I can see into the future, but just way off to the side.	[reply]
Re: Word Comparison by ysth (Canon) on Mar 27, 2008 at 03:55 UTC
Maybe not super efficient in terms of time, but at least terse: `for ([qw/dog good/],[qw/food fodder/]) { my ($first, $second) = @$_; print "$first does "; print "not " unless join("",sort split //,$second) =~ join(".*",sort split // +,$first); print "contain $second\n"; }` [download] I do this in word twist, which you may want to look at. `join("",sort split //,$first) =~ ("^".join("?",sort(split //,$second)) +."?\\z")` [download] is possibly a faster way - I haven't benchmarked. -- Online Fortune Cookie Search	[reply] [d/l] [select]
Re: Word Comparison by hipowls (Curate) on Mar 27, 2008 at 05:01 UTC
Now a regex variant;) I don't know if you'd want to use it but it does demonstrate building a regex at run time. while ( my $line = <DATA> ) { my ($new, $old) = split /\s+/, $line; my $rx = regex($new); print "$new can ", ( $old =~ /$rx/? '': 'NOT ' ), "be made from $o +ld\n"; } sub regex { my %letters; $letters{$_}++ for split //, lc shift; my $regex = join '', map { "(?=(?:.${_}){$letters{$_}})" } keys % +letters; return qr/\A$regex/si; } __DATA__ dog good food fodder lot total fuse useful poor porridge root rotor __END__ dog can be made from good food can NOT be made from fodder lot can be made from total fuse can be made from useful poor can NOT be made from porridge root can be made from rotor [download] I'll show the regular expression generated from 'root' in detail `my $regex = qr{ \A # anchor at start for efficiency (?= # positive lookahead for (?: . # anything r # with an 'r' after it ){1} # at least once ) (?= # positive lookahead for (?:.t){1} # a 't' ) (?= # positive lookahead for (?:.o){2} # for two 'o's ) }six; # ignore case` [download] All the lookaheads have to succeed for the regex to match. On a side note I benchmarked the variants `(?:.r)`, `(?:[^r]+r)` and `(?:[^r*]r)` on different input strings and found the first was usually fastest.	[reply] [d/l] [select]