OK, so it appears that by “maximum partial match” you mean longest common substring. A search on that phrase found the thread finding longest common substring, from which I derived the following:
#! perl use strict; use warnings; use Algorithm::Diff qw( LCS ); use constant { CASE_SENSITIVE => 0, DICTIONARY_FILE => 'words.txt', MINIMUM_MATCH => 2, }; print "Enter the target word: "; chomp(my $target = <STDIN>); my @target = split //, CASE_SENSITIVE ? $target : lc $target; open(my $in, '<', DICTIONARY_FILE) or die "Cannot open file '" . DICTIONARY_FILE . "' for reading: $! +"; my %substrings; while (my $word = <$in>) { chomp $word; my @word = split //, CASE_SENSITIVE ? $word : lc $word; my @lcs = LCS(\@target, \@word); $substrings{ join('', @lcs) } = $word if @lcs >= MINIMUM_MATCH; } close $in or die "Cannot close file '" . DICTIONARY_FILE . "': $!"; if (%substrings) { my @matches = sort { length $a <=> length $b } keys %substrings; print "The closest match is: ", $substrings{ $matches[-1] }, "\n" +; } else { print "No matches found\n"; }
With “words.txt” containing:
fal falle fall awiSayaM awiSayanZ
and suitable input from the keyboard, the output is:
19:42 >perl 454_SoPW.pl Enter the target word: awiSayanA The closest match is: awiSayanZ 19:42 >perl 454_SoPW.pl Enter the target word: fallen The closest match is: falle 19:42 >
Note: You will need to adjust the values of CASE_SENSITIVE and MINIMUM_MATCH to suit your requirements.
Update 1 (1st January, 2013):
Algorithm::Diff is actually the wrong module for this, I should have used String::LCSS. The former finds non-contiguous sub-sequences; the latter finds substrings. Revised code:
#! perl use strict; use warnings; use String::LCSS; use constant { CASE_SENSITIVE => 0, DICTIONARY_FILE => 'words.txt', }; print 'Enter the target word: '; chomp(my $orig_target = <STDIN>); my $target = CASE_SENSITIVE ? $orig_target : lc $orig_target; open(my $in, '<', DICTIONARY_FILE) or die "Cannot open file '" . DICTIONARY_FILE . "' for reading: $! +"; my %substrings; while (my $orig_word = <$in>) { chomp $orig_word; my $word = CASE_SENSITIVE ? $orig_word : lc $orig_word; my @lcss = lcss($word, $target); $substrings{ $lcss[0] } = [ $orig_word, $lcss[1], $lcss[2] ] if $l +css[0]; } close $in or die "Cannot close file '" . DICTIONARY_FILE . "': $!"; print 'Target: ', $orig_target, "\n"; if (%substrings) { my $key = (sort { length $a <=> length $b } keys %substrings)[- +1]; my $match = $substrings{ $key }->[0]; my $index2 = $substrings{ $key }->[2]; my $substr = substr($orig_target, $index2, length $key); print 'Closest match: ', $match, "\n"; print 'Longest common substring: ', $substr, "\n"; } else { print "No matches found\n"; } sub lcss { my ($first, $second) = @_; $first .= '$'; # force strings to be different: $second .= '@'; # kludge required by String::LCSS::lcss my @results = String::LCSS::lcss($first, $second); return wantarray ? @results : $results[0]; }
Update 2 (1st January, 2013):
It appears that String::LCSS is more badly broken than I realised. Even simple matches can fail to find the longest common substring:
18:27 >perl -MString::LCSS=lcss -wE "say scalar lcss('abxabcy', 'abc') +;" ab 18:28 >
(And see http://cpanratings.perl.org/dist/String-LCSS.)
Better to replace sub lcss in the above script with the following by BrowserUk in Re: finding longest common substring:
sub lcss { my $strings = join "\0", @_; my $lcs; for my $n (1 .. length $strings) { my $re = "(.{$n})" . '.*\0.*\1' x (@_ - 1); last unless $strings =~ $re; $lcs = $1; } return $lcs; }
Update 3 (2nd January, 2013):
Discovered the thread Does String::LCSS work?. String::LCSS is indeed broken, but String::LCSS_XS seems to work correctly.
Hope that helps,
| Athanasius <°(((>< contra mundum | Iustus alius egestas vitae, eros Piratica, |
In reply to Re^13: partial match between 2 files
by Athanasius
in thread partial match between 2 files
by lakssreedhar
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |