shamat has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I am a newbie in Pearl, and I'm struggling with a problem regarding natural language processing. I have a list of common misspellings, which I organized in something that looks like this:
$words[0] = "believe"; $words[1] = "beleive"; $words[2] = "beeliv"; $words[3] = "pelief";
The first entry in the list always refers to the correct spelling. I would like to find the mistakes in the entries 1-3, checking against the reference word. I would like to obtain an output like this:
0-1: ie ~ ei 0-2: e ~ ee; ie ~ i; v ~ 0-3: b ~ p; ve ~ f
So far I have written a very long and clumsy code, which I changed several times. I paste it here, but it actually does not work (the output is also very different from what I would like to have):

$words[0] = "believe"; $words[1] = "beleive"; $words[2] = "beeliv"; $words[3] = "pelief"; $reference_word = $words[0]; for ($n = 1; $n<$#words; $n++) { $z = 0; $l_count = 0; $r_count = 0; $l_common = ""; $r_common = ""; @char_a = split (//, $words[0]); @char_b = split (//, $words[$n]); #finding the largest part in common between two words on the le +ft for ($i=0;$i<=$#char_a;$i++) { #for ($j=0;$j<=$#char_b;$j++) { if ($char_a[$i] eq $char_b[$i]) { $l_count++; $l_common = $l_common.$char_a[$i]; ; } else { last } #} } #finding the largest part in common between two words on the r +ight #check parity of elements in the arrays if ($#char_a > $#char_b) { print "---PARITY BROKEN\n"; $diff = $#char_a > $#char_b; for ($k=1;$k<=$diff;$k++) { unshift (@char_b, "#") } } for ($i=$#char_a;$i>=0;$i--) { #for ($j=$#char_b;$j>=0;$j--) { if ($char_a[$i] eq $char_b[$i]) { $r_count++; $r_common = $r_common.$char_a[$i]; } else { last } #} } $r_common = reverse $r_common; print "$words[$n] ~ $words[$m] -> L_COMMON: >>$l_common<< -- R_COMM +ON: >>$r_common<< L_COUNT: $l_count - R_COUNT: $r_count\n"; if ($l_count ne $total_char) { $lenght_n = length($words[$n]); $lenght_m = length($words[$m]); $diff = ""; #print "1 -- TOTAL_CHAR: $total_char -- L_COUNT: $l_count\n"; #CASE1: magillum ~ magilla -> l_count= 6 r_count = 0 -> um ~ a +--- also ibilam ~ igilu if (!$r_common) { $xx = $total_char - $l_count; print "CASE1 -- TOTAL_CHAR: $total_char -- L_COUNT: $l_count +-- R_COUNT IS 0 -- TOT-LEFT: $xx\n"; $var1 = substr ($words[$n], $l_count); $var2 = substr ($words[$m], $l_count); $diff = $var1."~".$var2; $difference[$z] = "RIGHT_".$diff; print "CASE1 DIFFERENCE: $difference[$z] --- Z = $z\n"; $z++; $length_var1 = length ($var1); $length_var2 = length ($var2); if ($length_var1 > 2 || $length_var2 >2) { print "CASE1: LONG SEQUENCE FOUND IN VAR1 OR VAR2 --- L +ENGTH_VAR1 = $length_var1 LENGTH_VAR2 = $length_var2\n"; #chopping first and last characters from var1 and var2 +#at this point we know that they do not match, ex. bilam ~ gilu $left_var = substr ($var1, 0, 1)."~".substr ($var2, 0, + 1); $right_var = substr ($var1, -1)."~".substr ($var2, -1) +; $difference[$z-1] ="LEFT_$left_var"; $difference[$z] ="RIGHT_$right_var"; $words[$n] = substr ($var1, 1, -1); $words[$m] = substr ($var2, 1, -1); $z++; foreach $d (@difference) { print "-----NEW DIFFERENCE:$d\n"; } goto START; } } } #CASE2: zahadin ~ sumhadin -> l_count = 0 r_count = 5 if (!$l_common) { $xx = $total_char - $r_count; print "CASE2 -- TOTAL_CHAR: $total_char -- R_COUNT: $r_count +-- TOT-LEFT: $xx\n"; $var1 = substr ($words[$n], -$lenght_n, -($r_count)); $var2 = substr ($words[$m], -$lenght_m, -($r_count)); $diff = $var1."~".$var2; $difference[$z] = $diff; print "CASE2 DIFFERENCE: $difference[$z] --- Z = $z\n"; $z++; $length_var1 = length ($var1); $length_var2 = length ($var2); if ($length_var1 > 2 || $length_var2 >2) { print "CASE2: LONG SEQUENCE FOUND IN VAR1 OR VAR2 --- L +ENGTH_VAR1 = $length_var1 LENGTH_VAR2 = $length_var2\n"; #chopping first and last characters from var1 and var2 +#at this point we know that they do not match, ex. $left_var = substr ($var1, 0, 1)."~".substr ($var2, 0, + 1); $right_var = substr ($var1, -1)."~".substr ($var2, -1) +; $difference[$z-1] ="$left_var"; $difference[$z] ="$right_var"; $words[$n] = substr ($var1, 1, -1); $words[$m] = substr ($var2, 1, -1); $z++; foreach $d (@difference) { print "-----NEW DIFFERENCE:$d\n"; } goto START; } } #CASE3: ibila ~ igila -> l_count = 1 r_count = 3 if (($r_common) && ($l_common)) { print "CASE3 -- TOTAL_CHAR: $total_char -- R_COUNT: $r_count +-- TOT-LEFT: $xx\n"; $var1 = substr ($words[$n], $l_count, ($lenght_n - $r_count - + $l_count)); $var2 = substr ($words[$m], $l_count, ($lenght_m - $r_count - + $l_count)); $diff = $var1."~".$var2; $difference[$z] = $diff; print "CASE3 DIFFERENCE: $difference[$z] --- Z = $z\n"; $z++; $length_var1 = length ($var1); $length_var2 = length ($var2); if ($length_var1 > 2 || $length_var2 >2) { print "CASE2: LONG SEQUENCE FOUND IN VAR1 OR VAR2 --- L +ENGTH_VAR1 = $length_var1 LENGTH_VAR2 = $length_var2\n"; #chopping first and last characters from var1 and var2 +#at this point we know that they do not match, ex. $left_var = substr ($var1, 0, 1)."~".substr ($var2, 0, + 1); $right_var = substr ($var1, -1)."~".substr ($var2, -1) +; $difference[$z-1] ="$left_var"; $difference[$z] ="$right_var"; $words[$n] = substr ($var1, 1, -1); $words[$m] = substr ($var2, 1, -1); $z++; foreach $d (@difference) { print "-----NEW DIFFERENCE:$d\n"; } goto START; } } foreach $element (@difference) { print "ELEMENT-->>$element<<-\n"; } }

My idea was to find the maximum portion of the mistaken string matching the reference one, on the left and right boundaries, return what does not match, and then iterate over a loop. I was wondering if there is a better approach, and most of all a more efficient code, or a Perl module that may help. Thanks for your suggestions!

Replies are listed 'Best First'.
Re: Help finding mistakes in spellings using Perl
by AnomalousMonk (Archbishop) on Oct 09, 2013 at 23:50 UTC

    This is only a very indirect and hand-waving response and you probably already have looked at material like this, but this seems to be essentially the sort of thing that Levenshtein distance or Edit distance computations are processing, so you would only need to capture the bits and pieces of string that are being extracted by those algorithms to use for your own purposes. See Text::LevenshteinXS et al.

    >perl -wMstrict -le "use Text::LevenshteinXS; ;; my $word = 'believe'; ;; for my $check (qw(believe relieve beleive beeliv pelief beehive)) { my $d = distance($word, $check); print qq{'$word' < $d > '$check'}; } " 'believe' < 0 > 'believe' 'believe' < 1 > 'relieve' 'believe' < 2 > 'beleive' 'believe' < 3 > 'beeliv' 'believe' < 3 > 'pelief' 'believe' < 3 > 'beehive'
      Hi Anomalous Monk, thanks for your reply. I was aware of the Levenstein distance module, but I don't see a way to capture the differences in spellings while processing the strings using that module - this is probably because of my poor knowledge of Perl.
Re: Help finding mistakes in spellings using Perl
by 2teez (Vicar) on Oct 09, 2013 at 23:50 UTC

    Hi shamat,

    Dear Monks, I am a newbie in Pearl,..
    Let me start by saying, if you are still new to Perl is not late to start good habit like use warnings and strict in your script.
    Secondly, make your script readable use perltidy
    The first entry in the list always refers to the correct spelling. I would like to find the mistakes in the entries 1-3, checking against the reference word. I would like to obtain an output like this:
    One way to go is to compare each of the string with the first entry, letter by letter.
    I will give a head up below.
    use warnings; use strict; my @words = qw(believe beleive beeliv pelief); spelling_check( $words[0], $_ ) for @words[ 1 .. $#words ]; sub spelling_check { no warnings 'uninitialized'; my @wrds; push @wrds, [ split //, $_ ] for @_; my ( $right, $wrong ) = q{} x 2; for ( 0 .. $#{ $wrds[0] } ) { if ( $wrds[0]->[$_] ne $wrds[1]->[$_] ) { $right .= defined( $wrds[1]->[$_] ) ? $wrds[1]->[$_] : qw' +-'; $wrong .= defined( $wrds[0]->[$_] ) ? $wrds[0]->[$_] : qw' +-'; } } print join( ' ~ ' => ( $right, $wrong ) ), $/; }
    Produces..
    ei ~ ie eli- ~ liee pf- ~ bve
    NOTE: Please, note that if either of the strings comparing per time is longer, you have "uninitialized value" thus I used no warnings 'uninitialized'
    2. The printing is not as the OP wanted, but simply showed the difference in the letter of each string compared.
    3. The OP is to use this as a guide.
    Hope this helps

    If you tell me, I'll forget.
    If you show me, I'll remember.
    if you involve me, I'll understand.
    --- Author unknown to me
      Hi 2teez, thank you very much for your suggestions. The difficult part to me comes when there are multiple misspellings, randomly scattered in the word to be checked against the reference one. In this case, it is crucial to capture sets of variations, which is the reason why I was using an array, to which a new element is appended when a new variation is found. In any case, I might use your code as a starting point to rewrite mine, which compared to yours looks really ugly. Philosophical consideration: The more I work on this script, the more I realize how flexible our brain is in dealing with language, and how difficult is to bring to conscience all the processes involved in this task.
Re: Help finding mistakes in spellings using Perl (aspell, ispell)
by Anonymous Monk on Oct 10, 2013 at 02:00 UTC

      oh, wow, thanks for the shout-out. i wonder if anyone still uses (or ever used, for that matter) my Lingua::Ispell. I was rather proud of it. :-)

      I reckon we are the only monastery ever to have a dungeon stuffed with 16,000 zombies.
      Sorry for the late replies guys, I could check back the site only this morning.