Dear Monks, I am a newbie in Pearl, and I'm struggling with a problem regarding natural language processing. I have a list of common misspellings, which I organized in something that looks like this:
$words[0] = "believe"; $words[1] = "beleive"; $words[2] = "beeliv"; $words[3] = "pelief";
The first entry in the list always refers to the correct spelling. I would like to find the mistakes in the entries 1-3, checking against the reference word. I would like to obtain an output like this:
0-1: ie ~ ei 0-2: e ~ ee; ie ~ i; v ~ 0-3: b ~ p; ve ~ f
So far I have written a very long and clumsy code, which I changed several times. I paste it here, but it actually does not work (the output is also very different from what I would like to have):

$words[0] = "believe"; $words[1] = "beleive"; $words[2] = "beeliv"; $words[3] = "pelief"; $reference_word = $words[0]; for ($n = 1; $n<$#words; $n++) { $z = 0; $l_count = 0; $r_count = 0; $l_common = ""; $r_common = ""; @char_a = split (//, $words[0]); @char_b = split (//, $words[$n]); #finding the largest part in common between two words on the le +ft for ($i=0;$i<=$#char_a;$i++) { #for ($j=0;$j<=$#char_b;$j++) { if ($char_a[$i] eq $char_b[$i]) { $l_count++; $l_common = $l_common.$char_a[$i]; ; } else { last } #} } #finding the largest part in common between two words on the r +ight #check parity of elements in the arrays if ($#char_a > $#char_b) { print "---PARITY BROKEN\n"; $diff = $#char_a > $#char_b; for ($k=1;$k<=$diff;$k++) { unshift (@char_b, "#") } } for ($i=$#char_a;$i>=0;$i--) { #for ($j=$#char_b;$j>=0;$j--) { if ($char_a[$i] eq $char_b[$i]) { $r_count++; $r_common = $r_common.$char_a[$i]; } else { last } #} } $r_common = reverse $r_common; print "$words[$n] ~ $words[$m] -> L_COMMON: >>$l_common<< -- R_COMM +ON: >>$r_common<< L_COUNT: $l_count - R_COUNT: $r_count\n"; if ($l_count ne $total_char) { $lenght_n = length($words[$n]); $lenght_m = length($words[$m]); $diff = ""; #print "1 -- TOTAL_CHAR: $total_char -- L_COUNT: $l_count\n"; #CASE1: magillum ~ magilla -> l_count= 6 r_count = 0 -> um ~ a +--- also ibilam ~ igilu if (!$r_common) { $xx = $total_char - $l_count; print "CASE1 -- TOTAL_CHAR: $total_char -- L_COUNT: $l_count +-- R_COUNT IS 0 -- TOT-LEFT: $xx\n"; $var1 = substr ($words[$n], $l_count); $var2 = substr ($words[$m], $l_count); $diff = $var1."~".$var2; $difference[$z] = "RIGHT_".$diff; print "CASE1 DIFFERENCE: $difference[$z] --- Z = $z\n"; $z++; $length_var1 = length ($var1); $length_var2 = length ($var2); if ($length_var1 > 2 || $length_var2 >2) { print "CASE1: LONG SEQUENCE FOUND IN VAR1 OR VAR2 --- L +ENGTH_VAR1 = $length_var1 LENGTH_VAR2 = $length_var2\n"; #chopping first and last characters from var1 and var2 +#at this point we know that they do not match, ex. bilam ~ gilu $left_var = substr ($var1, 0, 1)."~".substr ($var2, 0, + 1); $right_var = substr ($var1, -1)."~".substr ($var2, -1) +; $difference[$z-1] ="LEFT_$left_var"; $difference[$z] ="RIGHT_$right_var"; $words[$n] = substr ($var1, 1, -1); $words[$m] = substr ($var2, 1, -1); $z++; foreach $d (@difference) { print "-----NEW DIFFERENCE:$d\n"; } goto START; } } } #CASE2: zahadin ~ sumhadin -> l_count = 0 r_count = 5 if (!$l_common) { $xx = $total_char - $r_count; print "CASE2 -- TOTAL_CHAR: $total_char -- R_COUNT: $r_count +-- TOT-LEFT: $xx\n"; $var1 = substr ($words[$n], -$lenght_n, -($r_count)); $var2 = substr ($words[$m], -$lenght_m, -($r_count)); $diff = $var1."~".$var2; $difference[$z] = $diff; print "CASE2 DIFFERENCE: $difference[$z] --- Z = $z\n"; $z++; $length_var1 = length ($var1); $length_var2 = length ($var2); if ($length_var1 > 2 || $length_var2 >2) { print "CASE2: LONG SEQUENCE FOUND IN VAR1 OR VAR2 --- L +ENGTH_VAR1 = $length_var1 LENGTH_VAR2 = $length_var2\n"; #chopping first and last characters from var1 and var2 +#at this point we know that they do not match, ex. $left_var = substr ($var1, 0, 1)."~".substr ($var2, 0, + 1); $right_var = substr ($var1, -1)."~".substr ($var2, -1) +; $difference[$z-1] ="$left_var"; $difference[$z] ="$right_var"; $words[$n] = substr ($var1, 1, -1); $words[$m] = substr ($var2, 1, -1); $z++; foreach $d (@difference) { print "-----NEW DIFFERENCE:$d\n"; } goto START; } } #CASE3: ibila ~ igila -> l_count = 1 r_count = 3 if (($r_common) && ($l_common)) { print "CASE3 -- TOTAL_CHAR: $total_char -- R_COUNT: $r_count +-- TOT-LEFT: $xx\n"; $var1 = substr ($words[$n], $l_count, ($lenght_n - $r_count - + $l_count)); $var2 = substr ($words[$m], $l_count, ($lenght_m - $r_count - + $l_count)); $diff = $var1."~".$var2; $difference[$z] = $diff; print "CASE3 DIFFERENCE: $difference[$z] --- Z = $z\n"; $z++; $length_var1 = length ($var1); $length_var2 = length ($var2); if ($length_var1 > 2 || $length_var2 >2) { print "CASE2: LONG SEQUENCE FOUND IN VAR1 OR VAR2 --- L +ENGTH_VAR1 = $length_var1 LENGTH_VAR2 = $length_var2\n"; #chopping first and last characters from var1 and var2 +#at this point we know that they do not match, ex. $left_var = substr ($var1, 0, 1)."~".substr ($var2, 0, + 1); $right_var = substr ($var1, -1)."~".substr ($var2, -1) +; $difference[$z-1] ="$left_var"; $difference[$z] ="$right_var"; $words[$n] = substr ($var1, 1, -1); $words[$m] = substr ($var2, 1, -1); $z++; foreach $d (@difference) { print "-----NEW DIFFERENCE:$d\n"; } goto START; } } foreach $element (@difference) { print "ELEMENT-->>$element<<-\n"; } }

My idea was to find the maximum portion of the mistaken string matching the reference one, on the left and right boundaries, return what does not match, and then iterate over a loop. I was wondering if there is a better approach, and most of all a more efficient code, or a Perl module that may help. Thanks for your suggestions!

In reply to Help finding mistakes in spellings using Perl by shamat

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.