comment on

Dear Monks, I am a newbie in Pearl, and I'm struggling with a problem regarding natural language processing. I have a list of common misspellings, which I organized in something that looks like this:

$words[0] = "believe";
$words[1] = "beleive"; 
$words[2] = "beeliv"; 
$words[3] = "pelief";
[download]

The first entry in the list always refers to the correct spelling. I would like to find the mistakes in the entries 1-3, checking against the reference word. I would like to obtain an output like this:

0-1: ie ~ ei
0-2: e ~ ee; ie ~ i; v ~
0-3: b ~ p; ve ~ f
[download]

So far I have written a very long and clumsy code, which I changed several times. I paste it here, but it actually does not work (the output is also very different from what I would like to have):

$words[0] = "believe";
$words[1] = "beleive"; 
$words[2] = "beeliv"; 
$words[3] = "pelief";

$reference_word = $words[0];

for ($n = 1; $n<$#words; $n++)  {
    
      
      $z = 0;

      
      $l_count = 0;
      $r_count = 0;
      $l_common = "";
      $r_common = "";
      
      @char_a = split (//, $words[0]);
      
      @char_b = split (//, $words[$n]);
       
          
       #finding the largest part in common between two words on the le
+ft
        for ($i=0;$i<=$#char_a;$i++) {
            #for ($j=0;$j<=$#char_b;$j++) {
                 if ($char_a[$i] eq $char_b[$i]) {
                  $l_count++;
                  $l_common = $l_common.$char_a[$i];
                  ;
               } else {
                  last
               } 
            #}
         
        }            
        #finding the largest part in common between two words on the r
+ight
        
        #check parity of elements in the arrays
        if ($#char_a > $#char_b) {
           print "---PARITY BROKEN\n";
           $diff = $#char_a > $#char_b;
           for ($k=1;$k<=$diff;$k++) {
              unshift (@char_b, "#") 
           }
        }
        
        for ($i=$#char_a;$i>=0;$i--) {
            #for ($j=$#char_b;$j>=0;$j--) {    
                if ($char_a[$i] eq $char_b[$i]) {
                  $r_count++;
                  $r_common = $r_common.$char_a[$i];
               } else {
                  last
               } 
            
            #}
        }
   $r_common = reverse $r_common;
   
   print "$words[$n] ~ $words[$m] -> L_COMMON: >>$l_common<< -- R_COMM
+ON: >>$r_common<< L_COUNT: $l_count - R_COUNT: $r_count\n"; 
      

   if ($l_count ne $total_char)    {
      $lenght_n = length($words[$n]);
      $lenght_m = length($words[$m]);
      $diff = "";
      #print "1 -- TOTAL_CHAR: $total_char -- L_COUNT: $l_count\n";
      
      
      #CASE1: magillum ~ magilla -> l_count= 6 r_count = 0 -> um ~ a  
+--- also ibilam ~ igilu
      if (!$r_common) {
         $xx = $total_char - $l_count;
         print "CASE1 -- TOTAL_CHAR: $total_char -- L_COUNT: $l_count 
+-- R_COUNT IS 0 -- TOT-LEFT: $xx\n";
         
         $var1 = substr ($words[$n], $l_count);
         $var2 = substr ($words[$m], $l_count);
         $diff = $var1."~".$var2;
         $difference[$z] = "RIGHT_".$diff;
         print "CASE1 DIFFERENCE: $difference[$z]  --- Z = $z\n";
         $z++;
         $length_var1 = length ($var1);
         $length_var2 = length ($var2);
            if ($length_var1 > 2 || $length_var2 >2) {
               print "CASE1: LONG SEQUENCE FOUND IN VAR1 OR VAR2 --- L
+ENGTH_VAR1 = $length_var1 LENGTH_VAR2 = $length_var2\n";
               
               #chopping first and last characters from var1 and var2 
+#at this point we know that they do not match, ex. bilam ~ gilu
               $left_var =  substr ($var1, 0, 1)."~".substr ($var2, 0,
+ 1);
               $right_var =  substr ($var1, -1)."~".substr ($var2, -1)
+;
               $difference[$z-1] ="LEFT_$left_var";
               $difference[$z] ="RIGHT_$right_var";
               $words[$n] = substr ($var1, 1, -1);
               $words[$m] = substr ($var2, 1, -1);
               $z++;
               foreach $d (@difference) {
                  print "-----NEW DIFFERENCE:$d\n";
               }
               
               
               goto START;
              
               }
               
            }
      }
      
      
      #CASE2: zahadin ~ sumhadin -> l_count = 0 r_count = 5
      if (!$l_common) {
         $xx = $total_char - $r_count;
         print "CASE2 -- TOTAL_CHAR: $total_char -- R_COUNT: $r_count 
+-- TOT-LEFT: $xx\n";
         
         $var1 = substr ($words[$n], -$lenght_n, -($r_count));
         $var2 = substr ($words[$m], -$lenght_m, -($r_count));
         $diff = $var1."~".$var2;
         $difference[$z] = $diff;
         print "CASE2 DIFFERENCE: $difference[$z]  --- Z = $z\n";
         $z++;
        
         $length_var1 = length ($var1);
         $length_var2 = length ($var2);
         
         if ($length_var1 > 2 || $length_var2 >2) {
               print "CASE2: LONG SEQUENCE FOUND IN VAR1 OR VAR2 --- L
+ENGTH_VAR1 = $length_var1 LENGTH_VAR2 = $length_var2\n";
               
               #chopping first and last characters from var1 and var2 
+#at this point we know that they do not match, ex. 
               $left_var =  substr ($var1, 0, 1)."~".substr ($var2, 0,
+ 1);
               $right_var =  substr ($var1, -1)."~".substr ($var2, -1)
+;
               $difference[$z-1] ="$left_var";
               $difference[$z] ="$right_var";
               $words[$n] = substr ($var1, 1, -1);
               $words[$m] = substr ($var2, 1, -1);
               $z++;
                foreach $d (@difference) {
                  print "-----NEW DIFFERENCE:$d\n";
                }
         
               goto START;
              
               }
               
            
         
      }

      
      
      
      
      #CASE3: ibila ~ igila -> l_count = 1 r_count = 3
      if (($r_common) && ($l_common)) {
         print "CASE3 -- TOTAL_CHAR: $total_char -- R_COUNT: $r_count 
+-- TOT-LEFT: $xx\n";
         
         $var1 = substr ($words[$n], $l_count, ($lenght_n - $r_count -
+ $l_count));
         $var2 = substr ($words[$m], $l_count, ($lenght_m - $r_count -
+ $l_count));
         
         $diff = $var1."~".$var2;
         $difference[$z] = $diff;
         print "CASE3 DIFFERENCE: $difference[$z]  --- Z = $z\n";
         $z++;
         
         $length_var1 = length ($var1);
         $length_var2 = length ($var2);
       
         
       
         
           if ($length_var1 > 2 || $length_var2 >2) {
               print "CASE2: LONG SEQUENCE FOUND IN VAR1 OR VAR2 --- L
+ENGTH_VAR1 = $length_var1 LENGTH_VAR2 = $length_var2\n";
               
               #chopping first and last characters from var1 and var2 
+#at this point we know that they do not match, ex. 
               $left_var =  substr ($var1, 0, 1)."~".substr ($var2, 0,
+ 1);
               $right_var =  substr ($var1, -1)."~".substr ($var2, -1)
+;
               $difference[$z-1] ="$left_var";
               $difference[$z] ="$right_var";
               $words[$n] = substr ($var1, 1, -1);
               $words[$m] = substr ($var2, 1, -1);
               $z++;
               foreach $d (@difference) {
                  print "-----NEW DIFFERENCE:$d\n";
               }
         
         
               goto START;
              
               }
        
         
        
      }

       foreach $element (@difference) {
                  print "ELEMENT-->>$element<<-\n";
   }
}
[download]

My idea was to find the maximum portion of the mistaken string matching the reference one, on the left and right boundaries, return what does not match, and then iterate over a loop. I was wondering if there is a better approach, and most of all a more efficient code, or a Perl module that may help. Thanks for your suggestions!

In reply to Help finding mistakes in spellings using Perl by shamat

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.