finding substrings that have been inserted into a string

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: finding substrings that have been inserted into a string by merlyn (Sage) on Sep 17, 2006 at 17:07 UTC
What you're looking for is the "edit distance", and there are many modules in the CPAN that implement that in different ways. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re^2: finding substrings that have been inserted into a string by Corion (Patriarch) on Sep 17, 2006 at 17:57 UTC
Also of interest is the Longest Common Subsequence, as implemented by Algorithm::Diff.	[reply]
Re: finding substrings that have been inserted into a string by Skeeve (Parson) on Sep 17, 2006 at 19:16 UTC
I don't know why you have a second output there, maybe I misunderstood something. Nevertheless: here is my attempt which assumes (= doesn't verify) that both strings are identical except for the number of dashes. `#!/usr/bin/perl use strict; use warnings; my $oldstring = 'ATTGC---AGTCCATGC------ATGC'; my $newstring = 'AT-TGC---AGTCCATGC--------ATGC'; my @oldstring= ($oldstring=~ /(-)(?:A\|C\|G\|T\|$)/g); my @newstring= ($newstring=~ /(-)(?:A\|C\|G\|T\|$)/g); my $pos= 0; for (my $i= 0; $i < @oldstring; ++$i) { if ( $oldstring[$i] ne $newstring[$i]) { print length($newstring[$i]) - length($oldstring[$i])," at pos +ition ",$pos,'-',$pos+length($oldstring[$i]),"\n"; } $pos+= length($oldstring[$i])+1; }` [download] Result `1 at position 2-2` `2 at position 17-23` It works by collecting all groups of dashes in front of any letter or the line end. Groups may heave length 0. Then both results are compared. `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]
Re: finding substrings that have been inserted into a string by Anonymous Monk on Sep 17, 2006 at 17:40 UTC
OK, here's my first attempt at a solution #my $old_seq='ATGC---ACGT---TAGCAAGGTAAAT'; #my $new_seq='----AT-GC----ACGT----TAGCAA----------GGTAAAT---'; print "old seq is\n$old_seq\nnew seq is\n$new_seq\n"; my @old_array = split //, lc ($old_seq); my @new_array = split //, lc ($new_seq); my $n_old=0; my $n_new=0; my %gaps; while (my $old_base = shift @old_array){ my $new_base = shift @new_array; #print "base in oldseq is $old_base, base in newseq is $new_base\n"; if ($old_base eq $new_base){ # print "match!\n"; $n_old++; $n_new++; next; } else{ # print "no match! - must be a new gap at position $n_new new, $n_o +ld old\n"; my $new_gap_length=0; while ($new_base = shift @new_array){ $n_new++; $new_gap_length++; # print "newbase is $new_base\n"; if ($new_base eq $old_base){ # print "found it - length was $new_gap_length\n"; $gaps{$n_new-$new_gap_length} = $new_gap_length; $n_old++; $n_new++; last; } } } } foreach (sort {$a <=> $b} keys %gaps){ print "gap at position in new $_, length $gaps{$_}\n"; } [download] It always reports the new gap as being at the end of the existing gap in the cases where the new gap's position can't be unambiguously decided. Anyone spot any flaws in this?	[reply] [d/l]
Re^2: finding substrings that have been inserted into a string by graff (Chancellor) on Sep 17, 2006 at 22:05 UTC
Anyone spot any flaws in this? Well, is it a flaw that you don't report the final "---" at the end of the "new_string"? Your while loop is based on the length of the "old_string", and when that's done, you're not checking whether there's anything left in the new_string beyond the last match. It would be an easy thing to add an element to %gaps after the while loop, if the current value of $n_new is less than $#new_array.	[reply]