Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Monks; I have a string (of DNA) consisting of characters ATGC and gaps (-). I have a process that inserts additional runs of gaps into the string. I need to find the position and lengths of the new gaps. code:

my $oldstring = 'ATTGC---AGTCCATGC------ATGC' my $newstring = 'AT-TGC---AGTCCATGC------ATGC' desired output: gap of length 1 at position 3 gap of length 1 at somewhere between position 20 and 27
I'm going to take a stab at a solution after I post this, but it feels like it should be a common(ish) problem. Any ideas?

Replies are listed 'Best First'.
Re: finding substrings that have been inserted into a string
by merlyn (Sage) on Sep 17, 2006 at 17:07 UTC
Re: finding substrings that have been inserted into a string
by Skeeve (Parson) on Sep 17, 2006 at 19:16 UTC

    I don't know why you have a second output there, maybe I misunderstood something.

    Nevertheless: here is my attempt which assumes (= doesn't verify) that both strings are identical except for the number of dashes.

    #!/usr/bin/perl use strict; use warnings; my $oldstring = 'ATTGC---AGTCCATGC------ATGC'; my $newstring = 'AT-TGC---AGTCCATGC--------ATGC'; my @oldstring= ($oldstring=~ /(-*)(?:A|C|G|T|$)/g); my @newstring= ($newstring=~ /(-*)(?:A|C|G|T|$)/g); my $pos= 0; for (my $i= 0; $i < @oldstring; ++$i) { if ( $oldstring[$i] ne $newstring[$i]) { print length($newstring[$i]) - length($oldstring[$i])," at pos +ition ",$pos,'-',$pos+length($oldstring[$i]),"\n"; } $pos+= length($oldstring[$i])+1; }

    Result
    1 at position 2-2
    2 at position 17-23

    It works by collecting all groups of dashes in front of any letter or the line end. Groups may heave length 0. Then both results are compared.


    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: finding substrings that have been inserted into a string
by Anonymous Monk on Sep 17, 2006 at 17:40 UTC
    OK, here's my first attempt at a solution

    #my $old_seq='ATGC---ACGT---TAGCAAGGTAAAT'; #my $new_seq='----AT-GC----ACGT----TAGCAA----------GGTAAAT---'; print "old seq is\n$old_seq\nnew seq is\n$new_seq\n"; my @old_array = split //, lc ($old_seq); my @new_array = split //, lc ($new_seq); my $n_old=0; my $n_new=0; my %gaps; while (my $old_base = shift @old_array){ my $new_base = shift @new_array; #print "base in oldseq is $old_base, base in newseq is $new_base\n"; if ($old_base eq $new_base){ # print "match!\n"; $n_old++; $n_new++; next; } else{ # print "no match! - must be a new gap at position $n_new new, $n_o +ld old\n"; my $new_gap_length=0; while ($new_base = shift @new_array){ $n_new++; $new_gap_length++; # print "newbase is $new_base\n"; if ($new_base eq $old_base){ # print "found it - length was $new_gap_length\n"; $gaps{$n_new-$new_gap_length} = $new_gap_length; $n_old++; $n_new++; last; } } } } foreach (sort {$a <=> $b} keys %gaps){ print "gap at position in new $_, length $gaps{$_}\n"; }
    It always reports the new gap as being at the end of the existing gap in the cases where the new gap's position can't be unambiguously decided. Anyone spot any flaws in this?
      Anyone spot any flaws in this?

      Well, is it a flaw that you don't report the final "---" at the end of the "new_string"? Your while loop is based on the length of the "old_string", and when that's done, you're not checking whether there's anything left in the new_string beyond the last match. It would be an easy thing to add an element to %gaps after the while loop, if the current value of $n_new is less than $#new_array.