AdrianJ217 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm kind of new to Perl and I am comparing 2 strings of different size containing DNA nucleotides. I want the script to take the smaller string and locate it in the much larger string allowing for mismatches and providing me with the sequence it found in the larger string plus adjacent 5 nucleotides on either side. So for example if I have 2 strings: #1 ATGATCCTG #2 TCGAGTGGCCATGAACGTGCCAATTG I want the script to take #1 and find the same sequence in #2 which is present but with 2 mismatches, along with 5 nucleotides on either side. Thank you so much!

Replies are listed 'Best First'.
Re: Comparing 2 different-sized strings
by choroba (Cardinal) on Aug 08, 2013 at 15:42 UTC
    Crossposted on StackOverflow. It is considered polite to inform about crossposting so people not attending both sites do not waste their time hacking a solution to a problem already solved at the other end of the Internet.

    Please, use the <code>...</code> tags not only for code, but for data samples as well. What have yout tried? Algorithm::Diff might probably help you.

    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Comparing 2 different-sized strings
by mtmcc (Hermit) on Aug 08, 2013 at 17:40 UTC
    It sounds like you want to implement nucleotide local alignment. This is more complicated than it might seem. The classical method is the smith waterman algorithm, but other variations exist. Your best bet is with bioperl. Have a look at Bio::Tool::dpAlign. That has some useful background information, and has a local alignment tool that might do what you want.

    Best of luck!

Re: Comparing 2 different-sized strings
by BrowserUk (Patriarch) on Aug 08, 2013 at 18:49 UTC

    This will knock the spots of most every other algorithm implemented in perl and many of them when implemented in C:

    #! perl -slw use strict; sub fuzzyMatch { my( $rHay, $rNee, $misses ) = @_; my $lNee = length $$rNee; my $min = $lNee - $misses; map { ( ( substr( $$rHay, $_, $lNee ) ^ $$rNee ) =~ tr[\0][] ) >= $min ? $_ : () } 0 .. length( $$rHay ) - $lNee; } my $hay = 'TCGAGTGGCCATGAACGTGCCAATTG'; my $nee = 'ATGATCCTG'; print substr( $hay, $_-5, length( $nee ) + 10 ) for fuzzyMatch( \$hay, + \$nee, 3 ); $hay = 'aacctgacctacgtttgacgatcgtacgtcagtcctccgtgctaactgacgtaaaaaaaata +cgtcccccccc'; $nee = 'acgtacgt'; print substr( $hay, $_-5, length( $nee ) + 10 ) for fuzzyMatch( \$hay, + \$nee, 3 ); __END__ C:\test>1048594 TGGCCATGAACGTGCCAAT acctgacctacgtttgac gacctacgtttgacgatc gtttgacgatcgtacgtc gacgatcgtacgtcagtc atcgtacgtcagtcctcc gtcagtcctccgtgctaa tgctaactgacgtaaaaa aactgacgtaaaaaaaat aaaaaaaatacgtccccc aaaatacgtcccccccc

    The subroutine returns the offset where the fuzzily matched substrings are found in the primary; one for each match.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Hi, thank you so much for the help. Can you just explain to me what the double dollar sign in front of rNee means? Thank you.
        Can you just explain to me what the double dollar sign in front of rNee means?

        It means dereference the reference.

        Because genomic work often involves very large strings; and passing large strings into subroutines causes them to be copied:

        sub something { my( $string ) = @_; ## $string is a copy of the argument } my $hugeString = ........; something( $hugeString );

        Instead of passing the arguments directly, I pass references (kind of pointers) to them:

        fuzzyMatch( \$hay, \$nee, 3 ); ## pass references to needle and haysta +ck

        Within fuzzyMatch(), it receives references to the two strings:

        sub fuzzyMatch { my( $rHay, $rNee, $misses ) = @_; ## the 'r's are to remind that +these are references

        So to get to the actual strings, I use a second $

        my $lNee = length $$rNee; ## read as: $lenghtNeedle = length of t +he data $, referenced by $rNee

        So, $$rNee is shorthand for ${ $rNee }; if that clarifies things for you?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      Hi, Thank you so much for your help. So when I execute the script I noticed that if I want 3 mismatches and set $misses to 3, I also get the ones that have 2 mismatches also which makes sense sense inside the subroutine it asks for >= $min. However, if I want only 3 mismatches and NOT to include the ones with 2, I tried changing it to =$min without the greater than sign, but then it gave me an error message:
      can't modify bitwise xor (^) in list assignment at rRNA_target.pl line + 92, near ") }"
      Any ideas what I can do?
        I tried changing it to =$min without the greater than sign, but then it gave me an error message:

        A single = is assignment not comparison. You need to change >= to ==; it should then work as you require.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Comparing 2 different-sized strings
by Skeeve (Parson) on Aug 08, 2013 at 17:24 UTC
    1. You didn't show what you've tried so far
    2. You didn't define how many mismatches are permitted
    3. You didn't tell what's a "nucleotid"
    4. You didn't define what to do if there are less than 5 nucleotids on either side
    5. You expect us to work for free?

    So to "solve" your problem in the most easy way, matching your description:

    my $s1='TCGAGTGGCCATGAACGTGCCAATTG'; my $s2='ATGATCCTG'; ($s1,$s2) = ($s2,$s1) if length($s1) > length($s2); my $len= length $s1; print "Matches:\n"; for (my $i=0; $i+$len+10 <= length $s2; ++$i) { print substr($s2, $i, $len+10),"\n"; }

    prints:

    Matches: TCGAGTGGCCATGAACGTG CGAGTGGCCATGAACGTGC GAGTGGCCATGAACGTGCC AGTGGCCATGAACGTGCCA GTGGCCATGAACGTGCCAA TGGCCATGAACGTGCCAAT GGCCATGAACGTGCCAATT GCCATGAACGTGCCAATTG

    I didn't bother to put the number of mismatches. In most cases it will be length($s1) ;)

    >

    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: Comparing 2 different-sized strings
by roboticus (Chancellor) on Aug 08, 2013 at 17:11 UTC

    AdrianJ217:

    Something like this should work:

    my $s1='TCGAGTGGCCATGAACGTGCCAATTG'; my $s2='ATGATCCTG'; ($s1,$s2) = ($s2,$s1) if length($s1) > length($s2); my $loc = index $s2, $s1; if ($loc<0) { print "String not found!\n"; } else { print "String found at location $loc\n"; }

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      The OP wants to allow for some mismatches. The index function is not suitable for that.