Comparing 2 different-sized strings

AdrianJ217 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Comparing 2 different-sized strings by choroba (Cardinal) on Aug 08, 2013 at 15:42 UTC
Crossposted on StackOverflow. It is considered polite to inform about crossposting so people not attending both sites do not waste their time hacking a solution to a problem already solved at the other end of the Internet. Please, use the `<code>...</code>` tags not only for code, but for data samples as well. What have yout tried? Algorithm::Diff might probably help you. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^2: Comparing 2 different-sized strings by Laurent_R (Canon) on Aug 08, 2013 at 21:32 UTC
Also crossposted on the Devshed forum: http://forums.devshed.com/perl-programming-6/comparing-2-different-sized-strings-in-perl-949907.html#post2896087 Update: and also on Perl Gurus: http://perlguru.com/gforum.cgi?post=75969;#75969	[reply]
Re: Comparing 2 different-sized strings by mtmcc (Hermit) on Aug 08, 2013 at 17:40 UTC
It sounds like you want to implement nucleotide local alignment. This is more complicated than it might seem. The classical method is the smith waterman algorithm, but other variations exist. Your best bet is with bioperl. Have a look at Bio::Tool::dpAlign. That has some useful background information, and has a local alignment tool that might do what you want. Best of luck!	[reply]
Re: Comparing 2 different-sized strings by BrowserUk (Patriarch) on Aug 08, 2013 at 18:49 UTC
This will knock the spots of most every other algorithm implemented in perl and many of them when implemented in C: #! perl -slw use strict; sub fuzzyMatch { my( $rHay, $rNee, $misses ) = @_; my $lNee = length $$rNee; my $min = $lNee - $misses; map { ( ( substr( $$rHay, $_, $lNee ) ^ $$rNee ) =~ tr[\0][] ) >= $min ? $_ : () } 0 .. length( $$rHay ) - $lNee; } my $hay = 'TCGAGTGGCCATGAACGTGCCAATTG'; my $nee = 'ATGATCCTG'; print substr( $hay, $_-5, length( $nee ) + 10 ) for fuzzyMatch( \$hay, + \$nee, 3 ); $hay = 'aacctgacctacgtttgacgatcgtacgtcagtcctccgtgctaactgacgtaaaaaaaata +cgtcccccccc'; $nee = 'acgtacgt'; print substr( $hay, $_-5, length( $nee ) + 10 ) for fuzzyMatch( \$hay, + \$nee, 3 ); __END__ C:\test>1048594 TGGCCATGAACGTGCCAAT acctgacctacgtttgac gacctacgtttgacgatc gtttgacgatcgtacgtc gacgatcgtacgtcagtc atcgtacgtcagtcctcc gtcagtcctccgtgctaa tgctaactgacgtaaaaa aactgacgtaaaaaaaat aaaaaaaatacgtccccc aaaatacgtcccccccc [download] The subroutine returns the offset where the fuzzily matched substrings are found in the primary; one for each match. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: Comparing 2 different-sized strings by AdrianJ217 (Novice) on Aug 09, 2013 at 09:22 UTC
Hi, thank you so much for the help. Can you just explain to me what the double dollar sign in front of rNee means? Thank you.	[reply]
Re^3: Comparing 2 different-sized strings by BrowserUk (Patriarch) on Aug 09, 2013 at 09:40 UTC
Can you just explain to me what the double dollar sign in front of rNee means? It means dereference the reference. Because genomic work often involves very large strings; and passing large strings into subroutines causes them to be copied: `sub something { my( $string ) = @_; ## $string is a copy of the argument } my $hugeString = ........; something( $hugeString );` [download] Instead of passing the arguments directly, I pass references (kind of pointers) to them: `fuzzyMatch( \$hay, \$nee, 3 ); ## pass references to needle and haysta +ck` [download] Within fuzzyMatch(), it receives references to the two strings: `sub fuzzyMatch { my( $rHay, $rNee, $misses ) = @_; ## the 'r's are to remind that +these are references` [download] So to get to the actual strings, I use a second $ `my $lNee = length $$rNee; ## read as: $lenghtNeedle = length of t +he data $, referenced by $rNee` [download] So, `$$rNee` is shorthand for `${ $rNee }`; if that clarifies things for you? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^4: Comparing 2 different-sized strings by AdrianJ217 (Novice) on Aug 09, 2013 at 11:46 UTC
Re^5: Comparing 2 different-sized strings by BrowserUk (Patriarch) on Aug 09, 2013 at 12:00 UTC
Re^4: Comparing 2 different-sized strings by AdrianJ217 (Novice) on Aug 10, 2013 at 19:27 UTC
Re^5: Comparing 2 different-sized strings by BrowserUk (Patriarch) on Aug 10, 2013 at 21:27 UTC
Some notes below your chosen depth have not been shown here
Re^2: Comparing 2 different-sized strings by AdrianJ217 (Novice) on Aug 14, 2013 at 10:08 UTC
Hi, Thank you so much for your help. So when I execute the script I noticed that if I want 3 mismatches and set $misses to 3, I also get the ones that have 2 mismatches also which makes sense sense inside the subroutine it asks for >= $min. However, if I want only 3 mismatches and NOT to include the ones with 2, I tried changing it to =$min without the greater than sign, but then it gave me an error message: `can't modify bitwise xor (^) in list assignment at rRNA_target.pl line + 92, near ") }"` [download] Any ideas what I can do?	[reply] [d/l]
Re^3: Comparing 2 different-sized strings by BrowserUk (Patriarch) on Aug 14, 2013 at 10:46 UTC
I tried changing it to =$min without the greater than sign, but then it gave me an error message: A single = is assignment not comparison. You need to change >= to ==; it should then work as you require. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: Comparing 2 different-sized strings by Skeeve (Parson) on Aug 08, 2013 at 17:24 UTC
You didn't show what you've tried so far You didn't define how many mismatches are permitted You didn't tell what's a "nucleotid" You didn't define what to do if there are less than 5 nucleotids on either side You expect us to work for free? So to "solve" your problem in the most easy way, matching your description: `my $s1='TCGAGTGGCCATGAACGTGCCAATTG'; my $s2='ATGATCCTG'; ($s1,$s2) = ($s2,$s1) if length($s1) > length($s2); my $len= length $s1; print "Matches:\n"; for (my $i=0; $i+$len+10 <= length $s2; ++$i) { print substr($s2, $i, $len+10),"\n"; }` [download] prints: `Matches: TCGAGTGGCCATGAACGTG CGAGTGGCCATGAACGTGC GAGTGGCCATGAACGTGCC AGTGGCCATGAACGTGCCA GTGGCCATGAACGTGCCAA TGGCCATGAACGTGCCAAT GGCCATGAACGTGCCAATT GCCATGAACGTGCCAATTG` [download] I didn't bother to put the number of mismatches. In most cases it will be length($s1) ;) > `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]
Re: Comparing 2 different-sized strings by roboticus (Chancellor) on Aug 08, 2013 at 17:11 UTC
AdrianJ217: Something like this should work: `my $s1='TCGAGTGGCCATGAACGTGCCAATTG'; my $s2='ATGATCCTG'; ($s1,$s2) = ($s2,$s1) if length($s1) > length($s2); my $loc = index $s2, $s1; if ($loc<0) { print "String not found!\n"; } else { print "String found at location $loc\n"; }` [download] ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l]
Re^2: Comparing 2 different-sized strings by Laurent_R (Canon) on Aug 08, 2013 at 17:16 UTC
The OP wants to allow for some mismatches. The index function is not suitable for that.	[reply]