Re: Comparing 2 different-sized strings
by choroba (Cardinal) on Aug 08, 2013 at 15:42 UTC
|
Crossposted on StackOverflow. It is considered polite to inform about crossposting so people not attending both sites do not waste their time hacking a solution to a problem already solved at the other end of the Internet.
Please, use the <code>...</code> tags not only for code, but for data samples as well. What have yout tried? Algorithm::Diff might probably help you.
| [reply] [d/l] |
|
|
| [reply] |
Re: Comparing 2 different-sized strings
by mtmcc (Hermit) on Aug 08, 2013 at 17:40 UTC
|
It sounds like you want to implement nucleotide local alignment. This is more complicated than it might seem. The classical method is the smith waterman algorithm, but other variations exist. Your best bet is with bioperl. Have a look at Bio::Tool::dpAlign. That has some useful background information, and has a local alignment tool that might do what you want.
Best of luck! | [reply] |
Re: Comparing 2 different-sized strings
by BrowserUk (Patriarch) on Aug 08, 2013 at 18:49 UTC
|
This will knock the spots of most every other algorithm implemented in perl and many of them when implemented in C:
#! perl -slw
use strict;
sub fuzzyMatch {
my( $rHay, $rNee, $misses ) = @_;
my $lNee = length $$rNee;
my $min = $lNee - $misses;
map {
(
( substr( $$rHay, $_, $lNee ) ^ $$rNee
) =~ tr[\0][] ) >= $min ? $_ : ()
} 0 .. length( $$rHay ) - $lNee;
}
my $hay = 'TCGAGTGGCCATGAACGTGCCAATTG';
my $nee = 'ATGATCCTG';
print substr( $hay, $_-5, length( $nee ) + 10 ) for fuzzyMatch( \$hay,
+ \$nee, 3 );
$hay = 'aacctgacctacgtttgacgatcgtacgtcagtcctccgtgctaactgacgtaaaaaaaata
+cgtcccccccc';
$nee = 'acgtacgt';
print substr( $hay, $_-5, length( $nee ) + 10 ) for fuzzyMatch( \$hay,
+ \$nee, 3 );
__END__
C:\test>1048594
TGGCCATGAACGTGCCAAT
acctgacctacgtttgac
gacctacgtttgacgatc
gtttgacgatcgtacgtc
gacgatcgtacgtcagtc
atcgtacgtcagtcctcc
gtcagtcctccgtgctaa
tgctaactgacgtaaaaa
aactgacgtaaaaaaaat
aaaaaaaatacgtccccc
aaaatacgtcccccccc
The subroutine returns the offset where the fuzzily matched substrings are found in the primary; one for each match.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
Hi, thank you so much for the help. Can you just explain to me what the double dollar sign in front of rNee means?
Thank you.
| [reply] |
|
|
sub something {
my( $string ) = @_;
## $string is a copy of the argument
}
my $hugeString = ........;
something( $hugeString );
Instead of passing the arguments directly, I pass references (kind of pointers) to them: fuzzyMatch( \$hay, \$nee, 3 ); ## pass references to needle and haysta
+ck
Within fuzzyMatch(), it receives references to the two strings: sub fuzzyMatch {
my( $rHay, $rNee, $misses ) = @_; ## the 'r's are to remind that
+these are references
So to get to the actual strings, I use a second $
my $lNee = length $$rNee; ## read as: $lenghtNeedle = length of t
+he data $, referenced by $rNee
So, $$rNee is shorthand for ${ $rNee }; if that clarifies things for you?
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
|
|
|
|
Hi,
Thank you so much for your help. So when I execute the script I noticed that if I want 3 mismatches and set $misses to 3, I also get the ones that have 2 mismatches also which makes sense sense inside the subroutine it asks for >= $min. However, if I want only 3 mismatches and NOT to include the ones with 2, I tried changing it to =$min without the greater than sign, but then it gave me an error message:
can't modify bitwise xor (^) in list assignment at rRNA_target.pl line
+ 92, near
")
}"
Any ideas what I can do? | [reply] [d/l] |
|
|
| [reply] |
Re: Comparing 2 different-sized strings
by Skeeve (Parson) on Aug 08, 2013 at 17:24 UTC
|
- You didn't show what you've tried so far
- You didn't define how many mismatches are permitted
- You didn't tell what's a "nucleotid"
- You didn't define what to do if there are less than 5 nucleotids on either side
- You expect us to work for free?
So to "solve" your problem in the most easy way, matching your description:
my $s1='TCGAGTGGCCATGAACGTGCCAATTG';
my $s2='ATGATCCTG';
($s1,$s2) = ($s2,$s1) if length($s1) > length($s2);
my $len= length $s1;
print "Matches:\n";
for (my $i=0; $i+$len+10 <= length $s2; ++$i) {
print substr($s2, $i, $len+10),"\n";
}
prints:
Matches:
TCGAGTGGCCATGAACGTG
CGAGTGGCCATGAACGTGC
GAGTGGCCATGAACGTGCC
AGTGGCCATGAACGTGCCA
GTGGCCATGAACGTGCCAA
TGGCCATGAACGTGCCAAT
GGCCATGAACGTGCCAATT
GCCATGAACGTGCCAATTG
I didn't bother to put the number of mismatches. In most cases it will be length($s1) ;)
>
s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
| [reply] [d/l] [select] |
Re: Comparing 2 different-sized strings
by roboticus (Chancellor) on Aug 08, 2013 at 17:11 UTC
|
my $s1='TCGAGTGGCCATGAACGTGCCAATTG';
my $s2='ATGATCCTG';
($s1,$s2) = ($s2,$s1) if length($s1) > length($s2);
my $loc = index $s2, $s1;
if ($loc<0) {
print "String not found!\n";
}
else {
print "String found at location $loc\n";
}
...roboticus
When your only tool is a hammer, all problems look like your thumb. | [reply] [d/l] |
|
|
| [reply] |