Re: How do you match a stretch of at least N characters
by choroba (Cardinal) on Sep 11, 2017 at 14:02 UTC
|
Iterate over the possible positions in the longer string. Use XOR to get a string that contains zero bytes in positions where two strings are the same. Then just use tr to count the number of zeroes in substrings of length 10, report the positions if the number is 9 or 10.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $small_str = 'AAATTGGTGTATATGAAAGACCTCGACGCTATTTAGAAAGAGAGAGCAATATT
+TCAAGAAT'
. 'GCATGCGTCAATTTTACGCAGACTATCTTTCTAGGGTTAAATATACTGACAGTGTGCAGTGAC
+TCACAAAA'
. 'GATGATTA';
my $reference_str = 'ACAATGAGATCACATGGACACAGGAAGGGGAATATCACACTCTGGGGAC
+TGTGGTGG'
. 'GGTCGGGGGAGGGGGGAGGGATAGCATTGGGAGATATACCTAATGCTAGATGACGTCCATACT
+GAGAATCA'
. 'TGTTAACATTAGTGGGTGCAGCGCACAAGCATGGCACATGTATACATATGTAACTAACCTGCA
+CAATGTGC'
. 'ACATGTACCCTAAAACTTAGAGTATAATAAAAAAAAAAAAAAAAAAAAAAAAAAACACATTAA
+AAAAAAAA'
. 'AAAACAACAAAACAAAGCAAACATGGAAATGTTTGTTATTTTAATTGTTATGATGGTTTCATG
+GCTGTTTG'
. 'CATGTGTCAAAACTCATCAAATTTGTGTACGTTAAATATGTGAAACTTATTGTATGCTGGTTA
+CACCTCAA'
. 'TAAAGCTGTTAAATTTAAAAAAAAAAAAAAAAAAAAAAATCACCAATAGTTGCTGCTAGAAAT
+CCAGTGTC'
. 'ACAAAAGGCCAAAGTTTATTGACAAATTGGTGTATATGAAAGACCTCGACGCTATTTAGAAAG
+AGAGAGCA'
. 'ATATTTCAAGAATGCATGCGTCAATTTTACGCAGACTATCTTTCTAGGGTTAATCTAGCTGCA
+TCAGGATC'
. 'ATATCGTCGGGTCTTTTTTCCGGCTCAGTCATCGCCCAAGCTGGCGCTATCTGGGCATCGGGG
+AGGAAGAA'
. 'GCCCGTGCCTTTTCCCGCGAGGTTGAAGCGGCATGGAAAGAGTTTGCCGAGGATGACTGCTGC
+TGCATTGA'
. 'CGTTGAGCGAAAACGCACGTTTACCATGATGATTCGGGAAGGTGTGGCCATGCACGCCTTTAA
+CGGTGAAC'
. 'TGTTCGTTCAGGCCACCTGGGATACCAGTTCGTCGCGGCTTTTCCGGACACAGTTCCGGATGG
+TCAGCCCG'
. 'AAGCGCATCAGCAACCCGAACAATACCGGCGACAGCCGGAACTGCCGTGCCGGTGTGCAGATT
+AATGACAG'
. 'CGGTGCGGCGCTGGGATATTACGTCAGCGAGGACGGGTATCCTGGCTGGATGCCGCAGAAATG
+GACATGGA'
. 'TACCCCGTGAGTTACCCGGCGGGCGCGCTTGGCGTAATCATGGTCATAGCTGTTTCCTGTGTG
+AAATTGTT'
. 'ATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGGGTGCCT
+AATGAGTG'
. 'AGCTAACTCACATTAATTGCGTTGCGCTCACTGCCCGCTTTCCAGTCGGGAAACCTGTCGTGC
+CAGCTGCA'
. 'TTAATGAATCGGCCAACGCGCGGGGAGAGGCGGTTTGCGTATTGGGCGCTCTTCCGCTTCCTC
+GCTCACTG'
. 'ACTCGCTGCGCTCGGTCGTTCGGCTGCGGCGAGCGGTATCAGCTCACTCAAAGGCGGTAATAC
+GGTTATCC'
. 'ACAGAATCAGGGGATAACGCAGGAAAGAACATGTGAGCAAAAGGCCAGCAAAAGGCCAGGAAC
+CGTAAAAA'
. 'GGCCGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACG
+CTCAAGTC'
. 'AGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCG
+TGCGCTCT'
. 'CCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCG
+CTTTCTCA'
. 'TAGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTGCA
+CGAACCCC'
. 'CCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAAGAC
+ACGACTTA'
. 'TCGCCACTGGCAGCAGCCACTGGTAACAGGATTAGCAGAGCGAGGTATGTAGGCGGTGCTACA
+GAGTTCTT'
. 'GAAGTGGTGGCCTAACTACGGCTACACTAGAAGGACAGTATTTGGTATCTGCGCTCTGCTGAA
+GCCAGTTA'
. 'CCTTCGGAAAAAGAGTTGGTAGCTCTTGATCCGGCAAACAAACCACCGCTGGTAGCGGTGGTT
+TTTTTGTT'
. 'TGCAAGCAGCAGATTACGCGCAGAAAAAAAGGATCTCAAGAAGATCCTTTGATCTTTTCTACG
+GGGTCTGA'
. 'CGCTCAGTGGAACGAAAACTCACGTTAAGGGATTTTGGTCATGAGATTATCAAAAAGGATCTT
+CACCTAGA'
. 'TCCTTTTAAATTAAAAATGAAGTTTTAAATCAATCTAAAGTATATATGAGTAAACTTGGTCTG
+ACAGTTAC'
. 'CAATGCTTAATCAGTGAGGCACCTATCTCAGCGATCTGTCTATTTCGTTCATCCATAGTTGCC
+TGACTCCC'
. 'CGTCGTGTAGATAACTACGATACGGGAGGGCTTACCATCTGGCCCCAGTGCTGCAATGATACC
+GCGAGACC'
. 'CACGCTCACCGGCTCCAGATTTATCAGCAATAAACCAGCCAGCCGGAAGGGCCGAGCGCAGAA
+GTGGTCCT'
. 'GCAACTTTATCCGCCTCCATCCAGTCTATTAATTGTTGCCGGGAAGCTAGAGTAAGTAGTTCG
+CCAGTTAA'
. 'TAGTTTGCGCAACGTTGTTGCCATTGCTACAGGCATCGTGGTGTCACGCTCGTCGTTTGGTAT
+GGCTTCAT'
. 'TCAGCTCCGGTTCCCAACGATCAAGGCGAGTTACATGATCCCCCATGTTGTGCAAAAAAGCGG
+TTAGCTCC'
. 'TTCGGTCCTCCGATCGTTGTCAGAAGTAAGTTGGCCGCAGTGTTATCACTCATGGTTATGGCA
+GCACTGCA'
. 'TAATTCTCTTACTGTCATGCCATCCGTAAGATGCTTTTCTGTGACTGGTGAGTACTCAACCAA
+GTCATTCT'
. 'GAGAATAGTGTATGCGGCGACCGAGTTGCTCTTGCCCGGCGTCAATACGGGATAATACCGCGC
+CACATAGC'
. 'AGAACTTTAAAAGTGCTCATCATTGGAAAACGTTCTTCGGGGCGAAAACTCTCAAGGATCTTA
+CCGCTGTT'
. 'GAGATCCAGTTCGATGTAACCCACTCGTGCACCCAACTGATCTTCAGCATCTTTTACTTTCAC
+CAGCGTTT'
. 'CTGGGTGAGCAAAAACAGGAAGGCAAAATGCCGCAAAAAAGGGAAAAGGGCGACACGGAAATG
+TTGAATAC'
. 'TCAT';
for my $start_pos (0 .. length($reference_str) - 11) {
my $xor = $small_str ^ substr $reference_str, $start_pos, length($
+reference_str) - 1;
for my $inner_pos (0 .. length($xor) - 10) {
if (9 <= substr($xor, $inner_pos, 10) =~ tr/\0//) {
say $start_pos + $inner_pos;
say substr $small_str, $inner_pos, 10;
say substr $reference_str, $start_pos + $inner_pos, 10;
}
}
}
TODO: Handle matches where the two strings don't fully overlap.
Update Probably like this? It wraps the longer string in characters that can never match, so some simple math is needed to get the original position back:
($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
| [reply] [d/l] [select] |
|
|
Choroba, thank you very much for the code. I am bit confused, is the second part an addition to the previous one? Because - I think - I get only exact matches with the second part, or not?
| [reply] |
|
|
No, it's a replacement. It reports 107 matches for me, 92 of them are exact.
($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
| [reply] [d/l] |
|
|
|
|
|
Re: How do you match a stretch of at least N characters
by hippo (Archbishop) on Sep 11, 2017 at 14:14 UTC
|
Let's see here. Your small string has 140 characters and your reference string has 3114. If you were just comparing exactly 10-character substrings that's 131x3105 which is 406755 comparisons. If you then make that >10 character substrings it becomes a lot heavier. eg. for 11, it's 130x3104 which is 403520 to be added on to your original 406755. Build that all the way up to 140 and it comes out as 26,471,170. Then you have the possibility that any one of those characters could be a non-match, so for a substring of length n that's n more matches. That gives a total by my calculations of 1,403,490,770 - and this is just the comparisons, nevermind all the slicing and dicing to achieve them.
Does your task have significant time constraints? If so, you'll have to start looking into some algorithm references to be a lot smarter about this, I fear.
Would I , for instance, start like this? if($reference_str=~/$small_str{10,}/)
No, that would be 10 or more repititions of the entirety of
$small_str with 10 or more instances of its last character at the end. (thanks haukex for pointing out my initial error).
| [reply] [d/l] [select] |
Re: How do you match a stretch of at least N characters
by Eily (Monsignor) on Sep 11, 2017 at 14:13 UTC
|
One way you could to it is use the ^ operator. When applied on strings, it will output 0 wherever the two are equal, and something else wherever they are different. That way searching for a series of identical characters becomes searching for a series of 0.
for my $position (0..(length($reference_str)-length($small_str)))
{
my $processed_str = $small_str ^ substr($reference_str, $position, l
+ength($small_str));
next unless $processed_str =~ m<(
\0{9}
| \0{1} . \0{8,}
| \0{2} . \0{7,}
| \0{3} . \0{6,}
| \0{4} . \0{5,}
| \0{5} . \0{4,}
| \0{6} . \0{3,}
| \0{7} . \0{2,}
| \0{8} . \0{1,}
)>xms;
print("Match found! ", substr($reference_str, $position+pos($proces
+sed_str),length($1))) and last;
}
Untested :) | [reply] [d/l] |
Re: How do you match a stretch of at least N characters
by stall (Novice) on Sep 12, 2017 at 07:55 UTC
|
| [reply] |
Re: How do you match a stretch of at least N characters
by QM (Parson) on Sep 12, 2017 at 08:02 UTC
|
Just thought I'd have a go at it myself, using regex instead of the (obviously better) XOR approach.
My rough check, with 1 mismatch, works out to the same order of magnitude as choroba's first try, but maybe 2x slower. With more mismatches, my solution goes exponential creating regexes, but should take a bit less time in the matching phase. (Hand-wavy arguments go here.)
Code:
-QM
--
Quantum Mechanics: The dreams stuff is made of
| [reply] [d/l] |
Re: How do you match a stretch of at least N characters
by Anonymous Monk on Sep 11, 2017 at 17:52 UTC
|
Do not waste your own time. Search CPAN for Bio::DB ... | [reply] |
|
|
Thanks Dr Wisenheimer.
«The Crux of the Biscuit is the Apostrophe»
perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help
| [reply] [d/l] |
|
|
That phrasing may not have been overly helpful... but it's not a bad idea. The SO thread that poj mentioned in the similar Re: Longest common substring with N mismatches points out Bio::Grep::Backend::Agrep, whose synopsis reads to me like it probably does what the OP wants. The parent module Bio::Grep is what actually gets use'd, and lists some of the abilities of the various back-end modules, so the OP might want to check out that page as well for other pertinent information.
| [reply] [d/l] |
|
|
Re: How do you match a stretch of at least N characters
by 1nickt (Canon) on Sep 11, 2017 at 13:59 UTC
|
How can a string simultaneously both match and be a mismatch? I suggest using more than one step/regex.
The way forward always starts with a minimal test.
| [reply] |
|
|
That is what I am a bit confused with. A colleague asked me if I can do that and I am still trying to figure out how this can be possible...
Ok, but how about that : how would one go in order to find all possible k-mers of the small substring that match the big one and then maybe see if they are some that are only separated by 1? Would that make sense? If yes, how do you do it? Split the small string and search it character-by-character?
| [reply] |
Re: How do you match a stretch of at least N characters
by paisani (Acolyte) on Sep 12, 2017 at 19:48 UTC
|
## Travel small_str along the length of reference_str -
## 107 matches, 92 exact in 3 seconds
my $match=0;
for my $i (0..length($reference_str)) {
for my $j (0..length($small_str)) {
my $ref = substr($reference_str, $i, 10); my @ref_arr = split(''
+, $ref);
my $small = substr($small_str, $j, 10); my @small_arr = split(''
+, $small);
## Don't bother if you have less than (9 or) 10 left ;-)
last if length($ref) < 10 or length($small) < 10;
if ( equal(\@ref_arr, \@small_arr) ) {
$match++;
print "Found match $match: @ref_arr, @small_arr\n";
}
}
}
print "Final tally - $match\n";
exit(0);
sub equal {
my ($first, $second) = @_;
my $mismatch=0;
foreach my $i (0..9) {
$mismatch++ if $first->[$i] ne $second->[$i];
return undef if $mismatch > 1; # Bail if you have more than 1 mismat
+ch in the sequence
}
return 1;
}
| [reply] [d/l] |
Re: How do you match a stretch of at least N characters
by ikegami (Patriarch) on Sep 11, 2017 at 21:45 UTC
|
| [reply] |
|
|
| [reply] |
|
|
Is this what you were looking for? The longest matching DNA string? You can save the max length found below to get your result.
Answer:
Found match of length (102) at position(506, 0):
AAATTGGTGTATATGAAAGACCTCGACGCTATTTAGAAAGAGAGAGCAATATTTCAAGAATGCATGCGTCAATTTTACGCAGACTATCTTTCTAGGGTTAAT
AAATTGGTGTATATGAAAGACCTCGACGCTATTTAGAAAGAGAGAGCAATATTTCAAGAATGCATGCGTCAATTTTACGCAGACTATCTTTCTAGGGTTAAA
my $i=-1;
while(++$i < length($reference_str)) {
my $j=-1;
while(++$j < length($small_str)) {
my $length = 9;
my $match=undef;
while ( $i+$length < length($reference_str) and
$j+$length < length($small_str) and
compareSnippet($i, $j, ++$length) ) {
$match++;
}
if ( $match) {
$length--;
print "Found match of length ($length) at position($i, $j):
+\n";
print substr($reference_str, $i, $length) . "\n";
print substr($small_str, $j, $length). "\n";
$i+= $length;
$j+= $length;
next;
}
}
}
sub compareSnippet {
my ( $pos_i, $pos_j, $length) = @_;
my $ref = substr($reference_str, $pos_i, $length); my @ref_arr = sp
+lit('', $ref);
my $small = substr($small_str, $pos_j, $length); my @small_arr = sp
+lit('', $small);
return undef if length($ref) < $length or length($small) < $length;
return equal(\@ref_arr, \@small_arr);
}
sub equal {
my ($first, $second) = @_;
my $mismatch=0;
foreach my $i (0..(@$first-1)) {
$mismatch++ if $first->[$i] ne $second->[$i];
return undef if $mismatch > 1;
}
# print "Woo! (@$first, @$second)\n" if $mismatch == 0;
return 1;
}
| [reply] [d/l] |
Re: How do you match a stretch of at least N characters
by locked_user sundialsvc4 (Abbot) on Sep 12, 2017 at 14:44 UTC
|
| |