FIJI42 has asked for the wisdom of the Perl Monks concerning the following question:

Hey everyone. So I have a nucleotide sequence to parse through in order to find all the matching pairs of dinucleotides (ex. AA, AT, CG, etc.). I'm having problems with the code for matching - I cannot figure out how to get matches for nucleotide pairs that overlap.

For example, here's a section from my nucleotide sequence and the code I utilize for matching 'AA':

$seq = AAGAATGCCAGTTTGTAAGTACTGACTTTGGAAAA sub dtNTfreq { my ($seq = ''; my $AA = 0; while ($myseq =~ /(AA)/gi ){ $AA =$AA+1; }

Using that perl code for matching, I find 5 possible matches for 'AA'.

However, the actual number of matches is 6, since the last section of $seq has 'AAAA' which match as follows:

[AA] [AA] or A [AA] A

I'm guessing I need to move through the sequence two nucleotides at a time to see if the match is true, but I'm unsure how to do so. Possibly by putting the while loop inside of a for loop that increments through the $seq string?

Any help is greatly appreciated.

Replies are listed 'Best First'.
Re: Identifying Overlapping Matches in Nucleotide Sequence
by kcott (Archbishop) on Oct 27, 2017 at 10:33 UTC

    G'day FIJI42,

    Biological data are typically huge. For reasons of efficiency, when dealing with this type of data, you should choose a fast solution over a slower one. Perl's string handling functions (index, substr, etc.) are measurably faster than regexes (try it yourself with Benchmark). String operators (e.g. $x eq 'some literal') will be far more efficient than an equivalent regex (e.g. $x =~ /^some literal$/).

    There are times when you will need a regex. In these cases, don't give the regex more work than necessary. In "/(AA)/gi", you're asking it to pointlessly capture matches (you never use the captured data); and you're also asking for a case-insensitive match (again pointless, as all characters are uppercase). Where you are dealing with mixed-case, or unknown case, canonicalise your data once then work with that (see fc, uc, and lc).

    With your specific question, index would be the best choice.

    Your ultimate goal is somewhat unclear. If you want to match "AA" once in "AAA", and twice in "AAAA":

    $ perl -E ' my ($x, $y, $c, $p) = qw{AxAAxAAAxAAAA AA 0 0}; $c++, $p += 2 while ($p = index $x, $y, $p) > -1; say $c; ' 4

    If you want to match "AA" twice in "AAA", and thrice in "AAAA":

    $ perl -E ' my ($x, $y, $c, $p) = qw{AxAAxAAAxAAAA AA 0 0}; $c++, $p++ while ($p = index $x, $y, $p) > -1; say $c; ' 6

    Note that the only difference between those two is: in the first, the position ($p) is incremented by two ($p += 2); in the second, it is incremented by one ($p++).

    That code is really only about quickly demonstrating the index function. It's not intended as a production-grade solution; for instance, pick better (more meaningful) variable names.

    — Ken

      kcott:

      When you mentioned the overhead of capturing and case-insensitivity, it piqued my interest, so I put together a quickie, and it shows:

      $ time perl ../ex_pattern_matching.pl <GGCGATGGCGTTCTAGCGCGTAAAA> x_cap_i: 7396 x_cap: 7396 x_i: 7396 x: 7396 poz: 9202 idx: 9202 ovr_cap_i: 9202 ovr_cap: 9202 ovr_i: 9202 ovr +: 9202 Rate ovr_cap ovr poz ovr_cap_i ovr_i x_cap x_cap_i x_i + idx x ovr_cap 11.1/s -- -11% -59% -60% -62% -71% -73% -73% + -91% -92% ovr 12.4/s 12% -- -54% -55% -57% -68% -69% -70% + -90% -91% poz 26.9/s 143% 117% -- -2% -7% -31% -34% -35% + -78% -79% ovr_cap_i 27.5/s 148% 122% 2% -- -5% -29% -32% -34% + -78% -79% ovr_i 28.8/s 160% 133% 7% 5% -- -26% -29% -31% + -77% -78% x_cap 38.8/s 250% 213% 44% 41% 35% -- -4% -7% + -69% -70% x_cap_i 40.5/s 266% 227% 51% 48% 41% 4% -- -3% + -67% -69% x_i 41.6/s 275% 236% 55% 51% 44% 7% 3% -- + -66% -68% idx 123/s 1012% 895% 358% 348% 327% 217% 204% 196% + -- -6% x 131/s 1079% 955% 386% 375% 353% 237% 222% 214% + 6% -- real 0m37.721s user 0m37.656s sys 0m0.015s

      The entries starting with "x" have the same problem that the OP had: the count was too low since they skip overlapping matches. The idx solution you provided was the clear winner (so long as you want correct results). I was surprised at the interaction between case insenstivity and capturing: turning on either makes it slow, but the combination isn't any slower.

      Since the simple (but incorrect) match was as quick as the index solution, I tried to "fix" the overlap by using the pos function on the simple match (the poz entry), but mucking about with the pos function put poz into the realm of the other overlapping match functions.

      The code:

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

        G'day roboticus,

        [You should be safe to drink coffee while reading this one. :-) ]

        Thanks for writing that benchmark. I have previously posted benchmarks (as have others) comparing string functions with regexes: I believe the results are fairly well known. I left writing another one as an optional exercise for the OP.

        It was, however, interesting that the various combinations of capturing and case-sensitivity didn't make that much difference (as you noted). I got the same sort of results as you (using Perl v5.26.0):

        <ACACGAAGCGCTCGTGTGATTATCT> x_cap_i: 7521 x_cap: 7521 x_i: 7521 x: 7521 poz: 9496 idx: 9496 ovr_cap_i: 9496 ovr_cap: 9496 ovr_i: 9496 ovr +: 9496 Rate ovr_cap ovr ovr_cap_i poz ovr_i x_cap x_cap_i x_i + idx x ovr_cap 5.77/s -- -12% -69% -70% -71% -80% -81% -83% + -92% -92% ovr 6.60/s 14% -- -64% -66% -67% -77% -79% -80% + -91% -91% ovr_cap_i 18.4/s 219% 179% -- -4% -8% -36% -41% -45% + -75% -76% poz 19.1/s 231% 190% 4% -- -5% -34% -39% -43% + -74% -75% ovr_i 20.1/s 248% 204% 9% 5% -- -30% -36% -40% + -73% -74% x_cap 28.8/s 399% 337% 56% 51% 44% -- -7% -14% + -61% -62% x_cap_i 31.2/s 440% 372% 69% 63% 55% 8% -- -7% + -58% -59% x_i 33.3/s 477% 405% 81% 74% 66% 16% 7% -- + -55% -56% idx 74.1/s 1183% 1023% 302% 287% 269% 157% 138% 122% + -- -3% x 76.3/s 1222% 1057% 315% 299% 280% 165% 145% 129% + 3% --

        By the way, I thought this was rather clever:

        $cnt++ while $pos = 1 + index $str, 'AA', $pos;

        Out of interest, I added these two subroutines:

        sub lc_x_cap_i { my ($str, $cnt) = @_; ++$cnt while $str =~ /(aa)/gi; return $cnt; } sub lc_x_i { my ($str, $cnt) = @_; ++$cnt while $str =~ /aa/gi; return $cnt; }

        Then ran your benchmark code again with only the "lc_*" and "x*" routines. I ran it a few times; this seemed to be a fairly representative result:

        <AGGTATGATGGTGTAGAGTAACTAG> x_cap_i: 7506 lc_x_cap_i: 7506 x_cap: 7506 x_i: 7506 lc_x_i: 7506 + x: 7506 Rate lc_x_cap_i x_cap_i x_cap lc_x_i x_i + x lc_x_cap_i 27.7/s -- -1% -12% -13% -14% + -66% x_cap_i 28.0/s 1% -- -11% -12% -13% + -66% x_cap 31.6/s 14% 13% -- -0% -2% + -61% lc_x_i 31.7/s 15% 13% 0% -- -2% + -61% x_i 32.4/s 17% 16% 2% 2% -- + -60% x 81.3/s 193% 190% 157% 156% 151% + --

        So clearly "x" (/AA/g) was faster than the rest (being ~80/s whereas the others were ~30/s, in all runs). I'd say the others were too close to call: although there may appear to be a trend, in two runs "x_cap" was the slowest.

        — Ken

Re: Identifying Overlapping Matches in Nucleotide Sequence
by hippo (Archbishop) on Oct 27, 2017 at 08:09 UTC

      Yep, it's basically the same question. Sorry for the duplicate.

Re: Identifying Overlapping Matches in Nucleotide Sequence
by Eily (Monsignor) on Oct 27, 2017 at 07:59 UTC

    You can use the look ahead assertion (found in Extended Patterns), which makes perl search for a pattern and return to the previous position after the match (and /g doesn't allow perl to match exactly at the same position, so it will move by at least one char on the next attempt).

    ($myseq =~ /(?=AA)/gi ), or (?=(AA)) if you want to capture the matches (which isn't really useful when the pattern is AA).

      This worked perfectly. Thanks!

Re: Identifying Overlapping Matches in Nucleotide Sequence
by thanos1983 (Parson) on Oct 27, 2017 at 08:11 UTC

    Hello FIJI42,

    I am not a biologist or bioinformatics engineer but I will try to tackle your question based on simple logic. I am a bit confused as why the nucleoside sequence should be 6? If you want to count the number of occurrences of a string as 'AA' then the correct answer is 5. For example AAGAATGCCAGTTTGTAAGTACTGACTTTGGAAAA is equal to 5.

    In case you want to count the number of occurrences of sets of characters (which means individual sets even apart from each other not in a sequence) then yes the number of occurrences is 6.

    If this is the case you are trying to resolve then you can simply count the number of occurrences and divide by 2 (which is the number of sets/pairs) that you are interested. But because when you divide by odd number then you will have a remainder. In this case you can use modulo e.g. Modulo operation to return quotient and remainder.

    Sample of code:

    #!/usr/bin/perl use strict; use warnings; use feature 'say'; my $seq = 'AAGAATGCCAGTTTGTAAGTACTGACTTTGGAAAA'; my $x = 'A'; my $c = () = $seq =~ /$x/g; # $c is now 13 my $mod = 2; my ($quot, $rem) = (int $c / $mod, $c % $mod); say "Number of occurences: " . $c; say "Number of pairs: " . $quot; say "Remainder: " . $rem; __END__ $ perl test.pl Number of occurences: 13 Number of pairs: 6 Remainder: 1

    Hope this helps, BR.

    Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: Identifying Overlapping Matches in Nucleotide Sequence
by karlgoethebier (Abbot) on Oct 27, 2017 at 15:15 UTC
    "... matching pairs..."

    By chance I found this and slightly adopted it:

    #!/usr/bin/env perl use strict; use warnings; use Data::Dump; use feature qw(say); my @s = split //, "AAAA"; my @m = map { $s[$_] . ' ' . $s[ $_ + 1 ] } 0 .. $#s - 1; dd \@m; say scalar @m; __END__

    I'm still wondering why it works ;-)

    Update: Getting the letters:

    dd grep { $_ ne "" } split /[^A]/, q(AAGAATGCCAGTTTGTAAGTACTGACTTTGGAAAA);

    Regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

    perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help