Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

count backrefenence regex

by tmolosh (Initiate)
on Oct 10, 2021 at 20:15 UTC ( [id://11137405]=perlquestion: print w/replies, xml ) Need Help??

tmolosh has asked for the wisdom of the Perl Monks concerning the following question:

hi all - i have a workaround for this but want to understand why my 1st approach did not work. This relates to reporting on "direct repeats" in DNA|RNA sequences.

here is a command line example to generate a 1,000,000 sequence and try to find direct repeats of length > 3

perl -e '@a=qw(g a t c g a t c g a t c); for(1..1000000){$x=int(1000*r +and());$s.=$a[$x]} while($s=~/(?<repeat>\w{3,})\w*\g{repeat}/g){ $+{r +epeat};$c=()=$s=~/$+{repeat}/g;print "$+{repeat} : $c\n"}' agtcg : 0 agtcg : 0 agtcg : 0 ^C

if i remove the //g, it reports a count of 1 but no substring

perl -e '@a=qw(g a t c g a t c g a t c); for(1..1000000){$x=int(1000*r +and());$s.=$a[$x]} while($s=~/(?<repeat>\w{3,})\w*\g{repeat}/g){ $+{r +epeat};$c=()=$s=~/$+{repeat}/;print "$+{repeat} : $c\n"}' : 1 : 1 : 1 : 1

if i 1st assign $+{repeat} to a variable, i get the substring, but without //g, the count is still (as expected since it is not global)

perl -e '@a=qw(g a t c g a t c g a t c); for(1..1000000){$x=int(1000*r +and());$s.=$a[$x]} while($s=~/(?<repeat>\w{3,})\w*\g{repeat}/g){ $m=$ ++{repeat};$c=()=$s=~/$m/;print "$m : $c\n"}' gcacgt : 1 cga : 1 tggg : 1 ttg : 1 tta : 1 cct : 1

adding //g back gets back to the non-functional state:

perl -e '@a=qw(g a t c g a t c g a t c); for(1..1000000){$x=int(1000*r +and());$s.=$a[$x]} while($s=~/(?<repeat>\w{3,})\w*\g{repeat}/g){ $m=$ ++{repeat};$c=()=$s=~/$m/g;print "$m : $c\n"}' gctgca : 0 gctgca : 0 gctgca : 0 ^C

any explanation so i can better understand regexs would be greatly appreciated

thanks

Replies are listed 'Best First'.
Re: count backrefenence regex
by haukex (Archbishop) on Oct 10, 2021 at 21:01 UTC

    Oneliners are great, but not for readability, and when you want people to be able to help you easily, smashing this code into a oneliner is not the best way to present a Short, Self-Contained, Correct Example. Also, Use strict and warnings.

    if i 1st assign $+{repeat} to a variable, i get the substring

    The Variables related to regular expressions are reset by each successful regex operation and are dynamically scoped, so if you want to use them later, you should generally always store them into other variables, and only use them if the match is successful in the first place. (related recent thread in regards to the variables' scoping: Re: why is $1 cleared at end of an inline sub?)

    adding //g back gets back to the non-functional state

    As per Regexp Quote Like Operators:

    In scalar context, each execution of m//g finds the next match, returning true if it matches, and false if there is no further match. The position after the last match can be read or set using the pos() function; see "pos" in perlfunc. A failed match normally resets the search position to the beginning of the string, but you can avoid that by adding the /c modifier (for example, m//gc). Modifying the target string also resets the search position.

    The thing to note here is that pos is per string. You're matching against $s in the while's condition, but then also matching against $s again inside the while's body, each operation using and affecting $s's pos.

    As far as I can tell, what your current algorithm is trying to do is count the ocurrences of repeated substrings immediately, each time you find them. This seems quite inefficient.

    You've got a few other issues in your code: Your first two examples have "Useless use of hash element in void context" because you just have $+{repeat}; all on its own, and second, $x=int(1000*rand()) and then using $x as an index to @a is going to cause a ton of nonexistent array elements to be picked. Also, random strings are not usually a good idea for testing during development, since tests should be repeatable.

    Another issue that I see is that your current regex will consume not only the match (?<repeat>\w{3,}), but also all characters between that match (\w*) and the repetition itself (\g{repeat}), so all of those latter characters won't be checked for repetitions. This can be solved with zero-width Lookaround Assertions, however, you haven't specified what should happen if the sequences overlap and so on. That's why test cases are important.

    Anyway, here's a starting point for what I think you might want. Note how I'm simply using a hash to count occurrences.

    use warnings; use strict; use Test::More; sub count_reps { my $data = shift; my %seqs; while ( $data =~ m{ (?<repeat>\w{3,}) (?= \w* \g{repeat} ) }xg ) { $seqs{ $+{repeat} }++; } return \%seqs; } is_deeply count_reps('AGCAGC'), { AGC => 1 }; is_deeply count_reps('AATGCAATCGCAGCAGCA'), { AAT => 1, GCA => 3 }; is_deeply count_reps('AGCTACCCAGCTAGGGAGCTA'), { AGCTA => 2 }; done_testing;

    Minor edits for clarity.

      Thanks for the response. sorry about the formatting, obviously my amateur hacky is on full display. I throw things onto the command line as quick-and-dirty tests and prototyping. but I fully agree with your constructive critique.

      I will give your code a try next week, it does look like it is going in the right direction. this is something of a weekend hobby so it will have to wait a bit.

      T
Re: count backrefenence regex
by tybalt89 (Monsignor) on Oct 10, 2021 at 23:09 UTC

    Using up my quota of guesses for the day...

    This will get all valid substrings of longer valid strings.

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11137405 use warnings; my $random = join '', map qw(A C G T)[rand 4], 1 .. 50; for my $seq ( qw( AGCAGC AATGCAATCGCAGCAGCA AGCTACCCAGCTAGGGAGCTA AAA_x_AAA_x_BBB_x_AAA_x_AAA_x_BBB ), $random ) { my %found; $seq =~ /([A-Z]{3,}) .* \1 (?{ $found{$1}++ }) (*FAIL)/x; my %counts = map { $_, scalar(() = $seq =~ /$_/g) } keys %found; use Data::Dump 'dd'; dd "for sequence $seq", \%counts; }

    Outputs:

    ("for sequence AGCAGC", { AGC => 2 }) ( "for sequence AATGCAATCGCAGCAGCA", { AAT => 2, AGC => 2, CAG => 2, GCA => 4 }, ) ( "for sequence AGCTACCCAGCTAGGGAGCTA", { AGC => 3, AGCT => 3, AGCTA => 3, CTA => 3, GCT => 3, GCTA => 3 }, ) ( "for sequence AAA_x_AAA_x_BBB_x_AAA_x_AAA_x_BBB", { AAA => 4, BBB => 2 }, ) ( "for sequence ATGGACTGCCTGGAAGAATCATCCATCCTGGGGCCCGGATCTTTGTACCC", { ATC => 4, ATCC => 2, CAT => 2, CATC => 2, CCC => 2, CCT => 2, CCTG => 2, CCTGG => 2, CTG => 3, CTGG => 2, GAA => 2, GCC => 2, GGA => 3, TCC => 2, TGG => 3, TGGA => 2, }, )
Re: count backrefenence regex
by LanX (Saint) on Oct 10, 2021 at 20:41 UTC
    It's not clear to me what you want to achieve.

    I tried to translate your first oneliner (that's the "working" one?) to a readable, formated script with strict and warnings but found too many bugs.

    Lets say you have a string "AAA_x_AAA_x_BBB_x_AAA_x_AAA_x_BBB" what is the result you expect for AAA (and BBB )?

    • 2 ?
    • 3 ?
    • 4 ?

    Please provide a fixed input and tell us the expected output. See also SSCCE

    update

    and what about AAA_x_BBB which is also repeated...? should sub-results be counted too or ignored?

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      I fully agree with your constructive comments. thanks. this was all supposed to be a quick-and-dirty time test to compare this attempted solution vs my work-around. but obviously dirty it was but not so quick.

      my apologies for not fully fleshing-out the "problem" that the code is trying to address. like many, I assume everyone else knows what I am thinking and just jump right in where my head is at the moment.

      in your scenario I would say you have 4 x "AAA_x_" but 2 "BBB" (not "BBB_x_")

      my thinking is: given a string, e.g., "GATCGGGGACTTAGGATCCGATCT" where (if I typed it right) the string has 2 x "GATC" and 2 "GATCT", find the number of occurrences of each unique substring of length >= some minimum length (I used 3 in my code) that occur more than once. "GATC" occurs 4 times, but twice with the extra "T" so I would call those 2 different substrings.

      BTW - I was using a 1,000,000 character string to make it take long enough to see a time difference.

      also, I mis-spoke above, what I ultimately would report is substring and its locations (I figure for that I would use $` from the regex matching). If locations are pushed into an array, and I decide I want the count, I would just use the length of the array.

        ... I was using a 1,000,000 character string ...

        Please see haukex's comment on this here in the paragraph beginning "You've got a few other issues in your code". In fact, the strings you were producing with the OPed code are only about 12,000 characters long.


        Give a man a fish:  <%-{-{-{-<

        > given a string, e.g., "GATCGGGGACTTAGGATCCGATCT" where (if I typed it right) the string has 2 x "GATC" and 2 "GATCT",

        you didn't

        DB<269> x "GATCGGGGACTTAGGATCCGATCT" =~ /(GATC)/g 0 'GATC' 1 'GATC' 2 'GATC' DB<270> x "GATCGGGGACTTAGGATCCGATCT" =~ /(GATCT)/g 0 'GATCT' DB<271>

        > each unique substring of length >= some minimum length (I used 3 in my code) that occur more than once.

        That's not solvable with a trivial regex because of the overlaps°, I suppose tybalt's complex solution with forced backtracking and embedded code for temporary results already nailed it.

        But I'm pretty sure we had this question here in the past. Maybe try super search

        Also seems identifying repeated sequences be a standard in BioInf, so some libraries should offer this.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

        °) (AAA_[BBB_)CCC]_(AAA_BBB_)[BBB_CCC] brackets ( and [ for different repeated but overlapping sequences.

Re: count backrefenence regex
by tmolosh (Initiate) on Oct 16, 2021 at 14:05 UTC
    I wonder if this would be better achieved with a suffix tree and/or stemming

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11137405]
Approved by LanX
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (1)
As of 2024-04-19 00:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found