in reply to count backrefenence regex

It's not clear to me what you want to achieve.

I tried to translate your first oneliner (that's the "working" one?) to a readable, formated script with strict and warnings but found too many bugs.

Lets say you have a string "AAA_x_AAA_x_BBB_x_AAA_x_AAA_x_BBB" what is the result you expect for AAA (and BBB )?

Please provide a fixed input and tell us the expected output. See also SSCCE

update

and what about AAA_x_BBB which is also repeated...? should sub-results be counted too or ignored?

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

Replies are listed 'Best First'.
Re^2: count backrefenence regex
by tmolosh (Initiate) on Oct 11, 2021 at 02:33 UTC

    I fully agree with your constructive comments. thanks. this was all supposed to be a quick-and-dirty time test to compare this attempted solution vs my work-around. but obviously dirty it was but not so quick.

    my apologies for not fully fleshing-out the "problem" that the code is trying to address. like many, I assume everyone else knows what I am thinking and just jump right in where my head is at the moment.

    in your scenario I would say you have 4 x "AAA_x_" but 2 "BBB" (not "BBB_x_")

    my thinking is: given a string, e.g., "GATCGGGGACTTAGGATCCGATCT" where (if I typed it right) the string has 2 x "GATC" and 2 "GATCT", find the number of occurrences of each unique substring of length >= some minimum length (I used 3 in my code) that occur more than once. "GATC" occurs 4 times, but twice with the extra "T" so I would call those 2 different substrings.

    BTW - I was using a 1,000,000 character string to make it take long enough to see a time difference.

    also, I mis-spoke above, what I ultimately would report is substring and its locations (I figure for that I would use $` from the regex matching). If locations are pushed into an array, and I decide I want the count, I would just use the length of the array.

      ... I was using a 1,000,000 character string ...

      Please see haukex's comment on this here in the paragraph beginning "You've got a few other issues in your code". In fact, the strings you were producing with the OPed code are only about 12,000 characters long.


      Give a man a fish:  <%-{-{-{-<

      > given a string, e.g., "GATCGGGGACTTAGGATCCGATCT" where (if I typed it right) the string has 2 x "GATC" and 2 "GATCT",

      you didn't

      DB<269> x "GATCGGGGACTTAGGATCCGATCT" =~ /(GATC)/g 0 'GATC' 1 'GATC' 2 'GATC' DB<270> x "GATCGGGGACTTAGGATCCGATCT" =~ /(GATCT)/g 0 'GATCT' DB<271>

      > each unique substring of length >= some minimum length (I used 3 in my code) that occur more than once.

      That's not solvable with a trivial regex because of the overlaps°, I suppose tybalt's complex solution with forced backtracking and embedded code for temporary results already nailed it.

      But I'm pretty sure we had this question here in the past. Maybe try super search

      Also seems identifying repeated sequences be a standard in BioInf, so some libraries should offer this.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      °) (AAA_[BBB_)CCC]_(AAA_BBB_)[BBB_CCC] brackets ( and [ for different repeated but overlapping sequences.