Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: count backrefenence regex

by haukex (Archbishop)
on Oct 10, 2021 at 21:01 UTC ( [id://11137407]=note: print w/replies, xml ) Need Help??


in reply to count backrefenence regex

Oneliners are great, but not for readability, and when you want people to be able to help you easily, smashing this code into a oneliner is not the best way to present a Short, Self-Contained, Correct Example. Also, Use strict and warnings.

if i 1st assign $+{repeat} to a variable, i get the substring

The Variables related to regular expressions are reset by each successful regex operation and are dynamically scoped, so if you want to use them later, you should generally always store them into other variables, and only use them if the match is successful in the first place. (related recent thread in regards to the variables' scoping: Re: why is $1 cleared at end of an inline sub?)

adding //g back gets back to the non-functional state

As per Regexp Quote Like Operators:

In scalar context, each execution of m//g finds the next match, returning true if it matches, and false if there is no further match. The position after the last match can be read or set using the pos() function; see "pos" in perlfunc. A failed match normally resets the search position to the beginning of the string, but you can avoid that by adding the /c modifier (for example, m//gc). Modifying the target string also resets the search position.

The thing to note here is that pos is per string. You're matching against $s in the while's condition, but then also matching against $s again inside the while's body, each operation using and affecting $s's pos.

As far as I can tell, what your current algorithm is trying to do is count the ocurrences of repeated substrings immediately, each time you find them. This seems quite inefficient.

You've got a few other issues in your code: Your first two examples have "Useless use of hash element in void context" because you just have $+{repeat}; all on its own, and second, $x=int(1000*rand()) and then using $x as an index to @a is going to cause a ton of nonexistent array elements to be picked. Also, random strings are not usually a good idea for testing during development, since tests should be repeatable.

Another issue that I see is that your current regex will consume not only the match (?<repeat>\w{3,}), but also all characters between that match (\w*) and the repetition itself (\g{repeat}), so all of those latter characters won't be checked for repetitions. This can be solved with zero-width Lookaround Assertions, however, you haven't specified what should happen if the sequences overlap and so on. That's why test cases are important.

Anyway, here's a starting point for what I think you might want. Note how I'm simply using a hash to count occurrences.

use warnings; use strict; use Test::More; sub count_reps { my $data = shift; my %seqs; while ( $data =~ m{ (?<repeat>\w{3,}) (?= \w* \g{repeat} ) }xg ) { $seqs{ $+{repeat} }++; } return \%seqs; } is_deeply count_reps('AGCAGC'), { AGC => 1 }; is_deeply count_reps('AATGCAATCGCAGCAGCA'), { AAT => 1, GCA => 3 }; is_deeply count_reps('AGCTACCCAGCTAGGGAGCTA'), { AGCTA => 2 }; done_testing;

Minor edits for clarity.

Replies are listed 'Best First'.
Re^2: count backrefenence regex
by tmolosh (Initiate) on Oct 11, 2021 at 02:07 UTC

    Thanks for the response. sorry about the formatting, obviously my amateur hacky is on full display. I throw things onto the command line as quick-and-dirty tests and prototyping. but I fully agree with your constructive critique.

    I will give your code a try next week, it does look like it is going in the right direction. this is something of a weekend hobby so it will have to wait a bit.

    T

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11137407]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (2)
As of 2024-04-20 04:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found