Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^2: How to match more than 32766 times in regex?

by rsFalse (Chaplain)
on Dec 01, 2015 at 18:45 UTC ( [id://1149055]=note: print w/replies, xml ) Need Help??


in reply to Re: How to match more than 32766 times in regex?
in thread How to match more than 32766 times in regex?

I've read about limitation, but is it a way to compose regex without big time penalty? I tried smth like /$regex*$regex*$regex*/ if I wanted match up to 96000 times, but it takes a lot of time regex to finish.
  • Comment on Re^2: How to match more than 32766 times in regex?

Replies are listed 'Best First'.
Re^3: How to match more than 32766 times in regex?
by FreeBeerReekingMonk (Deacon) on Dec 01, 2015 at 19:15 UTC

    To make a regexp faster, search from start or end, using ^ or $ to bind it to that point. But can you explain what you want to do? I am sure there is a better way.


    As for code, look at the multiplier x 3 that concatenates the string 3 times. Then, we use qr to quote a regular expression, which we then use and capture the results in @R, which we then print. Hope this gets you ideas. (duplicating the expression to capture it 2 times)

    $ perl -e '$s="(\\d\\w)" x 3; $X="a1b2c3d4e5"; $m=qr/$s/; @R=$X=~$m; +print join(";",@R)."\n"' 1b;2c;3d

    another way could be divide and conquer. Paying a penalty by using $' (the rest of the string that has not matched yet) for the next iteration. another idea is using index

      caveat about the multiplier: It assumes you can match that amount, so if you have 10 patterns to find, but matching 3 at a time, you are unable to match the last one.

      I had a input line of 100k characters '0' or '1'. I tried to solve a problem and find length of alternating subsequence. My approach was
      () = $line =~ /(.)\1*/g
      When I got test-case '0' x 100k, I gain answer of 4, not 1. Because (I think) it found three matches of length 32678 and the rest shorter match.
      When I used
      () = $line =~ /(.)\1*\1*\1*\1*/g
      - it worked slower on test case '01' x 50k. But I can't say how slower, because it was only a part of program (maybe not hot point).

        Matching a pattern across a 32767-length boundary is probably not the best solution. It feels like looking looking for nails because you have a hammer. Here's a solution that avoids quantifiers. I'm not going to take the time to test the speed of this algorithm. If it's not fast enough, drop into Inline::C and prefer a non-regex means of detecting boundaries.

        use strict; use warnings; for (1 .. 25) { my $digits = join '', map {int rand 2} 1 .. 100_000; my ($pos, $len) = longest($digits); my $substr = substr($digits, $pos, $len); printf "%02d: At position (%6d), length (%6d) => $substr\n", $_, $ +pos, $len; } sub longest { my ($seq, $lastpos, $maxpos, $maxlen) = (shift, 0, 0, 0); while ($seq =~ m/(.)(?!\1)/g) { my $p = pos $seq; my $len = $p - $lastpos; $lastpos = $p; if ($len > $maxlen) { $maxlen = $len; $maxpos = $p-$len; } } return $maxpos, $maxlen; } __END__ 01: At position ( 54044), length ( 18) => 111111111111111111 02: At position ( 22821), length ( 15) => 000000000000000 03: At position ( 84563), length ( 18) => 111111111111111111 04: At position ( 97707), length ( 18) => 000000000000000000 05: At position ( 2567), length ( 16) => 1111111111111111 06: At position ( 31038), length ( 18) => 111111111111111111 07: At position ( 73339), length ( 16) => 0000000000000000 08: At position ( 26644), length ( 19) => 0000000000000000000 09: At position ( 9906), length ( 21) => 111111111111111111111 10: At position ( 50662), length ( 15) => 111111111111111 11: At position ( 86843), length ( 15) => 111111111111111 12: At position ( 15995), length ( 16) => 0000000000000000 13: At position ( 4399), length ( 15) => 000000000000000 14: At position ( 25401), length ( 19) => 0000000000000000000 15: At position ( 65784), length ( 21) => 000000000000000000000 16: At position ( 37043), length ( 14) => 00000000000000 17: At position ( 63608), length ( 18) => 111111111111111111 18: At position ( 69870), length ( 17) => 00000000000000000 19: At position ( 20108), length ( 18) => 000000000000000000 20: At position ( 33099), length ( 19) => 1111111111111111111 21: At position ( 40355), length ( 17) => 00000000000000000 22: At position ( 98429), length ( 18) => 000000000000000000 23: At position ( 43568), length ( 16) => 1111111111111111 24: At position ( 71050), length ( 16) => 0000000000000000 25: At position ( 45697), length ( 19) => 0000000000000000000

        This scales up to any string size and any greatest-common-substring size. In other words, '1' x (100000) is not a problem. By doing away with the use of the regex altogether, and by eliminating the use of substr, it could even be adapted to work with infinite streams of digits.


        Dave

        Ok, now I understand. So first off, read about the index command. it will give you positions. So I _think_ you want something like this. Granted, it only takes the same sequence to find duplicated sequences after it. And as long as you do not tell us if gaps are allowed, if you really have 1's and 0's instead of GATC's. Python still seems the better way to go, just read this: http://codereview.stackexchange.com/questions/12522/simple-dna-sequence-finder-w-mismatch-tolerance
        Meanwhile, this finds sequences with copies, without allowing gaps:

        use strict; use warnings; use Term::ANSIColor; use Data::Dumper; my $X = "100100100010010110110101100100000"; # or use File::Slurp my $s = "100"; # my pattern $s my $L = length($s); # length of pattern my @C; # store colors for later my $counter = 0; my $baseposition = 0; my $newindex = 0; my $subsequenceposition = 0; while(($newindex=index($X,$s,$baseposition))>=$baseposition){ # ok, found something, now checking subsequences print "From $baseposition, found '$s' at position $newindex\n"; push(@C, {pos=>$newindex,length=>$L,color=>'black on_yellow'}); $subsequenceposition = $newindex + $L; print "iterations will start from $subsequenceposition, seeking...\n +"; while(substr($X,$subsequenceposition,$L) eq $s){ $counter++; push(@C, {pos=>$subsequenceposition,length=>$L,color=>'black on_gr +een'}); print "Found reocurrance at $subsequenceposition ($counter reocurr +ances found so far)\n"; $subsequenceposition += $L; } print &colored("Found sequence at $newindex. With $counter reocurran +ces", 'blue on_white'). "\n"; # now after the last reocurrance, keep searching for our $s $baseposition = $subsequenceposition; $counter = 0; print "Searching for more starting at $baseposition\n"; } print "DONE\n"; # now print my sequence with colors for my $p (sort {$b->{pos} <=> $a->{pos} } @C){ substr($X, $p->{pos}+$p->{length}, 0) = color('reset'); substr($X, $p->{pos}, 0) = color($p->{color}); print $X . "\n"; }
        You know, THAT should've really been in the original post...
Re^3: How to match more than 32766 times in regex?
by BrowserUk (Patriarch) on Dec 01, 2015 at 19:07 UTC
    if I wanted match up to 96000 times,

    What's your application?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re^3: How to match more than 32766 times in regex?
by Anonymous Monk on Dec 01, 2015 at 18:59 UTC
    "... if I wanted match up to 96000 times ..."

    You're fired.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1149055]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (2)
As of 2024-04-20 01:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found