Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^3: How to match more than 32766 times in regex?

by FreeBeerReekingMonk (Deacon)
on Dec 01, 2015 at 19:15 UTC ( [id://1149064]=note: print w/replies, xml ) Need Help??


in reply to Re^2: How to match more than 32766 times in regex?
in thread How to match more than 32766 times in regex?

To make a regexp faster, search from start or end, using ^ or $ to bind it to that point. But can you explain what you want to do? I am sure there is a better way.


As for code, look at the multiplier x 3 that concatenates the string 3 times. Then, we use qr to quote a regular expression, which we then use and capture the results in @R, which we then print. Hope this gets you ideas. (duplicating the expression to capture it 2 times)

$ perl -e '$s="(\\d\\w)" x 3; $X="a1b2c3d4e5"; $m=qr/$s/; @R=$X=~$m; +print join(";",@R)."\n"' 1b;2c;3d

another way could be divide and conquer. Paying a penalty by using $' (the rest of the string that has not matched yet) for the next iteration. another idea is using index

Replies are listed 'Best First'.
Re^4: How to match more than 32766 times in regex?
by FreeBeerReekingMonk (Deacon) on Dec 01, 2015 at 19:57 UTC

    caveat about the multiplier: It assumes you can match that amount, so if you have 10 patterns to find, but matching 3 at a time, you are unable to match the last one.

Re^4: How to match more than 32766 times in regex?
by rsFalse (Chaplain) on Dec 01, 2015 at 19:49 UTC
    I had a input line of 100k characters '0' or '1'. I tried to solve a problem and find length of alternating subsequence. My approach was
    () = $line =~ /(.)\1*/g
    When I got test-case '0' x 100k, I gain answer of 4, not 1. Because (I think) it found three matches of length 32678 and the rest shorter match.
    When I used
    () = $line =~ /(.)\1*\1*\1*\1*/g
    - it worked slower on test case '01' x 50k. But I can't say how slower, because it was only a part of program (maybe not hot point).

      Matching a pattern across a 32767-length boundary is probably not the best solution. It feels like looking looking for nails because you have a hammer. Here's a solution that avoids quantifiers. I'm not going to take the time to test the speed of this algorithm. If it's not fast enough, drop into Inline::C and prefer a non-regex means of detecting boundaries.

      use strict; use warnings; for (1 .. 25) { my $digits = join '', map {int rand 2} 1 .. 100_000; my ($pos, $len) = longest($digits); my $substr = substr($digits, $pos, $len); printf "%02d: At position (%6d), length (%6d) => $substr\n", $_, $ +pos, $len; } sub longest { my ($seq, $lastpos, $maxpos, $maxlen) = (shift, 0, 0, 0); while ($seq =~ m/(.)(?!\1)/g) { my $p = pos $seq; my $len = $p - $lastpos; $lastpos = $p; if ($len > $maxlen) { $maxlen = $len; $maxpos = $p-$len; } } return $maxpos, $maxlen; } __END__ 01: At position ( 54044), length ( 18) => 111111111111111111 02: At position ( 22821), length ( 15) => 000000000000000 03: At position ( 84563), length ( 18) => 111111111111111111 04: At position ( 97707), length ( 18) => 000000000000000000 05: At position ( 2567), length ( 16) => 1111111111111111 06: At position ( 31038), length ( 18) => 111111111111111111 07: At position ( 73339), length ( 16) => 0000000000000000 08: At position ( 26644), length ( 19) => 0000000000000000000 09: At position ( 9906), length ( 21) => 111111111111111111111 10: At position ( 50662), length ( 15) => 111111111111111 11: At position ( 86843), length ( 15) => 111111111111111 12: At position ( 15995), length ( 16) => 0000000000000000 13: At position ( 4399), length ( 15) => 000000000000000 14: At position ( 25401), length ( 19) => 0000000000000000000 15: At position ( 65784), length ( 21) => 000000000000000000000 16: At position ( 37043), length ( 14) => 00000000000000 17: At position ( 63608), length ( 18) => 111111111111111111 18: At position ( 69870), length ( 17) => 00000000000000000 19: At position ( 20108), length ( 18) => 000000000000000000 20: At position ( 33099), length ( 19) => 1111111111111111111 21: At position ( 40355), length ( 17) => 00000000000000000 22: At position ( 98429), length ( 18) => 000000000000000000 23: At position ( 43568), length ( 16) => 1111111111111111 24: At position ( 71050), length ( 16) => 0000000000000000 25: At position ( 45697), length ( 19) => 0000000000000000000

      This scales up to any string size and any greatest-common-substring size. In other words, '1' x (100000) is not a problem. By doing away with the use of the regex altogether, and by eliminating the use of substr, it could even be adapted to work with infinite streams of digits.


      Dave

        Hm, I don't know what it has to do with alternating substrings, but here's another version of your program :)
        use strict; use warnings; use List::Util 'reduce'; for (1 .. 25) { my $digits = join '', map {int rand 2} 1 .. 100_000; my $substr = longest($digits); printf "%02d: At position (%6d), length (%6d) => $substr\n", $_, index($digits, $substr), length($substr); } sub longest { return reduce { length($a) > length($b) ? $a : $b; } split /(?<=0)(?=1)|(?<=1)(?=0)/, $_[0]; }
        Works a bit faster, too.

      Ok, now I understand. So first off, read about the index command. it will give you positions. So I _think_ you want something like this. Granted, it only takes the same sequence to find duplicated sequences after it. And as long as you do not tell us if gaps are allowed, if you really have 1's and 0's instead of GATC's. Python still seems the better way to go, just read this: http://codereview.stackexchange.com/questions/12522/simple-dna-sequence-finder-w-mismatch-tolerance
      Meanwhile, this finds sequences with copies, without allowing gaps:

      use strict; use warnings; use Term::ANSIColor; use Data::Dumper; my $X = "100100100010010110110101100100000"; # or use File::Slurp my $s = "100"; # my pattern $s my $L = length($s); # length of pattern my @C; # store colors for later my $counter = 0; my $baseposition = 0; my $newindex = 0; my $subsequenceposition = 0; while(($newindex=index($X,$s,$baseposition))>=$baseposition){ # ok, found something, now checking subsequences print "From $baseposition, found '$s' at position $newindex\n"; push(@C, {pos=>$newindex,length=>$L,color=>'black on_yellow'}); $subsequenceposition = $newindex + $L; print "iterations will start from $subsequenceposition, seeking...\n +"; while(substr($X,$subsequenceposition,$L) eq $s){ $counter++; push(@C, {pos=>$subsequenceposition,length=>$L,color=>'black on_gr +een'}); print "Found reocurrance at $subsequenceposition ($counter reocurr +ances found so far)\n"; $subsequenceposition += $L; } print &colored("Found sequence at $newindex. With $counter reocurran +ces", 'blue on_white'). "\n"; # now after the last reocurrance, keep searching for our $s $baseposition = $subsequenceposition; $counter = 0; print "Searching for more starting at $baseposition\n"; } print "DONE\n"; # now print my sequence with colors for my $p (sort {$b->{pos} <=> $a->{pos} } @C){ substr($X, $p->{pos}+$p->{length}, 0) = color('reset'); substr($X, $p->{pos}, 0) = color($p->{color}); print $X . "\n"; }
      You know, THAT should've really been in the original post...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1149064]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-03-28 18:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found