Re^2: How to match more than 32766 times in regex?

Replies are listed 'Best First'.
Re^3: How to match more than 32766 times in regex? by FreeBeerReekingMonk (Deacon) on Dec 01, 2015 at 19:15 UTC
To make a regexp faster, search from start or end, using ^ or $ to bind it to that point. But can you explain what you want to do? I am sure there is a better way. As for code, look at the multiplier x 3 that concatenates the string 3 times. Then, we use qr to quote a regular expression, which we then use and capture the results in @R, which we then print. Hope this gets you ideas. (duplicating the expression to capture it 2 times) `$ perl -e '$s="(\\d\\w)" x 3; $X="a1b2c3d4e5"; $m=qr/$s/; @R=$X=~$m; +print join(";",@R)."\n"' 1b;2c;3d` [download] another way could be divide and conquer. Paying a penalty by using $' (the rest of the string that has not matched yet) for the next iteration. another idea is using index	[reply] [d/l]
Re^4: How to match more than 32766 times in regex? by FreeBeerReekingMonk (Deacon) on Dec 01, 2015 at 19:57 UTC
caveat about the multiplier: It assumes you can match that amount, so if you have 10 patterns to find, but matching 3 at a time, you are unable to match the last one.	[reply]
Re^4: How to match more than 32766 times in regex? by rsFalse (Chaplain) on Dec 01, 2015 at 19:49 UTC
I had a input line of 100k characters '0' or '1'. I tried to solve a problem and find length of alternating subsequence. My approach was `() = $line =~ /(.)\1/g` [download] When I got test-case '0' x 100k, I gain answer of 4, not 1. Because (I think) it found three matches of length 32678 and the rest shorter match. When I used `() = $line =~ /(.)\1\1\1\1*/g` [download] - it worked slower on test case '01' x 50k. But I can't say how slower, because it was only a part of program (maybe not hot point).	[reply] [d/l] [select]
Re^5: How to match more than 32766 times in regex? by davido (Cardinal) on Dec 01, 2015 at 21:48 UTC
Matching a pattern across a 32767-length boundary is probably not the best solution. It feels like looking looking for nails because you have a hammer. Here's a solution that avoids quantifiers. I'm not going to take the time to test the speed of this algorithm. If it's not fast enough, drop into Inline::C and prefer a non-regex means of detecting boundaries. use strict; use warnings; for (1 .. 25) { my $digits = join '', map {int rand 2} 1 .. 100_000; my ($pos, $len) = longest($digits); my $substr = substr($digits, $pos, $len); printf "%02d: At position (%6d), length (%6d) => $substr\n", $_, $ +pos, $len; } sub longest { my ($seq, $lastpos, $maxpos, $maxlen) = (shift, 0, 0, 0); while ($seq =~ m/(.)(?!\1)/g) { my $p = pos $seq; my $len = $p - $lastpos; $lastpos = $p; if ($len > $maxlen) { $maxlen = $len; $maxpos = $p-$len; } } return $maxpos, $maxlen; } __END__ 01: At position ( 54044), length ( 18) => 111111111111111111 02: At position ( 22821), length ( 15) => 000000000000000 03: At position ( 84563), length ( 18) => 111111111111111111 04: At position ( 97707), length ( 18) => 000000000000000000 05: At position ( 2567), length ( 16) => 1111111111111111 06: At position ( 31038), length ( 18) => 111111111111111111 07: At position ( 73339), length ( 16) => 0000000000000000 08: At position ( 26644), length ( 19) => 0000000000000000000 09: At position ( 9906), length ( 21) => 111111111111111111111 10: At position ( 50662), length ( 15) => 111111111111111 11: At position ( 86843), length ( 15) => 111111111111111 12: At position ( 15995), length ( 16) => 0000000000000000 13: At position ( 4399), length ( 15) => 000000000000000 14: At position ( 25401), length ( 19) => 0000000000000000000 15: At position ( 65784), length ( 21) => 000000000000000000000 16: At position ( 37043), length ( 14) => 00000000000000 17: At position ( 63608), length ( 18) => 111111111111111111 18: At position ( 69870), length ( 17) => 00000000000000000 19: At position ( 20108), length ( 18) => 000000000000000000 20: At position ( 33099), length ( 19) => 1111111111111111111 21: At position ( 40355), length ( 17) => 00000000000000000 22: At position ( 98429), length ( 18) => 000000000000000000 23: At position ( 43568), length ( 16) => 1111111111111111 24: At position ( 71050), length ( 16) => 0000000000000000 25: At position ( 45697), length ( 19) => 0000000000000000000 [download] This scales up to any string size and any greatest-common-substring size. In other words, `'1' x (100000)` is not a problem. By doing away with the use of the regex altogether, and by eliminating the use of substr, it could even be adapted to work with infinite streams of digits. Dave	[reply] [d/l] [select]
Re^6: How to match more than 32766 times in regex? by Anonymous Monk on Dec 01, 2015 at 22:38 UTC
Re^5: How to match more than 32766 times in regex? by FreeBeerReekingMonk (Deacon) on Dec 01, 2015 at 21:04 UTC
Ok, now I understand. So first off, read about the index command. it will give you positions. So I _think_ you want something like this. Granted, it only takes the same sequence to find duplicated sequences after it. And as long as you do not tell us if gaps are allowed, if you really have 1's and 0's instead of GATC's. Python still seems the better way to go, just read this: http://codereview.stackexchange.com/questions/12522/simple-dna-sequence-finder-w-mismatch-tolerance Meanwhile, this finds sequences with copies, without allowing gaps: use strict; use warnings; use Term::ANSIColor; use Data::Dumper; my $X = "100100100010010110110101100100000"; # or use File::Slurp my $s = "100"; # my pattern $s my $L = length($s); # length of pattern my @C; # store colors for later my $counter = 0; my $baseposition = 0; my $newindex = 0; my $subsequenceposition = 0; while(($newindex=index($X,$s,$baseposition))>=$baseposition){ # ok, found something, now checking subsequences print "From $baseposition, found '$s' at position $newindex\n"; push(@C, {pos=>$newindex,length=>$L,color=>'black on_yellow'}); $subsequenceposition = $newindex + $L; print "iterations will start from $subsequenceposition, seeking...\n +"; while(substr($X,$subsequenceposition,$L) eq $s){ $counter++; push(@C, {pos=>$subsequenceposition,length=>$L,color=>'black on_gr +een'}); print "Found reocurrance at $subsequenceposition ($counter reocurr +ances found so far)\n"; $subsequenceposition += $L; } print &colored("Found sequence at $newindex. With $counter reocurran +ces", 'blue on_white'). "\n"; # now after the last reocurrance, keep searching for our $s $baseposition = $subsequenceposition; $counter = 0; print "Searching for more starting at $baseposition\n"; } print "DONE\n"; # now print my sequence with colors for my $p (sort {$b->{pos} <=> $a->{pos} } @C){ substr($X, $p->{pos}+$p->{length}, 0) = color('reset'); substr($X, $p->{pos}, 0) = color($p->{color}); print $X . "\n"; } [download]	[reply] [d/l]
Re^5: How to match more than 32766 times in regex? by Anonymous Monk on Dec 01, 2015 at 20:32 UTC
You know, THAT should've really been in the original post...	[reply]
Re^3: How to match more than 32766 times in regex? by BrowserUk (Patriarch) on Dec 01, 2015 at 19:07 UTC
if I wanted match up to 96000 times, What's your application? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^3: How to match more than 32766 times in regex? by Anonymous Monk on Dec 01, 2015 at 18:59 UTC
"... if I wanted match up to 96000 times ..." You're fired.	[reply]


Don't ask to ask, just ask
	PerlMonks