Re: Regex matching on grid alignment
by LanX (Saint) on Sep 08, 2013 at 22:42 UTC
|
DB<130> $a='ABCBBBCBADBCCACDDDAC'
=> "ABCBBBCBADBCCACDDDAC"
DB<131> grep { /(.)\1\1/ } unpack '(A5)4', $a
=> "DDDAC"
edit
unpack pattern shortened
Cheers Rolf
( addicted to the Perl Programming Language)
update
IMHO its also possible to include the same principle into one regex (matching groups of 5 characters and investigating each match with embedded Perl) but I doubt it's more readable.
| [reply] [d/l] |
Re: Regex matching on grid alignment
by rjt (Curate) on Sep 08, 2013 at 23:16 UTC
|
$rem = $W - 3;
qr/^ (?:.{$W})* .{0,$rem} (.)\1\1/x
The idea here is to minimize the branching the RE engine will have to do. The logic is pretty similar to what you might do if you had the string split into lines; just skip 0..$H1 rows, and we know we're at the beginning of a row, so from there we just match 0..$W-3 characters followed by a repeating sequence of 3 with your original regex.
Performance is the same (a few % better actually) as the plain /(.)\1\1/, and several times faster than anything I tried with unpack or split.
Edit: You can get another ~25% or so if your character set really is small like [ABCD] by unrolling (.)\1\1 into (?:AAA|BBB|CCC|DDD). If you're not just using this as a boolean test and still need the character in $1, use (AAA|BBB|CCC|DDD) instead and use substr($1,0,1) to grab the first character if you get a match. The idea here is to push the more expensive operations out of the hot loop that's called millions of times.
___________
1. $H-1, actually, but it makes no measurable difference to the efficiency, or correctness. Edit: Changed {0,$H} to *, to shave a few keystrokes. Thanks LanX!
use strict; use warnings; omitted for brevity.
| [reply] [d/l] [select] |
|
|
| [reply] [d/l] [select] |
|
|
Because {0,$H} is a disfigured emoticon in need of love just like anyone else. Emoticons with physical defects are people, too, you know...
(Other than that, no, no particular reason, although I'm sure it made perfect sense to me at the time.)
| [reply] [d/l] |
|
|
| [reply] |
|
|
Performance is A mystery :)
| [reply] |
Re: Regex matching on grid alignment
by kcott (Archbishop) on Sep 09, 2013 at 08:46 UTC
|
If you set $/ to a reference to the number of (virtual) grid columns,
then open a filehandle to a reference to your string,
you can then read each (virtual) grid row as a separate record and perform your matching on each distinct row-record.
$ perl -Mstrict -Mwarnings -Mautodie -le '
sub match_virtual_grid_rows {
local $/ = \shift;
open my $fh, "<", \shift;
grep { /(.)\1\1/ } <$fh>;
}
my $virtual_grid_columns = 5;
my $grid_string = "ABCBBBCBADBCCACDDDAC";
print for match_virtual_grid_rows($virtual_grid_columns, $grid_str
+ing);
'
DDDAC
A couple of additional tests:
$ perl -Mstrict -Mwarnings -Mautodie -le '
sub match_virtual_grid_rows {
local $/ = \shift;
open my $fh, "<", \shift;
grep { /(.)\1\1/ } <$fh>;
}
my $virtual_grid_columns = 5;
my $grid_string = "AAABBBCBADBCCACDDDAC";
print for match_virtual_grid_rows($virtual_grid_columns, $grid_str
+ing);
'
AAABB
DDDAC
$ perl -Mstrict -Mwarnings -Mautodie -le '
sub match_virtual_grid_rows {
local $/ = \shift;
open my $fh, "<", \shift;
grep { /(.)\1\1/ } <$fh>;
}
my $virtual_grid_columns = 5;
my $grid_string = "AAABBBCBADCCCACDDDAC";
print for match_virtual_grid_rows($virtual_grid_columns, $grid_str
+ing);
'
AAABB
CCCAC
DDDAC
| [reply] [d/l] [select] |
Re: Regex matching on grid alignment
by Anonymous Monk on Sep 08, 2013 at 21:41 UTC
|
x x x x
1 2 3 4
(.) \1 \1
(.) \1 \1
x x x x x x
1 2 3 4 5 6
(.) \1 \1 #r1
(.) \1 \1 #r2
(.) \1 \1 #r3
(.) \1 \1 #r4
6 length - 3 consecutive = 3 leading ...(.)\1\1
..(.)\1\1.
.(.)\1\1..
(.)\1\1...
(.)\1\1(.)\1\1
So
m{(.)\1\1(.)\1\1
|(.)\1\1...
|.(.)\1\1..
|..(.)\1\1.
|...(.)\1\1
}x | [reply] [d/l] [select] |
|
|
m{
(?|
(.)\1\1(.)\1\1
|(.)\1\1...
|.(.)\1\1..
|..(.)\1\1.
|...(.)\1\1
){4} ## 6 x 4
}x;
Generating (and testing) this regex left up to the OP (hopefully you can see the relation between the various numbers) | [reply] [d/l] |
|
|
:) perl before 5.10 (before branch reset available ) would need () counting like
(.)\1\1(.)\2\2
|(.)\3\3...
|.(.)\4\4..
|..(.)\5\5.
|...(.)\6\6
| [reply] |
Re: Regex matching on grid alignment
by LanX (Saint) on Sep 08, 2013 at 22:23 UTC
|
> All of my other patterns work efficiently with the unmodified string (several of them can even be combined into one expression),
I'd include an end of line separator like \0 to simplify things a lot.
> so if you do suggest an alternate storage format for the string, please also consider the cost of the conversion when thinking about efficiency, here.
Yep, but w/o further knowledge of your operations we can't consider anything about efficiency.
Please help us understand why your old pattern wouldn't be able to be adjusted to a new separator.
I doubt you'll accept this answer, but it had to be given.
Cheers Rolf
( addicted to the Perl Programming Language)
| [reply] |
|
|
Thank you LanX. I'll try to explain a little more. The "match 3 of the same character in a row" is the ONLY pattern I have that must not break row boundaries. The other patterns are more complex, but here are a few examples (from memroy, please forgive mistakes):
/(.).{$width}\1.{$width}\1/; # 3 in a line vertically
/^(.)(?!\1)(.)(?:\1\2)+\1?$/; # XoXoX
# oXoXo
# XoXoX
# oXoXo
Sorry, these are about the only two I had any hope of getting right from memory. :-) There are others that look for dynamic "walking" patterns of symbols, stairstep patterns, various symmetries, and Jimmy Hoffa (if our grant approval ever goes through). Several of them share parts with others, so they combine quite nicely into a single expression.
I thought of using a row separator as well, but patterns not much more complex than the first get a lot less readable and slower when you start to have to have a lot of +1's and -1's to account for the different width, and exceptions so that it doesn't match three \0's in a row at the end of the lines. Patterns like the second, frankly I think it would be easier (for both man and machine) to copy the string and tr/\0//d. That wouldn't be horrible, but definitely preferable to just have another regexp to optimize with others than to have two data representations. | [reply] [d/l] [select] |
Re: Regex matching on grid alignment
by AnomalousMonk (Archbishop) on Sep 09, 2013 at 06:44 UTC
|
>perl -wMstrict -le
"my $s = join '', qw(ABCbb bCBAd ddCAC DDDAC ABBBC ABCDA ABCCC CCCAB);
print qq{'$s'};
;;
my $row_width = 5;
my $contiguous = 3;
$contiguous <= $row_width or die 'nonsense';
$contiguous > 0 or die 'ridiculous';
;;
my $pre_max = $row_width - $contiguous;
my $post = $contiguous - 1;
;;
use re 'eval';
my $mod;
my @reps =
grep { $mod = ! $mod }
$s =~ m{
\G (?: .{$row_width}){0,}? .{0,$pre_max}? ((.)\2{$post})
(?(?{ pos($s) % $row_width }) .){$pre_max}
}xmsg;
;;
printf qq{'$_' } for @reps;
"
'ABCbbbCBAdddCACDDDACABBBCABCDAABCCCCCCAB'
'DDD' 'BBB' 'CCC' 'CCC'
Update: If I correctly understand how all this works, the regex sub-expression
(?(?{ pos($s) % $row_width }) .){$pre_max}
will continue to "loop" until the value of the $pre_max quantifier is exhausted even though a row boundary has been reached. A way to "break out" of this loop once a boundary is reached (again, IIUC) is
(?(?{ $+[1] % $row_width }) . | (*SKIP:AHEAD)){$pre_max} (*MARK:AHEAD)
($+[1] used in place of pos($s)). Again, no timings have been done to confirm this is actually beneficial. See Special Backtracking Control Verbs (Perl version 5.10+) in perlre and note that these verbs are marked "experimental". If you really want to walk on the wild side, try (*ACCEPT) which is marked highly experimental:
(?(?{ $+[1] % $row_width }) . | (*ACCEPT)){$pre_max}
and no (*MARK) is needed. Both of these variations have been tested.
| [reply] [d/l] [select] |
|
|
I think I like LanX's grep/unpack approach best
May I ask why? When I benchmarked it, rjt's regex was 5-10 times faster (depending on grid size) than the grep/unpack approach when used alone, and even faster when I combine it with our existing expression. Efficiency was a main requirement, and I personally don't find the grep/unpack any more readable or maintainable than the regex, but maybe that's just me.
(LanX, please don't feel like I'm picking on you. I really appreciate your comments and learned from them. It's just that for what I asked for help with, I thought rjt's solution was way better, but now I'm trying to understand why it might not be.)
| [reply] |
|
|
I think I like LanX's grep/unpack approach best
May I ask why?
Well, I did only say I thought I liked it best :)
My practice in dealing with certain SoPW queries, regex-related ones in particular, is to try to formulate an answer before looking at any of the responses, then post what I have if it seems it might bring something to the table.
In quickly reading through the replies already posted before posting my own, it seemed that rjt's was closest to mine in matching to row-groups, then to a repeat within a row. (Although rjt's solution finds the rightmost repeat in a string, and you would presumably work leftward from there to pick up the others; my understanding is that you want all repeats from the string. My approach finds the leftmost repeat first, then continues working to the right.) What my approach added, it seemed to me, was a way within the regex to "re-synchronize" to a row boundary before looking for the next repeat to capture.
As I say, I only briefly glanced at the various replies before posting my own, and did no benchmarking. I was beguiled by the simplicity of some of the other replies and by the fact that I've found such approaches to be quite efficient in some other cases. I failed to notice that rjt had already done some benchmarking and found his or her regex approach to be significantly faster.
And that's the point: benchmarking tells the tale. No matter if an alternate solution is simpler, more elegant or maintainable (whatever those words really mean), if you have a need for speed and a solution's not fast enough, it's unuseable. Period.
So good luck in your quest, and I hope your experience of the Monastery has been and will continue to be productive. And please consider registering: it just makes things a bunch simpler.
| [reply] |
Re: Regex matching on grid alignment
by hdb (Monsignor) on Sep 09, 2013 at 07:28 UTC
|
If keeping regexes simple is one of your objectives, then another strategy would be to match all and then discard mis-aligned matches.
use strict;
use warnings;
my $str = 'AAA451BBB512CCC123DDD234EEE345FFF45';
while( $str =~ /((.)\2\2)/g ) {
next unless $-[1] % 5 < 3;
print "Found $1\n";
}
| [reply] [d/l] |
Re: Regex matching on grid alignment
by boftx (Deacon) on Sep 09, 2013 at 05:42 UTC
|
| [reply] |
|
|
| [reply] |