Why doesn't this regex work? (Solved!)

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Why doesn't this regex work?
by choroba (Cardinal) on Aug 15, 2013 at 09:45 UTC

use re 'debug';
[download]

Update: It seems your example data miss a digit at the end of each number. I tried it with a different data:

2120 2140 2180 2197 2200 2203 2205 2234 2238 2259 2280 2299 2401
[download]

If I understood your specification, the following should do the work, being a bit more readable:

    s/\b(\d*)(\d\d) (?=\1\d\d\b)/$1$2_/g;
    s/ /\n/g;
    s/_/ /g;
[download]

لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

[reply]
[d/l]
[select]

Re^2: Why doesn't this regex work?

by BrowserUk (Patriarch) on Aug 15, 2013 at 10:11 UTC

Have you tried use re 'debug';

Yes. But it didn't help. It shows that matches do occur; but doesn't explain why the newlines are never inserted.

It seems your example data miss a digit at the end of each number.

My mistake trying to simplify the data. Two corrections are possible:

Remove one \d from each side of the regex:

while( <DATA> ) {
    s[\s(\d+)\d\K\s(?=(\d+)\d\s)]{
        print "$1:$2";
        $1 + 1 == $2 ? "\n" : ' '
    }ge;
    print;
}
[download]

Add a digit to the end of each number in the data:

1051 1061 1071 1081 1091 1101 1111 1121 1131
11151 11161 11171 11181 11191 11201 11211 11221 11231
123451 123461 123471 123481 123491 123501 123511 123531
[download]

as you know what the regex should do

The idea of the regex is to match each pair of numbers in a line and capture the first d-2 digits of each number.

Eg. Match the pair and capture the first two digits of each: (10)91 (11)01.

Then if $1+1 (10+1) == $2 (11) replace the space between them with a newline.

The \K prevents the number before the replaced space being replaced. And the lookahead prevents the number after ebing replaced.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

[reply]
[d/l]
[select]

Re: Why doesn't this regex work? ( lookaround backtracking pos)
by Anonymous Monk on Aug 15, 2013 at 11:07 UTC

I think I got it, there is one \s and it is outside of the lookaround, so it gets consumed (advances pos), and backtracking doesn't go back before it, so its checking every other number

Look at the pos, in your version (\s outside (?=)) it jumps 7/15 but with \s inside (?=) it jumps 7/11/15

#! perl -slw
use strict;
my $data = '105 106 107 108 109 110 111 112 113
1115 1116 1117 1118 1119 1120 1121 1122 1123
12345 12346 12347 12348 12349 12350 12351 12353
';
{
    my $other = @ARGV ? qr{(?=\s(\d+)\d\s)} : qr{\s(?=(\d+)\d\s)};
    open my($DATA),'<',\$data;
    my $what = "";
    while( readline $DATA ) {
        s{
\s(\d+)\d
\K
$other
    }{
        warn "WHAT($1)($2)POS(@{[pos()]})\n";
        $1 + 1 == $2 ? ";\n" : '! '
    }gex;
        print;
    }
    print "\n$what";
}
__END__
$ perl junk
WHAT(10)(10)POS(7)
WHAT(10)(10)POS(15)
WHAT(11)(11)POS(23)
WHAT(11)(11)POS(31)
105 106! 107 108! 109 110! 111 112! 113

WHAT(111)(111)POS(9)
WHAT(111)(111)POS(19)
WHAT(112)(112)POS(29)
WHAT(112)(112)POS(39)
1115 1116! 1117 1118! 1119 1120! 1121 1122! 1123

WHAT(1234)(1234)POS(11)
WHAT(1234)(1234)POS(23)
WHAT(1235)(1235)POS(35)
12345 12346! 12347 12348! 12349 12350! 12351 12353




$ perl junk lookaround
WHAT(10)(10)POS(7)
WHAT(10)(10)POS(11)
WHAT(10)(10)POS(15)
WHAT(10)(11)POS(19)
WHAT(11)(11)POS(23)
WHAT(11)(11)POS(27)
WHAT(11)(11)POS(31)
105 106!  107!  108!  109;
 110!  111!  112!  113

WHAT(111)(111)POS(9)
WHAT(111)(111)POS(14)
WHAT(111)(111)POS(19)
WHAT(111)(112)POS(24)
WHAT(112)(112)POS(29)
WHAT(112)(112)POS(34)
WHAT(112)(112)POS(39)
1115 1116!  1117!  1118!  1119;
 1120!  1121!  1122!  1123

WHAT(1234)(1234)POS(11)
WHAT(1234)(1234)POS(17)
WHAT(1234)(1234)POS(23)
WHAT(1234)(1235)POS(29)
WHAT(1235)(1235)POS(35)
WHAT(1235)(1235)POS(41)
12345 12346!  12347!  12348!  12349;
 12350!  12351!  12353
[download]

rxrx gave me the idea to pos it

[reply]
[d/l]

Re^2: Why doesn't this regex work? ( lookaround backtracking pos)

by BrowserUk (Patriarch) on Aug 15, 2013 at 11:17 UTC

there is one \s and it is outside of the lookaround ... so its checking every other number

BINGO! Thank you.

If I put the first space into a lookbehind: s[(?<=\s)(\d+)\d\K\s(?=(\d+)\d\s)]{

Or substitute the zero length \b--which will serve the same purpose: s[\b(\d+)\d\K\s(?=(\d+)\d\s)]{

The substitution works as I wanted it to.

But damn I could not see that for looking.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

[reply]
[d/l]
[select]

Re^3: Why doesn't this regex work? ( lookaround dynamic postponed pattern (??{ code })

by Anonymous Monk on Aug 15, 2013 at 12:08 UTC

Took me a few dozen looks, I started to compose some funny(wrong) answers some 3-4 times

The experimental (??{ code }) feature looks kinda neat

perl -le " $_ = shift; s{(\d+)\d\s\K(?=(??{$1+1})\d\s)}{\n}g; print " 
+"11 12 21 22 32 33 41 44" 
11 12
21 22
32 33
41 44
[download]

[reply]
[d/l]

Re^4: Why doesn't this regex work? ( lookaround dynamic postponed pattern (??{ code })

by BrowserUk (Patriarch) on Aug 15, 2013 at 12:44 UTC

Re: Why doesn't this regex work?
by mtmcc (Hermit) on Aug 15, 2013 at 10:40 UTC

Maybe (probably) I'm misunderstanding the question, but in your sample data, the hundred groups in each line seem to be the same. i.e. $1 + 1 will always not equal $2 in this data, for each line.

I think the second point is about the spaces. When the regex matches, it skips a space, so misses out on matching the next number to the first part of the regex.

For example, this inserts a newline before the 200s start:

#!/usr/bin/perl
use strict;
use warnings;
#use Regexp::Debugger;

while( <DATA> ) {
    s[\s(\d+)\d\d\K\s(?=(\d+)\d\d)]{
        $1 + 1 == $2 ? "\n" : ' '
    }ge;
    print;
}

__DATA__
105 106 107 108 109 110 211 212 213
1115 1116 1117 1118 1119 1120 1121 1122 1123
12345 12346 12347 12348 12349 12350 12351 12353
[download]

But when you change the data to:

105 106 107 108 109 210 211 212 213
1115 1116 1117 1118 1119 1120 1121 1122 1123
12345 12346 12347 12348 12349 12350 12351 12353
[download]

the first part of the regex doesn't match the 109, because the match fails on the first space.

Apologies if I'm way off!

[reply]
[d/l]
[select]

Re: Why doesn't this regex work?
by Athanasius (Archbishop) on Aug 15, 2013 at 10:40 UTC

Using printf within the substitution:

#! perl
use strict;
use warnings;

while (<DATA>)
{
    chomp;
    s[\s(\d+)\d\K\s(?=(\d+)\d\s)]
    {
        printf("%s: (%s) (%s)\n", $_, $1, $2);
    }ge;
}

__DATA__
105 106 107 108 109 110 111 112 113
1115 1116 1117 1118 1119 1120 1121 1122 1123
12345 12346 12347 12348 12349 12350 12351 12353
[download]

gives:

20:32 >perl 687_SoPW.pl
105 106 107 108 109 110 111 112 113: (10) (10)
105 106 107 108 109 110 111 112 113: (10) (10)
105 106 107 108 109 110 111 112 113: (11) (11)
1115 1116 1117 1118 1119 1120 1121 1122 1123: (111) (111)
1115 1116 1117 1118 1119 1120 1121 1122 1123: (111) (111)
1115 1116 1117 1118 1119 1120 1121 1122 1123: (112) (112)
12345 12346 12347 12348 12349 12350 12351 12353: (1234) (1234)
12345 12346 12347 12348 12349 12350 12351 12353: (1234) (1234)
12345 12346 12347 12348 12349 12350 12351 12353: (1235) (1235)

20:32 >
[download]

which seems to show that each comparison is consuming three terms from the data line, instead of one. I don’t understand why this is, but I’m posting in the hope that it provides someone with a useful clue as to what is going on. :-)

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: Why doesn't this regex work?
by boftx (Deacon) on Aug 15, 2013 at 10:47 UTC

Please understand that I am new to the Monastery. That said, why wouldn't something like this work if the data always has spaces between the numbers?

#!/usr/bin/perl

use strict;

while( <DATA> ) {
    my @numbers = split;
    my $half = int(@numbers/2);
    my @first = splice(@numbers,0,$half);
    print "@first\n";
    print "  @numbers\n";
}

__DATA__
105 106 107 108 109 110 111 112 113
1115 1116 1117 1118 1119 1120 1121 1122 1123
12345 12346 12347 12348 12349 12350 12351 12353
[download]

The first snippet just cuts the line in half, but it could easily be adapted to print out lines of X number of values (with continuation rows indented.)

#!/usr/bin/perl

use strict;

my $maxitems = 3;

while( <DATA> ) {
    my @numbers = split;
    my $indent = '';
    while ( @numbers > 0) {
        my @rowdata = splice(@numbers,0,$maxitems);
        print "${indent}@rowdata\n";
        $indent = '  ';
    }
}

__DATA__
105 106 107 108 109 110 111 112 113 116
1115 1116 1117 1118 1119 1120 1121 1122 1123 1125
12345 12346 12347 12348 12349 12350 12351 12353 12355
[download]

UPDATE: I understand that this doesn't answer the question posed (which is interesting in itself) but I have to wonder if that is the right approach to the task at hand to begin with.

[reply]
[d/l]
[select]

Re: Why doesn't this regex work?
by Laurent_R (Canon) on Aug 15, 2013 at 10:16 UTC

In your data, the third (and fourth and fifth) digits from the right are the same. And this is what you are capturing. You will not get any match where the captured numbers are different.

Second thing, I think you should probably make your spaces optional if you do not have spaces at the beginning or at the end of your lines (or perhaps use a word boundary anchor).

With these two points in mind:

  DB<8> $_ = "105 106 107 108 109 110 111 112 113 213 214";

  DB<9> s[\s?(\d+)\d\d\K\s(?=(\d+)\d\d\s?)]{$1 + 1 == $2 ? "\n" : ' '}
+ge;

  DB<10> x $_
0  '105 106 107 108 109 110 111 112 113
213 214'
  DB<11>
[download]

I also think that this:

$1 + 1 == $2 ? "\n" : ' '

is going to fail on sequence jumps, e.g. on data looking like this:

$_ = qw /105 106 107 108 109 110 111 112 113 313 314/;

[reply]
[d/l]
[select]

Re: Why doesn't this regex work? (Solved!)
by AnomalousMonk (Archbishop) on Aug 15, 2013 at 14:34 UTC

As your original question seems to have been answered, here, FWIW, is an approach that is neither multi-pass nor dependent on /e replacement code execution or embedded code blocks as most others seem to be. I was a bit confused by the data output example in the OP, but I have set this up to insert newlines on transitions from one 100s group to the next.

>perl -wMstrict -le
"my @lines = (
   '105 106 107 108 109 110 111 112 113 220 221 223',
   '100 101 198 199 200 201 298 299 300 301 398 399 400 401',
   '1115 1116 1117 1118 1119 1120 1121 1122 1123 1200 1201 1202',
   '1100 1101 1102 1198 1199 1200 1201 1202 1298 1299 1300 1301',
   '12345 12346 12347 12348 12349 12450 12451 12453 12466 12467',
   '12300 12301 12398 12399 12400 12401 12498 12499 12500 12501',
   );
 print qq{'$_'} for @lines;
 print '';
 ;;
 s{ (\d) \d\d \K  [^\n\S]+  (?! \d* \1 \d\d \b) }{\n}xmsg
   for @lines;
 print qq{'$_'} for @lines;
"
'105 106 107 108 109 110 111 112 113 220 221 223'
'100 101 198 199 200 201 298 299 300 301 398 399 400 401'
'1115 1116 1117 1118 1119 1120 1121 1122 1123 1200 1201 1202'
'1100 1101 1102 1198 1199 1200 1201 1202 1298 1299 1300 1301'
'12345 12346 12347 12348 12349 12450 12451 12453 12466 12467'
'12300 12301 12398 12399 12400 12401 12498 12499 12500 12501'

'105 106 107 108 109 110 111 112 113
220 221 223'
'100 101 198 199
200 201 298 299
300 301 398 399
400 401'
'1115 1116 1117 1118 1119 1120 1121 1122 1123
1200 1201 1202'
'1100 1101 1102 1198 1199
1200 1201 1202 1298 1299
1300 1301'
'12345 12346 12347 12348 12349
12450 12451 12453 12466 12467'
'12300 12301 12398 12399
12400 12401 12498 12499
12500 12501'
[download]

[reply]
[d/l]
[select]

Re^2: Why doesn't this regex work? (?!bench)

by Anonymous Monk on Aug 16, 2013 at 02:08 UTC

Nice, doing the least amount of work is the fastest