Re: Why doesn't this regex work?
by choroba (Cardinal) on Aug 15, 2013 at 09:45 UTC
|
use re 'debug';
Its output is very long, but as you know what the regex should do, you might understand it.
Update:
It seems your example data miss a digit at the end of each number. I tried it with a different data:
2120 2140 2180 2197 2200 2203 2205 2234 2238 2259 2280 2299 2401
If I understood your specification, the following should do the work, being a bit more readable:
s/\b(\d*)(\d\d) (?=\1\d\d\b)/$1$2_/g;
s/ /\n/g;
s/_/ /g;
| [reply] [d/l] [select] |
|
|
Have you tried use re 'debug';
Yes. But it didn't help. It shows that matches do occur; but doesn't explain why the newlines are never inserted.
It seems your example data miss a digit at the end of each number.
My mistake trying to simplify the data. Two corrections are possible:
- Remove one \d from each side of the regex:
while( <DATA> ) {
s[\s(\d+)\d\K\s(?=(\d+)\d\s)]{
print "$1:$2";
$1 + 1 == $2 ? "\n" : ' '
}ge;
print;
}
- Add a digit to the end of each number in the data:
1051 1061 1071 1081 1091 1101 1111 1121 1131
11151 11161 11171 11181 11191 11201 11211 11221 11231
123451 123461 123471 123481 123491 123501 123511 123531
as you know what the regex should do
The idea of the regex is to match each pair of numbers in a line and capture the first d-2 digits of each number.
Eg. Match the pair and capture the first two digits of each: (10)91 (11)01.
Then if $1+1 (10+1) == $2 (11) replace the space between them with a newline.
The \K prevents the number before the replaced space being replaced. And the lookahead prevents the number after ebing replaced.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
Re: Why doesn't this regex work? ( lookaround backtracking pos)
by Anonymous Monk on Aug 15, 2013 at 11:07 UTC
|
I think I got it, there is one \s and it is outside of the lookaround, so it gets consumed (advances pos), and backtracking doesn't go back before it, so its checking every other number
Look at the pos, in your version (\s outside (?=)) it jumps 7/15 but with \s inside (?=) it jumps 7/11/15
#! perl -slw
use strict;
my $data = '105 106 107 108 109 110 111 112 113
1115 1116 1117 1118 1119 1120 1121 1122 1123
12345 12346 12347 12348 12349 12350 12351 12353
';
{
my $other = @ARGV ? qr{(?=\s(\d+)\d\s)} : qr{\s(?=(\d+)\d\s)};
open my($DATA),'<',\$data;
my $what = "";
while( readline $DATA ) {
s{
\s(\d+)\d
\K
$other
}{
warn "WHAT($1)($2)POS(@{[pos()]})\n";
$1 + 1 == $2 ? ";\n" : '! '
}gex;
print;
}
print "\n$what";
}
__END__
$ perl junk
WHAT(10)(10)POS(7)
WHAT(10)(10)POS(15)
WHAT(11)(11)POS(23)
WHAT(11)(11)POS(31)
105 106! 107 108! 109 110! 111 112! 113
WHAT(111)(111)POS(9)
WHAT(111)(111)POS(19)
WHAT(112)(112)POS(29)
WHAT(112)(112)POS(39)
1115 1116! 1117 1118! 1119 1120! 1121 1122! 1123
WHAT(1234)(1234)POS(11)
WHAT(1234)(1234)POS(23)
WHAT(1235)(1235)POS(35)
12345 12346! 12347 12348! 12349 12350! 12351 12353
$ perl junk lookaround
WHAT(10)(10)POS(7)
WHAT(10)(10)POS(11)
WHAT(10)(10)POS(15)
WHAT(10)(11)POS(19)
WHAT(11)(11)POS(23)
WHAT(11)(11)POS(27)
WHAT(11)(11)POS(31)
105 106! 107! 108! 109;
110! 111! 112! 113
WHAT(111)(111)POS(9)
WHAT(111)(111)POS(14)
WHAT(111)(111)POS(19)
WHAT(111)(112)POS(24)
WHAT(112)(112)POS(29)
WHAT(112)(112)POS(34)
WHAT(112)(112)POS(39)
1115 1116! 1117! 1118! 1119;
1120! 1121! 1122! 1123
WHAT(1234)(1234)POS(11)
WHAT(1234)(1234)POS(17)
WHAT(1234)(1234)POS(23)
WHAT(1234)(1235)POS(29)
WHAT(1235)(1235)POS(35)
WHAT(1235)(1235)POS(41)
12345 12346! 12347! 12348! 12349;
12350! 12351! 12353
rxrx gave me the idea to pos it | [reply] [d/l] |
|
|
there is one \s and it is outside of the lookaround ... so its checking every other number
BINGO! Thank you.
If I put the first space into a lookbehind: s[(?<=\s)(\d+)\d\K\s(?=(\d+)\d\s)]{
Or substitute the zero length \b--which will serve the same purpose: s[\b(\d+)\d\K\s(?=(\d+)\d\s)]{
The substitution works as I wanted it to.
But damn I could not see that for looking.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
|
Took me a few dozen looks, I started to compose some funny(wrong) answers some 3-4 times
The experimental (??{ code }) feature looks kinda neat
perl -le " $_ = shift; s{(\d+)\d\s\K(?=(??{$1+1})\d\s)}{\n}g; print "
+"11 12 21 22 32 33 41 44"
11 12
21 22
32 33
41 44
| [reply] [d/l] |
|
|
Re: Why doesn't this regex work?
by mtmcc (Hermit) on Aug 15, 2013 at 10:40 UTC
|
I've looked through it with Regexp::Debugger.
Maybe (probably) I'm misunderstanding the question, but in your sample data, the hundred groups in each line seem to be the same. i.e. $1 + 1 will always not equal $2 in this data, for each line.
I think the second point is about the spaces. When the regex matches, it skips a space, so misses out on matching the next number to the first part of the regex.
For example, this inserts a newline before the 200s start:
#!/usr/bin/perl
use strict;
use warnings;
#use Regexp::Debugger;
while( <DATA> ) {
s[\s(\d+)\d\d\K\s(?=(\d+)\d\d)]{
$1 + 1 == $2 ? "\n" : ' '
}ge;
print;
}
__DATA__
105 106 107 108 109 110 211 212 213
1115 1116 1117 1118 1119 1120 1121 1122 1123
12345 12346 12347 12348 12349 12350 12351 12353
But when you change the data to:
105 106 107 108 109 210 211 212 213
1115 1116 1117 1118 1119 1120 1121 1122 1123
12345 12346 12347 12348 12349 12350 12351 12353
the first part of the regex doesn't match the 109, because the match fails on the first space.
Apologies if I'm way off! | [reply] [d/l] [select] |
Re: Why doesn't this regex work?
by Athanasius (Archbishop) on Aug 15, 2013 at 10:40 UTC
|
#! perl
use strict;
use warnings;
while (<DATA>)
{
chomp;
s[\s(\d+)\d\K\s(?=(\d+)\d\s)]
{
printf("%s: (%s) (%s)\n", $_, $1, $2);
}ge;
}
__DATA__
105 106 107 108 109 110 111 112 113
1115 1116 1117 1118 1119 1120 1121 1122 1123
12345 12346 12347 12348 12349 12350 12351 12353
gives:
20:32 >perl 687_SoPW.pl
105 106 107 108 109 110 111 112 113: (10) (10)
105 106 107 108 109 110 111 112 113: (10) (10)
105 106 107 108 109 110 111 112 113: (11) (11)
1115 1116 1117 1118 1119 1120 1121 1122 1123: (111) (111)
1115 1116 1117 1118 1119 1120 1121 1122 1123: (111) (111)
1115 1116 1117 1118 1119 1120 1121 1122 1123: (112) (112)
12345 12346 12347 12348 12349 12350 12351 12353: (1234) (1234)
12345 12346 12347 12348 12349 12350 12351 12353: (1234) (1234)
12345 12346 12347 12348 12349 12350 12351 12353: (1235) (1235)
20:32 >
which seems to show that each comparison is consuming three terms from the data line, instead of one. I don’t understand why this is, but I’m posting in the hope that it provides someone with a useful clue as to what is going on. :-)
| [reply] [d/l] [select] |
Re: Why doesn't this regex work?
by boftx (Deacon) on Aug 15, 2013 at 10:47 UTC
|
Please understand that I am new to the Monastery. That said, why wouldn't something like this work if the data always has spaces between the numbers?
#!/usr/bin/perl
use strict;
while( <DATA> ) {
my @numbers = split;
my $half = int(@numbers/2);
my @first = splice(@numbers,0,$half);
print "@first\n";
print " @numbers\n";
}
__DATA__
105 106 107 108 109 110 111 112 113
1115 1116 1117 1118 1119 1120 1121 1122 1123
12345 12346 12347 12348 12349 12350 12351 12353
The first snippet just cuts the line in half, but it could easily be adapted to print out lines of X number of values (with continuation rows indented.)
#!/usr/bin/perl
use strict;
my $maxitems = 3;
while( <DATA> ) {
my @numbers = split;
my $indent = '';
while ( @numbers > 0) {
my @rowdata = splice(@numbers,0,$maxitems);
print "${indent}@rowdata\n";
$indent = ' ';
}
}
__DATA__
105 106 107 108 109 110 111 112 113 116
1115 1116 1117 1118 1119 1120 1121 1122 1123 1125
12345 12346 12347 12348 12349 12350 12351 12353 12355
UPDATE: I understand that this doesn't answer the question posed (which is interesting in itself) but I have to wonder if that is the right approach to the task at hand to begin with.
| [reply] [d/l] [select] |
Re: Why doesn't this regex work?
by Laurent_R (Canon) on Aug 15, 2013 at 10:16 UTC
|
In your data, the third (and fourth and fifth) digits from the right are the same. And this is what you are capturing. You will not get any match where the captured numbers are different.
Second thing, I think you should probably make your spaces optional if you do not have spaces at the beginning or at the end of your lines (or perhaps use a word boundary anchor).
With these two points in mind:
DB<8> $_ = "105 106 107 108 109 110 111 112 113 213 214";
DB<9> s[\s?(\d+)\d\d\K\s(?=(\d+)\d\d\s?)]{$1 + 1 == $2 ? "\n" : ' '}
+ge;
DB<10> x $_
0 '105 106 107 108 109 110 111 112 113
213 214'
DB<11>
I also think that this:
$1 + 1 == $2 ? "\n" : ' '
is going to fail on sequence jumps, e.g. on data looking like this:
$_ = qw /105 106 107 108 109 110 111 112 113 313 314/;
| [reply] [d/l] [select] |
Re: Why doesn't this regex work? (Solved!)
by AnomalousMonk (Archbishop) on Aug 15, 2013 at 14:34 UTC
|
As your original question seems to have been answered, here, FWIW, is an approach that is neither multi-pass nor dependent on /e replacement code execution or embedded code blocks as most others seem to be. I was a bit confused by the data output example in the OP, but I have set this up to insert newlines on transitions from one 100s group to the next.
>perl -wMstrict -le
"my @lines = (
'105 106 107 108 109 110 111 112 113 220 221 223',
'100 101 198 199 200 201 298 299 300 301 398 399 400 401',
'1115 1116 1117 1118 1119 1120 1121 1122 1123 1200 1201 1202',
'1100 1101 1102 1198 1199 1200 1201 1202 1298 1299 1300 1301',
'12345 12346 12347 12348 12349 12450 12451 12453 12466 12467',
'12300 12301 12398 12399 12400 12401 12498 12499 12500 12501',
);
print qq{'$_'} for @lines;
print '';
;;
s{ (\d) \d\d \K [^\n\S]+ (?! \d* \1 \d\d \b) }{\n}xmsg
for @lines;
print qq{'$_'} for @lines;
"
'105 106 107 108 109 110 111 112 113 220 221 223'
'100 101 198 199 200 201 298 299 300 301 398 399 400 401'
'1115 1116 1117 1118 1119 1120 1121 1122 1123 1200 1201 1202'
'1100 1101 1102 1198 1199 1200 1201 1202 1298 1299 1300 1301'
'12345 12346 12347 12348 12349 12450 12451 12453 12466 12467'
'12300 12301 12398 12399 12400 12401 12498 12499 12500 12501'
'105 106 107 108 109 110 111 112 113
220 221 223'
'100 101 198 199
200 201 298 299
300 301 398 399
400 401'
'1115 1116 1117 1118 1119 1120 1121 1122 1123
1200 1201 1202'
'1100 1101 1102 1198 1199
1200 1201 1202 1298 1299
1300 1301'
'12345 12346 12347 12348 12349
12450 12451 12453 12466 12467'
'12300 12301 12398 12399
12400 12401 12498 12499
12500 12501'
| [reply] [d/l] [select] |
|
|
5.016003
Rate anoNa anoNb bukA bukB anoM
anoNa 461/s -- -59% -90% -90% -94%
anoNb 1126/s 144% -- -76% -76% -85%
bukA 4662/s 910% 314% -- -0% -40%
bukB 4675/s 913% 315% 0% -- -40%
anoM 7751/s 1580% 588% 66% 66% --
FWIW the outputs aren't identical but they're close enough :) | [reply] [d/l] [select] |
|
|
Even the multi-pass solution I actually used to do the job probably only took 3 or 4 minutes, including the time it took to type the original and then retrieve, edit and re-run each of 5 passes.
Conversely, I must have spent an hour or two trying to figure out why my first attempt didn't work.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |