Re: How to capture backwards using regex?
by Corion (Patriarch) on Nov 10, 2024 at 11:34 UTC
|
You can set pos of the string to continue matching, and then capture the digits to the left and right of the continuation point (\G):
#!/usr/bin/perl
use 5.020;
use experimental 'signatures';
use Test::More;
sub digits( $S, $pos ) {
pos($S) = $pos;
return ($S =~ m/\b(\d*\G\d*)\b/c) ? $1 : undef;
}
is digits( 'Hello World 123456 !!! 789 ...', 14 ), '123456', "In middl
+e";
is digits( '12345 World 123456 !!! 789 ...', 14 ), '123456', "No digit
+s before";
is digits( '12345 World 123456 !!! 789 ...', 17 ), '123456', "At end";
is digits( '12345 World 123456 !!! 789 ...', 12 ), '123456', "At start
+";
done_testing;
| [reply] [d/l] [select] |
|
Wow, THANK YOU!!!
This looks like the simplest and most straight-forward solution. Thank you!
Now, I did notice that the \b word boundary modifier forces it work only if the number is preceded by a whitespace, and I don't want that behavior. And the "c" at the end causes a warning to appear. It says, "Use of /c modifier is meaningless without /g at C:\DESKTOP\test.pl line 11." The second \d* should be \d+ because for some reason if I use \d* it fails to capture the very last digit.
I wrote a modified version of this to see what's happening, and I think, this is precisely how I want it to work:
#!/usr/bin/perl
use strict;
use warnings;
my $S = " 45 ee33ee#555+'3210>{579}/8888\"~-099.2:";
for (my $i = 0; $i < length($S); $i++)
{
pos($S) = $i;
print("\n :", (($S =~ m/(\d*\G\d+)/) ? $1 : ''), ": \tptr=$i \ts
+ubstr=", substr($S, $i));
}
This program outputs the following:
:: ptr=0 substr= 45 ee33ee#555+'3210>{579}/8888"~-099.2
+:
:45: ptr=1 substr=45 ee33ee#555+'3210>{579}/8888"~-099.2:
:45: ptr=2 substr=5 ee33ee#555+'3210>{579}/8888"~-099.2:
:: ptr=3 substr= ee33ee#555+'3210>{579}/8888"~-099.2:
:: ptr=4 substr=ee33ee#555+'3210>{579}/8888"~-099.2:
:: ptr=5 substr=e33ee#555+'3210>{579}/8888"~-099.2:
:33: ptr=6 substr=33ee#555+'3210>{579}/8888"~-099.2:
:33: ptr=7 substr=3ee#555+'3210>{579}/8888"~-099.2:
:: ptr=8 substr=ee#555+'3210>{579}/8888"~-099.2:
:: ptr=9 substr=e#555+'3210>{579}/8888"~-099.2:
:: ptr=10 substr=#555+'3210>{579}/8888"~-099.2:
:555: ptr=11 substr=555+'3210>{579}/8888"~-099.2:
:555: ptr=12 substr=55+'3210>{579}/8888"~-099.2:
:555: ptr=13 substr=5+'3210>{579}/8888"~-099.2:
:: ptr=14 substr=+'3210>{579}/8888"~-099.2:
:: ptr=15 substr='3210>{579}/8888"~-099.2:
:3210: ptr=16 substr=3210>{579}/8888"~-099.2:
:3210: ptr=17 substr=210>{579}/8888"~-099.2:
:3210: ptr=18 substr=10>{579}/8888"~-099.2:
:3210: ptr=19 substr=0>{579}/8888"~-099.2:
:: ptr=20 substr=>{579}/8888"~-099.2:
:: ptr=21 substr={579}/8888"~-099.2:
:579: ptr=22 substr=579}/8888"~-099.2:
:579: ptr=23 substr=79}/8888"~-099.2:
:579: ptr=24 substr=9}/8888"~-099.2:
:: ptr=25 substr=}/8888"~-099.2:
:: ptr=26 substr=/8888"~-099.2:
:8888: ptr=27 substr=8888"~-099.2:
:8888: ptr=28 substr=888"~-099.2:
:8888: ptr=29 substr=88"~-099.2:
:8888: ptr=30 substr=8"~-099.2:
:: ptr=31 substr="~-099.2:
:: ptr=32 substr=~-099.2:
:: ptr=33 substr=-099.2:
:099: ptr=34 substr=099.2:
:099: ptr=35 substr=99.2:
:099: ptr=36 substr=9.2:
:: ptr=37 substr=.2:
:2: ptr=38 substr=2:
:: ptr=39 substr=:
| [reply] [d/l] [select] |
Re: How to capture backwards using regex?
by choroba (Cardinal) on Nov 10, 2024 at 16:56 UTC
|
Another possibility, passes all of LanX's tests and doesn't need 5.38 (still needs at least Perl 5 for look-ahead):
my ($after) = $str =~ /^.{$pos}(\d*)/;
my ($before) = length $after
? reverse($str) =~ /(?=(\d*))(?=.{$pos}$)/
: "";
my $match = reverse($before) . $after;
Update: I wasn't able to use Grimy's trick to solve the problem. Anyone? I tried hundreds of variations of the following:
$str =~ /^.{$pos}
( (?<= (?= \D | ^ | (?1) (\d\d) ) ) )
(\d+)
/x;
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
... use Grimy's trick to solve the problem.
Got some partial success.
One of the problems is to capture the content of the lookbehind.
As an assertion itself has zero length, capturing has to be done inside the assertion, but there is no part of the regex that contains the complete content.
So I had to collect the content character-wise and had to suppress non-digits somehow.
The final trouble is a failing "Nr at start" case.
I have no idea how to solve this one.
The rest of LanX's tests pass.
our $pre;
my $prefix;
my $match = $str =~ m{^.{$pos}
(?{local $pre = ''})
(
(?<=
(?= \D | ^ | (?1))
(.)
(?{my $m2 = $2; $pre .= $m2 if $m2 =~ m{\d}})
)
(?{$prefix = $pre})
)
(\d+)}x;
my $suffix = $3;
say $match ? $prefix . $suffix : '';
Note: This is not meant as a solution to the OP.
It's just an exercise.
Update:
Shifting the lookbehind by one position to the right solves the issue with "Nr at start".
Now passes all of LanX's tests.
my $match = $str =~ m{^.{$pos}
(?{local $pre = ''})
\d
(
(?<=
(?= \D | ^ | (?1))
(?: \D | (\d))
(?{$pre .= $2 if defined $2})
)
(?{$prefix = $pre})
)
(\d*)}x;
Update 2: added defined
Update 3: polished up
my ($start, $end);
my $res = $str =~ m{
^.{$pos}
\d
(
(?<=
(?= (?: \D | ^ ) (?{$start = pos}) | (?1))
.
)
)
(\d*)(?{$end = pos})
}x ? substr($str, $start, $end - $start) : undef ;
Greetings, 🐻
$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
| [reply] [d/l] [select] |
|
> passes all of LanX's tests and doesn't need 5.38
Are you implying that 5.38 is needed for my code?
I didn't bother to check version dependencies...
Anyway I could think of other workarounds but the pos and \G solution is hard to beat.
| [reply] |
|
> Are you implying that 5.38 is needed for my code?
No, sorry, I misread what you wrote:
> Unfortunately with variable length this still seems experimental with 5.38 ...
In fact, 5.30 is enough to run it, see perlre:
Prior to Perl 5.30, it worked only for fixed-width lookbehind, but starting in that release, it can handle variable lengths from 1 to 255 characters as an experimental feature.
The (*pla:pattern) syntax is from 5.28, in earlier versions, the only possible syntax was (?<=).
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
Re: How to capture backwards using regex?
by tybalt89 (Monsignor) on Nov 11, 2024 at 22:18 UTC
|
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11162630
use warnings;
$SIG{__WARN__} = sub { die @_ };
use v5.12;
use Test::More;
say "Version: $]";
sub test {
my ($msg, $str, $exp) = @_;
my $pos = index($str, "*");
die "NO POINTER <$msg> in <$str>" if $pos == -1;
substr($str,$pos,1) = "";
pos($str) = $pos;
die "MULTIPLE POINTERS <$msg> in <$str>" if index($str,"*") >-1;
my ($match) = $str =~ /(?|(?=(\d+))\d*\G\d|(\z))/;
is ($match,$exp,$msg);
}
test "In the middle", 'World 12*3456 Hello 789 ' => "123456" ;
test "At the start" , 'World *123456 Hello 789 ' => "123456" ;
test "At the end" , 'World 12345*6 Hello 789 ' => "123456" ;
test "Before" , 'Wo*rld 123456 Hello 789 ' => "" ;
test "Right before" , 'World* 123456 Hello 789 ' => "" ;
test "After" , 'World 123456 *Hello 789 ' => "" ;
test "Right after" , 'World 123456* Hello 789 ' => "" ;
test "In next Nr" , 'World 123456 Hello 7*89 ' => "789" ;
test "Nr at start" , '*123456 Hello' => "123456" ;
test "Nr at end" , 'World 12345*6' => "123456" ;
test "Pos at end" , 'World 123456*' => "" ;
done_testing;
Outputs:
Version: 5.040000
ok 1 - In the middle
ok 2 - At the start
ok 3 - At the end
ok 4 - Before
ok 5 - Right before
ok 6 - After
ok 7 - Right after
ok 8 - In next Nr
ok 9 - Nr at start
ok 10 - Nr at end
ok 11 - Pos at end
1..11
| [reply] [d/l] [select] |
Re: How to capture backwards using regex?
by LanX (Saint) on Nov 10, 2024 at 11:33 UTC
|
It looks like you reinvented a poor man's pos()
Different approaches come to mind:
- The simple one is to decrement your "pointer" by the max length of a possible integer.
- Another, is to also capture the (.{$P}) prelude and to post-process any trailing digits inside
- A "pure" regex approach is to use a positive look-behind assertion, see perlretut#Looking-ahead-and-looking-behind
Update
I like Corion's solution best, in terms of pure regex solutions.
| [reply] [d/l] |
|
use v5.12;
use warnings;
use Test::More;
say "Version: $]";
sub test {
my ($msg, $str, $exp) = @_;
my $pos = index($str, "*");
die "NO POINTER <$msg> in <$str>" if $pos == -1;
substr($str,$pos,1) = "";
die "MULTIPLE POINTERS <$msg> in <$str>" if index($str,"*") >-1;
no warnings 'experimental'; # no better category available
+ ???
my $match = join "", ($str =~ m/^.{$pos}(*plb:(\d{0,254}))(\d+)/)
+;
is ($match,$exp,$msg);
}
test "In the middle", 'World 12*3456 Hello 789 ' => "123456" ;
test "At the start" , 'World *123456 Hello 789 ' => "123456" ;
test "At the end" , 'World 12345*6 Hello 789 ' => "123456" ;
test "Before" , 'Wo*rld 123456 Hello 789 ' => "" ;
test "Right before" , 'World* 123456 Hello 789 ' => "" ;
test "After" , 'World 123456 *Hello 789 ' => "" ;
test "Right after" , 'World 123456* Hello 789 ' => "" ;
test "In next Nr" , 'World 123456 Hello 7*89 ' => "789" ;
test "Nr at start" , '*123456 Hello' => "123456" ;
test "Nr at end" , 'World 12345*6' => "123456" ;
test "Pos at end" , 'World 123456*' => "" ;
done_testing;
Version: 5.038002
ok 1 - In the middle
ok 2 - At the start
ok 3 - At the end
ok 4 - Before
ok 5 - Right before
ok 6 - After
ok 7 - Right after
ok 8 - In next Nr
ok 9 - Nr at start
ok 10 - Nr at end
ok 11 - Pos at end
1..11
Remarks
Unfortunately with variable length this still seems experimental with 5.38 ...
- Variable length positive lookbehind with capturing is experimental in regex;
and I had to use a hard limit of 254 instead of *
- Lookbehind longer than 255 not implemented
Strangely enough I couldn't find a dedicated warnings category and had to disable all "experimental" warnings locally.
Updates
had to fix tests because of C&P errors leading to multiple pointers, fixed test() to catch those errors.
| [reply] [d/l] [select] |
|
Yeah, I've played around with positive look-behind regex, and it just didn't want to work. Of course, I'm using TinyPerl 5.8, so I'm a decade or two behind. lol But surprisingly, Corion's solution works flawlessly on TinyPerl 5.8, so that's great!
| [reply] |
|
Look behind used to only work with fixed length patterns. IIRC.
That's most likely also related to the "experimental" status of my code.
> But surprisingly, Corion's solution works flawlessly
Doesn't surprise me that's a very old feature, could be even Perl4.
The idea of Pos and related stuff
is very sed/awk-ish, and text processing was one of the main objectives of Perl4.
| [reply] |
Re: How to capture backwards using regex?
by Anonymous Monk on Nov 12, 2024 at 20:30 UTC
|
| [reply] |