harangzsolt33 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I have a weird problem that I don't know how to solve. But it sounds like a very-very simple problem.

I have a very long string which has a list of big integers. And then I have a pointer which points to one of the digits of one of the integers. As long as the pointer points to the first digit of an integer, I can extract it using regex, but what if the pointer happens to point to the 3rd or 4th digit of the integer? Then I don't know how to capture the whole integer. Is there a way to capture backwards and forwards at the same time?


#!/usr/bin/perl use strict; use warnings; my $S = 'Hello World 123456 !!! 789 ...'; my $P = 14; # $P points to digit '3' # I want to capture 123456 but I can only capture 3456 with this regex +: my $INT = ($S =~ m/^.{$P}(\d+)/) ? $1 : ''; print "\n :$INT: \n";

Replies are listed 'Best First'.
Re: How to capture backwards using regex?
by Corion (Patriarch) on Nov 10, 2024 at 11:34 UTC

    You can set pos of the string to continue matching, and then capture the digits to the left and right of the continuation point (\G):

    #!/usr/bin/perl use 5.020; use experimental 'signatures'; use Test::More; sub digits( $S, $pos ) { pos($S) = $pos; return ($S =~ m/\b(\d*\G\d*)\b/c) ? $1 : undef; } is digits( 'Hello World 123456 !!! 789 ...', 14 ), '123456', "In middl +e"; is digits( '12345 World 123456 !!! 789 ...', 14 ), '123456', "No digit +s before"; is digits( '12345 World 123456 !!! 789 ...', 17 ), '123456', "At end"; is digits( '12345 World 123456 !!! 789 ...', 12 ), '123456', "At start +"; done_testing;
      Wow, THANK YOU!!!

      This looks like the simplest and most straight-forward solution. Thank you!

      Now, I did notice that the \b word boundary modifier forces it work only if the number is preceded by a whitespace, and I don't want that behavior. And the "c" at the end causes a warning to appear. It says, "Use of /c modifier is meaningless without /g at C:\DESKTOP\test.pl line 11." The second \d* should be \d+ because for some reason if I use \d* it fails to capture the very last digit.

      I wrote a modified version of this to see what's happening, and I think, this is precisely how I want it to work:


      #!/usr/bin/perl use strict; use warnings; my $S = " 45 ee33ee#555+'3210>{579}/8888\"~-099.2:"; for (my $i = 0; $i < length($S); $i++) { pos($S) = $i; print("\n :", (($S =~ m/(\d*\G\d+)/) ? $1 : ''), ": \tptr=$i \ts +ubstr=", substr($S, $i)); }

      This program outputs the following:

      :: ptr=0 substr= 45 ee33ee#555+'3210>{579}/8888"~-099.2 +: :45: ptr=1 substr=45 ee33ee#555+'3210>{579}/8888"~-099.2: :45: ptr=2 substr=5 ee33ee#555+'3210>{579}/8888"~-099.2: :: ptr=3 substr= ee33ee#555+'3210>{579}/8888"~-099.2: :: ptr=4 substr=ee33ee#555+'3210>{579}/8888"~-099.2: :: ptr=5 substr=e33ee#555+'3210>{579}/8888"~-099.2: :33: ptr=6 substr=33ee#555+'3210>{579}/8888"~-099.2: :33: ptr=7 substr=3ee#555+'3210>{579}/8888"~-099.2: :: ptr=8 substr=ee#555+'3210>{579}/8888"~-099.2: :: ptr=9 substr=e#555+'3210>{579}/8888"~-099.2: :: ptr=10 substr=#555+'3210>{579}/8888"~-099.2: :555: ptr=11 substr=555+'3210>{579}/8888"~-099.2: :555: ptr=12 substr=55+'3210>{579}/8888"~-099.2: :555: ptr=13 substr=5+'3210>{579}/8888"~-099.2: :: ptr=14 substr=+'3210>{579}/8888"~-099.2: :: ptr=15 substr='3210>{579}/8888"~-099.2: :3210: ptr=16 substr=3210>{579}/8888"~-099.2: :3210: ptr=17 substr=210>{579}/8888"~-099.2: :3210: ptr=18 substr=10>{579}/8888"~-099.2: :3210: ptr=19 substr=0>{579}/8888"~-099.2: :: ptr=20 substr=>{579}/8888"~-099.2: :: ptr=21 substr={579}/8888"~-099.2: :579: ptr=22 substr=579}/8888"~-099.2: :579: ptr=23 substr=79}/8888"~-099.2: :579: ptr=24 substr=9}/8888"~-099.2: :: ptr=25 substr=}/8888"~-099.2: :: ptr=26 substr=/8888"~-099.2: :8888: ptr=27 substr=8888"~-099.2: :8888: ptr=28 substr=888"~-099.2: :8888: ptr=29 substr=88"~-099.2: :8888: ptr=30 substr=8"~-099.2: :: ptr=31 substr="~-099.2: :: ptr=32 substr=~-099.2: :: ptr=33 substr=-099.2: :099: ptr=34 substr=099.2: :099: ptr=35 substr=99.2: :099: ptr=36 substr=9.2: :: ptr=37 substr=.2: :2: ptr=38 substr=2: :: ptr=39 substr=:
Re: How to capture backwards using regex?
by choroba (Cardinal) on Nov 10, 2024 at 16:56 UTC
    Another possibility, passes all of LanX's tests and doesn't need 5.38 (still needs at least Perl 5 for look-ahead):
    my ($after) = $str =~ /^.{$pos}(\d*)/; my ($before) = length $after ? reverse($str) =~ /(?=(\d*))(?=.{$pos}$)/ : ""; my $match = reverse($before) . $after;

    Update: I wasn't able to use Grimy's trick to solve the problem. Anyone? I tried hundreds of variations of the following:

    $str =~ /^.{$pos} ( (?<= (?= \D | ^ | (?1) (\d\d) ) ) ) (\d+) /x;

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      ... use Grimy's trick to solve the problem.

      Got some partial success. One of the problems is to capture the content of the lookbehind. As an assertion itself has zero length, capturing has to be done inside the assertion, but there is no part of the regex that contains the complete content. So I had to collect the content character-wise and had to suppress non-digits somehow. The final trouble is a failing "Nr at start" case. I have no idea how to solve this one. The rest of LanX's tests pass.

      our $pre; my $prefix; my $match = $str =~ m{^.{$pos} (?{local $pre = ''}) ( (?<= (?= \D | ^ | (?1)) (.) (?{my $m2 = $2; $pre .= $m2 if $m2 =~ m{\d}}) ) (?{$prefix = $pre}) ) (\d+)}x; my $suffix = $3; say $match ? $prefix . $suffix : '';

      Note: This is not meant as a solution to the OP. It's just an exercise.

      Update:

      Shifting the lookbehind by one position to the right solves the issue with "Nr at start". Now passes all of LanX's tests.

      my $match = $str =~ m{^.{$pos} (?{local $pre = ''}) \d ( (?<= (?= \D | ^ | (?1)) (?: \D | (\d)) (?{$pre .= $2 if defined $2}) ) (?{$prefix = $pre}) ) (\d*)}x;

      Update 2: added defined

      Update 3: polished up

      my ($start, $end); my $res = $str =~ m{ ^.{$pos} \d ( (?<= (?= (?: \D | ^ ) (?{$start = pos}) | (?1)) . ) ) (\d*)(?{$end = pos}) }x ? substr($str, $start, $end - $start) : undef ;

      Greetings,
      🐻

      $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
      > passes all of LanX's tests and doesn't need 5.38

      Are you implying that 5.38 is needed for my code?

      I didn't bother to check version dependencies...

      Anyway I could think of other workarounds but the pos and \G solution is hard to beat.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

        > Are you implying that 5.38 is needed for my code?

        No, sorry, I misread what you wrote:

        > Unfortunately with variable length this still seems experimental with 5.38 ...

        In fact, 5.30 is enough to run it, see perlre:

        Prior to Perl 5.30, it worked only for fixed-width lookbehind, but starting in that release, it can handle variable lengths from 1 to 255 characters as an experimental feature.

        The (*pla:pattern) syntax is from 5.28, in earlier versions, the only possible syntax was (?<=).

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: How to capture backwards using regex?
by tybalt89 (Monsignor) on Nov 11, 2024 at 22:18 UTC

    TIMTOWTDI...

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11162630 use warnings; $SIG{__WARN__} = sub { die @_ }; use v5.12; use Test::More; say "Version: $]"; sub test { my ($msg, $str, $exp) = @_; my $pos = index($str, "*"); die "NO POINTER <$msg> in <$str>" if $pos == -1; substr($str,$pos,1) = ""; pos($str) = $pos; die "MULTIPLE POINTERS <$msg> in <$str>" if index($str,"*") >-1; my ($match) = $str =~ /(?|(?=(\d+))\d*\G\d|(\z))/; is ($match,$exp,$msg); } test "In the middle", 'World 12*3456 Hello 789 ' => "123456" ; test "At the start" , 'World *123456 Hello 789 ' => "123456" ; test "At the end" , 'World 12345*6 Hello 789 ' => "123456" ; test "Before" , 'Wo*rld 123456 Hello 789 ' => "" ; test "Right before" , 'World* 123456 Hello 789 ' => "" ; test "After" , 'World 123456 *Hello 789 ' => "" ; test "Right after" , 'World 123456* Hello 789 ' => "" ; test "In next Nr" , 'World 123456 Hello 7*89 ' => "789" ; test "Nr at start" , '*123456 Hello' => "123456" ; test "Nr at end" , 'World 12345*6' => "123456" ; test "Pos at end" , 'World 123456*' => "" ; done_testing;

    Outputs:

    Version: 5.040000 ok 1 - In the middle ok 2 - At the start ok 3 - At the end ok 4 - Before ok 5 - Right before ok 6 - After ok 7 - Right after ok 8 - In next Nr ok 9 - Nr at start ok 10 - Nr at end ok 11 - Pos at end 1..11
Re: How to capture backwards using regex?
by LanX (Saint) on Nov 10, 2024 at 11:33 UTC
    It looks like you reinvented a poor man's pos()

    Different approaches come to mind:

    • The simple one is to decrement your "pointer" by the max length of a possible integer.
    • Another, is to also capture the (.{$P}) prelude and to post-process any trailing digits inside
    • A "pure" regex approach is to use a positive look-behind assertion, see perlretut#Looking-ahead-and-looking-behind
    Update

    I like Corion's solution best, in terms of pure regex solutions.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

      For completeness

      > A "pure" regex approach is to use a positive look-behind assertion

      use v5.12; use warnings; use Test::More; say "Version: $]"; sub test { my ($msg, $str, $exp) = @_; my $pos = index($str, "*"); die "NO POINTER <$msg> in <$str>" if $pos == -1; substr($str,$pos,1) = ""; die "MULTIPLE POINTERS <$msg> in <$str>" if index($str,"*") >-1; no warnings 'experimental'; # no better category available + ??? my $match = join "", ($str =~ m/^.{$pos}(*plb:(\d{0,254}))(\d+)/) +; is ($match,$exp,$msg); } test "In the middle", 'World 12*3456 Hello 789 ' => "123456" ; test "At the start" , 'World *123456 Hello 789 ' => "123456" ; test "At the end" , 'World 12345*6 Hello 789 ' => "123456" ; test "Before" , 'Wo*rld 123456 Hello 789 ' => "" ; test "Right before" , 'World* 123456 Hello 789 ' => "" ; test "After" , 'World 123456 *Hello 789 ' => "" ; test "Right after" , 'World 123456* Hello 789 ' => "" ; test "In next Nr" , 'World 123456 Hello 7*89 ' => "789" ; test "Nr at start" , '*123456 Hello' => "123456" ; test "Nr at end" , 'World 12345*6' => "123456" ; test "Pos at end" , 'World 123456*' => "" ; done_testing;

      Version: 5.038002 ok 1 - In the middle ok 2 - At the start ok 3 - At the end ok 4 - Before ok 5 - Right before ok 6 - After ok 7 - Right after ok 8 - In next Nr ok 9 - Nr at start ok 10 - Nr at end ok 11 - Pos at end 1..11

      Remarks

      Unfortunately with variable length this still seems experimental with 5.38 ...

      • Variable length positive lookbehind with capturing is experimental in regex;
      and I had to use a hard limit of 254 instead of *
      • Lookbehind longer than 255 not implemented
      Strangely enough I couldn't find a dedicated warnings category and had to disable all "experimental" warnings locally.

      Updates

      had to fix tests because of C&P errors leading to multiple pointers, fixed test() to catch those errors.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

      Yeah, I've played around with positive look-behind regex, and it just didn't want to work. Of course, I'm using TinyPerl 5.8, so I'm a decade or two behind. lol But surprisingly, Corion's solution works flawlessly on TinyPerl 5.8, so that's great!
        Look behind used to only work with fixed length patterns. IIRC.

        That's most likely also related to the "experimental" status of my code.

        > But surprisingly, Corion's solution works flawlessly

        Doesn't surprise me that's a very old feature, could be even Perl4.

        The idea of Pos and related stuff is very sed/awk-ish, and text processing was one of the main objectives of Perl4.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        see Wikisyntax for the Monastery

Re: How to capture backwards using regex?
by Anonymous Monk on Nov 12, 2024 at 20:30 UTC