ecm has asked for the wisdom of the Perl Monks concerning the following question:

I have a fairly complex regexp, consisting of 9 match groups one of which has to match. I'm using copies of the exact same regexp in 4 different spots. All of them should match the same text. However, in one spot I want to access the match parameters as in $1 to $9, while in the other spots only one parameter ($3) and $`, $&, and $' are used.

How do I abstract a regexp so I only need to define and update it in one spot, but still be able to access the capture match groups anyway?

I tried storing the regexp pattern in a string variable then use that variable in the search, but I don't know how to correctly set the variable to a multi-line string (using a here document?) and use that for the pattern. Below is a dump of the working code, then the failing code and the error messages:

Working:
while (not $second and $linking =~ /(\bINT\s?[0-9A-Fa-f]{2}[Hh]? (?:\/(?:E?[ABCD][XHL]|E?[SD]I|E?[SB]P|[DESC]S)=[0-9A-F +a-f]{2,}[Hh]?)+ (?:\"[^"]+\")? ) |(\bINT\s?[0-9A-Fa-f]{2}[Hh]? (?:\"[^"]+\")? ) |(\b(?:E?A[XHL])=[0-9A-Fa-f]{2,}[Hh]? (?:\/(?:E?[ABCD][XHL]|E?[SD]I|E?[SB]P|[DESC]S)=[0-9A-F +a-f]{2,}[Hh]?)* (?:\"[^"]+\")? ) |(\#[0-9A-Z][0-9]{4}\b) |(\bMEM\s?[0-9A-Fa-fXx]{1,4}[Hh]?:[0-9A-Fa-fXx]{1,4}[Hh] +? (?:\"[^"]+\")? ) |(\bMEM\s?[0-9A-Fa-fXx]{1,8}[Hh]? (?:\"[^"]+\")? ) |(\@[0-9A-Fa-fXx]{1,4}[Hh]?:[0-9A-Fa-fXx]{1,4}[Hh]? (?:\"[^"]+\")? ) |(\bPORT\s?[0-9A-Fa-fXx]{1,4}[Hh]?-[0-9A-Fa-fXx]{1,4}[Hh +]? (?:\"[^"]+\")? ) |(\bPORT\s?[0-9A-Fa-fXx]{1,4}[Hh]? (?:\"[^"]+\")? ) /x) { my $intplus = $1; my $intonly = $2; my $regonly = $3; if (defined $regonly and not defined $int) { if (defined $link) { print_or_errorline("Entered reg only link without im +plicit INT"); return; } $maskhighlight .= " " x (length($`) + length($&)); $linking = $'; next; } my $table = $4; my $mem_16_16 = $5; my $mem_32 = $6; my $call = $7; my $portrange = $8; my $portsingle = $9; $maskhighlight .= " " x length $`;
Failing:
my $linkpattern = <<PATTERNEND; (\bINT\s?[0-9A-Fa-f]{2}[Hh]? (?:\/(?:E?[ABCD][XHL]|E?[SD]I|E?[SB]P|[DESC]S)=[0-9A-F +a-f]{2,}[Hh]?)+ (?:\"[^\"]+\")? ) |(\bINT\s?[0-9A-Fa-f]{2}[Hh]? (?:\"[^\"]+\")? ) |(\b(?:E?A[XHL])=[0-9A-Fa-f]{2,}[Hh]? (?:\/(?:E?[ABCD][XHL]|E?[SD]I|E?[SB]P|[DESC]S)=[0-9A-F +a-f]{2,}[Hh]?)* (?:\"[^\"]+\")? ) |(\#[0-9A-Z][0-9]{4}\b) |(\bMEM\s?[0-9A-Fa-fXx]{1,4}[Hh]?:[0-9A-Fa-fXx]{1,4}[Hh] +? (?:\"[^\"]+\")? ) |(\bMEM\s?[0-9A-Fa-fXx]{1,8}[Hh]? (?:\"[^\"]+\")? ) |(\@[0-9A-Fa-fXx]{1,4}[Hh]?:[0-9A-Fa-fXx]{1,4}[Hh]? (?:\"[^\"]+\")? ) |(\bPORT\s?[0-9A-Fa-fXx]{1,4}[Hh]?-[0-9A-Fa-fXx]{1,4}[Hh +]? (?:\"[^\"]+\")? ) |(\bPORT\s?[0-9A-Fa-fXx]{1,4}[Hh]? (?:\"[^\"]+\")? ) PATTERNEND while (not $second and $linking =~ /$linkpattern/x) { my $intplus = $1; my $intonly = $2; my $regonly = $3; if (defined $regonly and not defined $int) { if (defined $link) { print_or_errorline("Entered reg only link without im +plicit INT"); return; } $maskhighlight .= " " x (length($`) + length($&)); $linking = $'; next; } my $table = $4; my $mem_16_16 = $5; my $mem_32 = $6; my $call = $7; my $portrange = $8; my $portsingle = $9; $maskhighlight .= " " x length $`;
My program uses curses for a TUI, so I redirected stderr to a file to capture perl error messages:
Unrecognized escape \s passed through at /home/[user]/proj/tractest/in +tlist.pl line 419. Unrecognized escape \s passed through at /home/[user]/proj/tractest/in +tlist.pl line 419. Unrecognized escape \s passed through at /home/[user]/proj/tractest/in +tlist.pl line 419. Unrecognized escape \s passed through at /home/[user]/proj/tractest/in +tlist.pl line 419. Unrecognized escape \s passed through at /home/[user]/proj/tractest/in +tlist.pl line 419. Unrecognized escape \s passed through at /home/[user]/proj/tractest/in +tlist.pl line 419. Unmatched ( in regex; marked by <-- HERE in m/ INTs?[0-9A +-Fa-f]{2}[Hh]? (?:/(?:E?[ABCD][XHL]|E?[SD]I|E?[SB]P|[DESC]S)=[0-9A-Fa +-f]{2,}[Hh]?)+ (?:"[^"]+")? ) |INTs?[0-9A-Fa-f]{2}[Hh]? (?:"[^"]+")? ) |(?:E?A[XHL])=[0-9A-Fa-f]{2,}[Hh]? (?:/(?:E?[ABCD][XHL]|E?[SD]I|E?[SB]P|[DESC]S)=[0-9A-Fa +-f]{2,}[Hh]?)* (?:"[^"]+")? ) |(#[0-9A-Z][0-9]{4) <-- HERE |MEMs?[0-9A-Fa-fXx]{1,4}[Hh]?:[0-9A-Fa-fXx]{1, +4}[Hh]? (?:"[^"]+")? ) |MEMs?[0-9A-Fa-fXx]{1,8}[Hh]? (?:"[^"]+")? ) |(@[0-9A-Fa-fXx]{1,4}[Hh]?:[0-9A-Fa-fXx]{1,4}[Hh]? (?:"[^"]+")? ) |PORTs?[0-9A-Fa-fXx]{1,4}[Hh]?-[0-9A-Fa-fXx]{1,4}[Hh]? (?:"[^"]+")? ) |PORTs?[0-9A-Fa-fXx]{1,4}[Hh]? (?:"[^"]+")? ) / at /home/[user]/proj/tractest/intlist.pl line 447, <$array_lstff[... +]> line 197305.

Replies are listed 'Best First'.
Re: Reusing a complex regexp in multiple spots, escaping the regexp
by haukex (Archbishop) on Apr 12, 2026 at 20:05 UTC

    Use qr// to precompile a regex into a form that can be used instead of a regex on the right-hand-side of =~ and also interpolated into other regexen. Use named capture groups and %+ instead of $1..$9. Use (DEFINE) to reduce repitition within a single regex.

    You didn't provide any sample input to really write a test program against, but here's a quick-and-dirty attempt to rewrite your regex that at least compiles. Quite a few more simplifications are probably possible, this is just to get you started.

    my $regex = qr{ (?<intplus> (?&INT) (?&SOMETHING)+ (?&QUOT)? ) | (?<intonly> (?&INT) (?&QUOT)? ) | (?<regonly> \b(?:E?A[XHL])=[0-9A-Fa-f]{2,}[Hh]? (?&SOMETHING)* (?&QUOT)? ) | (?<table> \#[0-9A-Z][0-9]{4}\b) | (?<mem_16_16> \bMEM\s?(?&HEX4):(?&HEX4) (?&QUOT)? ) | (?<mem_32> \bMEM\s?[0-9A-Fa-fXx]{1,8}[Hh]? (?&QUOT)? ) | (?<call> \@(?&HEX4):(?&HEX4) (?&QUOT)? ) | (?<portrange> \bPORT\s?(?&HEX4)-(?&HEX4) (?&QUOT)? ) | (?<portsingle> \bPORT\s?(?&HEX4) (?&QUOT)? ) (?(DEFINE) (?<INT> \bINT\s?[0-9A-Fa-f]{2}[Hh]? ) (?<HEX4> [0-9A-Fa-fXx]{1,4}[Hh]? ) (?<QUOT> (?:\"[^"]+\") ) (?<SOMETHING> \/(?:E?[ABCD][XHL]|E?[SD]I|E?[SB]P|[DESC]S)=[0-9A-Fa-f]{2, +}[Hh]? ) ) }x;
      On a tangent, what is the benefit of (?(DEFINE) ...) constructs for repeated patterns here?

      Intuitively, I would have opted for interpolating nested $variables holding qr// snippets, especially since I can make them more readable with /x and can unit test them individually.

      But I'm curious to learn why you chose this way.

      Is it about the handling of capture groups?

      I tried to read the relevant docs, but they constantly mention recursion and I can't spot any here...

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

Re: Reusing a complex regexp in multiple spots, escaping the regexp
by ikegami (Patriarch) on Apr 12, 2026 at 20:11 UTC

    Use <<'PATTERNEND'. It doesn't allow/require escaping.

    my $pat = <<'PATTERNEND'; \s PATTERNEND

    <<PATTERNEND, aka <<"PATTERNEND" uses the same quoting rules as double-quote literals, so you'd need to esacape characters that are special in double-quote literals, which includes \.

    my $pat = <<PATTERNEND; \\s PATTERNEND
    my $pat = <<"PATTERNEND"; \\s PATTERNEND

    But qr// is better suited for regex patterns.

    my $re = qr/ \s /x;
Re: Reusing a complex regexp in multiple spots, escaping the regexp
by Fletch (Bishop) on Apr 12, 2026 at 23:38 UTC

    Not to directly address your problem but given how much you're slinging at the regex engine it almost feels like you're on the cusp of where you might want to look at maybe Marpa::R2 or the like and use a parser and grammar rather than wrestling with 40 line regexen (unless you're Abigail in which case the entire thing should be rewritten to run solely as a finite automata inside the regex engine).

    Edit: Just to further explain, you've got eight nine (misread) different line types I think you mentioned. Especially if this is something longer term or that might expand in scope it's at the point where a proper parser feels right. If it's a one off, never mind of course.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      I think Marpa is overkill, one could just use interpolation of smaller regex snippets and named capture to make the whole regex far more readable.

      Disclaimer: the following "code" is not only an untested one, it's even AI generated, and is meant as a conceptional demo.

      Personally I would simplify it even further if I knew the full problem's domain and use better variable names and more /x and /i modifiers. (The handling of upper/lower case seems inconsistent)

      On a side note: the various snippets are now much easier testable against the expected input, and these samples could be added to the documentation.

      I don't see how Marpa can add more to this.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

      Updates

    • Fixed AI nonsense (2 unrelated variables with the same name)
    • I'm pretty sure that this $quotes? won't work without grouping
        Thanks for your input! The use case is to find references to treat as hyperlinks in Ralf Brown's Interrupt List. The INT/MEM/PORT keywords intentionally are matched only in all-caps to have fewer false positives.