Reusing a complex regexp in multiple spots, escaping the regexp

ecm has asked for the wisdom of the Perl Monks concerning the following question:

I have a fairly complex regexp, consisting of 9 match groups one of which has to match. I'm using copies of the exact same regexp in 4 different spots. All of them should match the same text. However, in one spot I want to access the match parameters as in $1 to $9, while in the other spots only one parameter ($3) and $`, $&, and $' are used.

How do I abstract a regexp so I only need to define and update it in one spot, but still be able to access the capture match groups anyway?

I tried storing the regexp pattern in a string variable then use that variable in the search, but I don't know how to correctly set the variable to a multi-line string (using a here document?) and use that for the pattern. Below is a dump of the working code, then the failing code and the error messages:

Working:

            while (not $second and $linking =~
              /(\bINT\s?[0-9A-Fa-f]{2}[Hh]?
                (?:\/(?:E?[ABCD][XHL]|E?[SD]I|E?[SB]P|[DESC]S)=[0-9A-F
+a-f]{2,}[Hh]?)+
                (?:\"[^"]+\")?
               )
              |(\bINT\s?[0-9A-Fa-f]{2}[Hh]?
                (?:\"[^"]+\")?
               )
              |(\b(?:E?A[XHL])=[0-9A-Fa-f]{2,}[Hh]?
                (?:\/(?:E?[ABCD][XHL]|E?[SD]I|E?[SB]P|[DESC]S)=[0-9A-F
+a-f]{2,}[Hh]?)*
                (?:\"[^"]+\")?
               )
              |(\#[0-9A-Z][0-9]{4}\b)
              |(\bMEM\s?[0-9A-Fa-fXx]{1,4}[Hh]?:[0-9A-Fa-fXx]{1,4}[Hh]
+?
                (?:\"[^"]+\")?
               )
              |(\bMEM\s?[0-9A-Fa-fXx]{1,8}[Hh]?
                (?:\"[^"]+\")?
               )
              |(\@[0-9A-Fa-fXx]{1,4}[Hh]?:[0-9A-Fa-fXx]{1,4}[Hh]?
                (?:\"[^"]+\")?
               )
              |(\bPORT\s?[0-9A-Fa-fXx]{1,4}[Hh]?-[0-9A-Fa-fXx]{1,4}[Hh
+]?
                (?:\"[^"]+\")?
               )
              |(\bPORT\s?[0-9A-Fa-fXx]{1,4}[Hh]?
                (?:\"[^"]+\")?
               )
              /x) {
              my $intplus = $1;
              my $intonly = $2;
              my $regonly = $3;
              if (defined $regonly and not defined $int) {
                if (defined $link) {
                  print_or_errorline("Entered reg only link without im
+plicit INT");
                  return;
                }
                $maskhighlight .= " " x (length($`) + length($&));
                $linking = $';
                next;
              }
              my $table = $4;
              my $mem_16_16 = $5;
              my $mem_32 = $6;
              my $call = $7;
              my $portrange = $8;
              my $portsingle = $9;
              $maskhighlight .= " " x length $`;
[download]

Failing:

            my $linkpattern = <<PATTERNEND;
              (\bINT\s?[0-9A-Fa-f]{2}[Hh]?
                (?:\/(?:E?[ABCD][XHL]|E?[SD]I|E?[SB]P|[DESC]S)=[0-9A-F
+a-f]{2,}[Hh]?)+
                (?:\"[^\"]+\")?
               )
              |(\bINT\s?[0-9A-Fa-f]{2}[Hh]?
                (?:\"[^\"]+\")?
               )
              |(\b(?:E?A[XHL])=[0-9A-Fa-f]{2,}[Hh]?
                (?:\/(?:E?[ABCD][XHL]|E?[SD]I|E?[SB]P|[DESC]S)=[0-9A-F
+a-f]{2,}[Hh]?)*
                (?:\"[^\"]+\")?
               )
              |(\#[0-9A-Z][0-9]{4}\b)
              |(\bMEM\s?[0-9A-Fa-fXx]{1,4}[Hh]?:[0-9A-Fa-fXx]{1,4}[Hh]
+?
                (?:\"[^\"]+\")?
               )
              |(\bMEM\s?[0-9A-Fa-fXx]{1,8}[Hh]?
                (?:\"[^\"]+\")?
               )
              |(\@[0-9A-Fa-fXx]{1,4}[Hh]?:[0-9A-Fa-fXx]{1,4}[Hh]?
                (?:\"[^\"]+\")?
               )
              |(\bPORT\s?[0-9A-Fa-fXx]{1,4}[Hh]?-[0-9A-Fa-fXx]{1,4}[Hh
+]?
                (?:\"[^\"]+\")?
               )
              |(\bPORT\s?[0-9A-Fa-fXx]{1,4}[Hh]?
                (?:\"[^\"]+\")?
               )
PATTERNEND
            while (not $second and $linking =~
              /$linkpattern/x) {
              my $intplus = $1;
              my $intonly = $2;
              my $regonly = $3;
              if (defined $regonly and not defined $int) {
                if (defined $link) {
                  print_or_errorline("Entered reg only link without im
+plicit INT");
                  return;
                }
                $maskhighlight .= " " x (length($`) + length($&));
                $linking = $';
                next;
              }
              my $table = $4;
              my $mem_16_16 = $5;
              my $mem_32 = $6;
              my $call = $7;
              my $portrange = $8;
              my $portsingle = $9;
              $maskhighlight .= " " x length $`;
[download]

My program uses curses for a TUI, so I redirected stderr to a file to capture perl error messages:

Unrecognized escape \s passed through at /home/[user]/proj/tractest/in
+tlist.pl line 419.
Unrecognized escape \s passed through at /home/[user]/proj/tractest/in
+tlist.pl line 419.
Unrecognized escape \s passed through at /home/[user]/proj/tractest/in
+tlist.pl line 419.
Unrecognized escape \s passed through at /home/[user]/proj/tractest/in
+tlist.pl line 419.
Unrecognized escape \s passed through at /home/[user]/proj/tractest/in
+tlist.pl line 419.
Unrecognized escape \s passed through at /home/[user]/proj/tractest/in
+tlist.pl line 419.
Unmatched ( in regex; marked by <-- HERE in m/              INTs?[0-9A
+-Fa-f]{2}[Hh]?
                (?:/(?:E?[ABCD][XHL]|E?[SD]I|E?[SB]P|[DESC]S)=[0-9A-Fa
+-f]{2,}[Hh]?)+
                (?:"[^"]+")?
               )
              |INTs?[0-9A-Fa-f]{2}[Hh]?
                (?:"[^"]+")?
               )
              |(?:E?A[XHL])=[0-9A-Fa-f]{2,}[Hh]?
                (?:/(?:E?[ABCD][XHL]|E?[SD]I|E?[SB]P|[DESC]S)=[0-9A-Fa
+-f]{2,}[Hh]?)*
                (?:"[^"]+")?
               )
              |(#[0-9A-Z][0-9]{4)
               <-- HERE |MEMs?[0-9A-Fa-fXx]{1,4}[Hh]?:[0-9A-Fa-fXx]{1,
+4}[Hh]?
                (?:"[^"]+")?
               )
              |MEMs?[0-9A-Fa-fXx]{1,8}[Hh]?
                (?:"[^"]+")?
               )
              |(@[0-9A-Fa-fXx]{1,4}[Hh]?:[0-9A-Fa-fXx]{1,4}[Hh]?
                (?:"[^"]+")?
               )
              |PORTs?[0-9A-Fa-fXx]{1,4}[Hh]?-[0-9A-Fa-fXx]{1,4}[Hh]?
                (?:"[^"]+")?
               )
              |PORTs?[0-9A-Fa-fXx]{1,4}[Hh]?
                (?:"[^"]+")?
               )
/ at /home/[user]/proj/tractest/intlist.pl line 447, <$array_lstff[...
+]> line 197305.
[download]

Comment on Reusing a complex regexp in multiple spots, escaping the regexp Select or Download Code

Replies are listed 'Best First'.
Re: Reusing a complex regexp in multiple spots, escaping the regexp by haukex (Archbishop) on Apr 12, 2026 at 20:05 UTC
Use `qr//` to precompile a regex into a form that can be used instead of a regex on the right-hand-side of `=~` and also interpolated into other regexen. Use named capture groups and %+ instead of `$1..$9`. Use `(DEFINE)` to reduce repitition within a single regex. You didn't provide any sample input to really write a test program against, but here's a quick-and-dirty attempt to rewrite your regex that at least compiles. Quite a few more simplifications are probably possible, this is just to get you started. my $regex = qr{ (?<intplus> (?&INT) (?&SOMETHING)+ (?&QUOT)? ) \| (?<intonly> (?&INT) (?&QUOT)? ) \| (?<regonly> \b(?:E?A[XHL])=[0-9A-Fa-f]{2,}[Hh]? (?&SOMETHING)* (?&QUOT)? ) \| (?<table> \#[0-9A-Z][0-9]{4}\b) \| (?<mem_16_16> \bMEM\s?(?&HEX4):(?&HEX4) (?&QUOT)? ) \| (?<mem_32> \bMEM\s?[0-9A-Fa-fXx]{1,8}[Hh]? (?&QUOT)? ) \| (?<call> \@(?&HEX4):(?&HEX4) (?&QUOT)? ) \| (?<portrange> \bPORT\s?(?&HEX4)-(?&HEX4) (?&QUOT)? ) \| (?<portsingle> \bPORT\s?(?&HEX4) (?&QUOT)? ) (?(DEFINE) (?<INT> \bINT\s?[0-9A-Fa-f]{2}[Hh]? ) (?<HEX4> [0-9A-Fa-fXx]{1,4}[Hh]? ) (?<QUOT> (?:\"[^"]+\") ) (?<SOMETHING> \/(?:E?[ABCD][XHL]\|E?[SD]I\|E?[SB]P\|[DESC]S)=[0-9A-Fa-f]{2, +}[Hh]? ) ) }x; [download]	[reply] [d/l] [select]
Re^2: Reusing a complex regexp in multiple spots, escaping the regexp by LanX (Saint) on Apr 12, 2026 at 22:40 UTC
On a tangent, what is the benefit of `(?(DEFINE) ...)` constructs for repeated patterns here? Intuitively, I would have opted for interpolating nested $variables holding `qr//` snippets, especially since I can make them more readable with `/x` and can unit test them individually. But I'm curious to learn why you chose this way. Is it about the handling of capture groups? I tried to read the relevant docs, but they constantly mention recursion and I can't spot any here... Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re: Reusing a complex regexp in multiple spots, escaping the regexp by ikegami (Patriarch) on Apr 12, 2026 at 20:11 UTC
Use `<<'PATTERNEND'`. It doesn't allow/require escaping. `my $pat = <<'PATTERNEND'; \s PATTERNEND` [download] `<<PATTERNEND`, aka `<<"PATTERNEND"` uses the same quoting rules as double-quote literals, so you'd need to esacape characters that are special in double-quote literals, which includes `\`. `my $pat = <<PATTERNEND; \\s PATTERNEND` [download] `my $pat = <<"PATTERNEND"; \\s PATTERNEND` [download] But `qr//` is better suited for regex patterns. `my $re = qr/ \s /x;` [download]	[reply] [d/l] [select]
Re: Reusing a complex regexp in multiple spots, escaping the regexp by Fletch (Bishop) on Apr 12, 2026 at 23:38 UTC
Not to directly address your problem but given how much you're slinging at the regex engine it almost feels like you're on the cusp of where you might want to look at maybe Marpa::R2 or the like and use a parser and grammar rather than wrestling with 40 line regexen (unless you're Abigail in which case the entire thing should be rewritten to run solely as a finite automata inside the regex engine). Edit: Just to further explain, you've got ~~eight~~ nine (misread) different line types I think you mentioned. Especially if this is something longer term or that might expand in scope it's at the point where a proper parser feels right. If it's a one off, never mind of course. The cake is a lie. The cake is a lie. The cake is a lie.	[reply]
Re^2: Reusing a complex regexp in multiple spots, escaping the regexp by LanX (Saint) on Apr 13, 2026 at 00:40 UTC
I think Marpa is overkill, one could just use interpolation of smaller regex snippets and named capture to make the whole regex far more readable. Disclaimer: the following "code" is not only an untested one, it's even AI generated, and is meant as a conceptional demo. Personally I would simplify it even further if I knew the full problem's domain and use better variable names and more /x and /i modifiers. (The handling of upper/lower case seems inconsistent) On a side note: the various snippets are now much easier testable against the expected input, and these samples could be added to the documentation. I don't see how Marpa can add more to this. Read more... (4 kB) Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery} Updates Fixed AI nonsense (2 unrelated variables with the same name) I'm pretty sure that this `$quotes?` won't work without grouping	[reply] [d/l] [select]
Re^3: Reusing a complex regexp in multiple spots, escaping the regexp by ecm (Initiate) on Apr 13, 2026 at 12:11 UTC
Thanks for your input! The use case is to find references to treat as hyperlinks in Ralf Brown's Interrupt List. The INT/MEM/PORT keywords intentionally are matched only in all-caps to have fewer false positives.	[reply]
Re^4: Reusing a complex regexp in multiple spots, escaping the regexp by LanX (Saint) on Apr 13, 2026 at 13:30 UTC

Updates