in reply to Re: Reusing a complex regexp in multiple spots, escaping the regexp
in thread Reusing a complex regexp in multiple spots, escaping the regexp

I think Marpa is overkill, one could just use interpolation of smaller regex snippets and named capture to make the whole regex far more readable.

Disclaimer: the following "code" is not only an untested one, it's even AI generated, and is meant as a conceptional demo.

Personally I would simplify it even further if I knew the full problem's domain and use better variable names and more /x and /i modifiers. (The handling of upper/lower case seems inconsistent)

On a side note: the various snippets are now much easier testable against the expected input, and these samples could be added to the documentation.

I don't see how Marpa can add more to this.

# ------------------------------------------------- # 1. Basic building blocks # ------------------------------------------------- my $hex2 = qr/[0-9A-Fa-f]{2}/; # exactly 2 hex digits my $hex2p = qr/[0-9A-Fa-f]{2,}/; # 2 or more hex digits my $hexXp = qr/[0-9A-Fa-fXx]{1,4}/; # 1&#8209;4 hex/X digit +s my $hexXp8 = qr/[0-9A-Fa-fXx]{1,8}/; # 1&#8209;8 hex/X digit +s my $opt_h = qr/[Hh]?/; # optional trailing H/h my $quotes = qr/"[^"]*"/; # optional quoted litera +l # ------------------------------------------------- # 2. Re&#8209;usable sub&#8209;patterns # ------------------------------------------------- my $int = qr/\bINT\s?$hex2$opt_h/; # INT <byte> my $mem_addr = qr/\bMEM\s?$hexXp:$hexXp$opt_h/; # MEM <addr>:<addr> my $mem_long = qr/\bMEM\s?$hexXp8$opt_h/; # MEM <addr> my $port_rng = qr/\bPORT\s?$hexXp$opt_h\-$hexXp$opt_h/; # PORT <range> my $port_one = qr/\bPORT\s?$hexXp$opt_h/; # PORT <single> my $hash_ref = qr/\#[0-9A-Z][0-9]{4}\b/; # #<letter><4 digits> my $ea_xhl = qr/\b(?:E?A[XHL])=$hex2p$opt_h/; # EA/XHL assignment my $reglist = qr/ (?:E?[ABCD][XHL]|E?[SD]I|E?[SB]P|[DESC]S)=$hex2p$opt_h /x; # a single register ent +ry my $opt_reglist = qr/ (?:\/$reglist)+ /x; # one or more "/<reg>" +entries # ------------------------------------------------- # 3. Full pattern with named captures # ------------------------------------------------- my $linking_re = qr/ # 1. INT with optional register list and optional quoted literal (?<int_full> $int $opt_reglist? $quotes? ) | # 2. INT with only optional quoted literal (?<int_simple> $int $quotes? ) | # 3. EA/XHL with repeated register list and optional quoted litera +l (?<ea_xhl_full> $ea_xhl (?:\/$reglist)* $quotes? ) | # 4. Hash reference (e.g. #A1234) (?<hash_ref> $hash_ref ) | # 5. MEM range (addr:addr) with optional quoted literal (?<mem_range> $mem_addr $quotes? ) | # 6. MEM single address with optional quoted literal (?<mem_simple> $mem_long $quotes? ) | # 7. @&#8209;reference (addr:addr) with optional quoted literal (?<at_ref> \@$hexXp:$hexXp$opt_h $quotes? ) | # 8. PORT range with optional quoted literal (?<port_range> $port_rng $quotes? ) | # 9. PORT single value with optional quoted literal (?<port_simple> $port_one $quotes? ) /x; # ------------------------------------------------- # 4. Usage example # ------------------------------------------------- if ( $linking =~ $linking_re) { my %cap = %+; # hash of all named captures if ( $cap{int_full} ) { # handle INT with register list } elsif ( $cap{mem_range} ) { # handle MEM range } # …other branches as needed }

Cheers Rolf
(addicted to the Perl Programming Language :)
see Wikisyntax for the Monastery

Updates

  • Fixed AI nonsense (2 unrelated variables with the same name)
  • I'm pretty sure that this $quotes? won't work without grouping
  • Replies are listed 'Best First'.
    Re^3: Reusing a complex regexp in multiple spots, escaping the regexp
    by ecm (Novice) on Apr 13, 2026 at 12:11 UTC
      Thanks for your input! The use case is to find references to treat as hyperlinks in Ralf Brown's Interrupt List. The INT/MEM/PORT keywords intentionally are matched only in all-caps to have fewer false positives.
        I took a quick look into Ralf Brown's Interrupt List and the material is far too vast to be easily understood.

        Please consider providing a SSCCE with sample input if you need further help.

        Please be aware that the AI generated code is far from being correct or the best approach. (I only tossed two prompts at it to generate that output.)

        Hints for improvement

        • the Xs in your hex character class look wrong, I suppose they are only allowed at the beginning like xFF
        • you can define your own named character classes
        • qr snippets can have individual modifiers like qr//xi , this can improve readability a lot
        • I'd enforce a clear naming convention for derived snippets with quantifiers like either trailing _opt or leading opt_ for ?
        • if you want to put your quantifiers after the snippet it might be best to always explicitly put them into a non-capturing group, like (?:$quote)?
        • you can also use named captures inside sub-terms, and only the matching one will show, that's far better than relying on counts like $5
        • you should consider a match all clause like (?<unknown>.*) at the end of your branches, to catch errors or missing implementations (e.g a weird number where you expected a hex)
        • you can embed Perl code into your snippets for debugging the current state with (?{...})
        • last but not least: instead of composing one giant regex, you can also partially match smaller parts and have intermediate logic in Perl with the /c continue-modifier
        Hope this helps :)

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        see Wikisyntax for the Monastery

          I took a quick look into Ralf Brown's Interrupt List and the material is far too vast to be easily understood.
          Yes, it is not for the faint of heart.
          Please consider providing a SSCCE with sample input if you need further help.
          The answers so far are already very satisfying.
          the Xs in your hex character class look wrong, I suppose they are only allowed at the beginning like xFF
          No, some MEM and @ (CALL) references do use literal Xes instead of hexits. Such as if there's references to "MEM xxxxh:xxx0h - Multiprocessor Specification - FLOATING POINTER STRUCTURE" or "CALL xxxxh:xxxxh - Alternate Multiplex Interrupt Specification TSRs"
          you should consider a match all clause like (?<unknown>.*) at the end of your branches, to catch errors or missing implementations (e.g a weird number where you expected a hex)

          I don't understand this. I don't want to match anything if there's no hyperlink reference in a given line. (Actually in one case I do want to know nothing was matched but it's not a problem, I just do last MATCHLINK before and if the code afterwards is run after having entered a single "link" item, I know nothing was matched.)

          Thanks!

    Re^3: Reusing a complex regexp in multiple spots, escaping the regexp
    by ecm (Novice) on Apr 13, 2026 at 19:56 UTC

      I used your suggestions as the base for my change. The occasional H was missing, and I added some /i flags to the numeric patterns. And renamed to keep things closer to what I had/want.

      I was surprised that I had to escape the slashes in a comment of the big regexp, else they were misdetected as ending the regexp.

        > I added some /i flags to the numeric patterns.

        please note you can also use qr//x to make the sub-terms more readable

        > I was surprised that I had to escape the slashes

        Well Quote-like-Operators can chose the delimiter freely.

        like qr~~ or qr{}

        > I used your suggestions

        Well ... mainly AI suggestions ;-)

        Some were good, others not to my taste.

        TIMTOWTDI ...

        Here my take

        { my $HEX = qr/ [0-9A-F] /xi; my $HEX_X = qr/ [0-9A-FX] /xi; my $H_opt = qr/ [h]? /xi; my $INT = qr/ \b INT \s? $HEX{2} [hH]? /x; # is longer $H_opt + better ??? our $hyperlinkpattern = qr~ ... $INT ... ~x; }

        Please note how the helper variables are restricted to the scope and how $HEX{2} is NOT interpreted as a hash-lookup (surprised me!).

        Hope my suggestions helped you having better maintainable code :-)

        There are many more improvements which come to mind but I'm prone to over-engineering and it's in you're in a better position to decide what works best for you =)

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        see Wikisyntax for the Monastery