in reply to Re: Reusing a complex regexp in multiple spots, escaping the regexp
in thread Reusing a complex regexp in multiple spots, escaping the regexp

I think Marpa is overkill, one could just use interpolation of smaller regex snippets and named capture to make the whole regex far more readable.

Disclaimer: the following "code" is not only an untested one, it's even AI generated, and is meant as a conceptional demo.

Personally I would simplify it even further if I knew the full problem's domain and use better variable names and more /x and /i modifiers. (The handling of upper/lower case seems inconsistent)

On a side note: the various snippets are now much easier testable against the expected input, and these samples could be added to the documentation.

I don't see how Marpa can add more to this.

# ------------------------------------------------- # 1. Basic building blocks # ------------------------------------------------- my $hex2 = qr/[0-9A-Fa-f]{2}/; # exactly 2 hex digits my $hex2p = qr/[0-9A-Fa-f]{2,}/; # 2 or more hex digits my $hexXp = qr/[0-9A-Fa-fXx]{1,4}/; # 1&#8209;4 hex/X digit +s my $hexXp8 = qr/[0-9A-Fa-fXx]{1,8}/; # 1&#8209;8 hex/X digit +s my $opt_h = qr/[Hh]?/; # optional trailing H/h my $quotes = qr/"[^"]*"/; # optional quoted litera +l # ------------------------------------------------- # 2. Re&#8209;usable sub&#8209;patterns # ------------------------------------------------- my $int = qr/\bINT\s?$hex2$opt_h/; # INT <byte> my $mem_addr = qr/\bMEM\s?$hexXp:$hexXp$opt_h/; # MEM <addr>:<addr> my $mem_long = qr/\bMEM\s?$hexXp8$opt_h/; # MEM <addr> my $port_rng = qr/\bPORT\s?$hexXp$opt_h\-$hexXp$opt_h/; # PORT <range> my $port_one = qr/\bPORT\s?$hexXp$opt_h/; # PORT <single> my $hash_ref = qr/\#[0-9A-Z][0-9]{4}\b/; # #<letter><4 digits> my $ea_xhl = qr/\b(?:E?A[XHL])=$hex2p$opt_h/; # EA/XHL assignment my $reglist = qr/ (?:E?[ABCD][XHL]|E?[SD]I|E?[SB]P|[DESC]S)=$hex2p$opt_h /x; # a single register ent +ry my $opt_reglist = qr/ (?:\/$reglist)+ /x; # one or more "/<reg>" +entries # ------------------------------------------------- # 3. Full pattern with named captures # ------------------------------------------------- my $linking_re = qr/ # 1. INT with optional register list and optional quoted literal (?<int_full> $int $opt_reglist? $quotes? ) | # 2. INT with only optional quoted literal (?<int_simple> $int $quotes? ) | # 3. EA/XHL with repeated register list and optional quoted litera +l (?<ea_xhl_full> $ea_xhl (?:\/$reglist)* $quotes? ) | # 4. Hash reference (e.g. #A1234) (?<hash_ref> $hash_ref ) | # 5. MEM range (addr:addr) with optional quoted literal (?<mem_range> $mem_addr $quotes? ) | # 6. MEM single address with optional quoted literal (?<mem_simple> $mem_long $quotes? ) | # 7. @&#8209;reference (addr:addr) with optional quoted literal (?<at_ref> \@$hexXp:$hexXp$opt_h $quotes? ) | # 8. PORT range with optional quoted literal (?<port_range> $port_rng $quotes? ) | # 9. PORT single value with optional quoted literal (?<port_simple> $port_one $quotes? ) /x; # ------------------------------------------------- # 4. Usage example # ------------------------------------------------- if ( $linking =~ $linking_re) { my %cap = %+; # hash of all named captures if ( $cap{int_full} ) { # handle INT with register list } elsif ( $cap{mem_range} ) { # handle MEM range } # …other branches as needed }

Cheers Rolf
(addicted to the Perl Programming Language :)
see Wikisyntax for the Monastery

Updates

  • Fixed AI nonsense (2 unrelated variables with the same name)
  • I'm pretty sure that this $quotes? won't work without grouping
  • Replies are listed 'Best First'.
    Re^3: Reusing a complex regexp in multiple spots, escaping the regexp
    by ecm (Initiate) on Apr 13, 2026 at 12:11 UTC
      Thanks for your input! The use case is to find references to treat as hyperlinks in Ralf Brown's Interrupt List. The INT/MEM/PORT keywords intentionally are matched only in all-caps to have fewer false positives.
        I took a quick look into Ralf Brown's Interrupt List and the material is far too vast to be easily understood.

        Please consider providing a SSCCE with sample input if you need further help.

        Please be aware that the AI generated code is far from being correct or the best approach. (I only tossed two prompts at it to generate that output.)

        Hints for improvement

        • the Xs in your hex character class look wrong, I suppose they are only allowed at the beginning like xFF
        • you can define your own named character classes
        • qr snippets can have individual modifiers like qr//xi , this can improve readability a lot
        • I'd enforce a clear naming convention for derived snippets with quantifiers like either trailing _opt or leading opt_ for ?
        • if you want to put your quantifiers after the snippet it might be best to always explicitly put them into a non-capturing group, like (?:$quote)?
        • you can also use named captures inside sub-terms, and only the matching one will show, that's far better than relying on counts like $5
        • you should consider a match all clause like (?<unknown>.*) at the end of your branches, to catch errors or missing implementations (e.g a weird number where you expected a hex)
        • you can embed Perl code into your snippets for debugging the current state with (?{...})
        • last but not least: instead of composing one giant regex, you can also partially match smaller parts and have intermediate logic in Perl with the /c continue-modifier
        Hope this helps :)

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        see Wikisyntax for the Monastery