in reply to Reusing a complex regexp in multiple spots, escaping the regexp

Not to directly address your problem but given how much you're slinging at the regex engine it almost feels like you're on the cusp of where you might want to look at maybe Marpa::R2 or the like and use a parser and grammar rather than wrestling with 40 line regexen (unless you're Abigail in which case the entire thing should be rewritten to run solely as a finite automata inside the regex engine).

Edit: Just to further explain, you've got eight nine (misread) different line types I think you mentioned. Especially if this is something longer term or that might expand in scope it's at the point where a proper parser feels right. If it's a one off, never mind of course.

The cake is a lie.
The cake is a lie.
The cake is a lie.

  • Comment on Re: Reusing a complex regexp in multiple spots, escaping the regexp

Replies are listed 'Best First'.
Re^2: Reusing a complex regexp in multiple spots, escaping the regexp
by LanX (Saint) on Apr 13, 2026 at 00:40 UTC
    I think Marpa is overkill, one could just use interpolation of smaller regex snippets and named capture to make the whole regex far more readable.

    Disclaimer: the following "code" is not only an untested one, it's even AI generated, and is meant as a conceptional demo.

    Personally I would simplify it even further if I knew the full problem's domain and use better variable names and more /x and /i modifiers. (The handling of upper/lower case seems inconsistent)

    On a side note: the various snippets are now much easier testable against the expected input, and these samples could be added to the documentation.

    I don't see how Marpa can add more to this.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

    Updates

  • Fixed AI nonsense (2 unrelated variables with the same name)
  • I'm pretty sure that this $quotes? won't work without grouping
      Thanks for your input! The use case is to find references to treat as hyperlinks in Ralf Brown's Interrupt List. The INT/MEM/PORT keywords intentionally are matched only in all-caps to have fewer false positives.
        I took a quick look into Ralf Brown's Interrupt List and the material is far too vast to be easily understood.

        Please consider providing a SSCCE with sample input if you need further help.

        Please be aware that the AI generated code is far from being correct or the best approach. (I only tossed two prompts at it to generate that output.)

        Hints for improvement

        • the Xs in your hex character class look wrong, I suppose they are only allowed at the beginning like xFF
        • you can define your own named character classes
        • qr snippets can have individual modifiers like qr//xi , this can improve readability a lot
        • I'd enforce a clear naming convention for derived snippets with quantifiers like either trailing _opt or leading opt_ for ?
        • if you want to put your quantifiers after the snippet it might be best to always explicitly put them into a non-capturing group, like (?:$quote)?
        • you can also use named captures inside sub-terms, and only the matching one will show, that's far better than relying on counts like $5
        • you should consider a match all clause like (?<unknown>.*) at the end of your branches, to catch errors or missing implementations (e.g a weird number where you expected a hex)
        • you can embed Perl code into your snippets for debugging the current state with (?{...})
        • last but not least: instead of composing one giant regex, you can also partially match smaller parts and have intermediate logic in Perl with the /c continue-modifier
        Hope this helps :)

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        see Wikisyntax for the Monastery