Polyglot has asked for the wisdom of the Perl Monks concerning the following question:

It is unusual to need such a lengthy regular-expression substitution as this, and I have never before needed one. However, I now have such an expression that is approaching 500 lines of code--all in a single substitution (and virtually must be so). The task is to break Thai text, which is typically written without spaces between words, into its component syllables, and I have found a method which works well, starting with attempts to match syllables, which are centered around one or more consonants, with the most vowel characters first, followed by shorter and shorter syllables down to single-character vowels. In this process, many of the regular expression rules are repeated from one syllable definition to the next, and I wish to place those repeated code blocks into a variable which can then greatly reduce the code, make it more readable, organize it better, and make it much easier to maintain with amendments to those code blocks.

Here are a couple such pieces of regex that I have attempted to assign to a variable:

#NO FORWARD O-ANG WITHOUT A THAI MUTE/CONSONANT ENDING my $nfw_oang = eval(qr% (?!\p{IsOang}(?! (?:\p{InThaiFinCons}){1,2} (?![\p{InThaiCompVowel}\p{InThaiPostVowel}\p{InThaiTone}]) (?:\p{InThaiMute}) ))%); #INITIAL CONSONANT(S) my $initialconsonant = eval(qr% (?: (?: (?:\p{InThaiDualC1}) (?:\p{InThaiDualC2}) ) | (?:\p{InThaiCons}) )%);
And here's how those might look nested into the full regular expression (shortened for demo only, and based on my Thai module):
my $space = q':ThIsWiLlBeAsPaCe:'; my $syllables = $text =~ s/ ( #SORT SYLLABLES BY VOWEL LENGTH CATEGORY #MATCH LONGEST FIRST #------------------------------------------------ (?: #Compound four-character vowels (three of them) (?:\p{IsSarae}) #SARA-E PRE-VOWEL (?:\p{InThaiCons}){1,2} #CONSONANT(S) (?:\p{InThaiTone})? #OPTIONAL TONE MARK (DEP. ON TYPIN +G ORDER) (?:[\p{IsSaraii}\p{IsSarauee}]) #ONE OF THESE COMP. VOWELS (?:\p{InThaiTone})? #OPTIONAL TONE MARK (DEP. ON TYPIN +G ORDER) (?:[\p{IsOang}\p{IsYoyak}]). #ONE OF THESE (?:[\p{IsSaraa}\p{IsWowaen}]). #THE SHORTENING VOWEL -or- WO-WAEN ) #NOTE: The wo-waen version not + on standard vowel charts | #------------------------------------------------ (?: #Compound three-character vowels (six of them) (?: #With pre-vowel & comp. vowel, no shortening post-vowel (2 +) (?:\p{IsSarae}) #SARA-E PRE-VOWEL (?:\p{InThaiCons}){1,2} #CONSONANT(S) (?:\p{InThaiTone})? #OPTIONAL TONE MARK (DEP. ON TYP +ING ORDER) (?:[\p{IsSaraii}\p{IsSarauee}]) #ONE OF THESE COMP. VOWELS (?:\p{InThaiTone})? #OPTIONAL TONE MARK (DEP. ON TYP +ING ORDER) (?:[\p{IsOang}\p{IsYoyak}]) #ONE OF THESE (?:\p{InThaiFinCons}){0,3} #OPTIONAL SYLLABLE-ENDING CONSON +ANT(S) (?![\p{InThaiCompVowel}\p{InThaiPostVowel}\p{InThaiTone}]) #NOT + ONE OF THESE!!! ${nfw_oang} # <---- VARIABLE HERE !!! (?:\p{InThaiMute})? #OPTIONAL THAI MUTE CHARACTER (G +ARAN) ) ) # [ SNIP ] | #------------------------------------------------ (?: #Single-character vowels (eighteen of them) (?: #Pre-consonant "I" vowels (2) (?:[\p{IsSaraaimaimuan}\p{IsSaraaimaimalai}]) #"I" PRE-VOWEL ${initialconsonant} # <---- VARIABLE HERE !!! (?![\p{InThaiCompVowel}\p{InThaiPostVowel}]) #NOT ONE OF THESE!!! (?:\p{InThaiTone})? #OPTIONAL TONE MARK ) ) ) /$space.$1.$space/egx; $text =~ s!(?:$space)+! !g;
I've tried it with and without the "eval", and I have removed the in-line comments for the variable assignments "just in case." Still, the match seems not to work as expected. Either it does not succeed (no matches at all), as it did before I had replaced the code blocks with the variables for them, or it matches more than it was expected to--depending on whether I have added the "eval" or not.

I had not expected this to be difficult. Now I'm puzzled as to what I might be doing wrong. Googling for answers did not enlighten me--the answers led me to believe it should be working as-is (but it isn't). I had the code working before attempting to replace the blocks with variables, so it seems this is the only variable (no pun intended) here.

Ideas are welcome, and thank you!

Blessings,

~Polyglot~

Replies are listed 'Best First'.
Re: Repeated code blocks in long and hairy regex
by Corion (Patriarch) on Nov 05, 2023 at 10:25 UTC

    When you add the eval, it evaluates the regular expression as a string. This will likely remove one layer of backslashes ("escapes").

    I wonder where you got the idea of wrapping eval around a regular expression definition from and why you think it would change the program outcome to one you expect?

      I wonder where you got the idea of wrapping eval around a regular expression definition from...

      I got the idea from an online O'Reilly book about "Mastering Perl" here: https://www.oreilly.com/library/view/mastering-perl/9780596527242/ch02.html

      Here is the code example that I had seen on that page:

      #!/usr/bin/perl # perl-grep2.pl my $pattern = shift @ARGV; my $regex = eval { qr/$pattern/ }; die "Check your pattern! $@" if $@; while( <> ) { print if m/$regex/; }

      I didn't know why my code had had issues to begin with, so when I saw that code example, I thought perhaps I had not used the correct syntax for this usage, and tried it the way they had it. I didn't have anything to lose in trying, but the result was not an improvement.

      Blessings,

      ~Polyglot~

        Note how that code uses the block form of eval, eval { ... }, not the string form eval qr//.

        That eval is only there to "protect" you against errors in your regular expression. I'm not sure what the use is in that program as it immediately exits anyway.

        You are at least getting tripped up by not understanding the two forms of eval and how they differ. String-eval should rarely be used and regular expressions are not one of these rare cases.

Re: Repeated code blocks in long and hairy regex
by ikegami (Patriarch) on Nov 06, 2023 at 14:51 UTC

    Those eval make no sense and you're missing the /x. You simply want

    my $nfw_oang = qr% (?!\p{IsOang}(?! (?:\p{InThaiFinCons}){1,2} (?![\p{InThaiCompVowel}\p{InThaiPostVowel}\p{InThaiTone}]) (?:\p{InThaiMute}) )) %x;

    And given that (?![\p{X}\p{Y}\p{Z}])\p{W} can be written as (?[ \p{W} - [\p{X}\p{Y}\p{Z}] ]) or (?[ \p{W} - \p{X} - \p{Y} - \p{Z} ]), you could use

    my $nfw_oang = qr% (?! \p{IsOang} (?! (?:\p{InThaiFinCons}){1,2} (?[ \p{InThaiMute} - \p{InThaiCompVowel} - \p{InThaiPostVowel} - \p{InThaiTone} ]) ) ) %x;

    That said, you might want to look into (?(DEFINE)...).

    / ... (?&NFW_OANG) ... (?&INITIAL_CONSONANT) ... (?(DEFINE) # NOT FOLLOWED BY O-ANG WITHOUT A THAI MUTE/CONSONANT ENDING (?<NFW_OANG> (?! (?&OANG) ) ) # O-ANG WITHOUT A THAI MUTE/CONSONANT ENDING (?<OANG> \p{IsOang} (?! (?:\p{InThaiFinCons}){1,2} (?[ \p{InThaiMute} - \p{InThaiCompVowel} - \p{InThaiPostVowel} - \p{InThaiTone} ]) ) ) # INITIAL CONSONANT(S) (?<INITIAL_CONSONANT> (?: \p{InThaiDualC1} \p{InThaiDualC2} | \p{InThaiCons} ) ) /x
Re: Repeated code blocks in long and hairy regex
by NERDVANA (Priest) on Nov 06, 2023 at 19:00 UTC
    Either it does not succeed (no matches at all), as it did before I had replaced the code blocks with the variables for them, or it matches more than it was expected to--depending on whether I have added the "eval" or not

    This is one of the problems when you try to do everything in a single match. I mean, there are times when you do want a single match, but you need to consider what you want the regex engine to do when it encounters something that isn't matched. So, for instance, maybe you need some sort of catch-all that returns an "error token" before resuming trying to match the rest.

    In many cases, when parsing things, I desire the regex to give me as many parse tokens as it was able to create and then tell me where the error is, so that I can report the error to the user with some context information. The easiest way to accomplish that is with the /gc regex flag and then the \G marker at the start of the regex to pick up where you left off. The location in the string is tracked per-string, and you can inspect it with the pos keyword.

    my @tokens; while ($input =~ /\G( pattern... )/gcx) { push @tokens, $1; } if (pos $input < length $input) { say "Parse error at ...."; }
    You can read all about it in perlre