comment on

It is unusual to need such a lengthy regular-expression substitution as this, and I have never before needed one. However, I now have such an expression that is approaching 500 lines of code--all in a single substitution (and virtually must be so). The task is to break Thai text, which is typically written without spaces between words, into its component syllables, and I have found a method which works well, starting with attempts to match syllables, which are centered around one or more consonants, with the most vowel characters first, followed by shorter and shorter syllables down to single-character vowels. In this process, many of the regular expression rules are repeated from one syllable definition to the next, and I wish to place those repeated code blocks into a variable which can then greatly reduce the code, make it more readable, organize it better, and make it much easier to maintain with amendments to those code blocks.

Here are a couple such pieces of regex that I have attempted to assign to a variable:

#NO FORWARD O-ANG WITHOUT A THAI MUTE/CONSONANT ENDING
my $nfw_oang = eval(qr%    
    (?!\p{IsOang}(?!
        (?:\p{InThaiFinCons}){1,2}    
        (?![\p{InThaiCompVowel}\p{InThaiPostVowel}\p{InThaiTone}])
        (?:\p{InThaiMute})
    ))%);

#INITIAL CONSONANT(S)
my $initialconsonant = eval(qr%
    (?:
        (?:
          (?:\p{InThaiDualC1})
          (?:\p{InThaiDualC2})
        )
        |
        (?:\p{InThaiCons})
    )%);
[download]

And here's how those might look nested into the full regular expression (shortened for demo only, and based on my Thai module):

my $space = q':ThIsWiLlBeAsPaCe:';

my $syllables = $text =~ s/
(
    #SORT SYLLABLES BY VOWEL LENGTH CATEGORY
    #MATCH LONGEST FIRST    
       #------------------------------------------------
    (?:        
    #Compound four-character vowels (three of them)
    (?:\p{IsSarae})                 #SARA-E PRE-VOWEL
    (?:\p{InThaiCons}){1,2}         #CONSONANT(S)
    (?:\p{InThaiTone})?             #OPTIONAL TONE MARK (DEP. ON TYPIN
+G ORDER)
    (?:[\p{IsSaraii}\p{IsSarauee}]) #ONE OF THESE COMP. VOWELS
    (?:\p{InThaiTone})?             #OPTIONAL TONE MARK (DEP. ON TYPIN
+G ORDER)
    (?:[\p{IsOang}\p{IsYoyak}]).    #ONE OF THESE
    (?:[\p{IsSaraa}\p{IsWowaen}]).  #THE SHORTENING VOWEL -or- WO-WAEN
    )                                   #NOTE: The wo-waen version not
+ on standard vowel charts        
    |  #------------------------------------------------
    (?:
        #Compound three-character vowels (six of them)
        (?:
            #With pre-vowel & comp. vowel, no shortening post-vowel (2
+)
    (?:\p{IsSarae})                   #SARA-E PRE-VOWEL
    (?:\p{InThaiCons}){1,2}           #CONSONANT(S)
    (?:\p{InThaiTone})?               #OPTIONAL TONE MARK (DEP. ON TYP
+ING ORDER)
    (?:[\p{IsSaraii}\p{IsSarauee}])   #ONE OF THESE COMP. VOWELS
    (?:\p{InThaiTone})?               #OPTIONAL TONE MARK (DEP. ON TYP
+ING ORDER)
    (?:[\p{IsOang}\p{IsYoyak}])       #ONE OF THESE
    (?:\p{InThaiFinCons}){0,3}        #OPTIONAL SYLLABLE-ENDING CONSON
+ANT(S)
    (?![\p{InThaiCompVowel}\p{InThaiPostVowel}\p{InThaiTone}])    #NOT
+ ONE OF THESE!!!
        ${nfw_oang}    # <---- VARIABLE HERE !!!
    (?:\p{InThaiMute})?               #OPTIONAL THAI MUTE CHARACTER (G
+ARAN)
    )
    )
    # [ SNIP ]
    |  #------------------------------------------------
    (?:
        #Single-character vowels (eighteen of them)
    (?:
        #Pre-consonant "I" vowels (2)
    (?:[\p{IsSaraaimaimuan}\p{IsSaraaimaimalai}]) #"I" PRE-VOWEL
    ${initialconsonant}  # <---- VARIABLE HERE !!!
    (?![\p{InThaiCompVowel}\p{InThaiPostVowel}])  #NOT ONE OF THESE!!!
    (?:\p{InThaiTone})?                           #OPTIONAL TONE MARK
    )

    )
 )    /$space.$1.$space/egx;

$text =~ s!(?:$space)+! !g;
[download]

I've tried it with and without the "eval", and I have removed the in-line comments for the variable assignments "just in case." Still, the match seems not to work as expected. Either it does not succeed (no matches at all), as it did before I had replaced the code blocks with the variables for them, or it matches more than it was expected to--depending on whether I have added the "eval" or not.

I had not expected this to be difficult. Now I'm puzzled as to what I might be doing wrong. Googling for answers did not enlighten me--the answers led me to believe it should be working as-is (but it isn't). I had the code working before attempting to replace the blocks with variables, so it seems this is the only variable (no pun intended) here.

Ideas are welcome, and thank you!

Blessings,

~Polyglot~

In reply to Repeated code blocks in long and hairy regex by Polyglot

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.