When I needed to write a regex character class containing lots and lots of Unicode items, I found that the /x modifier, which allows whitespace and comments in the regex, doesn't help at all within the [...] of a character class. So I needed to find a handy way to break it into multiple lines and even add comments.

I used the trick of interpolating an arbitrary expression using a reference.

my $XML_BaseChar= qr/@{[ "[" . # a character class... "\x{0041}-\x{005A}" . # the first range "\x{0100}-\x{0131}\x{0134}-\x{013E}\x{0141}-\x{0148}" . # the next +range "\x{01FA}-\x{0217}\x{0250}-\x{02A8}" . # etc. "]" ]}/; #here's how to close it up
Only the stuff inside the quotes appears inside the qr// construct.

Replies are listed 'Best First'.
Re: Extend regex legibility within character classes
by knobunc (Pilgrim) on Jun 11, 2001 at 21:34 UTC

    Why not interpolate the stuff into a variable? It can sometimes be a useful technique, but to my eye the following is more readable and obvious to people who follow.

    my $valid_XML_BaseChars = join('', "\x{0041}-\x{005A}", # Uppercase A-Z "\x{0100}-\x{0131}", # Extended Latin A subset # Skipping ligatures 0132, 0133 "\x{0134}-\x{013E}", # Continuing Ext. Latin A # Skipping middle dots 013F, 0140 "\x{0141}-\x{0148}", # Finishing Ext. Latin A "\x{01FA}-\x{0217}", # Extended Latin B subset "\x{0250}-\x{02A8}", # IPA Extensions ); my $XML_BaseChar= qr/[$valid_xml_basechar]/o;

    -ben

      That's what I started with, but wanted to get rid of the extra named variable. Using the @{[]} trick lets me write it in one statement.

      The extra variable would be less objectionable if I could hide the scope, but if the real one I'm declaring is also a "my", I can't put braces around the whole thing. It would take a third line, outside of the braces, first to declare it.

      —John

        my $XML_BaseChar= qr/$_/ for join "", "[", # a character class... "\x{0041}-\x{005A}", # the first range "\x{0100}-\x{0131}\x{0134}-\x{013E}\x{0141}-\x{0148}", # the next r +ange "\x{01FA}-\x{0217}\x{0250}-\x{02A8}", # etc. "]"; #here's how to close it up
                - tye (but my friends call me "Tye")