BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

The XML 1.0 specs 85 ebfn rule declares a BaseChar such that the following regex is a possible way to verify them.

Can anyone see a better way of presenting this or laying it out? Any alternative method?

my $re_BaseChar = qr[ [ \x{0041}-\x{005A} \x{0061}-\x{007A} \x{00C0}-\x{00D6} \x{00D8}-\x{ +00F6} \x{00F8}-\x{00FF} \x{0100}-\x{0131} \x{0134}-\x{013E} \x{0141}-\x{ +0148} \x{014A}-\x{017E} \x{0180}-\x{01C3} \x{01CD}-\x{01F0} \x{01F4}-\x{ +01F5} \x{01FA}-\x{0217} \x{0250}-\x{02A8} \x{02BB}-\x{02C1} \x{0386} \x{0388}-\x{038A} \x{038C} \x{038E}-\x{03A1} \x{03A3}-\x{ +03CE} \x{03D0}-\x{03D6} \x{03DA} \x{03DC} \x{03DE} \x{03E0} \x{03E2}-\x{ +03F3} \x{0401}-\x{040C} \x{040E}-\x{044F} \x{0451}-\x{045C} \x{045E}-\x{ +0481} \x{0490}-\x{04C4} \x{04C7}-\x{04C8} \x{04CB}-\x{04CC} \x{04D0}-\x{ +04EB} \x{04EE}-\x{04F5} \x{04F8}-\x{04F9} \x{0531}-\x{0556} \x{0559} \x{0561}-\x{0586} \x{05D0}-\x{05EA} \x{05F0}-\x{05F2} \x{0621}-\x{ +063A} \x{0641}-\x{064A} \x{0671}-\x{06B7} \x{06BA}-\x{06BE} \x{06C0}-\x{ +06CE} \x{06D0}-\x{06D3} \x{06D5} \x{06E5}-\x{06E6} \x{0905}-\x{ +0939} \x{093D} \x{0958}-\x{0961} \x{0985}-\x{098C} \x{098F}-\x{ +0990} \x{0993}-\x{09A8} \x{09AA}-\x{09B0} \x{09B2} \x{09B6}-\x{ +09B9} \x{09DC}-\x{09DD} \x{09DF}-\x{09E1} \x{09F0}-\x{09F1} \x{0A05}-\x{ +0A0A} \x{0A0F}-\x{0A10} \x{0A13}-\x{0A28} \x{0A2A}-\x{0A30} \x{0A32}-\x{ +0A33} \x{0A35}-\x{0A36} \x{0A38}-\x{0A39} \x{0A59}-\x{0A5C} \x{0A5E} \x{0A72}-\x{0A74} \x{0A85}-\x{0A8B} \x{0A8D} \x{0A8F}-\x{ +0A91} \x{0A93}-\x{0AA8} \x{0AAA}-\x{0AB0} \x{0AB2}-\x{0AB3} \x{0AB5}-\x{ +0AB9} \x{0ABD} \x{0AE0} \x{0B05}-\x{0B0C} \x{0B0F}-\x{ +0B10} \x{0B13}-\x{0B28} \x{0B2A}-\x{0B30} \x{0B32}-\x{0B33} \x{0B36}-\x{ +0B39} \x{0B3D} \x{0B5C}-\x{0B5D} \x{0B5F}-\x{0B61} \x{0B85}-\x{ +0B8A} \x{0B8E}-\x{0B90} \x{0B92}-\x{0B95} \x{0B99}-\x{0B9A} \x{0B9C} \x{0B9E}-\x{0B9F} \x{0BA3}-\x{0BA4} \x{0BA8}-\x{0BAA} \x{0BAE}-\x{ +0BB5} \x{0BB7}-\x{0BB9} \x{0C05}-\x{0C0C} \x{0C0E}-\x{0C10} \x{0C12}-\x{ +0C28} \x{0C2A}-\x{0C33} \x{0C35}-\x{0C39} \x{0C60}-\x{0C61} \x{0C85}-\x{ +0C8C} \x{0C8E}-\x{0C90} \x{0C92}-\x{0CA8} \x{0CAA}-\x{0CB3} \x{0CB5}-\x{ +0CB9} \x{0CDE} \x{0CE0}-\x{0CE1} \x{0D05}-\x{0D0C} \x{0D0E}-\x{ +0D10} \x{0D12}-\x{0D28} \x{0D2A}-\x{0D39} \x{0D60}-\x{0D61} \x{0E01}-\x{ +0E2E} \x{0E30} \x{0E32}-\x{0E33} \x{0E40}-\x{0E45} \x{0E81}-\x{ +0E82} \x{0E84} \x{0E87}-\x{0E88} \x{0E8A} \x{0E8D} \x{0E94}-\x{ +0E97} \x{0E99}-\x{0E9F} \x{0EA1}-\x{0EA3} \x{0EA5} \x{0EA7} \x{0EAA}-\x{ +0EAB} \x{0EAD}-\x{0EAE} \x{0EB0} \x{0EB2}-\x{0EB3} \x{0EBD} \x{0EC0}-\x{0EC4} \x{0F40}-\x{0F47} \x{0F49}-\x{0F69} \x{10A0}-\x{ +10C5} \x{10D0}-\x{10F6} \x{1100} \x{1102}-\x{1103} \x{1105}-\x{ +1107} \x{1109} \x{110B}-\x{110C} \x{110E}-\x{1112} \x{113C} \x{113E} \x{1140} \x{114C} \x{114E} + \x{1150} \x{1154}-\x{1155} \x{1159} \x{115F}-\x{ +1161} \x{1163} \x{1165} \x{1167} \x{1169} \x{116D}-\x{116E} \x{1172}-\x{1173} \x{1175} \x{119E} \x{11A8} \x{11AB} \x{11AE}-\x{11AF} \x{11B7}-\x{ +11B8} \x{11BA} \x{11BC}-\x{11C2} \x{11EB} \x{11F0} \x{11F9} \x{1E00}-\x{1E9B} \x{1EA0}-\x{1EF9} \x{1F00}-\x{ +1F15} \x{1F18}-\x{1F1D} \x{1F20}-\x{1F45} \x{1F48}-\x{1F4D} \x{1F50}-\x{ +1F57} \x{1F59} \x{1F5B} \x{1F5D} \x{1F5F}-\x{1F7D} \x{1F80}-\x{ +1FB4} \x{1FB6}-\x{1FBC} \x{1FBE} \x{1FC2}-\x{1FC4} \x{1FC6}-\x{ +1FCC} \x{1FD0}-\x{1FD3} \x{1FD6}-\x{1FDB} \x{1FE0}-\x{1FEC} \x{1FF2}-\x{ +1FF4} \x{1FF6}-\x{1FFC} \x{2126} \x{212A}-\x{212B} \x{212E} \x{2180}-\x{2182} \x{3041}-\x{3094} \x{30A1}-\x{30FA} \x{3105}-\x{ +312C} \x{AC00}-\x{D7A3} ] ]x;

Examine what is said, not who speaks.
1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
3) Any sufficiently advanced technology is indistinguishable from magic.
Arthur C. Clarke.

Replies are listed 'Best First'.
Re: Verifying Unicode (The mother of all regex).
by diotalevi (Canon) on May 02, 2003 at 17:43 UTC

    Write a function to read W3C's document and just copy/paste their data into your program. You may want to make the code nicer to look at if you use this.

    sub w3cchars_to_qr { my $qr = ''; for (map { s/^\s+//; s/\s+$//; $_ } split /\|/, shift()) { s/#x((?i:[\da-f]+))/\\x{$1}/g; if (/^\[([^]]+)\]$/) { $qr .= $1; } else { $qr .= $_; } } return qr/[$qr]/x; } END{ print w3cchars_to_qr( $base ); } $base = q{[#x0041-#x005A] | [#x0061-#x007A] | [#x00C0-#x00D6] | [#x00D +8-#x00F6] | [#x00F8-#x00FF] | [#x0100-#x0131] | [#x0134-#x013E] | [#x0141-#x0148] +| [#x014A-#x017E] | [#x0180-#x01C3] | [#x01CD-#x01F0] | [#x01F4-#x01F5] +| [#x01FA-#x0217] | [#x0250-#x02A8] | [#x02BB-#x02C1] | #x0386 | [#x0388 +-#x038A] | #x038C | [#x038E-#x03A1] | [#x03A3-#x03CE] | [#x03D0-#x03D6] | #x03DA +| #x03DC | #x03DE | #x03E0 | [#x03E2-#x03F3] | [#x0401-#x040C] | [#x040E-#x044F] +| [#x0451-#x045C] | [#x045E-#x0481] | [#x0490-#x04C4] | [#x04C7-#x04C8] +| [#x04CB-#x04CC] | [#x04D0-#x04EB] | [#x04EE-#x04F5] | [#x04F8-#x04F9] +| [#x0531-#x0556] | #x0559 | [#x0561-#x0586] | [#x05D0-#x05EA] | [#x05F0 +-#x05F2] | [#x0621-#x063A] | [#x0641-#x064A] | [#x0671-#x06B7] | [#x06BA-#x06BE] +| [#x06C0-#x06CE] | [#x06D0-#x06D3] | #x06D5 | [#x06E5-#x06E6] | [#x0905 +-#x0939] | #x093D | [#x0958-#x0961] | [#x0985-#x098C] | [#x098F-#x0990] | [#x0993 +-#x09A8] | [#x09AA-#x09B0] | #x09B2 | [#x09B6-#x09B9] | [#x09DC-#x09DD] | [#x09DF +-#x09E1] | [#x09F0-#x09F1] | [#x0A05-#x0A0A] | [#x0A0F-#x0A10] | [#x0A13-#x0A28] +| [#x0A2A-#x0A30] | [#x0A32-#x0A33] | [#x0A35-#x0A36] | [#x0A38-#x0A39] +| [#x0A59-#x0A5C] | #x0A5E | [#x0A72-#x0A74] | [#x0A85-#x0A8B] | #x0A8D +| [#x0A8F-#x0A91] | [#x0A93-#x0AA8] | [#x0AAA-#x0AB0] | [#x0AB2-#x0AB3] +| [#x0AB5-#x0AB9] | #x0ABD | #x0AE0 | [#x0B05-#x0B0C] | [#x0B0F-#x0B10] +| [#x0B13-#x0B28] | [#x0B2A-#x0B30] | [#x0B32-#x0B33] | [#x0B36-#x0B39] +| #x0B3D | [#x0B5C-#x0B5D] | [#x0B5F-#x0B61] | [#x0B85-#x0B8A] | [#x0B8E-#x0B90] +| [#x0B92-#x0B95] | [#x0B99-#x0B9A] | #x0B9C | [#x0B9E-#x0B9F] | [#x0BA3 +-#x0BA4] | [#x0BA8-#x0BAA] | [#x0BAE-#x0BB5] | [#x0BB7-#x0BB9] | [#x0C05-#x0C0C] +| [#x0C0E-#x0C10] | [#x0C12-#x0C28] | [#x0C2A-#x0C33] | [#x0C35-#x0C39] +| [#x0C60-#x0C61] | [#x0C85-#x0C8C] | [#x0C8E-#x0C90] | [#x0C92-#x0CA8] +| [#x0CAA-#x0CB3] | [#x0CB5-#x0CB9] | #x0CDE | [#x0CE0-#x0CE1] | [#x0D05 +-#x0D0C] | [#x0D0E-#x0D10] | [#x0D12-#x0D28] | [#x0D2A-#x0D39] | [#x0D60-#x0D61] +| [#x0E01-#x0E2E] | #x0E30 | [#x0E32-#x0E33] | [#x0E40-#x0E45] | [#x0E81 +-#x0E82] | #x0E84 | [#x0E87-#x0E88] | #x0E8A | #x0E8D | [#x0E94-#x0E97] | [#x0E99 +-#x0E9F] | [#x0EA1-#x0EA3] | #x0EA5 | #x0EA7 | [#x0EAA-#x0EAB] | [#x0EAD-#x0EAE] +| #x0EB0 | [#x0EB2-#x0EB3] | #x0EBD | [#x0EC0-#x0EC4] | [#x0F40-#x0F47] | [#x0F49 +-#x0F69] | [#x10A0-#x10C5] | [#x10D0-#x10F6] | #x1100 | [#x1102-#x1103] | [#x1105 +-#x1107] | #x1109 | [#x110B-#x110C] | [#x110E-#x1112] | #x113C | #x113E | #x1140 +| #x114C | #x114E | #x1150 | [#x1154-#x1155] | #x1159 | [#x115F-#x1161] | #x1163 +| #x1165 | #x1167 | #x1169 | [#x116D-#x116E] | [#x1172-#x1173] | #x1175 | #x119E +| #x11A8 | #x11AB | [#x11AE-#x11AF] | [#x11B7-#x11B8] | #x11BA | [#x11BC-#x11C2] +| #x11EB | #x11F0 | #x11F9 | [#x1E00-#x1E9B] | [#x1EA0-#x1EF9] | [#x1F00-#x1F15] +| [#x1F18-#x1F1D] | [#x1F20-#x1F45] | [#x1F48-#x1F4D] | [#x1F50-#x1F57] +| #x1F59 | #x1F5B | #x1F5D | [#x1F5F-#x1F7D] | [#x1F80-#x1FB4] | [#x1FB6-#x1FBC] +| #x1FBE | [#x1FC2-#x1FC4] | [#x1FC6-#x1FCC] | [#x1FD0-#x1FD3] | [#x1FD6-#x1FDB] +| [#x1FE0-#x1FEC] | [#x1FF2-#x1FF4] | [#x1FF6-#x1FFC] | #x2126 | [#x212A +-#x212B] | #x212E | [#x2180-#x2182] | [#x3041-#x3094] | [#x30A1-#x30FA] | [#x3105 +-#x312C] | [#xAC00-#xD7A3]};

      Thanks, but I already have regex's for all the rules, with a couple of editor macros doing most of the donkey work.

      My question was really about finding better (by some definition of the term) way of validating such large sets of unicode character ranges. I looked at negating the regex to see if that would reduce the complexity size, its not really complicated, but the result is no better. I also tried to discern some pattern in the ranges and see if might use bitwise boolean logic to accept/reject them in the same way that you can grep{ ord & 32 } @chars; to exclude uppercase alpha, but I don't see anything obvious.

      I thought that maybe someone else had tackled the problem and found a more elegant solution.


      Examine what is said, not who speaks.
      1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
      2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
      3) Any sufficiently advanced technology is indistinguishable from magic.
      Arthur C. Clarke.