Re: Verifying Unicode (The mother of all regex).

Write a function to read W3C's document and just copy/paste their data into your program. You may want to make the code nicer to look at if you use this.

sub w3cchars_to_qr {
    my $qr = '';
    for (map { s/^\s+//; s/\s+$//; $_ } split /\|/, shift()) {
        s/#x((?i:[\da-f]+))/\\x{$1}/g;
        if (/^\[([^]]+)\]$/) {
            $qr .= $1;
        } else {
            $qr .= $_;
        }
    }
    return qr/[$qr]/x;
}

END{ print w3cchars_to_qr( $base ); }

$base = q{[#x0041-#x005A] | [#x0061-#x007A] | [#x00C0-#x00D6] | [#x00D
+8-#x00F6] | 
[#x00F8-#x00FF] | [#x0100-#x0131] | [#x0134-#x013E] | [#x0141-#x0148] 
+| 
[#x014A-#x017E] | [#x0180-#x01C3] | [#x01CD-#x01F0] | [#x01F4-#x01F5] 
+| 
[#x01FA-#x0217] | [#x0250-#x02A8] | [#x02BB-#x02C1] | #x0386 | [#x0388
+-#x038A] | 
#x038C | [#x038E-#x03A1] | [#x03A3-#x03CE] | [#x03D0-#x03D6] | #x03DA 
+| #x03DC | 
#x03DE | #x03E0 | [#x03E2-#x03F3] | [#x0401-#x040C] | [#x040E-#x044F] 
+| 
[#x0451-#x045C] | [#x045E-#x0481] | [#x0490-#x04C4] | [#x04C7-#x04C8] 
+| 
[#x04CB-#x04CC] | [#x04D0-#x04EB] | [#x04EE-#x04F5] | [#x04F8-#x04F9] 
+| 
[#x0531-#x0556] | #x0559 | [#x0561-#x0586] | [#x05D0-#x05EA] | [#x05F0
+-#x05F2] | 
[#x0621-#x063A] | [#x0641-#x064A] | [#x0671-#x06B7] | [#x06BA-#x06BE] 
+| 
[#x06C0-#x06CE] | [#x06D0-#x06D3] | #x06D5 | [#x06E5-#x06E6] | [#x0905
+-#x0939] | 
#x093D | [#x0958-#x0961] | [#x0985-#x098C] | [#x098F-#x0990] | [#x0993
+-#x09A8] | 
[#x09AA-#x09B0] | #x09B2 | [#x09B6-#x09B9] | [#x09DC-#x09DD] | [#x09DF
+-#x09E1] | 
[#x09F0-#x09F1] | [#x0A05-#x0A0A] | [#x0A0F-#x0A10] | [#x0A13-#x0A28] 
+| 
[#x0A2A-#x0A30] | [#x0A32-#x0A33] | [#x0A35-#x0A36] | [#x0A38-#x0A39] 
+| 
[#x0A59-#x0A5C] | #x0A5E | [#x0A72-#x0A74] | [#x0A85-#x0A8B] | #x0A8D 
+| 
[#x0A8F-#x0A91] | [#x0A93-#x0AA8] | [#x0AAA-#x0AB0] | [#x0AB2-#x0AB3] 
+| 
[#x0AB5-#x0AB9] | #x0ABD | #x0AE0 | [#x0B05-#x0B0C] | [#x0B0F-#x0B10] 
+| 
[#x0B13-#x0B28] | [#x0B2A-#x0B30] | [#x0B32-#x0B33] | [#x0B36-#x0B39] 
+| #x0B3D | 
[#x0B5C-#x0B5D] | [#x0B5F-#x0B61] | [#x0B85-#x0B8A] | [#x0B8E-#x0B90] 
+| 
[#x0B92-#x0B95] | [#x0B99-#x0B9A] | #x0B9C | [#x0B9E-#x0B9F] | [#x0BA3
+-#x0BA4] | 
[#x0BA8-#x0BAA] | [#x0BAE-#x0BB5] | [#x0BB7-#x0BB9] | [#x0C05-#x0C0C] 
+| 
[#x0C0E-#x0C10] | [#x0C12-#x0C28] | [#x0C2A-#x0C33] | [#x0C35-#x0C39] 
+| 
[#x0C60-#x0C61] | [#x0C85-#x0C8C] | [#x0C8E-#x0C90] | [#x0C92-#x0CA8] 
+| 
[#x0CAA-#x0CB3] | [#x0CB5-#x0CB9] | #x0CDE | [#x0CE0-#x0CE1] | [#x0D05
+-#x0D0C] | 
[#x0D0E-#x0D10] | [#x0D12-#x0D28] | [#x0D2A-#x0D39] | [#x0D60-#x0D61] 
+| 
[#x0E01-#x0E2E] | #x0E30 | [#x0E32-#x0E33] | [#x0E40-#x0E45] | [#x0E81
+-#x0E82] | 
#x0E84 | [#x0E87-#x0E88] | #x0E8A | #x0E8D | [#x0E94-#x0E97] | [#x0E99
+-#x0E9F] | 
[#x0EA1-#x0EA3] | #x0EA5 | #x0EA7 | [#x0EAA-#x0EAB] | [#x0EAD-#x0EAE] 
+| #x0EB0 | 
[#x0EB2-#x0EB3] | #x0EBD | [#x0EC0-#x0EC4] | [#x0F40-#x0F47] | [#x0F49
+-#x0F69] | 
[#x10A0-#x10C5] | [#x10D0-#x10F6] | #x1100 | [#x1102-#x1103] | [#x1105
+-#x1107] | 
#x1109 | [#x110B-#x110C] | [#x110E-#x1112] | #x113C | #x113E | #x1140 
+| #x114C | 
#x114E | #x1150 | [#x1154-#x1155] | #x1159 | [#x115F-#x1161] | #x1163 
+| #x1165 | 
#x1167 | #x1169 | [#x116D-#x116E] | [#x1172-#x1173] | #x1175 | #x119E 
+| #x11A8 | 
#x11AB | [#x11AE-#x11AF] | [#x11B7-#x11B8] | #x11BA | [#x11BC-#x11C2] 
+| #x11EB | 
#x11F0 | #x11F9 | [#x1E00-#x1E9B] | [#x1EA0-#x1EF9] | [#x1F00-#x1F15] 
+| 
[#x1F18-#x1F1D] | [#x1F20-#x1F45] | [#x1F48-#x1F4D] | [#x1F50-#x1F57] 
+| #x1F59 | 
#x1F5B | #x1F5D | [#x1F5F-#x1F7D] | [#x1F80-#x1FB4] | [#x1FB6-#x1FBC] 
+| #x1FBE | 
[#x1FC2-#x1FC4] | [#x1FC6-#x1FCC] | [#x1FD0-#x1FD3] | [#x1FD6-#x1FDB] 
+| 
[#x1FE0-#x1FEC] | [#x1FF2-#x1FF4] | [#x1FF6-#x1FFC] | #x2126 | [#x212A
+-#x212B] | 
#x212E | [#x2180-#x2182] | [#x3041-#x3094] | [#x30A1-#x30FA] | [#x3105
+-#x312C] | 
[#xAC00-#xD7A3]};
[download]

Comment on Re: Verifying Unicode (The mother of all regex). Download Code

Replies are listed 'Best First'.
Re: Re: Verifying Unicode (The mother of all regex). by BrowserUk (Patriarch) on May 02, 2003 at 18:10 UTC
Thanks, but I already have regex's for all the rules, with a couple of editor macros doing most of the donkey work. My question was really about finding better (by some definition of the term) way of validating such large sets of unicode character ranges. I looked at negating the regex to see if that would reduce the ~~complexity~~ size, its not really complicated, but the result is no better. I also tried to discern some pattern in the ranges and see if might use bitwise boolean logic to accept/reject them in the same way that you can `grep{ ord & 32 } @chars;` to exclude uppercase alpha, but I don't see anything obvious. I thought that maybe someone else had tackled the problem and found a more elegant solution. Examine what is said, not who speaks. 1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong. 2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible 3) Any sufficiently advanced technology is indistinguishable from magic. Arthur C. Clarke.	[reply] [d/l]
Re: Re: Re: Verifying Unicode (The mother of all regex). by diotalevi (Canon) on May 02, 2003 at 19:38 UTC
Ah - so you want a unified rule for detecting base characters that isn't a simple dictionary. I started a script to look for really common bits. <code># Copy/paste the data right from the document $\ = $, = "\n"; $base = q{#x0041-#x005A \| #x0061-#x007A \| #x00C0-#x00D6 \| #x00D8-#x00F6 \| #x00F8-#x00FF \| #x0100-#x0131 \| #x0134-#x013E \| #x0141-#x0148 \| #x014A-#x017E \| #x0180-#x01C3 \| #x01CD-#x01F0 \| #x01F4-#x01F5 \| #x01FA-#x0217 \| #x0250-#x02A8 \| #x02BB-#x02C1 \| #x0386 \| #x0388-#x038A \| #x038C \| #x038E-#x03A1 \| #x03A3-#x03CE \| #x03D0-#x03D6 \| #x03DA \| #x03DC \| #x03DE \| #x03E0 \| #x03E2-#x03F3 \| #x0401-#x040C \| #x040E-#x044F \| #x0451-#x045C \| #x045E-#x0481 \| #x0490-#x04C4 \| #x04C7-#x04C8 \| #x04CB-#x04CC \| #x04D0-#x04EB \| #x04EE-#x04F5 \| #x04F8-#x04F9 \| #x0531-#x0556 \| #x0559 \| #x0561-#x0586 \| #x05D0-#x05EA \| #x05F0-#x05F2 \| #x0621-#x063A \| #x0641-#x064A \| #x0671-#x06B7 \| #x06BA-#x06BE \| #x06C0-#x06CE \| #x06D0-#x06D3 \| #x06D5 \| #x06E5-#x06E6 \| #x0905-#x0939 \| #x093D \| #x0958-#x0961 \| #x0985-#x098C \| #x098F-#x0990 \| #x0993-#x09A8 \| [#x0	[reply]