Ah - so you want a unified rule for detecting base characters that isn't a simple dictionary. I started a script to look for really common bits.

<code># Copy/paste the data right from the document $\ = $, = "\n"; $base = q{#x0041-#x005A | #x0061-#x007A | #x00C0-#x00D6 | #x00D8-#x00F6 | #x00F8-#x00FF | #x0100-#x0131 | #x0134-#x013E | #x0141-#x0148 | #x014A-#x017E | #x0180-#x01C3 | #x01CD-#x01F0 | #x01F4-#x01F5 | #x01FA-#x0217 | #x0250-#x02A8 | #x02BB-#x02C1 | #x0386 | #x0388-#x038A | #x038C | #x038E-#x03A1 | #x03A3-#x03CE | #x03D0-#x03D6 | #x03DA | #x03DC | #x03DE | #x03E0 | #x03E2-#x03F3 | #x0401-#x040C | #x040E-#x044F | #x0451-#x045C | #x045E-#x0481 | #x0490-#x04C4 | #x04C7-#x04C8 | #x04CB-#x04CC | #x04D0-#x04EB | #x04EE-#x04F5 | #x04F8-#x04F9 | #x0531-#x0556 | #x0559 | #x0561-#x0586 | #x05D0-#x05EA | #x05F0-#x05F2 | #x0621-#x063A | #x0641-#x064A | #x0671-#x06B7 | #x06BA-#x06BE | #x06C0-#x06CE | #x06D0-#x06D3 | #x06D5 | #x06E5-#x06E6 | #x0905-#x0939 | #x093D | #x0958-#x0961 | #x0985-#x098C | #x098F-#x0990 | #x0993-#x09A8 | [#x0

In reply to Re: Re: Re: Verifying Unicode (The mother of all regex). by diotalevi
in thread Verifying Unicode (The mother of all regex). by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.