in reply to Re: regexing for non-standard characters...
in thread regexing for non-standard characters...

unichist++ and graff++

Is unichist up to date (Unicode 5.2.0)?

Jim

  • Comment on Re^2: regexing for non-standard characters...

Replies are listed 'Best First'.
Re^3: regexing for non-standard characters...
by graff (Chancellor) on Apr 18, 2010 at 04:08 UTC
    Is unichist up to date (Unicode 5.2.0)?

    That would depend on which perl version you are using to run it. Check the perldelta man page that comes with your version of perl. The 5.10.0 that came with my macosx 10.5 shows Unicode 5.0.0; I notice that the 5.10.1 has Unicode 5.1.0.

    I haven't checked http://unicode.org, but that would be the place to look if you need to know what the Unicode version differences consist of.

    (update:) Oh, wait... I remember that there's that section of the unichist code that "summarizes" the ranges of characters according to language/script "pages" -- I wouldn't expect Unicode updates to have any (significant) impact on that part of the script, but it's something I should check up on... Thanks for asking.

    (another update: the POD in unichist says that the list of code page "classes" was based on Unicode 5.0)

      I asked about Unicode 5.2.0 mostly because I had just read about the release of Perl 5.12.0. The release announcement states:

      Perl now conforms much more closely to the Unicode standard. Additionally, this release includes an upgrade to version 5.2 of the standard.

      What you call "charts" in unichist are actually Unicode blocks. Unicode characters also belong to Unicode scripts. Support for both --blocks and --scripts would be a nice enhancement to your Unicode character histogram utility.

      I compared your blocks ("charts") with the current Unicode 5.2.0 blocks. It seems your utility is missing such essentials as Egyptian Hieroglyphs and Mahjong and Domino tiles. I'm surprised no one has complained to you yet about these glaring omissions. :-)

        It took me a while to get around to it, but I have updated unichist -- count/summarize characters in data so that it uses the version of "Blocks.txt" that comes with Perl, so the next time someone asks "what version of Unicode does the tool use", it will be correct to say "the same version used by Perl (whatever Perl version you happen to be using)".

        I checked on the "Scripts.txt" file, but I didn't see a good enough reason for incorporating it in addition to "Blocks.txt" -- the latter is sufficient for what "unichist" was meant to do.