in reply to regexing for non-standard characters...

So how does one find out what this stupid thing is

Try using this script on your data: unichist -- count/summarize characters in data -- it will show you a list of all the distinct code points, and how many times each one occurs. It expects utf8 text by default, but if your data comes in some other encoding, you can specify that in a command-line option ("--enc=..."); the output will always be in terms of unicode code points.

  • Comment on Re: regexing for non-standard characters...

Replies are listed 'Best First'.
Re^2: regexing for non-standard characters...
by Jim (Curate) on Apr 17, 2010 at 19:45 UTC
    unichist++ and graff++

    Is unichist up to date (Unicode 5.2.0)?

    Jim

      Is unichist up to date (Unicode 5.2.0)?

      That would depend on which perl version you are using to run it. Check the perldelta man page that comes with your version of perl. The 5.10.0 that came with my macosx 10.5 shows Unicode 5.0.0; I notice that the 5.10.1 has Unicode 5.1.0.

      I haven't checked http://unicode.org, but that would be the place to look if you need to know what the Unicode version differences consist of.

      (update:) Oh, wait... I remember that there's that section of the unichist code that "summarizes" the ranges of characters according to language/script "pages" -- I wouldn't expect Unicode updates to have any (significant) impact on that part of the script, but it's something I should check up on... Thanks for asking.

      (another update: the POD in unichist says that the list of code page "classes" was based on Unicode 5.0)

        I asked about Unicode 5.2.0 mostly because I had just read about the release of Perl 5.12.0. The release announcement states:

        Perl now conforms much more closely to the Unicode standard. Additionally, this release includes an upgrade to version 5.2 of the standard.

        What you call "charts" in unichist are actually Unicode blocks. Unicode characters also belong to Unicode scripts. Support for both --blocks and --scripts would be a nice enhancement to your Unicode character histogram utility.

        I compared your blocks ("charts") with the current Unicode 5.2.0 blocks. It seems your utility is missing such essentials as Egyptian Hieroglyphs and Mahjong and Domino tiles. I'm surprised no one has complained to you yet about these glaring omissions. :-)