in reply to Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8
in thread regex for utf-8

That should be "non-ascii". My question is focusing down to the matching part - I guess I can find the end of character because I'll know how many bytes it has in total from the high bits on the first byte, but I don't know if the "codepoint" includes the high bits or not. I need to find these characters, but also record what they are. My buddy did something similar in java because java could read the file in character by character, and he looked for characters >128. But he just printed the whole line with the offending character, and I want to count the characters. I havn't looked at java faor about a year, but it may be worth swimming through public static void main to get to the solution. My deadline is coming up. Modules: I was hoping to learn how to do this myself, but I am beginning to think this may be beyond me right now. I can't believe nobody else has written a quick little script to do just this. I'm not used to coming up against such a brick wall when I want to do something that seems pretty simple on the face of it. I looked at the ENCODE module; it may do this. I've never used a module before.
  • Comment on Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8

Replies are listed 'Best First'.
Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8
by John M. Dlugosz (Monsignor) on Mar 01, 2003 at 05:52 UTC
    Perl reads UTF-8 nativly. Regular expressions are for finding characters of interest. So, something like:

    require 5.6; use strict; use warnings; use utf8; my $count; while (<>) { while (/[^\x{1}-\x{7f}]/g) { ++$count; print "Found on line $.: $_"; } } print "Total: $count offending chars found.\n";
    That is, a pattern matches on anything that's NOT in the range of code points 1 through 0x7f, inclusive. The \x{1234} notation matches the UTF-8 encoding of code point 0x1234, all several bytes of it.

    Want to track which ones they are, not just count them all? Try something like ++$chars{$&}; inside the inner loop.

    See perlvar for the meaning of $&, the utf8 page for the pragmatic module, and perlre for regular expressions. See the latter half of perlmod for "Perl Modules" (the beginning might just make your eyes glaze over as yet). See "Quote and quote-like operators" in perlop for \x and friends.

    Now, care to tell us precicely what each line means in that example (after fixing typos)? Take you're time, we're always open.

    —John

      Warning: long post.

      The require and use directives at the top: "The use of the utf8 directive tells the Perl parser to allow UTF-8 in the program text in the current lexical scope. This means that bytes in the source text that have their high-bit set will be treated as being part of a literal UTF-8 character..." (The previous was taken from Jarrko Hietaniemi's documentation "utf8".)

      The inner while loop regex matches any bytes not belonging to the character class defined by the range of hex values listed. I'm not clear why the range would not be 0-127 decimal

      [\x{0} -\x{7e}]
      . $count tallies the number of lines with a non-ascii character, and the print statement prints the line # using $. .

      The notation of \x{nnnn} was probably the most important for me to understand; it matches the multibyte representation of that hex number, and I don't need to explicitly state the byte ranges (from unicode.pod:)

      "The special pattern \X matches any extended Unicode sequence--"a combining character sequence" in Standardese--where the first character is a base character and subsequent characters are mark characters that apply to the base character."

      This chart was helpful too: (also from unicode.pod)

      Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte U+0000..U+007F 00..7F U+0080..U+07FF C2..DF 80..BF U+0800..U+0FFF E0 A0..BF 80..BF U+1000..U+CFFF E1..EC 80..BF 80..BF U+D000..U+D7FF ED 80..9F 80..BF U+D800..U+DFFF ******* ill-formed ******* U+E000..U+FFFF EE..EF 80..BF 80..BF U+10000..U+3FFFF F0 90..BF 80..BF 80..BF U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF U+100000..U+10FFFF F4 80..8F 80..BF 80..BF Note the A0..BF in U+0800..U+0FFF, the 80..9F in U+D000...U+D7FF, the +90..BF in U+10000..U+3FFFF, and the 80...8F in U+100000..U+10FFFF. Th +e "gaps" are caused by legal UTF-8 avoiding non-shortest encodings: i +t is technically possible to UTF-8-encode a single code point in diff +erent ways, but that is explicitly forbidden, and the shortest possib +le encoding should always be used. So that's what Perl does. Another way to look at it is via bits: Code Points 1st Byte 2nd Byte 3rd Byte 4th Byt +e 0aaaaaaa 0aaaaaaa 00000bbbbbaaaaaa 110bbbbb 10aaaaaa ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaa +a
      I found in the Perl Cookbook an example of matching multibyte characters that explicitly matches the enumerated possible multibyte patterns, using a lot of whitespace to make it clear. The corresponding pattern for my task of matching non-ascii characters would be:
      my $patttern ={ [\x{c2-df}\x{80-bf}] | [\x{e0-ff}\x{a0-bf}\x{80-bf}] | [\x{e1-ef}\x{80-bf}\x{80-bf}] | #I leave off the rest here };
      This is inserted into a regex:
      while(<FILE>){ if (/$pattern/ox) #options to allow whitespace in the pattern, and + to prevent the compiler from recalculating it at every run) $chars{$&}++; #$& is the part of the string that actually matches + the pattern } foreach (keys %chars){ print "unpack 'U*', $_ matched $chars{$_} times.\n"; #I am unsure on the "unpack"
      My data is at work, so I will try this tomorrow. I'm not fully sure that my pattern composed of multibyte ranges is right.

      What is the utf-8 flag and do I need to check for it? Thank you, John

        No reason why I used 1 instead of 0, other than to avoid using \0 and spoil the symetry. Your way is right, too. If there are no zero chars in the string, it doesn't matter.

        use utf8; does more than allow UTF-8 characters to be in the source file (in strings and even for identifier names!). After all, there weren't any, so I didn't include it for that purpose. Dig a little deeper, or try leaving it out and see what happens (after knowing the original works OK (fixed typos, agrees with your data format, etc.)).

        The chart: I don't use a chart, but I can convert to/from UTF-8 on a whiteboard. Can you? Look at the reason for the numbers, not just at the numbers, in binary. Go back to the original source document on UTF-8 if it's not explained in the Perl docs. You can find it at unicode.org, or in the back of the book if you own a copy.

        Matching multibyte chars explicitly: I did that years ago and wished for better. Now, it's unnecessary. Why would you need to do that?

        ++$char{$&} while (/[^\0-\x{7f}]/g);
        Your if statement will only ++ the total for the first offending character it finds in a line.

        Unsure of unpack: You're not calling it in your sample, so I don't know what you mean. Leftover from another test-run, I suppose. You might be thinking:

        foreach (unpack "U*") { ++$chars{$_} if (ord($_) > 127); } # ... later while (my ($ch, $count)= each %chars) { printf "character U+%04x seen $count times.\n", $ch; }
        Keep it up!

        —John

Quick script?
by John M. Dlugosz (Monsignor) on Mar 01, 2003 at 06:00 UTC
    >> I can't believe nobody else has written a quick little script to do just this

    I'm sure lots of people have, though maybe not that exact problem. Here is how to do what the Java program does on one line on the command-line prompt:

    perl -Mutf8 -ne"print if /[^\0-\x7f]/"
    (change the quotes to suit your shell and OS. "" on Windows, usually '' on Unix)

    So, no need to wade through public static void main... the Perl program's already finished by then.

    —John