in reply to Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8
in thread regex for utf-8

Warning: long post.

The require and use directives at the top: "The use of the utf8 directive tells the Perl parser to allow UTF-8 in the program text in the current lexical scope. This means that bytes in the source text that have their high-bit set will be treated as being part of a literal UTF-8 character..." (The previous was taken from Jarrko Hietaniemi's documentation "utf8".)

The inner while loop regex matches any bytes not belonging to the character class defined by the range of hex values listed. I'm not clear why the range would not be 0-127 decimal

[\x{0} -\x{7e}]
. $count tallies the number of lines with a non-ascii character, and the print statement prints the line # using $. .

The notation of \x{nnnn} was probably the most important for me to understand; it matches the multibyte representation of that hex number, and I don't need to explicitly state the byte ranges (from unicode.pod:)

"The special pattern \X matches any extended Unicode sequence--"a combining character sequence" in Standardese--where the first character is a base character and subsequent characters are mark characters that apply to the base character."

This chart was helpful too: (also from unicode.pod)

Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte U+0000..U+007F 00..7F U+0080..U+07FF C2..DF 80..BF U+0800..U+0FFF E0 A0..BF 80..BF U+1000..U+CFFF E1..EC 80..BF 80..BF U+D000..U+D7FF ED 80..9F 80..BF U+D800..U+DFFF ******* ill-formed ******* U+E000..U+FFFF EE..EF 80..BF 80..BF U+10000..U+3FFFF F0 90..BF 80..BF 80..BF U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF U+100000..U+10FFFF F4 80..8F 80..BF 80..BF Note the A0..BF in U+0800..U+0FFF, the 80..9F in U+D000...U+D7FF, the +90..BF in U+10000..U+3FFFF, and the 80...8F in U+100000..U+10FFFF. Th +e "gaps" are caused by legal UTF-8 avoiding non-shortest encodings: i +t is technically possible to UTF-8-encode a single code point in diff +erent ways, but that is explicitly forbidden, and the shortest possib +le encoding should always be used. So that's what Perl does. Another way to look at it is via bits: Code Points 1st Byte 2nd Byte 3rd Byte 4th Byt +e 0aaaaaaa 0aaaaaaa 00000bbbbbaaaaaa 110bbbbb 10aaaaaa ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaa +a
I found in the Perl Cookbook an example of matching multibyte characters that explicitly matches the enumerated possible multibyte patterns, using a lot of whitespace to make it clear. The corresponding pattern for my task of matching non-ascii characters would be:
my $patttern ={ [\x{c2-df}\x{80-bf}] | [\x{e0-ff}\x{a0-bf}\x{80-bf}] | [\x{e1-ef}\x{80-bf}\x{80-bf}] | #I leave off the rest here };
This is inserted into a regex:
while(<FILE>){ if (/$pattern/ox) #options to allow whitespace in the pattern, and + to prevent the compiler from recalculating it at every run) $chars{$&}++; #$& is the part of the string that actually matches + the pattern } foreach (keys %chars){ print "unpack 'U*', $_ matched $chars{$_} times.\n"; #I am unsure on the "unpack"
My data is at work, so I will try this tomorrow. I'm not fully sure that my pattern composed of multibyte ranges is right.

What is the utf-8 flag and do I need to check for it? Thank you, John

Replies are listed 'Best First'.
Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8
by John M. Dlugosz (Monsignor) on Mar 03, 2003 at 07:53 UTC
    No reason why I used 1 instead of 0, other than to avoid using \0 and spoil the symetry. Your way is right, too. If there are no zero chars in the string, it doesn't matter.

    use utf8; does more than allow UTF-8 characters to be in the source file (in strings and even for identifier names!). After all, there weren't any, so I didn't include it for that purpose. Dig a little deeper, or try leaving it out and see what happens (after knowing the original works OK (fixed typos, agrees with your data format, etc.)).

    The chart: I don't use a chart, but I can convert to/from UTF-8 on a whiteboard. Can you? Look at the reason for the numbers, not just at the numbers, in binary. Go back to the original source document on UTF-8 if it's not explained in the Perl docs. You can find it at unicode.org, or in the back of the book if you own a copy.

    Matching multibyte chars explicitly: I did that years ago and wished for better. Now, it's unnecessary. Why would you need to do that?

    ++$char{$&} while (/[^\0-\x{7f}]/g);
    Your if statement will only ++ the total for the first offending character it finds in a line.

    Unsure of unpack: You're not calling it in your sample, so I don't know what you mean. Leftover from another test-run, I suppose. You might be thinking:

    foreach (unpack "U*") { ++$chars{$_} if (ord($_) > 127); } # ... later while (my ($ch, $count)= each %chars) { printf "character U+%04x seen $count times.\n", $ch; }
    Keep it up!

    —John

      wow!

      This was trivial now that it's done:

      #require 5.6; use strict; use warnings; use utf8; my %chars; my %descids; while (<>) { while ( /[^\x{1}-\x{7f}]/g) { ++$chars{$&}; } } foreach my $char (keys %chars){ print "$char found $chars{$char} times\n"; } print "found ". keys(%chars) . " distinct non-ascii chars\n";
      I had to comment out the version of perl: I have 5.6.1, and the compiler complained at that (that should be easy to fix). I next checked if any characters lie outside of ISO-8859-1 by changing the regex range to look up to \x{ff}, and got zero. That is the practical result of this whole exercise; this 58k tabbed text file will be easier to import into various DBMS systems if the user knows that the characters lie in the range of Latin-1

      The use utf8 directive is absolutely essential; the unicode hex notation is not allowed in the regex without it.

      The inner while loop (vs an if statement) around the regex is a little unclear; I guess the match with the /g modifier returns a list, and the if statement would only chec the scalar return. Would something like this capture all the matches in a single line into a list?

      while(<>){ while (my @matches = /[^\x{1}-\x{7f}]/g){ $conid = /patten-to-find-this-column/; $hash_of_lists{$conid} =[@matches]; # linking this with inner hash of found characters is fuzzy but near.. +. ++$chars{$&}; } }
      My next task is to make some data structures; at the top level are concept_ids (one of the fields in this table). Each concept-id is associated with numerous description_ids (the primary key of this table). Each row of this table (each description_id) could have numerous non-ascii characters, each associated with a frequency.

      I intend to collect this all into a hash of lists of hashes of hashes.

      The inner hash is the non-ascii characters and their frequency. The list of hashes is the row of the table with its non-ascii characters; each row could have a number of distinct non-ascii characters in it. And the hash of lists is the unique concept_id associated with numerous description_ids. After I have that, I'll want the individual words with the characters also collected and reported somehow, but that will come last.

      This will take some thinking; I'm taking a company trip tomorrow and can work this out in the hotel. I might not be able to post for about a week, depending on internet access.

      Your help is much appreciated.

      John

        >> I had to comment out the version of perl: I have 5.6.1, and the compiler complained at that (that should be easy to fix).

        That's odd. I put it in to make it clear that you didn't need 5.8, and that older than 5.6 wouldn't work. It's a good idea in posts for that reason, even if you don't use it in your own scripts that you don't share with anyone or put on multiple machines.

        Look up the syntax of require and use in the perlfunc manpage, and see what the problem is. Use perl -v on the command line to see what version is actually running.

        >> The use utf8 directive is absolutely essential; the unicode hex notation is not allowed in the regex without it.

        Correct. That's one reason I used \x{7f} instead of \x7f which would mean the same thing: to make sure it was compiled with Unicode support turned on. Then taking out the use like I suggested would produce a meaningful error and tell you what else use utf8 affects.

        The /g modifier doesn't make the regex return a list of all matches. The context controls that, and it's being called in scalar context.

        Rather, the /g will return one at a time, and each time through the while loop will get the next one. See, the regex stuff is remembering the state in an object that's created when you use the matching syntax. This is important to understand, and you'll see it in a few other places in Perl, too.

        The idea of calling context is important in Perl. I see you wrote print "found ". keys(%chars) . " distinct non-ascii chars\n";. What would happen if you used commas instead of the two dots? Why does it change its meaning?

        Your variation using my @matches= will indeed get all the matches at once. So why use a while loop? the rest of it doesn't make sence. Use it this way:

        my $_= "sdfdsf,65dfdf**3#ooijoi4asdfdsf."; my @matches= /[a-z]/g; print "I found: @matches\n";
        When you get back, try a new top-level thread, and be detailed since it will be picked up by new readers.

        Parsing each line in the file into fields? What format is it in? If it's comma-delimited text or something like that, there is a module which does that. If it's a unique character that doesn't appear within the field itself, just use the split function.

        —John