comment on

Warning: long post.

The require and use directives at the top: "The use of the utf8 directive tells the Perl parser to allow UTF-8 in the program text in the current lexical scope. This means that bytes in the source text that have their high-bit set will be treated as being part of a literal UTF-8 character..." (The previous was taken from Jarrko Hietaniemi's documentation "utf8".)

The inner while loop regex matches any bytes not belonging to the character class defined by the range of hex values listed. I'm not clear why the range would not be 0-127 decimal

[\x{0} -\x{7e}]
[download]

. $count tallies the number of lines with a non-ascii character, and the print statement prints the line # using $. .

The notation of \x{nnnn} was probably the most important for me to understand; it matches the multibyte representation of that hex number, and I don't need to explicitly state the byte ranges (from unicode.pod:)

"The special pattern \X matches any extended Unicode sequence--"a combining character sequence" in Standardese--where the first character is a base character and subsequent characters are mark characters that apply to the base character."

This chart was helpful too: (also from unicode.pod)

 Code Points            1st Byte  2nd Byte  3rd Byte  4th Byte

   U+0000..U+007F       00..7F
   U+0080..U+07FF       C2..DF    80..BF
   U+0800..U+0FFF       E0        A0..BF    80..BF
   U+1000..U+CFFF       E1..EC    80..BF    80..BF
   U+D000..U+D7FF       ED        80..9F    80..BF
   U+D800..U+DFFF       ******* ill-formed *******
   U+E000..U+FFFF       EE..EF    80..BF    80..BF
  U+10000..U+3FFFF      F0        90..BF    80..BF    80..BF
  U+40000..U+FFFFF      F1..F3    80..BF    80..BF    80..BF
 U+100000..U+10FFFF     F4        80..8F    80..BF    80..BF


Note the A0..BF in U+0800..U+0FFF, the 80..9F in U+D000...U+D7FF, the 
+90..BF in U+10000..U+3FFFF, and the 80...8F in U+100000..U+10FFFF. Th
+e "gaps" are caused by legal UTF-8 avoiding non-shortest encodings: i
+t is technically possible to UTF-8-encode a single code point in diff
+erent ways, but that is explicitly forbidden, and the shortest possib
+le encoding should always be used. So that's what Perl does. 

Another way to look at it is via bits: 

 Code Points                    1st Byte   2nd Byte  3rd Byte  4th Byt
+e

                    0aaaaaaa     0aaaaaaa
            00000bbbbbaaaaaa     110bbbbb  10aaaaaa
            ccccbbbbbbaaaaaa     1110cccc  10bbbbbb  10aaaaaa
  00000dddccccccbbbbbbaaaaaa     11110ddd  10cccccc  10bbbbbb  10aaaaa
+a
[download]

I found in the Perl Cookbook an example of matching multibyte characters that explicitly matches the enumerated possible multibyte patterns, using a lot of whitespace to make it clear. The corresponding pattern for my task of matching non-ascii characters would be:

my $patttern ={
  [\x{c2-df}\x{80-bf}] |
  [\x{e0-ff}\x{a0-bf}\x{80-bf}] |
  [\x{e1-ef}\x{80-bf}\x{80-bf}] |  #I leave off the rest here
};
[download]

This is inserted into a regex:

while(<FILE>){
   if (/$pattern/ox)  #options to allow whitespace in the pattern, and
+ to prevent the compiler from recalculating it at every run)

     $chars{$&}++; #$& is the part of the string that actually matches
+ the pattern
}

foreach (keys %chars){
  print "unpack 'U*', $_ matched $chars{$_} times.\n";

#I am unsure on the "unpack"
[download]

My data is at work, so I will try this tomorrow. I'm not fully sure that my pattern composed of multibyte ranges is right.

What is the utf-8 flag and do I need to check for it? Thank you, John

In reply to Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8 by jjohhn
in thread regex for utf-8 by jjohhn

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.