comment on

No reason why I used 1 instead of 0, other than to avoid using \0 and spoil the symetry. Your way is right, too. If there are no zero chars in the string, it doesn't matter.

use utf8; does more than allow UTF-8 characters to be in the source file (in strings and even for identifier names!). After all, there weren't any, so I didn't include it for that purpose. Dig a little deeper, or try leaving it out and see what happens (after knowing the original works OK (fixed typos, agrees with your data format, etc.)).

The chart: I don't use a chart, but I can convert to/from UTF-8 on a whiteboard. Can you? Look at the reason for the numbers, not just at the numbers, in binary. Go back to the original source document on UTF-8 if it's not explained in the Perl docs. You can find it at unicode.org, or in the back of the book if you own a copy.

Matching multibyte chars explicitly: I did that years ago and wished for better. Now, it's unnecessary. Why would you need to do that?

++$char{$&} while (/[^\0-\x{7f}]/g);
[download]

Your if statement will only ++ the total for the first offending character it finds in a line.

Unsure of unpack: You're not calling it in your sample, so I don't know what you mean. Leftover from another test-run, I suppose. You might be thinking:

foreach (unpack "U*") {
   ++$chars{$_}  if (ord($_) > 127);
   }

# ... later
while (my ($ch, $count)= each %chars) {
   printf "character U+%04x seen $count times.\n", $ch;
   }
[download]

Keep it up!

—John

In reply to Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8 by John M. Dlugosz
in thread regex for utf-8 by jjohhn

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.