use utf8; does more than allow UTF-8 characters to be in the source file (in strings and even for identifier names!). After all, there weren't any, so I didn't include it for that purpose. Dig a little deeper, or try leaving it out and see what happens (after knowing the original works OK (fixed typos, agrees with your data format, etc.)).
The chart: I don't use a chart, but I can convert to/from UTF-8 on a whiteboard. Can you? Look at the reason for the numbers, not just at the numbers, in binary. Go back to the original source document on UTF-8 if it's not explained in the Perl docs. You can find it at unicode.org, or in the back of the book if you own a copy.
Matching multibyte chars explicitly: I did that years ago and wished for better. Now, it's unnecessary. Why would you need to do that?
Your if statement will only ++ the total for the first offending character it finds in a line.++$char{$&} while (/[^\0-\x{7f}]/g);
Unsure of unpack: You're not calling it in your sample, so I don't know what you mean. Leftover from another test-run, I suppose. You might be thinking:
Keep it up!foreach (unpack "U*") { ++$chars{$_} if (ord($_) > 127); } # ... later while (my ($ch, $count)= each %chars) { printf "character U+%04x seen $count times.\n", $ch; }
—John
In reply to Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8
by John M. Dlugosz
in thread regex for utf-8
by jjohhn
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |