jjohhn has asked for the wisdom of the Perl Monks concerning the following question:

This is a clarified and restated question. I have a file in UTF-8 format and I want to find and tally all of the non-ascii characters in it. The very first steps are paralyzing me. With the patient help of Thelonius, John M. Dlugosz, and others, I have come this far:
while (<FILE>){ if(/[\xc0-\xfd]/)#found lead byte of a multibyte sequence
My data has a maximum character length of 3 bytes, but I might not have known that. How do I grab at the entire character, so I can put it into a hash tallying the number of times it appears? The camel says "regular expressions match characters instead of bytes". The man pages on pack() and unpack() say something clearly important that I am unable to comprehend at my stage. Can somebody either hint at the next step in my code, or direct me to documentation that would help? Thank you

Replies are listed 'Best First'.
Re: regex: searching for multi-byte characters
by blahblahblah (Priest) on Mar 01, 2003 at 05:04 UTC
    I don't understand multibyte characters that well either but I've had to deal with them a little. So this may answer your question or it may just show that I'm more confused than you are. If it's true that "regular expressions match characters instead of bytes", can you do something like this?:
    while (<FILE) { while ($_ =~ /\G(.)/g) { my $char = $1; # code here to check whether $char is one you want to tally... } }
    The code to check for what you want to tally might look like this:
    my $u = unpack('U', $char); $tally{$char}++ if ($u > 128);