in reply to Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8
in thread regex for utf-8
use utf8; does more than allow UTF-8 characters to be in the source file (in strings and even for identifier names!). After all, there weren't any, so I didn't include it for that purpose. Dig a little deeper, or try leaving it out and see what happens (after knowing the original works OK (fixed typos, agrees with your data format, etc.)).
The chart: I don't use a chart, but I can convert to/from UTF-8 on a whiteboard. Can you? Look at the reason for the numbers, not just at the numbers, in binary. Go back to the original source document on UTF-8 if it's not explained in the Perl docs. You can find it at unicode.org, or in the back of the book if you own a copy.
Matching multibyte chars explicitly: I did that years ago and wished for better. Now, it's unnecessary. Why would you need to do that?
Your if statement will only ++ the total for the first offending character it finds in a line.++$char{$&} while (/[^\0-\x{7f}]/g);
Unsure of unpack: You're not calling it in your sample, so I don't know what you mean. Leftover from another test-run, I suppose. You might be thinking:
Keep it up!foreach (unpack "U*") { ++$chars{$_} if (ord($_) > 127); } # ... later while (my ($ch, $count)= each %chars) { printf "character U+%04x seen $count times.\n", $ch; }
—John
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8
by jjohhn (Scribe) on Mar 03, 2003 at 17:57 UTC | |
by John M. Dlugosz (Monsignor) on Mar 03, 2003 at 23:30 UTC | |
by jjohhn (Scribe) on Mar 04, 2003 at 06:39 UTC |