Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8

Replies are listed 'Best First'.
Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8 by John M. Dlugosz (Monsignor) on Mar 01, 2003 at 04:16 UTC
What's a "non-unicode character" in a file? Perl has modules for extensive manipulation in this area, and Perl reads UTF-8 nativly.	[reply]
Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8 by jjohhn (Scribe) on Mar 01, 2003 at 05:32 UTC
That should be "non-ascii". My question is focusing down to the matching part - I guess I can find the end of character because I'll know how many bytes it has in total from the high bits on the first byte, but I don't know if the "codepoint" includes the high bits or not. I need to find these characters, but also record what they are. My buddy did something similar in java because java could read the file in character by character, and he looked for characters >128. But he just printed the whole line with the offending character, and I want to count the characters. I havn't looked at java faor about a year, but it may be worth swimming through public static void main to get to the solution. My deadline is coming up. Modules: I was hoping to learn how to do this myself, but I am beginning to think this may be beyond me right now. I can't believe nobody else has written a quick little script to do just this. I'm not used to coming up against such a brick wall when I want to do something that seems pretty simple on the face of it. I looked at the ENCODE module; it may do this. I've never used a module before.	[reply]
Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8 by John M. Dlugosz (Monsignor) on Mar 01, 2003 at 05:52 UTC
Perl reads UTF-8 nativly. Regular expressions are for finding characters of interest. So, something like: `require 5.6; use strict; use warnings; use utf8; my $count; while (<>) { while (/[^\x{1}-\x{7f}]/g) { ++$count; print "Found on line $.: $_"; } } print "Total: $count offending chars found.\n";` [download] That is, a pattern matches on anything that's NOT in the range of code points 1 through 0x7f, inclusive. The \x{1234} notation matches the UTF-8 encoding of code point 0x1234, all several bytes of it. Want to track which ones they are, not just count them all? Try something like `++$chars{$&};` inside the inner loop. See perlvar for the meaning of `$&`, the utf8 page for the pragmatic module, and perlre for regular expressions. See the latter half of perlmod for "Perl Modules" (the beginning might just make your eyes glaze over as yet). See "Quote and quote-like operators" in perlop for \x and friends. Now, care to tell us precicely what each line means in that example (after fixing typos)? Take you're time, we're always open. —John	[reply] [d/l] [select]
Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8 by jjohhn (Scribe) on Mar 02, 2003 at 19:48 UTC
Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: regex for utf-8 by John M. Dlugosz (Monsignor) on Mar 03, 2003 at 07:53 UTC
Some notes below your chosen depth have not been shown here
Quick script? by John M. Dlugosz (Monsignor) on Mar 01, 2003 at 06:00 UTC
>> I can't believe nobody else has written a quick little script to do just this I'm sure lots of people have, though maybe not that exact problem. Here is how to do what the Java program does on one line on the command-line prompt: `perl -Mutf8 -ne"print if /[^\0-\x7f]/"` [download] (change the quotes to suit your shell and OS. "" on Windows, usually '' on Unix) So, no need to wade through public static void main... the Perl program's already finished by then. —John	[reply] [d/l]