1 day ago my boss said that some text files could not be
integrated with a search engine because the characters with a cedille
on them were causing it to barf.
When I looked at this file in Emacs, I saw this
\347 I did a bit of reading on how emacs sees a buffer as
a sequence of 8-bit bytes unless you turn on various interpretation
modes, and then went on about my business.
Next I went to AsciiTable.com
and saw the offending character. Somehow the C with a cedille was
being displayed by the Windows command window as a greek "tau". Which
has ascii value 231. And sure enough the when I ran my file thru this
program
use strict;
my $file = shift or die 'must supply filename';
open my $fh, $file or die "couldnt open $file: $!";
$\="\n";
while (<$fh>) {
# print $.,$/;
my $char;
while (/\G(.)/g) {
++$char;
my $c=$1;
if ($c =~ /[[:^print:]]/) {
print "plain_text test failed on row $. with char # $char: <$c
+>\n"
. "Unicode Value: " . unpack('C', $c);
print "context: " . substr($_, 0, $char+5);
}
}
}
It said that the bad character was 231.
But it also flagged a lot of
other things in the file... possibly because one "wide bit" character
was being interpreted as several 8-bit chars. And then I read on the
utf8 pragma in the Perl standard docs. And I put
use utf8 at the top of the program. And the
presto-chango, only the cedille was detected.
What happened was that Perl saw my file as a sequence of Unicode
characters instead of as a sequence of 8-bit bytes. So how does one
decide Unicode characters are appropriate for their application?
One takes a look at the Unicode
Code Charts. So, after looking at these, I was certain of which
Unicode values I wanted to accept but I was only fairly sure that the
Perl POSIX :print: character class equivalent. So, I took
the low road (or is that the high road?) and wrote this to determine
whether to accept a string of text:
use utf8;
my $U;
while ($column =~ /\G(.)/g) {
# WRONG! Thanks John M. Dlugosz $U = unpack('C', $1);
$U = unpack('U', $1); # Now that's the ticket
$U < 127 and $U > 31 or return;
}
return 1;