in reply to Search & replace of UTF-8 characters ?
$line is still encoded. A character won't match the UTF-8 encoding of that character unless it's an ASCII character.
You had the right idea with -C / use open / binmode. The catch is that you're not reading from STDIN, you're reading from ARGV, and those don't work well if at all with ARGV.
The solution: Don't use ARGV.
my %fixes = ( "\x{00a9}" => '\\textcopyright', "\x{2010}" => '-', "\x{fffd}" => '\\,', "\x{03b4}" => '$\\delta$', "\x{00c5}" => '\\AA{}', ); my ($re) = map qr/$_/, join '|', map quotemeta, keys(%fixes); @ARGV = '-' if !@ARGV; for my $ARGV (@ARGV} { my $fh; if ($ARGV eq '-') { open($fh, '<&:encoding(UTF-8), *STDIN) or die "Can't dup STDIN: $!\n"); } else { open($fh, '<:encoding(UTF-8), $ARGV) or die "Can't open \"$ARGV\": $!\n"); } for (;;) { last if eof($fh); defined( my $line = <$fh> ) or die("Can't read from \"$ARGV\": $!\n"); $line =~ s/($re)/$fixes{$1}/g; print $line; } }
Yeah, it sucks. Especially since ARGV normally does that error checking for you.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Search & replace of UTF-8 characters ?
by mpeever (Friar) on Feb 25, 2010 at 17:03 UTC | |
by ikegami (Patriarch) on Feb 25, 2010 at 17:24 UTC |