Search & replace of UTF-8 characters ?

levien has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Search & replace of UTF-8 characters ? by ikegami (Patriarch) on Feb 25, 2010 at 16:42 UTC
`$line` is still encoded. A character won't match the UTF-8 encoding of that character unless it's an ASCII character. You had the right idea with `-C` / `use open` / `binmode`. The catch is that you're not reading from STDIN, you're reading from ARGV, and those don't work well if at all with ARGV. The solution: Don't use `ARGV`. my %fixes = ( "\x{00a9}" => '\\textcopyright', "\x{2010}" => '-', "\x{fffd}" => '\\,', "\x{03b4}" => '$\\delta$', "\x{00c5}" => '\\AA{}', ); my ($re) = map qr/$_/, join '\|', map quotemeta, keys(%fixes); @ARGV = '-' if !@ARGV; for my $ARGV (@ARGV} { my $fh; if ($ARGV eq '-') { open($fh, '<&:encoding(UTF-8), *STDIN) or die "Can't dup STDIN: $!\n"); } else { open($fh, '<:encoding(UTF-8), $ARGV) or die "Can't open \"$ARGV\": $!\n"); } for (;;) { last if eof($fh); defined( my $line = <$fh> ) or die("Can't read from \"$ARGV\": $!\n"); $line =~ s/($re)/$fixes{$1}/g; print $line; } } [download] Yeah, it sucks. Especially since ARGV normally does that error checking for you.	[reply] [d/l] [select]
Re^2: Search & replace of UTF-8 characters ? by mpeever (Friar) on Feb 25, 2010 at 17:03 UTC
I'm asking, not arguing... Wouldn't it have worked if the script was called via a command-line pipe? So if it was called as `./levians_program.pl < source.bib > source_corrected.bib` [download] that ought to work, right?	[reply] [d/l]
Re^3: Search & replace of UTF-8 characters ? by ikegami (Patriarch) on Feb 25, 2010 at 17:24 UTC
If you added some means of decoding to STDIN and encoding STDOUT, yes. `>perl -CSD -we"print chr 0x2660" \| perl -CSD -we"printf qq{%X\n}, ord +<STDIN>" 2660` [download] `-C` even works if you read STDIN through ARGV: `>perl -CSD -we"print chr 0x2660" \| perl -CSD -we"printf qq{%X\n}, ord +<>" 2660` [download] I don't have time to check the other tools right now. Update: Hey! `-C` DOES work with ARGV. I knew `binmode` and `use open` had problems with ARGV, so I took the OP's word for it when he said `-C` didn't work with it either. `>perl -CSD -we"print chr 0x2660" > foo >perl -CSD -we"printf qq{%X\n}, ord <>" foo 2660` [download]	[reply] [d/l] [select]
Re: Search & replace of UTF-8 characters ? by 7stud (Deacon) on Feb 25, 2010 at 17:19 UTC
$line is still encoded. A character won't match the UTF-8 encoding of that character unless it's an ASCII character. While that may be an accurate statement, trying to decipher what it means is not easy. Here is how I would put it: a unicode character is not the same as a unicode character encoded in UTF-8. There are many encodings, and UTF-8 is only one of them. However, there is only one unicode character for the copyright symbol. Simply put, if you want to match UTF-8 characters in a string, then you need to use UTF-8 characters in your substitution--not unicode characters. Here is a code example: use strict; use warnings; use 5.010; use Encode; my $unicode_str = "\x{00a9}"; my $utf8_str = encode('utf-8', $unicode_str); say $utf8_str; #copyright symbol my $line = "$utf8_str hello world"; $line =~ s/$utf8_str/\\textcopyright/; say $line; #\textcopyright hello world #Or you can just start with the UTF-8 character #for the copyright symbol: $line = "\xC2\xA9 hello world"; say $line; #copyright symbol followed by 'hello world' $line =~ s/\xC2\xA9/\\textcopyright/; say $line; #\textcopyright hello world [download] In my opinion, the easiest way to understand the whole unicode thing is this: a unicode escape sequence is an integer. An 'encoding' converts a unicode integer into a character. An encoding is just a list that looks like this: `1 => chinese character for the new year 2 => japanese character for fish 3 => happy face ... ... 60,000 => mongolian character for beef ...` [download] So an encoding takes unicode integers and translates them into characters. Different encodings translate the unicode integers into different characters. UTF-8 is just one encoding, which is very popular.	[reply] [d/l] [select]
Re^2: Search & replace of UTF-8 characters ? by ikegami (Patriarch) on Feb 25, 2010 at 18:46 UTC
While that may be an accurate statement, trying to decipher what it means is not easy I didn't want to spend much time confirming something the OP appeared to already know, but thanks for elaborating. Update: Although I think your elaboration is flawed. a unicode escape sequence is an integer. An 'encoding' converts a unicode integer into a character. An encoding is just a list that looks like this: Determine the character a value represents is unrelated to encoding/decoding. Decoding from UTF-8: `... 01 => 01 START OF HEADING ... 30 => 30 DIGIT ZERO ... E2 99 A0 => 2660 BLACK SPADE SUIT ...` [download] Encoding is the reverse operation. There is no difference between 2660 and black spade suit. Black spade suit is just a meaning assumed by 2660. Decoding is definitely not the process of going from 2660 to black spade suit as you claim.	[reply] [d/l]
Re^3: Search & replace of UTF-8 characters ? by 7stud (Deacon) on Feb 26, 2010 at 01:40 UTC
double post somehow	[reply]
Re^3: Search & replace of UTF-8 characters ? by 7stud (Deacon) on Feb 26, 2010 at 01:40 UTC
Although I think your elaboration is flawed. It's definitely not accurate. At the same time, anyone can understand my model, and they should be able to use it to successfully distinguish between unicodes and encodings like utf-8--and convert between them. Or they can read a tutorial an unicode and be completely confused, and not be able to write any code at all. Decoding is definitely not the process of going from 2660 to black spades suit as you claim. Encoding = convert unicode integer to utf-8 character for output Decoding = convert utf-8 character to unicode integer for input That simple model will allow any unicode beginner to write a lot of code before having to adjust their mental model. For what it's worth, I've never read a single unicode tutorial that will actually allow you to write code.	[reply]
Re^4: Search & replace of UTF-8 characters ? by ikegami (Patriarch) on Feb 26, 2010 at 06:03 UTC
Re: Search & replace of UTF-8 characters ? by levien (Initiate) on Feb 26, 2010 at 00:16 UTC
Thank you for the answer and the explanations! It works as it should now, and I learned a thing or two about perl and UTF-8. :-) Indeed I hadn't realised that the "easy" way of doing a search & replace does not use STDIN but ARGV, and also that it does not consider the input files as being UTF-8 encoded by default...	[reply]
Re^2: Search & replace of UTF-8 characters ? by ikegami (Patriarch) on Feb 26, 2010 at 06:20 UTC
and also that it does not consider the input files as being UTF-8 encoded by default... Perl has no idea what's in the file. It cannot assume the file's content is text encoded with UTF-8. In fact, it cannot assume the file's content is text at all. Unless you tell Perl otherwise, it gives you the file's contents: bytes.	[reply]
Re^3: Search & replace of UTF-8 characters ? by Anonymous Monk on Feb 26, 2010 at 07:26 UTC
perl does assume it is text, that is why you have to binmode	[reply]