in reply to converting text file encodings
This should only come up when converting from unicode to any non-unicode character set (unless your input is corrupt/invalid -- see below). If/when you need to convert from one non-unicode set to another, you'll want to convert the input to unicode first, then convert from unicode to the desired output encoding.
When converting from unicode to non-unicode, any unicode character that does not exist in the output encoding will be replaced by a question-mark character ("?"), so if you really want these things to be converted to "X" instead, you'll need to check for the presence of "?" characters in the input, and only change the cases of "?" that were created by the conversion.
One possible way to do that would be to divide the input into chunks using split /\?/, convert each chunk of characters, change any newly created "?" within those chunks to "X", then put the chunks back together again with join '?', ....
The only time a conversion into unicode would involve an "unsupported character" is if your input doesn't really use the encoding you think it does (e.g. you think it's iso-8859-4 and try to decode it as such, but it really isn't), or when there's corruption in the data (e.g. part of a multi-byte character is missing, or one or more bytes have been altered or added, making the data invalid for the character set that it's supposed to be using).
In these cases, the unicode result will contain the "replacement character" ("\x{fffd}") for each "uninterpretable" input byte. If your intended output happens to be unicode, it will be best to leave the replacement characters as-is -- maybe just check for them and issue a warning when they occur -- because they are an unambiguous indicator of problems found in the input. When you want to do a conversion of such a unicode string to some non-unicode encoding, you just need to do s/\x{fffd}/X/g first (because converting a unicode "\x{fffd}" character to any non-unicode character set will always produce a "?" character).
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: converting text file encodings
by andal (Hermit) on May 06, 2011 at 08:37 UTC | |
|
Re^2: converting text file encodings
by John M. Dlugosz (Monsignor) on May 06, 2011 at 07:29 UTC |