Re: converting text file encodings

I need to replace unsupported charaters with letter 'X' while converting the encodings.

This should only come up when converting from unicode to any non-unicode character set (unless your input is corrupt/invalid -- see below). If/when you need to convert from one non-unicode set to another, you'll want to convert the input to unicode first, then convert from unicode to the desired output encoding.

When converting from unicode to non-unicode, any unicode character that does not exist in the output encoding will be replaced by a question-mark character ("?"), so if you really want these things to be converted to "X" instead, you'll need to check for the presence of "?" characters in the input, and only change the cases of "?" that were created by the conversion.

One possible way to do that would be to divide the input into chunks using split /\?/, convert each chunk of characters, change any newly created "?" within those chunks to "X", then put the chunks back together again with join '?', ....

The only time a conversion into unicode would involve an "unsupported character" is if your input doesn't really use the encoding you think it does (e.g. you think it's iso-8859-4 and try to decode it as such, but it really isn't), or when there's corruption in the data (e.g. part of a multi-byte character is missing, or one or more bytes have been altered or added, making the data invalid for the character set that it's supposed to be using).

In these cases, the unicode result will contain the "replacement character" ("\x{fffd}") for each "uninterpretable" input byte. If your intended output happens to be unicode, it will be best to leave the replacement characters as-is -- maybe just check for them and issue a warning when they occur -- because they are an unambiguous indicator of problems found in the input. When you want to do a conversion of such a unicode string to some non-unicode encoding, you just need to do s/\x{fffd}/X/g first (because converting a unicode "\x{fffd}" character to any non-unicode character set will always produce a "?" character).

Comment on Re: converting text file encodings Select or Download Code

Replies are listed 'Best First'.
Re^2: converting text file encodings by andal (Hermit) on May 06, 2011 at 08:37 UTC
Well, function Encode::encode takes third argument that allows to define handling for "bad" characters. I believe, with the help of that one can replace them with other than question mark (?). For example `use strict; use Encode; # my raw data my $data = "\xe0\xe1\x02\n"; # interpret it as text in encoding CP1251 my $txt = Encode::decode("cp1251", $data); # convert text to octets in encoding Latin1. # Replace bad ones with X my $nd = Encode::encode("latin1", $txt, sub{ return "X" }); # just to see the result in the terminal which uses UTF-8 encoding Encode::from_to($nd, "latin1", "UTF-8"); print $nd, "\n"` [download]	[reply] [d/l]
Re^2: converting text file encodings by John M. Dlugosz (Monsignor) on May 06, 2011 at 07:29 UTC
What, the encoder function doesn't have a way to configure the error behavior? Sounds so very Microsoft. (I've had to avoid using the built-in Win32 functions for reasons related to that)	[reply]