sdperl has asked for the wisdom of the Perl Monks concerning the following question:

I'm reading an xml file in utf8 format. I want to strip out the xml and convert the data to iso-8859-2 octal codes for any letter not in the standard 7 digit ascii letters. So far it looks like my code works 99% of the time. But for some characters like a dash '-', it is converting it into a question mark '?'. Does this mean that it could not encode from a unicode - to the standard ascii - ?

Here is some code snippet:

use Encode; sub convertData { my $tempString = encode("iso-8859-2",shift); my $newString = ""; for ( my $l=length( $tempString), my $i = 0; $i < $l; $i++ ) { my $chr = substr( $tempString, $i, 1 ); my $ascii = ord( $chr ); if ( ( $ascii < 32 ) or ( $ascii > 126 ) ) { $newString .= sprintf("\\%03o", $ascii); } else { $newString .= $chr; } } return $newString; }

Also, I am fairly new to perl and encoding, so let me know if I am doing something that is not safe or totally correct.

Replies are listed 'Best First'.
Re: convert utf8 to iso-8859-2 to octal code
by ikegami (Patriarch) on Oct 24, 2011 at 21:54 UTC

    Does this mean that it could not encode from a unicode - to the standard ascii -

    It means the character does not exist in iso-8859-2. "-" does exist, so it's not what you actually had. Maybe it was "–" or "—" or one of the other various dashes.

    You can have encode do something other than substitute with "?" by using its third parameter.

    I think you would like to pass your text through Text::Unidecode before encoding it.

Re: convert utf8 to iso-8859-2 to octal code
by graff (Chancellor) on Oct 25, 2011 at 02:18 UTC
    ikegami's reply covers the problem as stated in the OP, but whenever you need to convert unicode text to some non-unicode encoding, it's useful to be able to look at it with a general-purpose diagnostic tool, to check for unicode characters that don't exist in the target encoding. Look at this node for one approach: unichist -- count/summarize characters in data (and you might also want to look at tlu -- TransLiterate Unicode).