convert utf8 to iso-8859-2 to octal code

sdperl has asked for the wisdom of the Perl Monks concerning the following question:

I'm reading an xml file in utf8 format. I want to strip out the xml and convert the data to iso-8859-2 octal codes for any letter not in the standard 7 digit ascii letters. So far it looks like my code works 99% of the time. But for some characters like a dash '-', it is converting it into a question mark '?'. Does this mean that it could not encode from a unicode - to the standard ascii - ?

Here is some code snippet:

use Encode;
sub convertData {
    my $tempString = encode("iso-8859-2",shift);
    my $newString = "";
    
    for ( my $l=length( $tempString), my $i = 0; $i < $l; $i++ ) {
           my $chr = substr( $tempString, $i, 1 );
            my $ascii = ord( $chr ); 
        
        if ( ( $ascii < 32 ) or ( $ascii > 126 ) ) {
            $newString .= sprintf("\\%03o", $ascii); 
        }
        else
        {
            $newString .= $chr;    
        }
    }
       
    return $newString;

 }
[download]

Also, I am fairly new to perl and encoding, so let me know if I am doing something that is not safe or totally correct.

Comment on convert utf8 to iso-8859-2 to octal code Download Code

Replies are listed 'Best First'.
Re: convert utf8 to iso-8859-2 to octal code by ikegami (Patriarch) on Oct 24, 2011 at 21:54 UTC
Does this mean that it could not encode from a unicode - to the standard ascii - It means the character does not exist in iso-8859-2. "-" does exist, so it's not what you actually had. Maybe it was "–" or "—" or one of the other various dashes. You can have `encode` do something other than substitute with "?" by using its third parameter. I think you would like to pass your text through Text::Unidecode before encoding it.	[reply] [d/l]
Re: convert utf8 to iso-8859-2 to octal code by graff (Chancellor) on Oct 25, 2011 at 02:18 UTC
ikegami's reply covers the problem as stated in the OP, but whenever you need to convert unicode text to some non-unicode encoding, it's useful to be able to look at it with a general-purpose diagnostic tool, to check for unicode characters that don't exist in the target encoding. Look at this node for one approach: unichist -- count/summarize characters in data (and you might also want to look at tlu -- TransLiterate Unicode).	[reply]