DreamT has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
Mastering problems with encoding/decoding is not an easy task, especially if it's not possible to alter the context where the problem lies in (script encoding, website encoding, database encoding etc.).

I know that I as a developer need to get a grip on how everything works, and I also need to find a method for converting data into the data I need.
So, I have the following questions for you:

1. Is there a method for understanding the encoding issues in Perl, regarding the different "artifacts" that can affect it? Some kind of flowchart would be nice.
2. Let's say that the preferred "destination data", regardless of it's original encoding, is supposed to be ISO-8859-1. How can I accomplish this? Automatically or semi-automatically?
Or, more compact: I need a method for dealing with data in different encodings that is supposed to end up in a certain encoding (or - I need help to learn Perl encoding/decoding). What are your tips?
  • Comment on A definitive way to handle encoding/decoding problems?

Replies are listed 'Best First'.
Re: A definitive way to handle encoding/decoding problems?
by moritz (Cardinal) on Apr 11, 2012 at 15:08 UTC

    There are several resources that can help you. I have written Character Encodings in Perl, and there's also the Perl Programming Unicode/UTF-8 wikibook.

    The simplest advice boils down to decoding everything that comes into your program, and encoding everything that leaves your program.

    All in all encoding problems are not really different than any other bugs where wrong output is produced. The real problems is that often programmers don't have a clear mental image of how which data source encodes its strings, and what the modules they use do with those strings.

Re: A definitive way to handle encoding/decoding problems?
by zentara (Cardinal) on Apr 11, 2012 at 18:03 UTC
    See tchrist's OSCON Perl Unicode Slides and his html slideshow ( Use arrow keys to advance/retreat the slideshow). In there is the definitive way

    tchrist suggests all scripts dealing with Unicode start like this( from his slideshow): Its simple isn't it? heh heh:-)

    #!/usr/bin/env perl use v5.14; use utf8; use strict; use autodie; use warnings; use warnings qw< FATAL utf8 >; use open qw< :std :encoding(UTF-8) >; use charnames qw< :full >; use feature qw< unicode_strings >; #The First of these is almost always needed; the rest, not so much. use Unicode::Normalize qw< NFD NFC >; use Encode qw< encode decode >; use Carp qw< carp croak confess cluck >; use File::Basename qw< basename >; $0 = basename($0); # shorter messages binmode(DATA, ":encoding(UTF-8)"); # This works like perl -CA: note that it # assumes your terminal is set to use UTF-8 if (grep /\P{ASCII}/ => @ARGV) { @ARGV = map { decode("UTF-8", $_) } @ARGV; } $| = 1; # comment out for performance END { close STDOUT } #This avoids compile-time “bugs” in the pragma: # XXX: use warnings FATAL => "all"; local $SIG{__DIE__} = sub { confess "Uncaught exception: @_" unless $^S; }; local $SIG{__WARN__} = sub { if ($^S) { cluck "Trapped warning: @_" } else { confess "Deadly warning: @_" } }; # I use this on normal CLI filters: if (@ARGV == 0 && -t STDIN && -t STDERR) { print STDERR "$0: reading input from tty, type ^D for EOF...\n"; } while (<>) { chomp; $_ = NFD($_); ... } continue { say NFC($_); } __END__

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
Re: A definitive way to handle encoding/decoding problems?
by halfcountplus (Hermit) on Apr 11, 2012 at 15:08 UTC

    a) The encoding of the script is not relevant to the encoding of the data manipulated by the process as long as you indicate to perl what the script encoding is if you use non-ascii literals in it. See: encoding

    b) Don't worry about (or rely on an assumption about) what internal coding perl uses. http://http://perldoc.perl.org/perlunifaq.html#I-lost-track%3B-what-encoding-is-the-internal-format-really%3F

    c) If you know what the encoding of the data you are reading is, use decode() on it, or binmode(:encoding(xxx)) on the filehandle. This will convert it into the internal encoding so you can do what have you, such as encode() it into some different encoding.

    binmode Encode

    d) If you don't know what the encoding of read data is, that's problematic. There isn't a definitive way to sort them out.

Re: A definitive way to handle encoding/decoding problems?
by Anonymous Monk on Apr 11, 2012 at 15:02 UTC
Re: A definitive way to handle encoding/decoding problems?
by JavaFan (Canon) on Apr 11, 2012 at 15:26 UTC
    Let's say that the preferred "destination data", regardless of it's original encoding, is supposed to be ISO-8859-1
    Uhm, and how should that work? ISO-8859-1 has just 256 code points, including the control characters. Each of those code points have a specific meaning. There are many encodings that are used to encode data sets with more than 256 elements. What you want is impossible.

    Of course, it's possible to use an encoding which uses nothing but characters from ISO-8859-1, or even printable ASCII. HTML numeric entities are one such an encoding. Or Perls \x{} escapes. You'd still need a mapping from numbers to meanings somewhere (perhaps you could borrow the Unicode mapping, but there are also encodings for scripts that use characters that are not defined in Unicode -- Klingon for instance)

Re: A definitive way to handle encoding/decoding problems?
by i5513 (Pilgrim) on Apr 11, 2012 at 15:39 UTC
    Hello,
    2. Let's say that the preferred "destination data", regardless of it's original encoding, is supposed to be ISO-8859-1. How can I accomplish this? Automatically or semi-automatically?
    Take a look at Encode::Guess