comment on

When working with the "Latin" encodings it is easy to confuse things. The very common character encodings latin1, latin9, and CP-1252 have a lot of things in common, but also a few significant differences (like the € symbol), the same is true for some French characters, an example follows. Also, the three aforementioned encodings share their lower 128 bytes with both UTF-8 and ASCII, and since your files have only been in ASCII so far, there is a potential for confusion there. And one shouldn't confuse the common name of the character encoding "Latin-1" (ISO/IEC 8859-1) with the Unicode Latin-1 Supplement.

use warnings;
use strict;
use Encode qw/encode/;
use open qw/:std :utf8/;
use charnames ':full';

my @encodings = qw/ latin1 latin9 cp1252 UTF-8 /;
my @strs = (
    "\N{LATIN SMALL LETTER E WITH ACUTE}",
    "\N{LATIN SMALL LIGATURE OE}",
    "\N{LATIN CAPITAL LETTER Y WITH DIAERESIS}",
    "Y\N{COMBINING DIAERESIS}",    # Unicode only
    "e\N{COMBINING ACUTE ACCENT}", # Unicode only
);

for my $str (@strs) {
    print "--- \"$str\" ---\n";
    for my $encoding (@encodings) {
        my $bytes = eval {encode($encoding,"$str",Encode::FB_CROAK)};
        my $out = !$bytes ? "N/A" :
            join( ' ', map { sprintf "%02X", ord } split //, $bytes);
        print "$encoding: $out\n";
    }
}
[download]

Output:

--- "é" ---
latin1: E9
latin9: E9
cp1252: E9
UTF-8: C3 A9
--- "ś" ---
latin1: N/A
latin9: BD
cp1252: 9C
UTF-8: C5 93
--- "ź" ---
latin1: N/A
latin9: BE
cp1252: 9F
UTF-8: C5 B8
--- "Ÿ" ---
latin1: N/A
latin9: N/A
cp1252: N/A
UTF-8: 59 CC 88
--- "é" ---
latin1: N/A
latin9: N/A
cp1252: N/A
UTF-8: 65 CC 81

So my first piece of advice is to be certain what your files are encoded with. If you're using a text editor, keep an eye on which encoding it uses, since it's easy to open a latin1 file, choose "Save As", and have the editor default to a different character encoding like UTF-8 or CP-1252 (sometimes labeled just "ANSI"). In addition, because of the similarities in the character sets, the editor can easily misidentify which encoding the file had in the first place!

IMO the critical part when working with different encodings is the decoding of bytes to characters and encoding of characters to bytes, so that means when reading/writing files and/or displaying things on the terminal, or, if you're working with a website and/or database with questionable Unicode support, those interfaces. If you get that part right and get your character data properly decoded into a Perl string, then you've won a major battle. So despite that what hippo wrote about Latin1 being the default, personally I would still suggest you explicitly specify the encoding when opening the files, i.e. open my $fh, '<:encoding(latin1)', $filename or die $!;. (Note that even though I used the Encode module in the code above for demonstration purposes, if you're just reading files and they are properly encoded, you should never have to touch it, and if you do you might be doing something wrong.)

Once you've got your character data correctly into a Perl string, you have to worry less about it - Perl tries to mostly make transparent which internal encoding it uses, and tries its best to let you think about the string as a sequence of Unicode characters (codepoints). Perl's Unicode handling is very good, including in regexes. I would recommend using a recent version of Perl though, since there have been continuous improvements made to Unicode handling (example). (See also the Perl Unicode Tutorial.)

This is also the answer to your two questions: If you open the file with the right encoding, then no other changes should be necessary to your code. One exception might be if you have used explicit character ranges like [a-zA-Z0-9_] instead of \w - the latter should automatically work with Unicode.

If you want to write Unicode characters directly in the Perl source code, use utf8; and in your editor save the file with the UTF-8 encoding (just stay away from the functions provided in the utf8 module unless you really know what you are doing). Although personally, I tend to write my Perl source in ASCII and use the \x{....} and \N{...} sequences (for the latter see charnames). If you want to print Unicode strings to your terminal and it supports UTF-8, you can use use open qw/:std :utf8/;, although be careful with that pragma because it changes the default encoding for opening files (which is another reason for my above suggestion for always being explicit about specifying the encoding).

In reply to Re: Parsing a Latin-1 Charset Data File by haukex
in thread Parsing a Latin-1 Charset Data File by sumeetgrover

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.