comment on

Please see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

The character U+00E9 LATIN SMALL LETTER E WITH ACUTE (é) is encoded in Latin-1, Latin-9, and CP-1252 as the single byte \xE9 (\351), but when encoded with UTF-8, it's the two-byte sequence \xC3\xA9 (\303\251).

In other words, some of your files are encoded with one of the single-byte encodings, others are encoded with UTF-8, and you'll have to specify the correct encoding when opening them, as in e.g. open my $fh, '<:raw:encoding(UTF-8)', $filename or die "$filename $!"; (see "open" Best Practices). That way, when you read the data into Perl, the characters are correctly decoded and you'll always have the correct characters (e.g. "\N{U+00E9}") in your Perl strings.

If you don't know the encoding of the input files, you could use a module like Encode::Guess, or I've written a tool that tries to be a little smarter: enctool - it allows you to narrow down the guesses by specifying what characters are expected to appear in the input file using e.g. the --one-of='\xE9' option. Some files, like HTML and XML, will often include a definition of the character set in their source, and (except for the cases where that declaration is incorrect) the appropriate parser modules (e.g. XML::LibXML) should honor that encoding.

As an aside, if you're putting Unicode characters in your Perl source, you should save it as UTF-8 and add use utf8; at the top of the file. If you're writing Unicode characters to the console, add use open qw/:std :utf8/;. And of course always Use strict and warnings, and a recent version of Perl is strongly recommended when working with Unicode.

If you have further issues with encodings when reading files, please see the tips for posting questions in this node.

By the way, why are you looking for "é" characters in the first place? Maybe there's a more efficient way to do what you're doing with your regex, if you tell us what the task is.

In reply to Re: Two octal values for eacute? by haukex
in thread Two octal values for eacute? by pianomonious

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.