comment on

So then here's your next problem:

if I try to insert it into a latin1 mySQL database

Any mysql from at least the last decade will support UTF8 mode. Assuming you have DBI there, the connection options should be at least:

connect($dsn, $user, $pass, {
  RaiseError => 1,
  AutoCommit => 1,
  mysql_enable_utf8 => 1,
  on_connect_call => 'set_strict_mode',
})
[download]

If this isn't an option and you really need mysql to be latin1, then you need to convert the incoming data to latin1 before you insert it.

use Encode qw(encode);
$_ = encode('iso-8859-1', $_, 1) for values %$add;
[download]

Note that that dies if there is any character in the string that is outside latin1.

Any time you mess with encodings, you need to make a distinction of which perl scalars are holding "unicode strings" vs. which are holding "byte sequences". It would be nice if Perl tracked this for you, but it does not. (perl internally tracks whether it has wide characters in a scalar, but that is not the same as tracking the logical intent of the string) Setting an encoding on a filehandle means you are receiving "unicode strings" from it. Before you pass those strings to an API that isn't unicode-aware, you need to choose what bytes they should become. You can manually convert to the bytes of your choice using the Encode module. Or, upgrade your DBI connection to be unicode aware.

Edit: re-reading your initial post, I realize this could use more clarification. You were previously loading the UTF-8 file and converting it to latin1, then comparing it with a regex of bytes which were were latin1 (because your source file was ascii with unofficial default latin1 upper bytes) which worked, then sending it to the database. On the new system the DB complained about \xE4 even though nothing in your script changed. right? I don't think perl changed behavior, because it is highly backward compatible, but it might be that the database driver has new defaults and might in fact already expect utf8? and \xE4 is not a valid utf8 sequence.

Regardless of what actually changed and broke things, the advice we are giving you here is to do all your processing as unicode. Use unicode for the script itself, causing the regex to be unicode, decode all incoming data before processing it, then encode all data before emitting it. And, ideally configure the mysql driver to do that encoding for you so you don't have to worry about it.

In reply to Re^2: Perl encoding problem by NERDVANA
in thread Perl encoding problem by derion

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.