comment on

I was sort of expecting this to come as a follow-up to your previous article but didn't want to overcomplicate things :)

One of the issues with encoding is that it happens in so many places that quite often things look right while you actually have a cancellation of errors. Your example is no exception. So here are some points:

Perl's default text encoding is ISO-8859-1 which is a 1-byte encoding.
Contemporary terminals work with UTF-8 encoding. This also includes the terminal you are using in your debugging session.
The infamous UTF-8 flag and the is_utf8 function come with a warning:
CAVEAT: If STRING has UTF8 flag set, it does NOT mean that STRING is UTF-8 encoded and vice-versa.
In many cases the function tells you what you already know (that the data doesn't look like you expect them), in some cases it is just misleading.

So, what's happening here? You read data with LWP. Though you haven't given the details, I guess you are using the content method to retrieve the data. This method always gives bytes. But wait: LWP can use the charset attribute from the Content-Type header to decode text into characters, and indeed it will do so if you use the method decoded_content method instead.

The data is displayed correctly because you are feeding non-decoded bytes to a terminal which expects UTF-8-encoded bytes. Since your input was also UTF-8-encoded bytes, it looks fine. Your application is just a man in the middle which passes these data through.

Decoding the data is the correct way (which, as I wrote, LWP can do for you if you want). Perl then knows that the character in question is a 'ü'. Perl can handle this character in its default encoding, which is slightly infortunate, because it will do so and print one Byte for that character. This character hits a terminal which expects UTF-8 encoded bytes, doesn't understand the character and substitutes it with the Unicode replacement character.

Now when you write the data, you need to encode it to UTF-8. I suppose (but didn't test right now) that MIME::Lite::TT::HTML does the right thing and encodes for you if you provide the Charset attribute on the constructor. =FC is QP-encoding for an ISO-8859-1 'ü' and indeed wrong here. So if you did provide Charset => 'utf8', then shout up, I'll write some tests.

As for handling the debugger: Since you are working with an UTF-8 terminal, you might want to try the following:

binmode DB::OUT,':utf8'; binmode DB::IN,':utf8'

This makes the debugger handle its I/O as UTF-8 encoded.

....and, because I just read the reply by LanX, I recommend against Data::Peek. It will tell you only what you already know ("that's not right") but not give guidance how to fix.

In reply to Re: Lost in encodings by haj
in thread Lost in encodings by Skeeve

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.