comment on

I've had this sort of trouble with the HTML parsing modules too. For reasons that probably make sense for some applications, the parsing strategy seems to be based on pulling fixed-size chunks of bytes from whatever input (even a scalar string, which seems odd) -- and then doing some operations on the chunks where perl 5.8's flexibility ("byte-semantics" vs. "character-semantics") goes awry, and ends up trying to do utf8 operations on a character where the final byte or two fell on the wrong side of a chunk boundary.

If you pass the parser a file handle as input, do not open that file handle with ":utf8" or any other encoding pragma that would cause the data to be converted to utf8 on input (via a PerlIO layer). If you pass it a scalar string, make sure that it is a string that does not have the utf8 flag set. (See the Encode man page about the utf8 flag.)

After the parsing is done, use Encode::decode() on the various pieces of text content if you need to do utf8 character-based stuff with it.

I presume this is a difficult design issue for the HTML parsing modules -- or maybe it's something that would be fairly easy to fix -- but it certainly is a problem.

In reply to Re: HTML::Tree problems with UTF-8 Content. by graff
in thread HTML::Tree problems with UTF-8 Content. by Cody Pendant

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.