comment on

The code you posted here doesn't make it clear where "guess_encoding" comes from. Are you using Encode::Guess? If so, are you actually using the code as posted? If so, maybe that's part of the problem, becuase you're supposed to pass a list of encodings that are "likely suspects" for the data you want to guess about. (Maybe you do that in the part of the code you're not showing us, e.g. as a list on the "use Encode::Guess" line?)

You should review the manual for Encode::Guess more carefully, and see if you are trying to get something out of it that it can't really provide. Anyway, I think this is how you should be assigning the decoded value to $normalized:

  my $encoding = guess_encoding( $response->content );
  my $normalized = decode( $encoding, $response->content );
[download]

Note that "decode" returns utf8 data -- you don't need to "re-encode" it as utf8. (updated to fix grammar)

<update> Actually, looking at the current man page for Encode::Guess on CPAN, it looks like you should be doing this:

  my $decoder = guess_encoding( $response->content );
  die $decoder unless ( ref $decoder );
  my $normalized = $decoder->decode( $response->content );
[download]

In other words, the guessing method is supposed to return an object that supplies the (hopefully) appropriate decoding method, and you just pass your data to that method. </update>

If that doesn't help, I'm not sure what else to suggest. Maybe if you try to break the process down to steps: store the $response->content to a file and inspect that manually; see what guess_encoding is returning for the chosen content -- maybe it's not guessing correctly; use the "FB_CROAK" flag as a third parameter in the "decode()" call (and wrap the call in an eval to catch it if it dies), to see if there are any errors when trying to convert the content to utf8, even when you know the "true" encoding of the source.

(What? You don't think there would be encoding errors in the original character data? Don't be so sure.)

In reply to Re: Matching UTF8 Regexps by graff
in thread Matching UTF8 Regexps by lestrrat

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.