Re: Matching UTF8 Regexps

The code you posted here doesn't make it clear where "guess_encoding" comes from. Are you using Encode::Guess? If so, are you actually using the code as posted? If so, maybe that's part of the problem, becuase you're supposed to pass a list of encodings that are "likely suspects" for the data you want to guess about. (Maybe you do that in the part of the code you're not showing us, e.g. as a list on the "use Encode::Guess" line?)

You should review the manual for Encode::Guess more carefully, and see if you are trying to get something out of it that it can't really provide. Anyway, I think this is how you should be assigning the decoded value to $normalized:

  my $encoding = guess_encoding( $response->content );
  my $normalized = decode( $encoding, $response->content );
[download]

Note that "decode" returns utf8 data -- you don't need to "re-encode" it as utf8. (updated to fix grammar)

<update> Actually, looking at the current man page for Encode::Guess on CPAN, it looks like you should be doing this:

  my $decoder = guess_encoding( $response->content );
  die $decoder unless ( ref $decoder );
  my $normalized = $decoder->decode( $response->content );
[download]

In other words, the guessing method is supposed to return an object that supplies the (hopefully) appropriate decoding method, and you just pass your data to that method. </update>

If that doesn't help, I'm not sure what else to suggest. Maybe if you try to break the process down to steps: store the $response->content to a file and inspect that manually; see what guess_encoding is returning for the chosen content -- maybe it's not guessing correctly; use the "FB_CROAK" flag as a third parameter in the "decode()" call (and wrap the call in an eval to catch it if it dies), to see if there are any errors when trying to convert the content to utf8, even when you know the "true" encoding of the source.

(What? You don't think there would be encoding errors in the original character data? Don't be so sure.)

Comment on Re: Matching UTF8 Regexps Select or Download Code

Replies are listed 'Best First'.
Re^2: Matching UTF8 Regexps by lestrrat (Deacon) on Mar 08, 2005 at 04:06 UTC
Thanks, but the guessing method is not partinent in this case, as I was testing it with a fixed set of charsets and the same content before posting. :( I'm just stumped why encode(), which in this case sets the UTF8 flag on, makes any difference. I can understand if it was necessary to make sure that both the regexp and the content had the UTF8 flag on (or off), but in this case it doesn't matter if the UTF8 flag on the regexp is set or not. It matters that the content has the UTF8 flag off. This is the part that I don't get. I'm not failing to convert the content into the correct encoding. <ASIDE>I only use Encode::Guess as a final resort because more often than not it tends... to utterly fail to work. It's understandable, though, because there are many who decide to gratuitously mix euc-jp, and shift-jis in the same page, for example. Ugh. Anyhow, guessing is done using about 20 steps for heuristics at this point. if all else fails, we try to decode using Encode::Guess just to see if we can do it, but we don't really rely on it	[reply]
Re^3: Matching UTF8 Regexps by graff (Chancellor) on Mar 08, 2005 at 04:31 UTC
I'm sorry, but you don't seem to be making any sense. I'm just stumped why encode(), which in this case sets the UTF8 flag on, makes any difference. You have it backwards: the parameters given to encode() are a character set spec and a utf8 string; in the typical case, the character set spec is non-unicode, but in any case, the function returns a scalar that contains octets (raw bytes), not "characters" in the perl-internal/utf8 sense, and so the scalar value being returned has the utf8 flag off. It's the "decode" function that takes a character-encoding name and a scalar of octets, and returns a scalar string of utf8 characters, with the utf8 flag on. I can understand if it was necessary to make sure that both the regexp and the content had the UTF8 flag on (or off), but in this case it doesn't matter if the UTF8 flag on the regexp is set or not. Well, ~~maybe~~ the status of the utf8 flag on the regex ~~could be irrelevant, but~~ is important (see later reply), and of course if the regex has characters in one encoding, and the string you apply it to is in some other encoding, there's no way it can work as intended. It matters that the content has the UTF8 flag off. This is the part that I don't get. I'm not failing to convert the content into the correct encoding. The reason this makes no sense is probably due to your misunderstanding about the roles of encode() and decode() with respect to clearing/setting the utf8 flag. Assuming that you are, in fact, successfully converting (decoding) both the regex and the fetched web-page contents to utf8, then sure, things should work fine, and there should be no mystery about that. (updated to fix grammar)	[reply]
Re^4: Matching UTF8 Regexps by lestrrat (Deacon) on Mar 08, 2005 at 07:34 UTC
Well, maybe the status of the utf8 flag on the regex could be irrelevant, but if the regex has characters in one encoding, and the string you apply it to is in some other encoding, there's no way it can work as intended. I don't get it. I thought I said I normalize both to UTF8..? Okay, so I probably wasn't clear. If the following doesn't make clear why I'm confused, then I will just have to admit that I don't know what I'm doing at all... # assume all encode/decode works correctly my $regexp_raw = "...."; my $regexp_utf8_decoded = decode($some_enc, $regex_raw); my $regexp_utf8_encoded = encode('utf8', $regexp_utf8_decoded); my $some_content = "..."; my $some_content_decoded = decode($some_enc, $some_content); my $some_content_encoded = encode('utf8', $some_content_decoded); $some_content_decoded =~ /$regexp_utf8_decded/; # matches correctly $some_content_decoded =~ /$regexp_utf8_encoded/; # matches correctly $some_content_encoded =~ /$regexp_utf8_decoded/; # no $some_content_encoded =~ /$regexp_utf8_encoded/; # no [download] So why doesn't $some_content_encoded match? I thought all strings are normalized to UTF8...	[reply] [d/l]
Re^5: Matching UTF8 Regexps by graff (Chancellor) on Mar 08, 2005 at 14:31 UTC