I'm just stumped why encode(), which in this case sets the UTF8 flag on, makes any difference.
You have it backwards: the parameters given to encode() are a character set spec and a utf8 string; in the typical case, the character set spec is non-unicode, but in any case, the function returns a scalar that contains octets (raw bytes), not "characters" in the perl-internal/utf8 sense, and so the scalar value being returned has the utf8 flag off.
It's the "decode" function that takes a character-encoding name and a scalar of octets, and returns a scalar string of utf8 characters, with the utf8 flag on.
I can understand if it was necessary to make sure that both the regexp and the content had the UTF8 flag on (or off), but in this case it doesn't matter if the UTF8 flag on the regexp is set or not.
Well, maybe the status of the utf8 flag on the regex could be irrelevant, but is important (see later reply), and of course if the regex has characters in one encoding, and the string you apply it to is in some other encoding, there's no way it can work as intended.
It matters that the content has the UTF8 flag off. This is the part that I don't get. I'm not failing to convert the content into the correct encoding.
The reason this makes no sense is probably due to your misunderstanding about the roles of encode() and decode() with respect to clearing/setting the utf8 flag. Assuming that you are, in fact, successfully converting (decoding) both the regex and the fetched web-page contents to utf8, then sure, things should work fine, and there should be no mystery about that. (updated to fix grammar)
In reply to Re^3: Matching UTF8 Regexps
by graff
in thread Matching UTF8 Regexps
by lestrrat
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |