I'm sorry, but you don't seem to be making any sense.
I'm just stumped why encode(), which in this case sets the UTF8 flag on, makes any difference.

You have it backwards: the parameters given to encode() are a character set spec and a utf8 string; in the typical case, the character set spec is non-unicode, but in any case, the function returns a scalar that contains octets (raw bytes), not "characters" in the perl-internal/utf8 sense, and so the scalar value being returned has the utf8 flag off.

It's the "decode" function that takes a character-encoding name and a scalar of octets, and returns a scalar string of utf8 characters, with the utf8 flag on.

I can understand if it was necessary to make sure that both the regexp and the content had the UTF8 flag on (or off), but in this case it doesn't matter if the UTF8 flag on the regexp is set or not.

Well, maybe the status of the utf8 flag on the regex could be irrelevant, but is important (see later reply), and of course if the regex has characters in one encoding, and the string you apply it to is in some other encoding, there's no way it can work as intended.

It matters that the content has the UTF8 flag off. This is the part that I don't get. I'm not failing to convert the content into the correct encoding.

The reason this makes no sense is probably due to your misunderstanding about the roles of encode() and decode() with respect to clearing/setting the utf8 flag. Assuming that you are, in fact, successfully converting (decoding) both the regex and the fetched web-page contents to utf8, then sure, things should work fine, and there should be no mystery about that. (updated to fix grammar)


In reply to Re^3: Matching UTF8 Regexps by graff
in thread Matching UTF8 Regexps by lestrrat

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.