O, wise ones, please show me the light.

I'm currently trying to fetch a bunch of pages of the web, and then storing the contents after normalizing their encoding to an arbitrary charset -- but I'm trying to do this if and only if the contents of the page matches a certain pattern.

Since I will mostly be working with Japanese charsets (which has at least 3 different widely used sets of charsets), I'd like to first normalize the page contents to UTF8 and then match a regular expression against it. But I had problems where the matching was exactly getting what I wanted.

First, this is what I was doing:

use Encode qw(encode decode); my $regexp_source = "... regexp string ..."; my $regexp_utf8 = encode('utf8', decode('euc-jp', $regexp_source) +); my $regexp = qr($regexp_utf8); my $ua = LWP::UserAgent->new(); my $response = $ua->get('http://foo.bar.com'); my $encoding = guess_encoding($response); my $normalized = encode('utf8', decode($encoding, $response->conten +t)); if ($normalized =~ /$regexp/) { store_data($response); }

Above doesn't work -- it seems to match unwanted uri's as well. However, after a few attempts I noticed the following:

What's the difference here? My instinct tells me that, if both the regular expression and the operand needed to be in Perl's native UTF8 format, that's fine, but why would it still work when the regexp string is encoded to utf8 or not, and why would it not work when the operand ($response->content) is encode()d to utf8?


In reply to Matching UTF8 Regexps by lestrrat

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.