lestrrat has asked for the wisdom of the Perl Monks concerning the following question:
O, wise ones, please show me the light.
I'm currently trying to fetch a bunch of pages of the web, and then storing the contents after normalizing their encoding to an arbitrary charset -- but I'm trying to do this if and only if the contents of the page matches a certain pattern.
Since I will mostly be working with Japanese charsets (which has at least 3 different widely used sets of charsets), I'd like to first normalize the page contents to UTF8 and then match a regular expression against it. But I had problems where the matching was exactly getting what I wanted.
First, this is what I was doing:
use Encode qw(encode decode); my $regexp_source = "... regexp string ..."; my $regexp_utf8 = encode('utf8', decode('euc-jp', $regexp_source) +); my $regexp = qr($regexp_utf8); my $ua = LWP::UserAgent->new(); my $response = $ua->get('http://foo.bar.com'); my $encoding = guess_encoding($response); my $normalized = encode('utf8', decode($encoding, $response->conten +t)); if ($normalized =~ /$regexp/) { store_data($response); }
Above doesn't work -- it seems to match unwanted uri's as well. However, after a few attempts I noticed the following:
What's the difference here? My instinct tells me that, if both the regular expression and the operand needed to be in Perl's native UTF8 format, that's fine, but why would it still work when the regexp string is encoded to utf8 or not, and why would it not work when the operand ($response->content) is encode()d to utf8?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Matching UTF8 Regexps
by japhy (Canon) on Mar 07, 2005 at 08:08 UTC | |
by lestrrat (Deacon) on Mar 07, 2005 at 09:23 UTC | |
|
Re: Matching UTF8 Regexps
by graff (Chancellor) on Mar 08, 2005 at 02:09 UTC | |
by lestrrat (Deacon) on Mar 08, 2005 at 04:06 UTC | |
by graff (Chancellor) on Mar 08, 2005 at 04:31 UTC | |
by lestrrat (Deacon) on Mar 08, 2005 at 07:34 UTC | |
by graff (Chancellor) on Mar 08, 2005 at 14:31 UTC |