Matching UTF8 Regexps

lestrrat has asked for the wisdom of the Perl Monks concerning the following question:

O, wise ones, please show me the light.

I'm currently trying to fetch a bunch of pages of the web, and then storing the contents after normalizing their encoding to an arbitrary charset -- but I'm trying to do this if and only if the contents of the page matches a certain pattern.

Since I will mostly be working with Japanese charsets (which has at least 3 different widely used sets of charsets), I'd like to first normalize the page contents to UTF8 and then match a regular expression against it. But I had problems where the matching was exactly getting what I wanted.

First, this is what I was doing:

   use Encode qw(encode decode);
   my $regexp_source = "... regexp string ...";
   my $regexp_utf8   = encode('utf8', decode('euc-jp', $regexp_source)
+);
   my $regexp        = qr($regexp_utf8);

   my $ua = LWP::UserAgent->new();
   my $response = $ua->get('http://foo.bar.com');

   my $encoding = guess_encoding($response);
   my $normalized = encode('utf8', decode($encoding, $response->conten
+t));

   if ($normalized =~ /$regexp/) {
      store_data($response);
   }
[download]

Above doesn't work -- it seems to match unwanted uri's as well. However, after a few attempts I noticed the following:

removing only the encoding of the regexp string to utf8 (encode('utf8',...)) doesn't affect the result
removing the encode() call on the content of the page makes things work

What's the difference here? My instinct tells me that, if both the regular expression and the operand needed to be in Perl's native UTF8 format, that's fine, but why would it still work when the regexp string is encoded to utf8 or not, and why would it not work when the operand ($response->content) is encode()d to utf8?

Comment on Matching UTF8 Regexps Select or Download Code

Replies are listed 'Best First'.
Re: Matching UTF8 Regexps by japhy (Canon) on Mar 07, 2005 at 08:08 UTC
Your regex source variable is double-quoted. This leads me to believe you'll have problems with backslashes and other regex-specific symbols that are incorrectly interpreted in a double-quoted string. Try using `qr//` to contain your regex instead: `my $regexp_source = qr/... regexp string .../;` [download] _____________________________________________________ Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart	[reply] [d/l] [select]
Re^2: Matching UTF8 Regexps by lestrrat (Deacon) on Mar 07, 2005 at 09:23 UTC
Thanks. Tried it, but no luck (same problem). Your suggestion has a valid point, but if that were the source of the problem, I don't see why encoding the content of the page makes a difference. (And in this particular case, I had no escapes in my regexp string -- which was just by pure luck) I'm guessing it has more to do with UTF8-funkiness than regular expression per say...	[reply]
Re: Matching UTF8 Regexps by graff (Chancellor) on Mar 08, 2005 at 02:09 UTC
The code you posted here doesn't make it clear where "guess_encoding" comes from. Are you using Encode::Guess? If so, are you actually using the code as posted? If so, maybe that's part of the problem, becuase you're supposed to pass a list of encodings that are "likely suspects" for the data you want to guess about. (Maybe you do that in the part of the code you're not showing us, e.g. as a list on the "use Encode::Guess" line?) You should review the manual for Encode::Guess more carefully, and see if you are trying to get something out of it that it can't really provide. Anyway, I think this is how you should be assigning the decoded value to $normalized: `my $encoding = guess_encoding( $response->content ); my $normalized = decode( $encoding, $response->content );` [download] Note that "decode" returns utf8 data -- you don't need to "re-encode" it as utf8. (updated to fix grammar) <update> Actually, looking at the current man page for Encode::Guess on CPAN, it looks like you should be doing this: `my $decoder = guess_encoding( $response->content ); die $decoder unless ( ref $decoder ); my $normalized = $decoder->decode( $response->content );` [download] In other words, the guessing method is supposed to return an object that supplies the (hopefully) appropriate decoding method, and you just pass your data to that method. </update> If that doesn't help, I'm not sure what else to suggest. Maybe if you try to break the process down to steps: store the $response->content to a file and inspect that manually; see what guess_encoding is returning for the chosen content -- maybe it's not guessing correctly; use the "FB_CROAK" flag as a third parameter in the "decode()" call (and wrap the call in an eval to catch it if it dies), to see if there are any errors when trying to convert the content to utf8, even when you know the "true" encoding of the source. (What? You don't think there would be encoding errors in the original character data? Don't be so sure.)	[reply] [d/l] [select]
Re^2: Matching UTF8 Regexps by lestrrat (Deacon) on Mar 08, 2005 at 04:06 UTC
Thanks, but the guessing method is not partinent in this case, as I was testing it with a fixed set of charsets and the same content before posting. :( I'm just stumped why encode(), which in this case sets the UTF8 flag on, makes any difference. I can understand if it was necessary to make sure that both the regexp and the content had the UTF8 flag on (or off), but in this case it doesn't matter if the UTF8 flag on the regexp is set or not. It matters that the content has the UTF8 flag off. This is the part that I don't get. I'm not failing to convert the content into the correct encoding. <ASIDE>I only use Encode::Guess as a final resort because more often than not it tends... to utterly fail to work. It's understandable, though, because there are many who decide to gratuitously mix euc-jp, and shift-jis in the same page, for example. Ugh. Anyhow, guessing is done using about 20 steps for heuristics at this point. if all else fails, we try to decode using Encode::Guess just to see if we can do it, but we don't really rely on it	[reply]
Re^3: Matching UTF8 Regexps by graff (Chancellor) on Mar 08, 2005 at 04:31 UTC
I'm sorry, but you don't seem to be making any sense. I'm just stumped why encode(), which in this case sets the UTF8 flag on, makes any difference. You have it backwards: the parameters given to encode() are a character set spec and a utf8 string; in the typical case, the character set spec is non-unicode, but in any case, the function returns a scalar that contains octets (raw bytes), not "characters" in the perl-internal/utf8 sense, and so the scalar value being returned has the utf8 flag off. It's the "decode" function that takes a character-encoding name and a scalar of octets, and returns a scalar string of utf8 characters, with the utf8 flag on. I can understand if it was necessary to make sure that both the regexp and the content had the UTF8 flag on (or off), but in this case it doesn't matter if the UTF8 flag on the regexp is set or not. Well, ~~maybe~~ the status of the utf8 flag on the regex ~~could be irrelevant, but~~ is important (see later reply), and of course if the regex has characters in one encoding, and the string you apply it to is in some other encoding, there's no way it can work as intended. It matters that the content has the UTF8 flag off. This is the part that I don't get. I'm not failing to convert the content into the correct encoding. The reason this makes no sense is probably due to your misunderstanding about the roles of encode() and decode() with respect to clearing/setting the utf8 flag. Assuming that you are, in fact, successfully converting (decoding) both the regex and the fetched web-page contents to utf8, then sure, things should work fine, and there should be no mystery about that. (updated to fix grammar)	[reply]
Re^4: Matching UTF8 Regexps by lestrrat (Deacon) on Mar 08, 2005 at 07:34 UTC
Re^5: Matching UTF8 Regexps by graff (Chancellor) on Mar 08, 2005 at 14:31 UTC


Think about Loose Coupling
	PerlMonks