Weird Perl 5.8.8 Regex Problems for Japanese UTF8

ruski86 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I've been banging my head against this for a week, and I'm pretty stuck. I have some regexs that run completely fine on my local machine, taking mere milliseconds. They work for both ASCII and unicode and cause me no problems whatsoever (assuming correct conversion to unicode when needed, etc...). This all happens locally on a Windows box using cygwin and Perl 5.10.1.

Now, as soon as I take that exact same code and throw it on a Linux server running Perl 5.8.8, everything still works in about the same length of time except for seven regular expressions (out of hundreds that I'm using) which for one reason or another wind up hanging and taking a long time (seconds v.s. milliseconds) for Japanese UTF8. I don't see anything special in these 7 regexs, or understand why there might be a problem in only Japanese (other high Unicode, like Chinese or Thai works fine), or why this problem only exists remotely on Perl 5.8.8 and not locally on Perl 5.10.1.

Upgrading the server to 5.10.1 is out of the question, so I need to find a work around for these. A few of the problematic regexs are:

1. ^\s*SPECIFIC\sDECISIONS\s*$
2. ^\s*Important\s+\S+\s+Decisions\s*$
3. ^General\s+Decisions$
4. ^E-mail:\s*replies@x.com$

Has anyone ever experienced anything similar?

Comment on Weird Perl 5.8.8 Regex Problems for Japanese UTF8

Replies are listed 'Best First'.
Re: Weird Perl 5.8.8 Regex Problems for Japanese UTF8 by graff (Chancellor) on Sep 07, 2010 at 22:11 UTC
As JavaFan indicated, your description isn't giving us quite enough to go on. How about if you post a little bit of sample data and some minimal snippet of runnable code that demonstrates the problem, such that running that snippet with that data takes 0.00N sec locally and N sec on the remote linux server. (You might need the Benchmark module to get measurable results.) As you try to do that, if you happen to notice that the self-contained test script (using the same test data in the same way) actually runs in about the same time on both systems, then that would suggest there might be a problem with where your linux server is getting its Japanese data from (i.e. it might not be a problem with your perl script or the perl interpreter). (updated to fix grammar)	[reply]
Re^2: Weird Perl 5.8.8 Regex Problems for Japanese UTF8 by ruski86 (Initiate) on Sep 08, 2010 at 14:10 UTC
It's really, really long code that I'm inheriting, so it's difficult to post a small self-contained portion, but those regexs I posted should cause the problem on their own when run against certain lines of Japanese. An example of one of these is: "インドネシア休場"	[reply]
Re^3: Weird Perl 5.8.8 Regex Problems for Japanese UTF8 by JavaFan (Canon) on Sep 08, 2010 at 15:30 UTC
I cannot reproduce that. In fact, according to Benchmark, 5.8.8 is significantly faster than either 5.10.1 or 5.12.2. #!/usr/bin/perl use strict; use warnings; use Benchmark qw[cmpthese]; my $RUNS = shift \|\| 5; print $], "\n"; sub match { my $subj = shift; $subj =~ /^\sSPECIFIC\sDECISIONS\s$/; $subj =~ /^\sImportant\s+\S+\s+Decisions\s$/; $subj =~ /^General\s+Decisions$/; $subj =~ /^E-mail:\s*replies\@x.com$/; } for (1 .. $RUNS) { cmpthese(-3, { english => sub {match "This is a string"}, japanese => sub {match "インドネ&#1247 +1;ア休場"}, }); } __END__ 5.008008 Rate japanese english japanese 466031/s -- -38% english 751158/s 61% -- Rate japanese english japanese 505266/s -- -36% english 793800/s 57% -- Rate japanese english japanese 502207/s -- -36% english 778988/s 55% -- Rate japanese english japanese 498022/s -- -35% english 767420/s 54% -- Rate japanese english japanese 508507/s -- -35% english 780966/s 54% -- 5.010001 Rate japanese english japanese 339534/s -- -41% english 577872/s 70% -- Rate japanese english japanese 416203/s -- -28% english 577831/s 39% -- Rate japanese english japanese 369283/s -- -37% english 581775/s 58% -- Rate japanese english japanese 418431/s -- -19% english 516905/s 24% -- Rate japanese english japanese 417789/s -- -29% english 587599/s 41% -- 5.012002 Rate japanese english japanese 361165/s -- -27% english 496107/s 37% -- Rate japanese english japanese 372533/s -- -33% english 553731/s 49% -- Rate japanese english japanese 362232/s -- -34% english 549086/s 52% -- Rate japanese english japanese 361565/s -- -33% english 537911/s 49% -- Rate japanese english japanese 368166/s -- -32% english 539974/s 47% -- [download] (For some reason, if you have `"インドネシア休場"` in a a `<code>` block, Perlmonks escapes the Japanese characters, and then shows the raw entities. Bug?)	[reply] [d/l]
Re: Non windows-1252 and PM (was Re^4: Weird Perl 5.8.8 Regex Problems for Japanese UTF8) by ikegami (Patriarch) on Sep 08, 2010 at 16:25 UTC
Re^3: Weird Perl 5.8.8 Regex Problems for Japanese UTF8 by graff (Chancellor) on Sep 08, 2010 at 22:18 UTC
You said: It's really, really long code that I'm inheriting, so it's difficult to post a small self-contained portion, but those regexs I posted should cause the problem on their own when run against certain lines of Japanese. One step in the diagnosis is to determine whether it really is the regexes themselves (and their handling of Japanese), or whether it is instead some other problem involving the Japanese data in the big app, regardless of the regexes (e.g. only the Japanese data are coming from a particular source that isn't playing nice with the app). You have a local machine and a remote machine; you have a small sample of Japanese text and non-Japanese text; you have a set of regexes. If a single minimal test script, containing its own test data and applying those regexes, runs in roughly the same amount of time in both the local and remote machines (or at least, behaves consistently when applying the regexes to the Japanese and non-Japanese strings), then the problem in the big app is probably not being caused by the regexes. Here is what it boils down to: What sort of evidence did you have that led to your conclusion in the OP that the regexes were causing the problem? And what is the likelihood that your original evidence might also be due to some other cause, not involving the regexes? If you try the test script that JavaFan supplied above, and it turns out to show a really big slow-down on the remote machine, then you really do have a problem using those regexes with Japanese text on that machine. That would be strange, but in that case, you can try some different variations on the regexes to see if there's another way to do the job without taking the hit. OTOH, in the more likely case that the regexes are performing as expected in that test, you'll need to examine what else is going on with the Japanese data in the big app on the remote machine (and how it differs from the local one)...	[reply]
Re: Weird Perl 5.8.8 Regex Problems for Japanese UTF8 by JavaFan (Canon) on Sep 07, 2010 at 20:58 UTC
Uhm, none of the regexes you show contain anything Japanese. And only the second one has any chance of matching anything that contains Japanese. (Well, technically, the fourth one may as well: the array @x may contain Japanse text, and '.' may match a single Japanese character. But you probably meant `/\@x\.com$/ there`.)	[reply] [d/l]
Re^2: Weird Perl 5.8.8 Regex Problems for Japanese UTF8 by ruski86 (Initiate) on Sep 08, 2010 at 14:04 UTC
Sorry, I mean that I'm using those regexs to match against a lot of different documents. Many of those documents happen to contain different languages. None of the characters in any of those languages cause any problems except the Japanese ones, which slow down the matching to a crawl (and it seems to only be about 10% of Japanese strings I've tried to match against that cause the problem). Does that clarify things?	[reply]