in reply to Weird Perl 5.8.8 Regex Problems for Japanese UTF8

As JavaFan indicated, your description isn't giving us quite enough to go on. How about if you post a little bit of sample data and some minimal snippet of runnable code that demonstrates the problem, such that running that snippet with that data takes 0.00N sec locally and N sec on the remote linux server. (You might need the Benchmark module to get measurable results.)

As you try to do that, if you happen to notice that the self-contained test script (using the same test data in the same way) actually runs in about the same time on both systems, then that would suggest there might be a problem with where your linux server is getting its Japanese data from (i.e. it might not be a problem with your perl script or the perl interpreter).

(updated to fix grammar)

  • Comment on Re: Weird Perl 5.8.8 Regex Problems for Japanese UTF8

Replies are listed 'Best First'.
Re^2: Weird Perl 5.8.8 Regex Problems for Japanese UTF8
by ruski86 (Initiate) on Sep 08, 2010 at 14:10 UTC
    It's really, really long code that I'm inheriting, so it's difficult to post a small self-contained portion, but those regexs I posted should cause the problem on their own when run against certain lines of Japanese. An example of one of these is: "インドネシア休場"
      I cannot reproduce that. In fact, according to Benchmark, 5.8.8 is significantly faster than either 5.10.1 or 5.12.2.
      #!/usr/bin/perl use strict; use warnings; use Benchmark qw[cmpthese]; my $RUNS = shift || 5; print $], "\n"; sub match { my $subj = shift; $subj =~ /^\s*SPECIFIC\sDECISIONS\s*$/; $subj =~ /^\s*Important\s+\S+\s+Decisions\s*$/; $subj =~ /^General\s+Decisions$/; $subj =~ /^E-mail:\s*replies\@x.com$/; } for (1 .. $RUNS) { cmpthese(-3, { english => sub {match "This is a string"}, japanese => sub {match "インドネ&#1247 +1;ア休場"}, }); } __END__ 5.008008 Rate japanese english japanese 466031/s -- -38% english 751158/s 61% -- Rate japanese english japanese 505266/s -- -36% english 793800/s 57% -- Rate japanese english japanese 502207/s -- -36% english 778988/s 55% -- Rate japanese english japanese 498022/s -- -35% english 767420/s 54% -- Rate japanese english japanese 508507/s -- -35% english 780966/s 54% -- 5.010001 Rate japanese english japanese 339534/s -- -41% english 577872/s 70% -- Rate japanese english japanese 416203/s -- -28% english 577831/s 39% -- Rate japanese english japanese 369283/s -- -37% english 581775/s 58% -- Rate japanese english japanese 418431/s -- -19% english 516905/s 24% -- Rate japanese english japanese 417789/s -- -29% english 587599/s 41% -- 5.012002 Rate japanese english japanese 361165/s -- -27% english 496107/s 37% -- Rate japanese english japanese 372533/s -- -33% english 553731/s 49% -- Rate japanese english japanese 362232/s -- -34% english 549086/s 52% -- Rate japanese english japanese 361565/s -- -33% english 537911/s 49% -- Rate japanese english japanese 368166/s -- -32% english 539974/s 47% --
      (For some reason, if you have "インドネシア休場" in a a <code> block, Perlmonks escapes the Japanese characters, and then shows the raw entities. Bug?)

        For some reason, if you have "インドネシア休場" in a a <code> block, Perlmonks escapes the Japanese characters, and then shows the raw entities. Bug?

        It's not PerlMonks, it's your browser. PerlMonks uses windows-1252. You tried to submit a character that couldn't be submitted because it's not in PerlMonk's character set. On the off chance that the input will be treated as HTML, your browser submitted the HTML-encoding of the character. PerlMonks displays the content of code blocks as it received it, so PerlMonks displays the HTML the browser sent it.

      You said:
      It's really, really long code that I'm inheriting, so it's difficult to post a small self-contained portion, but those regexs I posted should cause the problem on their own when run against certain lines of Japanese.

      One step in the diagnosis is to determine whether it really is the regexes themselves (and their handling of Japanese), or whether it is instead some other problem involving the Japanese data in the big app, regardless of the regexes (e.g. only the Japanese data are coming from a particular source that isn't playing nice with the app).

      You have a local machine and a remote machine; you have a small sample of Japanese text and non-Japanese text; you have a set of regexes. If a single minimal test script, containing its own test data and applying those regexes, runs in roughly the same amount of time in both the local and remote machines (or at least, behaves consistently when applying the regexes to the Japanese and non-Japanese strings), then the problem in the big app is probably not being caused by the regexes.

      Here is what it boils down to: What sort of evidence did you have that led to your conclusion in the OP that the regexes were causing the problem? And what is the likelihood that your original evidence might also be due to some other cause, not involving the regexes?

      If you try the test script that JavaFan supplied above, and it turns out to show a really big slow-down on the remote machine, then you really do have a problem using those regexes with Japanese text on that machine. That would be strange, but in that case, you can try some different variations on the regexes to see if there's another way to do the job without taking the hit.

      OTOH, in the more likely case that the regexes are performing as expected in that test, you'll need to examine what else is going on with the Japanese data in the big app on the remote machine (and how it differs from the local one)...