in reply to serious regex performance degradation after upgrade to perl 5.8 from 5.6
I can be wrong, but on Perl-5.8 UTF-8 will make the strings to alocate 4 bytes for each character. And REGEXP when looking in the string will need to handle that too.
From POD, perlunicode:
And from bytes:UTF-8 is a variable-length (1 to 6 bytes, current character allocation +s require 4 bytes)...
Soo, this code:As an example, when Perl sees $x = chr(400), it encodes the character +in UTF-8 and stores it in $x. Then it is marked as character data, so +, for instance, length $x returns 1. However, in the scope of the byt +es pragma, $x is treated as a series of bytes - the bytes that make u +p the UTF8 encoding - and length $x returns 2:
Has the output:$x = chr(400); print 'Length: ', length $x, qq~\n~; { use bytes; print 'Length (bytes): ', length $x, qq~\n~; }
Length: 1 Length (bytes): 2
Soo, to see if just a string 4 times bigger can make the REGEXP 4 times slow, make the same test, but adding a string bigger and compare with the tests of this node.
But note that the REGEXP machine in Perl-5.8x is much more complex than in Perl-5.6x just to need to handle the different encode formats that Perl handles. Maybe you need to look for some pragma that disable UTF-8 handling on REGEXP (that I haven't found), and not to try to recompile Perl.
Graciliano M. P.
"Creativity is the expression of the liberty".
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: serious regex performance degradation after upgrade to perl 5.8 from 5.6
by Anonymous Monk on Jan 20, 2004 at 23:44 UTC | |
by Callum (Chaplain) on Jan 21, 2004 at 13:36 UTC | |
|