I can be wrong, but on Perl-5.8 UTF-8 will make the strings to alocate 4 bytes for each character. And REGEXP when looking in the string will need to handle that too.
From POD, perlunicode:
And from bytes:UTF-8 is a variable-length (1 to 6 bytes, current character allocation +s require 4 bytes)...
Soo, this code:As an example, when Perl sees $x = chr(400), it encodes the character +in UTF-8 and stores it in $x. Then it is marked as character data, so +, for instance, length $x returns 1. However, in the scope of the byt +es pragma, $x is treated as a series of bytes - the bytes that make u +p the UTF8 encoding - and length $x returns 2:
Has the output:$x = chr(400); print 'Length: ', length $x, qq~\n~; { use bytes; print 'Length (bytes): ', length $x, qq~\n~; }
Length: 1 Length (bytes): 2
Soo, to see if just a string 4 times bigger can make the REGEXP 4 times slow, make the same test, but adding a string bigger and compare with the tests of this node.
But note that the REGEXP machine in Perl-5.8x is much more complex than in Perl-5.6x just to need to handle the different encode formats that Perl handles. Maybe you need to look for some pragma that disable UTF-8 handling on REGEXP (that I haven't found), and not to try to recompile Perl.
Graciliano M. P.
"Creativity is the expression of the liberty".
In reply to Re: serious regex performance degradation after upgrade to perl 5.8 from 5.6
by gmpassos
in thread serious regex performance degradation after upgrade to perl 5.8 from 5.6
by dmandel
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |