comment on

This is kind of continuation of questions I've been asking when bumping into unexpected regex' performance issues, the last one was 11155604, I think. This one is also observed with very fresh/latest strawberry-perl-5.40.0.1-64bit-PDL, so perhaps it's something new.

I'm trying to improve one of CPAN modules which deals with PDF, the string below simulates a classic cross-reference table, with number of entries and preceding file data roughly the same as in one of the PDF files I'm using for tests.

Method (1) is similar to the original. I tried the (4) first, with vague idea of not creating useless copies of data. However, this is when I noticed that, while other changes (not relevant here) where steady speed/memory gains, unexpectedly everything got very slow. So I concocted the SSCCE below to ask if perhaps this is a bug in Perl or not. Also, strangely, the results of (4) vary somewhat from run to run, sometimes as "fast" as 1.33 s.

(Now I think to use perhaps the (3) further, after checking if global anchor is maintained/used elsewhere by module. The question remains about bug in Perl, as accidental by-product of otherwise idle investigations)

use strict;
use warnings;
use feature 'say';
use Time::HiRes 'time';

say $^V;

my $s = '*' x 5_000_000;
$s .= "0123456789 01234 n \n" x 40_000;

my $re = qr/ (\d{10}) \x{20} (\d{5}) \x{20} (\w) \s\s /x;
my ( $xref, $t );

                                # (1) peel off entry by entry
$xref = substr $s, 5_000_000;   # from shorter string
$t = time;
for ( 0 .. 39_999 ) {
    my $entry = substr $xref, $_ * 20, 20;
    die unless $entry =~ / \A $re /x;
    
    # do something useful with captures
    
}
say time - $t;

$xref = substr $s, 5_000_000;   # (2) global match (shorter string)
$t = time;
for ( 0 .. 39_999 ) {
    die unless $xref =~ / \G $re /gx;
}
say time - $t;

                                # (3) global match (original string),
pos( $s ) = 5_000_000;          # start from pos
$t = time;
for ( 0 .. 39_999 ) {
    die unless $s =~ / \G $re /gx;
}
say time - $t;

$xref = \substr $s, 5_000_000;  # (4) use reference to substr
$t = time;
for ( 0 .. 39_999 ) {
    die unless $$xref =~ / \G $re /gx;
}
say time - $t;


__END__

v5.40.0
0.0973920822143555
0.04703688621521
0.0475959777832031
3.08383107185364
[download]

In reply to Regex match is very slow against deref'd reference to substr/lvalue, is it normal? by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.