Re: fast greedy regex

Curiouser and curiouser. The change of tests, changes the relative performance of the methods, but it also slows all of them down.

I thought that using globals instead of lexicals might have been part of the difference, and it is, but only a small part.

#! perl -slw
use strict;
use Benchmark qw[ cmpthese ];

our $TEST ||= 0;
our $N = $TEST ? 10 : $N || 1000;
our @data = map{ join' ', '2004-05-13', '14:02:00', ('blah') x (1+rand
+( 9 )) } 1 .. $N;

our (@greedy, @explicit, @unpack);

cmpthese( $TEST ? 1 : -1, {
    our_g   => '@greedy   = map {/(^\S*)\s(\S*)\s(.*$)/}        @data'
+,
    our_e   => '@explicit = map {/(^\d{4}\-\d{2}\-\d{2})\s
                                   (\d{2}:\d{2}:\d{2})\s(.*$)/x} @data
+',
    our_u   => '@unpack   = map {unpack "A10 x A8 x A*" => $_}  @data'
+,

    my_g    => 'my @greedy   = map {/(^\S*)\s(\S*)\s(.*$)/}        @da
+ta',
    my_e    => 'my @explicit = map {/(^\d{4}\-\d{2}\-\d{2})\s
                                   (\d{2}:\d{2}:\d{2})\s(.*$)/x} @data
+',
    my_u    => 'my @unpack   = map {unpack "A10 x A8 x A*" => $_}  @da
+ta',

    greedy => q[
        my( $date, $time, $text );
        
        m[(^\S*)\s(\S*)\s(.*$)]
            and ( $date, $time, $text ) = ( $1, $2, $3 )
#            and $TEST and print "greedy: $date|$time|$text"
            for @data;
    ],
    explicit => q[
        my( $date, $time, $text );
        
        m[(^\d{4}\-\d{2}\-\d{2})\s(\d{2}:\d{2}:\d{2})\s(.*$)]
            and ( $date, $time, $text ) = ( $1, $2, $3 )
#            and $TEST and print "explicit: $date|$time|$text"
            for @data;
    ],
    unpackA => q[
        use bytes;
        my( $date, $time, $text );

        ( $date, $time, $text ) = unpack 'A10 x A8 x A*', $_
#            and $TEST and print "unpackA: $date|$time|$text"
            for @data;
    ],
    substr => q[
        use bytes;
        my( $date, $time, $text );

        ( $date, $time, $text ) = 
            ( 
                substr( $_, 0, 10 ),
                substr( $_, 11, 8 ),
                substr( $_, 20 )
            )
#            and $TEST and print "substr: $date|$time|$text"
            for @data;
    ],
});
    
__END__
P:\test>362106
           Rate our_e our_g our_u  my_e my_g my_u unpackA substr expli
+cit greedy
our_e    72.4/s    --   -2%  -15%  -28% -30% -43%    -55%   -73%     -
+77%   -79%
our_g    73.6/s    2%    --  -13%  -27% -29% -42%    -54%   -73%     -
+77%   -79%
our_u    85.0/s   17%   15%    --  -16% -18% -33%    -47%   -69%     -
+73%   -75%

my_e      101/s   39%   37%   19%    --  -3% -20%    -37%   -63%     -
+68%   -71%
my_g      104/s   43%   41%   22%    3%   -- -18%    -35%   -62%     -
+67%   -70%
my_u      126/s   74%   71%   48%   25%  21%   --    -21%   -53%     -
+60%   -64%

unpackA   160/s  121%  117%   88%   59%  54%  27%      --   -41%     -
+49%   -54%
substr    270/s  273%  267%  218%  168% 160% 114%     69%     --     -
+14%   -22%
explicit  314/s  334%  327%  270%  212% 203% 149%     96%    16%      
+ --    -9%
greedy    346/s  378%  370%  307%  243% 234% 175%    116%    28%      
+10%     --
[download]

It would be interesting to see the benchmark run on 5.6.2 (pre-unicodification), which I don't have installed currently.

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

[reply]
[d/l]

          Rate our_e our_g  my_e explicit  my_g greedy our_u unpackA m
+y_u substr
our_e    126/s    --   -9%  -20%     -25%  -28%   -33%  -37%    -49% -
+54%   -56%
our_g    139/s   10%    --  -11%     -17%  -21%   -26%  -30%    -44% -
+49%   -52%
my_e     158/s   25%   13%    --      -6%  -10%   -17%  -21%    -36% -
+42%   -45%
explicit 168/s   33%   20%    6%       --   -5%   -12%  -16%    -32% -
+38%   -42%
my_g     175/s   39%   26%   11%       5%    --    -7%  -12%    -29% -
+36%   -39%
greedy   189/s   50%   36%   20%      13%    8%     --   -5%    -24% -
+30%   -34%
our_u    199/s   58%   43%   26%      19%   13%     5%    --    -20% -
+27%   -31%
unpackA  248/s   96%   78%   57%      48%   41%    31%   25%      --  
+-9%   -14%
my_u     272/s  116%   95%   73%      63%   55%    44%   37%     10%  
+ --    -5%
substr   288/s  128%  106%   83%      72%   64%    52%   45%     16%  
+ 6%     --
[download]

           Rate our_e our_g  my_e our_u my_g explicit unpackA greedy m
+y_u substr
our_e    61.0/s    --   -9%  -25%  -29% -33%     -41%    -48%   -48% -
+52%   -67%
our_g    67.3/s   10%    --  -17%  -21% -27%     -35%    -42%   -43% -
+47%   -64%
my_e     80.7/s   32%   20%    --   -5% -12%     -22%    -31%   -32% -
+36%   -57%
our_u    85.3/s   40%   27%    6%    --  -7%     -18%    -27%   -28% -
+32%   -54%
my_g     91.6/s   50%   36%   13%    7%   --     -12%    -21%   -23% -
+28%   -51%
explicit  104/s   70%   54%   28%   22%  13%       --    -11%   -12% -
+18%   -44%
unpackA   116/s   91%   73%   44%   36%  27%      12%      --    -2%  
+-8%   -37%
greedy    118/s   94%   76%   47%   39%  29%      14%      2%     --  
+-6%   -36%
my_u      126/s  107%   88%   57%   48%  38%      22%      9%     7%  
+ --   -32%
substr    186/s  205%  176%  130%  118% 103%      79%     60%    57%  
+47%     --
[download]

Boris

[reply]
[d/l]
[select]

Thanks for running the benchmarks :)

That pretty much reflects what I've thought for a while. Despite arguments to the contrary.

There is a substantial penalty to unicode support for many string operations, even when the strings involved do not, have not and could not contain unicode.

I wish that it was possible to reliably, manually 'turn off' all unicode processing and conditional testing using a progma (say no utf8; or use bytes;) and recover the 5.6.x performance.

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

[reply]
[d/l]
[select]


No such thing as a small change
	PerlMonks