comment on

I know it looks really trivial once you see it, but I'm really astonished by your approach of using 1+index(...) - it had not occurred to me to use index that way in an expression to check for presence. I'll add that to my set of idiosyncratic phrases, just like if( system(...) == 0 ) { for successful execution of subprocesses.

Update: I wondered about how much the capturing parentheses cost, and it seems they account for roughly ~~a third~~ half of the performance attainable when using the regex engine. Maybe the two additional steps executed in the regex engine (OPEN1 and CLOSE1) are to blame for that, as they effectively double the number of steps the regex engine has to execute for a successful match.

Not invoking the regex engine still is much faster, even though I had thought there once was an optimization that turned constant regular expressions without anchors or quantifiers into an index lookup...

# a:  if( $s =~ m[(lazy)] ){ $found=$1 }
Compiling REx "(lazy)"
Final program:
   1: OPEN1 (3)
   3:   EXACT <lazy> (5)
   5: CLOSE1 (7)
   7: END (0)
anchored "lazy" at 0 (checking anchored) minlen 4
Matching REx "(lazy)" against "the quick brown fox jumps over the lazy
+ dog"
Intuit: trying to determine minimum start position...
  Found anchored substr "lazy" at offset 35...
  (multiline anchor test skipped)
  try at offset...
Intuit: Successfully guessed: match at offset 35
  35 < the > <lazy dog>      |  1:OPEN1(3)
  35 < the > <lazy dog>      |  3:EXACT <lazy>(5)
  39 <the lazy> < dog>       |  5:CLOSE1(7)
  39 <the lazy> < dog>       |  7:END(0)
Match successful!
Freeing REx: "(lazy)"
# b:  $found = 'lazy' if 1+index( $s, 'lazy' );
# c:  if( $s =~ m[lazy] ){ $found=$& }
Compiling REx "lazy"
Final program:
   1: EXACT <lazy> (3)
   3: END (0)
anchored "lazy" at 0 (checking anchored isall) minlen 4
Matching REx "lazy" against "the quick brown fox jumps over the lazy d
+og"
Intuit: trying to determine minimum start position...
  Found anchored substr "lazy" at offset 35...
  (multiline anchor test skipped)
  try at offset...
Intuit: Successfully guessed: match at offset 35
Freeing REx: "lazy"
       Rate    a    c    b
a 2038631/s   -- -50% -75%
c 4089154/s 101%   -- -49%
b 8013601/s 293%  96%   --
[download]

The program I used:

use strict;
use Benchmark 'cmpthese';
use vars '$s';
$s='the quick brown fox jumps over the lazy dog'; 
my $found;

my %benchmarks = (
    a => q[ if( $s =~ m[(lazy)] ){ $found=$1 } ],
    b => q[ $found = 'lazy' if 1+index( $s, 'lazy' ); ],
    c => q[ if( $s =~ m[lazy] ){ $found=$& } ],
);

{
    use re 'debug';
    for (sort keys %benchmarks) {
        print "# $_: $benchmarks{$_}\n";
        undef $found;
        my $code = eval qq{sub { $benchmarks{$_} } }
            or die "Couldn't compile benchmark $_: $@";
        $code->();
        $found eq 'lazy'
            or die "Unexpected results: [$found] vs. 'lazy'";
    };
};

cmpthese( -1, \%benchmarks);
[download]

In reply to Re^3: Get a known substring from a string by Corion
in thread Get a known substring from a string by jake7176

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.