string pattern match, limited to first 1000 characters?

ManFromNeptune has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: string pattern match, limited to first 1000 characters? by GrandFather (Saint) on Jun 23, 2007 at 04:21 UTC
`(substr $str, 0, 1000) =~ /html/i;` [download] DWIM is Perl's answer to Gödel	[reply] [d/l]
Re: string pattern match, limited to first 1000 characters? by jwkrahn (Abbot) on Jun 23, 2007 at 03:50 UTC
`$string =~ /\A.{0,996}?html/si;` [download]	[reply] [d/l]
Re: string pattern match, limited to first 1000 characters? by shmem (Chancellor) on Jun 23, 2007 at 09:23 UTC
Basically I'm looking for a better method than creating a smaller test string with substr(0, 1000). That looks like micro-optimization. Taking jwkrahn's and GrandFather's propositions: #!/usr/bin/perl use Benchmark qw(cmpthese); $substr = join('',a..j); $str = $substr x 90 . 'hTmL'. $substr x 2000; print "length of search string: ",length $str, "\n"; cmpthese(-3, { # \A anchores at the beginning of a string, so no ^. See below (de +merphq) #' regex ' => sub { $str =~ /^\A.{0,996}?html/si; }, ' regex ' => sub { $str =~ /\A.{0,996}?html/si; }, ' substr' => sub { (substr $str, 0, 1000) =~ /html/i; }, #'!regex ' => sub { $str =~ /^\A.{0,996}?sgml/si; }, '!regex ' => sub { $str =~ /\A.{0,996}?sgml/si; }, '!substr' => sub { (substr $str, 0, 1000) =~ /sgml/i; }, }); __END__ length of search string: 20904 Rate regex substr !substr !regex regex 30996/s -- -65% -72% -79% substr 88073/s 184% -- -21% -42% !substr 111975/s 261% 27% -- -26% !regex 150866/s 387% 71% 35% -- [download] What can we deduce from that? Not much. The efficiency of either method seems to depend on whether the searched pattern is contained in the string. The results may vary with the position the pattern in the string. More important, even the "slowest" of these searches performs at a rate of ~31000/second. How many searches do you have to do in what time? In what context? How does the rest of your code perform? --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply] [d/l]
Re^2: string pattern match, limited to first 1000 characters? by BrowserUk (Patriarch) on Jun 23, 2007 at 09:50 UTC
What do you make of this benchmark? :) #!/usr/bin/perl use Benchmark qw(cmpthese); $substr = join('','a'..'j'); $str = $substr x 90 . 'hTmL'. $substr x 2000; print "length of search string: ",length $str, "\n"; cmpthese(-3, { ' regex ' => sub { $str =~ /^\A.{0,996}?html/si or $str =~ /^\A.{0,996}?sgml/si; }, ' substr' => sub { substr( $str, 0, 1000) =~ /html/i or substr( $str, 0, 1000) =~ /sgml/i; }, 'index' => sub { 1+index( lc substr( $str, 0, 1000 ), 'html' ) or 1+index( lc substr( $str, 0, 1000 ), 'sgml' ) } , }); print substr $str, 900, 10;; __END__ C:\test>junk length of search string: 20904 Rate regex substr index regex 93505/s -- -54% -65% substr 204367/s 119% -- -23% index 266809/s 185% 31% -- hTmLabcdef [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^3: string pattern match, limited to first 1000 characters? by shmem (Chancellor) on Jun 23, 2007 at 10:27 UTC
The conlusions are obvious, aren't they? don't use a regexp when all you need is index don't use a regexp with a quantifier pattern if you can substr simple tools are fastest for simple tasks Did I miss some? Nice that you combined the positive and negative searches into one, so one can see the average of both. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply]
Re^3: string pattern match, limited to first 1000 characters? by roboticus (Chancellor) on Jun 23, 2007 at 12:34 UTC
BrowserUk: Ummm ... you want to reverse those `or` clauses, otherwise the second clause in your `or` won't run. #!/usr/bin/perl use Benchmark qw(cmpthese); $substr = join('','a'..'j'); $str = $substr x 90 . 'hTmL'. $substr x 2000; print "length of search string: ",length $str, "\n"; cmpthese(-3, { 'Fregex ' => sub { $str =~ /^\A.{0,996}?html/si or $str =~ /^\A.{0,996}?sgml/si; }, 'Fsubstr' => sub { substr( $str, 0, 1000) =~ /sgml/i or substr( $str, 0, 1000) =~ /html/i; }, 'Findex ' => sub { 1+index( lc substr( $str, 0, 1000 ), 'sgml' ) or 1+index( lc substr( $str, 0, 1000 ), 'html' ) }, 'Rregex ' => sub { $str =~ /^\A.{0,996}?sgml/si or $str =~ /^\A.{0,996}?html/si; }, 'Rsubstr' => sub { substr( $str, 0, 1000) =~ /sgml/i or substr( $str, 0, 1000) =~ /html/i; }, 'Rindex' => sub { 1+index( lc substr( $str, 0, 1000 ), 'sgml' ) or 1+index( lc substr( $str, 0, 1000 ), 'html' ) }, }); print substr $str, 900, 10; __END__ root@swill ~/PerlMonks$ ./string_search2.pl length of search string: 20904 Rate Rregex Fregex Rsubstr Fsubstr Findex Rindex Rregex 48562/s -- -35% -54% -54% -55% -55% Fregex 75225/s 55% -- -28% -29% -30% -30% Rsubstr 104700/s 116% 39% -- -1% -2% -3% Fsubstr 105896/s 118% 41% 1% -- -1% -1% Findex 107056/s 120% 42% 2% 1% -- -0% Rindex 107434/s 121% 43% 3% 1% 0% -- hTmLabcdef root@swill ~/PerlMonks$ [download] Update: Had I used my brain, I'd've changed the `$str` definition to use 'sGmL' rather than edit the function definitions.... ...roboticus There are lies, damned lies, and benchmarks.	[reply] [d/l] [select]
Re^4: string pattern match, limited to first 1000 characters? by johngg (Canon) on Jun 23, 2007 at 17:40 UTC
Re^5: string pattern match, limited to first 1000 characters? by roboticus (Chancellor) on Jun 24, 2007 at 14:04 UTC
Re^2: string pattern match, limited to first 1000 characters? by demerphq (Chancellor) on Jun 23, 2007 at 13:09 UTC
I dont understand why you did `^\A`. Is there a reason? --- $world=~s/war/peace/g	[reply] [d/l]
Re^3: string pattern match, limited to first 1000 characters? by shmem (Chancellor) on Jun 23, 2007 at 15:23 UTC
Other than ignorance, no. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply]
Re^4: string pattern match, limited to first 1000 characters? by ManFromNeptune (Scribe) on Jun 24, 2007 at 04:40 UTC
Re^5: string pattern match, limited to first 1000 characters? by shmem (Chancellor) on Jun 24, 2007 at 10:04 UTC
Some notes below your chosen depth have not been shown here