Re: string pattern match, limited to first 1000 characters?

Basically I'm looking for a better method than creating a smaller test string with substr(0, 1000).

That looks like micro-optimization. Taking jwkrahn's and GrandFather's propositions:

#!/usr/bin/perl

use Benchmark qw(cmpthese);

$substr = join('',a..j);
$str = $substr x 90 . 'hTmL'. $substr x 2000;
print "length of search string: ",length $str, "\n";
cmpthese(-3, {
    # \A anchores at the beginning of a string, so no ^. See below (de
+merphq)
    #' regex ' => sub { $str =~ /^\A.{0,996}?html/si; },
    ' regex ' => sub { $str =~ /\A.{0,996}?html/si; },
    ' substr' => sub { (substr $str, 0, 1000) =~ /html/i; },
    #'!regex ' => sub { $str =~ /^\A.{0,996}?sgml/si; },
    '!regex ' => sub { $str =~ /\A.{0,996}?sgml/si; },
    '!substr' => sub { (substr $str, 0, 1000) =~ /sgml/i; },
});
__END__
length of search string: 20904
            Rate  regex   substr !substr !regex 
 regex   30996/s      --    -65%    -72%    -79%
 substr  88073/s    184%      --    -21%    -42%
!substr 111975/s    261%     27%      --    -26%
!regex  150866/s    387%     71%     35%      --
[download]

What can we deduce from that? Not much. The efficiency of either method seems to depend on whether the searched pattern is contained in the string. The results may vary with the position the pattern in the string.

More important, even the "slowest" of these searches performs at a rate of ~31000/second. How many searches do you have to do in what time? In what context? How does the rest of your code perform?

--shmem

_($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                              /\_¯/(q    /
----------------------------  \__(m.====·.(_("always off the crowd"))."·
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}

Comment on Re: string pattern match, limited to first 1000 characters? Download Code

Replies are listed 'Best First'.
Re^2: string pattern match, limited to first 1000 characters? by BrowserUk (Patriarch) on Jun 23, 2007 at 09:50 UTC
What do you make of this benchmark? :) #!/usr/bin/perl use Benchmark qw(cmpthese); $substr = join('','a'..'j'); $str = $substr x 90 . 'hTmL'. $substr x 2000; print "length of search string: ",length $str, "\n"; cmpthese(-3, { ' regex ' => sub { $str =~ /^\A.{0,996}?html/si or $str =~ /^\A.{0,996}?sgml/si; }, ' substr' => sub { substr( $str, 0, 1000) =~ /html/i or substr( $str, 0, 1000) =~ /sgml/i; }, 'index' => sub { 1+index( lc substr( $str, 0, 1000 ), 'html' ) or 1+index( lc substr( $str, 0, 1000 ), 'sgml' ) } , }); print substr $str, 900, 10;; __END__ C:\test>junk length of search string: 20904 Rate regex substr index regex 93505/s -- -54% -65% substr 204367/s 119% -- -23% index 266809/s 185% 31% -- hTmLabcdef [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^3: string pattern match, limited to first 1000 characters? by shmem (Chancellor) on Jun 23, 2007 at 10:27 UTC
The conlusions are obvious, aren't they? don't use a regexp when all you need is index don't use a regexp with a quantifier pattern if you can substr simple tools are fastest for simple tasks Did I miss some? Nice that you combined the positive and negative searches into one, so one can see the average of both. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply]
Re^3: string pattern match, limited to first 1000 characters? by roboticus (Chancellor) on Jun 23, 2007 at 12:34 UTC
BrowserUk: Ummm ... you want to reverse those `or` clauses, otherwise the second clause in your `or` won't run. #!/usr/bin/perl use Benchmark qw(cmpthese); $substr = join('','a'..'j'); $str = $substr x 90 . 'hTmL'. $substr x 2000; print "length of search string: ",length $str, "\n"; cmpthese(-3, { 'Fregex ' => sub { $str =~ /^\A.{0,996}?html/si or $str =~ /^\A.{0,996}?sgml/si; }, 'Fsubstr' => sub { substr( $str, 0, 1000) =~ /sgml/i or substr( $str, 0, 1000) =~ /html/i; }, 'Findex ' => sub { 1+index( lc substr( $str, 0, 1000 ), 'sgml' ) or 1+index( lc substr( $str, 0, 1000 ), 'html' ) }, 'Rregex ' => sub { $str =~ /^\A.{0,996}?sgml/si or $str =~ /^\A.{0,996}?html/si; }, 'Rsubstr' => sub { substr( $str, 0, 1000) =~ /sgml/i or substr( $str, 0, 1000) =~ /html/i; }, 'Rindex' => sub { 1+index( lc substr( $str, 0, 1000 ), 'sgml' ) or 1+index( lc substr( $str, 0, 1000 ), 'html' ) }, }); print substr $str, 900, 10; __END__ root@swill ~/PerlMonks$ ./string_search2.pl length of search string: 20904 Rate Rregex Fregex Rsubstr Fsubstr Findex Rindex Rregex 48562/s -- -35% -54% -54% -55% -55% Fregex 75225/s 55% -- -28% -29% -30% -30% Rsubstr 104700/s 116% 39% -- -1% -2% -3% Fsubstr 105896/s 118% 41% 1% -- -1% -1% Findex 107056/s 120% 42% 2% 1% -- -0% Rindex 107434/s 121% 43% 3% 1% 0% -- hTmLabcdef root@swill ~/PerlMonks$ [download] Update: Had I used my brain, I'd've changed the `$str` definition to use 'sGmL' rather than edit the function definitions.... ...roboticus There are lies, damned lies, and benchmarks.	[reply] [d/l] [select]
Re^4: string pattern match, limited to first 1000 characters? by johngg (Canon) on Jun 23, 2007 at 17:40 UTC
It looks from your code that, apart from `Fregex` and `Rregex`, your subroutine pairs are identical as you haven't swapped the 'sgml' and 'html' around. When I run your code with them swapped around I get these results. (Rates are slow as the machine is a quite elderly SPARC.) `length of search string: 20904 Rate Rregex Fregex Rsubstr Rindex Fsubstr Findex Rregex 11708/s -- -16% -41% -45% -61% -70% Fregex 14005/s 20% -- -29% -35% -54% -64% Rsubstr 19721/s 68% 41% -- -8% -35% -50% Rindex 21423/s 83% 53% 9% -- -29% -45% Fsubstr 30362/s 159% 117% 54% 42% -- -22% Findex 39142/s 234% 179% 98% 83% 29% --` [download] Cheers, JohnGG Update: Fixed typo.	[reply] [d/l] [select]
Re^5: string pattern match, limited to first 1000 characters? by roboticus (Chancellor) on Jun 24, 2007 at 14:04 UTC
Re^2: string pattern match, limited to first 1000 characters? by demerphq (Chancellor) on Jun 23, 2007 at 13:09 UTC
I dont understand why you did `^\A`. Is there a reason? --- $world=~s/war/peace/g	[reply] [d/l]
Re^3: string pattern match, limited to first 1000 characters? by shmem (Chancellor) on Jun 23, 2007 at 15:23 UTC
Other than ignorance, no. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply]
Re^4: string pattern match, limited to first 1000 characters? by ManFromNeptune (Scribe) on Jun 24, 2007 at 04:40 UTC
Ok, this is quite interesting. I've tried all of the suggested approaches, and based on a sample of actual text that I need to run on, the following is the overall best performing: `$str =~ /\A.{0,995}?<html/i` [download] (I added the '<' to the text to make it more specific, and also removed the /s qualifier -- in my situation, both of these tweaks boosted performance.) But now the plot thickens ... what I just realized is that sometimes my content is gzipped (i.e. served by an Apache web server with mod_deflate, aka "Content-Encoding: gzip", or deflate, or compress). I have used some good CPAN modules that inflate this type of content, but now I'm faced with the same dilemna ... if I've got a 20KB gzipped file (100KB inflated), and I still only care about checking the first 1000 characters of the inflated content, is there a way to do a "partial inflate" so I don't have to incur the full overhead of a total-file inflation? I know this seems like a long-shot, but I figured I'd ask for ideas. MFN	[reply] [d/l]
Re^5: string pattern match, limited to first 1000 characters? by shmem (Chancellor) on Jun 24, 2007 at 10:04 UTC
Re^6: string pattern match, limited to first 1000 characters? by ManFromNeptune (Scribe) on Jun 25, 2007 at 17:32 UTC