in reply to string pattern match, limited to first 1000 characters?

Basically I'm looking for a better method than creating a smaller test string with substr(0, 1000).

That looks like micro-optimization. Taking jwkrahn's and GrandFather's propositions:

#!/usr/bin/perl use Benchmark qw(cmpthese); $substr = join('',a..j); $str = $substr x 90 . 'hTmL'. $substr x 2000; print "length of search string: ",length $str, "\n"; cmpthese(-3, { # \A anchores at the beginning of a string, so no ^. See below (de +merphq) #' regex ' => sub { $str =~ /^\A.{0,996}?html/si; }, ' regex ' => sub { $str =~ /\A.{0,996}?html/si; }, ' substr' => sub { (substr $str, 0, 1000) =~ /html/i; }, #'!regex ' => sub { $str =~ /^\A.{0,996}?sgml/si; }, '!regex ' => sub { $str =~ /\A.{0,996}?sgml/si; }, '!substr' => sub { (substr $str, 0, 1000) =~ /sgml/i; }, }); __END__ length of search string: 20904 Rate regex substr !substr !regex regex 30996/s -- -65% -72% -79% substr 88073/s 184% -- -21% -42% !substr 111975/s 261% 27% -- -26% !regex 150866/s 387% 71% 35% --

What can we deduce from that? Not much. The efficiency of either method seems to depend on whether the searched pattern is contained in the string. The results may vary with the position the pattern in the string.

More important, even the "slowest" of these searches performs at a rate of ~31000/second. How many searches do you have to do in what time? In what context? How does the rest of your code perform?

--shmem

_($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                              /\_¯/(q    /
----------------------------  \__(m.====·.(_("always off the crowd"))."·
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}

Replies are listed 'Best First'.
Re^2: string pattern match, limited to first 1000 characters?
by BrowserUk (Patriarch) on Jun 23, 2007 at 09:50 UTC

    What do you make of this benchmark? :)

    #!/usr/bin/perl use Benchmark qw(cmpthese); $substr = join('','a'..'j'); $str = $substr x 90 . 'hTmL'. $substr x 2000; print "length of search string: ",length $str, "\n"; cmpthese(-3, { ' regex ' => sub { $str =~ /^\A.{0,996}?html/si or $str =~ /^\A.{0,996}?sgml/si; }, ' substr' => sub { substr( $str, 0, 1000) =~ /html/i or substr( $str, 0, 1000) =~ /sgml/i; }, 'index' => sub { 1+index( lc substr( $str, 0, 1000 ), 'html' ) or 1+index( lc substr( $str, 0, 1000 ), 'sgml' ) } , }); print substr $str, 900, 10;; __END__ C:\test>junk length of search string: 20904 Rate regex substr index regex 93505/s -- -54% -65% substr 204367/s 119% -- -23% index 266809/s 185% 31% -- hTmLabcdef

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      The conlusions are obvious, aren't they?
      • don't use a regexp when all you need is index
      • don't use a regexp with a quantifier pattern if you can substr
      • simple tools are fastest for simple tasks

      Did I miss some?

      Nice that you combined the positive and negative searches into one, so one can see the average of both.

      --shmem

      _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                    /\_¯/(q    /
      ----------------------------  \__(m.====·.(_("always off the crowd"))."·
      ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}

      BrowserUk:

      Ummm ... you want to reverse those or clauses, otherwise the second clause in your or won't run.

      #!/usr/bin/perl use Benchmark qw(cmpthese); $substr = join('','a'..'j'); $str = $substr x 90 . 'hTmL'. $substr x 2000; print "length of search string: ",length $str, "\n"; cmpthese(-3, { 'Fregex ' => sub { $str =~ /^\A.{0,996}?html/si or $str =~ /^\A.{0,996}?sgml/si; }, 'Fsubstr' => sub { substr( $str, 0, 1000) =~ /sgml/i or substr( $str, 0, 1000) =~ /html/i; }, 'Findex ' => sub { 1+index( lc substr( $str, 0, 1000 ), 'sgml' ) or 1+index( lc substr( $str, 0, 1000 ), 'html' ) }, 'Rregex ' => sub { $str =~ /^\A.{0,996}?sgml/si or $str =~ /^\A.{0,996}?html/si; }, 'Rsubstr' => sub { substr( $str, 0, 1000) =~ /sgml/i or substr( $str, 0, 1000) =~ /html/i; }, 'Rindex' => sub { 1+index( lc substr( $str, 0, 1000 ), 'sgml' ) or 1+index( lc substr( $str, 0, 1000 ), 'html' ) }, }); print substr $str, 900, 10; __END__ root@swill ~/PerlMonks$ ./string_search2.pl length of search string: 20904 Rate Rregex Fregex Rsubstr Fsubstr Findex Rindex Rregex 48562/s -- -35% -54% -54% -55% -55% Fregex 75225/s 55% -- -28% -29% -30% -30% Rsubstr 104700/s 116% 39% -- -1% -2% -3% Fsubstr 105896/s 118% 41% 1% -- -1% -1% Findex 107056/s 120% 42% 2% 1% -- -0% Rindex 107434/s 121% 43% 3% 1% 0% -- hTmLabcdef root@swill ~/PerlMonks$

      Update: Had I used my brain, I'd've changed the $str definition to use 'sGmL' rather than edit the function definitions....

      ...roboticus

      There are lies, damned lies, and benchmarks.

        It looks from your code that, apart from Fregex and Rregex, your subroutine pairs are identical as you haven't swapped the 'sgml' and 'html' around. When I run your code with them swapped around I get these results. (Rates are slow as the machine is a quite elderly SPARC.)

        length of search string: 20904 Rate Rregex Fregex Rsubstr Rindex Fsubstr Findex Rregex 11708/s -- -16% -41% -45% -61% -70% Fregex 14005/s 20% -- -29% -35% -54% -64% Rsubstr 19721/s 68% 41% -- -8% -35% -50% Rindex 21423/s 83% 53% 9% -- -29% -45% Fsubstr 30362/s 159% 117% 54% 42% -- -22% Findex 39142/s 234% 179% 98% 83% 29% --

        Cheers,

        JohnGG

        Update: Fixed typo.

Re^2: string pattern match, limited to first 1000 characters?
by demerphq (Chancellor) on Jun 23, 2007 at 13:09 UTC

    I dont understand why you did ^\A. Is there a reason?

    ---
    $world=~s/war/peace/g

      Other than ignorance, no.

      --shmem

      _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                    /\_¯/(q    /
      ----------------------------  \__(m.====·.(_("always off the crowd"))."·
      ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
        Ok, this is quite interesting. I've tried all of the suggested approaches, and based on a sample of actual text that I need to run on, the following is the overall best performing:
        $str =~ /\A.{0,995}?<html/i
        (I added the '<' to the text to make it more specific, and also removed the /s qualifier -- in my situation, both of these tweaks boosted performance.)

        But now the plot thickens ... what I just realized is that sometimes my content is gzipped (i.e. served by an Apache web server with mod_deflate, aka "Content-Encoding: gzip", or deflate, or compress). I have used some good CPAN modules that inflate this type of content, but now I'm faced with the same dilemna ... if I've got a 20KB gzipped file (100KB inflated), and I still only care about checking the first 1000 characters of the inflated content, is there a way to do a "partial inflate" so I don't have to incur the full overhead of a total-file inflation? I know this seems like a long-shot, but I figured I'd ask for ideas.

        MFN