ManFromNeptune has asked for the wisdom of the Perl Monks concerning the following question:

Folks, I need to do a simple case-insensitive string search match for 'HTML' in some very large strings. A simple regex like $string =~ /html/i would do the trick, but I only care if the search text appears within the first 1000 characters of the string, and it seems very inefficient to have Perl regex test the entire thing. Is there a way to constrain the scope (depth) of a regex search? Basically I'm looking for a better method than creating a smaller test string with substr(0, 1000). Thanks for any suggestions, MFN
  • Comment on string pattern match, limited to first 1000 characters?

Replies are listed 'Best First'.
Re: string pattern match, limited to first 1000 characters?
by GrandFather (Saint) on Jun 23, 2007 at 04:21 UTC
    (substr $str, 0, 1000) =~ /html/i;

    DWIM is Perl's answer to Gödel
Re: string pattern match, limited to first 1000 characters?
by jwkrahn (Abbot) on Jun 23, 2007 at 03:50 UTC
Re: string pattern match, limited to first 1000 characters?
by shmem (Chancellor) on Jun 23, 2007 at 09:23 UTC
    Basically I'm looking for a better method than creating a smaller test string with substr(0, 1000).

    That looks like micro-optimization. Taking jwkrahn's and GrandFather's propositions:

    #!/usr/bin/perl use Benchmark qw(cmpthese); $substr = join('',a..j); $str = $substr x 90 . 'hTmL'. $substr x 2000; print "length of search string: ",length $str, "\n"; cmpthese(-3, { # \A anchores at the beginning of a string, so no ^. See below (de +merphq) #' regex ' => sub { $str =~ /^\A.{0,996}?html/si; }, ' regex ' => sub { $str =~ /\A.{0,996}?html/si; }, ' substr' => sub { (substr $str, 0, 1000) =~ /html/i; }, #'!regex ' => sub { $str =~ /^\A.{0,996}?sgml/si; }, '!regex ' => sub { $str =~ /\A.{0,996}?sgml/si; }, '!substr' => sub { (substr $str, 0, 1000) =~ /sgml/i; }, }); __END__ length of search string: 20904 Rate regex substr !substr !regex regex 30996/s -- -65% -72% -79% substr 88073/s 184% -- -21% -42% !substr 111975/s 261% 27% -- -26% !regex 150866/s 387% 71% 35% --

    What can we deduce from that? Not much. The efficiency of either method seems to depend on whether the searched pattern is contained in the string. The results may vary with the position the pattern in the string.

    More important, even the "slowest" of these searches performs at a rate of ~31000/second. How many searches do you have to do in what time? In what context? How does the rest of your code perform?

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}

      What do you make of this benchmark? :)

      #!/usr/bin/perl use Benchmark qw(cmpthese); $substr = join('','a'..'j'); $str = $substr x 90 . 'hTmL'. $substr x 2000; print "length of search string: ",length $str, "\n"; cmpthese(-3, { ' regex ' => sub { $str =~ /^\A.{0,996}?html/si or $str =~ /^\A.{0,996}?sgml/si; }, ' substr' => sub { substr( $str, 0, 1000) =~ /html/i or substr( $str, 0, 1000) =~ /sgml/i; }, 'index' => sub { 1+index( lc substr( $str, 0, 1000 ), 'html' ) or 1+index( lc substr( $str, 0, 1000 ), 'sgml' ) } , }); print substr $str, 900, 10;; __END__ C:\test>junk length of search string: 20904 Rate regex substr index regex 93505/s -- -54% -65% substr 204367/s 119% -- -23% index 266809/s 185% 31% -- hTmLabcdef

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        The conlusions are obvious, aren't they?
        • don't use a regexp when all you need is index
        • don't use a regexp with a quantifier pattern if you can substr
        • simple tools are fastest for simple tasks

        Did I miss some?

        Nice that you combined the positive and negative searches into one, so one can see the average of both.

        --shmem

        _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                      /\_¯/(q    /
        ----------------------------  \__(m.====·.(_("always off the crowd"))."·
        ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}

        BrowserUk:

        Ummm ... you want to reverse those or clauses, otherwise the second clause in your or won't run.

        #!/usr/bin/perl use Benchmark qw(cmpthese); $substr = join('','a'..'j'); $str = $substr x 90 . 'hTmL'. $substr x 2000; print "length of search string: ",length $str, "\n"; cmpthese(-3, { 'Fregex ' => sub { $str =~ /^\A.{0,996}?html/si or $str =~ /^\A.{0,996}?sgml/si; }, 'Fsubstr' => sub { substr( $str, 0, 1000) =~ /sgml/i or substr( $str, 0, 1000) =~ /html/i; }, 'Findex ' => sub { 1+index( lc substr( $str, 0, 1000 ), 'sgml' ) or 1+index( lc substr( $str, 0, 1000 ), 'html' ) }, 'Rregex ' => sub { $str =~ /^\A.{0,996}?sgml/si or $str =~ /^\A.{0,996}?html/si; }, 'Rsubstr' => sub { substr( $str, 0, 1000) =~ /sgml/i or substr( $str, 0, 1000) =~ /html/i; }, 'Rindex' => sub { 1+index( lc substr( $str, 0, 1000 ), 'sgml' ) or 1+index( lc substr( $str, 0, 1000 ), 'html' ) }, }); print substr $str, 900, 10; __END__ root@swill ~/PerlMonks$ ./string_search2.pl length of search string: 20904 Rate Rregex Fregex Rsubstr Fsubstr Findex Rindex Rregex 48562/s -- -35% -54% -54% -55% -55% Fregex 75225/s 55% -- -28% -29% -30% -30% Rsubstr 104700/s 116% 39% -- -1% -2% -3% Fsubstr 105896/s 118% 41% 1% -- -1% -1% Findex 107056/s 120% 42% 2% 1% -- -0% Rindex 107434/s 121% 43% 3% 1% 0% -- hTmLabcdef root@swill ~/PerlMonks$

        Update: Had I used my brain, I'd've changed the $str definition to use 'sGmL' rather than edit the function definitions....

        ...roboticus

        There are lies, damned lies, and benchmarks.

      I dont understand why you did ^\A. Is there a reason?

      ---
      $world=~s/war/peace/g

        Other than ignorance, no.

        --shmem

        _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                      /\_¯/(q    /
        ----------------------------  \__(m.====·.(_("always off the crowd"))."·
        ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}