quickly counting substrings

Crayman has asked for the wisdom of the Perl Monks concerning the following question:

i needed to modify File::Tail so that i could use swatch
to monitor multiline alerts generated by snort. File::Tail
previously used tr/\n// to count a standard newline separator.
since tr will not allow variables or multicharacter strings,
i had to find another technique. below is a benchmark of
a variety of techniques. the subroutines ending in _M count
multiple separator lines, the others count single \n separated
lines. here are the benchmarks from my 200mhz Pentinum Pro.

--snip--

Benchmark: timing 300 iterations of c_strncmp, c_strncmp_M, c_strstr, 
+c_strstr_M, m_array,
m_array_M, m_for, m_for_M, m_while, m_while_M, tr...
 c_strncmp:  7 wallclock secs ( 7.44 usr +  0.00 sys =  7.44 CPU)
c_strncmp_M:  8 wallclock secs ( 7.31 usr +  0.00 sys =  7.31 CPU)
  c_strstr:  1 wallclock secs ( 1.36 usr +  0.00 sys =  1.36 CPU)
c_strstr_M:  1 wallclock secs ( 1.18 usr +  0.00 sys =  1.18 CPU)
   m_array: 14 wallclock secs (12.25 usr +  0.00 sys = 12.25 CPU)
 m_array_M:  5 wallclock secs ( 4.91 usr +  0.00 sys =  4.91 CPU)
     m_for: 12 wallclock secs (11.60 usr +  0.00 sys = 11.60 CPU)
   m_for_M:  5 wallclock secs ( 4.91 usr +  0.00 sys =  4.91 CPU)
   m_while:  8 wallclock secs ( 8.41 usr +  0.00 sys =  8.41 CPU)
 m_while_M:  4 wallclock secs ( 3.94 usr +  0.00 sys =  3.94 CPU)
        tr:  2 wallclock secs ( 1.31 usr +  0.00 sys =  1.31 CPU)

=== check counts ===
        _m_while_M  2000
        _c_strncmp  6000
            _m_for  6000
          _m_array  6000
         _c_strstr  6000
          _m_while  6000
      _c_strncmp_M  2000
               _tr  6000
          _m_for_M  2000
        _m_array_M  2000
       _c_strstr_M  2000
-- snip --

here is the benchmark code - notice the use of Inline.pm to
get some more speeeed by using C.  can anyone think of a 
faster way to count one substring in another???

-- snip --
[download]

#!/usr/bin/perl -w
use Benchmark('timethese');
use Inline C;

my $tststr="abcdefghijklmnopqrstuvwxyz01234567890\n";
my $dta=($tststr x 2000).( ($tststr."\n") x 1900).("\n" x 200);
my %cnt=();


my $_tr = sub {
   my $cnt=$dta=~tr/\n//;
   $cnt{_tr}=$cnt;
};

my $_m_while = sub {
   my $cnt=0;
   my $rcdsep="\n";
   while ($dta=~m/$rcdsep/g) {$cnt++};
   $cnt{_m_while}=$cnt;
};

my $_m_for = sub {
   my $cnt=0;
   my $rcdsep="\n";
   $cnt++ for($dta=~m/$rcdsep/g);
   $cnt{_m_for}=$cnt;
};

my $_m_array = sub {
   my $cnt=0;
   my $rcdsep="\n";
   $cnt=scalar(@{[$dta=~m/$rcdsep/g]});
   $cnt{_m_array}=$cnt;
};

my $_m_while_M = sub {
   my $cnt=0;
   my $rcdsep="\n\n";
   while ($dta=~m/$rcdsep/g) {$cnt++};
   $cnt{_m_while_M}=$cnt;
};

my $_m_for_M = sub {
   my $cnt=0;
   my $rcdsep="\n\n";
   $cnt++ for($dta=~m/$rcdsep/g);
   $cnt{_m_for_M}=$cnt;
};

my $_m_array_M = sub {
   my $cnt=0;
   my $rcdsep="\n\n";
   $cnt=scalar(@{[$dta=~m/$rcdsep/g]});
   $cnt{_m_array_M}=$cnt;
};

my $_c_strncmp = sub {
   my $rcdsep="\n";
   my $cnt=c_strncmp($dta,$rcdsep);
   $cnt{_c_strncmp}=$cnt;
};

my $_c_strncmp_M = sub {
   my $rcdsep="\n\n";
   my $cnt=c_strncmp($dta,$rcdsep);
   $cnt{_c_strncmp_M}=$cnt;
};

my $_c_strstr = sub {
   my $rcdsep="\n";
   my $cnt=c_strstr($dta,$rcdsep);
   $cnt{_c_strstr}=$cnt;
};

my $_c_strstr_M = sub {
   my $rcdsep="\n\n";
   my $cnt=c_strstr($dta,$rcdsep);
   $cnt{_c_strstr_M}=$cnt;
};

$_tr->(); # seed it??

timethese(300,
               { tr =>         $_tr,
                 m_while =>    $_m_while,   m_for =>       $_m_for,   
+    m_array =>   $_m_array,
                 m_while_M =>  $_m_while_M, m_for_M =>     $_m_for_M, 
+    m_array_M => $_m_array_M,
                 c_strncmp =>  $_c_strncmp, c_strncmp_M => $_c_strncmp
+_M,
                 c_strstr =>   $_c_strstr,  c_strstr_M =>  $_c_strstr_
+M
               }
         );

print "\n=== check counts ===\n";
printf("%18s %5d\n",$_,$cnt{$_}) for (keys %cnt);

__END__
__C__

#include <string.h>

int c_strncmp(char * buf, char * sep){
   int cnt = 0;
   int seplen = strlen(sep);
   for(; *buf ; buf++)
      if(!strncmp(buf,sep,seplen)){
         cnt++;
         buf+=seplen-1;
      }
  
   return cnt;
}

int c_strstr(char * buf, char * sep){
   int cnt = 0;
   int seplen = strlen(sep);
   for(; *buf; buf+=seplen, cnt++)
      buf=strstr(buf,sep);
   return cnt; 
}
-- snip --
[download]


thanks in advance for the input!! 

___cliff rayman___cliff_AT_rayman.com___

Edit 2001-03-16 by tye to add <READMORE>

Comment on quickly counting substrings Select or Download Code

Replies are listed 'Best First'.
Re: quickly counting substrings by jeroenes (Priest) on Mar 02, 2001 at 15:18 UTC
Very nice comparison. You might as well have a swing at index, which is probably fast. `sub s_index{ my $pos = 0; my $cnt = 0; while (index($str, $sep, $pos)){ $pos++; $cnt++; } $cnt{s_index}=$cnt; }` [download] Jeroen "We are not alone"(FZ) Update: Thanks to danger for pointing my typos out. I took his code, and benchmarked it: my $_s_index = sub { my $pos = 0; my $cnt = 0; local $[ = 1; # ook! my $pos = -1; ++$cnt while $pos = index($dta,"\n\n",$pos+2); $cnt{_s_index}=$cnt; }; my $_while_re = sub{ my $cnt = 0; my $recsep = "\n\n"; $cnt++ while $dta =~ /$recsep/g; $cnt{_while_re} = $cnt; }; =>resulted in: Benchmark: timing 300 iterations of m_array, m_array_M, m_for, m_for_M +, m_while, m_while_M, s_index, tr, while_re... m_array: 11 wallclock secs ( 5.13 usr + 0.04 sys = 5.17 CPU) @ 58 +.03/s (n=300) m_array_M: 4 wallclock secs ( 1.57 usr + 0.02 sys = 1.59 CPU) @ 18 +8.68/s (n=300) m_for: 9 wallclock secs ( 4.19 usr + 0.00 sys = 4.19 CPU) @ 71 +.60/s (n=300) m_for_M: 3 wallclock secs ( 1.41 usr + 0.01 sys = 1.42 CPU) @ 21 +1.27/s (n=300) m_while: 5 wallclock secs ( 2.46 usr + 0.00 sys = 2.46 CPU) @ 12 +1.95/s (n=300) m_while_M: 2 wallclock secs ( 0.94 usr + 0.00 sys = 0.94 CPU) @ 31 +9.15/s (n=300) s_index: 2 wallclock secs ( 0.88 usr + 0.00 sys = 0.88 CPU) @ 34 +0.91/s (n=300) tr: 1 wallclock secs ( 0.36 usr + 0.00 sys = 0.36 CPU) @ 83 +3.33/s (n=300) (warning: too few iterations for a reliable count) while_re: 2 wallclock secs ( 0.90 usr + 0.00 sys = 0.90 CPU) @ 33 +3.33/s (n=300) === check counts === _while_re 2000 _m_for_M 2000 _m_array_M 2000 _tr 6000 _m_while_M 2000 _m_while 6000 _s_index 2000 _m_for 6000 _m_array 6000 [download] With tr still being the winner, and closely tied 2nd and 3rd place for index and danger's while, respectively. Than the light showed me that the separators should be all the same. That gave some shuffle: Benchmark: timing 300 iterations of m_array, m_array_M, m_for, m_for_M +, m_while, m_while_M, s_index, tr, while_re... m_array: 12 wallclock secs ( 5.04 usr + 0.04 sys = 5.08 CPU) @ 59 +.06/s (n=300) m_array_M: 11 wallclock secs ( 5.08 usr + 0.03 sys = 5.11 CPU) @ 58 +.71/s (n=300) m_for: 9 wallclock secs ( 4.19 usr + 0.00 sys = 4.19 CPU) @ 71 +.60/s (n=300) m_for_M: 8 wallclock secs ( 4.10 usr + 0.00 sys = 4.10 CPU) @ 73 +.17/s (n=300) m_while: 5 wallclock secs ( 2.26 usr + 0.02 sys = 2.28 CPU) @ 13 +1.58/s (n=300) m_while_M: 5 wallclock secs ( 2.28 usr + 0.02 sys = 2.30 CPU) @ 13 +0.43/s (n=300) s_index: 5 wallclock secs ( 2.13 usr + 0.00 sys = 2.13 CPU) @ 14 +0.85/s (n=300) tr: 0 wallclock secs ( 0.27 usr + 0.00 sys = 0.27 CPU) @ 11 +11.11/s (n=300) (warning: too few iterations for a reliable count) while_re: 5 wallclock secs ( 2.11 usr + 0.00 sys = 2.11 CPU) @ 14 +2.18/s (n=300) === check counts === _while_re 6000 _m_for_M 6000 _m_array_M 6000 _tr 6000 _m_while_M 6000 _m_while 6000 _s_index 6000 _m_for 6000 _m_array 6000 [download] Sadly enough the C-code doesn't compile on my machine. I don't have the time to get it working right now.	[reply] [d/l] [select]
Re: Re: quickly counting substrings by danger (Priest) on Mar 02, 2001 at 15:44 UTC
That routine is a little broken -- the return value of index() can be 0 on success, and you'll want to set $pos to the current position plus whatever the length of the separator is at each iteration. One version of using index() that fares somewhat comparably to the m_while_M routine is: `my $_index = sub { my $cnt = 0; local $[ = 1; # ook! my $pos = -1; ++$cnt while $pos = index($dta,"\n\n",$pos+2); $cnt{_index} = $cnt; };` [download] But an alternate version of the m_while_M routine (using while as a statement modifier rather than the block form) seems to be the best I can come up with at this hour: `my $_while_re = sub{ my $cnt = 0; my $recsep = "\n\n"; $cnt++ while $dta =~ /$recsep/g; $cnt{_while_re} = $cnt; };` [download]	[reply] [d/l] [select]
Re: quickly counting substrings by petral (Curate) on Mar 03, 2001 at 01:25 UTC
Very interesting. Just curious: do we no longer need to use /o when we're using an unchanging variable in a regexp? p	[reply]
Re: Re: quickly counting substrings by Crayman (Acolyte) on Mar 09, 2001 at 12:15 UTC
it's not unchanging in the actual code. $rcdsep is contained in an object. if we settle on using one of the perl routines, as opposed to C, we will probably bind $rcdsep using a closure and then /o would be applicable. ___cliff rayman___cliff_AT_rayman.com___	[reply]