Crayman has asked for the wisdom of the Perl Monks concerning the following question:

i needed to modify File::Tail so that i could use swatch
to monitor multiline alerts generated by snort. File::Tail
previously used tr/\n// to count a standard newline separator.
since tr will not allow variables or multicharacter strings,
i had to find another technique. below is a benchmark of
a variety of techniques. the subroutines ending in _M count
multiple separator lines, the others count single \n separated
lines. here are the benchmarks from my 200mhz Pentinum Pro.
--snip-- Benchmark: timing 300 iterations of c_strncmp, c_strncmp_M, c_strstr, +c_strstr_M, m_array, m_array_M, m_for, m_for_M, m_while, m_while_M, tr... c_strncmp: 7 wallclock secs ( 7.44 usr + 0.00 sys = 7.44 CPU) c_strncmp_M: 8 wallclock secs ( 7.31 usr + 0.00 sys = 7.31 CPU) c_strstr: 1 wallclock secs ( 1.36 usr + 0.00 sys = 1.36 CPU) c_strstr_M: 1 wallclock secs ( 1.18 usr + 0.00 sys = 1.18 CPU) m_array: 14 wallclock secs (12.25 usr + 0.00 sys = 12.25 CPU) m_array_M: 5 wallclock secs ( 4.91 usr + 0.00 sys = 4.91 CPU) m_for: 12 wallclock secs (11.60 usr + 0.00 sys = 11.60 CPU) m_for_M: 5 wallclock secs ( 4.91 usr + 0.00 sys = 4.91 CPU) m_while: 8 wallclock secs ( 8.41 usr + 0.00 sys = 8.41 CPU) m_while_M: 4 wallclock secs ( 3.94 usr + 0.00 sys = 3.94 CPU) tr: 2 wallclock secs ( 1.31 usr + 0.00 sys = 1.31 CPU) === check counts === _m_while_M 2000 _c_strncmp 6000 _m_for 6000 _m_array 6000 _c_strstr 6000 _m_while 6000 _c_strncmp_M 2000 _tr 6000 _m_for_M 2000 _m_array_M 2000 _c_strstr_M 2000 -- snip -- here is the benchmark code - notice the use of Inline.pm to get some more speeeed by using C. can anyone think of a faster way to count one substring in another??? -- snip --
#!/usr/bin/perl -w use Benchmark('timethese'); use Inline C; my $tststr="abcdefghijklmnopqrstuvwxyz01234567890\n"; my $dta=($tststr x 2000).( ($tststr."\n") x 1900).("\n" x 200); my %cnt=(); my $_tr = sub { my $cnt=$dta=~tr/\n//; $cnt{_tr}=$cnt; }; my $_m_while = sub { my $cnt=0; my $rcdsep="\n"; while ($dta=~m/$rcdsep/g) {$cnt++}; $cnt{_m_while}=$cnt; }; my $_m_for = sub { my $cnt=0; my $rcdsep="\n"; $cnt++ for($dta=~m/$rcdsep/g); $cnt{_m_for}=$cnt; }; my $_m_array = sub { my $cnt=0; my $rcdsep="\n"; $cnt=scalar(@{[$dta=~m/$rcdsep/g]}); $cnt{_m_array}=$cnt; }; my $_m_while_M = sub { my $cnt=0; my $rcdsep="\n\n"; while ($dta=~m/$rcdsep/g) {$cnt++}; $cnt{_m_while_M}=$cnt; }; my $_m_for_M = sub { my $cnt=0; my $rcdsep="\n\n"; $cnt++ for($dta=~m/$rcdsep/g); $cnt{_m_for_M}=$cnt; }; my $_m_array_M = sub { my $cnt=0; my $rcdsep="\n\n"; $cnt=scalar(@{[$dta=~m/$rcdsep/g]}); $cnt{_m_array_M}=$cnt; }; my $_c_strncmp = sub { my $rcdsep="\n"; my $cnt=c_strncmp($dta,$rcdsep); $cnt{_c_strncmp}=$cnt; }; my $_c_strncmp_M = sub { my $rcdsep="\n\n"; my $cnt=c_strncmp($dta,$rcdsep); $cnt{_c_strncmp_M}=$cnt; }; my $_c_strstr = sub { my $rcdsep="\n"; my $cnt=c_strstr($dta,$rcdsep); $cnt{_c_strstr}=$cnt; }; my $_c_strstr_M = sub { my $rcdsep="\n\n"; my $cnt=c_strstr($dta,$rcdsep); $cnt{_c_strstr_M}=$cnt; }; $_tr->(); # seed it?? timethese(300, { tr => $_tr, m_while => $_m_while, m_for => $_m_for, + m_array => $_m_array, m_while_M => $_m_while_M, m_for_M => $_m_for_M, + m_array_M => $_m_array_M, c_strncmp => $_c_strncmp, c_strncmp_M => $_c_strncmp +_M, c_strstr => $_c_strstr, c_strstr_M => $_c_strstr_ +M } ); print "\n=== check counts ===\n"; printf("%18s %5d\n",$_,$cnt{$_}) for (keys %cnt); __END__ __C__ #include <string.h> int c_strncmp(char * buf, char * sep){ int cnt = 0; int seplen = strlen(sep); for(; *buf ; buf++) if(!strncmp(buf,sep,seplen)){ cnt++; buf+=seplen-1; } return cnt; } int c_strstr(char * buf, char * sep){ int cnt = 0; int seplen = strlen(sep); for(; *buf; buf+=seplen, cnt++) buf=strstr(buf,sep); return cnt; } -- snip --

thanks in advance for the input!! 

___cliff rayman___cliff_AT_rayman.com___

Edit 2001-03-16 by tye to add <READMORE>

Replies are listed 'Best First'.
Re: quickly counting substrings
by jeroenes (Priest) on Mar 02, 2001 at 15:18 UTC
    Very nice comparison. You might as well have a swing at index, which is probably fast.
    sub s_index{ my $pos = 0; my $cnt = 0; while (index($str, $sep, $pos)){ $pos++; $cnt++; } $cnt{s_index}=$cnt; }

    Jeroen
    "We are not alone"(FZ)

    Update: Thanks to danger for pointing my typos out. I took his code, and benchmarked it:

    my $_s_index = sub { my $pos = 0; my $cnt = 0; local $[ = 1; # ook! my $pos = -1; ++$cnt while $pos = index($dta,"\n\n",$pos+2); $cnt{_s_index}=$cnt; }; my $_while_re = sub{ my $cnt = 0; my $recsep = "\n\n"; $cnt++ while $dta =~ /$recsep/g; $cnt{_while_re} = $cnt; }; =>resulted in: Benchmark: timing 300 iterations of m_array, m_array_M, m_for, m_for_M +, m_while, m_while_M, s_index, tr, while_re... m_array: 11 wallclock secs ( 5.13 usr + 0.04 sys = 5.17 CPU) @ 58 +.03/s (n=300) m_array_M: 4 wallclock secs ( 1.57 usr + 0.02 sys = 1.59 CPU) @ 18 +8.68/s (n=300) m_for: 9 wallclock secs ( 4.19 usr + 0.00 sys = 4.19 CPU) @ 71 +.60/s (n=300) m_for_M: 3 wallclock secs ( 1.41 usr + 0.01 sys = 1.42 CPU) @ 21 +1.27/s (n=300) m_while: 5 wallclock secs ( 2.46 usr + 0.00 sys = 2.46 CPU) @ 12 +1.95/s (n=300) m_while_M: 2 wallclock secs ( 0.94 usr + 0.00 sys = 0.94 CPU) @ 31 +9.15/s (n=300) s_index: 2 wallclock secs ( 0.88 usr + 0.00 sys = 0.88 CPU) @ 34 +0.91/s (n=300) tr: 1 wallclock secs ( 0.36 usr + 0.00 sys = 0.36 CPU) @ 83 +3.33/s (n=300) (warning: too few iterations for a reliable count) while_re: 2 wallclock secs ( 0.90 usr + 0.00 sys = 0.90 CPU) @ 33 +3.33/s (n=300) === check counts === _while_re 2000 _m_for_M 2000 _m_array_M 2000 _tr 6000 _m_while_M 2000 _m_while 6000 _s_index 2000 _m_for 6000 _m_array 6000
    With tr still being the winner, and closely tied 2nd and 3rd place for index and danger's while, respectively.

    Than the light showed me that the separators should be all the same. That gave some shuffle:

    Benchmark: timing 300 iterations of m_array, m_array_M, m_for, m_for_M +, m_while, m_while_M, s_index, tr, while_re... m_array: 12 wallclock secs ( 5.04 usr + 0.04 sys = 5.08 CPU) @ 59 +.06/s (n=300) m_array_M: 11 wallclock secs ( 5.08 usr + 0.03 sys = 5.11 CPU) @ 58 +.71/s (n=300) m_for: 9 wallclock secs ( 4.19 usr + 0.00 sys = 4.19 CPU) @ 71 +.60/s (n=300) m_for_M: 8 wallclock secs ( 4.10 usr + 0.00 sys = 4.10 CPU) @ 73 +.17/s (n=300) m_while: 5 wallclock secs ( 2.26 usr + 0.02 sys = 2.28 CPU) @ 13 +1.58/s (n=300) m_while_M: 5 wallclock secs ( 2.28 usr + 0.02 sys = 2.30 CPU) @ 13 +0.43/s (n=300) s_index: 5 wallclock secs ( 2.13 usr + 0.00 sys = 2.13 CPU) @ 14 +0.85/s (n=300) tr: 0 wallclock secs ( 0.27 usr + 0.00 sys = 0.27 CPU) @ 11 +11.11/s (n=300) (warning: too few iterations for a reliable count) while_re: 5 wallclock secs ( 2.11 usr + 0.00 sys = 2.11 CPU) @ 14 +2.18/s (n=300) === check counts === _while_re 6000 _m_for_M 6000 _m_array_M 6000 _tr 6000 _m_while_M 6000 _m_while 6000 _s_index 6000 _m_for 6000 _m_array 6000
    Sadly enough the C-code doesn't compile on my machine. I don't have the time to get it working right now.

      That routine is a little broken -- the return value of index() can be 0 on success, and you'll want to set $pos to the current position plus whatever the length of the separator is at each iteration.

      One version of using index() that fares somewhat comparably to the m_while_M routine is:

      my $_index = sub { my $cnt = 0; local $[ = 1; # ook! my $pos = -1; ++$cnt while $pos = index($dta,"\n\n",$pos+2); $cnt{_index} = $cnt; };

      But an alternate version of the m_while_M routine (using while as a statement modifier rather than the block form) seems to be the best I can come up with at this hour:

      my $_while_re = sub{ my $cnt = 0; my $recsep = "\n\n"; $cnt++ while $dta =~ /$recsep/g; $cnt{_while_re} = $cnt; };
Re: quickly counting substrings
by petral (Curate) on Mar 03, 2001 at 01:25 UTC
    Very interesting. Just curious: do we no longer need to use /o when we're using an unchanging variable in a regexp?

    p
      it's not unchanging in the actual code. $rcdsep is
      contained in an object. if we settle on using one of
      the perl routines, as opposed to C, we will probably
      bind $rcdsep using a closure and then /o would be applicable.
      ___cliff rayman___cliff_AT_rayman.com___