comment on

Using index would look something like the following:

sub using_index {
   our $seq; *seq = \$_[0];
   my @groups;

   my $pos = -1;
   my $start = -1;

   for (;;) {
      my $new_pos = index($seq, 'M', $pos+1);

      if ($new_pos < 0) {
         if (defined($start)) {
            push(@groups, [ $start, $pos ]);
         }
         last;
      }

      if ($start < 0) {
         $start = $new_pos;
      }
      elsif ($new_pos - $pos > 1) {
         push(@groups, [ $start, $pos ]);
         $start = $new_pos;
      }

      $pos = $new_pos;
   }

   return @groups;
}
[download]

It would be simpler if there was a function that returned the next character which isn't 'M'.

As you can guess, it's much slower than the regexp approach. The regexp approach is 170% faster than (i.e. 2.7 times the speed of) the index method on the input you provided.

Benchmark code:

use strict;
use warnings;

use Benchmark qw( cmpthese );

sub using_index {
   our $seq; *seq = \$_[0];
   my @groups;

   my $pos = -1;
   my $start = -1;

   for (;;) {
      my $new_pos = index($seq, 'M', $pos+1);

      if ($new_pos < 0) {
         if (defined($start)) {
            push(@groups, [ $start, $pos ]);
         }
         last;
      }

      if ($start < 0) {
         $start = $new_pos;
      }
      elsif ($new_pos - $pos > 1) {
         push(@groups, [ $start, $pos ]);
         $start = $new_pos;
      }

      $pos = $new_pos;
   }

   return @groups;
}

sub using_regexp {
   our $seq; *seq = \$_[0];
   my @groups;
   push(@groups, [ $-[0], $+[0]-1 ]) while $seq =~ /M+/g;
   return @groups;
}

{
   my $seq = "IIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOMMMMMMMMMMMMMIIIIIMMMMMM
+MMMOOOOOOOOOOOOOOMMMMMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOOOOOOOOOMMMMMMMMMM
+MIIIIIIMMMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMIIIMMMMMMMMMOOOOOOO
+OOOOOOOOOOOOOOOOOOOOMMMMMMMIIIIMMMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOMMMMMM
+MIIIMMMMMMMMMOOOOOOOOOOOOOOOOOOOOOOOOOMMMMMMMMMIIIMMMMMMMMMMMOOOOOOOO
+OOOOOOOOOMMMMMMMMI";

   print("using_index\n");
   print("-----------\n");
   printf("%d to %d\n", @$_) foreach using_index($seq);

   print("\n");

   print("using_regexp\n");
   print("------------\n");
   printf("%d to %d\n", @$_) foreach using_regexp($seq);

   print("\n");

   cmpthese(-3, {
      using_index  => sub { my @groups = using_index  $seq; 1; },
      using_regexp => sub { my @groups = using_regexp $seq; 1; },
   });
}
[download]

Benchmark results:

               Rate  using_index using_regexp
using_index  2039/s           --         -63%
using_regexp 5467/s         168%           --
[download]

In reply to Re^2: 'grouping' substrings? by ikegami
in thread 'grouping' substrings? by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.