Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello dear Monks!
Assuming you have the following string:
dddddddddBBBBBBBBBBBBDDDDDDDDBBBBBBBBBBBBBBddddddddddddddddddddBBBBBBB +BBBBBBBDDBBBBBBBBddddddddddddd

I want to find the start and end positions of each of the B substrings. I have started coding this a bit, but I am obviously doing something wrong:
use strict; use warnings; my $seq = 'dddddddddBBBBBBBBBBBBDDDDDDDDBBBBBBBBBBBBBBdddddddddddddddd +ddddBBBBBBBBBBBBBBDDBBBBBBBBddddddddddddd'; while($seq=~/(B+)/g) { my $seg=$1; my $seg_length=length($seg); my $seg_start = index ($seq, $seg); my $seg_end=$seg_start+$seg_length; print $seg."|".$seg_start."-".$seg_end."\n"; }

Replies are listed 'Best First'.
Re: Find the boundaries of a substring in a string
by Corion (Patriarch) on Jun 27, 2023 at 08:43 UTC

    If you want the indices of a match, take a look at @+ and @- in perlvar.

      Can you help me understand how to use them in my example please?

        Win8 Strawberry 5.8.9.5 (32) Tue 06/27/2023 4:54:38 C:\@Work\Perl\monks >perl use strict; use warnings; use Data::Dump qw(dd); my $seq = 'ddBddddBBBBBBBBBBBBDDDDDDDDBBBBBBBBBBBBBBddddddddddBBBBBBBB +BBBBBBDDBBBBBBBBddddddddd'; my $rx_sub_str = qr{ B+ }xms; my @endpoints; while ($seq =~ / $rx_sub_str /xmsg) { push @endpoints, [ $-[0], $+[0] ]; } dd \@endpoints; ^Z [[2, 3], [7, 19], [27, 41], [51, 65], [67, 75]]
        Use of index might be faster than a regex approach for large data.


        Give a man a fish:  <%-{-{-{-<

Re: Find the boundaries of a substring in a string
by hippo (Archbishop) on Jun 27, 2023 at 09:19 UTC

    TIMTOWTDI. Using your code as a starting point I would use pos to find the end point of the sequence. Here is a runnable example.

    use strict; use warnings; my $seq = 'dddddddddBBBBBBBBBBBBDDDDDDDDBBBBBBBBBBBBBBddddddddddddddddddddBBBB +BBBBBBBBBBDDBBBBBBBBddddddddddddd'; while ($seq =~ /(B+)/g) { my $seg = $1; my $seg_length = length ($seg); my $seg_end = pos ($seq); my $seg_start = $seg_end - $seg_length; print $seg. "|" . $seg_start . "-" . $seg_end . "\n"; }

    🦛

      Thank you both! I noticed in both snippets that the end boundaries are off by 1, or am I counting wrong?

        Are you counting from 1 or zero? If from 1 then it should probably be this (to give the first match at 10 to 21 inclusive):

        use strict; use warnings; my $seq = 'dddddddddBBBBBBBBBBBBDDDDDDDDBBBBBBBBBBBBBBddddddddddddddddddddBBBB +BBBBBBBBBBDDBBBBBBBBddddddddddddd'; while ($seq =~ /(B+)/g) { my $seg = $1; my $seg_length = length ($seg); my $seg_end = pos ($seq); my $seg_start = $seg_end - $seg_length + 1; print $seg. "|" . $seg_start . "-" . $seg_end . "\n"; }

        To say what you expect the answer to be when posting have a read of How to ask better questions using Test::More and sample data.


        🦛

Re: Find the boundaries of a substring in a string
by kcott (Archbishop) on Jun 27, 2023 at 09:44 UTC

    Take a look at pos.

    $ perl -e ' my $str = "BaBBaBBBaBBBB"; # start pos: 0 2 5 9 # end pos: 0 3 7 12 my $fmt = "%d -> %2d\n"; while ($str =~ /(B+)/g) { my $len = length $1; my $pos = pos $str; printf $fmt, $pos-$len, $pos-1; } ' 0 -> 0 2 -> 3 5 -> 7 9 -> 12

    Edit: s/look as pos/look at pos/ Thanks LanX.

    — Ken

Re: Find the boundaries of a substring in a string
by harangzsolt33 (Deacon) on Jun 28, 2023 at 14:53 UTC
    My example will give you the start and end positions without using regex. I did some testing, and it appears that this solution runs much slower than the regex solution. But it's just to show that there is more than one way to do it :

    #!/usr/bin/perl use strict; use warnings; my @STARTPOS = (); my @ENDPOS = (); my $str = "BaaaaBBBBBBaaaaaaaBBBBBBBBBBBBBBBBBBBBaaaBBBBBBBBBBBxB"; print "$str\n\n"; my $i = 0; while (($i = index($str, 'B', $i)) >= 0) { push(@STARTPOS, $i++); while (vec($str, $i++, 8) == 66) {} push(@ENDPOS, $i-1); } # Display results: for (my $i = 0; $i < @STARTPOS; $i++) { print "\nstart: ", $STARTPOS[$i], "\t-> end: ", $ENDPOS[$i]; }
      So a worse way, coding perl like C, defeating the purpose, but with much worse performance. Awesome