Find and extract substring(s) within larger string.

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Find and extract substring(s) within larger string.
by Athanasius (Archbishop) on Oct 21, 2013 at 03:47 UTC

You need to call the /g -modified regex in list context:

13:42 >perl -wMstrict -e "my $dna = 'xxxxxxpecbcbccrlxxxxxxpeeeerlxxxx
+xplxxxxxPeRLxxxx'; my @matches = $dna =~ /(p.*?l)/gi; print qq[$_\n] 
+for @matches;"
pecbcbccrl
peeeerl
pl
PeRL

13:42 >
[download]

See “Global Matching” in Using regular expressions in Perl.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Find and extract substring(s) within larger string.

by Anonymous Monk on Oct 21, 2013 at 04:14 UTC

my $regex = qr/p[^x]+l/;

my @matches = $dna =~ /($regex)/;
[download]

[reply]
[d/l]

Re^3: Find and extract substring(s) within larger string.

by Athanasius (Archbishop) on Oct 21, 2013 at 04:41 UTC

For placement of the regex switches, see How do I apply switches like /i or /g to a qr regexp?:

/imsx immediately follow the closing delimiter of the qr, but /gce must be supplied at the point of actual use immediately following m// or s///.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^4: Find and extract substring(s) within larger string.

by Anonymous Monk on Oct 21, 2013 at 04:51 UTC

Re^3: Find and extract substring(s) within larger string.

by AnomalousMonk (Archbishop) on Oct 21, 2013 at 23:37 UTC

... speed up the regex:
my $regex = qr/p[^x]+l/;

Please be aware that use of the /i case-insensitivity regex modifier usually imposes a speed penalty, perhaps quite significant if you're really dealing with long-ish (e.g., DNA) strings. Using a character class avoids this:
my $regex = qr/[Pp][^x]*[Ll]/;
with no /i modifier needed anywhere. As always, Benchmark-ing tells the true tale with regard to performance in a real application; anything else, however well informed, is speculation.

Also be aware that the [^x]+ term in the quoted regex requires at least one non-'x' to be present for a match, thus excluding a match on something like 'pl'. So the final code might look like the following code. (Note that () capturing parentheses are not needed in this case and may impose a speed penalty.)

>perl -wMstrict -le
"my $dna = 'xxpecbcbccrlxxxPeeeerlxxpLxxxPeRLxx';
 ;;
 my $perl = qr{ [Pp] [^x]* [Ll] }xms;
 ;;
 my @matches = $dna =~ m{ $perl }xmsg;
 printf qq{'$_' } for @matches;
"
'pecbcbccrl' 'Peeeerl' 'pL' 'PeRL'
[download]

[reply]
[d/l]
[select]

Re: Find and extract substring(s) within larger string.
by 2teez (Vicar) on Oct 21, 2013 at 06:33 UTC

Hi Anonymous Monk,
I am also considering capturing the beginning and ending index of each match (reading left to right)
You can also do like so, using while loop and index function:

use warnings;
use strict;

my $dna = 'xxxxxxpecbcbccrlxxxxxxpeeeerlxxxxxplxxxxxPeRLxxxx';
my $re  = qr/p.*?l/i;

while ( $dna =~ m[($re)]g ) {
    my $beg = index( $dna, $1 );
    my $len = length($1);
    print join " " => $1, $beg, $beg + ($len -1), $/; # updated
}
[download]

pecbcbccrl 6 15 
peeeerl 22 28 
pl 34 35 
PeRL 41 44
[download]

hdb

If you tell me, I'll forget.
If you show me, I'll remember.
if you involve me, I'll understand.
--- Author unknown to me

[reply]
[d/l]
[select]

Re^2: Find and extract substring(s) within larger string.

by hdb (Monsignor) on Oct 21, 2013 at 07:29 UTC

Perl provides the ~~$- and $+~~ @- and @+ arrays for this purpose (thanks to kcott pointing out that arrays have the @ sigil...):

use warnings;
use strict;
#                    1         2         3         4  
#          0123456789012345678901234567890123456789012345678
my $dna = 'xxxxxxpecbcbccrlxxxxxxpeeeerlxxxxxplxxxxxPeRLxxxx';
my $re  = qr/p.*?l/i;

while ( $dna =~ m[($re)]g ) {
    print join " " => $1, $-[0], $+[0]-1, $/;
}
[download]

A correction of -1 is required, otherwise the matches are too long by one as in your code above.

[reply]
[d/l]
[select]

Re^3: Find and extract substring(s) within larger string.

by 2teez (Vicar) on Oct 21, 2013 at 07:56 UTC

Hi hdb,
A correction of -1 is required, otherwise the matches are too long by one as in your code above.
You are right... my bad!

If you tell me, I'll forget.
If you show me, I'll remember.
if you involve me, I'll understand.
--- Author unknown to me

[reply]