bontchev has asked for the wisdom of the Perl Monks concerning the following question:

Hello enlightened ones,

I have a pretty basic problem, yet I am unable to find an elegant solution to it. :-(

I have a string which contains zero or more substrings, surrounded by delimiter pairs. I want to extract all the substrings.

For instance, if the delimiter pair is Foo and Bar and I have the string

WhateverFooBlahBarMoreFooStuffBarMore

I'd like to extract Blah and Stuff from it.

A simple

@words = ($line =~ /Foo(.*)Bar/g);

doesn't work - because of the greedy expression matching it gets BlahBarMoreFooStuff. I tried to override that with

@words = ($line =~ /(Foo(.*)Bar)+?/g);

but, for some reason, then I get FooBlahBarMoreFooStuffBar and BlahBarMoreFooStuff.

Any help would be appreciated...

Replies are listed 'Best First'.
Re: Extracting substrings
by kyle (Abbot) on Feb 17, 2008 at 12:30 UTC

    Use a non-greedy match (same as a greedy match, but with "?" after it).

    use Data::Dumper; my $string = 'WhateverFooBlahBarMoreFooStuffBarMore'; my @words = ( $string =~ /Foo(.*?)Bar/g ); print Dumper \@words; __END__ $VAR1 = [ 'Blah', 'Stuff' ];
Re: Extracting substrings
by ambrus (Abbot) on Feb 18, 2008 at 08:56 UTC
    /(Foo(.*)Bar)+?/g
    That would work with POSIX regular expression semantics, but Perl regexen have different greediness semantics. Luckily, Perl 5.10.0 has a replaceable regex engine, so if you just wait till the module re::engine::TRE is fixed, you'll be able to use that regex (though you'll need a /x flag too).

      It does work that way indeed:

      perl -wE 'use re::engine::TRE; print "<<$_>>\n" for "WhateverFooBlahBa +rMoreFooStuffBarMore" =~ /(?:Foo(.*)Bar)+?/gx'

      Output:

      <<Blah>> <<Stuff>>
Re: Extracting substrings
by poolpi (Hermit) on Feb 18, 2008 at 10:26 UTC

    with look ahead and look behind

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my $string = 'WhateverFooBlahBarMoreFooStuffBarMore'; my @words = $string =~ / (?<=Foo) (.*?) (?=Bar) /gx; print Dumper \@words;
    Output: $VAR1 = [ 'Blah', 'Stuff' ];
    hth,

    PooLpi

    'Ebry haffa hoe hab im tik a bush'. Jamaican proverb

      Meh, that's too little unneeded complication. Why not use the new (in perl 5.10.0) feature Keep too: print "<<$_>>\n" for "WhateverFooBlahBarMoreFooStuffBarMore" =~ /Foo\K.*?(?=Bar)/gx;

      Really, if you add a capture around the word anyway, why do you need the lookaheads at all? print "<<$_>>\n" for "WhateverFooBlahBarMoreFooStuffBarMore" =~ /Foo(.*?)Bar/gx;