newbio has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

my $line='*Mary* had a little lamb';

my $bagofwords='had |a |Sam |Tom ';

Output: *Mary_had_a* little lamb

I want to combine words that consecutively follow certain tagged words in a sentence. For example, in the sentence above '*Mary* had a little lamb', I would like to match and combine words that consecutively (in a greedy way) follow tagged word *Mary* based on a match with a collection of words I have in $bagofwords variable.

I tried using the following expression but it does not seem to work. Is there equivalent of [^\*\]+ for full words, something like - (had |a |little |Sam |Tom )+ in regular expression?

$line=~s/\*([^\*]+)\*\s($bagofwords)+/*$1_$2*/g;

Thanks.

Replies are listed 'Best First'.
Re: greedy match of words
by almut (Canon) on Aug 06, 2009 at 20:44 UTC

    You're pretty close. Try it with this minor change:

    my $line='*Mary* had a little lamb'; my $bagofwords='had|a|Sam|Tom'; while ($line=~s/\*([^\*]+)\*\s($bagofwords)\s+/*$1_$2* /) { print "$line\n"; # debug } __END__ *Mary_had* a little lamb *Mary_had_a* little lamb
Re: greedy match of words
by GrandFather (Saint) on Aug 06, 2009 at 21:04 UTC

    I can't see a straight forward way to do it with a single regex, but the following may suit:

    use strict; use warnings; my $line='*Mary* had a little lamb'; my $bagofwords='had |a |Sam |Tom '; $line = join ' ', map {s/(\*.*\* | ^(?:$bagofwords)$)/*$1*/x; s/\*\*/* +/g; $_} $line =~ /(\S+)/g; $line =~ s/\* \*/_/g; print $line;

    Prints:

    *Mary_had_a* little lamb

    True laziness is hard work
Re: greedy match of words
by moritz (Cardinal) on Aug 06, 2009 at 20:39 UTC
    You're nearly there. Two things that you need to improve: you don't need escape the star in the character class, it's just [^*]; and ($bagofwords)+ will only capture the last match of that regex.

    Here's something that nearly works:

    use warnings; use strict; use 5.010; my $line = '*Mary* had a little lamb'; my $bagofwords = qr{had |a |Sam |Tom}; if ($line =~ s/\*([^*]+)\*\s*((?:$bagofwords)*)/*$1 $2* /) { say $line; }

    It outputs *Mary had a * little lamb. If you want to include the underscores, you might have to post-process $2 before interpolating it, possibly with the /e modifier.

    In Perl 6 a quantified regex just returns a list of match objects, so that's a bit easier here:

    use v6; token word { had | a | Sam | Tom } say '*Mary* had a little lamb'.subst: rx{ '*' (<-[*]>+) '*' \s+ [<word> \s+]+ }, { '*' ~ join('_', $0, @($<word>)) ~ '* ' };

    Output: *Mary_had_a* little lamb

    (Note that Rakudo doesn't yet implement s///, but the method form of substitutions work today)