cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Howdy Monks. I have some text extracted from blog posts that in some cases includes an unwanted footer info that I want to lop off. Thing is, in some cases this is marked by a word that could theoretically occur legitimately earlier in the text. So for example let's say I have something like this
Tags are very useful to included in these posts. Interested in growi +ng a home based business go here. Tags: Vemma Builder · wallstrip · Vemma · Vegas // Mar 2nd 2007 at 3:30 am vemma killa Leave a Comment Name Mail Website
and I want to get rid of everything from the second "Tags" to the end. If I say $text = s/Tags.+?$//; then it matches all the text. Is there some way to specify that it should use the "least inclusive" match from the end in order to prevent this?

Many TIA...

Steve

Replies are listed 'Best First'.
Re: regexp: least inclusive match?
by Sidhekin (Priest) on Mar 02, 2007 at 23:46 UTC

    If the marker is a plain string (as in the example, without regex specials), substr and rindex ought to do the job:

    my $idx = rindex($text, "Tags"); if ($idx > -1) { substr($text, $idx) = ''; }

    Otherwise you could for instance take advantage of * being greedy from the start:

    $text =~ s/(.*)Tags.*/$1/s;

    print "Just another Perl ${\(trickster and hacker)},"
    The Sidhekin proves Sidhe did it!

      Sidhekin is right, with a minor change the regex will work.

      This is because principle 0 of regex says

      Principle 0: Taken as a whole, any regexp will be matched at the earliest possible position in the string
      So while your '.+?' dictates a preference for the smallest amount of characters to match, in the end this preference is overruled.
Re: regexp: least inclusive match?
by johngg (Canon) on Mar 03, 2007 at 09:35 UTC
    Another way to do it would be to use regular expression with a negative look-ahead assertion. Substitute from the marker word to end of string with nothing as long as the marker word is not followed by another occurrence, thus it will only match from the last marker word onwards.

    use strict; use warnings; my $blog; { local $/; $blog = <DATA>; } my $word = q{Tags}; my $rxStrip = qr {(?xs) \b$word\b (?!.*\b$word\b) .* }; $blog =~ s{$rxStrip}{}; print $blog; __END__ Tags are very useful to included in these posts. Interested in growi +ng a home based business go here. Tags: Vemma Builder · wallstrip · Vemma · Vegas // Mar 2nd 2007 at 3:30 am vemma killa Leave a Comment Name Mail Website

    I hope this is of interest.

    Cheers,

    JohnGG