John M. Dlugosz has asked for the wisdom of the Perl Monks concerning the following question:

I'm recognizing a string that has some stuff, and the last thing of interest in the string should include everything to the end but not any trailing whitespace.

Consider as a concrete example the ability to match list items in Wiki-style markup. Here is what I ended up using:

my ($prefix, $content) = $line =~ /^\s*([*#]+:?)\s*(.*\S)?(?:(?<=\S)\ +s*)?$/;
It's especially complicated for the fact the last thing could be missing completely. The interesting part here is the two uses of \S. The first makes sure the thing ends with a non-space. The second is a look-back assertion so the final trailing whitespace won't just be eaten by the main item (which can certainly include internal spaces). I considered doing something with a "cut" first, but didn't work it out.

Is there some better (or clearer?) trick to doing this? Perhaps different ideas using newer regex features?

Replies are listed 'Best First'.
Re: Regex: not the trailing whitespace
by ikegami (Patriarch) on Apr 23, 2011 at 05:36 UTC

    Just delete the trailing whitespace first.

    while (<>) { s/\s+\z//; ... }

    The second is a look-back assertion so the final trailing whitespace won't just be eaten by the main item

    Huh? /[*#]+:?/ cannot eat the trailing whitespace.

    So all you need is

    / ^ \s* ([*#]+:?) \s* (.*\S)? \s* $ /sx

    (Is some backtracking protection needed?)

      Huh? /[*#]+:?/ cannot eat the trailing whitespace.
      No, I mean the (.*) would. The leading string of bullets is the prefix, and the content of the line after that is the "main item".

        No, I mean the (.*) would.

        There is no /(.*)/ in your code. If you mean the /(.*\S)/, it can't eat the trailing whitespace either.

Re: Regex: not the trailing whitespace
by wind (Priest) on Apr 23, 2011 at 06:40 UTC
    Just rely on greedy and non-greedy matching to eat up all the spaces where appropriate. Don't need any special new features, or even any additional boundary conditions:
    my ($prefix, $content) = $line =~ / ^ \s* ([*#]+:?) \s* (.*?) \s* $ /x
      I see. I don't remember exactly what I was thinking, but I suppose using a \S helps it optimize. OTOH, it might make it harder for the engine to understand. I should try some benchmarks with and without.