in reply to Re: removing redundantwhitespace
in thread removing redundantwhitespace

It doesn't handle the preservation of line breaks (ie, only removing lines that contain nothing but whitespace). Nor does it remove leading whitespace.

Below is a fairly simple, single-pass regex that handles all but leading spaces on the first line, so I've added a very simple regex before it:

s/^\s+//; s{ [^\S\n]* (?: (\n)\s* | [^\S\n]+ ) }{ $1 || ' ' }gex

OT, but the node title made me wonder if there was a reasonable single-pass regex for removing leading and trailing whitespace while collapsing internal whitespace. I can see a lot of approaches that will work, but most seem to get bogged down in unfortunate complexities. Ignoring warnings lets me do:

s{(?<=(\S))?\s+(?=(\S))?}{length($1.$2)?'':' '}gx

Requiring Perl 5.010 means I don't have to ignore warnings:

s{(?<=(\S))?\s+(?=(\S?))}{length(($1//'').$2)?'':' '}gx

Surely we can do better than that. Oh, again requiring 5.010, I can do this:

s{(^)?\s+(\z)?}{$1//$2//' '}gx

That's not too bad. (:

- tye        

Replies are listed 'Best First'.
Re^3: removing redundantwhitespace (too far)
by ikegami (Patriarch) on Sep 14, 2008 at 15:59 UTC

    The first one works except it leaves the trailing newline:

    one two three four five six

    The second one (with missing "e" added) fails:

    onetwothreefourfivesix

    The third one (with missing "e" added) fails:

    onetwothreefourfivesix

    The fourth one (with missing "e" added) produces:

    one two three four five six

    but I think that's what you were going for?

      The first one works except it leaves the trailing newline:

      It was supposed to leave a trailing newline.

      The other ones are, of course, meant to do something different. Besides omitting the /e, I had the test backward. But the real problem with that is that reversing the test means reversing "and" vs "or" which leaves this approach "bogged down in unfortunate complexities", as usual.

      Yes, the last one appears to do what was intended. I didn't have 5.010 handy at the time I wrote that.

      I'll continue to wonder if there is a reasonable, (and warning-free would be nice) single-pass regex for this that doesn't require 5.010 features.

      Note that the (even less compelling) translation to pre-5.010 would be:

      s{(^)?\s+(\z)?}{ defined $1 || defined $2 ? '' : ' ' }gex

      and that this even leaves a trailing space on 5.8.3 because of a quirk (aka "bug") there:

      $ perl5.8.3 -del DB<1> x "hi" =~ /i\z/ 0 1 DB<2> x "hi" =~ /(i)(\z)?/ 0 'i' 1 undef $ perl5.10 -del DB<1> x "hi" =~ /(i)(\z)?/ 0 'i' 1 ''

      Thanks for pointing those issues out.

      - tye