in reply to removing redundantwhitespace

I can't believe no one suggested tr/\r\n\t / /s, which handles all but the final piece of whitespace.

Replies are listed 'Best First'.
Re^2: removing redundantwhitespace (too far)
by tye (Sage) on Sep 14, 2008 at 06:00 UTC

    It doesn't handle the preservation of line breaks (ie, only removing lines that contain nothing but whitespace). Nor does it remove leading whitespace.

    Below is a fairly simple, single-pass regex that handles all but leading spaces on the first line, so I've added a very simple regex before it:

    s/^\s+//; s{ [^\S\n]* (?: (\n)\s* | [^\S\n]+ ) }{ $1 || ' ' }gex

    OT, but the node title made me wonder if there was a reasonable single-pass regex for removing leading and trailing whitespace while collapsing internal whitespace. I can see a lot of approaches that will work, but most seem to get bogged down in unfortunate complexities. Ignoring warnings lets me do:

    s{(?<=(\S))?\s+(?=(\S))?}{length($1.$2)?'':' '}gx

    Requiring Perl 5.010 means I don't have to ignore warnings:

    s{(?<=(\S))?\s+(?=(\S?))}{length(($1//'').$2)?'':' '}gx

    Surely we can do better than that. Oh, again requiring 5.010, I can do this:

    s{(^)?\s+(\z)?}{$1//$2//' '}gx

    That's not too bad. (:

    - tye        

      The first one works except it leaves the trailing newline:

      one two three four five six

      The second one (with missing "e" added) fails:

      onetwothreefourfivesix

      The third one (with missing "e" added) fails:

      onetwothreefourfivesix

      The fourth one (with missing "e" added) produces:

      one two three four five six

      but I think that's what you were going for?

        The first one works except it leaves the trailing newline:

        It was supposed to leave a trailing newline.

        The other ones are, of course, meant to do something different. Besides omitting the /e, I had the test backward. But the real problem with that is that reversing the test means reversing "and" vs "or" which leaves this approach "bogged down in unfortunate complexities", as usual.

        Yes, the last one appears to do what was intended. I didn't have 5.010 handy at the time I wrote that.

        I'll continue to wonder if there is a reasonable, (and warning-free would be nice) single-pass regex for this that doesn't require 5.010 features.

        Note that the (even less compelling) translation to pre-5.010 would be:

        s{(^)?\s+(\z)?}{ defined $1 || defined $2 ? '' : ' ' }gex

        and that this even leaves a trailing space on 5.8.3 because of a quirk (aka "bug") there:

        $ perl5.8.3 -del DB<1> x "hi" =~ /i\z/ 0 1 DB<2> x "hi" =~ /(i)(\z)?/ 0 'i' 1 undef $ perl5.10 -del DB<1> x "hi" =~ /(i)(\z)?/ 0 'i' 1 ''

        Thanks for pointing those issues out.

        - tye