Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi
I have the following text -
one two three four five six

which I need to convert to -
one two three four five six

How might I implement a regular expression in perl to do this?
Thanks
Andy

Replies are listed 'Best First'.
Re: removing redundantwhitespace
by johngg (Canon) on Sep 13, 2008 at 14:19 UTC
    You could combine a split, a map containing regex substitutions, and a join.

    use strict; use warnings; my $string = qq{one two \n\n three\t\t\n four five\n\n\n\t six \n\n}; my $newString = join qq{\n}, map { s{^\s+}{}; s{\s+$}{}; s{\s+}{ }g; $_; } split m{\n+}, $string; print qq{->$newString<-\n};

    The output.

    ->one two three four five six<-

    I hope this is useful.

    Cheers,

    JohnGG

    Update: Just realised that the third substitution in the map should be global. Corrected.

Re: removing redundantwhitespace
by Anonymous Monk on Sep 13, 2008 at 13:10 UTC
      Hi,
      I've looked at these but they don't help me when my string contains "\n".
      So my example again -
      one two   \n\n   three\t\t\n four    five\n\n\n\t six \n\n
      needs to become-
      one two\nthree\nfour five\nsix
        It seems to me that you want a run of whitespace that includes a newline to become a single newline, and a run of spaces to become a single space. It's not clear to me what you want to happen to a run of whitespace that includes a tab, but no newline; let's suppose that you want that to collapse to just a space, too. If your entire string (including newlines) is in the scalar $string, then
        $string =~ s/\s*\n\s*/\n/g; $string =~ s/\s+/ /g;
        should do it. This will not destroy leading and trailing whitespace; for that, you should use ikegami's solution, or read again the links in Re: removing redundantwhitespace.
Re: removing redundantwhitespace
by GrandFather (Saint) on Sep 13, 2008 at 21:00 UTC
    $str =~ s/((^|\n| |\t)(?: |\t|(?<! |\t)\n)*)/$2/gs;

    fixes most cases but doesn't convert a tab to a space if it is the first character in a white space string of characters.


    Perl reduces RSI - it saves typing
      This doesn't collapse "a \nb" into "a\nb". It's difficult to tell from the desired output (because I can't see spaces at the ends of lines …), but it seems that the poster may have wanted that to happen.

        I thought I caught that case but, as you suggest, the white space on the end of the line is hard to see! The following simplifies the regex and fixes that case at the cost of complicating the substitution:

        $str =~ s/((^|\s)\s*)/length ($2) ? (-1 < index ($1, "\n") ? "\n" : ' +') : ''/ges;

        Oh, and it replaces tabs with spaces.


        Perl reduces RSI - it saves typing
Re: removing redundantwhitespace
by chromatic (Archbishop) on Sep 14, 2008 at 04:47 UTC

    I can't believe no one suggested tr/\r\n\t / /s, which handles all but the final piece of whitespace.

      It doesn't handle the preservation of line breaks (ie, only removing lines that contain nothing but whitespace). Nor does it remove leading whitespace.

      Below is a fairly simple, single-pass regex that handles all but leading spaces on the first line, so I've added a very simple regex before it:

      s/^\s+//; s{ [^\S\n]* (?: (\n)\s* | [^\S\n]+ ) }{ $1 || ' ' }gex

      OT, but the node title made me wonder if there was a reasonable single-pass regex for removing leading and trailing whitespace while collapsing internal whitespace. I can see a lot of approaches that will work, but most seem to get bogged down in unfortunate complexities. Ignoring warnings lets me do:

      s{(?<=(\S))?\s+(?=(\S))?}{length($1.$2)?'':' '}gx

      Requiring Perl 5.010 means I don't have to ignore warnings:

      s{(?<=(\S))?\s+(?=(\S?))}{length(($1//'').$2)?'':' '}gx

      Surely we can do better than that. Oh, again requiring 5.010, I can do this:

      s{(^)?\s+(\z)?}{$1//$2//' '}gx

      That's not too bad. (:

      - tye        

        The first one works except it leaves the trailing newline:

        one two three four five six

        The second one (with missing "e" added) fails:

        onetwothreefourfivesix

        The third one (with missing "e" added) fails:

        onetwothreefourfivesix

        The fourth one (with missing "e" added) produces:

        one two three four five six

        but I think that's what you were going for?

Re: removing redundantwhitespace
by ikegami (Patriarch) on Sep 13, 2008 at 13:07 UTC
      Ug, ignore the parent post. I missed the removal of blank lines and the removal of leading spaces. Here's the straightforward way, when reading from a file:
      while (<$fh>) { chomp; s/^\s+//; next if !length; s/\s+$//; s/\s+/ /g; ... }
        Thanks, but I'm not actually reading from a file. I have this text in a single variable returned from a driver...