olivier has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I am fairly new to Perl, and I have been trying to use some regular expressions in a Perl one liner to replace \r\n (or \n on Unix) with a single space, thereby joining lines. It is however a bit more involved than that because this joining should only happen under certain conditions. Here is the kind of text files that I am dealing with:

line de Bitte lesen Sie die Gebrauchsanweisung gut durch und bewahren Sie diese auf. en Please read these instructions carefully and keep for future reference. fr Lire et conserver ces instructions pour toute utilisation future du produit. line de 19 Zubehoerteile. en 19 Accessories. fr 19 Accessoires line de 10 Aktivitaeten. en 10 Activities. fr 10 Activite's.

I have a whole list of text files with text in German, French and English. All lines containing meaningful text are preceded by 4 spaces. However, I would like to have only a single line of text under each language for those language lines that space multiple lines (a line being 76 characters long maximum at present). In other words, the above would become:

line de Bitte lesen Sie die Gebrauchsanweisung gut durch und bewahren Sie +diese auf. en Please read these instructions carefully and keep for future refer +ence. fr Lire et conserver ces instructions pour toute utilisation future d +u produit. line de 19 Zubehoerteile. en 19 Accessories. fr 19 Accessoires line de 10 Aktivitaeten. en 10 Activities. fr 10 Activite's.

The languages text that already span only 1 line are left untouched, whereas the other lines, where the language text spans more than 1 lines, are joined. I have tried several regular expressions, all of them as Perl one liners as I thought it would be an extra challenge to do it all one 1 line, such as

perl -i.bak -pe 's/^\s\s\s\s(.*)$\n\s\s\s\s/$1/g' myfile.txt perl -i.bak -pe 's/\n\s\s\s\s/ /gm' myfile.txt

However, none of my regular expressions works. It looks as if anything specified after the \n is ignored. Any idea? Any help would be immensely appreciated? Thank you very much

Replies are listed 'Best First'.
Re: replace \n with space to join indented lines
by ikegami (Patriarch) on Dec 05, 2006 at 06:32 UTC

    -p processes the file one line at a time, while both of your regexp rely on matching two lines at a time. There are other problems with your regexps, but that's the biggest. That can be solved by working with more than one line at a time.

    my $file = do { local $/; <> }; 1 while $file =~ s/^([ ]{4}[^\n]+)\n[ ]{4}/$1/m; print($file);

    Or better yet:

    my $file = do { local $/; <> }; 1 while $file =~ s/^(([ ]+)(?![ ])[^\n]+)\n\2(?![ ])/$1/m; print($file);

    Note: The second will unwrap all pragraphs, no matter how far they are indented.

    Note: Both work with and without -i.

    Note: 1 while s///; is used in order to repeatedly match the same line. s///g; won't work when the paragraph is more than two lines long.

      Thank you ikegami. The problem is that I do have files that have paragraphs that contain more than 2 lines (sorry, I should have made that clearer in my original post). Is there a way in which you solution could be invoked as a one liner? Thank you

        If you want a one-liner, put it all on one line!

        perl -i.bak -e "local $/; $_=<>; 1 while s/^(([ ]+)(?![ ])[^\n]+)\n\2( +?![ ])/$1/m; print" myfile.txt
Re: replace \n with space to join indented lines
by jwkrahn (Abbot) on Dec 05, 2006 at 07:35 UTC
    If you want a one-liner then this may be what you want:
    perl -lpe'BEGIN { $/ = "\n " } $\ = y/\n// ? $/ : $"'
      Thank you very much jwkrahn. Your code works a treat. However, I do not understand it. Would you mind explaining it step by step please? Thank you olivier
        perl -lpe'
        Setup a while loop that automatically prints the current line and with the -l switch chomps the input and appends the output record separator.
        BEGIN { $/ = "\n " }
        Change the input record separator to a newline followed by four spaces.
        $\ = y/\n// ? $/ : $"'
        With the chomped line (after the input record separator has been removed), count the number of newlines. If there are no newlines then set the output record separator to the list separator (a single space), otherwise set the output record separator to the input record separator.

Re: replace \n with space to join indented lines
by Anonymous Monk on Dec 05, 2006 at 06:21 UTC
    lookup the /s modifier
      Thanks for this. I did try the /s modifier, but it did not work either. My understanding is that it treat the whole lot as a single line and that's probably why it was not working.