pat_mc has asked for the wisdom of the Perl Monks concerning the following question:

Hi there,

I have a long file from which I want to remove a single linebreak before all lines starting with the string "= = = =".

Sounds like a trivial task, or so I thought and wrote the following Perl script:

# usr/bin/perl -w

use strict;

my $lines = join "", <>;
$lines =~ s/\n\n=/=/gm;
print $lines;

To my disappointment the substitution did not work. I tried a whole bunch of variations on the regex but still could not get it to work. It appears as if the replacement works OK when I print the result to the console but not when I redirect the output into a file. Can that be?

Any ideas what is going wrong here?

Thanks in advance -

Pat

P. S.: Sorry I did not get the <code> tags to work either. Quite obviously, this is my first post.
  • Comment on Why does my Perl regex substitution for linebreak fail?

Replies are listed 'Best First'.
Re: Why does my Perl regex substitution for linebreak fail?
by kyle (Abbot) on Mar 05, 2008 at 22:22 UTC
    my $lines = join "", <DATA>; $lines =~ s/\n\n=/=/gm; print $lines; __DATA__ line 1 ====== line after break line

    Produces:

    line 1====== line after break line

    That's what I'd expect, but maybe it's not what you wanted.

    If you want to remove a blank line before the marker, do s/\n\n=/\n=/gm. Then the output is:

    line 1 ====== line after break line

    You can do s/\n=/=/gm (which sounds like what you describe), but that will produce output like the first output when there's no blank line before the marker.

    As an aside, you can avoid reading in the whole file by setting the input record separator.

    $/ = '='; while (<DATA>) { s/\n\n=/\n=/m; print; } __DATA__ line 1 ====== line after break line

    Produces...

    line 1 ====== line after break line

    See perlvar for info about $/ (aka $INPUT_RECORD_SEPARATOR if you use English).

      The m modifier is useless since ^, $, etc isn't used. In fact, why aren't you using \z when you change the IRS? And why not use "\n\n=" as the IRS?
      local $/ = "\n\n="; while (<DATA>) { s/\n\n=\z/\n=/; print; }

      I agree. The original regex, as posed by pat mc, removes both \n rather than just the single one that pac mc said was wanted to be removed.

      I also presume that pat mc (based upon the inquiry) is looking for regex solutions; but several of the other nodelets in this thread have some good ideas for alternativies to the regex approach.

      ack Albuquerque, NM
      Thanks, kyle, for drawing my attention to the use of the IRS, an aspect of file handling in Perl I was unaware of so far.
Re: Why does my Perl regex substitution for linebreak fail?
by igelkott (Priest) on Mar 05, 2008 at 22:43 UTC
    ... OK when I print the result to the console but not when I redirect the output into a file ...

    Could your file actually have \r\n (windows-based) line-endings? Could get different terminal behavior if running cygwin with unix line-endings?

    If you have a unix-like system available, might try pushing a small bit of your processed and unprocessed file through "od". I sometimes use something like "tail -3 foo | od -bc" to keep from getting fooled by "friendly" systems.

      For mostly-printable files the output of "tail -3 foo | cat -A" is less cluttered.
      Thanks, igelkott, for adressing the console-part of my post. Can you please explain to me in more basic terms what your suggestion is? I am fairly new to Linux and hence don't quite understand what the issue is you are pointing at. The file I intend to operate on, however, has been generated with the 'cat' command in the shell concatenating other files generated under Linux. Not sure, therefore, if the inter-operating-system issue applies here. Thanks again - Pat
        "tail -3 foo | od -bc" means to take the last three lines from "foo" and feed it to the "od" command with "b" and "c" options.

        I'll presume that the first part is either clear or is reasonably easy to look up; "od" is the weird part. This is named for "octal dump" (option b) but I'm using it here to also get the character names (option c). In particular, to reveal the non-printing characters.

      igelkott -

      Your answer got right to the core of the issue. I searched for \r and got matches in exactly those lines which resisted the replacement. What exactly is this \r character, anyway? I have no idea how that \r entered my fully Linux-based and Linux-generated file.

      Any thoughts on this?

      Thanks again for shedding some light on this.

      Cheers -

      Pat
        Line-endings: \r and \n (CR and LF)
        •  \n -> unix
        •  \r -> mac
        • \r\n -> pc

        Exactly how pc line-ending got in your file, I couldn't say but I would guess that the data has passed through a windows machine at sometime. Some file transfer methods take care of line-endings and others don't.

Re: Why does my Perl regex substitution for linebreak fail?
by halfcountplus (Hermit) on Mar 05, 2008 at 23:18 UTC
    what you are literally asking for ('I have a long file from which I want to remove a single linebreak before all lines starting with the string "= = = =".') is this:

    #!/usr/bin/perl use strict; my $l; while (<DATA>) { if ($_ =~ /^====/) {chomp $l;} print $l; $l=$_; } print $l; __DATA__ one two ====three four
Re: Why does my Perl regex substitution for linebreak fail?
by graff (Chancellor) on Mar 06, 2008 at 03:39 UTC
    You say you want to remove a single linebreak before all lines starting with the string "= = = =", but your snippet would remove two linebreaks ("\n\n" is replaced with nothing). Just curious about that.

    Anyway, I think others have already given good ideas. Here's another one, that doesn't require holding the entire file in memory at once (unless of course the file does not actually contain any instance of "\n===="):

    #!/usr/bin/perl use strict; use warnings; $/ = "\n===="; while (<>) { s/\n====$/====/; print; }
    Setting the INPUT_RECORD_SEPARATOR ($/, see perlvar) like that makes things very simple. If the file happens to have CRLF line termination, you may need to set $/ to "\r\n====" (and include "\r" in the s/// as well).

    (updated upon realizing that a CRLF file would just need a modified s///; the original $/ setting above would still work fine -- oops! I just noticed that ikegami already posted this idea, as I should have known he would!)

      Yes, graff, you are right in observing that my regex contains two linebreaks - in contrast to what I actually intended to do. The curious thing is that the regex performs as expected when it should match one linebreak but not when it contains two linebreaks - in that case it appears to do NOTHING at all, although the file definitely does contain several consecutive linbreak-only lines. I am still puzzled and am starting to believe the issue is not due to the Perl-side of things but rather an I/O or even a Linux problem. Any conejectures on this one? Thanks again - Pat
        I'm not sure I follow what you are describing there. The best thing to do is to present a minimal script and data set that still (even after what you've learned) produces results that you consider to be unexpected, and point out how it differs from what you would expect.