Why does my Perl regex substitution for linebreak fail?

pat_mc has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Why does my Perl regex substitution for linebreak fail? by kyle (Abbot) on Mar 05, 2008 at 22:22 UTC
`my $lines = join "", <DATA>; $lines =~ s/\n\n=/=/gm; print $lines; __DATA__ line 1 ====== line after break line` [download] Produces: `line 1====== line after break line` [download] That's what I'd expect, but maybe it's not what you wanted. If you want to remove a blank line before the marker, do `s/\n\n=/\n=/gm`. Then the output is: `line 1 ====== line after break line` [download] You can do `s/\n=/=/gm` (which sounds like what you describe), but that will produce output like the first output when there's no blank line before the marker. As an aside, you can avoid reading in the whole file by setting the input record separator. `$/ = '='; while (<DATA>) { s/\n\n=/\n=/m; print; } __DATA__ line 1 ====== line after break line` [download] Produces... `line 1 ====== line after break line` [download] See perlvar for info about `$/` (aka `$INPUT_RECORD_SEPARATOR` if you use English).	[reply] [d/l] [select]
Re^2: Why does my Perl regex substitution for linebreak fail? by ikegami (Patriarch) on Mar 06, 2008 at 02:55 UTC
The `m` modifier is useless since `^`, `$`, etc isn't used. In fact, why aren't you using `\z` when you change the IRS? And why not use `"\n\n="` as the IRS? `local $/ = "\n\n="; while (<DATA>) { s/\n\n=\z/\n=/; print; }` [download]	[reply] [d/l] [select]
Re^2: Why does my Perl regex substitution for linebreak fail? by ack (Deacon) on Mar 06, 2008 at 03:43 UTC
I agree. The original regex, as posed by pat mc, removes both \n rather than just the single one that pac mc said was wanted to be removed. I also presume that pat mc (based upon the inquiry) is looking for regex solutions; but several of the other nodelets in this thread have some good ideas for alternativies to the regex approach. ack Albuquerque, NM	[reply]
Re^2: Why does my Perl regex substitution for linebreak fail? by pat_mc (Pilgrim) on Mar 06, 2008 at 09:10 UTC
Thanks, kyle, for drawing my attention to the use of the IRS, an aspect of file handling in Perl I was unaware of so far.	[reply]
Re: Why does my Perl regex substitution for linebreak fail? by igelkott (Priest) on Mar 05, 2008 at 22:43 UTC
... OK when I print the result to the console but not when I redirect the output into a file ... Could your file actually have \r\n (windows-based) line-endings? Could get different terminal behavior if running cygwin with unix line-endings? If you have a unix-like system available, might try pushing a small bit of your processed and unprocessed file through "od". I sometimes use something like "`tail -3 foo \| od -bc`" to keep from getting fooled by "friendly" systems.	[reply] [d/l]
Re^2: Why does my Perl regex substitution for linebreak fail? by quester (Vicar) on Mar 06, 2008 at 07:34 UTC
For mostly-printable files the output of "tail -3 foo \| cat -A" is less cluttered.	[reply]
Re^2: Why does my Perl regex substitution for linebreak fail? by pat_mc (Pilgrim) on Mar 06, 2008 at 08:58 UTC
Thanks, igelkott, for adressing the console-part of my post. Can you please explain to me in more basic terms what your suggestion is? I am fairly new to Linux and hence don't quite understand what the issue is you are pointing at. The file I intend to operate on, however, has been generated with the 'cat' command in the shell concatenating other files generated under Linux. Not sure, therefore, if the inter-operating-system issue applies here. Thanks again - Pat	[reply]
Re^3: Why does my Perl regex substitution for linebreak fail? by igelkott (Priest) on Mar 06, 2008 at 18:10 UTC
"`tail -3 foo \| od -bc`" means to take the last three lines from "foo" and feed it to the "od" command with "b" and "c" options. I'll presume that the first part is either clear or is reasonably easy to look up; "od" is the weird part. This is named for "octal dump" (option b) but I'm using it here to also get the character names (option c). In particular, to reveal the non-printing characters.	[reply] [d/l]
Re^4: Why does my Perl regex substitution for linebreak fail? by pat_mc (Pilgrim) on Mar 06, 2008 at 20:35 UTC
Re^2: Why does my Perl regex substitution for linebreak fail? by pat_mc (Pilgrim) on Mar 06, 2008 at 15:48 UTC
igelkott - Your answer got right to the core of the issue. I searched for \r and got matches in exactly those lines which resisted the replacement. What exactly is this \r character, anyway? I have no idea how that \r entered my fully Linux-based and Linux-generated file. Any thoughts on this? Thanks again for shedding some light on this. Cheers - Pat	[reply]
Re^3: Why does my Perl regex substitution for linebreak fail? by igelkott (Priest) on Mar 06, 2008 at 18:37 UTC
Line-endings: \r and \n (CR and LF) \n -> unix \r -> mac \r\n -> pc Exactly how pc line-ending got in your file, I couldn't say but I would guess that the data has passed through a windows machine at sometime. Some file transfer methods take care of line-endings and others don't.	[reply]
Re^4: Why does my Perl regex substitution for linebreak fail? by pat_mc (Pilgrim) on Mar 06, 2008 at 20:33 UTC
Re^5: Why does my Perl regex substitution for linebreak fail? by igelkott (Priest) on Mar 06, 2008 at 21:21 UTC
Re: Why does my Perl regex substitution for linebreak fail? by halfcountplus (Hermit) on Mar 05, 2008 at 23:18 UTC
what you are literally asking for ('I have a long file from which I want to remove a single linebreak before all lines starting with the string "= = = =".') is this: `#!/usr/bin/perl use strict; my $l; while (<DATA>) { if ($_ =~ /^====/) {chomp $l;} print $l; $l=$_; } print $l; __DATA__ one two ====three four` [download]	[reply] [d/l]
Re: Why does my Perl regex substitution for linebreak fail? by graff (Chancellor) on Mar 06, 2008 at 03:39 UTC
You say you want to remove a single linebreak before all lines starting with the string "= = = =", but your snippet would remove two linebreaks ("\n\n" is replaced with nothing). Just curious about that. Anyway, I think others have already given good ideas. Here's another one, that doesn't require holding the entire file in memory at once (unless of course the file does not actually contain any instance of "\n===="): `#!/usr/bin/perl use strict; use warnings; $/ = "\n===="; while (<>) { s/\n====$/====/; print; }` [download] Setting the INPUT_RECORD_SEPARATOR ($/, see perlvar) like that makes things very simple. If the file happens to have CRLF line termination, you may need to ~~set $/ to "\r\n====" (and~~ include "\r" in the s/// ~~as well)~~. (updated upon realizing that a CRLF file would just need a modified s///; the original $/ setting above would still work fine -- oops! I just noticed that ikegami already posted this idea, as I should have known he would!)	[reply] [d/l]
Re^2: Why does my Perl regex substitution for linebreak fail? by pat_mc (Pilgrim) on Mar 06, 2008 at 09:15 UTC
Yes, graff, you are right in observing that my regex contains two linebreaks - in contrast to what I actually intended to do. The curious thing is that the regex performs as expected when it should match one linebreak but not when it contains two linebreaks - in that case it appears to do NOTHING at all, although the file definitely does contain several consecutive linbreak-only lines. I am still puzzled and am starting to believe the issue is not due to the Perl-side of things but rather an I/O or even a Linux problem. Any conejectures on this one? Thanks again - Pat	[reply]
Re^3: Why does my Perl regex substitution for linebreak fail? by graff (Chancellor) on Mar 07, 2008 at 02:03 UTC
I'm not sure I follow what you are describing there. The best thing to do is to present a minimal script and data set that still (even after what you've learned) produces results that you consider to be unexpected, and point out how it differs from what you would expect.	[reply]