Special behavior for LF and CR in RegExs?

Argel has asked for the wisdom of the Perl Monks concerning the following question:

I'm curious as to why the following does not work. I gather it has something to do with some underlying magic of regular expressions (or perhaps in split)? It just seems odd that I have to do an m/$regex/m to split on LFs and CRs. I would have thought switching to using octal or hex codes would override some of that default behavior.

# Doesn't DWIM
$data =~ s/\012/\015/;
$data =~ s/\015+/\015/;
@records = split /\015/, $data;
[download]

My other curiosity would be is there a way to split without having to resort to the map+chomp afterwards (while leaving the rest of the data intact)?

# Works, but the map+chomp seems ugly
@records = map {chomp $_;  $_} split /^/xms, $data;
[download]

Note: I realize only the 'm' option is necessary. The 'xs' are as per Perl Best Practices.

Thank you oh great wise ones!!

-- Argel

Comment on Special behavior for LF and CR in RegExs? Select or Download Code

Replies are listed 'Best First'.
Re: Special behavior for LF and CR in RegExs? by Aristotle (Chancellor) on Jan 05, 2006 at 00:55 UTC
The “doesn’t DWIM” snip seems to be missing `/g` modifiers. Posting accident, or is that so in your code as well? Anyway, if `split /^/m` works, it seems that `split /\n/` also should. Does it not? You can minimise that code quite a bit, btw, by simply saying `chomp( @records = split /^/xms, $data );` Makeshifts last the longest.	[reply] [d/l] [select]
Re^2: Special behavior for LF and CR in RegExs? by Argel (Prior) on Jan 05, 2006 at 01:19 UTC
Good catch on the missing 'g'!! You are right, that did work. I have seen splitting on a \n work and also seen it not work. I'm using a compiled by myself perl 5.8.0 on Solaris 8 so perhaps there is a bug buried away in there? Looks like davidrw's $/ suggestion also works. Given the above \n problem I think I will use that instead. Thanks for all the help!! -- Argel	[reply]
Re^3: Special behavior for LF and CR in RegExs? by Aristotle (Chancellor) on Jan 05, 2006 at 01:39 UTC
Well, `$/` is the input record separator; generally, in strings and patterns, `\n` is magically mapped to that behind the scenes – even if it consists of multiple characters on the platform in question, such as CR/LF on DOS. Basically, using `\n` will always work so long as the data you’re processing comes from the same platform that you’re running on. If not, you’ll need to convert end-of-line markers. There’s no way to avoid this. So outside specific scenarios, you should use `\n` or `$/` and let Perl handle the specifics. That will also yield the most portable scripts. Makeshifts last the longest.	[reply]
Re^4: Special behavior for LF and CR in RegExs? (Ah! No!!) by tye (Sage) on Jan 05, 2006 at 05:31 UTC
Re^5: Special behavior for LF and CR in RegExs? by Aristotle (Chancellor) on Jan 05, 2006 at 12:50 UTC
Re: Special behavior for LF and CR in RegExs? by davidrw (Prior) on Jan 05, 2006 at 00:52 UTC
what about just this? `my @records = split($/, $data);` [download] Update: Your original code will work if you add the `/g` modifier to the substitutions.. `perl -le '$_="blah\r\nfoo\r\nstuff\r\n"; s/\012/\015/g; s/\015+/\015/g +; print join ":", split(/\015/,$_)'` [download]	[reply] [d/l] [select]