comment on

You (and Skeeve) seem to be confused about the difference between the use of "\r" and "\n" in file I/O and their use in regular expressions. To my knowledge, there has never been any magic, variation or ambiguity when using these in regular expressions -- things like s/\r/\n/g; and tr/\r//d; and split /[\r\n]+/ have always had clear and consistent meanings and effects, as documented near the top of the perlre man page.

But of course some hapless programmers, faced with data they don't fully understand, have been (and continue to be) confounded by these clear and consistent expressions when they use the wrong ones on a given set of data. In other words, it's not the expressions themselves that are problematic, it's the misunderstandings about what is in the data, which stem in part from not understanding the source(s) of the data and/or not using a suitable method to read it in (or write it out).

The various wrinkles of file I/O magic, including what "chomp" does, were relatively simple in pre-5.8 perl (though still tricky enough to burn the unwary in numerous ways); the situation and methods of control in 5.8 and later versions are more varied and intricate, and it's a tribute to the designers of PerlIO that the older, simpler idioms still do what they always did.

Update: As for this point:

BTW, I note that \R is defined in perlreref as (?>\v|\x0D\x0A). Shouldn't that be (?>\x0D\x0A|\v) ? And I wonder what the EBCDIC folk make of this !

I see that "\R" and "\v" were both introduced as of 5.10, and these are likely to help once people realize they exist and get the hang of using them. Maybe there's a problem with the description in that man page, but their actual behavior looks very handy:

$_ = "hi\x0bthere\x0d\x0aline 3?\x0a    line4\x0dno\x0away\r\n";
print;

@lines = split /\R/;
print "===\n", join( "\n===\n", @lines ), "\n===\n";
[download]

For me, that snippet produces:

hi
  there
line 3?
no  line4
way
===
hi
===
there
===
line 3?
===
    line4
===
no
===
way
===
[download]

Notice how vertical tab (\x0b) and the isolated CR (\x0a) are treated the same as CRLF and LF. Wow, this is going to make a lot of things easier.

Another update: And I wonder, just who are all these EBCDIC people I keep hearing about? Are they in the same museum with the VAX/VMS users?

In reply to Re^5: Parsing a text file by graff
in thread Parsing a text file by calmthestorm

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.