Re^3: Parsing a text file

Replies are listed 'Best First'.
Re^4: Parsing a text file by gone2015 (Deacon) on Jan 14, 2009 at 10:41 UTC
`[\r\n]` does appear to finesse the problem nicely. Unfortunately, when this last came up, I looked at all the relevant documentation I could find, but I did not see any guarantee that `"\r"` will be `"\x0A"` if `"\n"` is `"\x0D"` (or vice versa) in not-EBCDIC land. Or even that `"\r"` and `"\n"` are in general guaranteed to be duals of each other. As brother ikegami says, you know and I know that these days, with the exception of EBCDIC systems, `"\r\n"` is exactly `"\x0D\x0A"`. If an authoritative position were taken that as of (say) 5.8.0: `"\r\n" eq "\x0D\x0A"` except for EBCDIC. any system using line endings other than `"\n"` will support, and will by default use, a PerlIO layer than maps those line endings to/from `"\n"` then we could consign worrying about this piece of magic to the bin. I don't know what the position is with MacPerl, but perlmacos suggests that the above could be back-dated to 5.8.0 including MacPerl. FWIW, socket handling can (of course) be simplified by applying `binmode $sock, ':crlf'`, which is nice. Nevertheless, `chomp` is a snare and a delusion if you think it's handling Internet CRLF line endings (unless you're futzing about with `$/` at the same time). Wouldn't it be nice to have a `chompnl` equivalent to `s/\x0D?\x0A$//` ? And, perhaps, `chomps` equivalent to `s/\s+$//` ? BTW, I note that `\R` is defined in perlreref as `(?>\v\|\x0D\x0A)`. Shouldn't that be `(?>\x0D\x0A\|\v)` ? And I wonder what the EBCDIC folk make of this !	[reply] [d/l] [select]
Re^5: Parsing a text file by graff (Chancellor) on Jan 14, 2009 at 13:20 UTC
You (and Skeeve) seem to be confused about the difference between the use of "\r" and "\n" in file I/O and their use in regular expressions. To my knowledge, there has never been any magic, variation or ambiguity when using these in regular expressions -- things like `s/\r/\n/g;` and `tr/\r//d;` and `split /[\r\n]+/` have always had clear and consistent meanings and effects, as documented near the top of the perlre man page. But of course some hapless programmers, faced with data they don't fully understand, have been (and continue to be) confounded by these clear and consistent expressions when they use the wrong ones on a given set of data. In other words, it's not the expressions themselves that are problematic, it's the misunderstandings about what is in the data, which stem in part from not understanding the source(s) of the data and/or not using a suitable method to read it in (or write it out). The various wrinkles of file I/O magic, including what "chomp" does, were relatively simple in pre-5.8 perl (though still tricky enough to burn the unwary in numerous ways); the situation and methods of control in 5.8 and later versions are more varied and intricate, and it's a tribute to the designers of PerlIO that the older, simpler idioms still do what they always did. Update: As for this point: BTW, I note that \R is defined in perlreref as `(?>\v\|\x0D\x0A)`. Shouldn't that be `(?>\x0D\x0A\|\v)` ? And I wonder what the EBCDIC folk make of this ! I see that "\R" and "\v" were both introduced as of 5.10, and these are likely to help once people realize they exist and get the hang of using them. Maybe there's a problem with the description in that man page, but their actual behavior looks very handy: `$_ = "hi\x0bthere\x0d\x0aline 3?\x0a line4\x0dno\x0away\r\n"; print; @lines = split /\R/; print "===\n", join( "\n===\n", @lines ), "\n===\n";` [download] For me, that snippet produces: `hi there line 3? no line4 way === hi === there === line 3? === line4 === no === way ===` [download] Notice how vertical tab (\x0b) and the isolated CR (\x0a) are treated the same as CRLF and LF. Wow, this is going to make a lot of things easier. Another update: And I wonder, just who are all these EBCDIC people I keep hearing about? Are they in the same museum with the VAX/VMS users?	[reply] [d/l] [select]
Re^6: Parsing a text file by gone2015 (Deacon) on Jan 14, 2009 at 15:04 UTC
You (and Skeeve) seem to be confused about the difference between the use of "\r" and "\n" in file I/O and their use in regular expressions. It is a very confusing area, not made any easier by some very confused documentation. On this topic perlop says: All systems use the virtual `"\n"` to represent a line terminator, called a "newline". ... In general, use `"\n"` when you mean a "newline" for your system, but use the literal ASCII when you need an exact character. ... If you get in the habit of using `"\n"` for networking, you may be burned some day. and perlport says: Perl uses `\n` to represent the "logical" newline, where what is logical may depend on the platform in use. In MacPerl, `\n` always means `\015`. In DOSish perls, `\n` usually means `\012`, but when accessing a file in "text" mode, STDIO translates it to (or from) `\015\012`, depending on whether you're reading or writing. ... Note that this is 5.10.0 documentation. Once upon a time, Perl really did have different values for `"\n"` in strings, and that was the way it adapted to different systems -- there was no translation of incoming characters. These days all that malarky is better done at the PerlIO layer. But I believe that MacPerl, however historic, does it the old way. That really is what the documentation means when it refers to the "logical" or "virtual" newline. Also, `"\n"` is referred to as "newline" or NL, not LF. (Because this is such an unusual approach one tends to read past these little words -- and that's not helped by the jumbling together of what `"\n"` means, what different operating systems do, and what Perl may or may not do in the middle.) Now, you and I have never seen a system where `"\r\n" ne "\x0D\x0A"`, but the documentation continues to state that it's a possibility.	[reply] [d/l] [select]