Re: alter $/ - but why?
by derby (Abbot) on Aug 03, 2005 at 18:11 UTC
|
from perlport:
When dealing with binary files (or text files in binary mode) be sure to explicitly set $/ to the appropriate value for your file format before using chomp().
So if your script is accepting files from all different types of OS'es and newlines are not appropriately converted during the transfer, then you're going to have to explicitly set the input record seperator
| [reply] |
Re: alter $/ - but why?
by radiantmatrix (Parson) on Aug 03, 2005 at 18:44 UTC
|
derby points out the answer to your question, but there's another couple of pieces.
- If you want to process the file one line at a time, you need to set $/ anyway, or you may not be reading one line at a time.
- If you need to preserve the original line-endings for a write-out operation at some point, you can just easily modify the sub to set $\ as well.
Because of the first, the whole structure is kind of odd anyway, since with Mac line endings, you'd slurp the whole file to find out that you have those line endings.
Better to do:
open IN, '<', $filename or die ("Can't open $filename: $!");
sysseek IN, -5, 2;
my $last_five;
sysread IN, $last_five, 5;
## find out what the EOL chars are and set $\ to match
$/ = $1 if $last_five =~ m/(\r{0,1}\n)$/s;
sysseek IN,0,0;
while (<IN>) {
chomp;
# now process stuff #
}
This is predicated on the text-files being well-formed (ending with an EOL before EOF), so you may need to handle the possibility of malformed files or whatnot.
<-radiant.matrix->
Larry Wall is Yoda: there is no try{} (ok, except in Perl6; way to ruin a joke, Larry! ;P)
The Code that can be seen is not the true Code
"In any sufficiently large group of people, most are idiots" - Kaa's Law
| [reply] [d/l] |
|
|
| [reply] [d/l] [select] |
|
|
If you are guaranteed to only read your OS's native formats, then you wouldn't need this routine at all. Therefore, I assumed the OP has this code in place because the script running on one OS is likely to read files created by several different OSes.
So, I stand by my statement: if you read a file with Mac line endings (say, on a Unix box), using the code in the top node would read the whole file, since $/ would be looking for a Unix-style line-endings, which don't contain "\r";
Your point about using the hex values for setting is a good one to remember, but the code as I wrote it automatically accounts for that. As for using the same line endings for output as have been determined for input, $\ = $/ is sufficient.
<-radiant.matrix->
Larry Wall is Yoda: there is no try{} (ok, except in Perl6; way to ruin a joke, Larry! ;P)
The Code that can be seen is not the true Code
"In any sufficiently large group of people, most are idiots" - Kaa's Law
| [reply] |
Re: alter $/ - but why?
by betterworld (Curate) on Aug 03, 2005 at 18:36 UTC
|
To make your code even more portable, you should replace every "\n" by "\012" (except those "\n" that are printed).
If you don't, your code would not run properly on Mac systems. | [reply] |
|
|
Well, yes and no... actually the tests for bare "\n" and "\r" would work, but the test for "\r\n" would fail (at least, in MacPerl on MacOS Classic). It is advisable to replace "\r\n" and /\r\n/ with "\015\012" and /\015\012/, respectively.
| [reply] [d/l] [select] |
Re: alter $/ - but why?
by graff (Chancellor) on Aug 04, 2005 at 03:02 UTC
|
As indicated previously, "chomp" is equivalent to "s{$/$}{}" on a string, so if you're going to use it on files of unknown origin (line-endings varying from file to file), it would be a good idea to make sure that $/ is set appropriately for each file.
But that sub does have its drawbacks: apart from the fact it will pull in the full content of a "\r-only" type of text file, there is also the possibility that a single file could contain a variety of patterns involving "\r" and "\n" -- e.g. someone on a unix box quickly edits CRLF-type file, adding a couple "\n-only" lines at the top, or the file contains stuff other than text, etc.
If the goal is simply to be able to handle all sorts of line-termination patterns (and you aren't worried about getting hit with a massive Mac "\r-only" file that'll chew up too much RAM), you could do without the sub and go right to a main processing loop like this:
$/ = "\xa";
while( <FILE> ) {
s/\xd?\xa$//; # does what chomp would do, handles CRLF and LF-onl
+y
for my $line ( split /\xd/, $_, -1 ) # handles CR-only cases
{
# now we're line-oriented no matter what the input style is...
}
}
OTOH, if the goal is to be scrupulous and careful about knowing what sorts of line termination are showing up in your data files, write a separate diagnostic for that, have it produce a suitably detailed report for each file (e.g. number of "(\r\n)+", number of "(\n)+", number of "(\r)+"), and then configure your data-processing script(s) to work from that report. | [reply] [d/l] |
Re: alter $/ - but why?
by samtregar (Abbot) on Aug 03, 2005 at 20:50 UTC
|
Have you tried it? I recommend you write three tests - one with source text with each line-ending style. Verify that it works correctly with the original code. Then make your change and see if it still works. If it does, you're done. If not you'll need to learn more about what chomp() does.
For bonus points, write your tests using Test::More!
-sam
| [reply] |