While this is absolutely correct, it does raise the question - What should be considered "characters" in a text file? To my mind, all characters, including carriage return (\r) and line feed (\n), should be counted as these do contribute to the size of the file. The difference in reported size encountered by the venerable Anonymous Monk, as gjb has rightly alluded to, is due to platform differences in the interpretation of these characters.
An alternate method of counting the number of characters in a file, including carriage return and line feeds, which should return the same result irrelevant of platform, would be:
print length do { local $/; local @ARGV = ( $file ); <> }, "\n";
Where the variable $file contains the text file name whose characters are to be counted.
Update
With regard to the follow-up post from Anonymous Monk, I would concur with the direction suggested of gjb in this post - It sounds as if there *really is* a difference between the files being compared on the two different machines (presumably as a result of the file transfer via FTP), hence the differing character counts.
perl -le 'print+unpack("N",pack("B32","00000000000000000000000111111110"))' | [reply] [d/l] [select] |
| [reply] |
Hi, thanks for your help. I've tried both these methods and the results are still not the same! Linux appears to be counting an extra character per line. If I strip out \n or chomp it makes no difference.
| [reply] |
It sounds like the carriage return is still in the file on the linux side. If you ftp'ed the file from windows to linux in binary mode, that would be the case. Do a little test case such as this:
#!/usr/bin/perl -wd
while(<>) {
chomp;
print $_, "\n";
}
While in the debugger, display $_ (x $_) after the chomp. Do you see something like this: "blah blah blah\cM"?
That control-M is the carriage return. Some editors may
also show the carriage return (vi) if configured properly.
chomp removes any trailing string that corresponds to the current value of $/. In this case only the unix newline will be removed. You could be more destructive and remove all whitespace at the end of a line with a regex such as s/\s+$//. That would work on both platforms and you wouldn't have to worry. Or you could ensure your transfer process does the correct translation for you.
-derby | [reply] [d/l] |
Sounds like you are doing a binary transfer to get your Win text file onto Linux.
Win uses two chars for end-of-line, and Linux uses one. If Linux returns an extra char per line, it is probably counting the ^M or \r that windows ignores as part of the line delimiter.
Option 1) use ASCII or TEXT transfer
Option 2) don't count the "\r" characters at the end of each line
e.g.
$count= length($line);
$count-- if substr($line,-1) eq "\r";
| [reply] [d/l] |
| [reply] |
What about file content and locale? There can be difference if text contain UTF characters and is read once with use bytes; and secondly with use utf8;
Try to add use bytes; pragma to script and test it again if you are sure the files are the same...
| [reply] [d/l] [select] |