Re: char count windows vs linux

Replies are listed 'Best First'.
Re: Re: char count windows vs linux by rob_au (Abbot) on Dec 17, 2002 at 12:55 UTC
While this is absolutely correct, it does raise the question - What should be considered "characters" in a text file? To my mind, all characters, including carriage return (`\r`) and line feed (`\n`), should be counted as these do contribute to the size of the file. The difference in reported size encountered by the venerable Anonymous Monk, as gjb has rightly alluded to, is due to platform differences in the interpretation of these characters. An alternate method of counting the number of characters in a file, including carriage return and line feeds, which should return the same result irrelevant of platform, would be: `print length do { local $/; local @ARGV = ( $file ); <> }, "\n";` [download] Where the variable `$file` contains the text file name whose characters are to be counted. Update With regard to the follow-up post from Anonymous Monk, I would concur with the direction suggested of gjb in this post - It sounds as if there really is a difference between the files being compared on the two different machines (presumably as a result of the file transfer via FTP), hence the differing character counts. `perl -le 'print+unpack("N",pack("B32","00000000000000000000000111111110"))'`	[reply] [d/l] [select]
Re: Re: Re: char count windows vs linux by aging acolyte (Pilgrim) on Dec 17, 2002 at 13:53 UTC
would something like a md5sum work across systems to tell you if the files have been copied/ftp'ed correctly? I know that md5sum is available for both linux and windows The md5sum function computes a 128-bit checksum (or fingerprint or message-digest) for a file. A consistant fingerprint means the files are the same. A.A.	[reply]
Re: Re: Re: char count windows vs linux by Anonymous Monk on Dec 17, 2002 at 13:12 UTC
Hi, thanks for your help. I've tried both these methods and the results are still not the same! Linux appears to be counting an extra character per line. If I strip out \n or chomp it makes no difference.	[reply]
Re: Re: Re: Re: char count windows vs linux by derby (Abbot) on Dec 17, 2002 at 13:41 UTC
It sounds like the carriage return is still in the file on the linux side. If you ftp'ed the file from windows to linux in binary mode, that would be the case. Do a little test case such as this: `#!/usr/bin/perl -wd while(<>) { chomp; print $_, "\n"; }` [download] While in the debugger, display $_ (x $_) after the chomp. Do you see something like this: "blah blah blah\cM"? That control-M is the carriage return. Some editors may also show the carriage return (vi) if configured properly. chomp removes any trailing string that corresponds to the current value of $/. In this case only the unix newline will be removed. You could be more destructive and remove all whitespace at the end of a line with a regex such as s/\s+$//. That would work on both platforms and you wouldn't have to worry. Or you could ensure your transfer process does the correct translation for you. -derby	[reply] [d/l]
Re: Re: Re: Re: char count windows vs linux by jaa (Friar) on Dec 17, 2002 at 15:10 UTC
Sounds like you are doing a binary transfer to get your Win text file onto Linux. Win uses two chars for end-of-line, and Linux uses one. If Linux returns an extra char per line, it is probably counting the ^M or \r that windows ignores as part of the line delimiter. Option 1) use ASCII or TEXT transfer Option 2) don't count the "\r" characters at the end of each line e.g. `$count= length($line); $count-- if substr($line,-1) eq "\r";` [download]	[reply] [d/l]
Re: Re: Re: Re: char count windows vs linux by gjb (Vicar) on Dec 17, 2002 at 13:21 UTC
Are you sure the files are the same, i.e. if you transfered them by FTP, did you use ASCII mode? Something might have gone wrong in that stage somewhere. Just my 2 cents, -gjb-	[reply]
Re: Re: char count windows vs linux by ph0enix (Friar) on Dec 17, 2002 at 17:08 UTC
What about file content and locale? There can be difference if text contain UTF characters and is read once with `use bytes;` and secondly with `use utf8;` Try to add `use bytes;` pragma to script and test it again if you are sure the files are the same...	[reply] [d/l] [select]


Your skill will accomplish what the force of many cannot
	PerlMonks