Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: char count windows vs linux

by gjb (Vicar)
on Dec 17, 2002 at 12:43 UTC ( [id://220500]=note: print w/replies, xml ) Need Help??


in reply to char count windows vs linux

If you chomp the lines you read, the number of chars should be the same.

while (<FILE>) { chomp($_); $count += length($_) + 1; }

Hope this helps, -gjb-

Replies are listed 'Best First'.
Re: Re: char count windows vs linux
by rob_au (Abbot) on Dec 17, 2002 at 12:55 UTC
    While this is absolutely correct, it does raise the question - What should be considered "characters" in a text file? To my mind, all characters, including carriage return (\r) and line feed (\n), should be counted as these do contribute to the size of the file. The difference in reported size encountered by the venerable Anonymous Monk, as gjb has rightly alluded to, is due to platform differences in the interpretation of these characters.

    An alternate method of counting the number of characters in a file, including carriage return and line feeds, which should return the same result irrelevant of platform, would be:

    print length do { local $/; local @ARGV = ( $file ); <> }, "\n";

    Where the variable $file contains the text file name whose characters are to be counted.

     

    Update

    With regard to the follow-up post from Anonymous Monk, I would concur with the direction suggested of gjb in this post - It sounds as if there *really is* a difference between the files being compared on the two different machines (presumably as a result of the file transfer via FTP), hence the differing character counts.

     

    perl -le 'print+unpack("N",pack("B32","00000000000000000000000111111110"))'

      would something like a md5sum work across systems to tell you if the files have been copied/ftp'ed correctly? I know that md5sum is available for both linux and windows

      The md5sum function computes a 128-bit checksum (or fingerprint or message-digest) for a file. A consistant fingerprint means the files are the same.

      A.A.

      Hi, thanks for your help. I've tried both these methods and the results are still not the same! Linux appears to be counting an extra character per line. If I strip out \n or chomp it makes no difference.
        It sounds like the carriage return is still in the file on the linux side. If you ftp'ed the file from windows to linux in binary mode, that would be the case. Do a little test case such as this:

        #!/usr/bin/perl -wd while(<>) { chomp; print $_, "\n"; }

        While in the debugger, display $_ (x $_) after the chomp. Do you see something like this: "blah blah blah\cM"? That control-M is the carriage return. Some editors may also show the carriage return (vi) if configured properly. chomp removes any trailing string that corresponds to the current value of $/. In this case only the unix newline will be removed. You could be more destructive and remove all whitespace at the end of a line with a regex such as s/\s+$//. That would work on both platforms and you wouldn't have to worry. Or you could ensure your transfer process does the correct translation for you.

        -derby

        Sounds like you are doing a binary transfer to get your Win text file onto Linux.

        Win uses two chars for end-of-line, and Linux uses one. If Linux returns an extra char per line, it is probably counting the ^M or \r that windows ignores as part of the line delimiter.

        Option 1) use ASCII or TEXT transfer

        Option 2) don't count the "\r" characters at the end of each line e.g.
        $count= length($line); $count-- if substr($line,-1) eq "\r";

        Are you sure the files are the same, i.e. if you transfered them by FTP, did you use ASCII mode? Something might have gone wrong in that stage somewhere.

        Just my 2 cents, -gjb-

Re: Re: char count windows vs linux
by ph0enix (Friar) on Dec 17, 2002 at 17:08 UTC

    What about file content and locale? There can be difference if text contain UTF characters and is read once with use bytes; and secondly with use utf8;

    Try to add use bytes; pragma to script and test it again if you are sure the files are the same...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://220500]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (5)
As of 2024-04-19 14:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found