rkg has asked for the wisdom of the Perl Monks concerning the following question:

I am having some trouble reading a file using Activestate perl.

I have a ".csv" file that was emailed to me. It opens fine in Excel. It opens fine in Notepad, tho it seems tab delimited (despite the file name), and the eol chars show in notepad as empty boxes and the file is run onto one line.

Fine -- at this point, I am thinking it is a tab delim file and I need to adjust unix/win line endings.

When I read it in with perl, however, and print the 1st line, I get an extra space after each char:

my $fh = FileHandle->new($file) or die "cannot read open $file"; my $row = <$fh>; print "|$row|\n";
yields something like this
| !D A I L Y R E P O R T |
where the "!" denotes a black box. When I look at the file in Notepad or Excel, the first line is "DAILY REPORT", not "D A I L Y R E P O R T"

Where are the extra spaces coming from?

Thanks

rkg

Replies are listed 'Best First'.
Re: simple file question, extra spaces, win32
by SavannahLion (Pilgrim) on Dec 29, 2003 at 18:51 UTC
    The file is probably encoded as Unicode. I get similar behavior when I read from Unicode files and display the file contents in the Command box.

    There's a Unicode module at CPAN you can examine. Though I have no direct experience with it.
    The Camel book also mentions turning on Unicode and UTF-8 support, but I don't think I understand how it works in Perl since I haven't had much luck in getting it to work.

    ----
    Is it fair to stick a link to my site here?

    Thanks for you patience.

      In the perl 5.6 series, you had to be more explicit about your use of Unicode. In the 5.8 series, perl did a better job of detecting Unicode automagically, so a use utf8; should only be necessary in very specific circumstances.

      ----
      I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
      -- Schemer

      : () { :|:& };:

      Note: All code is untested, unless otherwise stated

        Except this kind of file is not in UTF-8... Instead it's just two bytes per character, most likely in Little Endian form. I think the official name of this encoding is UCS-2.

        At worst, this can converted to UTF-8 by doing:

        $utf8 = pack 'U*', unpack 'v*', $unicode;
        Not fast, but it'll do the trick. If all you want is ISO-Latin-1, try
        $latin1 = pack 'C*', unpack 'v*', $unicode;

        p.s. Those aren't spaces, instead, most of them extra bytes will be chr(0).

Re: simple file question, extra spaces, win32
by hardburn (Abbot) on Dec 29, 2003 at 18:52 UTC

    I suspect the file has special characters that only MS software likes. Try something like this:

    my $row = <$fh>; local $, = ' '; print map { ord $_ } split //, $row;

    Which should print out the base-10 value of each character in the row with spaces in between. Then check the ASCII value of each character. You should be able to find the problematic character in there.

    ----
    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    : () { :|:& };:

    Note: All code is untested, unless otherwise stated

Re: simple file question, extra spaces, win32
by rkg (Hermit) on Dec 29, 2003 at 18:57 UTC
    I see
    | &#9632;U R L R e p o r t |
    and those spaces are ascii zero, not spaces, it seems
    local $, = ' '; print map { ord $_ } split //, $row;
    generates
    255 254 85 0 82 0 76 0 32 0 82 0 101 0 112 0 111 0 114 0 116 0
    I will look into the Unicode module...

    Thanks for the tip about the moon -- I knew this whole problem involved aliens. At least it isn't the Black Helicopters -- those are very scary.

      Just so you know, the first two bytes (255 254) is a FFFE byte code marker for unicode. It tells whatever is reading the file that it's a Unicode file and what order the bytes are in. For example, if it was FEFF then it would be in Big Endian order where the sequence of 0's would be reversed.

      Example: 254 255 0 85 0 82 0 76....

      I was guessing as to what the problem was without looking at the actual file, but since you posted a tidbit from the file, here's a Unicode FAQ, that should help you to understand how Unicode works.

      ---
      Is it fair to stick a link to my site here?

      Thanks for you patience.

      What you want is not the Unicode module (is there one?) but the Encode module:
      use Encode; $utf16_row = v255.254.85.0.82.0.76.0.32.0.82.0.101.0.112.0.111.0.1 14.0.116.0; # fake up your data. $utf8_row = decode("utf16", $utf16_row); print $utf8_row;
      untested
Re: simple file question, extra spaces, win32
by JamesNC (Chaplain) on Dec 30, 2003 at 12:12 UTC
    I think everyone answered your question well. However, I wanted to add a comment here because Microsoft Office 2000 Products do NOT properly encode Unicode all the time (fixed in OfficeXP I am told). Another solution is to make sure it is properly encoded. You can save a doc with special characters in Unicode in Word. Close the doc and reopen it in Word and it is messed up. How on earth it made it through QA is simply amazing to me. The moral of the story is just because it opens in Excel or Word and it is Unicode doesn't mean it isn't messed up. Your best bet is to use Excel to "Save As" the document in UTF8 if it opens cleanly there, or I also found that opening it in Excel and pasting it into Notepad.exe and saving it to UTF8 works (... it just does ). This is not a Perl problem, and once the document is in a proper unicode encoding, you really won't know the difference. I am using AS 5.8 on WinXP and Win2K. I would also recommend reading perluniintro - great read.
Re: simple file question, extra spaces, win32
by PodMaster (Abbot) on Dec 29, 2003 at 18:53 UTC
    Where are the extra spaces coming from?
    From the moon :) Your file is probably encoded in some kind of unicode (only time i've seen this type of thing) or something, and when something something, it gets converted to space. binmode may or may not help. locales may or may not have something to do with it.

    Got a copy of the file available for download (if you want a definitive answer)?

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

Re: simple file question, extra spaces, win32
by rkg (Hermit) on Dec 29, 2003 at 19:13 UTC
    ugly but seems to work
    my @rows = map {chomp; s/[\x7F-\xFF]//g; s/[\x00-\x1F]//g; $_} read_f +ile($file);