friedo has asked for the wisdom of the Perl Monks concerning the following question:

I'm sure this must have come up before, but I'm having trouble searching for it. The docs for Text::CSV_XS say,

A CSV string may be terminated by 0x0A (line feed) or by 0x0D,0x0A (carriage return, line feed).

Unfortunately, some Mac format files terminate their lines with a sole carriage return (\r) and this is breaking our automated file processor which uses Text::CSV_XS. We can run mac2unix (or similar) on the file and that works fine, but we need a way to detect and deal with Mac files in an automated manner. Some ideas are:

  1. Read the first few KB of the file and look for lone \r's. If there are some, run mac2unix on the file.
  2. Use one of Perl's new-fangled IO layers/filters to deal with it. I'm not familiar with the new IO stuff so I don't really know where to begin looking.
  3. Use some option in Text::CSV_XS which I accidentally overlooked. (Oops.)

Thanks for your help.

Replies are listed 'Best First'.
Re: Text::CSV_XS and line-endings
by Argel (Prior) on Mar 17, 2006 at 00:38 UTC
    It's been a awhile since I used Text::CSV_XS, but don't you create an IO::File instance and then pass that to yout Text::CSV instance? If so then would the following from IO::Handle do what you want? They are about halfway down the page.
    IO::Handle->format_line_break_characters( [STR] ) $: IO::Handle->input_record_separator( [STR] ) $/
    Seems like it should honor $/ but perhaps you can force it like so?
    IO::Handle->input_record_separator( ["\r"] );

    Update1: Looks like this is what you want. However, Text::CSV_XS seems to somehow ignore line 7. According to the Text::CV_XS documentation the IO::Handle->getline is what is called, so the above should in theory work. However, $csv->getline returns undef on my simulated MAC test file. Looks like the Decode routine in the .so may be the culprit?

    #!/usr/local/bin/perl use strict; use warnings; use Data::Dumper; use IO::File; use Text::CSV_XS; IO::Handle->input_record_separator( "\r" ); my $file = defined $ARGV[0] ? $ARGV[0] : 'normal.txt'; my $io = new IO::File "$file", "<" || die "horribly"; my $csv = new Text::CSV_XS; # my $test = $io->getline; # print Data::Dumper->Dump([$test],['io']); my $columns = $csv->getline($io); print Data::Dumper->Dump([$columns],['csv']); exit 0;
    Also cleaned up some errors in the original portion.

    Update2: Would something like the following work?

    cat mac.txt | perl -e '$/="\r"; while(<>){$_=~s/\015$/\n/; print $_;}'
Re: Text::CSV_XS and line-endings
by roboticus (Chancellor) on Mar 17, 2006 at 02:23 UTC
    friedo--

    This works for me:
    #!/usr/bin/perl -w use strict; use Text::CSV_XS; # Slurp up the whole file open(INF,"<test.mac") || die "Can't open test.mac!"; my $file = <INF>; close(INF); # Convert CRs to LFs $file =~ s/\015/\012/g; # Parse CSV file line-by-line my $csv = Text::CSV_XS->new(); for my $i (split /\012+/, $file) { my $status = $csv->parse($i); print "ST:", $status; for my $j ($csv->fields) { print " [", $j, "]"; } print "\n"; }
    --roboticus
      Since we don't know what OS the client was running, perhaps
      # Convert CRs and CRLFs to LFs $file =~ s/\015\012?/\012/g;
      is best. Are LFCRs a possible concern?
        Use Text::FixEOL to fix messed up line endings. It does the sane thing for even really messed up line endings in most cases. That is what it was written for.
        use Text::FixEOL; # Convert EOLs in the $file string to unix conventions my $fixer = Text::FixEOL->new; $file = $fixer->to_unix($file);
      That won't work for rows that contain new-lines, which is common in CSVs since they don't have a way to escape them. Text::CSV_XS handles that with its binary option, but only if you let it read the lines for obvious reasons.

      -sam

Re: Text::CSV_XS and line-endings
by traveler (Parson) on Mar 16, 2006 at 22:58 UTC
    If some rows contain newlines, won't mac2unix mess those up? Have you tried contacting the author?
      I'm not sure what you mean by "mess them up". It would convert them to Unix line-endings, but as long as the data is text that shouldn't make a big difference.

      -sam

Re: Text::CSV_XS and line-endings
by GrandFather (Saint) on Mar 16, 2006 at 22:04 UTC

    If you have the option, rather than using getline ($io) you could pull the lines out yourself and hand them to parse ($line) (use fields () to get a getline equivelent result list).


    DWIM is Perl's answer to Gödel
      That won't work with rows containing new-lines, which is very common in CSV data. Text::CSV_XS handles this when given the binary option.

      -sam

        Although in that case OP is pretty stuffed in any case. He would almost have to write a replacement for CVS_XS to deal with the problem - no fun at all!


        DWIM is Perl's answer to Gödel
Re: Text::CSV_XS and line-endings
by zer (Deacon) on Mar 16, 2006 at 21:40 UTC
    the variable $/ can be modified for the \r. this tells perl that \r is the line delimiter instead of \n.

    you can test to see which OS you are on with the $ENV{OS};

    i hope this helps

      Aside from the fact that $/ doesn't affect Text::CSV_XS, $ENV{OS} won't help here either. The issue is the OS of the person who created the file, not the system running the code!

      -sam

      Unfortunately changing $/ doesn't work because Text::CSV_XS does its own low-level parsing. The OS the code is running on is known, the problem is the CSV files can come from anywhere. Thanks for the suggestions though.

      In addition to the problems pointed out by others (need the OS on which a file was created not the OS on the current machine), you shouldn't rely on $ENV{OS}, even if it happens to be in your environment. It's not set automatically by perl, and doesn't seem to be set in Gentoo Linux, RedHat Linux, OS X, or Solaris. You should use $^O (equivalent to $Config{osname}).

      perl v5.8.7 i386-linux $Config{osname}=$^O=linux $ +ENV{OS}= perl v5.8.4 i686-linux-thread-multi $Config{osname}=$^O=linux $ +ENV{OS}= perl v5.8.7 i686-linux $Config{osname}=$^O=linux $ +ENV{OS}= perl 5.005_03 sun4-solaris $Config{osname}=$^O=solaris $ +ENV{OS}= perl v5.8.6 darwin-thread-multi-2level $Config{osname}=$^O=darwin $ +ENV{OS}=
Re: Text::CSV_XS and line-endings
by jZed (Prior) on Mar 17, 2006 at 17:17 UTC
    Sorry to get into this late, it always seems like the postings I most need to see happen when I am taking a break from PM. (maintainer of Text::CSV_XS here). The module handles MAC line endings fine, just specify eol="\015" either globally or per table. I should probably revise the dcos.

    update:I originally mentioned "csv_eol" but that's the syntax for DBD::CSV, not Text::CSV_XS, it is now shown correctly as "eol"

      Hi, jZed, thanks for your reply.

      I'm trying the simplest test case I could think of, but I can't get your suggestion to work. Here is what I'm trying.

      #!/usr/bin/perl use strict; use warnings; use Text::CSV_XS; use Data::Dumper; use IO::File; my $fh = IO::File->new; $fh->open( "<test.csv" ) or die $!; my $c = Text::CSV_XS->new( { binary => 1, csv_eol => "\015" } ); my $d = $c->getline( $fh ); print Dumper( $d );

      And test.csv contains:

      foo,bar,baz^Mred,green,blue^Mnarf,blatz,quux

      (Where ^M's are \r's)

      That script results in $d being undef, as reported by Data::Dumper. Running the script on the same data with \n's instead of \r's works fine.

        Ooops, sorry, I was giving you DBD::CSV instructions, not Text::CSV_XS instructions, use "eol=>" instead of "csv_eol=>".