Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I get third party comma separated (csv) files. A csv file can has many non-ascii chars, which should not pose a problem.

My end goal is to filter out most of the csv data and then, with the resultant csv out file, create an Excel spreadsheet using the excellent csv2xls.pl program.

I am using Text::CSV for the I/O. In the code snippet below, the offending item seems to be "or next";

while ( my $row = $csv->getline( $in_fh ) ) {
$row->4 =~ m/Hardware/ or next;
push (@rows, ($row));
}

The difference between the csv2xls.pl program choking or not choking on the final output file depends only upon whether lines of data were filtered out. If I remove the "or next" I get back my original file, and it converts to Excel just fine, ableit loaded with far too much data.

This has baffled me for over a day. Anyone have any ideas?

  • Comment on CSV_XS ERROR: 2027 - EIQ - Quoted field not terminated @ pos 408

Replies are listed 'Best First'.
Re: CSV_XS ERROR: 2027 - EIQ - Quoted field not terminated @ pos 408
by roboticus (Chancellor) on May 21, 2010 at 20:46 UTC

    Considering that the error message is "Quoted field not terminated @ pos 408", I suspect that the data actually has embedded quotes. So the choking you're experiencing may just be the fact that some fields are much wider than expected. For example, consider the following CSV data:

    "The file has three records", 14.0, 7 "Escape a quote with a backslash, like so: \", but does it work?", 18. +7, 4 "Joe Smith, Esq.", 19.4, 6
    Here we have three records, each with a string, a float, and an integer. If your code to parse the CSV file doesn't expect a quote to be escaped with a backslash, it may end the string prematurely, and likely choke when it tries to use "but does it work?" as a float. So check your data for unusual quoting situations, and check whether CSV_XS handles it in the way you expect. (You may have to change some settings...)

    ...roboticus

      I write this in case anyone else runs into this. First, thank you all for working on this. The data I pasted to show the problem was actual data, but the strings in it were shortened for clarity. As for the final solution, I upgraded Text-CSV_XS from 0.67 to 0.73. It did not seem to help at first, but after I rebooted I cannot get it to fail. This has driven me half crazy.
      Again, thanks.
Re: CSV_XS ERROR: 2027 - EIQ - Quoted field not terminated @ pos 408
by graff (Chancellor) on May 22, 2010 at 03:29 UTC
    You said:
    The difference between the csv2xls.pl program choking or not choking on the final output file depends only upon whether lines of data were filtered out. If I remove the "or next" I get back my original file, and it converts to Excel just fine, ableit loaded with far too much data.

    Well, I'm sorry, but you're going to have to prove that, by posting a runnable snippet of code with appropriate data that demonstrates your problem.

    I tried writing a little snippet of code with data to replicate what you describe, and I didn't see any problem:

    #!/usr/bin/perl use strict; use warnings; use Text::CSV_XS; # this is what reported an error, right? my $target = ( @ARGV ) ? shift : 'egik'; my $csv = Text::CSV_XS->new; my @keep; while ( my $row = $csv->getline( \*DATA )) { $row->[4] =~ /[$target]/ or next; push @keep, $row; } print join( " : ", @$_ )."\n" for ( @keep ); __DATA__ a,b,"c,c",d,e,f,g,h b,c,d,e,"f,f",g,h,i c,d,e,f,g,h,"i,i",j "d,d",e,f,g,h,i,j,k e,f,g,h,i,j,k,"l,l" f,"g,g",h,i,j,k,l,m g,h,i,"j,j",k,l,m,n h,i,j,k,l,"m,m",n,o
    I gather your system has both Text::CSV and Text::CSV_XS installed (and former uses the latter when it's available), but since you say CSV_XS reported the error, I used that explicitly. (You can set it back to Text::CSV if you want.)

    By default, the test script (let's call it "test_script") outputs four lines from the test data. If I put one or more matchable letters as a command line arg, it will output as many lines as match the given letters (e.g. test_script f will output 1 line, test_script e-l will output all 8 lines). No problems with quoted fields, even though each line has a quoted field somewhere in it.

    So see if you can post a similar snippet that proves the problem you are talking about.

      Text::CSV is not what is reporting the error. csv2xls.pl is.
      Here is a code snippet that duplicates the problem.

      By the way, the data originators paste lines and/or paragraphs from Word into a form that puts it into a proprietary data base within a program called DOORS. DOORS exported the db as a ',' separated csv file. The csv has under 9K rows with lots of embedded newlines, and non-ascii chars.

      use Text::CSV; use UTF8BOM; use utf8; use Encode; use strict; my $origfile = 'csv.csv'; my $infile = 'csv_in.csv'; `cp -f $origfile $infile`; my $outfile = 'csv_out.csv'; my $in_fh; #`cp csv.csv $infile`; UTF8BOM->remove_from_file($infile); my @rows; my $csv = Text::CSV->new ( { binary => 1 } ) or die "Cannot use CSV: ".Text::CSV->error_diag (); open my $in_fh, "<:encoding(utf8)", "$infile" or die (__LINE__.": \n") +; while ( my $row = $csv->getline( $in_fh ) ) { $row->[4] =~ m/Hardware/ or next; # Causes csv2xls.pl problems. push (@rows, ($row)); } $csv->eof or $csv->error_diag(); close $in_fh; $csv->eol ("\r\n"); my $csv_out; my $num_rows = @rows; print ("There are $num_rows rows\n"); open $csv_out, ">:encoding(utf8)", "$outfile" or die(__LINE__.": \n"); $csv->print ($csv_out, $_) for @rows; close $csv_out or die (__LINE__.": \n"); unlink("csv_out.xls"); `csv2xls.pl -u csv_out.csv`;
        Text::CSV is not what is reporting the error. csv2xls.pl is.

        The title of this thread is "CSV_XS ERROR:..." -- I take that to mean that the Text::CSV_XS module is what emitted the error, regardless of which perl script invoked that module.

        Here is a code snippet that duplicates the problem.

        But there is no way to "duplicate the problem" if there's no sample data available that actually causes the problem. Where's the data?

        I know you don't want to post your entire actual input file, and no one here wants you to do that. The problem for you is to locate the point in the file where the problem occurs, and show a sample that contains just that part, or something that is equivalent to it and also causes the same error report.

        It's sad that Text::CSV(_XS) doesn't make it easy to locate the source of the problem in your data. The error message you gave as the title of this thread says "@ pos 408". If I understand correctly, this will refer to the 408th comma-delimited field, counting from the beginning of the file. If you know how many fields there should be per "record", a little arithmetic will tell you how many records there are between the start of the file and the problem. (Note that I'm referring to "records", not "lines", given that some records include embedded line-breaks.)

        Something to try will be to remove initial records from the input data until the error goes away. There should be one record in particular such that you'll get the error (with a low "pos" number) when it is present at the beginning of the input, and you won't get the error when you remove it from the data.

        Still, the real puzzle, according to your initial description, is why there's an error when you try to do "next" to skip some records, but no error when you don't use "next". Assuming there's an error in the data, it should occur in both cases, and whether you use "next" or not should have nothing to do with finding the error.

        So either your initial description is wrong (e.g. there's something else different about the two cases, besides the presence/absence of "next"), or there's something really strange about the data, which you haven't shown.

Re: CSV_XS ERROR: 2027 - EIQ - Quoted field not terminated @ pos 408
by Tux (Canon) on May 22, 2010 at 12:16 UTC

    None of you is showing how the new () is called (except graff, who passes no arguments).

    The OP explicitely said that it had a lot of non-ASCII characters in the data. If - and only if - that data is valid UTF-8, Text::CSV_XS converts the fields to UTF-8, and accepts it as if binary => 1 was passed to the constructor.

        my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });

    is with recent versions by far the safest way to instantiate $csv.

    Another problem might be line endings. Unless you specify exactly what you have, recent versions allow all of \n, \r, and \r\n fully automatic. If you however pass something else, and the line ending do not match, you're likely to end up with very long lines.

    What csv2xls.pl are you talking about? Text::CSV_XS comes with csv2xls in it's examples/ folder. Note that my version does not have a .pl extension.


    Enjoy, Have FUN! H.Merijn
      I downloaded your csv2xls and changed the last line of my snippet to
      `perl.exe ./csv2xls -u csv_out.csv`;
      Same error:
      There are 741 rows # CSV_XS ERROR: 2027 - EIQ - Quoted field not terminated @ pos 408 2027EIQ - Quoted field not terminated408 at ./csv2xls line 132, <> lin +e 1.
      Note that, with the said offending line in, the csv_out gets cut down to about 1/10 the size.
      Thanks for your efforts, by the way. This has been dogging me for days.
        I had also changed to
        my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 }) or die "Cannot use CSV: ".Text::CSV->error_diag ();
        Thanks
Re: CSV_XS ERROR: 2027 - EIQ - Quoted field not terminated @ pos 408
by wsppan (Acolyte) on May 21, 2010 at 21:25 UTC
    Shouldn't that be: $row->[4] =~ m/Hardware/ or next;
      It is, but since the OP didn't use <code> tags, it was interpreted as a node reference.
Re: CSV_XS ERROR: 2027 - EIQ - Quoted field not terminated @ pos 408
by ww (Archbishop) on May 22, 2010 at 00:28 UTC
    Code tags in the future please, Anonymonk (OP): without them, your square brackets become a shortcut <a href="...">...</a>.

    For more information on the html variant used for markup here, see Markup in the Monastery ... and the very prominent notes on how to format a node beneath the text entry box in which you create a node.

    ...and yes, wsppan, re your observation about $row->[4] =~ m/Hardware/ or next;: that's how it should be... and would have been had Anonymonk read about how to post a node and heeded the instructions.