Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

A problem with Text::CSV

by lihao (Monk)
on Mar 26, 2008 at 20:08 UTC ( [id://676520]=perlquestion: print w/replies, xml ) Need Help??

lihao has asked for the wisdom of the Perl Monks concerning the following question:

hello, monks

I got a weird problem when using Text::CSV to process a list of 119880 records/lines saved in in.csv, the code I am using is as follows:

#!/usr/bin/perl use strict; use warnings; use Text::CSV; my $csv = Text::CSV->new({ binary => 1, allow_whitespace => 1, allow_loose_quotes => 1, }) or die Text::CSV->error_diag (); open my $csvfile, '<', 'in.csv' or die "cannot open in.csv for reading +: $!"; open my $fout, '>', 'out.txt' or die "cannot open out.txt for writting +: $!"; my $line_no = 0; while (my $record = $csv->getline($csvfile)) { $line_no++; my $value = "$record->[9], $record->[7], $record->[8] $record->[1 +0]"; printf {$fout} "[%06d] %s\n", $line_no, $value; } __END__

By running this code, I got only 119606 records in out.txt.. so how can I detect the 274 missing records. or are there any robust ways to grab records from a CSV file and process its fields within a while loop.

BTW. I used wc -l filename to count the number of lines contained in filename.

Thanks

lihao

Replies are listed 'Best First'.
Re: A problem with Text::CSV
by Tux (Canon) on Mar 26, 2008 at 20:23 UTC

    I think I can say with confidence that there are no missing records, but there are 274 records with embedded newlines, so they are split over multiple lines in the CSV file. With embedded newlines, wc -l will not reliably tell you how many records there are.


    Enjoy, Have FUN! H.Merijn

      ah, you are right, there are newlines contained in some CSV records, so the number of records 119606 should be OK, right?? I guess my second method is wrong. :)

      One last question: is there any way that I can check if there is any missing records by using $csv->getline($io) method? :-)

      Thanks again

      lihao

      Hi, thank you for the fast response:)

      I actually changed the while loop and found some of the records are from the trailing ^M, and after I removed them, I missed 137 records now, and I can find the problematic records with line_number anyway by now:-) I will check if there are embedded newlines within these fields..

      many thanks

      lihao

      while (my $record = <$csvfile>) { $line_no++; chomp $record; $record =~ s/\cM$//; if ($csv->parse($record)) { my @columns = $csv->fields(); my $value = "$columns[9], $columns[2]"; printf {$fout} "[good][%06d] %s\n", $line_no, $value; } else { printf {$fout} "[bad][%06d] %s\n", $line_no, $record; } }

        This is changing your script from good to bad. This is exactly what you should NOT do. The trailing ^M is part of the field and should not be removed.

        Use the comma-counting code from Narveson, and check if the lines that have a trailing ^M also have less comma's than the lines that seem to be correct. Note that even that is unreliable, as comma's can be part of a field when correctly quoted.

        Best way to find the problematic lines (if any) is to call the new () constructor with no arguments at all, and see where the parsing stops. Then use Text::CSV's diagnostics to see what caused the stop.


        Enjoy, Have FUN! H.Merijn
Re: A problem with Text::CSV
by Narveson (Chaplain) on Mar 26, 2008 at 20:43 UTC
    Are there any robust ways to grab records from a CSV file?

    Yes, use Perl's native getline method (what you're calling when you use angle brackets on a filehandle).

    To see if your discrepancy results from embedded newlines or simply from blank lines, you might try a histogram of commas seen per line.

    my %lines_with_this_many_commas; while (<$csvfile>) { my $comma_count = tr/,/,/; $lines_with_this_many_commas{$comma_count}++; } print "Perl's getline counted $. lines\n"; print "with the following incidence of commas:\n"; my $PRINTF = "%10s %30s\n"; printf $PRINTF, 'Commas', 'Lines with this many commas'; for my $n (sort {$a<=>$b} keys %lines_with_this_many_commas) { printf $PRINTF, $n, $lines_with_this_many_commas{$n}; }
Re: A problem with Text::CSV
by apl (Monsignor) on Mar 26, 2008 at 20:26 UTC
    Are there any blank lines in your CSV file?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://676520]
Approved by kyle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2024-03-28 10:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found