Takuan Soho has asked for the wisdom of the Perl Monks concerning the following question:

Hello, To do some bioinformatics researches, I have to read fields from huge files and do some basic operations on those fields. As long as I work with tab separated files, everything works fine. I can split the lines using something like : "@line = s/\t/, $_;". I can access the fields and do a simple printing task like this:print "$line2\t$line[0]"; (Of course, the manipulations are a little more complex, but this is for the sake of testing the issue). When I started getting CSV files, and tried to do the same thing using Text::CSV, everything jammed. Here is a simple program that illustrates what I did, and record the time it took:

### Test program use Text::CSV; $input_file = "ACGT.csv"; $output_file = "test.txt"; if (! open ENTREE, "<", $input_file) { print "Could not open handle ENTREE for $input_file. $!\n";} if (! open SORTIE, ">", $output_file) {<br> print "Could not open handle SORTIE for $output_file. $!\n";} $start = time(); my $csv_reader = Text::CSV->new(); my @columns; while (<ENTREE>) { $csv_reader->parse($_); @columns = $csv_reader->fields(); print SORTIE "$columns[2]\t$columns[0]\n"; } $end = time(); $duration = $end - $start; print "Reading took $duration seconds.\n"; close ENTREE; close SORTIE;

On a huge file (a couple hundreds of columns, around a million rows), it took 3858 seconds. More than an hour to just go through a file!!! Am I doing something inherently wrong?

As a comparison, I transformed the CSV file in a tab separated file using a small python script based on the python CSV module, and it took less than 200 seconds. Then, back to the perl script, I did the same manipulation (using the split function this time) in 273 seconds.

Do you know of a way to deal with CSV files in perl that will allow me to get to the same kind of efficiency ... without having to use a python script on the side?

Thank you very much!

Replies are listed 'Best First'.
Re: Efficiency issue when reading large CSV files
by zwon (Abbot) on Jun 25, 2009 at 18:11 UTC

    Do you have Text::CSV_XS installed?

    BTW, what have you done with <code> tags?

      More precisely,
      • Make sure you are using Text::CSV version 1 or greater

        # Use Text::CSV_XS if it's available or the # slower Pure Perl implementation otherwise. use Text::CSV 1.000; ... my $csv_reader = Text::CSV->new(); ...

        and make sure you have Text::CSV_XS installed.

        >perl -e"use Text::CSV_XS"
      • Or just use Text::CSV_XS directly.

        use Text::CSV_XS; ... my $csv_reader = Text::CSV_XS->new(); ...

      The XS version is 50x faster than the Pure Perl implementation according to the author. Both of the above methods will use the XS version.

      I don't have Text::CSV_XS installed, but I'll install it locally and see if it works better.

      Sorry for the format. This is the first time I post and I am still unsure how everything works. The "preview" did not show any formating between the tags, so I included other tags inside the code. I will edit the node.

      More precisely,
      • Make sure you are using Text::CSV version 1 or greater

        # Use Text::CSV_XS if it's available or the # slower Pure Perl implementation otherwise. use Text::CSV 1.000; ... my $csv_reader = Text::CSV->new(); ...

        and make sure you have Text::CSV_XS installed.

        >perl -e"use Text::CSV_XS"
      • Or just use Text::CSV_XS directly.

        use Text::CSV_XS; ... my $csv_reader = Text::CSV_XS->new(); ...

      The XS version is 50x faster than the Pure Perl implementation according to the author. Both of the above methods will use the XS version.

Re: Efficiency issue when reading large CSV files
by Tux (Canon) on Jun 25, 2009 at 18:58 UTC
    Next to all the good advice regarding Text::CSV_XS, include the binary attribute and use a different mode for reading (getline ()) to gain speed and safety (think embedded newlines):

    open my $entree, "<", $input_file or die "$input_file: $!"; open my $sortie, ">", $output_file or die "$output_file: $!"; my $csv_reader = Text::CSV->new ({ binary => 1 }); while (my $row = $csv->getline ($entree)) { print $sortie "$row->[2]\t$row->[0]\n"; } $csv->eof or $csv->error_diag ();

    To get an even finer grained time usage, have a look at Time::HiRes


    Enjoy, Have FUN! H.Merijn
Re: Efficiency issue when reading large CSV files
by Tux (Canon) on Jun 25, 2009 at 19:26 UTC

    50 time is optimal on average. For the getline () we're currently talking about it me run into more than 100 times!. Here is a rudimentory comparison table from a while ago, comparing the different versions and access methods of the two CSV modules (click on download to see it as a table):

    Short version (higher is better): Text::CSV_XS Text::CSV_PP ---------------------- ---------------- 0.23 0.25 0.43 0.65 1.00 1.06 1.19 ==== ==== ==== ==== ==== ==== ==== combine 1 70 67 98 96 15 15 14 combine 10 48 47 96 100 6 6 5 combine 100 40 40 96 99 5 5 4 parse 1 100 86 88 89 12 6 5 parse 10 100 98 93 91 8 3 3 parse 100 97 100 95 97 7 2 2 print io 87 86 94 99 79 6 5 getline io 64 64 93 100 - 2 1 ---- ---- ---- ---- ---- ---- ---- average 75 73 94 96 16 5 4

    Long version:


    Enjoy, Have FUN! H.Merijn
      Thank you very much Tux!
      Indeed, with this new module and the "getline()" method, the job can be done in a very reasonable time! The increase in speed that I experienced was in the order of 30x rather than 100x, but that's certainly more than enough to make me happy!

      But that's raises another question (and this time, it's really a question about "perl wisdom" and not about "perl how-to"). Since both Text::CSV and Text::CSV_XS are object oriented, and since both implement the same methods, and since one is clearly faster than the other, why wasn't the code of the slower one simply replaced by the code of the faster one? In other words, why two different modules to do the same thing?

        Because Text::CSV_XS was there - in it's extended implementation - way before Text::CSV, which was a braindead pure-perl implementation. After some discussion the author of the current implementation and me, we decided that Text::CSV would best be implementing a wrapper of the two modules. Text::CSV_XS is, as the name already shows, an XS implementation, which needs an ANSI C compiler, which not everybody has. That is why there is a pure-perl implementation, as a fallback for those that need the functionality, but do not have compiling possibilities


        Enjoy, Have FUN! H.Merijn
Re: Efficiency issue when reading large CSV files
by jrsimmon (Hermit) on Jun 25, 2009 at 18:34 UTC
    Is there a reason you can't simply split based off the comma?
    @line = split(/\,/, $_);
    I don't understand, from your post, where the need for an external library comes into the picture.
      Indeed, I would prefer not to use an external library. The problem is that the CSV format allows using a coma inside quotes. It is unlikely that there will be a quoted coma in my input... but "unlikely" does not mean that there will never be such a thing.
      If I split on coma and there are quoted comas, the computation could be totally messed up, but in ways that would make the mess almost unnoticeable.
        I suspect you could still handle this without an external library. Complex regex aren't my forte, but I know this forum where that seems to be a pretty popular pasttime...

        Something along the lines of:
        $_ =~ s/some expression meaning a quoted comma/some placeholder/g; @line = split(/\,/, $_); #if the commas are needed foreach my $line (@line){ $line =~ s/some placeholder/,/g; }
        or
        @line = split(/some expression that only matches commas not within quo +tes/, $_);
        You're going to pay a bit of a performance penalty on the 2nd substitution or the more complex matching statement, but it's linear not exponential, so not too bad. Given what you've seen with Text::CSV, I'd wager it'll still be an improvement.

        Perhaps I'm a purist, but I am loathe to pull in another module when the task can be accomplished without it.