in reply to Efficiency issue when reading large CSV files

50 time is optimal on average. For the getline () we're currently talking about it me run into more than 100 times!. Here is a rudimentory comparison table from a while ago, comparing the different versions and access methods of the two CSV modules (click on download to see it as a table):

Short version (higher is better): Text::CSV_XS Text::CSV_PP ---------------------- ---------------- 0.23 0.25 0.43 0.65 1.00 1.06 1.19 ==== ==== ==== ==== ==== ==== ==== combine 1 70 67 98 96 15 15 14 combine 10 48 47 96 100 6 6 5 combine 100 40 40 96 99 5 5 4 parse 1 100 86 88 89 12 6 5 parse 10 100 98 93 91 8 3 3 parse 100 97 100 95 97 7 2 2 print io 87 86 94 99 79 6 5 getline io 64 64 93 100 - 2 1 ---- ---- ---- ---- ---- ---- ---- average 75 73 94 96 16 5 4

Long version:

CSV_XS 0.23 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.34 0.35 0.3 +6 0.37 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.50 0.51 0.52 0.53 0.54 0. +55 0.56 0.57 0.58 0.59 0.60 0.61 0.62 0.63 0.64 0.65 ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== === += ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== == +== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== combine 1 70 67 66 67 63 96 96 98 95 100 93 9 +6 97 97 99 95 98 97 97 97 95 94 94 94 96 +96 96 94 95 91 96 98 98 96 98 96 combine 10 48 47 47 47 47 98 96 97 98 97 94 9 +3 96 94 96 96 96 98 96 96 97 93 94 93 98 +99 99 99 99 97 98 99 99 99 99 100 combine 100 40 40 39 40 40 96 94 95 95 95 95 9 +4 95 94 95 95 96 96 96 95 95 93 93 93 98 +99 99 100 99 97 100 98 99 99 99 99 parse 1 100 86 86 84 77 89 91 91 90 87 87 8 +9 90 89 89 89 88 89 83 89 88 87 87 87 88 +89 88 88 86 87 88 88 87 86 87 89 parse 10 100 98 96 96 93 94 97 96 97 95 97 9 +6 97 92 95 94 93 94 84 92 93 89 89 93 92 +95 95 95 90 91 91 91 90 91 96 91 parse 100 97 100 100 100 97 100 100 97 97 97 100 10 +0 100 95 95 97 95 95 85 95 95 95 95 97 97 +97 95 97 95 97 97 97 95 97 100 97 print io 87 86 87 86 86 95 96 96 96 96 90 9 +3 94 95 94 95 94 94 95 94 95 91 92 93 96 +97 97 97 96 97 99 97 100 99 98 99 getline io 64 64 63 63 61 64 64 62 63 62 64 6 +3 65 96 99 98 93 93 95 93 93 93 93 95 96 +98 96 95 95 99 97 96 98 100 95 100 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- --- +- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- -- +-- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- average 75 73 73 72 70 91 91 91 91 91 90 9 +0 91 94 95 94 94 94 91 93 93 91 92 93 95 +96 95 95 94 94 95 95 95 95 96 96 CSV_PP 1.00 1.02 1.05 1.06 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.1 +5 1.16 1.17 1.18 1.19 ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== === += ==== ==== ==== ==== combine 1 15 15 15 15 16 15 16 15 15 16 14 1 +4 14 14 14 14 combine 10 6 6 6 6 6 6 6 5 6 6 5 +5 5 5 5 5 combine 100 5 4 5 5 5 5 5 4 4 4 4 +4 4 4 4 4 parse 1 12 12 11 6 6 6 6 6 6 6 5 +5 6 5 6 5 parse 10 8 8 7 3 3 3 3 3 3 3 3 +3 3 3 3 3 parse 100 7 7 7 2 2 2 2 2 2 2 2 +2 2 2 2 2 print io 79 76 6 6 6 6 6 6 6 6 5 +5 5 5 5 5 getline io - - 4 2 2 2 2 1 1 1 1 +1 1 1 1 1 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- --- +- ---- ---- ---- ---- average 16 16 7 5 5 5 5 5 5 5 4 +4 5 4 5 4

Enjoy, Have FUN! H.Merijn

Replies are listed 'Best First'.
Re^2: Efficiency issue when reading large CSV files
by Takuan Soho (Acolyte) on Jun 26, 2009 at 15:01 UTC
    Thank you very much Tux!
    Indeed, with this new module and the "getline()" method, the job can be done in a very reasonable time! The increase in speed that I experienced was in the order of 30x rather than 100x, but that's certainly more than enough to make me happy!

    But that's raises another question (and this time, it's really a question about "perl wisdom" and not about "perl how-to"). Since both Text::CSV and Text::CSV_XS are object oriented, and since both implement the same methods, and since one is clearly faster than the other, why wasn't the code of the slower one simply replaced by the code of the faster one? In other words, why two different modules to do the same thing?

      Because Text::CSV_XS was there - in it's extended implementation - way before Text::CSV, which was a braindead pure-perl implementation. After some discussion the author of the current implementation and me, we decided that Text::CSV would best be implementing a wrapper of the two modules. Text::CSV_XS is, as the name already shows, an XS implementation, which needs an ANSI C compiler, which not everybody has. That is why there is a pure-perl implementation, as a fallback for those that need the functionality, but do not have compiling possibilities


      Enjoy, Have FUN! H.Merijn