Re: Efficiency issue when reading large CSV files

50 time is optimal on average. For the getline () we're currently talking about it me run into more than 100 times!. Here is a rudimentory comparison table from a while ago, comparing the different versions and access methods of the two CSV modules (click on download to see it as a table):

Short version (higher is better):
                   Text::CSV_XS         Text::CSV_PP
             ----------------------  ----------------
             0.23  0.25  0.43  0.65  1.00  1.06  1.19 
             ====  ====  ====  ====  ====  ====  ==== 
combine   1    70    67    98    96    15    15    14 
combine  10    48    47    96   100     6     6     5 
combine 100    40    40    96    99     5     5     4 
parse     1   100    86    88    89    12     6     5 
parse    10   100    98    93    91     8     3     3 
parse   100    97   100    95    97     7     2     2 
print    io    87    86    94    99    79     6     5 
getline  io    64    64    93   100     -     2     1 
             ----  ----  ----  ----  ----  ----  ---- 
average        75    73    94    96    16     5     4
[download]

Long version:

CSV_XS      0.23 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.34 0.35 0.3
+6 0.37 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.50 0.51 0.52 0.53 0.54 0.
+55 0.56 0.57 0.58 0.59 0.60 0.61 0.62 0.63 0.64 0.65
            ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ===
+= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==
+== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
combine   1   70   67   66   67   63   96   96   98   95  100   93   9
+6   97   97   99   95   98   97   97   97   95   94   94   94   96   
+96   96   94   95   91   96   98   98   96   98   96
combine  10   48   47   47   47   47   98   96   97   98   97   94   9
+3   96   94   96   96   96   98   96   96   97   93   94   93   98   
+99   99   99   99   97   98   99   99   99   99  100
combine 100   40   40   39   40   40   96   94   95   95   95   95   9
+4   95   94   95   95   96   96   96   95   95   93   93   93   98   
+99   99  100   99   97  100   98   99   99   99   99
parse     1  100   86   86   84   77   89   91   91   90   87   87   8
+9   90   89   89   89   88   89   83   89   88   87   87   87   88   
+89   88   88   86   87   88   88   87   86   87   89
parse    10  100   98   96   96   93   94   97   96   97   95   97   9
+6   97   92   95   94   93   94   84   92   93   89   89   93   92   
+95   95   95   90   91   91   91   90   91   96   91
parse   100   97  100  100  100   97  100  100   97   97   97  100  10
+0  100   95   95   97   95   95   85   95   95   95   95   97   97   
+97   95   97   95   97   97   97   95   97  100   97
print    io   87   86   87   86   86   95   96   96   96   96   90   9
+3   94   95   94   95   94   94   95   94   95   91   92   93   96   
+97   97   97   96   97   99   97  100   99   98   99
getline  io   64   64   63   63   61   64   64   62   63   62   64   6
+3   65   96   99   98   93   93   95   93   93   93   93   95   96   
+98   96   95   95   99   97   96   98  100   95  100
            ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---
+- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- --
+-- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
average       75   73   73   72   70   91   91   91   91   91   90   9
+0   91   94   95   94   94   94   91   93   93   91   92   93   95   
+96   95   95   94   94   95   95   95   95   96   96

CSV_PP      1.00 1.02 1.05 1.06 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.1
+5 1.16 1.17 1.18 1.19
            ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ===
+= ==== ==== ==== ====
combine   1   15   15   15   15   16   15   16   15   15   16   14   1
+4   14   14   14   14
combine  10    6    6    6    6    6    6    6    5    6    6    5    
+5    5    5    5    5
combine 100    5    4    5    5    5    5    5    4    4    4    4    
+4    4    4    4    4
parse     1   12   12   11    6    6    6    6    6    6    6    5    
+5    6    5    6    5
parse    10    8    8    7    3    3    3    3    3    3    3    3    
+3    3    3    3    3
parse   100    7    7    7    2    2    2    2    2    2    2    2    
+2    2    2    2    2
print    io   79   76    6    6    6    6    6    6    6    6    5    
+5    5    5    5    5
getline  io    -    -    4    2    2    2    2    1    1    1    1    
+1    1    1    1    1
            ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---
+- ---- ---- ---- ----
average       16   16    7    5    5    5    5    5    5    5    4    
+4    5    4    5    4
[download]

Enjoy, Have FUN! H.Merijn

Comment on Re: Efficiency issue when reading large CSV files Select or Download Code

Replies are listed 'Best First'.
Re^2: Efficiency issue when reading large CSV files by Takuan Soho (Acolyte) on Jun 26, 2009 at 15:01 UTC
Thank you very much Tux! Indeed, with this new module and the "getline()" method, the job can be done in a very reasonable time! The increase in speed that I experienced was in the order of 30x rather than 100x, but that's certainly more than enough to make me happy! But that's raises another question (and this time, it's really a question about "perl wisdom" and not about "perl how-to"). Since both Text::CSV and Text::CSV_XS are object oriented, and since both implement the same methods, and since one is clearly faster than the other, why wasn't the code of the slower one simply replaced by the code of the faster one? In other words, why two different modules to do the same thing?	[reply]
Re^3: Efficiency issue when reading large CSV files by Tux (Canon) on Jun 26, 2009 at 17:27 UTC
Because Text::CSV_XS was there - in it's extended implementation - way before Text::CSV, which was a braindead pure-perl implementation. After some discussion the author of the current implementation and me, we decided that Text::CSV would best be implementing a wrapper of the two modules. Text::CSV_XS is, as the name already shows, an XS implementation, which needs an ANSI C compiler, which not everybody has. That is why there is a pure-perl implementation, as a fallback for those that need the functionality, but do not have compiling possibilities Enjoy, Have FUN! H.Merijn	[reply]