in reply to Re^2: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4
in thread Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4

Hi ... this is the code i have used to

#use 5.18.2; use warnings; use Benchmark qw( cmpthese ); use Text::CSV_PP; use Text::CSV_XS; my ($fh, $csv); sub fh { open $fh, "<", "Proj20101111.csv"; } sub cxs { $csv = Text::CSV_XS->new ({ binary => 1, sep_char => "|" }) +} sub cpp { $csv = Text::CSV_PP->new ({ binary => 1, sep_char => "|" }) +} cmpthese (4, { perl => sub { fh; while (my @row = split /\|/ => <$fh>) {} }, c_xs => sub { fh; cxs; while (my $row = $csv->getline ($fh)) {} }, c_pp => sub { fh; cpp; while (my $row = $csv->getline ($fh)) {} }, cbxs => sub { fh; cxs; $csv->bind_columns (\my($a,$b,$c,$d,$e,$f)) +; while ( $csv->getline ($fh)) {} }, }); close($fh);

and my sample data would be like

RESOURCE_TYPE_DESCR|YEAR_MONTH|ACCOUNTING_PERIOD|TRANSACTION_PERIOD|CLIENT_ID|CLIENT|CLIENT_INDUSTRY|CLIENT_COUNTRY|GL_BU|CONTRACT_SEGMENT|CONTRACT_ID|CONTRACT|CONTRACT_REGION|CONTRACT_START_DATE|CONTRACT_END_DATE|CONTRACT_LINE_STATUS|CONTRACT_LINE_STATUS_DESCRIPTION|CA_STATUS|PRODUCT|CONTRACT_LINE_ID|CONTRACT_LINE|CONTRACT_LINE_START_DATE|CONTRACT_LINE_END_DATE|PC_BUSINESS_UNIT|PROJECT_TYPE_ID|PROJECT_TYPE|COST_TYPE|PROJECT_GL_BU|PROJECT_PRACTICE|PROJECT_DEPT_ID|PROJECT_DEPARTMENT|ENFORCE_TEAM|PROJECT_ID|PROJECT|INTEGRATION_TEMPLATE|PM_ID|PM|PM_NOTES_ID|PROJECT_STATUS|PROJECT_START_DATE|PROJECT_END_DATE|ACT_ID|ACTIVITY|DEFERRABLE__Y_N_|ACTIVITY_TYPE|AM_ID|AM|AM_NOTES_ID|ACTIVITY_PROCESSING_STATUS|ACTIVITY_START_DATE|ACTIVITY_END_DATE|GL_DEPT_ID|GL_DEPT|GL_DEPT_SEGMENT|GL_DEPT_PRACTICE|RESOURCE_GL_BU|RESOURCE_SEGMENT|RESOURCE_PRACTICE|RESOURCE_DEPT_ID|RESOURCE_DEPARTMENT|RESOURCE_LOCATION|RESOURCE_ID|RESOURCE|RESOURCE_NOTES_ID|HR_MANAGER|HR_MANAGER_NOTES_ID|JOB_CODE_ID|JOB_CODE|REVENUE_TYPE__RPID_|CURRENCY|HOURS_QUANTITY|UNIT_OF_MEASURE|BILL_RATE|R_HOURS|R_COST|O_HOURS|O_COST|W_HOURS|W_COST|I_HOURS|I_COST|S_HOURS|S_COST|TOTAL_HOURS|FTE|REVENUE_DISCOUNT|REVENUE_PREMIUM|INTERCOMPANY_REVENUE|DUMMY_1|REVENUE|XSEG|BILLABLE|DIRECT_COST|SALES_COST|CAPITALIZED_COST|WIP|WIP_FORWARD|WIP_WRITE_UP_OFF|INVOICED|BILL_PLAN__BPID_|INVOICE__|AR_WRITE_UP_OFF|AP_VENDOR_ID|AP_VENDOR|AP_VOUCHER_ID|AP_VOUCHER_LINE|ASSET_BU|ASSET_ID|ASSET|ASSET_PROFILE_ID|ASSET_DEPT|PE_ID|PE|AE_ID|AE|CNS_CL_ID|CNS_CL|CFA_ID|CFA|CFA_NOTES_ID|PRODUCT_ID|EFFECTIVE_DATE_OF_CURRENT_BILL_RATE|RATE_PLAN|RATE_PLAN_DESCR|REVENUE_HOLD|BILL_PLAN_HOLD|AM_DISTRIB_STATUS|BILL_RATE_USED_TO_PRICE_TRANSACTION|DISTRIBUTED_REVENUE|NON_DISTRIBUTED_REVENUE|NET_ACCRUED_REVENUE|NET_ACCRUED_COST ""|201407|"Jul-2014"|" "|"BLANK"|" "|" "|" "|"CAN01"|" "|" "|" "|" "|"09/09/9999 00:00:00"|"09/09/9999 00:00:00"|" "|" "|" "|"N/A"|0|" "|"09/09/9999 00:00:00"|"09/09/9999 00:00:00"|"CAN1P"|"C051"|"Clnt Ongoing-Contractual-GCB"|" "|"CAN01"|"Global Lrg Mkt Canada Practice"|"100893"|"Client Group 61 - Greenham"|"N"|"00516587"|"Plan Administration"|"Reg_N Amer (US, CAN, Ind-HIN)"|"20291"|"Mian,Aamir "|"A20291@HMTTT.com"|"O"|"11/01/2000 00:00:00"|"12/31/2018 00:00:00"|"003"|"Integrated In Scope: Ppt Fee"|"N"|"01025"|"59687"|"Bessette,Norm"|"N2BESSET@HMTTT.com"|"A"|"11/01/2000 00:00:00"|"12/31/2012 00:00:00"|"100893"|" "|" "|" "|" "|" "|" "|" "|" "|""|"9490402"|"B/W Print"|" "|" "|" "|" "|" "|"

those datas would be in GB size csv files .... :( help ! thanks !

  • Comment on Re^3: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4
  • Download Code

Replies are listed 'Best First'.
Re^4: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4
by Tux (Canon) on Feb 12, 2015 at 14:25 UTC

    unless that data example can be public, please remove it again: it has e-mail addresses. If it is world-viewable, please add readmore-tags.

    Now back to the testing example. Explaining the extreme diff in cbxs is easy: you bind for just 6 columns, but you feed it way more (132), so the geline will return false.

    Having created a 30001 line version of your data example, and binding with the correct number of columns, I get

    Rate c_pp c_xs cbxs perl c_pp 3.03e-02/s -- -97% -97% -98% c_xs 1.07/s 3434% -- -1% -34% cbxs 1.08/s 3453% 1% -- -33% perl 1.61/s 5229% 51% 50% --

    which is what I would expect. Now that I have seen your data, note that it actually resembles CSV: it only uses a pipe instead of a comma, but the fields between the pipes *are* quoted, so making a correct parser with split will (eventually) fail

    ""|201407|"Jul|2014"|" "|"BLANK"|" "|" "|" "|"CAN01"|" "|" "|" "|" "|" +09/09/9999 00:00:00"... ^

    See where I manually inserted a pipe? And what if this data allows embedded newlines between the quotes?

    As you are inserting the data between the pipes into XLS, your data will contain the quotes if you just split on pipes. I don't think that is what you want.

    These are the reasons why you need a CSV parser and not split.

    To parse this data quick and reliable, you could go with something like this:

    $ cat test.pl use strict; use warnings; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1, sep_char => "|", auto_diag + => 1 }); open my $fh, "<", "Proj20101111.csv" or die $!; my @hdr = map { lc } @{$csv->getline ($fh)}; my %rec; $csv->bind_columns (\@rec{@hdr}); my %count; while ($csv->getline ($fh)) { $count{$rec{client_id}}++; } printf "%-8s %7d\n", $_, $count{$_} for sort keys %count; $ perl test.pl 104167 1001 116571 4004 BLANK 24024

    HTH


    Enjoy, Have FUN! H.Merijn