Re: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10

Replies are listed 'Best First'.
Re^2: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by Tux (Canon) on Feb 11, 2015 at 15:33 UTC
In your code above, why do you freeze the panes for every cell write? (Or at all?) I was wondering the same. I do see value in `$worksheet->set_column ($col, $col, $width);` after the conversion. Text::CSV and it's siblings may be faster at splitting lines. unlikely, unless the code has embedded specials. I tested with a biggish CSV file with 6 fields: $ cat data.pl use 5.18.2; use warnings; use Benchmark qw( cmpthese ); use Text::CSV_PP; use Text::CSV_XS; my ($fh, $csv); sub fh { open $fh, "<", "data.csv"; } sub cxs { $csv = Text::CSV_XS->new ({ binary => 1, sep_char => "\|" }) +} sub cpp { $csv = Text::CSV_PP->new ({ binary => 1, sep_char => "\|" }) +} cmpthese (4, { perl => sub { fh; while (my @row = split /\\|/ => <$fh>) {} }, c_xs => sub { fh; cxs; while (my $row = $csv->getline ($fh)) {} }, c_pp => sub { fh; cpp; while (my $row = $csv->getline ($fh)) {} }, cbxs => sub { fh; cxs; $csv->bind_columns (\my($a,$b,$c,$d,$e,$f)) +; while ( $csv->getline ($fh)) {} }, }); $ perl data.pl Use of uninitialized value in split at data.pl line 15, <$fh> line 260 +0000. Use of uninitialized value in split at data.pl line 15, <$fh> line 280 +0000. Use of uninitialized value in split at data.pl line 15, <$fh> line 300 +0000. Use of uninitialized value in split at data.pl line 15, <$fh> line 320 +0000. Rate c_pp c_xs cbxs perl c_pp 8.87e-02/s -- -95% -96% -98% c_xs 1.85/s 1988% -- -25% -65% cbxs 2.45/s 2667% 33% -- -53% perl 5.26/s 5834% 184% 114% -- [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^3: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by GT Learner (Initiate) on Feb 12, 2015 at 10:29 UTC
Hi! Thanks for your reply ! I have tried this code with my csv file of the size 271MB contains 179K records !! and i got the output like this ! </p >perl new_cmp.pl (warning: too few iterations for a reliable count) Use of uninitialized value in split at new_cmp.pl line 14, <$fh> line +408638. Use of uninitialized value in split at new_cmp.pl line 14, <$fh> line +579912. Use of uninitialized value in split at new_cmp.pl line 14, <$fh> line +751186. Use of uninitialized value in split at new_cmp.pl line 14, <$fh> line +922460. s/iter c_pp perl + c_xs cbxs c_pp 45.6 -- -85% + -97% -100% perl 6.71 579% -- + -77% -100% c_xs 1.54 2853% 335% + -- -100% cbxs 2.50e-016 18241099999999997952% 2684799999999998464% 617799999999 +998208% -- [download] I dont understand why it is showing lines like 408K - 579K - 751K - 922K ! since my csv file contained the records of 179K and the columns are about #135! :( Please tell me how can i get through this and made it to be done ! Thanks !	[reply] [d/l]
Re^4: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by Tux (Canon) on Feb 12, 2015 at 11:10 UTC
Hard to tell from what we see here, but I can only guess that your bench has one or more errors, as cbxs just cannot be that fast (compared to c_xs). So if you want us to give you a better answer than any wild guess possible, you'll need to post your code and your data. update: the high line numbers are because I forgot to close `$fh` at the end of each test. With the close added, here are the new compares: $ perl5.8.4 test.pl Rate c_pp c_xs cbxs perl c_pp 9.63e-02/s -- -94% -97% -98% c_xs 1.65/s 1616% -- -42% -57% cbxs 2.84/s 2845% 72% -- -27% perl 3.88/s 3932% 135% 37% -- $ perl5.20.1 test.pl Use of uninitialized value in split at test.pl line 13, <$fh> line 200 +000. Use of uninitialized value in split at test.pl line 13, <$fh> line 200 +000. Use of uninitialized value in split at test.pl line 13, <$fh> line 200 +000. Use of uninitialized value in split at test.pl line 13, <$fh> line 200 +000. Rate c_pp c_xs cbxs perl c_pp 8.65e-02/s -- -96% -97% -98% c_xs 2.07/s 2295% -- -27% -53% cbxs 2.84/s 3179% 37% -- -35% perl 4.40/s 4980% 112% 55% -- [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^3: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by GT Learner (Initiate) on Feb 12, 2015 at 12:50 UTC
Hi ... this is the code i have used to #use 5.18.2; use warnings; use Benchmark qw( cmpthese ); use Text::CSV_PP; use Text::CSV_XS; my ($fh, $csv); sub fh { open $fh, "<", "Proj20101111.csv"; } sub cxs { $csv = Text::CSV_XS->new ({ binary => 1, sep_char => "\|" }) +} sub cpp { $csv = Text::CSV_PP->new ({ binary => 1, sep_char => "\|" }) +} cmpthese (4, { perl => sub { fh; while (my @row = split /\\|/ => <$fh>) {} }, c_xs => sub { fh; cxs; while (my $row = $csv->getline ($fh)) {} }, c_pp => sub { fh; cpp; while (my $row = $csv->getline ($fh)) {} }, cbxs => sub { fh; cxs; $csv->bind_columns (\my($a,$b,$c,$d,$e,$f)) +; while ( $csv->getline ($fh)) {} }, }); close($fh); [download] and my sample data would be like RESOURCE_TYPE_DESCR\|YEAR_MONTH\|ACCOUNTING_PERIOD\|TRANSACTION_PERIOD\|CLIENT_ID\|CLIENT\|CLIENT_INDUSTRY\|CLIENT_COUNTRY\|GL_BU\|CONTRACT_SEGMENT\|CONTRACT_ID\|CONTRACT\|CONTRACT_REGION\|CONTRACT_START_DATE\|CONTRACT_END_DATE\|CONTRACT_LINE_STATUS\|CONTRACT_LINE_STATUS_DESCRIPTION\|CA_STATUS\|PRODUCT\|CONTRACT_LINE_ID\|CONTRACT_LINE\|CONTRACT_LINE_START_DATE\|CONTRACT_LINE_END_DATE\|PC_BUSINESS_UNIT\|PROJECT_TYPE_ID\|PROJECT_TYPE\|COST_TYPE\|PROJECT_GL_BU\|PROJECT_PRACTICE\|PROJECT_DEPT_ID\|PROJECT_DEPARTMENT\|ENFORCE_TEAM\|PROJECT_ID\|PROJECT\|INTEGRATION_TEMPLATE\|PM_ID\|PM\|PM_NOTES_ID\|PROJECT_STATUS\|PROJECT_START_DATE\|PROJECT_END_DATE\|ACT_ID\|ACTIVITY\|DEFERRABLE__Y_N_\|ACTIVITY_TYPE\|AM_ID\|AM\|AM_NOTES_ID\|ACTIVITY_PROCESSING_STATUS\|ACTIVITY_START_DATE\|ACTIVITY_END_DATE\|GL_DEPT_ID\|GL_DEPT\|GL_DEPT_SEGMENT\|GL_DEPT_PRACTICE\|RESOURCE_GL_BU\|RESOURCE_SEGMENT\|RESOURCE_PRACTICE\|RESOURCE_DEPT_ID\|RESOURCE_DEPARTMENT\|RESOURCE_LOCATION\|RESOURCE_ID\|RESOURCE\|RESOURCE_NOTES_ID\|HR_MANAGER\|HR_MANAGER_NOTES_ID\|JOB_CODE_ID\|JOB_CODE\|REVENUE_TYPE__RPID_\|CURRENCY\|HOURS_QUANTITY\|UNIT_OF_MEASURE\|BILL_RATE\|R_HOURS\|R_COST\|O_HOURS\|O_COST\|W_HOURS\|W_COST\|I_HOURS\|I_COST\|S_HOURS\|S_COST\|TOTAL_HOURS\|FTE\|REVENUE_DISCOUNT\|REVENUE_PREMIUM\|INTERCOMPANY_REVENUE\|DUMMY_1\|REVENUE\|XSEG\|BILLABLE\|DIRECT_COST\|SALES_COST\|CAPITALIZED_COST\|WIP\|WIP_FORWARD\|WIP_WRITE_UP_OFF\|INVOICED\|BILL_PLAN__BPID_\|INVOICE__\|AR_WRITE_UP_OFF\|AP_VENDOR_ID\|AP_VENDOR\|AP_VOUCHER_ID\|AP_VOUCHER_LINE\|ASSET_BU\|ASSET_ID\|ASSET\|ASSET_PROFILE_ID\|ASSET_DEPT\|PE_ID\|PE\|AE_ID\|AE\|CNS_CL_ID\|CNS_CL\|CFA_ID\|CFA\|CFA_NOTES_ID\|PRODUCT_ID\|EFFECTIVE_DATE_OF_CURRENT_BILL_RATE\|RATE_PLAN\|RATE_PLAN_DESCR\|REVENUE_HOLD\|BILL_PLAN_HOLD\|AM_DISTRIB_STATUS\|BILL_RATE_USED_TO_PRICE_TRANSACTION\|DISTRIBUTED_REVENUE\|NON_DISTRIBUTED_REVENUE\|NET_ACCRUED_REVENUE\|NET_ACCRUED_COST ""\|201407\|"Jul-2014"\|" "\|"BLANK"\|" "\|" "\|" "\|"CAN01"\|" "\|" "\|" "\|" "\|"09/09/9999 00:00:00"\|"09/09/9999 00:00:00"\|" "\|" "\|" "\|"N/A"\|0\|" "\|"09/09/9999 00:00:00"\|"09/09/9999 00:00:00"\|"CAN1P"\|"C051"\|"Clnt Ongoing-Contractual-GCB"\|" "\|"CAN01"\|"Global Lrg Mkt Canada Practice"\|"100893"\|"Client Group 61 - Greenham"\|"N"\|"00516587"\|"Plan Administration"\|"Reg_N Amer (US, CAN, Ind-HIN)"\|"20291"\|"Mian,Aamir "\|"A20291@HMTTT.com"\|"O"\|"11/01/2000 00:00:00"\|"12/31/2018 00:00:00"\|"003"\|"Integrated In Scope: Ppt Fee"\|"N"\|"01025"\|"59687"\|"Bessette,Norm"\|"N2BESSET@HMTTT.com"\|"A"\|"11/01/2000 00:00:00"\|"12/31/2012 00:00:00"\|"100893"\|" "\|" "\|" "\|" "\|" "\|" "\|" "\|" "\|""\|"9490402"\|"B/W Print"\|" "\|" "\|" "\|" "\|" "\|" those datas would be in GB size csv files .... :( help ! thanks !	[reply] [d/l]
Re^4: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by Tux (Canon) on Feb 12, 2015 at 14:25 UTC
unless that data example can be public, please remove it again: it has e-mail addresses. If it is world-viewable, please add readmore-tags. Now back to the testing example. Explaining the extreme diff in cbxs is easy: you bind for just 6 columns, but you feed it way more (132), so the geline will return false. Having created a 30001 line version of your data example, and binding with the correct number of columns, I get `Rate c_pp c_xs cbxs perl c_pp 3.03e-02/s -- -97% -97% -98% c_xs 1.07/s 3434% -- -1% -34% cbxs 1.08/s 3453% 1% -- -33% perl 1.61/s 5229% 51% 50% --` [download] which is what I would expect. Now that I have seen your data, note that it actually resembles CSV: it only uses a pipe instead of a comma, but the fields between the pipes are quoted, so making a correct parser with split will (eventually) fail `""\|201407\|"Jul\|2014"\|" "\|"BLANK"\|" "\|" "\|" "\|"CAN01"\|" "\|" "\|" "\|" "\|" +09/09/9999 00:00:00"... ^` [download] See where I manually inserted a pipe? And what if this data allows embedded newlines between the quotes? As you are inserting the data between the pipes into XLS, your data will contain the quotes if you just split on pipes. I don't think that is what you want. These are the reasons why you need a CSV parser and not split. To parse this data quick and reliable, you could go with something like this: `$ cat test.pl use strict; use warnings; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1, sep_char => "\|", auto_diag + => 1 }); open my $fh, "<", "Proj20101111.csv" or die $!; my @hdr = map { lc } @{$csv->getline ($fh)}; my %rec; $csv->bind_columns (\@rec{@hdr}); my %count; while ($csv->getline ($fh)) { $count{$rec{client_id}}++; } printf "%-8s %7d\n", $_, $count{$_} for sort keys %count; $ perl test.pl 104167 1001 116571 4004 BLANK 24024` [download] HTH Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^2: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by GT Learner (Initiate) on Feb 12, 2015 at 10:39 UTC
Hi.. Thanks for your reply! I did freeze only the header row ! just to make it look separate from other rows in the excel sheet (xlsx) ! Thanks !	[reply]
Re^3: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by QM (Parson) on Feb 12, 2015 at 12:14 UTC
I'm not sure if you understood my comment. You only need to freeze panes once, correct? Not on every cell insertion. So either at the beginning, or end, of the script only. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply]