Need help in reading csv files >1GB for converting into xlsx files in Solaris 10

GT Learner has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by BrowserUk (Patriarch) on Feb 11, 2015 at 13:56 UTC
See if adding `$workbook->set_optimization();` helps. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l]
Re^2: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by GT Learner (Initiate) on Feb 12, 2015 at 05:43 UTC
it didnt work :( I've tried it !! but getting an error like `"Undefined subroutine &main::set_optimization called at program line number".`	[reply] [d/l]
Re^3: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by BrowserUk (Patriarch) on Feb 12, 2015 at 07:00 UTC
Oh well.... With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]
Re^3: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by Anonymous Monk on May 08, 2015 at 10:17 UTC
You need to upgrade Excel::Writer::XLSX to a version that supports `set_optimization()`.	[reply]
Re: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by QM (Parson) on Feb 11, 2015 at 15:09 UTC
Note that Excel is crap on large files. I'm not sure about the programmatic approach, but interactively, when pasting in large amounts of data, it sometimes helps to turn off auto-recalc. In your code above, why do you freeze the panes for every cell write? (Or at all?) Text::CSV and it's siblings may be faster at splitting lines. You may find a better solution using VB natively in Excel (due to the large input file size). Of course, if you want to do much more than what you've outlined, you'll want to stay in Perl. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply]
Re^2: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by Tux (Canon) on Feb 11, 2015 at 15:33 UTC
In your code above, why do you freeze the panes for every cell write? (Or at all?) I was wondering the same. I do see value in `$worksheet->set_column ($col, $col, $width);` after the conversion. Text::CSV and it's siblings may be faster at splitting lines. unlikely, unless the code has embedded specials. I tested with a biggish CSV file with 6 fields: $ cat data.pl use 5.18.2; use warnings; use Benchmark qw( cmpthese ); use Text::CSV_PP; use Text::CSV_XS; my ($fh, $csv); sub fh { open $fh, "<", "data.csv"; } sub cxs { $csv = Text::CSV_XS->new ({ binary => 1, sep_char => "\|" }) +} sub cpp { $csv = Text::CSV_PP->new ({ binary => 1, sep_char => "\|" }) +} cmpthese (4, { perl => sub { fh; while (my @row = split /\\|/ => <$fh>) {} }, c_xs => sub { fh; cxs; while (my $row = $csv->getline ($fh)) {} }, c_pp => sub { fh; cpp; while (my $row = $csv->getline ($fh)) {} }, cbxs => sub { fh; cxs; $csv->bind_columns (\my($a,$b,$c,$d,$e,$f)) +; while ( $csv->getline ($fh)) {} }, }); $ perl data.pl Use of uninitialized value in split at data.pl line 15, <$fh> line 260 +0000. Use of uninitialized value in split at data.pl line 15, <$fh> line 280 +0000. Use of uninitialized value in split at data.pl line 15, <$fh> line 300 +0000. Use of uninitialized value in split at data.pl line 15, <$fh> line 320 +0000. Rate c_pp c_xs cbxs perl c_pp 8.87e-02/s -- -95% -96% -98% c_xs 1.85/s 1988% -- -25% -65% cbxs 2.45/s 2667% 33% -- -53% perl 5.26/s 5834% 184% 114% -- [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^3: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by GT Learner (Initiate) on Feb 12, 2015 at 10:29 UTC
Hi! Thanks for your reply ! I have tried this code with my csv file of the size 271MB contains 179K records !! and i got the output like this ! </p >perl new_cmp.pl (warning: too few iterations for a reliable count) Use of uninitialized value in split at new_cmp.pl line 14, <$fh> line +408638. Use of uninitialized value in split at new_cmp.pl line 14, <$fh> line +579912. Use of uninitialized value in split at new_cmp.pl line 14, <$fh> line +751186. Use of uninitialized value in split at new_cmp.pl line 14, <$fh> line +922460. s/iter c_pp perl + c_xs cbxs c_pp 45.6 -- -85% + -97% -100% perl 6.71 579% -- + -77% -100% c_xs 1.54 2853% 335% + -- -100% cbxs 2.50e-016 18241099999999997952% 2684799999999998464% 617799999999 +998208% -- [download] I dont understand why it is showing lines like 408K - 579K - 751K - 922K ! since my csv file contained the records of 179K and the columns are about #135! :( Please tell me how can i get through this and made it to be done ! Thanks !	[reply] [d/l]
Re^4: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by Tux (Canon) on Feb 12, 2015 at 11:10 UTC
Re^3: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by GT Learner (Initiate) on Feb 12, 2015 at 12:50 UTC
Hi ... this is the code i have used to #use 5.18.2; use warnings; use Benchmark qw( cmpthese ); use Text::CSV_PP; use Text::CSV_XS; my ($fh, $csv); sub fh { open $fh, "<", "Proj20101111.csv"; } sub cxs { $csv = Text::CSV_XS->new ({ binary => 1, sep_char => "\|" }) +} sub cpp { $csv = Text::CSV_PP->new ({ binary => 1, sep_char => "\|" }) +} cmpthese (4, { perl => sub { fh; while (my @row = split /\\|/ => <$fh>) {} }, c_xs => sub { fh; cxs; while (my $row = $csv->getline ($fh)) {} }, c_pp => sub { fh; cpp; while (my $row = $csv->getline ($fh)) {} }, cbxs => sub { fh; cxs; $csv->bind_columns (\my($a,$b,$c,$d,$e,$f)) +; while ( $csv->getline ($fh)) {} }, }); close($fh); [download] and my sample data would be like RESOURCE_TYPE_DESCR\|YEAR_MONTH\|ACCOUNTING_PERIOD\|TRANSACTION_PERIOD\|CLIENT_ID\|CLIENT\|CLIENT_INDUSTRY\|CLIENT_COUNTRY\|GL_BU\|CONTRACT_SEGMENT\|CONTRACT_ID\|CONTRACT\|CONTRACT_REGION\|CONTRACT_START_DATE\|CONTRACT_END_DATE\|CONTRACT_LINE_STATUS\|CONTRACT_LINE_STATUS_DESCRIPTION\|CA_STATUS\|PRODUCT\|CONTRACT_LINE_ID\|CONTRACT_LINE\|CONTRACT_LINE_START_DATE\|CONTRACT_LINE_END_DATE\|PC_BUSINESS_UNIT\|PROJECT_TYPE_ID\|PROJECT_TYPE\|COST_TYPE\|PROJECT_GL_BU\|PROJECT_PRACTICE\|PROJECT_DEPT_ID\|PROJECT_DEPARTMENT\|ENFORCE_TEAM\|PROJECT_ID\|PROJECT\|INTEGRATION_TEMPLATE\|PM_ID\|PM\|PM_NOTES_ID\|PROJECT_STATUS\|PROJECT_START_DATE\|PROJECT_END_DATE\|ACT_ID\|ACTIVITY\|DEFERRABLE__Y_N_\|ACTIVITY_TYPE\|AM_ID\|AM\|AM_NOTES_ID\|ACTIVITY_PROCESSING_STATUS\|ACTIVITY_START_DATE\|ACTIVITY_END_DATE\|GL_DEPT_ID\|GL_DEPT\|GL_DEPT_SEGMENT\|GL_DEPT_PRACTICE\|RESOURCE_GL_BU\|RESOURCE_SEGMENT\|RESOURCE_PRACTICE\|RESOURCE_DEPT_ID\|RESOURCE_DEPARTMENT\|RESOURCE_LOCATION\|RESOURCE_ID\|RESOURCE\|RESOURCE_NOTES_ID\|HR_MANAGER\|HR_MANAGER_NOTES_ID\|JOB_CODE_ID\|JOB_CODE\|REVENUE_TYPE__RPID_\|CURRENCY\|HOURS_QUANTITY\|UNIT_OF_MEASURE\|BILL_RATE\|R_HOURS\|R_COST\|O_HOURS\|O_COST\|W_HOURS\|W_COST\|I_HOURS\|I_COST\|S_HOURS\|S_COST\|TOTAL_HOURS\|FTE\|REVENUE_DISCOUNT\|REVENUE_PREMIUM\|INTERCOMPANY_REVENUE\|DUMMY_1\|REVENUE\|XSEG\|BILLABLE\|DIRECT_COST\|SALES_COST\|CAPITALIZED_COST\|WIP\|WIP_FORWARD\|WIP_WRITE_UP_OFF\|INVOICED\|BILL_PLAN__BPID_\|INVOICE__\|AR_WRITE_UP_OFF\|AP_VENDOR_ID\|AP_VENDOR\|AP_VOUCHER_ID\|AP_VOUCHER_LINE\|ASSET_BU\|ASSET_ID\|ASSET\|ASSET_PROFILE_ID\|ASSET_DEPT\|PE_ID\|PE\|AE_ID\|AE\|CNS_CL_ID\|CNS_CL\|CFA_ID\|CFA\|CFA_NOTES_ID\|PRODUCT_ID\|EFFECTIVE_DATE_OF_CURRENT_BILL_RATE\|RATE_PLAN\|RATE_PLAN_DESCR\|REVENUE_HOLD\|BILL_PLAN_HOLD\|AM_DISTRIB_STATUS\|BILL_RATE_USED_TO_PRICE_TRANSACTION\|DISTRIBUTED_REVENUE\|NON_DISTRIBUTED_REVENUE\|NET_ACCRUED_REVENUE\|NET_ACCRUED_COST ""\|201407\|"Jul-2014"\|" "\|"BLANK"\|" "\|" "\|" "\|"CAN01"\|" "\|" "\|" "\|" "\|"09/09/9999 00:00:00"\|"09/09/9999 00:00:00"\|" "\|" "\|" "\|"N/A"\|0\|" "\|"09/09/9999 00:00:00"\|"09/09/9999 00:00:00"\|"CAN1P"\|"C051"\|"Clnt Ongoing-Contractual-GCB"\|" "\|"CAN01"\|"Global Lrg Mkt Canada Practice"\|"100893"\|"Client Group 61 - Greenham"\|"N"\|"00516587"\|"Plan Administration"\|"Reg_N Amer (US, CAN, Ind-HIN)"\|"20291"\|"Mian,Aamir "\|"A20291@HMTTT.com"\|"O"\|"11/01/2000 00:00:00"\|"12/31/2018 00:00:00"\|"003"\|"Integrated In Scope: Ppt Fee"\|"N"\|"01025"\|"59687"\|"Bessette,Norm"\|"N2BESSET@HMTTT.com"\|"A"\|"11/01/2000 00:00:00"\|"12/31/2012 00:00:00"\|"100893"\|" "\|" "\|" "\|" "\|" "\|" "\|" "\|" "\|""\|"9490402"\|"B/W Print"\|" "\|" "\|" "\|" "\|" "\|" those datas would be in GB size csv files .... :( help ! thanks !	[reply] [d/l]
Re^4: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by Tux (Canon) on Feb 12, 2015 at 14:25 UTC
Re^2: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by GT Learner (Initiate) on Feb 12, 2015 at 10:39 UTC
Hi.. Thanks for your reply! I did freeze only the header row ! just to make it look separate from other rows in the excel sheet (xlsx) ! Thanks !	[reply]
Re^3: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by QM (Parson) on Feb 12, 2015 at 12:14 UTC
I'm not sure if you understood my comment. You only need to freeze panes once, correct? Not on every cell insertion. So either at the beginning, or end, of the script only. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply]
Re: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by Tux (Canon) on Feb 11, 2015 at 14:42 UTC
Other things to note in taking this approach, none of which will help you with the problem at hand: This will fail if any of the fields have a pipe symbol, escaped or not This will fail if any of the fields has embedded newline You might hit problems when the data is encoded. UTF-8 and UTF-16 are not always converted automatically in XLS or XLSX, so you might need code like `from_to ($c, "utf-8", "ucs2"); $worksheet->write_unicode ($row, $col++, $c);`. YMMV Dates and numerics might be a PITA unless you tell XLS or XLSX what the format is. `20150203`, `02032015`, or even `2015-03-02`. I've seen XLS and XLSX being extremely stubborn in converting anything that XLS or XLS thinks to look like a US date to save as a US date, even if I tell it to be text. FWIW I NEVER want US dates. Either YYYYMMDD or DD-MM-YYYY is acceptable for me. Same for currency: Imagine you get an amount as `� 2.000,00` and read that into a system which requires `2,000.00`. Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by GotToBTru (Prior) on Feb 11, 2015 at 13:32 UTC
Divide and conquer: make smaller files out of larger ones. Dum Spiro Spero	[reply]
Re: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by CoVAX (Beadle) on Feb 11, 2015 at 18:53 UTC
Here are a few notes from some code I wrote in 2006 (using Win32:OLE and Excel) `# Set the locale to American English, which lets us do things like use # American date formats (e.g. "12-30-98" rather than "30-12-98"). This # will continue to work even when international users have to run this # program on their computers. use Win32::OLE::NLS qw(:DEFAULT :LANG :SUBLANG); $Win32::OLE::LCID = MAKELCID(MAKELANGID(LANG_ENGLISH, SUBLANG_DEFAU +LT));` [download] `# The value of $Win32::OLE::Warn determines what happens when an OLE e +rror # occurs. If it's 0, the error is ignored. If it's 2, or if it's 1 & + the # script is running under -w, the Win32::OLE module invokes Carp::car +p(). # If $Win32::OLE::Warn is set to 3, Carp::croak() is invoked and the +pro- # gram dies immediately. $Win32::OLE::Warn = 2;` [download] Excel has a per-cell content length limit that is approximately 32,??? characters (it's not 32,767 or 32,768--it's less than this). Exceeding this limit results in silent truncation by Excel if I correctly recall. Searched for donuts and crumpit. Found content instead.	[reply] [d/l] [select]
Re: Need help in reading csv files >1GB for converting into xlsx files in Solaris 10 - Perl v-5.8.4 by CoVAX (Beadle) on Feb 11, 2015 at 19:14 UTC
Have a look at "Excel Hacks" by David & Raina Hawley (O'Reilly), in particular Hack #14 (First Edition) or Hack #15 (Second Edition) titled "Reduce Workbook Bloat". P.S. Use Google to search for 'filetype:pdf excel hacks' Searched for donuts and crumpit. Found content instead.	[reply]