comment on

Dear Monks,

This topic branches out from one of my other posts Efficient way to sum columns in a file. Since this topic is slightly different from the earlier one I am starting a new thread.

I tested two ways to cut columns from a delimited file. The first one being UNIX cut and the other one was a simple Perl script. Unfortunately the Perl script performed poorly against the cut utility. I ran the tests a few times to make sure they are statistically significant

Here is the timed test results -

[sk]% time cut -d, -f"1-15" numbers.csv > out.csv
5.670u 0.340s 0:06.27 95.8% 

[sk]% time perl -lanF, -e 'print join ",", @F[0..14];' numbers.csv > o
+ut.csv
31.950u 0.200s 0:32.26 99.6%
[download]

The above test was done with 500,000 rows and 25 columns. The cut operation was performed to get the first 15 columns. The link above has code to generate random data (thanks to Random Walk).

As you could see the *my* perl script is not as good as UNIX cut. I have two questions here-

1. Can this script be improved so that it is comparable to the UNIX cut command in performance? If the Perl script can finish in 10 seconds that will be great (50% drop in peformance)! I am happy to take this performance drop because it keeps the script clean and portable (typically i work on UNIX machines so this is not a huge requirement)

2. If that is not possible, would you typically consider piping output from cut when the script does not require all the columns for processing? i.e. say the script only needs 3 columns instead of a possible 200 columns then would you pipe the 3 column output from cut instead of spliting the 200 columns in Perl and keeping only the 3 that is required?

I typically work with large files (~a few million rows by 500-800 columns).

Thanks in adavance for your thoughts!

cheers

In reply to cut vs split (suggestions) by sk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.