comment on

> Geez, as a middle point, 5,000 files times 4 MB each is 20,000 MB => 20 GB.

Yep. I get data sets greater that 25 GB all the time.

>You are writing that much data to the file system in the first place.

The bottleneck (of course) comes from writing the data (to an old-fashioned hard disk - things would obviously proceed much more quickly if I had 64 GB of RAM and could write everything to a RAM disk :) ), so I never considered the possibility that writing 7000 files slowed things down much.

> Your app is much bigger than I thought.

I wouldn't describe the app as large at all (obviously the sets of data weigh in heavily). :) Beyond the loops, reading in a one-line second file and a small (less than 1 MB) third file, testing for EOF on the third file, and setting a variable based on changing values in the third file, the remainder (and only significant part) of the script consists of a one line system call. The entire script doesn't exceed 1500 bytes, commented, and not obfuscated in any way.

> But a DB can handle that, even SQLite for 64 bit processor.

Again, the external programs that do the subsequent processing don't work with databases, but I will definitely keep the technique in mind.

> The processing that you do with this data is unclear as well as the size of the result set.

I have kept that rather vague (nothing illegal, but the possibility exists of a "terms of use" violation somewhere along the way). :) The final result ends up as only slightly smaller than the input files (with reformatting and some selective discarding).

That said, I have come up with a few ideas for the "upstream" script which would completely eliminate the need for this one. It would replace the naming system with a different one that would maintain sort order, and could combine some of the files at acquisition time (doing the processing in that script that this script does) that would reduce the number of files (but not the total size of the data) by about an order of magnitude.

Thanks again for your help and suggestions.

In reply to Re^6: looping efficiency by Anonymous Monk
in thread looping efficiency by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.