comment on

There are a couple of problems with your current "flat file" solution (and all flat-file alternative solutions). First, you are reading it in line by line. Sure, you could slurp, but even then there's not a huge speed difference, and you run into scalability problems.

Second, you're splitting lines on various things. split uses a regexp-like thingy, and that means revving up the regexp engine, which while well optimized isn't as fast as non-regexp alternatives. The problem is that your current file format doesn't lend itself well to non-regexp (and non-split) alternatives.

If you have control over the datasource, there are a few things you can do for better speed.

One suggestion is, instead of using "\n" delimited records, and "\t" delimited key/value pairs, go with a more "regular" format. One possibility would be fixed-width fields. With that sort of solution, at least you can unpack each record. That's going to be faster than splitting on a RE. If each "line" (or record) is of equal byte-length, and each key/value within each record is of fixed width, you can use seek, tell to jump around in the file, and unpack to grab keys/values from each record. It's pretty hard to beat that for speed, within Perl.

Another possibility is to abandon the flat file, and go with a database. You mentioned that you wanted to maintain a single-file for your data though. Ok, no problem. Use DBD::SQLite. It is a pretty fast database implementation that stores all of the data in one single file. There is database overhead to consider, but scalability is good, and you don't need to be as careful about maintaining equal-byte-length records with fixed-width fields.

And yet another possibility is to use the Storable module to freeze and thaw your datastructures. The module is written in XS (if I'm not mistaken) and optimized for speed already. It's not as scalable of a solution, but speed is pretty good.

Dave

In reply to Re: Re: Re: Need to process a tab delimted file *FAST* by davido
in thread Need to process a tab delimted file *FAST* by devnul

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.