comment on

If you're sure that indexing the start offset of every line will fit in memory, then go for it. You can handle a lot of data that way -- but just be sure you don't hit the nasty cases, such as a file containing only newlines.

For that case, and the general case of even larger files, consider a variant of the index-every-nth line idea: index based on disk block (or more likely, some multiple of the disk block). Say you use a block size of 8KB. Then keep the line number of the first complete line starting within each block. When seeking to a given line number, you do a binary search in your index to find the block number that contains the largest line number less than or equal to the line you're looking for. Then you read the block in, scanning linearly through the text for the line you want.

This approach deals with the problematic cases more gracefully -- if you have a huge number of newlines, you'll still only read the block containing the line you want. (Well, you might have to read the following block too, to get the whole line.) Or, if you have enormous single lines, you'll never have the problem of your index giving you a starting position way before the line you want, as might happen if you were indexing every 25th line.

Generally speaking, your worst-case performance is defined in terms of the time to process some number of bytes, not lines, so you'll be better off if your index thinks in terms of bytes.

All that being said, I would guess that this would be overkill for your particular application, and you'd be better off with an offset-per-line index. It's simpler and good enough. And if that gets too big, you can always store the index in a gdbm file.

In reply to Re: Displaying/buffering huge text files by sfink
in thread Displaying/buffering huge text files by spurperl

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.