comment on

Hello fellow monks

I'm having trouble to decide for a file access method. I'm trying to describe the candidates that come to mind and their pros and cons I can think of. I'd like your input to pick the most appropriate one.

Requirements

binary file; no EOL, no Unicode
file size can range from a couple of kilobytes to ~250 Megabytes
exclusively sequential access to convert from binary to parsed, it is possible to skip forward
non-sequential access to convert from parsed to binary; easiest to build back-to-front; sometimes backtracking required to calculate the correct values (e.g. record size information)
there are some 'packed bytes' which contain multiple values (e.g. bit0-3 = value1, bit4-7 = value2)
modifying the file during a parse / write is guaranteed to lead to inconsistency

Additional limitation: My parser is driven via a user-provided syntax description. I haven't found a way to describe a packed byte as a single value, they are handled as separate values. Therefore the parsing/writing requires multiple passes over the same byte. The data stream MUST either be able to return to a previous position or the current byte's content must be cached for a subsequent read.

Option a) file handle

open my $fh, '<', $filename;

PRO: works perfectly fine when converting from binary to parsed, 'natural' way to handle a file, file size does not need to have an impact on memory consumption

CON: converting from parsed to binary is better done in memory because it's easier to build the data from back-to-front; ~~must protect file against change during parse/write~~

Option b) load data into scalar

local $/; my $data = <$fh>;

PRO: ~~drastically reduce the time while parser is vulnerable to file change~~; does not matter where data is read/written.

CON: Must rely on substr to extract values from byte stream and/or manually track position

Option c) memory backed file handle

open my $fh, '<', \$data;

PRO: keep file access methods, ~~while also lessen vulnerability to file change~~

CON: file must be read completely into memory even if just a small number of records are parsed and most of them are simply skipped

Which option is the most reasonable to you? Is there another option I am not aware of?

Update: crossed out remarks regarding file access concurrency, they are distracting

In reply to Deciding for a file access method - requesting opinions by Monk::Thomas

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.