comment on

This is a question about the processing of large data files. Suppose that I have a file with records of magnitude 10**6 or greater. The first row of the file contains tab-delimited names of fields extracted from a database. All other rows contain one record each with data from the fields provided in the header row.

NAME    RANK    SOCIAL_SECURITY
George    Washington    000-00-0000
John    Adams    000-00-0001
[download]

What distinguishes one such data file from another is the header row: both the particular fields indicated as well as the total number of fields. (The files may even be identically named, differentiated only by a timestamp of their arrival time in my system. So the content of the header row is crucial.) Hence, the header row has to be treated differently from all other rows.

A typical approach to this is as follows: First, initialize a flag to a false value, begin to read the file line by line, process the header row to see which fields are present, set the flag to a true value, then process all remaining rows.

$header_seen = 0;
while (<>) {
    unless ($header_seen) {
        # process header to get field names
        $header_seen++;
    } else {
        # process each subsequent record
    }
}
[download]

This requires that I check the status of $header_seen on each line of the file. I suppose that I could use Tie::File and process row 0 differently from all others ... but this is likely to be slower and less memory efficient.

Is there any other approach to this problem?

Thank you very much.

Jim Keenan

In reply to Line-by-line processing of a file where the first line is different by jkeenan1

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.