jkeenan1 has asked for the wisdom of the Perl Monks concerning the following question:

This is a question about the processing of large data files. Suppose that I have a file with records of magnitude 10**6 or greater. The first row of the file contains tab-delimited names of fields extracted from a database. All other rows contain one record each with data from the fields provided in the header row.

NAME RANK SOCIAL_SECURITY George Washington 000-00-0000 John Adams 000-00-0001

What distinguishes one such data file from another is the header row: both the particular fields indicated as well as the total number of fields. (The files may even be identically named, differentiated only by a timestamp of their arrival time in my system. So the content of the header row is crucial.) Hence, the header row has to be treated differently from all other rows.

A typical approach to this is as follows: First, initialize a flag to a false value, begin to read the file line by line, process the header row to see which fields are present, set the flag to a true value, then process all remaining rows.

$header_seen = 0; while (<>) { unless ($header_seen) { # process header to get field names $header_seen++; } else { # process each subsequent record } }

This requires that I check the status of $header_seen on each line of the file. I suppose that I could use Tie::File and process row 0 differently from all others ... but this is likely to be slower and less memory efficient.

Is there any other approach to this problem?

Thank you very much.

Jim Keenan

Replies are listed 'Best First'.
Re: Line-by-line processing of a file where the first line is different
by xdg (Monsignor) on Jul 11, 2006 at 11:24 UTC

    Why not just read one line first?

    my $header_row = <>; # process it while (<>) { # process the file }

    If you're not sure if the first line is the right one, loop until you find the header, then break out of the header-search loop and loop through the data using a different loop.

    -xdg

    Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

      Elementary, my dear Watson!

      I know I learned that way back when ... but you can get in such a rut writing code in the same way over and over that you forget the basics!

      I will try it out! Thanks.

      Jim Keenan
Re: Line-by-line processing of a file where the first line is different
by Corion (Patriarch) on Jul 11, 2006 at 11:27 UTC

    If you can be reasonably sure that your header will only appear once in your input file, then the following approach can work:

    my $header = <>; while (<>) { # process each subsequent record };

    If you are really reading from <> instead of manually looping over the files, you cannot be reasonably sure that the header line will not occur again in your input stream as likely each file has its own header line. Then you can check $ARGV to see if the current input file has changed I think, and discard the first header line again.

    This approach will of course only work if no header lines come in the middle of your input streamv as is likely when multiple files are concatenated together...

      The files being processed are delivered by an overall framework. I will have to see whether that framework can guarantee delivery of only one file at a time. But, in any event, we shift the filename off @ARGV and read it through a filehandle. So I can think we can meet the conditions you describe. Thanks for clarifying this.
      Jim Keenan
Re: Line-by-line processing of a file where the first line is different
by Ieronim (Friar) on Jul 11, 2006 at 13:05 UTC
    You can open each file separately in a simple foreach loop:
    foreach my $file (@ARGV) { open my $fh, $file or warn("Cannot open file $file: $!\n"), next; my $header = <$fh>; #process header while (<$fh>) { # process file } }
    It is generally more reliable and controllable process than using <>. Reading the first line with smth like my $header = <$fh>; was already described in the above comments.
Re: Line-by-line processing of a file where the first line is different
by GrandFather (Saint) on Jul 11, 2006 at 22:20 UTC

    If you are comfortable with SQL you may find DBD::CSV does a useful job for you.


    DWIM is Perl's answer to Gödel