nosbod has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

ok, I have a series of flat files whereby one line holds info relating to one individual. A method is called (via $obj->get_next) and the next line (individual) is read and dealt with.

I have a series of modules each to dealt with slightly different formats of this input file. My problem is that the format of one of the input files is such that instead of an individuals data existing in rows it exists as columns.

So, the best way of reading an individual at a time?
Each time $obj->get_next is called do I read in the whole file pulling out the correct column position in each row? Or, do I read the whole file in once (the first time $obj->get_next is called) and store in memory?

(I know that I could read the whole file in first time it is seen and write back out in a prefered format and then read from this file, but I am trying to avoid writing out to file) Is there a nicer way of doing this?

thanks in advance
Rich

Replies are listed 'Best First'.
Re: reading columns from a flat file
by BrowserUk (Patriarch) on Mar 19, 2004 at 10:50 UTC

    You could write a short perl script (or even a one-liner) that would read the inverted input file and re-invert it to stdout. You can then use this to read and convert the file and read it's output directly using a piped open (See perlopentut)

    The following example processes a simple whitespace delimited file and inverts it. You'll need to adapt flip.pl to your formats. Also, it assumes that every line will have the same number of columns as the first.

    P:\test>type test.dat 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 P:\test>type flip.pl #! perl -slw use strict; my @cols; push @cols, [ split ] while $_ = <>; for my $n ( 0 .. $#{ $cols[ 0 ] } ) { print join ' ', map{ $_->[ $n ] } @cols; } P:\test>perl -le " open F, qq[ flip test.dat |]; print join '|', split while $_ = <F>; " 1|1|1|1|1|1|1|1 2|2|2|2|2|2|2|2 3|3|3|3|3|3|3|3 4|4|4|4|4|4|4|4 5|5|5|5|5|5|5|5
    Note: The one-liner manually wrapped and uses win32 quoting.

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
      yes, if i'm reading this correcly then this is performing the pivot and is going to have to be done either:

      on the whole input file when each new individual is called,

      or once the first time the $ob->get_next method is called and then stored in an object variable for further calls for the next id

      The question is which though? The size could be large and i guess this will be the decider. I was wondering whether there might be another way

        The idea was that by moving the inversion into a sperate process, your main program can just read the 'correct' format from a file handle (as shown) and the memory consumption wouldn't be a burdon on your main process. Nor would you need to change the main program, except to use the special form of open for the errent file.

        How big is big?

        Other than re-reading and re-spliting every line once for every column in the file, there isn't another way.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
Re: reading columns from a flat file
by pelagic (Priest) on Mar 19, 2004 at 10:28 UTC
    The "nicest" solution is strongly dependant of the number of individuals you expect. If you deal with a couple of hundred thousand individuals you should be careful with memory and/or runtime.
    So: how many data do you expect?
    pelagic

    -------------------------------------
    I can resist anything but temptation.
      well, yes, this is the issue. It is totally dependant on the user that is using the module

      It could well be a couple of hundred thousand ids, or it could be 5

Re: reading columns from a flat file
by Somni (Friar) on Mar 20, 2004 at 10:24 UTC

    As far as I can see there's really no middle-ground between reading the entire file into memory, or scanning the file every time a new row is requested.

    If you read the file into memory you could destroy your data structure as rows are fetched. You'll slowly recover your memory.

    If you're scanning down the file you could remember the seek positions of the end of each column for each line in the file. Seeking instead of reading and discarding lines may be a bit faster, but then, the extra complexity may simply not be worth it.

    I think your best bet is a heuristic; if the file is larger than some threshold, use the slow scan; if it's smaller, read it into memory.