bigtiny has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I've been searching around and having trouble finding the info I need to understand this. So, if your answer to this is to point me to the relevant doc, that's fine!

I'm writing a script that will use a large external data file. In some cases, I'll just need to grab a record out of the file -- no problemo. However sometimes I may need to read the whole file. We're talking over a million records here.

I know from past experience that trying to suck something like this up into a big data structure can be hazardous. I'm wondering if this: while (<INFILE>) is any better?

Are there other techniques I should use to traverse a large file like this and which might offer methods to move forward, back, go to beginning, etc.?

Any advice or pointers would be appreciated

Replies are listed 'Best First'.
Re: help reading from large file needed
by CountZero (Bishop) on Oct 12, 2010 at 20:51 UTC
    That large a file and the necessity to be able to get at individual records is just crying for a database solution. SQLite springs to mind in that respect.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: help reading from large file needed
by BrowserUk (Patriarch) on Oct 12, 2010 at 21:29 UTC
    Are there other techniques I should use to traverse a large file like this and which might offer methods to move forward, back, go to beginning, etc.?

    See seek. It works best if the records are fixed length. If they are not then creating an index that maps record number to file position is very simple and makes for quite fast access.

    I have a file that is a 3.6GB and contains 40e6 records. I index it like this:

    perl -e"BEGIN{binmode STDOUT}" -ne"print pack'Q',tell STDIN" syssort >syssort.idx

    Which takes just a couple of minutes to run. I can then randomly access the records in that file using:

    #! perl -slw use strict; use Time::HiRes qw[ time ]; our $N //= 1000; open IDX, '+<:raw', 'syssort.idx' or die $!; open DAT, '+<:raw', 'syssort' or die $!; my $start = time; for ( 1 .. $N ) { my $recnum = int rand 40e6; seek IDX, $recnum *8, 0; my $idx; read IDX, $idx, 8; my $pos = unpack 'Q', $idx; seek DAT, $pos, 0; chomp( my $record = <DAT> ); # printf "Record %d: '%s'\n", $recnum, $record; } my $elapsed = time - $start; printf "$N random records read in %.3f seconds (%6f/s)\n", $elapsed, $elapsed / $N; __END__ c:\test>syssort-idx -N=1e4 1e4 random records read in 2.223 seconds (0.000222/s) c:\test>syssort-idx -N=1e5 1e5 random records read in 21.332 seconds (0.000213/s) c:\test>syssort-idx -N=1e3 1e3 random records read in 0.218 seconds (0.000218/s) c:\test>syssort-idx -N=1e3 1e3 random records read in 0.226 seconds (0.000226/s)

    At 0.2 milliseconds per record, it is fast enough for most purposes.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: help reading from large file needed
by kcott (Archbishop) on Oct 12, 2010 at 21:11 UTC

    The seek function allows you to move forward and backward (and there's a complementary tell function which lets you know where you are in the file). However, unless you know exactly where you want to go, that's not necessarily going to be of much use. If you have fixed-length records and know exactly which record you're after, that may be a solution.

    As already stated, a DB solution may be better.

    -- Ken

Re: help reading from large file needed
by roboticus (Chancellor) on Oct 12, 2010 at 21:22 UTC

    BIGtiny:

    I like the Count's suggestion to use a database, but if you want to use variable length records, you may also consider a modification of kcott's suggestion: If you have a key field, scan your file once, gathering the keys and file positions of the records, and construct a hash. Then you can query the hash table to seek your records quickly.

    ...roboticus

Re: help reading from large file needed
by dasgar (Priest) on Oct 12, 2010 at 22:59 UTC

    Not advocating that this is any better than previous suggestions, but you can check out Tie::File, which ties a file to an array where each element corresponds to a line in the file.

Re: help reading from large file needed
by locked_user sundialsvc4 (Abbot) on Oct 13, 2010 at 01:36 UTC

    I’ll echo CountZero’s comment about SQLite.   That tool truly is a “game changer” when it comes to doing the things for which we require “flat files.”   If you have not yet looked at it very closely, you owe it to yourself (and to your projects) to do so A.S.A.P...