in reply to Help performing "random" access on very very large file

I guess you don't have enough memory for using File::Tie on a 500GB file, since it sluprs the entire file... uhm well, have to acknowledge being disinformed here.. thanks ikegami and blue_cowdawg.

I'd go indeed with 4), BerkeleyDB (DB_BTREE), key line number, value byte offset. That would allow you to do delta seeks back and forth. Splitting the one big file into a reasonable number (depending on number of file descriptors available) of smaller chunks might be helpful, as rpanman already noted.

update: obvious correction

--shmem

_($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                              /\_¯/(q    /
----------------------------  \__(m.====·.(_("always off the crowd"))."·
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
  • Comment on Re: Help performing "random" access on very very large file

Replies are listed 'Best First'.
Re^2: Help performing "random" access on very very large file
by blue_cowdawg (Monsignor) on Jul 16, 2007 at 15:07 UTC
        I guess you don't have enough memory for using File::Tie on a 500GB file, since it sluprs the entire file.

    Uhhhhmmmmm... no it don't! From the File::Tie man page:

        The file is not loaded into memory, so this will work even for gigantic files.
    I even remember looking at the source code for this module once to validate that statement. I used to think that the module loaded the whole file into memory but it doesn't.

    However, there is a lot of overhead associated with File::Tie for certain operations. For instance I had some code once that looked something like:

    # considering @ry was the tie'd array: my $last_line = $#ry + 1;
    and found that if I visited that line multiple times it slowed my code way down. Just FWIW...


    Peter L. Berghold -- Unix Professional
    Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
Re^2: Help performing "random" access on very very large file
by ikegami (Patriarch) on Jul 16, 2007 at 15:05 UTC

    I guess you don't have enough memory for using File::Tie on a 500GB file, since it sluprs the entire file.

    If you're only reading from the file (as it seems to be the case here), Tie::File doesn't slurp the file. (I don't know how it handles writes.) It does keep a cache of lines in memory, but the size of the cache is configurable.

    What it does do is keep in memory the index of every line up until the last one accessed. scalar(@tied) and $#tied count as accessing the last line. So if you do random(@tied), Tie::File will read through the entire file to build an (in-memory) index of every line in the file. The index could easily take many GBs.