perchance has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,
My problem is as follows:
I'm manipulating large tab-delimited files, and I was looking for a way to make my code more efficient. What I wanted to know was, is there a module or some built-in functionality that would allow me to read a large chunk of data from a file (say several megs) and then read line by line from the memory? I know I can implement this by myself, but I wanted to know if there is something that is already written and may be more efficient or more careful than my own code. I tried finding answers on CPAN and here, but nothing came up.

Any ideas?

--- Find the River

Edit by tye

Replies are listed 'Best First'.
Re: reading (caching?) large files
by davorg (Chancellor) on Jun 05, 2001 at 13:21 UTC

    If you set $/ to a reference to a (numerical) scalar value then the next time you use <FILE> to read from a filehandle, it will read that number of bytes from the file.

    { # Always change Perl special vars localised in a block local $/ = \1024; while (<FILE>) { # $_ contains the next 1024 bytes from FILE } }
    --
    <http://www.dave.org.uk>

    "Perl makes the fun jobs fun
    and the boring jobs bearable" - me

Re: reading (caching?) large files
by jeroenes (Priest) on Jun 05, 2001 at 13:34 UTC
    Be aware that your OS already implements a cache. The cache that you are making may interfere with the OS's.

    I use fairly large files myself (among others tab-del), and I first try to read them into memory at once. Sometimes that is not possible, and than I try to minimalize the data in a row-read-write approach. As long as you use the proper functions, OS will take care of caching.

    In cases where I need large amount of data available for lookup/search/sort, I use the BerkeleyDB. Very nice performance.

    Hope this helps,

    Jeroen
    "We are not alone"(FZ)

      Sorry if I'm not following, but:

      1. The filesystem caches into its swap space whenever it reads into memory something too large. Do you mean it also reads ahead when Perl opens a handle, or has read a certain amount, so that it saves time?

      2. No time to use anything like BerkeleyDB now, but I'll remember it for the future, though, it sounds useful.

      3. What exactly do you mean row-read-write? Regular line by line? How is that helpful?

      10x again,
      me

      --- Find the River

      Edit by tye

        I was too brief, apparently.

        1. The filesystem caches pages, not files. So while perl is reading line by line, it probably reads often from the same page. Every time that page is read from cache. Works quite efficient for sequential reads.
        2. The DB is quite easy to use, it has a tied interface (aka you can approach it just like a hash).
        3. Indeed regular line by line. That way you can reduce the size of the data to reduce memory usage. For example, you can remove double spaces, remove unneeded data, or write numbers as bytes, etc etc, without having to store everything in memory.

        Jeroen
        "We are not alone"(FZ)

Re: reading (caching?) large files
by Vynce (Friar) on Jun 05, 2001 at 13:14 UTC

    does read help? i don't know of any module to do it for you, but that and

    ($line, $buf) = split "\n", $buf, 2;
    seems likely to do the trick.

    i actually think that kind of optimization is already done for you, but i may be mistaken.