suaveant has asked for the wisdom of the Perl Monks concerning the following question:

I have a project where I am streamlining a report generation system to put out lots of reports in a short amount of time. I prepare the input files for each customer into one big file sorted by key and when the data is ready process this against the data file. The problem is that I need to write the data out to many files and may have thousands at a time to deal with.

Currently I store up about 5000 lines of output in a hash, then write it out in append mode to however many files I accumulated. I doubt this is the best way to do this. Originally I had thought to maintain all the filehandles in an array but I believe I ran into OS limits that way.

Is there a better way to do something like this quickly?

We've recently switched from solaris to quad dual core Xeons running Linux with 16Gigs of RAM... it is ok if I utilize up to probably 75% of the system (maybe even a bit higher) for the biggest jobs.

                - Ant
                - Some of my best work - (1 2 3)

Replies are listed 'Best First'.
Re: Writing to many (>1000) files at once
by CountZero (Bishop) on Aug 14, 2006 at 22:10 UTC
    A possible alternative would be to use a database: rather than sending all output directly to its intended, "save" all items to a database with as primary key the intended output file. Probably you must also save a sequence number so the messages come out in the right order.

    Once you have processed all data, start outputting from the database by ordering the data per "intended output file" and sequence number. At each change of "intended output file" close the previous file and open a new one.

    When all is done "clean" your database so it is ready for its next run.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Writing to many (>1000) files at once
by Fletch (Bishop) on Aug 14, 2006 at 23:25 UTC

    There's the little known but core FileCache that will handle closing and reopening descriptors for you to keep you under the per-process handle limits.

    And yes, some versions of Solaris had a pitifully low file descriptor limit if you were using the C library's stdio.h functions (or anything sitting on top of them rather than system calls such as read(2) and friends). I think basically someone declared the descriptor as a char rather than an int so even if you upped the ulimits you'd get bitten at 252-ish descriptors. That being said, I think this was aeons ago in the 2.4-2.6 era and was fixed maybe around 2.7-2.8. Solaris 9 didn't have the problem at all that I recall.

    Update: clarified bit about what had problems.

Re: Writing to many (>1000) files at once
by BrowserUk (Patriarch) on Aug 15, 2006 at 01:56 UTC

    There's probably stuff I'm missing here, but your numbers don't add up.

    5000 lines * 132 chars * 1000 files = 630 MB. Yor 16 GB should conservatively give you headroom for 40,000 reports of 5000 lines; or 1000 files of 200,000 lines; or some other combination in between. Assuming that there is no scope for 'sharing' lines between files whilst in memory.

    Perhaps the problem is that you are storing each line as a separate keyed value in a hash and that structure is consuming extra memory?

    If your perl is built to use PerlIO, then you (I think), should be able to open as many RAM files as you want. As the lines in a ramfile are effectively concatenated into a single scalar, they require less overhead than using an array or hash.

    You could then spew (the opposite of slurp) them out to the filesystem one file at a time at the end. That should be more filesystem cache freindly, and less tiresome to code, than juggling 1000's of files through 250 filehandles.

    It would make the generation phase very fast. Even the writing should be comparatively quicker as you would only be asking the filesystem to allocate the final space once, rather constantly reallocation the size of many files in rotation.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I agree, if you can do it in memory, then do it in memory. Also the Standard Perl i/o mechanisms are incredibly slow. I would recommend File::Slurp (supports slurping and spewing). I used it once in a programm that had to modify roughly 1000 files. Compared to standard slurping like:
      { local $/ = undef; my $wholefile = <$FH>; }
      it was 15 times faster (it now takes roughly 3 minutes, while with standard perl mechanisms like the one abough, it took 45 minutes). It might even be that both (Input and Ouput Data) fit into main memory - if they do and if such memory consumption isn't a problem, then simply do it - its much faster.
Re: Writing to many (>1000) files at once
by GrandFather (Saint) on Aug 14, 2006 at 21:53 UTC

    How big are the individual files? Are the files unique? Do the files all reside on the same media?

    Multiple processors probably aren't going to help a huge amount because getting out to the disk drives is likely the bottle neck. There's not much point having more file handles open than the system can physically write from simultaniously.


    DWIM is Perl's answer to Gödel
      It is financial data... each file is unique and even lines that have the same key may pull different items from the record, though the output is all based on the same thing.

      Files can be anything from a few bytes to a couple megs, really... all depends on how many securities they ask for.

                      - Ant
                      - Some of my best work - (1 2 3)

        It doesn't sound like trying to "write" more than a file at a time gains anything from an I/O efficiency point of view then. You may get a gain in code organisation, but there are probably other ways to achieve that. Can you sketch the code structure and the structure of files on disk? (Not file contents, just how directories hang together and such.)

        How do the files get out to your customers? Is that the potential bottleneck?


        DWIM is Perl's answer to Gödel
Re: Writing to many (>1000) files at once
by CountZero (Bishop) on Aug 15, 2006 at 08:14 UTC
    Looking at it from another side: is there a way you can sort your customer file on the "customer" field? Assuming that each customer get his/her own output file you could then sequentially go through the customer file an put together the output file. In that way you have only one file open for output at the same time. Depending on how your files are organized it might mean that you will have to go through the data file too many times and that this is the real bottleneck.

    I still think a database-solution would work best.

    BTW: what is the source of all this data? If it is already in a database, cannot you run some queries to get directly at the data?

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Writing to many (>1000) files at once
by ysth (Canon) on Aug 14, 2006 at 23:01 UTC
    I think Solaris has (had?) a lot lower limit to open filehandles than linux does. Try it again keeping the filehandles in an array.

      Solaris does have lower limits by default. Look at rlim_fd_max and the ulimit command.

      For linux, look at file-max and the limit command.

Re: Writing to many (>1000) files at once
by zentara (Cardinal) on Aug 15, 2006 at 12:40 UTC

      Tie::File still requires a filehandle, and he is limited to 250.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Writing to many (>1000) files at once
by furry_marmot (Pilgrim) on Aug 16, 2006 at 13:41 UTC
    Can you describe your constraints a little more? Is the issue to get the data out as fast as possible and to all targets at once? Or is it more about updating a set of data as a whole?

    Anyway, turning the problem around, I was wondering if it would work to keep all your report files in or under a directory, say /reports. Then duplicate it in, say /reports_updating, where you can take your time opening and writing/appending files. When ready, swap directory names. Instant update as far as external processes are concerned.

    --marmot

      Yes, trying to get the data out quickly... actually I think what we have will be fast enough, I was just curious if there was some better way to handle the multiple files than I was doing. As it is I generate the files I need in about a minute.

      The directory thing's not really a problem since the program triggers events that a monitor handles to tell the next step to start.

                      - Ant
                      - Some of my best work - (1 2 3)