Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi.

I have an application that splitting a large data file into smaller files with a few records each and then saving the files to disk.

For the most part it works well, but in once instance, I am parsing an file with 5000+ records into smaller files of 12 each. It is taking about 20 seconds to save all the 400+ files. My development machine isa little slow, so it will get faster in production, but still I want it to be as fast as possible.

I'm using the basic

foreach my $file (@file) { my $filename = $file->{'filename'}; my $content = $file->{'content'}; open(FILE, ">$filename") or die "Can't open $filename $!"; print FILE $content; close(FILE); }


My question is, is there some magical module out there, defying my undertanding of how the OS works, to save a bunch of files with only one I/O call? To store them all in memory until they are all ready to be written, and then do it at once, with one open/print/close call?

If not, is there *anything* at all I could do make it faster aside from putting more than 12 records in each file?

Replies are listed 'Best First'.
Re: Quickest way to write multiple files
by saskaqueer (Friar) on Jun 08, 2004 at 08:34 UTC

    Copied from chatterbox: saskaqueer's mind boggles over the newest SoPW. Imagine a "magical module out there" that could "save a bunch of files with only one I/O call". That'd be amazing to see ;)

    That aside, I'd say 20 seconds isn't too bad for creating and manipulating 417 files. If your code can use any adjustments for speed, it would most likely be with the code you are using to split up a chunk of data into 12 records per file. If the bottleneck here is indeed the file I/O for outputting the new files, then that's about the best you're going to get.

    How are you reading in this 5000+ record file? Are you reading it in all at once into memory, or are you looping through it 12 records at a time to split it up? If you're pulling everything into memory at once, that might be your problem there.

      The parsing of the initial datafile is taking an average of 2 seconds. Of couse I'll try to shave 1.5 off, but that's not much. They are being stored in a hashref, then split into separate hashrefs, and output as xml 12 at a time with XML::Simple.

      The actualy splitting of the records is only taking around .1 seconds, so not much I can do there.

      Thanks anyway. If I happen to stumble on a way to make the magic module I'll be sure to post it here.
Re: Quickest way to write multiple files
by graff (Chancellor) on Jun 08, 2004 at 09:45 UTC
    I would encourage you to question your motives in this task. Why do you think you need to have just 12 records per output file? And what's wrong, really, with having just a single file with 5000+ records in it?

    If you're bothered now by the extra time it takes to create 400+ new files, there's a good chance you'll be bothered again whenever you need to do a global scan of those 400+ files later on (e.g. to search for some particular string).

    In your later reply, you say that reading/parsing the one big file seems to take very little time, and most of the time during your split routine is spent handling the open/write/close on all the little files. This is a normal outcome, which you will also observe when reading data back from all those little files. So, what benefit do you get from the little files that will offset the price you pay in extra run time?

    If you're trying to improve access time to any chosen record in the set, by reducing the size of the file that must be read to fetch that record, there are better ways to do this, that do not involve writing tons of little files.

    For instance, create an index table of byte offsets for each record within the one big file; if each record is uniquely identified by some sort of "id" field in the xml structure, store that with the record's byte offset. Then to read a record back, just use the "seek()" function to go directly to that record, read to the end of that record, and parse it. That's a simple technique, and it would be hard to come up with a faster access method than that.

Re: Quickest way to write multiple files
by ysth (Canon) on Jun 08, 2004 at 09:34 UTC
    Update: if BrowserUK is correct, I can be safely ignored.

    Opening and closing a file for each record is going to get you lousy performance. If the records are sorted by filename, try (untested):

    my $openfile = ''; foreach my $file (@file) { my $filename = $file->{filename}; my $content = $file->{content}; if ($filename ne $openfile) { $openfile = $filename; open(FILE, "> $filename") or die "Can't open $filename $!"; } print FILE $content; } close(FILE) if $openfile;
    Otherwise, sort by filename (untested):
    for my $file (sort {$a->{filename} cmp $b->{filename}} @files) { ... }
    You may get slightly better results saving all the records for a file in an array and then printing them together just before opening the next file (or the close at the end).

    If you had a smaller number of files, I'd recommend a hash of filehandles instead:

    foreach my $file (@files) { my $filename ... my $content ... if (!$fh{filename}) { open $fh{filename}, "> $filename" or die ... } print {$fh{filename}} $content; } foreach my $fh (keys %fh) { close($fh{$fh}); }

      As far as I can tell, his snippet was only opening and closing each file once, not once per line, and he appears to be writing the entire contents in one hit.

      With only 12 lines per small file, unless the lines are quite long then accumulating the data and writing it in one hit is unlikely to benefit as it's quite possible that stdio would have buffered more than the entire size of each file anyway.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
Re: Quickest way to write multiple files
by BrowserUk (Patriarch) on Jun 08, 2004 at 09:29 UTC

    You might be able to speed up the writing by outputing to a virtual disk rather than a harddrive, but if the files need to persist, then whatever gains (if any) you made, you would lose when you copy from the vdrive to the disk.

    If the small files are only transitory it might benefit though.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
Re: Quickest way to write multiple files
by bart (Canon) on Jun 09, 2004 at 06:45 UTC
    In general, opening a file can slow down considerably, if there are already many files in that directory. Your best shot would be, IMO, to limit that number. In a similar manner, it's probably best to use bare filenames into the current directory, or relative filepaths into a subdirectory from a the current root directory, instead of using absolute filepaths, in order to avoid the lookup of the directory over and over again, for each file.

    HTH.