Performance Trap - Opening/Closing Files Inside a Loop


Syntactic Confectionery Delight
	PerlMonks

Performance Trap - Opening/Closing Files Inside a Loop

by Limbic~Region (Chancellor)

on Dec 09, 2004 at 23:57 UTC ( [id://413719]=perlquestion: print w/replies, xml )

Need Help??

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
There is a long story behind this that involves a Java programmer asking for some help with Perl. I won't get into the particulars other than to say the question asked was:

What's the easiest way to loop through a comma delimited file and append the line minus 1 column into a new file that is the same name as the excluded column?

The file in question was about 20 lines long. I gave my disclaimer about normally using a module to handle CSV, but the following code should work:

#!/usr/bin/perl
use strict;
use warnings;
while ( <DATA> ) {
    my @field = split /,/;
    my $file  = splice @field, 2, 1;
    open (OUTPUT, '>>', $file) or die $!;
    print OUTPUT join ',', @field;
}
__DATA__
1,2,foo,3
4,5,bar,6
7,8,foo,9
[download]

I asked the Java programmer the next day how it worked and I was informed that it was too slow and that a Java program was being written instead. Scratching my head, I asked if the same file I was shown before was the one actually being used. It wasn't - multiple files millions of lines long each. Opening and closing file(s) that many times is bound to be slow. I offered the following modification^* of the code provided the column being excluded was fairly repetetive in the file:

#!/usr/bin/perl
use strict;
use warnings;
my %fh;
while ( <DATA> ) {
    my @field = split /,/;
    my $file  = splice @field, 2, 1;
    if ( ! $fh{$file} ) {
        open ($fh{$file}, '>>', $file) or die $!;
    }
    print { $fh{$file} } join ',', @field;
}
__DATA__
1,2,foo,3
4,5,bar,6
7,8,foo,9
[download]

I explained that the reason for the disclaimer was that that the hash only bought performance if a file had more than 1 line getting appended to it. Additionally, if there are too many unique files, memory and/or open file descriptors may cause a problem. I was then told that the Java code was nearly done but thanks anyway. *shrug* - exit stage right.

I think I am missing how Java is going to be that much faster. I assume Java is still going to open and close the file each time through the loop unless there is a similar trick. Given that I don't really know Java I could be out in left field here.

Leaving Java aside, is there more run-time efficient way than my second suggestion in Perl? I haven't given it a lot of thought because the Java developer is just being silly. It is a run one and done script so it would already be finished if the first version (wrapped in a tiny shell script) had been allowed to run. On the other hand, this is the sort of thing that I like to be aware of in the future. (Prior Planning Prevents Poor Performance)^**

Cheers - L~R

* The actual code used ARGV
** I learned this in the military, but there were a couple extra explicitive Ps

Comment on Performance Trap - Opening/Closing Files Inside a Loop Select or Download Code

Replies are listed 'Best First'.

Re: Performance Trap - Opening/Closing Files Inside a Loop
by tachyon (Chancellor) on Dec 10, 2004 at 04:44 UTC

If you have the memory something this will probably be faster. You can save on the if/else for every line as well as only doing the minimum in the loop (like not splicing and joining when we don't really need to). Even a saving of a few microseconds X millions of lines is substantial. Multiple calls to print are significantly slower than a single call as the OS can buffer/write more efficiently.

#!/usr/bin/perl

my ($field,%fh);

while ( <DATA> ) {
    @field = split /,/;
    $fh{$field[2]} .= "$field[0],$field[1],$field[3]";
}

for my $file( keys %fh ) {
    open F, ">$file" or die $!;
    print F $fh{$file};
    close F;
}
[download]

cheers

tachyon

Re^2: Performance Trap - Opening/Closing Files Inside a Loop

by tmoertel (Chaplain) on Dec 10, 2004 at 07:04 UTC

L~R: Assuming that you have the RAM, can you compare tachyon's code's run time to the other implementations? My guess is that tachyon's code will fare well. (If you don't have the RAM, just tweak the code so that it will process, say, 100_000 or so lines per pass and clear out %fh between passes. Also, you'll need to open files in append mode.)

Cheers,
Tom

Tom Moertel : Blog / Talks / CPAN / LectroTest / PXSL / Coffee / Movie Rating Decoder

Re^3: Performance Trap - Opening/Closing Files Inside a Loop

by tachyon (Chancellor) on Dec 10, 2004 at 08:26 UTC

I agree reducing the number of seeks you need is vital. Given an average 3 msec seek time you can only have 333 seeks per second. This is of course glacial. Ignoring buffering the original code effectively needed 2 seeks (or more) per line, the improved version required at least 1 seek. In the example I presented the number of seeks required is a function of the number of files we need to create, not the number of lines in the input file. This will be a significant improvement provided that the number of unique files is less than the number of input lines.

cheers

tachyon

Re^3: Performance Trap - Opening/Closing Files Inside a Loop

by Limbic~Region (Chancellor) on Dec 10, 2004 at 15:38 UTC

Cheers - L~R

Re^4: Performance Trap - Opening/Closing Files Inside a Loop

by Rhandom (Curate) on Dec 10, 2004 at 16:23 UTC

Re^2: Performance Trap - Opening/Closing Files Inside a Loop

by Animator (Hermit) on Dec 10, 2004 at 11:57 UTC

for my $file (keys %fh)

while (my ($file, $data) = each %fh)

[reply]
[d/l]
[select]

Re^3: Performance Trap - Opening/Closing Files Inside a Loop

by apotheon (Deacon) on Dec 10, 2004 at 12:10 UTC

Doesn't placing a variable declaration in the conditional statement for a conditional loop occasionally lead to strange errors?

print substr("Just another Perl hacker", 0, -2);
- apotheon
CopyWrite Chad Perrin

Re^4: Performance Trap - Opening/Closing Files Inside a Loop

by Animator (Hermit) on Dec 10, 2004 at 12:38 UTC

Re^4: Performance Trap - Opening/Closing Files Inside a Loop

by Ven'Tatsu (Deacon) on Dec 10, 2004 at 16:47 UTC

Re^4: Performance Trap - Opening/Closing Files Inside a Loop

by diotalevi (Canon) on Dec 10, 2004 at 16:42 UTC

Re^3: Performance Trap - Opening/Closing Files Inside a Loop

by tachyon (Chancellor) on Dec 11, 2004 at 00:27 UTC

While you are correct that accessing key value pairs with each is a little faster this is unlikely to influence runtime in any measurable way as the output bottleneck lies with the OS and disk IO.

cheers

tachyon

Re: Performance Trap - Opening/Closing Files Inside a Loop
by sgifford (Prior) on Dec 10, 2004 at 00:14 UTC

Unless the Java programmer knows something you don't (like there are only 100 possibilities for filenames), I don't think their performance will beat an implementation that caches filehandles, as you described in your second example. I don't even think a C or Assembly program would be much faster; IMHO the time will be completely dominated by I/O and system calls.

One trick to be able to keep many filehandles open without worrying about consuming more resources than you intend to is the FileCache module.

Re^2: Performance Trap - Opening/Closing Files Inside a Loop

by Limbic~Region (Chancellor) on Dec 10, 2004 at 00:27 UTC

like there are only 100 possibilities for filenames

And what they are so they can be opened before entering the loop. This isn't the case.

Cheers - L~R

Re: Performance Trap - Opening/Closing Files Inside a Loop
by runrig (Abbot) on Dec 10, 2004 at 01:15 UTC

Re: Performance Trap - Opening/Closing Files Inside a Loop
by tmoertel (Chaplain) on Dec 10, 2004 at 06:51 UTC

Leaving Java aside, is there a more run-time efficient way than my second suggestion in Perl?

Assuming sufficiently small input size, we can load the entire input into RAM and build an optimal write plan before attempting further I/O. The plan's goal would be to minimize disk seek time, which is likely the dominant run-time factor under our control. An optimal strategy would probably be to open one file at a time, write all of its lines, close it, and then move on to the next file. If input size is larger than RAM, the speediest approach would then be to divide the input into RAM-sized partitions and process each according to its own optimal write plan.

Caching the output filehandles (as in your second implementation) probably will not be competitive. Even if you can hold all of the output files open simultaneously, a write pattern that jumps among files seemingly at random will probably kill you with seek time. Your OS will do its best to combine writes and reduce head movement with elevator (and better) algorithms, but you'll still pay a heavy price. You'll do much better if you can keep the disk head writing instead of seeking.

If it turns out that the number of distinct files to be created is nearly the same as the number of input lines, no strategy is likely to improve performance significantly over the naive strategy of opening and closing files as you walk line by line through the input.

One more thing. If the input that Mr. Java tested your program against was millions of lines long, does that imply that your code may have been creating thousands of files? If so, you might want to determine whether the filesystem you were using has a hashed or tree-based directory implementation. If not, your run time may have been dominated by filesystem overhead. Many filesystems (e.g., ext2/3) bog down once you start getting more than a hundred or so entries in a directory.

Cheers,
Tom

Tom Moertel : Blog / Talks / CPAN / LectroTest / PXSL / Coffee / Movie Rating Decoder

Re^2: Performance Trap - Opening/Closing Files Inside a Loop

by EverLast (Scribe) on Dec 10, 2004 at 12:16 UTC

... Many filesystems (e.g., ext2/3) bog down once you start getting more than a hundred or so entries in a directory.

Actually, I have found VFAT to pale in comparison to ext3 and even ext2. ReiserFS should be even better i've heard. YMMV, of course - RAM/processor(s) etc.

Update:

A well-known approach to this 'many files' problem is to create a n-level directory structure based on filenames. File abc goes into a/b/abc, def goes into d/e/def etc. (for n=2). Filenames are then typically randomly generated - and if they're not you can use some transformation to create input for the directory. Reportedly, ReiserFS does this internally.

---Lars

Re^3: Performance Trap - Opening/Closing Files Inside a Loop

by dragonchild (Archbishop) on Dec 10, 2004 at 13:52 UTC

http://www.cpan.org/authors/id/R/RK/RKINYON/

Being right, does not endow the right to be rude; politeness costs nothing.
Being unknowing, is not the same as being stupid.
Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re^2: Performance Trap - Opening/Closing Files Inside a Loop

by tachyon (Chancellor) on Dec 10, 2004 at 08:30 UTC

We found that ext3 with the 2.4.x Linux Kernel was reasonably happy with 10,000 files in a directory but obviously unhappy with 1 million. By reasonably happy I mean other bottlenecks dominated affairs. I would be interested if anyone has done a study on the relation of file numbers per dir vs access time for different file systems that it a little more precise.

cheers

tachyon

Re^3: Performance Trap - Opening/Closing Files Inside a Loop

by meredith (Friar) on Dec 10, 2004 at 21:31 UTC

Out of curiousity, did you test with the 'dir_index' feature flag set? It allows the filesystem to use hashed b-trees for lookups in large directories.

mhoward - at - hattmoward.org

Re^2: Performance Trap - Opening/Closing Files Inside a Loop

by iburrell (Chaplain) on Dec 10, 2004 at 18:03 UTC

The big advantage of caching filehandles is that the open files can hold output in the buffers until they are full. If they are continually being closed and reopened, then each line is being written individually.

What is need is some way to keep a limited number of filehandles opened to keep from hitting the limit. A LRU cache would be perfect. It see a couple of modules that implement this. Or reimplementing it would be pretty easy.

Re: Performance Trap - Opening/Closing Files Inside a Loop
by graff (Chancellor) on Dec 10, 2004 at 03:01 UTC

is there more run-time efficient way than my second suggestion in Perl?

Given the original statement of the problem, with or without the (rather disingenuous) extension to the original problem, I would have suggested that it would help matters noticeably if the input were sorted with respect to the column containing the file name.

The sorting would be really easy to do, either prior to passing the data to perl, or within the perl script (though there might be memory issues doing it in the script, if we're talking about millions of lines instead of dozens). I hope the esteemed java programmer knows about the unix "sort" command (and the fact that it's ported to windows)...

Re^2: Performance Trap - Opening/Closing Files Inside a Loop

by Limbic~Region (Chancellor) on Dec 10, 2004 at 03:36 UTC

Cheers - L~R

Re^3: Performance Trap - Opening/Closing Files Inside a Loop

by graff (Chancellor) on Dec 10, 2004 at 04:05 UTC

Even so, now we're just talking about a two-stage sort:

## let's suppose the file names are in column 3 of "table.txt":

perl -pe 's/^/$.,/' table.txt | sort -t , -k 4,4 -k 1,1n | cut -f2- -d
+, | splitter.pl
[download]

(update: if the original table has file names in column 3, and a perl script prepends a line number to each line, then the primary sort column has to be 4, not 3.)

Re^4: Performance Trap - Opening/Closing Files Inside a Loop

by Limbic~Region (Chancellor) on Dec 10, 2004 at 15:43 UTC

Re: Performance Trap - Opening/Closing Files Inside a Loop
by kvale (Monsignor) on Dec 10, 2004 at 00:05 UTC

-Mark

Re^2: Performance Trap - Opening/Closing Files Inside a Loop

by Limbic~Region (Chancellor) on Dec 10, 2004 at 00:24 UTC

Cheers - L~R

Re^3: Performance Trap - Opening/Closing Files Inside a Loop

by kvale (Monsignor) on Dec 10, 2004 at 02:07 UTC

Once you have established a hashmap from column values to filehandles, then you can print to the desired filehandle. I expect a single hash lookup to be much faster than a pair of system calls for opening and closing files; in addition to the OS bookkeeping and disk IO overhead for opening and closing, each file buffer is flushed (and, depending on the OS and filesystem, the disk is written to) for every line written.

Another completely different method is to append the lines to different strings, one for each column value. Then write them all the strings out to files after the loop.

-Mark

Re^4: Performance Trap - Opening/Closing Files Inside a Loop

by Limbic~Region (Chancellor) on Dec 10, 2004 at 03:32 UTC

Re: Performance Trap - Opening/Closing Files Inside a Loop
by edoc (Chaplain) on Dec 10, 2004 at 03:21 UTC

woah! way to get distracted! I just realised I've spent way too much time messin with this.. back to work..

Read more... (3 kB)

cheers,

J

Re: Performance Trap - Opening/Closing Files Inside a Loop
by TedPride (Priest) on Dec 10, 2004 at 10:03 UTC

Allowing 10 MB of memory and 100 files, for instance, you could print whenever the buffer for an individual file reached 100K or so.

Re: Performance Trap - Opening/Closing Files Inside a Loop
by xorl (Deacon) on Dec 10, 2004 at 15:08 UTC

Now this is a perfect example of the classic setup. The Java guy did not give you enough information before you began the project. Now the boss is going to be told that perl is old and too slow. You'll be seeing a memo soon saying that all perl programs will need to be converted to Java and that all perl programmers will need to either learn Java or be fired. You better strike now. Have a meeting with the boss. Explain the problem have a working demo of your revised code that shows how much faster it is now that you have enough information. Good Luck.

Re^2: Performance Trap - Opening/Closing Files Inside a Loop

by Limbic~Region (Chancellor) on Dec 10, 2004 at 15:53 UTC

Cheers - L~R

Re^3: Performance Trap - Opening/Closing Files Inside a Loop

by xorl (Deacon) on Dec 10, 2004 at 17:57 UTC

The more jobs I have, the more I realize with very few exceptions there is always some kind of office politics motivating most people's actions at work (and sometimes even outside of work). This has proven to be the case in the small companies (3-10 employees) to the large companies (4,000-10,000 employees) and all the sizes in between.

I would still wonder why they'd ask you (the guy with the non-programming title) to solve a "simple" task with an "inferior" language. It could just be the contractor wants to show in his own mind that your company needs him, or it could be something more complex. I'd dig a little deeper and watch out for any rumors.

Again Good Luck.

On the bright side, your question has helped me fix a problem I was having. Thanks!

Re: Performance Trap - Opening/Closing Files Inside a Loop
by runrig (Abbot) on Dec 10, 2004 at 18:23 UTC

#!/usr/bin/awk -f
{ print $1, $3>>$2 }
[download]

Re: Performance Trap - Opening/Closing Files Inside a Loop
by DrHyde (Prior) on Dec 10, 2004 at 12:06 UTC

I bet you're going to run out of file handles.

Re^2: Performance Trap - Opening/Closing Files Inside a Loop

by Happy-the-monk (Canon) on Dec 10, 2004 at 15:02 UTC

See runrig's comment.

Cheers, Sören

Re^2: Performance Trap - Opening/Closing Files Inside a Loop

by tachyon (Chancellor) on Dec 11, 2004 at 00:45 UTC

I bet you're going to run out of file handles.

I'll bet you I won't :-)

$ ulimit -HSn 8192
[download]

A file descriptor is really just an entry in a C array which is maintained by the OS. Although by default the size of that array is typically 512, 1024, .... or some other function of base 2 depending on OS etc you can make it virtually anything you want within reason. See this for some more details.

cheers

tachyon

Re: Performance Trap - Opening/Closing Files Inside a Loop
by mattr (Curate) on Dec 11, 2004 at 18:38 UTC

I was agreeing with the idea of doing 10MB segments with an in-memory hash. But what I didn't quite understand is why they are using the filesystem as the database, sure it is possible but hardly seems useful. Add to that the restriction on filenames and even potential security problems if something gets injected there..

I was thinking one of the tons of db-like options available might be useful, either as the end product or as the intermediate stage. If the java guys can't read a tied hash, mldbm or whatever you could use an sql db, anyway these things ought to be good at dealing with memory and disk write optimization. You can always dump the db to separate files if that's what you want.

Anyway, the point is that you are intentionally not being told what the project is supposed to do, so watch your back! I would personally ask why on earth they are writing thousands of files to the disk, that is so 70s. Don't the java guys know how to use Oracle or whatever they have in the same room? :) And they waste your time too, talk about inefficient use of resources!

Anyway it would be really funny if the answer is just to use the sql LOAD DATA INFILE command on a database you already have to solve the problem. You may be interested in the mysqlimport utility which is an interface to that command.

Back to Seekers of Perl Wisdom

Log In^?

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: perlquestion [id://413719]
Approved by sgifford
Front-paged by tachyon
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others musing on the Monastery: (7)

As of 2024-04-18 02:49 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found