Fisch has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I try to analyse some data and want to write to several FH simultanously. But after 1021 open files my script crashes at the open command, not being able to open the next FH.
I already set the maximum number of open files to 65536 in /etc/secutity/limits.conf and I am using a "fresh" shell, i.e. ulimit -a shows the same number.
Are there any limitations to perl itself or is this a linux issue?
Thanks for any hint!

Replies are listed 'Best First'.
Re: maximum number of open files
by liverpole (Monsignor) on Dec 04, 2006 at 13:14 UTC
    Hi Fisch,

    This is a Linux issue.  You're reaching the limit 2 ** 10, minus the 3 files STDIN, STDOUT and STDERR.

    First off, do you really need to be opening that many files at once?  Maybe it's just a matter of rethinking your algorithm.

    Secondly, if you do need to increase the number, do something like the the following (as described by this site):

    # echo 5000 > /proc/sys/fs/file-max Then edit /etc/sysctl.conf, and append the line: fs.file-max = 5000 Then make the new value take effect with: # sysctl -p or, alternatively, logout and back in again.

    s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
Re: maximum number of open files
by cdarke (Prior) on Dec 04, 2006 at 17:41 UTC
    I had a similar problem once (long time ago) and I cheated. It was an exceptional condition - most of the time I was not getting close to the limits.
    I abstracted the IO part of the application (I was lucky to have a layered design which made it easy) and put that as a separate module. When I got close to the limit I created another process which just ran the IO. I communicated with this "IO Server" using named pipes, but any appropriate IPC could be used. I could just keep creating "IO Server" processes as I needed.
    There are performance implications, but they were not noticable. However it was a lot of work, far better to avoid using huge numbers of opens.
Re: maximum number of open files
by graff (Chancellor) on Dec 05, 2006 at 03:24 UTC
    Is the reason for all those thousands of simultaneous file handles to sort the content of some single large input into thousands of distinct output buckets? If so, is the input an existing file on disk, or is it a "live" streaming source that needs to be sorted on a continuous basis?

    For splitting up the contents of a large, complicated input file, I'd do one pass to build an index for the records to be sorted: the output of this pass is a stream of lines containing "bucket_name start_offset byte_length" for each distinct input record; then I would sort the index by bucket_name, and use a second-pass script that does a "seek(...); read(...)" on the big file for each line in the sorted index. Because of the sorting, all the records intended for a given bucket would be clustered together, and I only need to have one output file opened at a time. On the whole, this is likely to work a lot faster than any alternative, because there will be less file/io/system overhead.

    If dealing with a continuous input stream, where two passes over the data might not be practical (and the number/names of potential output buckets might not be known in advance), I'd probably switch to storing stuff in a database, instead of in lots of different files -- a mysql/oracle/whatever flat table with fields "bucket_name" and "record_value" might suffice, if you build an index on the bucket_name field to speed up retrieval based on that field.

    Either way, I'd avoid having thousands of file handles open at the same time. There must have been some good reason why every OS has a standard/default limit on the number of open file handles per process, and circumventing that limit by orders of magnitude would, I expect, lead to trouble.