r1n0 has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,
I am requesting a recommendation for a thread-safe persistent file queue. I have a script that utilizes multiple threads to read and write from a queue. I have been using the module Thread::Queue for creating an in-memory queue, but I would prefer to setup a file queue that will allow the queue to be persistent in case anything crashes or a need to shutdown happens.

This queue is pretty simple. It is a list of jobs/tasks to be performed across various threads. The threads need to be able to remove and add jobs to the queue as needed.

I have had no problem using Thread::Queue, but looking for a module to replace it for the need to create a persistent queue.

As always, thank you for your help.
  • Comment on Persistent File Queue recommendation requested

Replies are listed 'Best First'.
Re: Persistent File Queue recommendation requested
by Corion (Patriarch) on Jan 27, 2010 at 13:19 UTC

    Have a look at IPC::DirQueue, or use files in the file system as jobs and directories as their respective state yourself. On sane (that is, non-NFS) filesystems, rename is atomic and hence you can acquire a job by renaming it to "state1/$filename.$$", and once it is completed, rename it to "state2/$filename" for others to see. Restarting a job is as easy as renaming the corresponding file, and finding abandoned jobs means checking whether the PID corresponding to the file still exists.

    If you want to expand this scheme across machines (with a shared directory holding the data, see my caution against NFS above), note that the PID is not unique anymore, so you will need to add the machine name as well as the PID to the file.

    I haven't used IPC::DirQueue myself, but I have worked with similar schemes, and they have the benefit that you have lots of user tools already available. They don't scale well over 1000 jobs due to file system limitations and the rescanning, but if you have higher requirements, a dedicated (single point of failure) job queue server like SMTP might be a better option.

      Thank you for the input. You state that they don't scale well for 1000s of jobs. Unfortunately, this is the case. The enqueue thread received hundreds of thousands of jobs in a day. So, this probably isn't going to work for my needs. I appreciate the feedback and will probably play with IPC::DirQueue for other things. Thank you.

        Depending on the nature of the jobs, you can follow the usual approach and use the first two letters or some other (evenly distributed!) criteria to distribute the jobs among subdirectories. This makes scanning for fresh jobs harder though. Alternatively, move jobs that are "in processing" into a separate directory which is not scanned. That will reduce the load that idle jobs produce while scanning for work to do.

        If you have to have a high throughput and can't batch your 100k requests into jobs of (say) 100 items or so, I'd look at premade solutions or maybe just at dedicating a database machine which serves as the central job directory.

Re: Persistent File Queue recommendation requested
by JavaFan (Canon) on Jan 27, 2010 at 13:10 UTC
    You could use the standard technique that's used to prevent multiple processes modifying the same file:
    • open file for read-write.
    • get an exclusive lock.
    • modify the file.
    • close file.
      I tested using a tied array (to file) with what you mentioned. I was locking a shared variable before adding anything to the array and I would lock the same variable before a read. The problem I encountered (and I hope I am wrong) is that Tie::File module seemed to be leaky (memory wise), and before I knew it, my overall script was taking a lot of memory. My script starts with ~20MB mem being used, and after using Tie::File and performing your steps above for about 30 minutes, the script would grow to over 500MB in RSS memory. Does anyone know if Tie::File is leaky?
        I've never heard before it being leaky. What makes you think the leak is in Tie::File, and not in your usage? Considering that Tie::File uses a cache, you may end up with each thread having a separate cache; each of them consuming memory (and if each of them has their own view of the file, the threads won't communicate with each other correctly either).

        How are you using Tie::File so that the threads can divide the work correctly?