warthurton has asked for the wisdom of the Perl Monks concerning the following question:

I have a perl based mail filter that runs on one account (that gets a ton of spam). I would like to keep the # of concurrent runs of this filter down.

I do not want to do this in the qmail since it would restrict the rest of the site for # of concurrent deliveries.

What I'm thinking about is some way to check how many copies of the script are running and if it is > x then wait until it is <= x and then continue to process.

Some possibilities are checking the process list (but that could cause multiples to still start running at the same time) or writing a file out at start of processing and deleting it at the end (but what if the removal fails, then maybe check a timestamp).

People do this with pid files quit often, but I'm not sure of the best way to process them.

Has anyone ever had to restrict # of concurrent processing copies of a program? What have you done? What has worked? What hasn't? Did you setup queues?

Thanks for any ideas.

W

Replies are listed 'Best First'.
Re: Maximum # of concurrent runs
by halley (Prior) on Aug 19, 2003 at 16:19 UTC
    Have you read up on Parallel::ForkManager?
    use Parallel::ForkManager; $pm = new Parallel::ForkManager($MAX_PROCESSES); foreach $data (@all_data) { # Forks and returns the pid for the child: my $pid = $pm->start and next; ... do some work with $data in the child process ... $pm->finish; # Terminates the child process }
    Looks like your solution would iterate over the spammy email data, launching new forks each time. Don't exec() or you'll skip the important $pm->finish() call.

    Update: To manage a family of processes cleanly and simply, you must be the parent. You must have some central authority process which detects new emails and launches new filter processes, and that central process must do the management.

    It's not going to be as clean or supportable to try to detect how many siblings or cousins are already running before performing any work. You'd get into a big hairy ball of semaphores before you got anything working.

    --
    [ e d @ h a l l e y . c c ]

      Unfortunately the process isn't forking.

      Each process is created as each new mail message is received.

        Well, whatever controls that is what has to be managed. There's nothing you can do once you've forked except continue to process, correct? So you have to control when the forks happen.

        -- Randal L. Schwartz, Perl hacker
        Be sure to read my standard disclaimer if this is a reply.

Re: Maximum # of concurrent runs
by dragonchild (Archbishop) on Aug 19, 2003 at 16:17 UTC
    Have one script that is continually running. It would maintain a variable that has the number of concurrent processes running. You then use fork to initiate a process.

    ------
    We are the carpenters and bricklayers of the Information Age.

    The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: Maximum # of concurrent runs
by MidLifeXis (Monsignor) on Aug 19, 2003 at 17:32 UTC

    How about spooling the mail, and then processing it after it is spooled? Or do you do rejects inline?

    If doing the rejects inline, then I would read up on semaphores.

    If you are able to spool it, then you can manage this at your leisure.

      I think that I will make spooling the bulk of the mail and then processing it ever 5 or 10 minutes a job for the NEAR future.

      Thanks for the suggestion. In the meantime I will put a locking mechanism in as a stopgap.

      Thanks

Re: Maximum # of concurrent runs
by esh (Pilgrim) on Aug 19, 2003 at 16:46 UTC

    I've used LockFile::Simple for similar problems where I can only have one copy of a program running at a time. This module handles the waiting, retrying, cleaning up of stale lock files, and more.

    The problem gets a bit more complicated if you want to have a maximum of N copies running at a time.

    -- Eric Hammond

Re: Maximum # of concurrent runs
by BrowserUk (Patriarch) on Aug 19, 2003 at 17:08 UTC

    If your OS support it then you should take a look at IPC::Semaphore.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

Re: Maximum # of concurrent runs
by esh (Pilgrim) on Aug 19, 2003 at 21:40 UTC

    I believe I understand your situation. You have an email filter which gets run by qmail for every incoming email. You believe that you cannot control when the emails come in or when qmail fires off your filter.

    You want your filter to hold off running if there are a lot of other copies of itself already running and processing other emails.

    I agree that the most elegant solution would be to file the emails in a separate folder and run a batch process regularly which processes the entire folder. However, this delays when you get your email delivered to your real mailbox as it depends on a polling mechanism.

    Here is my module which uses LockFile::Simple to only allow one copy of a program to run at a time. This is similar to merlyn's highlander column which I just learned about. However, since I use this code regularly, mine is a bit more reusable. We obviously both watched the same movie, though.

    package My::ThereCanBeOnlyOne; use strict; use LockFile::Simple; my $Lock; sub import { my $self = shift; my $name = shift || 'therecanbeonlyone'; my %args = @_; # Using /tmp allows an internal denial of service attack. my $lockfile = "/tmp/$name.pid"; my $locker = LockFile::Simple->make(-autoclean => 0, %args); $Lock = $locker->lock($lockfile, '%f') and return 1; open(LOCKFILE, "< $lockfile") or die __PACKAGE__."Unable to open $lockfile: $!"; my $other_pid = <LOCKFILE>; close(LOCKFILE); chomp($other_pid); die __PACKAGE__.": $name: $other_pid still running. $$ exiting\n"; } END { $Lock->release(); } 1;

    At the top of your filter program you would write:

    use My::ThereCanBeOnlyOne 'myprogram';
    Replace 'myprogram' with the name of your program or resource you want to lock on. It is arbitrary.

    This would use all of the LockFile::Simple defaults as far as retries, timeouts, expiration, etc. You can override any LockFile::Simple parameters, by simply including them at the end of the use statement like so:

    use My::ThereCanBeOnlyOne 'myprogram', -hold => 0, -stale => 1, -max => 1;

    Now comes the interesting idea. You want to limit the number of email filters to N. If you're willing to live with N as the absolute maximum and having it be more and more likely that a copy of the program will wait for other copies to finish as the number of running copies approaches N (say 6), then you could use:

    use My::ThereCanBeOnlyOne 'myprogram'.int(rand(6));
    This picks a random lock file slot from 0 to 5. If another running program already has that slot, we keep waiting until it finishes. You can see that the probability of a program waiting will increase the more programs are already running, but this will for sure limit you to 6 (or N) simultaneous copies.

    This might not be reasonable for some applications, but for an email filter, I thought it might be appropriate.

    -- Eric Hammond

      Thanks. These makes a lot of sense. After reading the some of the other replies I do think that in the future I will go to a temporary queue and then process it in a non-realtime manner, but for now I'm going to implement a scheme very similar to what you have proposed.

      Thanks again.

Re: Maximum # of concurrent runs
by davido (Cardinal) on Aug 19, 2003 at 18:25 UTC
    I think that a simple solution could be to have a process handle file. Do the following:

    Create a text file, with six lines, each containing the word HANDLE.

    Next, when your script executes, it should:

    * Lock the handle file.
    * Open the handle file.
    * Read in the handle file.
    * If there are no "HANDLE"s left, exit.
    * If there are "HANDLE"s left, pop one off the bottom of the file.
    * Write out the file.
    * Close it.
    * Unlock it.

    Then do whatever work you intended to do within the script. Upon completion of the work, do the following:

    * Lock and open the HANDLE file.
    * Push your HANDLE back into the end of the file.
    * Close and unlock the file.
    * exit.

    Oh, and if it finds HANDLE file already locked, just wait a second and try again.

    It's a pretty simple method. If you want to increase or decrease the number of simultaneous processes, you just alter the number of handles in the file. ...the file could just as easily contain a counter number instead of a series of "handles". Each process decrements the counter, runs, then increments the counter. If the counter ever hits zero, no more processes can run. Same basic concept.

    Dave

    "If I had my life to do over again, I'd be a plumber." -- Albert Einstein

      This mechanism assumes that people didn't do nasty things like "kill -9" the process (which people shouldn't do anyway, but that cargo cult solution gets repeated all the time).

      A better solution would be something that gets reset automatically by the operating system even if the process stops dead in its tracks. I give an "only-one" solution in my "highlander" column, which could be extended to "only six" with a bit of cleverness. In fact, I have that bit of cleverness scheduled for a future column idea. {grin}

      -- Randal L. Schwartz, Perl hacker
      Be sure to read my standard disclaimer if this is a reply.

Re: Maximum # of concurrent runs
by TomDLux (Vicar) on Aug 19, 2003 at 18:06 UTC

    My vote is with MidLifeXis.

    Doing things item by item is great when items are rare.

    When that becomes too expensive, batch processing is far more effecient. So long as processing is done before the items are needed, who cares when it is done.

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

Re: Maximum # of concurrent runs
by johndageek (Hermit) on Aug 19, 2003 at 21:14 UTC
    A question
    1) will any one of the processes clean up the entire mail queue, or does each process clean up a single defined message?

    An option is to fire off a process that will append the mail id to a queue, or database table as email is recieved, then have a single removal process running that will work on email based on the queue entries, deleting queue entries upon completion.

    Enjoy!
    John