http://qs1969.pair.com?node_id=685436

Ryszard has asked for the wisdom of the Perl Monks concerning the following question:

I have a (legitimate) need to send up to around 1.5million emails twice a month (notification of a billing cycle). I ultimately want the solution to scale to about twice (and a bit) what it is initially spec'd at to compensate for future requirements.

I will be getting a flat file list of email addresses, and the idea is that i will connect to an SMTP server for delivery.

In order to increase scale, i figure parsing and chopping up the the list and putting it in a database would allow multiple hosts to access the list and process their part all at the same time.

The issue i have is a requirement to throttle the requests on a per second basis (so as to not choke the mail system).

My thoughts on the matter are to use Net::SMTP and threads, and some kind of counter to track the requests sent, and stop sending if the limit has been reached for that period.

If someone has experiences on this, i'd love to hear it

update

it seems as tho' i have actually jumped the gun on this. after chatting with the mail admins (i'm an apps guy), it appears the most scalable solution is to dump stuff straight into the mail queue, and (in our instance) tune exim to work it out (as moritz as mentioned)

Still, as a point of interest, how could one throttle socket to actions/second, i guess a counter, and in the case of using threads, perhaps a shard variable? are perl threads safe enuff for this?

Replies are listed 'Best First'.
Re: Applying the brakes
by moritz (Cardinal) on May 08, 2008 at 11:20 UTC
    I don't have very much experience with mail servers, but I'll try nonetheless: Could you let the mail server do the throttling?

    I guess that's a common problem, and some mail servers already have a solution for that.

    Perhaps you can query the local queue size of your mail server and sleep if there are more than $limit mails in the queue.

    Especially when it comes to concurrency I try very hard to find some piece of software that already solves the hard part of the problem, in this case I'd hope for a decent mail server.

Re: Applying the brakes
by jettero (Monsignor) on May 08, 2008 at 11:34 UTC
    I think moritz is correct about getting the mailserver to do this throttling for you. But I do have limited experience with problems much smaller than yours. I send out mails to users whose mailboxes are full. I use two strategies to prevent overloading my already overloaded mailserver.

    First, it refuses connections and new RCPTs when it's already sweating. Second, I try not to get it to that point. (This send_mail function is from my own Net::SMTP::OneLiner.)

    # This part is a no brainer and your mail software # likely already supports something like it: eval { send_mail('postmaster@mei.net', "$dir_entry\@mei.net", "Usage Notice (full mailbox)", $msg) }; if( $@ ) { warn "sleeping for 2 seconds then retrying due to send_mail() +WARNING: $@"; sleep 2; redo RETRY; } }

    The second thing is to simply wait until the load average is low enough:

    sub sleep_until_low_load { my $limit = shift; REDO: open PROC, "/proc/loadavg" or die $!; my $line = <PROC>; my $load = $1 if $line =~ m/^\s*([\d\.]+)\s+/; close PROC; if( $load > $limit ) { print "\tsleeping for 5 seconds since $load > $limit\n"; sleep 5; goto REDO; } }

    -Paul

Re: Applying the brakes
by salva (Canon) on May 08, 2008 at 12:10 UTC
    As it seems that your mail server is exim, you can send the messages to it through a pipe using "batch SMTP":
    open my $exim, "| exim -bS" or die; for (...) { print $exim <<'MSG'; MAIL FROM: foo@bar.com RCPT TO: doz@bar.com DATA Subject: Your Invoice To: You From: Me You have to pay 1,000,000 dollars . MSG }
    If you want to parallelize, fork some processes and divide the task between them:
    my $workers = 4; for my $ix (0..$workers-1) { unless (fork) { open my $fh, '<', $address_list_file_name or die; open my $exim, "| exim -bS" or die; while (<>) { next unless (($. % $workers) == $ix); send_invoice($exim, $_); } exit 0; } }
Re: Applying the brakes
by tachyon-II (Chaplain) on May 08, 2008 at 13:19 UTC

    You may find this comparison of MTAs interesting. Yes it is dated but Exim was slow as a dog back then. Here is another comparison which is a bit less dated and still shows Exim at the back of the field.

    Now the hardware used was relatively old and slow but this is not a hardware bottleneck task. It is concurrency, network connectivity, and DNS lookups that are bottlenecks. To send an email your server has to look up the address in DNS (you want a big local cache), then make the connection to the remote server (may take seconds), then send HELO,MAIL,RCPT,DATA,EOM - all of which can be slowed down by the remote server. The entire transaction can easily exceed several seconds, thus you require multiple concurrent processes (50+) in your MTA to get any sort of decent throughput.

    You will note that performace in terms of emails per second is a feeble 5-15 depending on MTA. At 20/second througput you are looking at 20 hours runtime for 1.5 million emails.

    I suggest you look at a dedicated mail server that also runs its own DNS with a huge cache to do this task. Exim looks like one of the less efficient options judging from the performance benchmarks. If you dump 1.5 million emails onto your Exim queue and it is really only running a throughput of 5/sec you will effectively freeze outgoing email for 80 hours as any new message will go to the back of the queue. Even if it is doing 50/sec there will be nothing else going out for 8 hours - unless of course there is some way to flag it as low priority. Not only will you have an outgoing email problem you will also have an incoming email issue as Exim is also recieving incoming connections.

    If you did want to split up the flatfile of email addresses you don't need a database. Just use split(1).

      good info, thanks for this.

        Just had another thought. While dumping the file (probably in paced chunks - waiting for the queue to shrink back towards zero) may be sensible you need to permute you infile somewhow to ensure you don't have:

        bob@domain sue@domain ... foo@domain bar@other_domain

        If you dump a whole series of emails to the same mail server in a row it will choke and possibly ban/throttle you. One simple approach would be simply to apply a sort and let the variation in username vaguely randomise the domains or you could shuffle them in an array using a Fisher Yeats.

        Provided you don't have high frequencies of gmail, hotmail, yahoo accounts a simple sort ought to work OK, otherwise you may need some clever code to make sure that these common domains don't occur in a row.

        I would probably take the easy road and try a simple sort first and check how many times a given domain occurs in your proposed concurrency frame (probably 50-100). Domains occuring more than 2-3 times within a frame may be a problem as your MTA will be asking for that many concurrent connections.

        Update

        Could not resist. Here is a don't hit the same domain if we have sent an email in the last n width frame algorithm to run you address list through. NB Code updated to remove bug where domain pulled off fifo in else unchecked against current working domain - if it is that needs to go on the fifo, if not it is good to go (untested)

Re: Applying the brakes
by CountZero (Bishop) on May 08, 2008 at 13:32 UTC
    I will not repeat what all others have already told you, but I confess I am a bit puzzled about your requirement for forking or threading the preparation of these e-mails (I assume in order to increase the throughput) and at the same time having a real concern for not swamping the mail server.

    Even if your mail-server can handle a few hundred messages a second it will still take many hours to process your batch of messages and it seems unlikely to me that preparing these messages by your script will take that long. Do you have any idea how long it takes to prepare your messages? As long as it is faster than the sending of the messages there is no need to complicate matters to start additional threads or use forking.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      The idea behind forking or threading is to offload the queue i'm processing to the mail server asap. what i want is a process that runs in as short a time as possible.

      for example, if someone has to kick of the notification process over the weekend, it would really suck to have to come in/log in and check it every few hours to make sure its still going.

      ..and yes, i'm aware of writing automated monitors and all that, but at the end of the day, nothing beats (at least initially in the first few runs) eyes-on to ensure things are going smoothly.

Re: Applying the brakes
by jethro (Monsignor) on May 08, 2008 at 12:15 UTC
    On your update question:
    If all you want is to limit mails/second and you know how many hosts are doing the sending, you might as well throttle each host to 1/hosts of the rate one host would be sending.
    So 2 hosts would be sending at 1/2 of the rate one host sends. No communication necessary for that

    if you really really want communication, then the database all the hosts are already accessing gives a convenient means to coordinate.
    For example it could give out contingents of email adresses to each host that are put in by one coordinator process at the specified rate.

    Perl threads could be used similary, one coordinator thread and many sub-threads. Counter question: Safe from what?

Re: Applying the brakes
by samtregar (Abbot) on May 08, 2008 at 17:02 UTC
    It's pretty hard to kill most mail-servers just by giving them lots of mail to send. Most reasonable systems (qmail, Ironport, Ecelerity, etc) have solid queuing mechanisms that will deal with any backlog. Just make sure they have enough free disk space and you should be able to push out your messages as fast as you can format them. In general that means using a custom injection format rather than straight SMTP (qmail-inject for qmail, ESMTP for Ironport, Ecelerity::Injector for Ecelerity, for example).

    Of course, that assumes the sending server isn't running MS Exchange, or something equally pathetic. If that's the case, by all means throttle it!

    Arguably harder than the actual sending is bounce processing. In order to not be blocked by ISPs you have to process bounce logs from your SMTP server and unsubscribe bad addresses. You also have to parse bounces that arrive via email to your Return-Path. Both involve tricky heuristics and very little standardization.

    -sam

      Oh man, you sinner!

      You talk of reasonable systems, and in the same breath you mention qmail! And you do not speak of Postfix.

      In my mind, the main problem is not going to be the formatting of the messages, nor their queuing on the outbound exchanger.

      To pump out a million or more messages, statistically speaking, all your outbound sockets will quickly be tied up on remote hosts that are out to lunch, overloaded and generally taking too much time to do nothing. This will kill your outbound delivery.

      What you (the OP) needs to do is to set up two machines (or clusters), the first one attempts to deliver the mail at top speed. It performs one try, and one try only, with a very short timeout, 5-10 seconds max. If it fails, it hands the message off to a fallback relay that is configured to be much more lenient and patient at speaking to slow and/or broken exchangers.

      This way your primary outbound exchanger is always available to send out messages, and doesn't get its deferred outbound queue clogged up with messages that might take minutes to deal with. Instead the fallback relay deals with the problem servers.

      It helps if you NAT stuff out through the same IP address, so that greylisting servers don't consider the transaction as new tuple. If you can't then you'll need a third level of fallback relay. The second level makes a couple of attempts to deliver mail on the assumption that the remote server is spitting out transient errors due to greylisting, and only after it gives does it finally divert the message to a third-level fallback relay.

      And naturally, this is simple to configure with Postfix :)

      • another intruder with the mooring in the heart of the Perl

        Yeah, I probably shouldn't have included qmail in that list. I've never used it to do 1m+ delivery runs. So far I've only seen Ecelerity and Ironport do that. I included qmail only because I'm pretty sure you couldn't kill it by sending it a ton of mail to send out.

        Neither Ecelerity or Ironport need two machines to deal with slow receivers. You can setup priority levels on different queues if you want to make sure one large mailing can't stop later mails from going out, but the defaults seem to do a pretty good job of that for both, in my experience.

        -sam

Re: Applying the brakes
by dwm042 (Priest) on May 08, 2008 at 17:49 UTC
    I know very little about the mechanics of sending large amounts of mail, but as a former email admin for a small telecommunications company, I will say that your biggest issue will be to avoid being blacklisted. The problem with blacklisting is if you get on a list, it can be very difficult to get off (SORBS is a holy pain).

    Our company provided outbound relays for our customers. Most of our customers were small offices, and hardly anyone sent more than a few dozen mails in an hour. In my case if an IP started using our relays to send much more than 5-10,000 emails an hour, and if the owners of the IP had not contacted us in advance to obtain an exception, they would get shut off automatically.

    Take home: how you send your mail is at least as important as sending the mail. The care you take to answer all the obvious issues that concern blacklisters will affect your ability to deliver. Blacklisters can be very specific about TTLs on your mail server DNS records, about using a static address, about having a reverse address.

    Finally, it isn't just you that could get you blacklisted. If the SORBS people decide to shut down the class C you are on, you're going to have to wait until the owner of the class C gets it fixed.
Re: Applying the brakes
by sundialsvc4 (Abbot) on May 08, 2008 at 16:04 UTC

    It sounds to me like your admins are right. You see, even a single thread can “outrun the mail-server.” Therefore, there's no advantage to having more than one thread or process doing that, plus it's considerably simpler to manage.

    If the mail-server is such that it can simply be given an enormous queue of stuff to chomp-through, then just give it the enormous queue all at once and let it chomp-through as it will. If there's an issue of choking-out other mail that needs to go out the same tube, any sort of “simple throttle” should do ... in one process ('cuz you just don't need more).