bfdi533 has asked for the wisdom of the Perl Monks concerning the following question:

I have a process that looks through a growing list of data every couple of minutes. The process that does this starts, runs and stops every couple of minutes rather than run continuously.

When new "events" are seen, an email is sent. The issue is that for most normal circumstances, the number of events are in the 1-5 range per hour. But, there are times when this can grow to several thousand per hour. I would like to only send 1 per hour for no matter how many "events" there are.

I need to threshold this but am not sure the techniques that would work to do the threshold.

Has anyone done any work on this kind of thing and provide a nudge in the right direction?

Replies are listed 'Best First'.
Re: Email Thresholding
by mr_mischief (Monsignor) on Apr 02, 2015 at 16:58 UTC

    There are a couple of simple ways to go about this. There are plenty of complicated ways. Sometimes you can get 80% or 90% of what you want with a lot less effort if you just barely tweak your spec.

    If you're okay with two emails in some cases rather than one, it's easy to get to a point where that edge case is just allowed to happen. Instead of worrying about "the past hour", work with "this current clock hour". "One per hour" in this case means per clock hour rather than sliding sixty-minute windows. Append events when they happen to an events log. Open a new log every hour, maybe with the filename format of yyyy-mm-dd-hh.log. If your log already exists before you've opened it to write out this event, your mail for that hour should have already been sent. Don't send mail in that case. Feel free to go ahead and append the event, though, as this gives a great diagnostic tool. The oldest logs can be cleaned up weekly or monthly for space concerns, although don't underestimate the power of gzip on text files. This might get you an email from 00:59 and another at 01:02 but you'll not get a third until 02:00 with this method.

    Another option would be to append to an ongoing log and use something like logrotate to manage the size. A stat of the file for mtime could let you know if it's been written to in the past hour.

    If you're already using a database, putting this information in a table with timestamps may be appropriate.

    I'm curious about your prior assumptions. This sounds like you're sending an email per event when running 30 times per hour as it is. Why not collect all those events into a single email per run as a starting point? That gets you to a maximum of 30 emails per hour right off the bat, and gets all the alert information for the past two minutes into the initial contact.

    You could gather all the information for this two-minute window into one text file, and combine that with the per-clock-hour stuff from above, appending or attaching all those per-run files for the hour to the one email.

    This is really more of a problem space discussion than a Perl discussion. If you have some code you need help tweaking to any of the recommendations in the thread, show us what you have so far and I'm sure many of us would be happy to help.

      Thanks for your thoughts. I do have some Perl code that I can include once I can get to it.

      Agreed that this was intended more of a problem space solution but with a Perl implementation since my code is in Perl.

      As to the specifics, my code runs every 3 minutes and checks the last 5 minutes worth of logs from a DB. Every matching event is then logged to a different table in the DB. Once all events are gathered, the match table is then run through line by line to generate the emails.

      I know, rather lazy as I did not think this would blow up and spam e or my engineers. But, now that it has, here we are.

      With the input gathered so far, I think I had a couple of thoughts:

      1. run through the match table at the beginning of the script and store that last matches in a hash for easy lookup later
      2. aggregate the match table query for the look to get a count of each match in the period, rather than all of them
      3. For each match in the loop, check the has to make sure it has not been an hour since last detection. If more than an hour, send the email

      Any additional thoughts?

        I would query the database with the time constraint of the last 60 minutes. If you're not timestamping your entries with a native DB timestamp, start doing that.

        I would consider how many varieties of alert I could have, and if that's three or four, I'd limit each type to one per hour rather than one overall.

        For auditability you're going to want a record of the emails being sent anyway. Have a table where you record the email being sent. Select any sent for your class of alert (or for all if you go that route) from the last hour, by timestamp. If there are none, aggregate all the events from the last hour which you selected above, send an email, and insert your row into the email_sent table.

        The more we discuss this, the more it sounds like Nagios, Mon, Argus, Big Brother, Tripwire, or some other monitoring/IDS solution. You might be able to make a plugin to one of those or at least look to them for how to solve these issues.

Re: Email Thresholding
by GotToBTru (Prior) on Apr 02, 2015 at 15:04 UTC

    Store the last time an email was sent as the create/modification date of a file (or as data within the file). Check the time elapsed every time you want to send another email.

    if ($condition_requires_email) { if (-e 'last.email' && -M 'last.email' < 1.0/24.0) { print "Don't need to send email yet\n"; } else { &send_email(); system('touch','last.email'); } }

    Update: corrected logic error pointed out by jhourcle.

    Dum Spiro Spero

      Although the log-checking process would work, this one is likely the fastest solution for most situations.

      ... although I'd move the $condition_requires_email to an outer if block that wraps the whole thing ... otherwise, your logic would send e-mail when (!$condition_requires_email)

      I don't know what type of NRT data bfdi533 is dealing with, but with the space weather data that I deal with, getting the alerts out without a potential 1 hr delay is quite important. (and contractually required, according to the mission's PDMP (Project Data Management Plan)).

Re: Email Thresholding
by bitingduck (Deacon) on Apr 02, 2015 at 15:00 UTC

    If it starts and stops every few minutes (e.g. as a cron job) rather than running in the background, the easiest way to keep track is probably to write a file somewhere that contains any relevant data you care to keep track of. If you just want to know when the last time you sent an email, you might be able to use the last-modified date of the file (and just "touch" it each time you send an email). If you want to store more information, you can store the time and any diagnostic information (e.g. last event time). Be sure to pay attention to what the running environment is- background jobs often run in a different environment than the user that created them and you need to specify full pathnames for files.

Re: Email Thresholding
by choroba (Cardinal) on Apr 02, 2015 at 15:15 UTC
    Would it be possible to run the process once per hour, instead of every couple of minutes?
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      The OP doesn't make clear, but I could imagine something where most of the time you get no events, but if you get one you want the email sent right away (hence the check every few minutes) but sometimes things go really haywire and it turns itself into a spammer when the first event would have been sufficient to get the on-call person to fix things (which presumably takes less than an hour in most cases...). If you're the person on the receiving end of the emails, then it probably becomes important to learn enough Perl to fix the spam problem. Pure speculation, however.

        That is exactly right. There are times when there are no emails and the first event of the hour is the only email needed. But the times when it gets very noisy, the email "spam" is a problem. Last night as an example we received 23000+ emails, hence the need for thresholding.

      That is a good thought but this is a near-real time detection system and has to be run every couple of minutes. It is just the email that has to be throttled to every hour for bursty periods.

        I concur that the main element of your solution is as noted above:

        • On each invocation, unconditionally append to a file all the stuff you will want to send on the next E-Mail update.
        • At the end of each invocation, determine if E-Mail needs to be sent (timer expired, special event, Mayan Calendar turns a new page stone, etc.).
        • Empty out the E-mail holder file upon sending

        This may present a serious design problem. Let's say Event A happens, so your script runs and sends out an email about it, then sets a flag somewhere to say "Don't send any more emails for one hour." Five minutes later, Event B happens. Now people won't get notified about Event B until at least 55 minutes later when the next email is allowed.

        That may not be acceptable in a "near-real time detection system"; and if it is acceptable, then it should be acceptable to run the script hourly. Either way, you're going to have notifications up to 1 hour old.

        Aaron B.
        Available for small or large Perl jobs and *nix system administration; see my home node.

        What is it doing other than sending email? If you want 1 email per hr but something else more frequently: 1) run the notifier on a cron job once per hour and 2) separately run the other more frequently (possibly changing the interval on the fly if load ramps up).
Re: Email Thresholding
by flexvault (Monsignor) on Apr 03, 2015 at 14:37 UTC

    bfdi533,

      "...my code runs every 3 minutes and checks the last 5 minutes worth of logs from a DB. Every matching event is then logged to a different table in the DB..."

    I'm guessing that you read the log files to get at the time stamps each time your script is called. I suggest you search on "Perl 'pop before smtp' " to get some examples and ideas on how to query logs for specific information. Specifically, use of 'read, tell, seek' functions in Perl. Also, what if the error is "disk is full" and you can't log to the DB or there's not enough room in "/tmp" to build your email, or...

    This is not a new or unique problem, so build on the shoulders of those that went before you.

    Good Luck...Ed

    "Well done is better than well said." - Benjamin Franklin

Re: Email Thresholding
by jhourcle (Prior) on Apr 05, 2015 at 06:05 UTC

    I've done work on NRT data, but I'm not the one who's responsible for generating alerts on the matter.

    A few things that you might want to consider:

    • I don't know how the entries get into your database, but rather than query against a growing table, you may be able to set up a trigger that logs to a second table that you can grab everything from every 3 minutes and then clear in that same transaction.
    • Depending on the database, the trigger itself might be able to generate e-mail or send some other signal for the e-mail to go out. (if using Postgres, there's PL/pgSQL, but there's something you have to do first to allow access to all perl functions). But if you can get it to export anything to an external file, you can have a daemon sit and poll for that file so you don't beat on the database.
    • The astronomy community has the Gamma-ray Coordinates Network to distribute even information in (near?) real time.
    • In Earth science, there's the Volcano Notification Service. I suspect that there would be other warning systems for critical weather events that are more push-like. (all that I know about is the NWS, which is a pull)
    • Distributing NRT solar data just came up yesterday on the code4solar mailing list. (because I brought it up), but it's for the data for re-analysis so lags by a good 15 minutes. The event detection info gets pumped into the Heliophysics Event Knowledgebase.
Re: Email Thresholding
by locked_user sundialsvc4 (Abbot) on Apr 06, 2015 at 12:49 UTC

    Maybe you could use a database table (SQLite or otherwise ...) to accumulate a sort of “to-do list” for this daemon.   This table would contain email_type, datetime_when_last_sent, and datetime_when_last_requested.   Every few minutes, the process scans to look for emails that need to be sent, and, using REPLACE INTO logic, inserts or updates them into the table with datetime_when_last_requested = NOW().   This is “the most-recent time when we determined that ‘this email needs to be sent soon.’”   There is only one row for each email_type.

    Separately, the daemon queries this table to find emails that have been requested in the last hour, but have not been sent during that same interval.   It sends them (once ...), and updates datetime_when_last_sent = NOW().   The two activities are otherwise unrelated, although you likely will want to do both of them at the same time back-to-back.