in reply to Summarizing Mail Activity (Long Tail Problem)

I'd probably solve it like this:


Where do you want *them* to go today?
  • Comment on Re: Summarizing Mail Activity (Long Tail Problem)

Replies are listed 'Best First'.
Re^2: Summarizing Mail Activity (Long Tail Problem) - lookup table
by varian (Chaplain) on Mar 24, 2007 at 07:31 UTC
    As an extension to the excellent approach mentioned by thezip and bobf I would recommend a slightly adapted aggregate table setup:

    1: table VOLUMES with columns (date,id,size), index on (date)
    2: table EMAILS with columns (id,emailaddress), index on (id)

    The lookup table will reduce space requirements because you don't repeat the long email addresses for each day in the VOLUMES table.
    Indexes take up space as well, from your note it appears that queries will be done by date/period only so the VOLUMES table index only needs indexed on the date to give you the id's for that day.

    You mentioned the need to keep track of total email volume on a given day, this can be entered in the VOLUMES table as an extra entry with id=0, add corresponding entry in the EMAILS table with emailaddress='all'.

      A slight modification to varian's adaptation would be to allow multiple email addresses to map to a single ID. That way, if/when people's email addresses change (marriage, company is bought out, ...) you can still tie their email history together. If ya want....

      --roboticus

Re^2: Summarizing Mail Activity (Long Tail Problem)
by bobf (Monsignor) on Mar 24, 2007 at 04:12 UTC

    If an entire day is processed at once and only the top 10 counts are required, then your approach could be optimized further if you were to

    • store only the top 10 results from each day in the aggregates table.

    An obvious limitation to this restriction is that if the number of results per day increased (to the top 20, for example), then the processing script would have to be re-run. The OP didn't indicate if that was a possibility. This risk could be mitigated by storing data for the top 10%, the top 50, etc, and it would still represent a significant savings in storage space compared to storing all of the summary data.

    Update: I may have misunderstood the requirements, as thezip mentions, below.

      Yes, but I understood a requirement to be able to display *all* historic data for "Yesterday's Top 10", regardless of if they were Top 10 for any previous day.

      Where do you want *them* to go today?