in reply to Summarizing Mail Activity (Long Tail Problem)

First off, how big is the database? Our DBs tend to be a few gigs in size. If necessary to cull from a production database I would think of a query that would not miss, yet still reduce the storage needed to represent the information I want (or at least fit somewhere where I have the space), and take a reasonable time to run, in that order of importance. In the case of jobs that would take forever, I would either try to run them overnight, over a weekend, or sometimes on special dedicated machines if it needed to be done ASAP.

So anyway, I think that if you had a database that had three fields: date, sender address, and number of sends, that would be a good next step for you. There would be no records with a zero in the last field, if you know what I mean. So that would have as the number of rows the number of days in eight years--2920, let's call it 3000, times the average number of unique senders per day--much less than 1.5 million, I hope!

Since you said you created an even bigger table once, I presume you have the space. And this table is only temporary--you can create said table with only one pass through the production database and subsequent passes can then be made on this table. So this table only has to exist for long enough for the next step or two.

The other thing to do is to work backwards from your desired final result. To find out what percentage of a day's worth of mail for which a sender was responsible, you need to know the total volume that day. That number could be represented with a special code in the first-step database described above (with a keyword like 'total' in place of the email address--ordering in the db and/or a check for absence of @ might serve as markers)or in another table. We're talking about 3000 records, 2 observations if it's a separate table, three if it's part of the first.

The final database will have at least 10 rows per day or around 30,000 rows and at least two fields (date and percentage--I presume actual addresses don't show up on the graph?).

So if the intermediate dataset above is sorted by ascending date and then by descending number of emails, all it takes is one more pass through the intermediate database, and depending on how smart the database software is, calculate the percentage for only the top ten of each one (based on the previous sort order). It may be worth it to pull the db in one step, sort it in another, then create the final in a third.

So I guess as this point I have to note that it may have been worth it to extract daily totals in the previous step, but I don't know how much longer that would have taken; it might be worth it to either add 3000 records to the interim database, with a sender code of "daily summary" or something--just anything that will make it easy to distinguish. Or it may be worth it to create a separate summary table with two columns, date and total number of sends. I don't know.

Sounds similar in scale to problems I have solved, only in our company we are lucky to have the data in SAS format and have lots of SAS licenses.


I like computer programming because it's like Legos for the mind.
  • Comment on Re: Summarizing Mail Activity (Long Tail Problem)