Summarizing Mail Activity (Long Tail Problem)

eskwayrd has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Summarizing Mail Activity (Long Tail Problem) by thezip (Vicar) on Mar 24, 2007 at 03:48 UTC
I'd probably solve it like this: Export all of the headers into a headersfile on an offline machine Run a script on the offline machine that processes only one day's headers at a time. When a day has been completed, it should be removed from the headersfile and the aggregate counts for each email address should be inserted into the database Aggregates table. The Aggregates table will have three columns: email_address, date, rawcount email_address and date comprise the composite key for the Aggregates table Since a record is not generated if a user hasn't sent email that day, superfluous records won't be inserted into the Aggregates table This script can be run on the offline machine in a controlled and limited way until all dates have been processed From then on, only the current day needs to be processed/aggregated You should then be able to efficiently query reports from the Aggregates table Where do you want them* to go today?*	[reply]
Re^2: Summarizing Mail Activity (Long Tail Problem) - lookup table by varian (Chaplain) on Mar 24, 2007 at 07:31 UTC
As an extension to the excellent approach mentioned by thezip and bobf I would recommend a slightly adapted aggregate table setup: 1: table VOLUMES with columns (date,id,size), index on (date) 2: table EMAILS with columns (id,emailaddress), index on (id) The lookup table will reduce space requirements because you don't repeat the long email addresses for each day in the VOLUMES table. Indexes take up space as well, from your note it appears that queries will be done by date/period only so the VOLUMES table index only needs indexed on the date to give you the id's for that day. You mentioned the need to keep track of total email volume on a given day, this can be entered in the VOLUMES table as an extra entry with id=0, add corresponding entry in the EMAILS table with emailaddress='all'.	[reply]
Re^3: Summarizing Mail Activity (Long Tail Problem) - lookup table by roboticus (Chancellor) on Mar 24, 2007 at 12:24 UTC
A slight modification to varian's adaptation would be to allow multiple email addresses to map to a single ID. That way, if/when people's email addresses change (marriage, company is bought out, ...) you can still tie their email history together. If ya want.... --roboticus	[reply]
Re^2: Summarizing Mail Activity (Long Tail Problem) by bobf (Monsignor) on Mar 24, 2007 at 04:12 UTC
If an entire day is processed at once and only the top 10 counts are required, then your approach could be optimized further if you were to store only the top 10 results from each day in the aggregates table. An obvious limitation to this restriction is that if the number of results per day increased (to the top 20, for example), then the processing script would have to be re-run. The OP didn't indicate if that was a possibility. This risk could be mitigated by storing data for the top 10%, the top 50, etc, and it would still represent a significant savings in storage space compared to storing all of the summary data. Update: I may have misunderstood the requirements, as thezip mentions, below.	[reply]
Re^3: Summarizing Mail Activity (Long Tail Problem) by thezip (Vicar) on Mar 24, 2007 at 04:18 UTC
Yes, but I understood a requirement to be able to display all historic data for "Yesterday's Top 10", regardless of if they were Top 10 for any previous day. Where do you want them* to go today?*	[reply]
Re: Summarizing Mail Activity (Long Tail Problem) by Anno (Deacon) on Mar 24, 2007 at 13:27 UTC
One data structure that can efficiently keep the top `n` of a large set of cases is a heap, specifically a minimum heap. You insert cases into the heap until is reaches size `n`, then after each insertion delete the minimum. After all cases have been processed, the heap contains the top `n`, which can be extracted lowest to highest. The heap never grows larger than `n+1` There are a few modules on CPAN that implement heaps. If cou need persistence (I'm not sure if your problem requires that), `Storable` will probably be able to add it. Anno	[reply] [d/l] [select]
Re: Summarizing Mail Activity (Long Tail Problem) by eric256 (Parson) on Mar 24, 2007 at 14:51 UTC
What database are you using? It seems like maybe just some SQL tweaking and indexing are in order. If you are using something like oracle maybe you could do a materialized view that updates as records are commited to the main database. That would automaticaly calculate aggregates as data was entered If you defined the view as something like `select sender_id,to_char('dd-mm-yyy', send_date) send_date, count(1) messages from email_table group by sender_id, to_char('dd-mm-yyy', send_date)` then it would aggregate the data by day, and depending on how you specifiy the update it should be pretty easy on the database especially if you index on the date formated. ___________ Eric Hodges	[reply] [d/l]
Re: Summarizing Mail Activity (Long Tail Problem) by OfficeLinebacker (Chaplain) on Mar 25, 2007 at 02:43 UTC
First off, how big is the database? Our DBs tend to be a few gigs in size. If necessary to cull from a production database I would think of a query that would not miss, yet still reduce the storage needed to represent the information I want (or at least fit somewhere where I have the space), and take a reasonable time to run, in that order of importance. In the case of jobs that would take forever, I would either try to run them overnight, over a weekend, or sometimes on special dedicated machines if it needed to be done ASAP. So anyway, I think that if you had a database that had three fields: date, sender address, and number of sends, that would be a good next step for you. There would be no records with a zero in the last field, if you know what I mean. So that would have as the number of rows the number of days in eight years--2920, let's call it 3000, times the average number of unique senders per day--much less than 1.5 million, I hope! Since you said you created an even bigger table once, I presume you have the space. And this table is only temporary--you can create said table with only one pass through the production database and subsequent passes can then be made on this table. So this table only has to exist for long enough for the next step or two. The other thing to do is to work backwards from your desired final result. To find out what percentage of a day's worth of mail for which a sender was responsible, you need to know the total volume that day. That number could be represented with a special code in the first-step database described above (with a keyword like 'total' in place of the email address--ordering in the db and/or a check for absence of @ might serve as markers)or in another table. We're talking about 3000 records, 2 observations if it's a separate table, three if it's part of the first. The final database will have at least 10 rows per day or around 30,000 rows and at least two fields (date and percentage--I presume actual addresses don't show up on the graph?). So if the intermediate dataset above is sorted by ascending date and then by descending number of emails, all it takes is one more pass through the intermediate database, and depending on how smart the database software is, calculate the percentage for only the top ten of each one (based on the previous sort order). It may be worth it to pull the db in one step, sort it in another, then create the final in a third. So I guess as this point I have to note that it may have been worth it to extract daily totals in the previous step, but I don't know how much longer that would have taken; it might be worth it to either add 3000 records to the interim database, with a sender code of "daily summary" or something--just anything that will make it easy to distinguish. Or it may be worth it to create a separate summary table with two columns, date and total number of sends. I don't know. Sounds similar in scale to problems I have solved, only in our company we are lucky to have the data in SAS format and have lots of SAS licenses. I like computer programming because it's like Legos for the mind.	[reply]
Re: Summarizing Mail Activity (Long Tail Problem) by mojotoad (Monsignor) on Mar 25, 2007 at 23:26 UTC
You might want to consider using a round robin database, which are great for aggregating 'buckets' of data from varying degrees of resolution; it would be up to you to specify the minimal bucket size and how long it takes them to coalesce into larger buckets. RRDtool is a well known example. They do provide perl bindings as well as ones for ruby and python. Cheers, Matt P.S. The RRDtool site provides a nice tutorial for beginners if you haven't messed with RRD's before.	[reply]
Re: Summarizing Mail Activity (Long Tail Problem) by eskwayrd (Acolyte) on Mar 29, 2007 at 22:18 UTC
Thanks all for the input. Obviously, I was doing something stupid with my original aggregation attempt, and the suggestions for creating an optimized schema are spot-on. Unfortunately, while interesting, I cannot use a Heap or an RRD; both would ultimately remove data that might need to be included in the graph at some point. Those would probably be ideal solutions if my requirements were different.	[reply]


Just another Perl shrine
	PerlMonks