Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Summarizing Mail Activity (Long Tail Problem)

by eskwayrd (Acolyte)
on Mar 24, 2007 at 02:16 UTC ( [id://606370]=perlquestion: print w/replies, xml ) Need Help??

eskwayrd has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,
I'm not new to perl, but I'm trying to solve a problem with perl. :-)

Can anyone suggest a strategy for showing a 'top 10' graph over time, where the items on the top 10 list vary dramatically?

I've been asked to plot the top 10 senders of messages through our mail server by date; we've got a database of message headers going back almost 8 years, and comprising more than 30 million records. I need to determine the top 10 senders for the closest full day (aka 'yesterday'), and plot each sender's mail volume as a percentage of each day's mail volumes for the entire time period.

The brute force approach is taking more than an hour to complete, and I've been asked to not hog the database server so much.

I've thought of (and tried) making an aggregation table that I could update daily, but the list of 10 ten users is fairly chaotic and changes significantly day to day (especially on weekends). We have almost 1.5 million distinct sender addresses, so there's a substantial long tail, and the majority of the sender addresses will never show up in a top 10 list. I've already tried aggregating counts for all senders for all dates, but that table ended up being significantly larger than the original table, in part because I'd recorded 0's for addresses that hadn't sent messages that day.

I'm stumped trying to figure out a strategy that will keep an aggregate table down to a reasonable size, while minimizing the amount of querying required to create records for new addresses with enough mail volume to make the top 10 list for a given date.

Thoughts or suggestions greatly appreciated.
  • Comment on Summarizing Mail Activity (Long Tail Problem)

Replies are listed 'Best First'.
Re: Summarizing Mail Activity (Long Tail Problem)
by thezip (Vicar) on Mar 24, 2007 at 03:48 UTC

    I'd probably solve it like this:

    • Export all of the headers into a headersfile on an offline machine

    • Run a script on the offline machine that processes only one day's headers at a time. When a day has been completed, it should be removed from the headersfile and the aggregate counts for each email address should be inserted into the database Aggregates table.

    • The Aggregates table will have three columns:

        email_address, date, rawcount

    • email_address and date comprise the composite key for the Aggregates table

    • Since a record is not generated if a user hasn't sent email that day, superfluous records won't be inserted into the Aggregates table

    • This script can be run on the offline machine in a controlled and limited way until all dates have been processed

    • From then on, only the current day needs to be processed/aggregated

    • You should then be able to efficiently query reports from the Aggregates table

    Where do you want *them* to go today?
      As an extension to the excellent approach mentioned by thezip and bobf I would recommend a slightly adapted aggregate table setup:

      1: table VOLUMES with columns (date,id,size), index on (date)
      2: table EMAILS with columns (id,emailaddress), index on (id)

      The lookup table will reduce space requirements because you don't repeat the long email addresses for each day in the VOLUMES table.
      Indexes take up space as well, from your note it appears that queries will be done by date/period only so the VOLUMES table index only needs indexed on the date to give you the id's for that day.

      You mentioned the need to keep track of total email volume on a given day, this can be entered in the VOLUMES table as an extra entry with id=0, add corresponding entry in the EMAILS table with emailaddress='all'.

        A slight modification to varian's adaptation would be to allow multiple email addresses to map to a single ID. That way, if/when people's email addresses change (marriage, company is bought out, ...) you can still tie their email history together. If ya want....

        --roboticus

      If an entire day is processed at once and only the top 10 counts are required, then your approach could be optimized further if you were to

      • store only the top 10 results from each day in the aggregates table.

      An obvious limitation to this restriction is that if the number of results per day increased (to the top 20, for example), then the processing script would have to be re-run. The OP didn't indicate if that was a possibility. This risk could be mitigated by storing data for the top 10%, the top 50, etc, and it would still represent a significant savings in storage space compared to storing all of the summary data.

      Update: I may have misunderstood the requirements, as thezip mentions, below.

        Yes, but I understood a requirement to be able to display *all* historic data for "Yesterday's Top 10", regardless of if they were Top 10 for any previous day.

        Where do you want *them* to go today?
Re: Summarizing Mail Activity (Long Tail Problem)
by Anno (Deacon) on Mar 24, 2007 at 13:27 UTC
    One data structure that can efficiently keep the top n of a large set of cases is a heap, specifically a minimum heap. You insert cases into the heap until is reaches size n, then after each insertion delete the minimum. After all cases have been processed, the heap contains the top n, which can be extracted lowest to highest. The heap never grows larger than n+1

    There are a few modules on CPAN that implement heaps. If cou need persistence (I'm not sure if your problem requires that), Storable will probably be able to add it.

    Anno

Re: Summarizing Mail Activity (Long Tail Problem)
by eric256 (Parson) on Mar 24, 2007 at 14:51 UTC

    What database are you using? It seems like maybe just some SQL tweaking and indexing are in order. If you are using something like oracle maybe you could do a materialized view that updates as records are commited to the main database. That would automaticaly calculate aggregates as data was entered

    If you defined the view as something like select sender_id,to_char('dd-mm-yyy', send_date) send_date, count(1) messages from email_table group by sender_id, to_char('dd-mm-yyy', send_date) then it would aggregate the data by day, and depending on how you specifiy the update it should be pretty easy on the database especially if you index on the date formated.


    ___________
    Eric Hodges
Re: Summarizing Mail Activity (Long Tail Problem)
by OfficeLinebacker (Chaplain) on Mar 25, 2007 at 02:43 UTC

    First off, how big is the database? Our DBs tend to be a few gigs in size. If necessary to cull from a production database I would think of a query that would not miss, yet still reduce the storage needed to represent the information I want (or at least fit somewhere where I have the space), and take a reasonable time to run, in that order of importance. In the case of jobs that would take forever, I would either try to run them overnight, over a weekend, or sometimes on special dedicated machines if it needed to be done ASAP.

    So anyway, I think that if you had a database that had three fields: date, sender address, and number of sends, that would be a good next step for you. There would be no records with a zero in the last field, if you know what I mean. So that would have as the number of rows the number of days in eight years--2920, let's call it 3000, times the average number of unique senders per day--much less than 1.5 million, I hope!

    Since you said you created an even bigger table once, I presume you have the space. And this table is only temporary--you can create said table with only one pass through the production database and subsequent passes can then be made on this table. So this table only has to exist for long enough for the next step or two.

    The other thing to do is to work backwards from your desired final result. To find out what percentage of a day's worth of mail for which a sender was responsible, you need to know the total volume that day. That number could be represented with a special code in the first-step database described above (with a keyword like 'total' in place of the email address--ordering in the db and/or a check for absence of @ might serve as markers)or in another table. We're talking about 3000 records, 2 observations if it's a separate table, three if it's part of the first.

    The final database will have at least 10 rows per day or around 30,000 rows and at least two fields (date and percentage--I presume actual addresses don't show up on the graph?).

    So if the intermediate dataset above is sorted by ascending date and then by descending number of emails, all it takes is one more pass through the intermediate database, and depending on how smart the database software is, calculate the percentage for only the top ten of each one (based on the previous sort order). It may be worth it to pull the db in one step, sort it in another, then create the final in a third.

    So I guess as this point I have to note that it may have been worth it to extract daily totals in the previous step, but I don't know how much longer that would have taken; it might be worth it to either add 3000 records to the interim database, with a sender code of "daily summary" or something--just anything that will make it easy to distinguish. Or it may be worth it to create a separate summary table with two columns, date and total number of sends. I don't know.

    Sounds similar in scale to problems I have solved, only in our company we are lucky to have the data in SAS format and have lots of SAS licenses.


    I like computer programming because it's like Legos for the mind.
Re: Summarizing Mail Activity (Long Tail Problem)
by mojotoad (Monsignor) on Mar 25, 2007 at 23:26 UTC
    You might want to consider using a round robin database, which are great for aggregating 'buckets' of data from varying degrees of resolution; it would be up to you to specify the minimal bucket size and how long it takes them to coalesce into larger buckets.

    RRDtool is a well known example. They do provide perl bindings as well as ones for ruby and python.

    Cheers,
    Matt

    P.S. The RRDtool site provides a nice tutorial for beginners if you haven't messed with RRD's before.

Re: Summarizing Mail Activity (Long Tail Problem)
by eskwayrd (Acolyte) on Mar 29, 2007 at 22:18 UTC
    Thanks all for the input.

    Obviously, I was doing something stupid with my original aggregation attempt, and the suggestions for creating an optimized schema are spot-on.

    Unfortunately, while interesting, I cannot use a Heap or an RRD; both would ultimately remove data that might need to be included in the graph at some point. Those would probably be ideal solutions if my requirements were different.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://606370]
Approved by kyle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (2)
As of 2024-04-25 05:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found