Delivering activity feed items in a moderately scalable way Delivering activity feed items in a moderately scalable way database database

Delivering activity feed items in a moderately scalable way


I'm leaning towards just having the master activity table. If you go with that, this is what I would consider implementing:

  1. You can create several activity tables and do a UNION ALL when fetching the data from the database. For example, roll them over monthly - activity_2010_02, etc. Just going by your example - 200K users x 100 friends x 3 activities = 60 million rows. Not a concern performance-wise for PostgreSQL, but you might consider this purely for convenience now and eventually for effortless future expansion.

  2. This has the disadvantage that you're querying for users who may never be active, and as your friend list grows, this query can get slower and slower.

Are you going to display the entire activity feed, going back to the beginning of times? You haven't provided much detail in the original question but I'd hazard a guess that you'd be showing the last 10/20/100 items sorted by time stamp. A couple of indexes and the LIMIT clause should be enough to provide an instant response (as I've just tested on a table with about 20 million rows). It can be slower on a busy server, but that is something that should be worked out with hardware and caching solutions, Postgres is not going to be the bottleneck there.

Even if you do provide activity feeds going back to the dawn of time, paginate the output! The LIMIT clause will save you there. If the basic query with a LIMIT on it is not enough, or if your users have a long tail of friends that are no longer active, you could consider limiting the lookup to the last day/week/month first and then provide the list of friend ids:

select * from activity   where ts <= 123456789     and source_user in (1, 2, 44, 2423, ... my friend list)

If you've got a table spanning months or years back, the search for the friends ids will only be performed within the rows selected by the first WHERE clause.

That's just if I choose between the two solutions you are considering now. I would also look at things like:

  1. Reconsidering your denormalisation of the table. Is storing pre-generated HTML output really the best way? Will you be better off performance-wise by having a lookup table of activities instead and generating templated output on the fly? Pre-generated HTML can seem better at the outset, but consider things like disk storage, APIs, future layout changes and storing HTML may not be that attractive after all. The lookup table could contain your possible activities - added a friend, changed status, etc., and the activity log would reference that and the friend's id if another user is involved in the activity.

  2. Doing pre-generate HTML, but not storing it in the database. Save the stuff on disk as pre-generated pages. This is not a silver bullet, however, and largely depends on the ratio of write-to-reads on your site. I.e. a typical discussion thread on a public forum could have a dozen messages, but could be viewed hundreds of times - a good candidate for caching. Whereas if your application is more tuned to immediate status updates and you'd have to regenerate the HTML page and save it again on disk after every couple of views, then there's little value in this approach.

Hope this helps.