Strategy for caching of remote service; what should I be considering? Strategy for caching of remote service; what should I be considering? database database

Strategy for caching of remote service; what should I be considering?


Would http://instagram.com/developer/realtime/ be any use? It appears that Instagram is willing to POST to your server when there's new (and maybe updated?) images for you to check out. Would that do the trick?

Otherwise, I think your problem sounds much like the problem any search engine has—have you seen Wikipedia on crawler selection criteria? You're dealing with many of the problems faced by web crawlers: what to crawl, how often to crawl it, and how to avoid making too many requests to an individual site. You might also look at open-source crawlers (on the same page) for code and algorithms you might be able to study.

Anyway, to throw out some thoughts on standards for crawling:

  • Update the things that have changed often when updated. So, if an item hasn't changed in the last five updates, then maybe you could assume it won't change as often and update it less.
  • Create a score for each image, and update the ones with the highest scores. Or the lowest scores (depending on what kind of score you're using). This is a similar thought to what is used by LilyPond to typeset music. Some ways to create input for such a score:
    • A statistical model of the chance of an image being updated and needing to be recached.
    • An importance score for each image, using things like the recency of the image, or the currency of its event.
  • Update things that are being viewed frequently.
  • Update things that have many views.
  • Does time affect the probability that an image will be updated? You mentioned that newer images are more important, but what about the probability of changes on older ones? Slow down the frequency of checks of older images.
  • Allocate part of your requests to slowly updating everything, and split up other parts to process results from several different algorithms simultaneously. So, for example, have the following (numbers are for show/example only--I just pulled them out of a hat):
    • 5,000 requests per hour churning through the complete contents of the database (provided they've not been updated since the last time that crawler came through)
    • 2,500 requests processing new images (which you mentioned are more important)
    • 2,500 requests processing images of current events
    • 2,500 requests processing images that are in the top 15,000 most viewed (as long as there has been a change in the last 5 checks of that image, otherwise, check it on a decreasing schedule)
    • 2,500 requests processing images that have been viewed at least
    • Total: 15,000 requests per hour.


How many (unique) photos / events are viewed on your site per hour? Those photos that are not viewed probably don't need to be updated often. Do you see any patterns in views for old events / phones? Old events might not be as popular so perhaps they don't have to be checked that often.

andyg0808 has good detailed information however it is important to know the patterns of your data usage before applying in practice.

At some point you will find that 20,000 API requests per hour will not be enough to update frequently viewed photos, which might lead you to different questions as well.