Social-networking: Hadoop, HBase, Spark over MongoDB or Postgres? Social-networking: Hadoop, HBase, Spark over MongoDB or Postgres? postgresql postgresql

Social-networking: Hadoop, HBase, Spark over MongoDB or Postgres?


I think you are on the right direction to search for software stack/architecture which can:

  • handle different types of load: batch, real time computing etc.
  • scale in size and speed along with business growth
  • be a live software stack which are well maintained and supported
  • have common library support for domain specific computing such as machine learning, etc.

To those merits, Hadoop + Spark can give you the edges you need. Hadoop is relatively mature for now to handle large scale data in a batch manner. It supports reliable and scalable storage(HDFS) and computation(Mapreduce/Yarn). With the addition of Spark, you can leverage storage (HDFS) plus real-time computing (performance) added by Spark.

In terms of development, both systems are natively supported by Java/Scala. Library support, performance tuning of those are abundant here in stackoverflow and everywhere else. There are at least a few machine learning libraries(Mahout, Mlib) working with hadoop, spark.

For deployment, AWS and other cloud provider can provide host solution for hadoop/spark. Not an issue there either.


I guess you should separate data storage and data processing. In particular, "Spark or MongoDB?" is not a good thing to ask, but rather "Spark or Hadoop or Storm?" and also "MongoDB or Postgres or HDFS?"

In any case, I would refrain from having the database do processing.


I have to admit that I'm a little biased but if you want to learn something new, you have serious spare time, you're willing to read a lot, and you have the resources (in terms of infrastructure), go for HBase*, you won't regret it. A whole new universe of possibilities and interesting features open up when you can have +billions of atomic counters in real time.

*Alongside Hadoop, Hive, Spark...