Hadoop and Django, is it possible? Hadoop and Django, is it possible? hadoop hadoop

Hadoop and Django, is it possible?


What is Hadoop?

Imagine the following challange: you have a lot of data, and with a lot I mean at least Terabytes. You want to transform this data or extract some informations and process it into a format which is indexed, compressed or "digested" in a way so you can work with it.

Hadoop is able to parallelize such a processing job and, here comes the best part, takes care of things like redundant storage of the files, distribution of the task over different machines on the cluster etc (Yes, you need a cluster, otherwise Hadoop is not able to compensate the performance loss of the framework).

If you take a first look at the Hadoop ecosystem you will find 3 big terms: HDFS(Hadoop Filesystem), Hadoop itself(with MapReduce) and HBase(the "database" sometimes column store, which does not fits exactly)

HDFS is the Filesystem used by both Hadoop and HBase. It is a extra layer on top of the regular filesystem on your hosts. HDFS slices the uploaded Files in chunks (usually 64MB) and keeps them available in the cluster and takes care of their replication.

When Hadoop gets a task to execute, it gets the path of the input files on the HDFS, the desired output path, a Mapper and a Reducer Class. The Mapper and Reducer is usually a Java class passed in a JAR file.(But with Hadoop Streaming you can use any comandline tool you want). The mapper is called to process every entry (usually by line, e.g.: "return 1 if the line contains a bad F* word") of the input files, the output gets passed to the reducer, which merges the single outputs into a desired other format (e.g: addition of numbers). This is a easy way to get a "bad word" counter.

The cool thing: the computation of the mapping is done on the node: you process the chunks linearly and you move just the semi-digested (usually smaller) data over the network to the reducers.

And if one of the nodes dies: there is another one with the same data.

HBase takes advantage of the distributed storage of the files and stores its tables, splitted up in chunks on the cluster. HBase gives, contrary to Hadoop, random access to the data.

As you see HBase and Hadoop are quite different to RDMBS. Also HBase is lacking of a lot of concepts of RDBMS. Modeling data with triggers, preparedstatements, foreign keys etc. is not the thing HBase was thought to do (I'm not 100% sure about this, so correct me ;-) )

Can Django integrated with Hadoop?

For Java it's easy: Hadoop is written in Java and all the API's are there, ready to use.

For Python/Django I don't know (yet), but I'm sure you can do something with Hadoop streaming/Jython as a last resort.I've found the following: Hadoopy and Python in Mappers and Reducers.


Django can connect with most RDMS, so you can use it with a Hadoop based solution.

Keep in mind, Hadoop is many things, so specifically, you want something with low latency such as HBase, don't try to use it with Hive or Impala.

Python has a thrift based binding, happybase, that let you query Hbase.