Relationship between Hadoop and databases Relationship between Hadoop and databases hadoop hadoop

Relationship between Hadoop and databases


I want to know the relationship between HDFS and databases.

There is no relation as such between the 2. If you still want to find some similarity, the only thing which is common between the 2 is provision to store data. But this is analogous to any FS and DB combination. MySQL and ext3, for example. You say that you are storing data in MySQL, but eventually your data in getting stored on top your FS. Usually folks use NoSQL databases, like HBase, on top of their Hadoop cluster to exploit parallelism and distributed behavior provided by HDFS.

Is it always necessary that to use HDFS, the data be in a some NoSQL format?

There is actually nothing like NoSQL format. You can use HDFS for any kind of data, text, binary, xml etc etc.

Is there a specific database that always comes attached when using HDFS?

No. The only thing which comes coupled with HDFS is MapReduce framework. You can obviously make a DB to work with HDFS. Folks often use NoSQL DBs on top of HDFS. There are several choices like Cassandra, HBase etc. It's totally upto you to decide which one to use.

Can I use a relational database as the native database for Hadoop?

There is no OOTB feature which allows this. Moreover, it doesn't make much sense to use RDBMSs with Hadoop. Hadoop was developed for the times when RDBMS is not the suitable option, like handling PBs of data, handling unstructured data etc. Having said that, you must not think of Hadoop as a replacement to the RDBMBs. Both have entirely different goals.

EDIT :

Normally folks use NoSQL DBs(like HBase, Cassandra) with Hadoop. Using these DBs with hadoop is merely a matter of configuration. You don't need any connecting program in order to achieve this. Apart from the point made by @Doctor Dan, there are few other reasons behind choosing NoSQL DBs in place of SQL DBs. One thing is size. These NoSQL DBs provided great horizontal scalibilty which enable you to store PBs of data easily. You could scale traditional systems, but vertically. Another reason the is complexity of data. The places, where these DBs are being used, mostly handle highly unstructured data which is not very easy to deal with using traditional systems. For example, sensor data, log data etc.

Basically, I did not understand why does SQOOP exist. Why can't we directly use an SQL data on Hadoop.

Although Hadoop is very good at handling your BigData needs, it is not the solution to all your needs. It is not suitable for real-time needs. Suppose you are an Online Transaction Company with very very huge dataset. You find out that you could process this data very easily using Hadoop. But the problem is that you can't serve the real-time needs of you customers with Hadoop. This is where SQOOP comes into picture. It is an import/export tool that allows you to move data between a SQL DB and Hadoop. You could move your BigData into your Hadoop cluster, process it there and then push the results back into your SQL DB using SQOOP to serve the real-time needs of your customers.

HTH


The advantage of Hadoop is its ability to store data with replication, so you cannot have Hadoop "work off", say, SQL Server, nor would it make much sense. There are HBase, Hive and Pig environments (and others) that can be set up to work with Hadoop, and they look and feel like regular SQL languages. Check out Hortonworks' Sandbox if you want to have something to play with, as they say, from 0 to Big Data in 15 minutes. Hope this helps.


What do you really want to achieve, that is not clear from your question.

There is only an indirect relationship between HDFS and Database. HDFS is a file system, not a database. Hadoop is a combination of parallel processing framework ( MapReduce ) and the file system HDFS. The parallel processing framework grabs chunks of data from the HDFS file system using something called and InputFormat. Some databases like: Oracle NoSQL Database (ONDB), Cassandra, Riak, others have the ability to return an InputFormat containing their data, so they can participate as a source for MapReduce processing, just like data from HDFS.

So again, what do you want to do?

Hadoop and HDFS are in general useful when you have a large amount of data that has not yet been aggregated and/or structured into some model needed for higher level processing. On occasion (though qustionably forced more often than really necessary), Hadoop can be used to do higher level processing that would normally be done in another processing/storage technology that leverages a decent model. Think Google Instant, the search index creation used to run on MapReduce, then they developed a model and now use a better approach .. could not do Google Instant on MapReduce alone.