Azure Data lake VS Azure HDInsight

azure azure-data-lake azure-hdinsight

The easiest way to think of Data Lake is to think of this large container that has like a real lake with rivers coming into the river you never know where the rivers are coming from (or what "type" of river). Azure Data Lake was introduced to make big data easy for developers, data scientists, and analysts to store data of any size. It removes the complexities of ingesting and storing all your data while making it faster to get up and running with big data. Data Lake is able to stored the mass different types of data (Structured data, unstructured data, log files, real-time, images, etc. ) and to blend that together, to correlate many different data types. The key thing here is as we are moving from traditional way to the modern tools (like Hadoop, Cassandra, NoSQL DB, etc). Azure Data Lake includes three services:

Azure Data Lake Store, a no limits data lake that powers big dataanalytics
Azure Data Lake Analytics, a massively parallel on-demandjob service
Azure HDInsight, a full managed Cloud Hadoop and Sparkoffering

Azure Data Lake Store is like a cloud-based file service or file system that is pretty much unlimited in size. We can run services on top of the data that's in that store. So you could use Hadoop or Spark in an HDInsight cluster, or you could use the Azure Data Lake analytic service, which is a complement to the Azure Data Lake Store. And what that service will let you do is to run jobs that effectively query the data you have stored in the Azure Data Lake store and generate output results.

azure azure-data-lake azure-hdinsight

In nutshell,

Hdinsight is a managed hadoop service (to provide compute support)Azure Data lake(ADL) is a managed storage service (to provide large amount of storage support)

(Instead of ADL, you can alternatively choose to use Blobs in HDinsight, but Blobs have some limitations (like file streaming to storage via hdinsight cluster is not supported)

Here is the definition from Azure documentation (below):

Azure uses "decomposed hardware method"

You can relate or assume HDinsight as a Hadoop Cluster, Azure Data lake (ADL) as HDFS. But they are detached.

If you want to relate with AWS, HDInsight is equivalent to EMR and ADL is equivalent to EMRFS or S3

If you terminate the cluster, ADL storage stays with the files stored in it. You can access the storage directly using another service or tool (like Azure Data bricks) or you can create one another hdinsight cluster on top of the data.

Hdinsight access the ADL using adl:// , and hdinsight never store the file blocks in the nodes (like Hadoop does), rather it has mappings to storage service.

azure azure-data-lake azure-hdinsight

Azure Data Lake Store, is just that a data store. HDInsight can also do that in the cluster that you spin up. However, when you stop that cluster, the data also goes away.

It is common that customers use either Azure Data Lake Store, or Azure storage to provide permanent storage separate from the cluster (compute) used to process the data.

Guy

CodeHunter

Azure Data lake VS Azure HDInsight

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last