Use Data Lake or Blob on HDInsights cluster on Azure Use Data Lake or Blob on HDInsights cluster on Azure azure azure

Use Data Lake or Blob on HDInsights cluster on Azure


As per this document, an Azure Storage account can hold up to 4.75 TB, though individual blobs (or files from an HDInsight perspective) can only go up to 195 GB. Azure Data Lake Store can grow dynamically to hold trillions of files, with individual files greater than a petabyte. For more information, see Understanding blobs and Data Lake Store.

Also, check Benefits of Azure Storage and Use Data Lake Store for more details and comparisons.

Hope this helps.


In addition to Ashok's answer: ADLS is currently only available in a few regions, compared to Azure Storage. So if you need your HDInsight account in a specific region, you should make sure your storage is in the same region.

Another benefit of ADLS over Azure Storage is its POSIX-based security model at the file/folder level that uses AAD security principals instead of Shared Access Keys.

The reason why you may not want to use ADLS for non-analytics data is primarily cost. Because of some of the additional capabilities, it is currently a bit more expensive.


In addition to the other answers its not possible to use the Spark Data Factory activity on HDInsights clusters that use Data Lake as the primary storage. This limitation applies to both ADFv1 and v2 as seen here: https://docs.microsoft.com/en-us/azure/data-factory/v1/data-factory-spark and https://docs.microsoft.com/en-us/azure/data-factory/transform-data-using-spark