Apache Drill vs Spark [closed]

hadoop apache-spark bigdata apache-drill

Here's an article I came across that discusses some of the SQL technologies: http://www.zdnet.com/article/sql-and-hadoop-its-complicated/

Drill is fundamentally different in both the user's experience and the architecture. For example:

Drill is a schema-free query engine. For instance, you can point it at a directory of JSON or Parquet log files (on your local box, an NFS share, S3, HDFS, MapR-FS, etc.) and run a query. You don't have to load data, create and manage schemas or pre-process the data.
Drill uses a JSON document model internally which allows it to query data of any structure. A lot of modern data is complex, meaning a record can contain nested structures and arrays, and field names may actually encode values such timestamps or web page URLs. Drill allows normal BI tools to operate seamlessly on such data without requiring the data to be flattened in advance.
Drill works with a variety of non-relational datastores, including Hadoop, NoSQL databases (MongoDB, HBase) and cloud storage. Additional datastores will be added.

Drill 1.0 was just released (May 19, 2015). You can easily download it onto your laptop and play with it without any infrastructure (Hadoop, NoSQL, etc.).

hadoop apache-spark bigdata apache-drill

Drill provides the ability for you to query different kinds of datasets with ANSI SQL. This makes it great for adhoc data exploration, and connecting BI tools to datasets via ODBC. You can even use Drill to SQL JOIN different kinds of datasets. For example, you could join records in a MySQL table with rows in a JSON file, or a CSV file, or OpenTSDB, or MapR-DB... the list goes on. Drill can connect to lots of different types of data.

When I think to use Spark, I'm typically wanting to use it for RDDs (resilient distributed dataset). RDDs make it easy to process a lot of data, quickly. Spark also has a bunch of libraries for ML and streaming. Drill doesn't process data at all. It just gets you access to said data. You could use Drill to pull data into Spark, or Tensorflow, or PySpark, or Tableau, etc.

hadoop apache-spark bigdata apache-drill

Apache Spark-SQL:

You need to write code (Scala, Java or Python) to access the data and process it.
SQL queries can be executed against Dataframes.
Execution can be done in a distributed fashion (cluster).
Almost every data storage has a Spark driver or connector.
Used for massive parallel computing/ data analytics.
Support stream processing.
Has a bigger support community.

Apache Drill:

No need to write code, Drill will explore the data source and create its own data catalog.
Easier to use, just SQL.
Execution can be done in a distributed fashion (cluster).
It can be used to read data from many data sources such as MongoDB, Parquet files, MySQL and any JDBC database.
Used for ad-hoc data exploration.
It does not support stream processing.
It has a smaller support community.

CodeHunter

Apache Drill vs Spark [closed]

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last