Difference between Pig and Hive? Why have both? [closed]

hadoop hive apache-pig

Check out this post from Alan Gates, Pig architect at Yahoo!, that compares when would use a SQL like Hive rather than Pig. He makes a very convincing case as to the usefulness of a procedural language like Pig (vs. declarative SQL) and its utility to dataflow designers.

hadoop hive apache-pig

Hive was designed to appeal to a community comfortable with SQL. Its philosophy was that we don't need yet another scripting language. Hive supports map and reduce transform scripts in the language of the user's choice (which can be embedded within SQL clauses). It is widely used in Facebook by analysts comfortable with SQL as well as by data miners programming in Python. SQL compatibility efforts in Pig have been abandoned AFAIK - so the difference between the two projects is very clear.

Supporting SQL syntax also means that it's possible to integrate with existing BI tools like Microstrategy. Hive has an ODBC/JDBC driver (that's a work in progress) that should allow this to happen in the near future. It's also beginning to add support for indexes which should allow support for drill-down queries common in such environments.

Finally--this is not pertinent to the question directly--Hive is a framework for performing analytic queries. While its dominant use is to query flat files, there's no reason why it cannot query other stores. Currently Hive can be used to query data stored in Hbase (which is a key-value store like those found in the guts of most RDBMSes), and the HadoopDB project has used Hive to query a federated RDBMS tier.

hadoop hive apache-pig

I found this the most helpful (though, it's a year old) - http://yahoohadoop.tumblr.com/post/98256601751/pig-and-hive-at-yahoo

It specifically talks about Pig vs Hive and when and where they are employed at Yahoo. I found this very insightful. Some interesting notes:

On incremental changes/updates to data sets:

Instead, joining against the new incremental data and using the results together with the results from the previous full join is the correct approach. This will take only a few minutes. Standard database operations can be implemented in this incremental way in Pig Latin, making Pig a good tool for this use case.

On using other tools via streaming:

Pig integration with streaming also makes it easy for researchers to take a Perl or Python script they have already debugged on a small data set and run it against a huge data set.

On using Hive for data warehousing:

In both cases, the relational model and SQL are the best fit. Indeed, data warehousing has been one of the core use cases for SQL through much of its history. It has the right constructs to support the types of queries and tools that analysts want to use. And it is already in use by both the tools and users in the field.
The Hadoop subproject Hive provides a SQL interface and relational model for Hadoop. The Hive team has begun work to integrate with BI tools via interfaces such as ODBC.

CodeHunter

Difference between Pig and Hive? Why have both? [closed]

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last