Pentaho and Hadoop Pentaho and Hadoop hadoop hadoop

Pentaho and Hadoop


An ETL is a tool to Extract data, Transform (join, enrich, filter,...) it and Load the result in another data store. Good ETLS are visual, data store agnostic and easy to automate.

Hadoop is a data store distributed on a network of clusters plus software to handle diseminated data. The data transformation is specialized on few elementary operations which can be optimized to this usually massive amount of data, like (but not only) Map-Reduce.

Pentaho Data Integrator has connectors to Hadoop systems which are easy to set up and tune up. So the best strategy is to setup a Hadoop network as data store and manipulate it through the PDI.


Pentaho PDI is a tool for creating, managing, running and monitoring ETL workflows. It can work with Hadoop, RDBMS, Queues, files, etc. Hadoop is a platform for distributed computation (Map-Reduce framework, HDFS, etc). Many tools can run on Hadoop or can connect to Hadoop and use it's data, run processes.

Pentaho PDI can connect to Hadoop using it's own connectors and write/read data. You can start Hadopp job from PDI, also it can process data by itself inside transformation flow and store or send results to HDFS, RDBMS, some queue, email, etc. Of course you can invent you own tool for ETL workflows or simply use bash+Hive, etc, but PDI allows ETL processsing in a unified way not depending on data sources and targets. Also Pentaho has great visualization.