How to save data from hadoop to database using python How to save data from hadoop to database using python hadoop hadoop

How to save data from hadoop to database using python


I code all my hadoop MR jobs in python. Let me just say that you need not use python for moving data. Use Sqoop : http://sqoop.apache.org/

Sqoop is an open-source tool that allows users to extract data from a relational database into Hadoop for further processing. And its very simple to use. All you need to do is

  1. Download and configure sqoop
  2. Create your mysql table schema
  3. Specify hadoop hdfs file name, result table name and column seperator.

Read this for more info : http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html

Advantage of using sqoop is that we can now convert our hdfs data to any type of relational database (mysql,derby,hive,etc) and vice versa with a single line command

For your use case, please do necessary changes :

mapper.py

#!/usr/bin/env pythonimport sysfor line in sys.stdin:        line = line.strip()        if line.find("<row") != -1 :            words=line.split(' ')            campaignID=words[1].split('"')[1]            adGroupID=words[2].split('"')[1]            print "%s:%s:"%(campaignID,adGroupID)

streaming command

bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar - file /path/to/mapper.py file -mapper /path/to/mapper.py file -file /path/to/reducer.py file -reducer /path/to/reducer.py file -input /user/input -output /user/output

mysql

create database test;use test;create table testtable ( a varchar (100), b varchar(100) );

sqoop

./sqoop export --connect jdbc:mysql://localhost/test --username root --table testnow --export-dir /user/output --input-fields-terminated-by ':'

Note :

  1. Please change mapper as per your need
  2. I have used ':' as my column seperator in both the mapper and in sqoop command. Change as per needed.
  3. Sqoop tutorials : I have personally followed Hadoop:The Definitive Guide (Oreilly) as well as http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html.


The best place to write your data to the database would from OutputFormat. Reducer level writing can be done but not the best thing to do.

If you had written mapper and reducer in Java you could have made use of DBOutputFormat.

So you can write a custom OutputFormat which meets your data output format (key, value) of reducer to sink the data to MySQL.

Read this tutorial on Yahoo Developer Network on how to write a custom Output Format