how to load twitter data from hdfs using pig? how to load twitter data from hdfs using pig? hadoop hadoop

how to load twitter data from hdfs using pig?


You need to register below jar in pig, this jar contains the appropriate class which you are trying to access.

elephant-bird-pig-4.1.jar

EDITED: For proper steps.

REGISTER '/home/hdfs/json-simple-1.1.jar';REGISTER '/home/hdfs/elephant-bird-hadoop-compat-4.1.jar';REGISTER '/home/hdfs/elephant-bird-pig-4.1.jar';load_tweets = LOAD '/user/hdfs/twittes.txt' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;dump load_tweets;

I used above steps on my local cluster and its working fine, so you need to add these jars before running your load.


You need to Register 3 Jar files as shown in the blog. Each jar has its own importance.

elephant-bird-hadoop-compat-4.1.jar-Utilities for dealing with Hadoop incompatibilities between 1.x and 2.x.

elephant-bird-pig-4.1.jar--Json loader for pig, it loads each Json record into Pig.

json-simple-1.1.1.jar--One of the Json Parser available in Java

After Registering the Jars, you can load the tweets by the following pig script.

load_tweets = LOAD '/user/flume/tweets/' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;

After loading the tweets, you can see them by dumping it

dump load_tweets