parsing json input in hadoop java parsing json input in hadoop java hadoop hadoop

parsing json input in hadoop java


Seems you forgot to embed the JSon library in your Hadoop job jar.You can have a look there to see how you can build your job with the library:http://tikalk.com/build-your-first-hadoop-project-maven


There are several ways to use external jars with your map reduce code:

  1. Include the referenced JAR in the lib subdirectory of the submittable JAR: The job will unpack the JAR from this lib subdirectory into the jobcache on the respective TaskTracker nodes and point your tasks to this directory to make the JAR available to your code. If the JARs are small, change often, and are job-specific this is the preferred method. This is what @clement suggested in his answer.

  2. Install the JAR on the cluster nodes. The easiest way is to place the JAR into $HADOOP_HOME/lib directory as everything from this directory is included when a Hadoop daemon starts. Note that a start stop will be needed to make this effective.

  3. TaskTrackers will be using the external JAR, so you can provide it by modifying HADOOP_TASKTRACKER_OPTS option in the hadoop-env.sh configuration file and make it point to the jar. The jar needs to be present at the same path on all the nodes where task-tracker runs.

  4. Include the JAR in the “-libjars” command line option of the hadoop jar … command. The jar will be placed in distributed cache and will be made available to all of the job’s task attempts. Your map-reduce code must use GenericOptionsParser. For more details read this blog post.

Comparison:

  • 1 is a legacy method but discouraged because it has a large negative performance cost.

  • 2 and #3 are good for private clusters but pretty lame practice as you cannot expect end users to do that.

  • 4 is the most recommended option.

Read the main post from Cloudera).