Get input file name in streaming hadoop program

python input streaming hadoop filesplitting

According to the "Hadoop : The Definitive Guide"

Hadoop sets job configuration parameters as environment variables for Streaming programs. However, it replaces non-alphanumeric character with underscores to make sure they are valid names. The following Python expression illustrates how you can retrieve the value of the mapred.job.id property from within a Python Streaming script:

os.environ["mapred_job_id"]

You can also set environment variables for the Streaming process launched by MapReduce by applying the -cmdenv option to the Streaming launcher program (once for each variable you wish to set). For example, the following sets the MAGIC_PARAMETER environment variable:

-cmdenv MAGIC_PARAMETER=abracadabra

python input streaming hadoop filesplitting

By parsing the mapreduce_map_input_file(new) or ~~map_input_file~~(deprecated) environment variable, you will get the map input file name.

Notice:
The two environment variables are case-sensitive, all letters are lower-case.

python input streaming hadoop filesplitting

The new ENV_VARIABLE for Hadoop 2.x is MAPREDUCE_MAP_INPUT_FILE

CodeHunter

Get input file name in streaming hadoop program

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last