Hadoop writing to a new file from mapper Hadoop writing to a new file from mapper hadoop hadoop

Hadoop writing to a new file from mapper


Are you sure your are using a single mapper? Because Hadoop creates a number of mappers very close to the number of input splits (more details).

The concept of input split is very important as well: it means very big data files are splited into several chuncks, each chunck assigned to a mapper. Thus, unless you are totally sure only one mapper is being used, you wont be able to control which part of the file you are workin on, and you will not be able to control any kind of global index.

Being said that, by using a single mapper in MapReduce is the same than not using MapReduce at all :) Maybe the mistake is mine, and I'm assuming you have only one file to be analyzed, is that the case?

In the case you have several big data files the scenario changes, and it could make sense to create a single mapper for each file, but you will have to create your own InputSplit and override the isSplitable method by returning always false.