Hadoop inode to path Hadoop inode to path hadoop hadoop

Hadoop inode to path


A bit late, but since I am looking into this now and stumbled across your question.

First of all, a bit of context.

(I am working with Hadoop 2.6)

The Name server is responsible for maintaining the INodes, which is in-memory representation of the (virtual) filesystem structure, while Blocks being maintained by the data nodes. I believe that there are several reason for Name node not to maintain the rest of the information, like the links to the data nodes where the data is stored within the each INode:

  • It would require more memory to represent all that information (memory is the resource which actually limits the amount of files which can be writing into HDFS cluster, since the whole structure is maintained in RAM, for faster access)
  • Would induce more workload on the name node, in case for example if the file is moved from one node to another, or new node is installed and the file needs to be replicated to it. Each time it would happen, Name node would need to update its state.
  • Flexibility, since the INode is an abstraction, thus adding the link would bind it to determined technology and communication protocol

Now coming back to your questions:

  1. The fsimage file already contains the mapping to HDFS path. If you look more carefully in the XML, each INode, regardless its type has an ID (in you case it is 37749299). If you look further in the file, you can find the section <INodeDirectorySection>, which has the mapping between the parent and children and it is this ID field which is used to determine the relation. Through the <name> attribute you can easily determine the structure you see for example in the HDFS explorer.
  2. Furthermore, you have <blocks> section, which has block ID (in your case it is 1108336288). If you look carefully into the sources of the Hadoop, you can find the method idToBlockDir in the DatanodeUtil which gives you a hint how the files are being organized on the disk and block id mapping is performed.

Basically the original id is being shifted twice (by 16 and by 8 bits).

int d1 = (int)((blockId >> 16) & 0xff);int d2 = (int)((blockId >> 8) & 0xff);

And the final directory is built using obtained values:

String path = DataStorage.BLOCK_SUBDIR_PREFIX + d1 + SEP + DataStorage.BLOCK_SUBDIR_PREFIX + d2;

Where the block is stored using in the file which uses blk_<block_id> naming format.

I not a Hadoop expert, so if someone who understands this better could correct any of the flows in my logic, please do so. Hope this helps.