Threads in Hadoop Threads in Hadoop hadoop hadoop

Threads in Hadoop


1) I have 2 CPU's, As per my understanding my system will be able to serve 2 threads at a time, what is the use in setting the datanode / namenode handler to higher value like '10'?

Most of the time, these threads will be blocked(asleep) waiting IO operation. Assume on average 1 thread is asleep for 99.9% of the time, then it only consumes 0.1% cpu. You can easily run 1000 threads at the same time. In production, the threads configuration should be based on the cluster setup (pysical cores per node, disk throughput, network throughput, workload, etc.) If you are not sure, just use the default values.

2) What is the difference between handler count and maximum transfer thread both are used for processing?

dfs.datanode.handler.count is the handler threads for ClientDatanodeProtocol, which is used for client/DN RPC communicates information about block recovery meta info. The message size is small and the transfer is fast, the handler will be idle for most of the time, so we don't need much handlers. We can easily reuse the idle one. So the default value is 10 which is quite smaller than transfer.threads.

dfs.datanode.max.transfer.threads is the number of DataXceiver threads, which is used for transfering blocks via the DTP (data transfer protocol). The block data is big and the transfer takes some time. 1 thread will be served for one block reading. only until the whole block is transferred, the thread can be reused. If there's many clients request block at the same time, we need more threads. For each write connection, there will be 2 threads. So this number should be larger for write bound applications.

Actually DataXceiver threads will be blocked waiting for reading from disk or waiting for sending data through interface. So it doesn't consume much cpu except data checksums computation.