HDFS performance for small files HDFS performance for small files hadoop hadoop

HDFS performance for small files


HDFS is really not designed for many small files.

For each new file you read, the client has to talk to the namenode, which gives it the location(s) of the block(s) of the file, and then the client streams the data from the datanode.

Now, in the best case, the client does this once, and then finds that it is the machine with the data on it, and can read it directly from disk. This will be fast: comparable to direct disk reads.

If it's not the machine that has the data on it, then it must stream the data over the network. Then you are bound by network I/O speeds, which shouldn't be terrible, but still a bit slower than direct disk read.

However, you're getting an even worse case- where the overhead of talking to the namenode becomes significant. With only 1KB files, you are getting to the point where you're exchanging just as much metadata as actual data. The client has to make two separate network exchanges to get the data from each file. Add to this that the namenode is probably getting hammered by all of these different threads and so it might become a bottleneck.

So to answer your question, yes, if you use HDFS for something it's not designed to be used for, it's going to be slow. Merge your small files, and use MapReduce to get data locality, and you'll have much better performance. In fact, because you'll be able to take better advantage of sequential disk reads, I wouldn't be surprised if reading from one big HDFS file was even faster than reading many small local files.


just to add to whatever Joe has said, another difference between HDFS and other filesystems is that it keeps disk i/o as less as possible by storing data in larger blocks (normally 64M or 128M) as compared to traditional FS where FS block size is in the order of KBs. for that reason they always say that HDFS is good at processing few large files rather than large no of small files. the reason behind this is the fact that, although there have been significant advancements in components like cpu, ram etc in recent times, the disk i/o is an area where we are still not that much advance. this was the intention behind having so huge blocks(unlike traditional FS) and keep the usage of disk as less as possible.

moreover if the block size is too small, we will have a greater no of blocks. which means more metadata. this may again degrade the performance, as more amount of information needs to loaded into the memory. for each block, which is considered an object in HDFS has about 200B of metadata associated with it. if you have many small blocks, it'll just increase the metadata and you might end up with RAM issues.

There is very good post on Cloudera's blog section which talks about the same issue. You can visit that here.


Lets try to understand our limits and see when we hit them
a) We need namenode to give us information where files are sitting. I can assume that this number is around thousands per second. More information is here https://issues.apache.org/jira/browse/HADOOP-2149Assuming this number to be 10000K we should be able to get information about 10 MB second for 1K files. (somehow you get more...). may
b) Overhead of HDFS. This overhead is mostly on latency not in throughput. HDFS can be tuned to serve a lot of files in parralel. HBase is doing it and we can take settings from HBase tuning guides. The question here is actually how much Datanodes you need
c) Your LAN. You move data from the network so you might hit 1GB ethernet throughput limit. (i think it what you got.

I also have to agree with Joe - that HDFS is not built for the scenario and you should use other technology (like HBase, if you like Hadoop stack) or compress files together - for example into sequence files.

Regarding reading bigger files from HDFS - run DFSIO benchmark and it will be your number.
In the same time - SSD on single host perfectly can be a solution also.