Configuring Hadoop logging to avoid too many log files Configuring Hadoop logging to avoid too many log files hadoop hadoop

Configuring Hadoop logging to avoid too many log files


I had this same problem. Set the environment variable "HADOOP_ROOT_LOGGER=WARN,console" before starting Hadoop.

export HADOOP_ROOT_LOGGER="WARN,console"hadoop jar start.jar


Unfortunately, there isn't a configurable way to prevent that. Every task for a job gets one directory in history/userlogs, which will hold the stdout, stderr, and syslog task log output files. The retain hours will help keep too many of those from accumulating, but you'd have to write a good log rotation tool to auto-tar them.

We had this problem too when we were writing to an NFS mount, because all nodes would share the same history/userlogs directory. This means one job with 30,000 tasks would be enough to break the FS. Logging locally is really the way to go when your cluster actually starts processing a lot of data.

If you are already logging locally and still manage to process 30,000+ tasks on one machine in less than a week, then you are probably creating too many small files, causing too many mappers to spawn for each job.


Configuring hadoop to use log4j and setting

log4j.appender.FILE_AP1.MaxFileSize=100MBlog4j.appender.FILE_AP1.MaxBackupIndex=10

like described on this wiki page doesn't work?

Looking at the LogLevel source code, seems like hadoop uses commons logging, and it'll try to use log4j by default, or jdk logger if log4j is not on the classpath.

Btw, it's possible to change log levels at runtime, take a look at the commands manual.