Delete files older than x days on hadoop Delete files older than x days on hadoop hadoop hadoop

Delete files older than x days on hadoop


Here is what I use in bash, you may try it:

e.g. grep all 8 months old files. Change grep regex pattern as per your need:

hadoop fs -ls -R <location> | grep '.*2016-[0-8].*' | awk '{print $8}'

Delete files:

hadoop fs -rm -r `hadoop fs -ls -R <location> | grep '.*2016-[0-8].*' | awk '{print $8}'`


I figured it out. I know there are people that don't recommend the use of ls for these kind of problems, but I am using grep -o to create a new line (so I'll know what strings to expect) and I know what the file name pattern is so this will work perfectly.

#!/bin/bashIFS=$'\n'source_path='/user/'current_date=$(date +%Y-%m-%d)files_ls=$(hdfs dfs -ls "$source_path" | grep -o " 2[0-9]\{3\}-.*")for line in $files_ls; do    last_mod=$(echo "$line" | grep -o "[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}")    file_path=$(echo "$line" | grep -o " /user/.*.log")    time_diff="$(( ($(date --date="$current_date" +%s) - $(date --date="$last_mod" +%s) )/(60*60*24) ))"    if [ "$time_diff" -ge "8" ]; then        echo "hdfs dfs -rm -skipTrash$file_path"    fidone