How to make HDFS work in docker swarm
The whole mess stems from interaction between docker swarm using overlay networks and how the HDFS name node keeps track of its data nodes. The namenode records the datanode IPs/hostnames based the datanode's overlay network IPs. When the HDFS client asks for read/write operations directly on the datanodes, the namenode reports back the IPs/hostnames of the datanodes based on the overlay network. Since the overlay network is not accessible to the external clients, any rw operations will fail.
The final solution (after lots of struggling to get overlay network to work) I used was to have the HDFS services use the host network. Here's a snippet from the compose file:
version: '3.7'x-deploy_default: &deploy_default mode: replicated replicas: 1 placement: constraints: - node.role == manager restart_policy: condition: any delay: 5sservices: hdfs_namenode: deploy: <<: *deploy_default networks: hostnet: {} volumes: - hdfs_namenode:/hadoop-3.2.0/var/name_node command: namenode -fs hdfs://${PRIMARY_HOST}:9000 image: hadoop:3.2.0 hdfs_datanode: deploy: mode: global networks: hostnet: {} volumes: - hdfs_datanode:/hadoop-3.2.0/var/data_node command: datanode -fs hdfs://${PRIMARY_HOST}:9000 image: hadoop:3.2.0volumes: hdfs_namenode: hdfs_datanode:networks: hostnet: external: true name: host