How to make HDFS work in docker swarm How to make HDFS work in docker swarm docker docker

How to make HDFS work in docker swarm


The whole mess stems from interaction between docker swarm using overlay networks and how the HDFS name node keeps track of its data nodes. The namenode records the datanode IPs/hostnames based the datanode's overlay network IPs. When the HDFS client asks for read/write operations directly on the datanodes, the namenode reports back the IPs/hostnames of the datanodes based on the overlay network. Since the overlay network is not accessible to the external clients, any rw operations will fail.

The final solution (after lots of struggling to get overlay network to work) I used was to have the HDFS services use the host network. Here's a snippet from the compose file:

version: '3.7'x-deploy_default: &deploy_default  mode: replicated  replicas: 1  placement:    constraints:      - node.role == manager  restart_policy:    condition: any    delay: 5sservices:  hdfs_namenode:    deploy:      <<: *deploy_default    networks:      hostnet: {}    volumes:      - hdfs_namenode:/hadoop-3.2.0/var/name_node    command:      namenode -fs hdfs://${PRIMARY_HOST}:9000    image: hadoop:3.2.0  hdfs_datanode:    deploy:      mode: global    networks:      hostnet: {}    volumes:      - hdfs_datanode:/hadoop-3.2.0/var/data_node    command:      datanode -fs hdfs://${PRIMARY_HOST}:9000    image: hadoop:3.2.0volumes:  hdfs_namenode:  hdfs_datanode:networks:  hostnet:    external: true    name: host