CoreOS & HDFS - Running a distributed file system in Linux Containers/Docker CoreOS & HDFS - Running a distributed file system in Linux Containers/Docker hadoop hadoop

CoreOS & HDFS - Running a distributed file system in Linux Containers/Docker


I'll try to provide two possibilities. I haven't tried either of these, so they are mostly suggestions. But could get you down the right path.

The first, if you want to do HDFS and it requires device access on the host, would be to run the HDFS daemons in a privileged container that had access to the required host devices (the disks directly). See https://docs.docker.com/reference/run/#runtime-privilege-linux-capabilities-and-lxc-configuration for information on the --privileged and --device flags.

In theory, you could pass the devices to the container that is handling the access to disks. Then you could use something like --link to talk to each other. The NameNode would store the metadata on the host using a volume (passed with -v). Though, given the little reading I have done about NameNode, it seems like there won't be a good solution yet for high availability anyways and it is a single point of failure.

The second option to explore, if you are looking for a clustered file system and not HDFS in particular, would be to check out the recent Ceph FS support added to the kernel in CoreOS 471.1.0: https://coreos.com/releases/#471.1.0. You might then be able to use the same approach of privileged container to access host disks to build a Ceph FS cluster. Then you might have a 'data only' container that had Ceph tools installed to mount a directory on the Ceph FS cluster, and expose this as a volume for other containers to use.

Though both of these are only ideas and I haven't used HDFS or Ceph personally (though I am keeping an eye on Ceph and would like to try something like this soon as a proof of concept).