hadoop No FileSystem for scheme: file
This is a typical case of the maven-assembly
plugin breaking things.
Why this happened to us
Different JARs (hadoop-commons
for LocalFileSystem
, hadoop-hdfs
for DistributedFileSystem
) each contain a different file called org.apache.hadoop.fs.FileSystem
in their META-INFO/services
directory. This file lists the canonical classnames of the filesystem implementations they want to declare (This is called a Service Provider Interface implemented via java.util.ServiceLoader
, see org.apache.hadoop.FileSystem#loadFileSystems
).
When we use maven-assembly-plugin
, it merges all our JARs into one, and all META-INFO/services/org.apache.hadoop.fs.FileSystem
overwrite each-other. Only one of these files remains (the last one that was added). In this case, the FileSystem
list from hadoop-commons
overwrites the list from hadoop-hdfs
, so DistributedFileSystem
was no longer declared.
How we fixed it
After loading the Hadoop configuration, but just before doing anything FileSystem
-related, we call this:
hadoopConfig.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName() ); hadoopConfig.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName() );
Update: the correct fix
It has been brought to my attention by krookedking
that there is a configuration-based way to make the maven-assembly
use a merged version of all the FileSystem
services declarations, check out his answer below.
For those using the shade plugin, following on david_p's advice, you can merge the services in the shaded jar by adding the ServicesResourceTransformer to the plugin config:
<plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>2.3</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/> </transformers> </configuration> </execution> </executions> </plugin>
This will merge all the org.apache.hadoop.fs.FileSystem services in one file
For the record, this is still happening in hadoop 2.4.0. So frustrating...
I was able to follow the instructions in this link: http://grokbase.com/t/cloudera/scm-users/1288xszz7r/no-filesystem-for-scheme-hdfs
I added the following to my core-site.xml and it worked:
<property> <name>fs.file.impl</name> <value>org.apache.hadoop.fs.LocalFileSystem</value> <description>The FileSystem for file: uris.</description></property><property> <name>fs.hdfs.impl</name> <value>org.apache.hadoop.hdfs.DistributedFileSystem</value> <description>The FileSystem for hdfs: uris.</description></property>