Java Spark disable Hadoop discovery
So the final "trick" I've used is a mix of sandev and Vipul answers.
Create a 'fake' winutils in your project root :
mkdir <java_project_root>/bintouch <java_project_root>/bin/winutils.exe
Then, in your Spark configuration, provide the 'fake' HADOOP_HOME :
public SparkConf sparkConfiguration() { SparkConf cfg = new SparkConf(); File hadoopStubHomeDir = new File("."); System.setProperty("hadoop.home.dir", hadoopStubHomeDir.getAbsolutePath()); cfg.setAppName("ScalaPython") .setMaster("local") .set("spark.executor.instances", "2"); return cfg;}
But still, it's a 'trick' to avoid Hadoop discovery, but it doesn't turn it off.
Just spark need winutils just create a folder example C:\hadoop\bin\winutils.exeand define inveroiment variable HADOOP_HOME = C:\hadoop and append to path variable C:\hadoop\bin.then u can use spark functionality
It's not because spark wants hadoop to be installed or it just wants that particular file.
First, You have to run the code with spark-submit, are you doing that? Please stick to that as a first approach since that would yield list library-related issues. After you've done that you can add this to your pom file to be able to run it directly from the IDE, I use IntelliJ but should work on eclipse as well
<dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.6.5</version></dependency>
Second, if it still doesn't work:
Download the winutils file from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe.
create a new directory named bin inside some_other_directory
in your code add this line before creating the Context.
System.setProperty("hadoop.home.dir", "full path to some_other_directory");
Pro tip, switch to using Scala. Not that it's necessary but that's where spark feels most at home and it wouldn't take you more than a day or two to get the basic programs running just right.