How can I read in a binary file from hdfs into a Spark dataframe?
So, for anyone that starts with Spark as me and stumbles upon binary files. Here is how I solved it:
dt=np.dtype([('idx_metric','>i4'),('idx_resource','>i4'),('date','>i4'), ('value','>f8'),('pollID','>i2')])schema=StructType([StructField('idx_metric',IntegerType(),False), StructField('idx_resource',IntegerType(),False), StructField('date',IntegerType),False), StructField('value',DoubleType(),False), StructField('pollID',IntegerType(),False)])filenameRdd=sc.binaryFiles('hdfs://nameservice1:8020/user/*.binary')def read_array(rdd): #output=zlib.decompress((bytes(rdd[1])),15+32) # in case also zipped array=np.frombuffer(bytes(rdd[1])[20:],dtype=dt) # remove Header (20 bytes) array=array.newbyteorder().byteswap() # big Endian return array.tolist()unzipped=filenameRdd.flatMap(read_array)bin_df=sqlContext.createDataFrame(unzipped,schema)
And now you can do whatever fancy stuff you want in Spark with your dataframe.
Edit:Please review the use of sc.binaryFiles as mentioned here:https://stackoverflow.com/a/28753276/5088142
try using:
hdfs://machine_host_name:8020/user/bin_file1.bin
you the host-name in fs.defaultFS in core-site.xml
Since Spark 3.0, Spark supports binary file data source, which reads binary files and converts each file into a single record that contains the raw content and metadata of the file.
https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html