How can I read in a binary file from hdfs into a Spark dataframe? How can I read in a binary file from hdfs into a Spark dataframe? hadoop hadoop

How can I read in a binary file from hdfs into a Spark dataframe?


So, for anyone that starts with Spark as me and stumbles upon binary files. Here is how I solved it:

dt=np.dtype([('idx_metric','>i4'),('idx_resource','>i4'),('date','>i4'),             ('value','>f8'),('pollID','>i2')])schema=StructType([StructField('idx_metric',IntegerType(),False),                   StructField('idx_resource',IntegerType(),False),                    StructField('date',IntegerType),False),                    StructField('value',DoubleType(),False),                    StructField('pollID',IntegerType(),False)])filenameRdd=sc.binaryFiles('hdfs://nameservice1:8020/user/*.binary')def read_array(rdd):    #output=zlib.decompress((bytes(rdd[1])),15+32) # in case also zipped    array=np.frombuffer(bytes(rdd[1])[20:],dtype=dt) # remove Header (20 bytes)    array=array.newbyteorder().byteswap() # big Endian    return array.tolist()unzipped=filenameRdd.flatMap(read_array)bin_df=sqlContext.createDataFrame(unzipped,schema)

And now you can do whatever fancy stuff you want in Spark with your dataframe.


Edit:Please review the use of sc.binaryFiles as mentioned here:https://stackoverflow.com/a/28753276/5088142


try using:

hdfs://machine_host_name:8020/user/bin_file1.bin

you the host-name in fs.defaultFS in core-site.xml


Since Spark 3.0, Spark supports binary file data source, which reads binary files and converts each file into a single record that contains the raw content and metadata of the file.

https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html