How can I read in a binary file from hdfs into a Spark dataframe?

python hadoop numpy apache-spark spark-dataframe

So, for anyone that starts with Spark as me and stumbles upon binary files. Here is how I solved it:

dt=np.dtype([('idx_metric','>i4'),('idx_resource','>i4'),('date','>i4'),             ('value','>f8'),('pollID','>i2')])schema=StructType([StructField('idx_metric',IntegerType(),False),                   StructField('idx_resource',IntegerType(),False),                    StructField('date',IntegerType),False),                    StructField('value',DoubleType(),False),                    StructField('pollID',IntegerType(),False)])filenameRdd=sc.binaryFiles('hdfs://nameservice1:8020/user/*.binary')def read_array(rdd):    #output=zlib.decompress((bytes(rdd[1])),15+32) # in case also zipped    array=np.frombuffer(bytes(rdd[1])[20:],dtype=dt) # remove Header (20 bytes)    array=array.newbyteorder().byteswap() # big Endian    return array.tolist()unzipped=filenameRdd.flatMap(read_array)bin_df=sqlContext.createDataFrame(unzipped,schema)

And now you can do whatever fancy stuff you want in Spark with your dataframe.

python hadoop numpy apache-spark spark-dataframe

Edit:Please review the use of sc.binaryFiles as mentioned here:https://stackoverflow.com/a/28753276/5088142

try using:

hdfs://machine_host_name:8020/user/bin_file1.bin

you the host-name in fs.defaultFS in core-site.xml

python hadoop numpy apache-spark spark-dataframe

Since Spark 3.0, Spark supports binary file data source, which reads binary files and converts each file into a single record that contains the raw content and metadata of the file.

https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html

CodeHunter

How can I read in a binary file from hdfs into a Spark dataframe?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last