Read directory of avro files from HDFS into dataframe-like object in python

python-3.x hadoop apache-spark hdfs avro

You could use json and hdfscli python packages.

To get started:

from hdfs import InsecureClientHDFS_HOSTNAME = 'master1.hadoop.com'HDFSCLI_PORT = 50070HDFSCLI_CONNECTION_STRING = f'http://{HDFS_HOSTNAME}:{HDFSCLI_PORT}'hdfs_client = InsecureClient(HDFSCLI_CONNECTION_STRING)avro_file = '/path/to/avro/file.avsc'with hdfs_client.read(avro_file) as reader:    content = json.load(reader)

Then you need to implement the loop (maybe with hdfs_client.walk) and transform to pandas.

python-3.x hadoop apache-spark hdfs avro

As mentioned above, cyavro looks like an excellent solution to my question. It provides a very quick reading of individual avro files, or a directory of avro files and concatenating them together, into a pandas dataframe.

While it appears to support some reading of other protocols like http:// or s3://, it does not appear to natively support reading from hdfs:// at this time. But a viable alternative approach might be mounting the HDFS to the local filesystem, which would provide filesystem access that cyavro can act upon.

CodeHunter

Read directory of avro files from HDFS into dataframe-like object in python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last