Read directory of avro files from HDFS into dataframe-like object in python Read directory of avro files from HDFS into dataframe-like object in python hadoop hadoop

Read directory of avro files from HDFS into dataframe-like object in python


You could use json and hdfscli python packages.

To get started:

from hdfs import InsecureClientHDFS_HOSTNAME = 'master1.hadoop.com'HDFSCLI_PORT = 50070HDFSCLI_CONNECTION_STRING = f'http://{HDFS_HOSTNAME}:{HDFSCLI_PORT}'hdfs_client = InsecureClient(HDFSCLI_CONNECTION_STRING)avro_file = '/path/to/avro/file.avsc'with hdfs_client.read(avro_file) as reader:    content = json.load(reader)

Then you need to implement the loop (maybe with hdfs_client.walk) and transform to pandas.


As mentioned above, cyavro looks like an excellent solution to my question. It provides a very quick reading of individual avro files, or a directory of avro files and concatenating them together, into a pandas dataframe.

While it appears to support some reading of other protocols like http:// or s3://, it does not appear to natively support reading from hdfs:// at this time. But a viable alternative approach might be mounting the HDFS to the local filesystem, which would provide filesystem access that cyavro can act upon.