Read directory of avro files from HDFS into dataframe-like object in python
You could use json and hdfscli python packages.
To get started:
from hdfs import InsecureClientHDFS_HOSTNAME = 'master1.hadoop.com'HDFSCLI_PORT = 50070HDFSCLI_CONNECTION_STRING = f'http://{HDFS_HOSTNAME}:{HDFSCLI_PORT}'hdfs_client = InsecureClient(HDFSCLI_CONNECTION_STRING)avro_file = '/path/to/avro/file.avsc'with hdfs_client.read(avro_file) as reader: content = json.load(reader)
Then you need to implement the loop (maybe with hdfs_client.walk) and transform to pandas.
As mentioned above, cyavro looks like an excellent solution to my question. It provides a very quick reading of individual avro files, or a directory of avro files and concatenating them together, into a pandas dataframe.
While it appears to support some reading of other protocols like http://
or s3://
, it does not appear to natively support reading from hdfs://
at this time. But a viable alternative approach might be mounting the HDFS to the local filesystem, which would provide filesystem access that cyavro can act upon.