Reading and writing out of core files sequentially multi-threaded with Python Reading and writing out of core files sequentially multi-threaded with Python numpy numpy

Reading and writing out of core files sequentially multi-threaded with Python


Multi-threaded sequential write can be error prone. Most systems typically prefer formats like Parquet that allow them to write each chunk of data to different files.

If you want to do actual parallel sequential writes you'll have to do some sort of locking, and you're probably on your own in terms of larger all-in-one systems.


I finally found a working solution with pyarrow.

Incremental writing:

import pyarrow as paresult = []writer = Falsefor _, row in df.iterrows():  result.append(process_row(row))  if len(result) >= 10000:    batch = pa.RecordBatch.from_pandas(pd.DataFrame(result))    if not writer:      writer = pa.RecordBatchFileWriter(f'filename.arrow', batch.schema)      writer.write(batch)      result = []batch = pa.RecordBatch.from_pandas(pd.DataFrame(result))writer.write(batch)writer.close()

Read all into one dataframe:

pa.RecordBatchFileReader("filename.arrow").read_pandas()

Incremental reading:

rb = pa.RecordBatchFileReader("filename.arrow")for i in range(rb.num_record_batches):  b = rb.get_batch(i)