Reading and writing out of core files sequentially multi-threaded with Python
Multi-threaded sequential write can be error prone. Most systems typically prefer formats like Parquet that allow them to write each chunk of data to different files.
If you want to do actual parallel sequential writes you'll have to do some sort of locking, and you're probably on your own in terms of larger all-in-one systems.
I finally found a working solution with pyarrow.
Incremental writing:
import pyarrow as paresult = []writer = Falsefor _, row in df.iterrows(): result.append(process_row(row)) if len(result) >= 10000: batch = pa.RecordBatch.from_pandas(pd.DataFrame(result)) if not writer: writer = pa.RecordBatchFileWriter(f'filename.arrow', batch.schema) writer.write(batch) result = []batch = pa.RecordBatch.from_pandas(pd.DataFrame(result))writer.write(batch)writer.close()
Read all into one dataframe:
pa.RecordBatchFileReader("filename.arrow").read_pandas()
Incremental reading:
rb = pa.RecordBatchFileReader("filename.arrow")for i in range(rb.num_record_batches): b = rb.get_batch(i)