splitting files in unix

performance unix split bigdata

I assume you're using split -b which will be more CPU-efficient than splitting by lines, but still reads the whole input file and writes it out to each file. If the serial nature of the execution of this portion of split is your bottleneck, you can use dd to extract the chunks of the file in parallel. You will need a distinct dd command for each parallel process. Here's one example command line (assuming the_input_file is a large file this extracts a bit from the middle):

dd skip=400 count=1 if=the_input_file bs=512 of=_output

To make this work you will need to choose appropriate values of count and bs (those above are very small). Each worker will also need to choose a different value of skip so that the chunks don't overlap. But this is efficient; dd implements skip with a seek operation.

Of course, this is still not as efficient as implementing your data consumer process in such a way that it can read a specified chunk of the input file directly, in parallel with other similar consumer processes. But I assume if you could do that you would not have asked this question.

performance unix split bigdata

Given that this is an OS utility, my inclination would be to think that it's optimized for best performance.

You can see this question (or do a man -k split or man split) to find related commands that you might be able to use instead of split.

If you are thinking of implementing your own solution in say C, then I would suggest you run some benchmarks this for your own specific system/environment and some sample data and determine what tool to use.

Note: if you aren't going to be doing this regularly, it may not be worth your while to even think about this much, just go ahead and use a tool that does what you need it to do (in this case split)

CodeHunter

splitting files in unix

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last