Fastest way to sort huge (50-100 GB) files when you have enough memory Fastest way to sort huge (50-100 GB) files when you have enough memory unix unix

Fastest way to sort huge (50-100 GB) files when you have enough memory


What about QuickSort? Did you try? std::sort is usually implemented by quicksort (more precisely introsort, which switches to heapsort if quicksort performance would be bad), so you can try with it. quicksort is usually the fastest option (although the worst-case complexity is O(n^2), but in usual cases it beats all other sorting algorithms).

The space complexity of quicksort should not be too bad, it requires log2(N) stack space, which is around 30 stack frames for 1 billion items.

However, it is unstable sorting algorithm (order of "equal" items is not preserved), so it depends if you are ok with that.

Btw. Unix sort seems to be implemented by merge sort, which usually isn't the fastest option for in-RAM sort.


I know this is old but I figure I'd chime in with what I just figured out in hopes that it may help someone else in the future.

GNU sort as you may already know is pretty fast. Couple that with many CPU cores and a lot of RAM and when you pass in some special flags to GNU's sort and make it extremely fast.

* pay close attention to the 'buffer-size' flag. buffer size is the main reason for this speed-up. ive used parallel flag before and it wasn't as fast by itself.

sort --parallel=32 --buffer-size=40G -u -t, -k2 -o $file.csv $file

I used a for loop to handle all the files in the folder and sorted huge csv files, by the second key, with a comma delim, only keeping unique values, with the following results:

for file in $(ls -p | grep -v  -E "[0-4/]"); do     time sort --parallel=32 --buffer-size=40G -u -t, -k2 -o $file.sorted.csv $file; donereal    0m36.041suser    1m53.291ssys     0m32.007sreal    0m52.449suser    1m52.812ssys     0m38.202sreal    0m50.133suser    1m41.124ssys     0m38.595sreal    0m41.948suser    1m41.080ssys     0m35.949sreal    0m47.387suser    1m39.998ssys     0m34.076s

The input files are 5.5 GB with ~75,000,000 million rows each. The max memory usage I saw while a sort was taking place was a little less then 20 GB. So if it scales proportionally then a 50 GB file should take a little less then 200 GB of space. sorted 27.55 GB of data in under 9 minutes!