Fastest way to sort huge (50-100 GB) files when you have enough memory
Use parallel sorting algorithms for huge data.
Useful topic:Which parallel sorting algorithm has the best average case performance?
What about QuickSort? Did you try? std::sort is usually implemented by quicksort (more precisely introsort, which switches to heapsort if quicksort performance would be bad), so you can try with it. quicksort is usually the fastest option (although the worst-case complexity is O(n^2), but in usual cases it beats all other sorting algorithms).
The space complexity of quicksort should not be too bad, it requires log2(N) stack space, which is around 30 stack frames for 1 billion items.
However, it is unstable sorting algorithm (order of "equal" items is not preserved), so it depends if you are ok with that.
Btw. Unix sort seems to be implemented by merge sort, which usually isn't the fastest option for in-RAM sort.
I know this is old but I figure I'd chime in with what I just figured out in hopes that it may help someone else in the future.
GNU sort as you may already know is pretty fast. Couple that with many CPU cores and a lot of RAM and when you pass in some special flags to GNU's sort and make it extremely fast.
* pay close attention to the 'buffer-size' flag. buffer size is the main reason for this speed-up. ive used parallel flag before and it wasn't as fast by itself.
sort --parallel=32 --buffer-size=40G -u -t, -k2 -o $file.csv $file
I used a for loop to handle all the files in the folder and sorted huge csv files, by the second key, with a comma delim, only keeping unique values, with the following results:
for file in $(ls -p | grep -v -E "[0-4/]"); do time sort --parallel=32 --buffer-size=40G -u -t, -k2 -o $file.sorted.csv $file; donereal 0m36.041suser 1m53.291ssys 0m32.007sreal 0m52.449suser 1m52.812ssys 0m38.202sreal 0m50.133suser 1m41.124ssys 0m38.595sreal 0m41.948suser 1m41.080ssys 0m35.949sreal 0m47.387suser 1m39.998ssys 0m34.076s
The input files are 5.5 GB with ~75,000,000 million rows each. The max memory usage I saw while a sort was taking place was a little less then 20 GB. So if it scales proportionally then a 50 GB file should take a little less then 200 GB of space. sorted 27.55 GB of data in under 9 minutes!