Re-order lines in textfile for better compression ratio Re-order lines in textfile for better compression ratio bash bash

Re-order lines in textfile for better compression ratio


Here's a fairly naïve algorithm:

  • Choose an initial line at random and write to the compression stream.
  • While remaining lines > 0:
    • Save the state of the compression stream
    • For each remaining line in the text file:
      • write the line to the compression stream and record the resulting compressed length
      • roll back to the saved state of the compression stream
    • Write the line that resulted in the lowest compressed length to the compression stream
    • Free the saved state

This is a greedy algorithm and won't be globally optimal, but it should be pretty good at matching together lines that compressed well when followed one after the other. It's O(n2) but you said compression speed wasn't an issue. The main advantage is that it's empirical: it doesn't rely on assumptions about which line order will compress well but actually measures it.

If you use zlib, it provides a function deflateCopy that duplicates the state of the compression stream, although it's apparently pretty expensive.

Edit: if you approach this problem as outputting all lines in a sequence while trying to minimize the total edit distance between all pairs of lines in the sequence then this problem reduces to the Travelling Salesman Problem, with the edit distance as your "distance" and all your lines as the nodes you have to visit. So you could look into the various approaches to that problem and apply them to this. Even then, the optimal TSP solution in terms of edit distance isn't necessarily going to be the file that compresses the smallest/