Suggestions needed for optimizing O(n^2) algorithm Suggestions needed for optimizing O(n^2) algorithm hadoop hadoop

Suggestions needed for optimizing O(n^2) algorithm


I am not sure about the properties of your comparator and the data set, but assuming that your comparator defines an equivalence relation on your rows, here goes nothing:

  1. Create a map for the input file and use the comparator function as the key comparator of the map. The map values are a sequence/list of rows, i.e. all rows that are 'same' get successively added to the same map entry). Takes O(n*log n) time.
  2. Walk through the other file's rows and check if each row matches a key in the map. In that case, due to the equivalence relation implied by your comparator you know that this row is the 'same' as all the rows in the value of that map entry. Takes O(n* log n + C), depending on how many matches you have to output.

Note that in the worst case, according to your problem description, you cannot get any better than O(n^2), simply because there may be O(n^2) results of matching records that you have to output!


Assuming the files aren't ridiculously large, I'd go through the file in its entirety, and compute a hash for the row, and keep track of hash/line # (or file pointer position) combinations. Then sort the list of hashes, and identify those that appear more than once.


We'd need to know more about your comparison function. Is your comparison transitive? (That is, does A==B and B==C imply A==C?) Is it reflexive? (Does A==B imply B==A?)

If your comparison function is transitive and reflexive, and many records being equal is common, then you could bin your records into groups by comparing them to one "representative sample" of the group. That could approach O(N) in the best case.

Note that hashing the records assumes hash(A) == hash(B) <=> compare(A, B) == true, but if compare(A, B) can be true even when bytes(A) != bytes(B) it might be tricky to design an appropriate hashing algorithm.