Suggestions needed for optimizing O(n^2) algorithm

algorithm optimization hadoop

I am not sure about the properties of your comparator and the data set, but assuming that your comparator defines an equivalence relation on your rows, here goes nothing:

Create a map for the input file and use the comparator function as the key comparator of the map. The map values are a sequence/list of rows, i.e. all rows that are 'same' get successively added to the same map entry). Takes O(n*log n) time.
Walk through the other file's rows and check if each row matches a key in the map. In that case, due to the equivalence relation implied by your comparator you know that this row is the 'same' as all the rows in the value of that map entry. Takes O(n* log n + C), depending on how many matches you have to output.

Note that in the worst case, according to your problem description, you cannot get any better than O(n^2), simply because there may be O(n^2) results of matching records that you have to output!

algorithm optimization hadoop

Assuming the files aren't ridiculously large, I'd go through the file in its entirety, and compute a hash for the row, and keep track of hash/line # (or file pointer position) combinations. Then sort the list of hashes, and identify those that appear more than once.

algorithm optimization hadoop

We'd need to know more about your comparison function. Is your comparison transitive? (That is, does A==B and B==C imply A==C?) Is it reflexive? (Does A==B imply B==A?)

If your comparison function is transitive and reflexive, and many records being equal is common, then you could bin your records into groups by comparing them to one "representative sample" of the group. That could approach O(N) in the best case.

Note that hashing the records assumes hash(A) == hash(B) <=> compare(A, B) == true, but if compare(A, B) can be true even when bytes(A) != bytes(B) it might be tricky to design an appropriate hashing algorithm.

CodeHunter

Suggestions needed for optimizing O(n^2) algorithm

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last