vba: get unique values from array
This post contains 2 examples. I like the 2nd one:
Sub unique() Dim arr As New Collection, a Dim aFirstArray() As Variant Dim i As Long aFirstArray() = Array("Banana", "Apple", "Orange", "Tomato", "Apple", _ "Lemon", "Lime", "Lime", "Apple") On Error Resume Next For Each a In aFirstArray arr.Add a, a Next On Error Goto 0 ' added to original example by PEH For i = 1 To arr.Count Cells(i, 1) = arr(i) Next End Sub
There's no built-in functionality to remove duplicates from arrays. Raj's answer seems elegant, but I prefer to use dictionaries.
Dim d As ObjectSet d = CreateObject("Scripting.Dictionary")'Set d = New Scripting.DictionaryDim i As LongFor i = LBound(myArray) To UBound(myArray) d(myArray(i)) = 1Next iDim v As VariantFor Each v In d.Keys() 'd.Keys() is a Variant array of the unique values in myArray. 'v will iterate through each of them.Next v
EDIT: I changed the loop to use LBound
and UBound
as per Tomalak's suggested answer.EDIT: d.Keys()
is a Variant array, not a Collection.
Update (6/15/16)
I have created much more thorough benchmarks. First of all, as @ChaimG pointed out, early binding makes a big difference (I originally used @eksortso's code above verbatim which uses late binding). Secondly, my original benchmarks only included the time to create the unique object, however, it did not test the efficiency of using the object. My point in doing this is, it doesn't really matter if I can create an object really fast if the object I create is clunky and slows me down moving forward.
Old Remark: It turns out, that looping over a collection object is highly inefficient
It turns out that looping over a collection can be quite efficient if you know how to do it (I didn't). As @ChaimG (yet again), pointed out in the comments, using a For Each
construct is ridiculously superior to simply using a For
loop. To give you an idea, before changing the loop construct, the time for Collection2
for the Test Case Size = 10^6
was over 1400s (i.e. ~23 minutes). It is now a meager 0.195s (over 7000x faster).
For the Collection
method there are two times. The first (my original benchmark Collection1
) show the time to create the unique object. The second part (Collection2
) shows the time to loop over the object (which is very natural) to create a returnable array as the other functions do.
In the chart below, a yellow background indicates that it was the fastest for that test case, and red indicates the slowest ("Not Tested" algorithms are excluded). The total time for the Collection
method is the sum of Collection1
and Collection2
. Turquoise indicates that is was the fastest regardless of original order.
Below is the original algorithm I created (I have modified it slightly e.g. I no longer instantiate my own data type). It returns the unique values of an array with the original order in a very respectable time and it can be modified to take on any data type. Outside of the IndexMethod
, it is the fastest algorithm for very large arrays.
Here are the main ideas behind this algorithm:
- Index the array
- Sort by values
- Place identical values at the end of the array and subsequently "chop" them off.
- Finally, sort by index.
Below is an example:
Let myArray = (86, 100, 33, 19, 33, 703, 19, 100, 703, 19) 1. (86, 100, 33, 19, 33, 703, 19, 100, 703, 19) (1 , 2, 3, 4, 5, 6, 7, 8, 9, 10) <<-- Indexing 2. (19, 19, 19, 33, 33, 86, 100, 100, 703, 703) <<-- sort by values (4, 7, 10, 3, 5, 1, 2, 8, 6, 9) 3. (19, 33, 86, 100, 703) <<-- remove duplicates (4, 3, 1, 2, 6) 4. (86, 100, 33, 19, 703) ( 1, 2, 3, 4, 6) <<-- sort by index
Here is the code:
Function SortingUniqueTest(ByRef myArray() As Long, bOrigIndex As Boolean) As Variant Dim MyUniqueArr() As Long, i As Long, intInd As Integer Dim StrtTime As Double, Endtime As Double, HighB As Long, LowB As Long LowB = LBound(myArray): HighB = UBound(myArray) ReDim MyUniqueArr(1 To 2, LowB To HighB) intInd = 1 - LowB 'Guarantees the indices span 1 to Lim For i = LowB To HighB MyUniqueArr(1, i) = myArray(i) MyUniqueArr(2, i) = i + intInd Next i QSLong2D MyUniqueArr, 1, LBound(MyUniqueArr, 2), UBound(MyUniqueArr, 2), 2 Call UniqueArray2D(MyUniqueArr) If bOrigIndex Then QSLong2D MyUniqueArr, 2, LBound(MyUniqueArr, 2), UBound(MyUniqueArr, 2), 2 SortingUniqueTest = MyUniqueArr()End FunctionPublic Sub UniqueArray2D(ByRef myArray() As Long) Dim i As Long, j As Long, Count As Long, Count1 As Long, DuplicateArr() As Long Dim lngTemp As Long, HighB As Long, LowB As Long LowB = LBound(myArray, 2): Count = LowB: i = LowB: HighB = UBound(myArray, 2) Do While i < HighB j = i + 1 If myArray(1, i) = myArray(1, j) Then Do While myArray(1, i) = myArray(1, j) ReDim Preserve DuplicateArr(1 To Count) DuplicateArr(Count) = j Count = Count + 1 j = j + 1 If j > HighB Then Exit Do Loop QSLong2D myArray, 2, i, j - 1, 2 End If i = j Loop Count1 = HighB If Count > 1 Then For i = UBound(DuplicateArr) To LBound(DuplicateArr) Step -1 myArray(1, DuplicateArr(i)) = myArray(1, Count1) myArray(2, DuplicateArr(i)) = myArray(2, Count1) Count1 = Count1 - 1 ReDim Preserve myArray(1 To 2, LowB To Count1) Next i End IfEnd Sub
Here is the sorting algorithm I use (more about this algo here).
Sub QSLong2D(ByRef saArray() As Long, bytDim As Byte, lLow1 As Long, lHigh1 As Long, bytNum As Byte) Dim lLow2 As Long, lHigh2 As Long Dim sKey As Long, sSwap As Long, i As ByteOn Error GoTo ErrorExit If IsMissing(lLow1) Then lLow1 = LBound(saArray, bytDim) If IsMissing(lHigh1) Then lHigh1 = UBound(saArray, bytDim) lLow2 = lLow1 lHigh2 = lHigh1 sKey = saArray(bytDim, (lLow1 + lHigh1) \ 2) Do While lLow2 < lHigh2 Do While saArray(bytDim, lLow2) < sKey And lLow2 < lHigh1: lLow2 = lLow2 + 1: Loop Do While saArray(bytDim, lHigh2) > sKey And lHigh2 > lLow1: lHigh2 = lHigh2 - 1: Loop If lLow2 < lHigh2 Then For i = 1 To bytNum sSwap = saArray(i, lLow2) saArray(i, lLow2) = saArray(i, lHigh2) saArray(i, lHigh2) = sSwap Next i End If If lLow2 <= lHigh2 Then lLow2 = lLow2 + 1 lHigh2 = lHigh2 - 1 End If Loop If lHigh2 > lLow1 Then QSLong2D saArray(), bytDim, lLow1, lHigh2, bytNum If lLow2 < lHigh1 Then QSLong2D saArray(), bytDim, lLow2, lHigh1, bytNumErrorExit:End Sub
Below is a special algorithm that is blazing fast if your data contains integers. It makes use of indexing and the Boolean data type.
Function IndexSort(ByRef myArray() As Long, bOrigIndex As Boolean) As Variant'' Modified to take both positive and negative integers Dim arrVals() As Long, arrSort() As Long, arrBool() As Boolean Dim i As Long, HighB As Long, myMax As Long, myMin As Long, OffSet As Long Dim LowB As Long, myIndex As Long, count As Long, myRange As Long HighB = UBound(myArray) LowB = LBound(myArray) For i = LowB To HighB If myArray(i) > myMax Then myMax = myArray(i) If myArray(i) < myMin Then myMin = myArray(i) Next i OffSet = Abs(myMin) '' Number that will be added to every element '' to guarantee every index is non-negative If myMax > 0 Then myRange = myMax + OffSet '' E.g. if myMax = 10 & myMin = -2, then myRange = 12 Else myRange = OffSet End If If bOrigIndex Then ReDim arrSort(1 To 2, 1 To HighB) ReDim arrVals(1 To 2, 0 To myRange) ReDim arrBool(0 To myRange) For i = LowB To HighB myIndex = myArray(i) + OffSet arrBool(myIndex) = True arrVals(1, myIndex) = myArray(i) If arrVals(2, myIndex) = 0 Then arrVals(2, myIndex) = i Next i For i = 0 To myRange If arrBool(i) Then count = count + 1 arrSort(1, count) = arrVals(1, i) arrSort(2, count) = arrVals(2, i) End If Next i QSLong2D arrSort, 2, 1, count, 2 ReDim Preserve arrSort(1 To 2, 1 To count) Else ReDim arrSort(1 To HighB) ReDim arrVals(0 To myRange) ReDim arrBool(0 To myRange) For i = LowB To HighB myIndex = myArray(i) + OffSet arrBool(myIndex) = True arrVals(myIndex) = myArray(i) Next i For i = 0 To myRange If arrBool(i) Then count = count + 1 arrSort(count) = arrVals(i) End If Next i ReDim Preserve arrSort(1 To count) End If ReDim arrVals(0) ReDim arrBool(0) IndexSort = arrSortEnd Function
Here are the Collection (by @DocBrown) and Dictionary (by @eksortso) Functions.
Function CollectionTest(ByRef arrIn() As Long, Lim As Long) As Variant Dim arr As New Collection, a, i As Long, arrOut() As Variant, aFirstArray As Variant Dim StrtTime As Double, EndTime1 As Double, EndTime2 As Double, count As LongOn Error Resume Next ReDim arrOut(1 To UBound(arrIn)) ReDim aFirstArray(1 To UBound(arrIn)) StrtTime = Timer For i = 1 To UBound(arrIn): aFirstArray(i) = CStr(arrIn(i)): Next i '' Convert to string For Each a In aFirstArray ''' This part is actually creating the unique set arr.Add a, a Next EndTime1 = Timer - StrtTime StrtTime = Timer ''' This part is writing back to an array for return For Each a In arr: count = count + 1: arrOut(count) = a: Next a EndTime2 = Timer - StrtTime CollectionTest = Array(arrOut, EndTime1, EndTime2)End FunctionFunction DictionaryTest(ByRef myArray() As Long, Lim As Long) As Variant Dim StrtTime As Double, Endtime As Double Dim d As Scripting.Dictionary, i As Long '' Early Binding Set d = New Scripting.Dictionary For i = LBound(myArray) To UBound(myArray): d(myArray(i)) = 1: Next i DictionaryTest = d.Keys()End Function
Here is the Direct approach provided by @IsraelHoletz.
Function ArrayUnique(ByRef aArrayIn() As Long) As Variant Dim aArrayOut() As Variant, bFlag As Boolean, vIn As Variant, vOut As Variant Dim i As Long, j As Long, k As Long ReDim aArrayOut(LBound(aArrayIn) To UBound(aArrayIn)) i = LBound(aArrayIn) j = i For Each vIn In aArrayIn For k = j To i - 1 If vIn = aArrayOut(k) Then bFlag = True: Exit For Next If Not bFlag Then aArrayOut(i) = vIn: i = i + 1 bFlag = False Next If i <> UBound(aArrayIn) Then ReDim Preserve aArrayOut(LBound(aArrayIn) To i - 1) ArrayUnique = aArrayOutEnd FunctionFunction DirectTest(ByRef aArray() As Long, Lim As Long) As Variant Dim aReturn() As Variant Dim StrtTime As Long, Endtime As Long, i As Long aReturn = ArrayUnique(aArray) DirectTest = aReturnEnd Function
Here is the benchmark function that compares all of the functions. You should note that the last two cases are handled a little bit different because of memory issues. Also note, that I didn't test the Collection
method for the Test Case Size = 10,000,000
. For some reason, it was returning incorrect results and behaving unusual (I'm guessing the collection object has a limit on how many things you can put in it. I searched and I couldn't find any literature on this).
Function UltimateTest(Lim As Long, bTestDirect As Boolean, bTestDictionary, bytCase As Byte) As Variant Dim dictionTest, collectTest, sortingTest1, indexTest1, directT '' all variants Dim arrTest() As Long, i As Long, bEquality As Boolean, SizeUnique As Long Dim myArray() As Long, StrtTime As Double, EndTime1 As Variant Dim EndTime2 As Double, EndTime3 As Variant, EndTime4 As Double Dim EndTime5 As Double, EndTime6 As Double, sortingTest2, indexTest2 ReDim myArray(1 To Lim): Rnd (-2) '' If you want to test negative numbers, '' insert this to the left of CLng(Int(Lim... : (-1) ^ (Int(2 * Rnd())) * For i = LBound(myArray) To UBound(myArray): myArray(i) = CLng(Int(Lim * Rnd() + 1)): Next i arrTest = myArray If bytCase = 1 Then If bTestDictionary Then StrtTime = Timer: dictionTest = DictionaryTest(arrTest, Lim): EndTime1 = Timer - StrtTime Else EndTime1 = "Not Tested" End If arrTest = myArray collectTest = CollectionTest(arrTest, Lim) arrTest = myArray StrtTime = Timer: sortingTest1 = SortingUniqueTest(arrTest, True): EndTime2 = Timer - StrtTime SizeUnique = UBound(sortingTest1, 2) If bTestDirect Then arrTest = myArray: StrtTime = Timer: directT = DirectTest(arrTest, Lim): EndTime3 = Timer - StrtTime Else EndTime3 = "Not Tested" End If arrTest = myArray StrtTime = Timer: indexTest1 = IndexSort(arrTest, True): EndTime4 = Timer - StrtTime arrTest = myArray StrtTime = Timer: sortingTest2 = SortingUniqueTest(arrTest, False): EndTime5 = Timer - StrtTime arrTest = myArray StrtTime = Timer: indexTest2 = IndexSort(arrTest, False): EndTime6 = Timer - StrtTime bEquality = True For i = LBound(sortingTest1, 2) To UBound(sortingTest1, 2) If Not CLng(collectTest(0)(i)) = sortingTest1(1, i) Then bEquality = False Exit For End If Next i For i = LBound(dictionTest) To UBound(dictionTest) If Not dictionTest(i) = sortingTest1(1, i + 1) Then bEquality = False Exit For End If Next i For i = LBound(dictionTest) To UBound(dictionTest) If Not dictionTest(i) = indexTest1(1, i + 1) Then bEquality = False Exit For End If Next i If bTestDirect Then For i = LBound(dictionTest) To UBound(dictionTest) If Not dictionTest(i) = directT(i + 1) Then bEquality = False Exit For End If Next i End If UltimateTest = Array(bEquality, EndTime1, EndTime2, EndTime3, EndTime4, _ EndTime5, EndTime6, collectTest(1), collectTest(2), SizeUnique) ElseIf bytCase = 2 Then arrTest = myArray collectTest = CollectionTest(arrTest, Lim) UltimateTest = Array(collectTest(1), collectTest(2)) ElseIf bytCase = 3 Then arrTest = myArray StrtTime = Timer: sortingTest1 = SortingUniqueTest(arrTest, True): EndTime2 = Timer - StrtTime SizeUnique = UBound(sortingTest1, 2) UltimateTest = Array(EndTime2, SizeUnique) ElseIf bytCase = 4 Then arrTest = myArray StrtTime = Timer: indexTest1 = IndexSort(arrTest, True): EndTime4 = Timer - StrtTime UltimateTest = EndTime4 ElseIf bytCase = 5 Then arrTest = myArray StrtTime = Timer: sortingTest2 = SortingUniqueTest(arrTest, False): EndTime5 = Timer - StrtTime UltimateTest = EndTime5 ElseIf bytCase = 6 Then arrTest = myArray StrtTime = Timer: indexTest2 = IndexSort(arrTest, False): EndTime6 = Timer - StrtTime UltimateTest = EndTime6 End IfEnd Function
And finally, here is the sub that produces the table above.
Sub GetBenchmarks() Dim myVar, i As Long, TestCases As Variant, j As Long, temp TestCases = Array(1000, 5000, 10000, 20000, 50000, 100000, 200000, 500000, 1000000, 2000000, 5000000, 10000000) For j = 0 To 11 If j < 6 Then myVar = UltimateTest(CLng(TestCases(j)), True, True, 1) ElseIf j < 10 Then myVar = UltimateTest(CLng(TestCases(j)), False, True, 1) ElseIf j < 11 Then myVar = Array("Not Tested", "Not Tested", 0.1, "Not Tested", 0.1, 0.1, 0.1, 0, 0, 0) temp = UltimateTest(CLng(TestCases(j)), False, False, 2) myVar(7) = temp(0): myVar(8) = temp(1) temp = UltimateTest(CLng(TestCases(j)), False, False, 3) myVar(2) = temp(0): myVar(9) = temp(1) myVar(4) = UltimateTest(CLng(TestCases(j)), False, False, 4) myVar(5) = UltimateTest(CLng(TestCases(j)), False, False, 5) myVar(6) = UltimateTest(CLng(TestCases(j)), False, False, 6) Else myVar = Array("Not Tested", "Not Tested", 0.1, "Not Tested", 0.1, 0.1, 0.1, "Not Tested", "Not Tested", 0) temp = UltimateTest(CLng(TestCases(j)), False, False, 3) myVar(2) = temp(0): myVar(9) = temp(1) myVar(4) = UltimateTest(CLng(TestCases(j)), False, False, 4) myVar(5) = UltimateTest(CLng(TestCases(j)), False, False, 5) myVar(6) = UltimateTest(CLng(TestCases(j)), False, False, 6) End If Cells(4 + j, 6) = TestCases(j) For i = 1 To 9: Cells(4 + j, 6 + i) = myVar(i - 1): Next i Cells(4 + j, 17) = myVar(9) Next jEnd Sub
Summary
From the table of results, we can see that the Dictionary
method works really well for cases less than about 500,000, however, after that, the IndexMethod
really starts to dominate. You will notice that when order doesn't matter and your data is made up of positive integers, there is no comparison to the IndexMethod
algorithm (it returns the unique values from an array containing 10 million elements in less than 1 sec!!! Incredible!). Below I have a breakdown of which algorithm is preferred in various cases.
Case 1
Your Data contains integers (i.e. whole numbers, both positive and negative): IndexMethod
Case 2
Your Data contains non-integers (i.e. variant, double, string, etc.) with less than 200000 elements: Dictionary Method
Case 3
Your Data contains non-integers (i.e. variant, double, string, etc.) with more than 200000 elements: Collection Method
If you had to choose one algorithm, in my opinion, the Collection
method is still the best as it only requires a few lines of code, it's super general, and it's fast enough.