vba: get unique values from array vba: get unique values from array vba vba

vba: get unique values from array


This post contains 2 examples. I like the 2nd one:

Sub unique()   Dim arr As New Collection, a   Dim aFirstArray() As Variant   Dim i As Long    aFirstArray() = Array("Banana", "Apple", "Orange", "Tomato", "Apple", _   "Lemon", "Lime", "Lime", "Apple")    On Error Resume Next   For Each a In aFirstArray      arr.Add a, a   Next  On Error Goto 0 ' added to original example by PEH   For i = 1 To arr.Count      Cells(i, 1) = arr(i)   Next  End Sub 


There's no built-in functionality to remove duplicates from arrays. Raj's answer seems elegant, but I prefer to use dictionaries.

Dim d As ObjectSet d = CreateObject("Scripting.Dictionary")'Set d = New Scripting.DictionaryDim i As LongFor i = LBound(myArray) To UBound(myArray)    d(myArray(i)) = 1Next iDim v As VariantFor Each v In d.Keys()    'd.Keys() is a Variant array of the unique values in myArray.    'v will iterate through each of them.Next v

EDIT: I changed the loop to use LBound and UBound as per Tomalak's suggested answer.EDIT: d.Keys() is a Variant array, not a Collection.


Update (6/15/16)

I have created much more thorough benchmarks. First of all, as @ChaimG pointed out, early binding makes a big difference (I originally used @eksortso's code above verbatim which uses late binding). Secondly, my original benchmarks only included the time to create the unique object, however, it did not test the efficiency of using the object. My point in doing this is, it doesn't really matter if I can create an object really fast if the object I create is clunky and slows me down moving forward.

Old Remark: It turns out, that looping over a collection object is highly inefficient

It turns out that looping over a collection can be quite efficient if you know how to do it (I didn't). As @ChaimG (yet again), pointed out in the comments, using a For Each construct is ridiculously superior to simply using a For loop. To give you an idea, before changing the loop construct, the time for Collection2 for the Test Case Size = 10^6 was over 1400s (i.e. ~23 minutes). It is now a meager 0.195s (over 7000x faster).

For the Collection method there are two times. The first (my original benchmark Collection1) show the time to create the unique object. The second part (Collection2) shows the time to loop over the object (which is very natural) to create a returnable array as the other functions do.

In the chart below, a yellow background indicates that it was the fastest for that test case, and red indicates the slowest ("Not Tested" algorithms are excluded). The total time for the Collection method is the sum of Collection1 and Collection2. Turquoise indicates that is was the fastest regardless of original order.

Benchmarks5

Below is the original algorithm I created (I have modified it slightly e.g. I no longer instantiate my own data type). It returns the unique values of an array with the original order in a very respectable time and it can be modified to take on any data type. Outside of the IndexMethod, it is the fastest algorithm for very large arrays.

Here are the main ideas behind this algorithm:

  1. Index the array
  2. Sort by values
  3. Place identical values at the end of the array and subsequently "chop" them off.
  4. Finally, sort by index.

Below is an example:

Let myArray = (86, 100, 33, 19, 33, 703, 19, 100, 703, 19)    1.  (86, 100, 33, 19, 33, 703, 19, 100, 703, 19)        (1 ,   2,  3,  4,  5,   6,  7,   8,   9, 10)   <<-- Indexing    2.  (19, 19, 19, 33, 33, 86, 100, 100, 703, 703)   <<-- sort by values             (4,   7, 10,  3,  5,  1,   2,   8,   6,   9)    3.  (19, 33,  86, 100, 703)   <<-- remove duplicates            (4,   3,   1,   2,   6)    4.  (86, 100,  33, 19, 703)           ( 1,   2,   3,  4,   6)   <<-- sort by index

Here is the code:

Function SortingUniqueTest(ByRef myArray() As Long, bOrigIndex As Boolean) As Variant    Dim MyUniqueArr() As Long, i As Long, intInd As Integer    Dim StrtTime As Double, Endtime As Double, HighB As Long, LowB As Long    LowB = LBound(myArray): HighB = UBound(myArray)    ReDim MyUniqueArr(1 To 2, LowB To HighB)    intInd = 1 - LowB  'Guarantees the indices span 1 to Lim    For i = LowB To HighB        MyUniqueArr(1, i) = myArray(i)        MyUniqueArr(2, i) = i + intInd    Next i    QSLong2D MyUniqueArr, 1, LBound(MyUniqueArr, 2), UBound(MyUniqueArr, 2), 2    Call UniqueArray2D(MyUniqueArr)    If bOrigIndex Then QSLong2D MyUniqueArr, 2, LBound(MyUniqueArr, 2), UBound(MyUniqueArr, 2), 2    SortingUniqueTest = MyUniqueArr()End FunctionPublic Sub UniqueArray2D(ByRef myArray() As Long)    Dim i As Long, j As Long, Count As Long, Count1 As Long, DuplicateArr() As Long    Dim lngTemp As Long, HighB As Long, LowB As Long    LowB = LBound(myArray, 2): Count = LowB: i = LowB: HighB = UBound(myArray, 2)    Do While i < HighB        j = i + 1        If myArray(1, i) = myArray(1, j) Then            Do While myArray(1, i) = myArray(1, j)                ReDim Preserve DuplicateArr(1 To Count)                DuplicateArr(Count) = j                Count = Count + 1                j = j + 1                If j > HighB Then Exit Do            Loop            QSLong2D myArray, 2, i, j - 1, 2        End If        i = j    Loop    Count1 = HighB    If Count > 1 Then        For i = UBound(DuplicateArr) To LBound(DuplicateArr) Step -1            myArray(1, DuplicateArr(i)) = myArray(1, Count1)            myArray(2, DuplicateArr(i)) = myArray(2, Count1)            Count1 = Count1 - 1            ReDim Preserve myArray(1 To 2, LowB To Count1)        Next i    End IfEnd Sub

Here is the sorting algorithm I use (more about this algo here).

Sub QSLong2D(ByRef saArray() As Long, bytDim As Byte, lLow1 As Long, lHigh1 As Long, bytNum As Byte)    Dim lLow2 As Long, lHigh2 As Long    Dim sKey As Long, sSwap As Long, i As ByteOn Error GoTo ErrorExit    If IsMissing(lLow1) Then lLow1 = LBound(saArray, bytDim)    If IsMissing(lHigh1) Then lHigh1 = UBound(saArray, bytDim)    lLow2 = lLow1    lHigh2 = lHigh1    sKey = saArray(bytDim, (lLow1 + lHigh1) \ 2)    Do While lLow2 < lHigh2        Do While saArray(bytDim, lLow2) < sKey And lLow2 < lHigh1: lLow2 = lLow2 + 1: Loop        Do While saArray(bytDim, lHigh2) > sKey And lHigh2 > lLow1: lHigh2 = lHigh2 - 1: Loop        If lLow2 < lHigh2 Then            For i = 1 To bytNum                sSwap = saArray(i, lLow2)                saArray(i, lLow2) = saArray(i, lHigh2)                saArray(i, lHigh2) = sSwap            Next i        End If        If lLow2 <= lHigh2 Then            lLow2 = lLow2 + 1            lHigh2 = lHigh2 - 1        End If    Loop    If lHigh2 > lLow1 Then QSLong2D saArray(), bytDim, lLow1, lHigh2, bytNum    If lLow2 < lHigh1 Then QSLong2D saArray(), bytDim, lLow2, lHigh1, bytNumErrorExit:End Sub

Below is a special algorithm that is blazing fast if your data contains integers. It makes use of indexing and the Boolean data type.

Function IndexSort(ByRef myArray() As Long, bOrigIndex As Boolean) As Variant'' Modified to take both positive and negative integers    Dim arrVals() As Long, arrSort() As Long, arrBool() As Boolean    Dim i As Long, HighB As Long, myMax As Long, myMin As Long, OffSet As Long    Dim LowB As Long, myIndex As Long, count As Long, myRange As Long    HighB = UBound(myArray)    LowB = LBound(myArray)    For i = LowB To HighB        If myArray(i) > myMax Then myMax = myArray(i)        If myArray(i) < myMin Then myMin = myArray(i)    Next i    OffSet = Abs(myMin)  '' Number that will be added to every element                         '' to guarantee every index is non-negative    If myMax > 0 Then        myRange = myMax + OffSet  '' E.g. if myMax = 10 & myMin = -2, then myRange = 12    Else        myRange = OffSet    End If    If bOrigIndex Then        ReDim arrSort(1 To 2, 1 To HighB)        ReDim arrVals(1 To 2, 0 To myRange)        ReDim arrBool(0 To myRange)        For i = LowB To HighB            myIndex = myArray(i) + OffSet            arrBool(myIndex) = True            arrVals(1, myIndex) = myArray(i)            If arrVals(2, myIndex) = 0 Then arrVals(2, myIndex) = i        Next i        For i = 0 To myRange            If arrBool(i) Then                count = count + 1                arrSort(1, count) = arrVals(1, i)                arrSort(2, count) = arrVals(2, i)            End If        Next i        QSLong2D arrSort, 2, 1, count, 2        ReDim Preserve arrSort(1 To 2, 1 To count)    Else        ReDim arrSort(1 To HighB)        ReDim arrVals(0 To myRange)        ReDim arrBool(0 To myRange)        For i = LowB To HighB            myIndex = myArray(i) + OffSet            arrBool(myIndex) = True            arrVals(myIndex) = myArray(i)        Next i        For i = 0 To myRange            If arrBool(i) Then                count = count + 1                arrSort(count) = arrVals(i)            End If        Next i        ReDim Preserve arrSort(1 To count)    End If    ReDim arrVals(0)    ReDim arrBool(0)    IndexSort = arrSortEnd Function

Here are the Collection (by @DocBrown) and Dictionary (by @eksortso) Functions.

Function CollectionTest(ByRef arrIn() As Long, Lim As Long) As Variant    Dim arr As New Collection, a, i As Long, arrOut() As Variant, aFirstArray As Variant    Dim StrtTime As Double, EndTime1 As Double, EndTime2 As Double, count As LongOn Error Resume Next    ReDim arrOut(1 To UBound(arrIn))    ReDim aFirstArray(1 To UBound(arrIn))    StrtTime = Timer    For i = 1 To UBound(arrIn): aFirstArray(i) = CStr(arrIn(i)): Next i '' Convert to string    For Each a In aFirstArray               ''' This part is actually creating the unique set        arr.Add a, a    Next    EndTime1 = Timer - StrtTime    StrtTime = Timer         ''' This part is writing back to an array for return    For Each a In arr: count = count + 1: arrOut(count) = a: Next a    EndTime2 = Timer - StrtTime    CollectionTest = Array(arrOut, EndTime1, EndTime2)End FunctionFunction DictionaryTest(ByRef myArray() As Long, Lim As Long) As Variant    Dim StrtTime As Double, Endtime As Double    Dim d As Scripting.Dictionary, i As Long  '' Early Binding    Set d = New Scripting.Dictionary    For i = LBound(myArray) To UBound(myArray): d(myArray(i)) = 1: Next i    DictionaryTest = d.Keys()End Function

Here is the Direct approach provided by @IsraelHoletz.

Function ArrayUnique(ByRef aArrayIn() As Long) As Variant    Dim aArrayOut() As Variant, bFlag As Boolean, vIn As Variant, vOut As Variant    Dim i As Long, j As Long, k As Long    ReDim aArrayOut(LBound(aArrayIn) To UBound(aArrayIn))    i = LBound(aArrayIn)    j = i    For Each vIn In aArrayIn        For k = j To i - 1            If vIn = aArrayOut(k) Then bFlag = True: Exit For        Next        If Not bFlag Then aArrayOut(i) = vIn: i = i + 1        bFlag = False    Next    If i <> UBound(aArrayIn) Then ReDim Preserve aArrayOut(LBound(aArrayIn) To i - 1)    ArrayUnique = aArrayOutEnd FunctionFunction DirectTest(ByRef aArray() As Long, Lim As Long) As Variant    Dim aReturn() As Variant    Dim StrtTime As Long, Endtime As Long, i As Long    aReturn = ArrayUnique(aArray)    DirectTest = aReturnEnd Function

Here is the benchmark function that compares all of the functions. You should note that the last two cases are handled a little bit different because of memory issues. Also note, that I didn't test the Collection method for the Test Case Size = 10,000,000. For some reason, it was returning incorrect results and behaving unusual (I'm guessing the collection object has a limit on how many things you can put in it. I searched and I couldn't find any literature on this).

Function UltimateTest(Lim As Long, bTestDirect As Boolean, bTestDictionary, bytCase As Byte) As Variant    Dim dictionTest, collectTest, sortingTest1, indexTest1, directT '' all variants    Dim arrTest() As Long, i As Long, bEquality As Boolean, SizeUnique As Long    Dim myArray() As Long, StrtTime As Double, EndTime1 As Variant    Dim EndTime2 As Double, EndTime3 As Variant, EndTime4 As Double    Dim EndTime5 As Double, EndTime6 As Double, sortingTest2, indexTest2    ReDim myArray(1 To Lim): Rnd (-2)   '' If you want to test negative numbers,     '' insert this to the left of CLng(Int(Lim... : (-1) ^ (Int(2 * Rnd())) *    For i = LBound(myArray) To UBound(myArray): myArray(i) = CLng(Int(Lim * Rnd() + 1)): Next i    arrTest = myArray    If bytCase = 1 Then        If bTestDictionary Then            StrtTime = Timer: dictionTest = DictionaryTest(arrTest, Lim): EndTime1 = Timer - StrtTime        Else            EndTime1 = "Not Tested"        End If        arrTest = myArray        collectTest = CollectionTest(arrTest, Lim)        arrTest = myArray        StrtTime = Timer: sortingTest1 = SortingUniqueTest(arrTest, True): EndTime2 = Timer - StrtTime        SizeUnique = UBound(sortingTest1, 2)        If bTestDirect Then            arrTest = myArray: StrtTime = Timer: directT = DirectTest(arrTest, Lim): EndTime3 = Timer - StrtTime        Else            EndTime3 = "Not Tested"        End If        arrTest = myArray        StrtTime = Timer: indexTest1 = IndexSort(arrTest, True): EndTime4 = Timer - StrtTime        arrTest = myArray        StrtTime = Timer: sortingTest2 = SortingUniqueTest(arrTest, False): EndTime5 = Timer - StrtTime        arrTest = myArray        StrtTime = Timer: indexTest2 = IndexSort(arrTest, False): EndTime6 = Timer - StrtTime        bEquality = True        For i = LBound(sortingTest1, 2) To UBound(sortingTest1, 2)            If Not CLng(collectTest(0)(i)) = sortingTest1(1, i) Then                bEquality = False                Exit For            End If        Next i        For i = LBound(dictionTest) To UBound(dictionTest)            If Not dictionTest(i) = sortingTest1(1, i + 1) Then                bEquality = False                Exit For            End If        Next i        For i = LBound(dictionTest) To UBound(dictionTest)            If Not dictionTest(i) = indexTest1(1, i + 1) Then                bEquality = False                Exit For            End If        Next i        If bTestDirect Then            For i = LBound(dictionTest) To UBound(dictionTest)                If Not dictionTest(i) = directT(i + 1) Then                    bEquality = False                    Exit For                End If            Next i        End If        UltimateTest = Array(bEquality, EndTime1, EndTime2, EndTime3, EndTime4, _                        EndTime5, EndTime6, collectTest(1), collectTest(2), SizeUnique)    ElseIf bytCase = 2 Then        arrTest = myArray        collectTest = CollectionTest(arrTest, Lim)        UltimateTest = Array(collectTest(1), collectTest(2))    ElseIf bytCase = 3 Then        arrTest = myArray        StrtTime = Timer: sortingTest1 = SortingUniqueTest(arrTest, True): EndTime2 = Timer - StrtTime        SizeUnique = UBound(sortingTest1, 2)        UltimateTest = Array(EndTime2, SizeUnique)    ElseIf bytCase = 4 Then        arrTest = myArray        StrtTime = Timer: indexTest1 = IndexSort(arrTest, True): EndTime4 = Timer - StrtTime        UltimateTest = EndTime4    ElseIf bytCase = 5 Then        arrTest = myArray        StrtTime = Timer: sortingTest2 = SortingUniqueTest(arrTest, False): EndTime5 = Timer - StrtTime        UltimateTest = EndTime5    ElseIf bytCase = 6 Then        arrTest = myArray        StrtTime = Timer: indexTest2 = IndexSort(arrTest, False): EndTime6 = Timer - StrtTime        UltimateTest = EndTime6    End IfEnd Function

And finally, here is the sub that produces the table above.

Sub GetBenchmarks()    Dim myVar, i As Long, TestCases As Variant, j As Long, temp    TestCases = Array(1000, 5000, 10000, 20000, 50000, 100000, 200000, 500000, 1000000, 2000000, 5000000, 10000000)    For j = 0 To 11        If j < 6 Then            myVar = UltimateTest(CLng(TestCases(j)), True, True, 1)        ElseIf j < 10 Then            myVar = UltimateTest(CLng(TestCases(j)), False, True, 1)        ElseIf j < 11 Then            myVar = Array("Not Tested", "Not Tested", 0.1, "Not Tested", 0.1, 0.1, 0.1, 0, 0, 0)            temp = UltimateTest(CLng(TestCases(j)), False, False, 2)            myVar(7) = temp(0): myVar(8) = temp(1)            temp = UltimateTest(CLng(TestCases(j)), False, False, 3)            myVar(2) = temp(0): myVar(9) = temp(1)            myVar(4) = UltimateTest(CLng(TestCases(j)), False, False, 4)            myVar(5) = UltimateTest(CLng(TestCases(j)), False, False, 5)            myVar(6) = UltimateTest(CLng(TestCases(j)), False, False, 6)        Else            myVar = Array("Not Tested", "Not Tested", 0.1, "Not Tested", 0.1, 0.1, 0.1, "Not Tested", "Not Tested", 0)            temp = UltimateTest(CLng(TestCases(j)), False, False, 3)            myVar(2) = temp(0): myVar(9) = temp(1)            myVar(4) = UltimateTest(CLng(TestCases(j)), False, False, 4)            myVar(5) = UltimateTest(CLng(TestCases(j)), False, False, 5)            myVar(6) = UltimateTest(CLng(TestCases(j)), False, False, 6)        End If        Cells(4 + j, 6) = TestCases(j)        For i = 1 To 9: Cells(4 + j, 6 + i) = myVar(i - 1): Next i        Cells(4 + j, 17) = myVar(9)    Next jEnd Sub

Summary
From the table of results, we can see that the Dictionary method works really well for cases less than about 500,000, however, after that, the IndexMethod really starts to dominate. You will notice that when order doesn't matter and your data is made up of positive integers, there is no comparison to the IndexMethod algorithm (it returns the unique values from an array containing 10 million elements in less than 1 sec!!! Incredible!). Below I have a breakdown of which algorithm is preferred in various cases.

Case 1
Your Data contains integers (i.e. whole numbers, both positive and negative): IndexMethod

Case 2
Your Data contains non-integers (i.e. variant, double, string, etc.) with less than 200000 elements: Dictionary Method

Case 3
Your Data contains non-integers (i.e. variant, double, string, etc.) with more than 200000 elements: Collection Method

If you had to choose one algorithm, in my opinion, the Collection method is still the best as it only requires a few lines of code, it's super general, and it's fast enough.