Efficiently merge string arrays in .NET, keeping distinct values Efficiently merge string arrays in .NET, keeping distinct values arrays arrays

Efficiently merge string arrays in .NET, keeping distinct values


string[] result = list1.Union(list2).ToArray();

from msdn: "This method excludes duplicates from the return set. This is different behavior to the Concat(TSource) method, which returns all the elements in the input sequences including duplicates."


Why do you imagine that it would be inefficient? As far as I'm aware, both Concat and Distinct are evaluated lazily, using a HashSet behind the scenes for Distinct to keep track of the elements which have already been returned.

I'm not sure how you'd manage to make it more efficient than that in a general way :)

EDIT: Distinct actually uses Set (an internal class) instead of HashSet, but the gist is still correct. This is a really good example of just how neat LINQ is. The simplest answer is pretty much as efficient as you can achieve without more domain knowledge.

The effect is the equivalent of:

public static IEnumerable<T> DistinctConcat<T>(IEnumerable<T> first, IEnumerable<T> second){    HashSet<T> returned = new HashSet<T>();    foreach (T element in first)    {        if (returned.Add(element))        {            yield return element;        }    }    foreach (T element in second)    {        if (returned.Add(element))        {            yield return element;        }    }}


.NET 3.5 introduced the HashSet class which could do this:

IEnumerable<string> mergedDistinctList = new HashSet<string>(list1).Union(list2);

Not sure of performance, but it should beat the Linq example you gave.

EDIT:I stand corrected. The lazy implementation of Concat and Distinct have a key memory AND speed advantage. Concat/Distinct is about 10% faster, and saves multiple copies of data.

I confirmed through code:

Setting up arrays of 3000000 strings overlapping by 300000Starting Hashset...HashSet: 00:00:02.8237616Starting Concat/Distinct...Concat/Distinct: 00:00:02.5629681

is the output of:

        int num = 3000000;        int num10Pct = (int)(num / 10);        Console.WriteLine(String.Format("Setting up arrays of {0} strings overlapping by {1}", num, num10Pct));        string[] list1 = Enumerable.Range(1, num).Select((a) => a.ToString()).ToArray();        string[] list2 = Enumerable.Range(num - num10Pct, num + num10Pct).Select((a) => a.ToString()).ToArray();        Console.WriteLine("Starting Hashset...");        Stopwatch sw = new Stopwatch();        sw.Start();        string[] merged = new HashSet<string>(list1).Union(list2).ToArray();        sw.Stop();        Console.WriteLine("HashSet: " + sw.Elapsed);        Console.WriteLine("Starting Concat/Distinct...");        sw.Reset();        sw.Start();        string[] merged2 = list1.Concat(list2).Distinct().ToArray();        sw.Stop();        Console.WriteLine("Concat/Distinct: " + sw.Elapsed);