Word frequency in a large text file Word frequency in a large text file multithreading multithreading

Word frequency in a large text file


The best short answer I can give is to measure, measure, measure. Stopwatch is nice to get a feeling for where time is spent but eventually you'll end up sprinkling large swats of your code with it or you will have to find a better tool for this purpose. I would suggest getting a dedicated profiler tool for this, there are many available for C# and .NET.


I've managed to shave off about 43% of the total runtime in three steps.

First I measured your code and got this:

Original code measurements

This seems to indicate that there are two hotspots here that we can try to combat:

  1. String splitting (SplitInternal)
  2. Dictionary maintenance (FindEntry, Insert, get_Item)

The last piece of the time spent is in reading the file and I really doubt we can gain much by changing that part of the code. One other answer here mentions using specific buffersizes, I tried this and could not gain measurable differences.

The first, string splitting, is somewhat easy but involves rewriting a very simple call to string.Split into a bit more code. The loop that processes one line I rewrote to this:

while ((line = streamReader.ReadLine()) != null){    int lastPos = 0;    for (int index = 0; index <= line.Length; index++)    {        if (index == line.Length || line[index] == ' ')        {            if (lastPos < index)            {                string word = line.Substring(lastPos, index - lastPos);                // process word here            }            lastPos = index + 1;        }    }}

I then rewrote the processing of one word to this:

int currentCount;wordCount.TryGetValue(word, out currentCount);wordCount[word] = currentCount + 1;

This relies on the fact that:

  1. TryGetValue is cheaper than checking if the word exists and then retrieving its current count
  2. If TryGetValue fails to get the value (key does not exist) then it will initialize the currentCount variable here to its default value, which is 0. This means that we don't really need to check if the word actually existed.
  3. We can add new words to the dictionary through the indexer (it will either overwrite the existing value or add a new key+value to the dictionary)

The final loop thus looks like this:

while ((line = streamReader.ReadLine()) != null){    int lastPos = 0;    for (int index = 0; index <= line.Length; index++)    {        if (index == line.Length || line[index] == ' ')        {            if (lastPos < index)            {                string word = line.Substring(lastPos, index - lastPos);                int currentCount;                wordCount.TryGetValue(word, out currentCount);                wordCount[word] = currentCount + 1;            }            lastPos = index + 1;        }    }}

The new measurement shows this:

new measurement

Details:

  1. We went from 6876ms to 5013ms
  2. We lost the time spent in SplitInternal, FindEntry and get_Item
  3. We gained time spent in TryGetValue and Substring

Here's difference details:

difference

As you can see, we lost more time than we gained new time which resulted in a net improvement.

However, we can do better. We're doing 2 dictionary lookups here which involves calculating the hash code of the word, and comparing it to keys in the dictionary. The first lookup is part of the TryGetValue and the second is part of wordCount[word] = ....

We can remove the second dictionary lookup by creating a smarter data structure inside the dictionary at the cost of more heap memory used.

We can use Xanatos' trick of storing the count inside an object so that we can remove that second dictionary lookup:

public class WordCount{    public int Count;}...var wordCount = new Dictionary<string, WordCount>();...string word = line.Substring(lastPos, index - lastPos);WordCount currentCount;if (!wordCount.TryGetValue(word, out currentCount))    wordCount[word] = currentCount = new WordCount();currentCount.Count++;

This will only retrieve the count from the dictionary, the addition of 1 extra occurance does not involve the dictionary. The result from the method will also change to return this WordCount type as part of the dictionary instead of just an int.

Net result: ~43% savings.

final results

Final piece of code:

public class WordCount{    public int Count;}public static IDictionary<string, WordCount> Parse(string path){    var wordCount = new Dictionary<string, WordCount>();    using (var fileStream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.None, 65536))    using (var streamReader = new StreamReader(fileStream, Encoding.Default, false, 65536))    {        string line;        while ((line = streamReader.ReadLine()) != null)        {            int lastPos = 0;            for (int index = 0; index <= line.Length; index++)            {                if (index == line.Length || line[index] == ' ')                {                    if (lastPos < index)                    {                        string word = line.Substring(lastPos, index - lastPos);                        WordCount currentCount;                        if (!wordCount.TryGetValue(word, out currentCount))                            wordCount[word] = currentCount = new WordCount();                        currentCount.Count++;                    }                    lastPos = index + 1;                }            }        }    }    return wordCount;}


Your approach seems to be in-line with how most people would tackle it. You're right to notice that using multi-threading did not offer any significant gains, because the bottleneck is most likely IO bound, and no matter what kind of hardware you have, you cannot read faster than your hardware supports.

If you're really looking for speed improvements (I doubt you will get any), you could try and implement a producer-consumer pattern where one thread reads the file and other threads process the lines (maybe then parallelise the checking of words in a line). The trade off here is you're adding a lot more complex code in exchange for marginal improvements (only benchmarking can determine this).

http://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem

edit: also have a look at ConcurrentDictionary


I've gained quite much (from 25sec to 20sec on a file of 200mb) simply changing:

int cnt;if (wordCount.TryGetValue(word, out cnt)){    wordCount[word] = cnt + 1;}else....

A variant based on ConcurrentDictionary<> and Parallel.ForEach (using the IEnumerable<> overload). Note that instead of using an int, I'm using an InterlockedInt that uses Interlocked.Increment to increment itself. Being a reference type, it works correctly with the ConcurrentDictionary<>.GetOrAdd...

public class InterlockedInt{    private int cnt;    public int Cnt    {        get        {            return cnt;        }    }    public void Increment()    {        Interlocked.Increment(ref cnt);    }}public static IDictionary<string, InterlockedInt> Parse(string path){    var wordCount = new ConcurrentDictionary<string, InterlockedInt>();    Action<string> action = line2 =>    {        var words = line2.Split(separators, StringSplitOptions.RemoveEmptyEntries);        foreach (var word in words)        {            wordCount.GetOrAdd(word, x => new InterlockedInt()).Increment();        }    };    IEnumerable<string> lines = File.ReadLines(path);    Parallel.ForEach(lines, action);    return wordCount;}

Note that the use of Parallel.ForEach is less efficient than using directly one thread for each physical core (you can see how in the history). While both solutions take less than 10 seconds of "wall" clock on my PC, the Parallel.ForEach uses 55 seconds of CPU time against the 33 seconds of the Thread solution.

There is another trick that is valued around 5-10%:

public static IEnumerable<T[]> ToBlock<T>(IEnumerable<T> source, int num){    var array = new T[num];    int cnt = 0;    foreach (T row in source)    {        array[cnt] = row;        cnt++;        if (cnt == num)        {            yield return array;            array = new T[num];            cnt = 0;        }    }    if (cnt != 0)    {        Array.Resize(ref array, cnt);        yield return array;    }}

You "group" the rows (choose a number between 10 and 100) in packets, so that there is less intra-thread communication. The workers then have to do a foreach on the received rows.