How should I detect which delimiter is used in a text file? How should I detect which delimiter is used in a text file? asp.net asp.net

How should I detect which delimiter is used in a text file?


In Python, there is a Sniffer class in the csv module that can be used to guess a given file's delimiter and quote characters. Its strategy is (quoted from csv.py's docstrings):


[First, look] for text enclosed between two identical quotes(the probable quotechar) which are preceded and followedby the same character (the probable delimiter).For example:

         ,'some text',

The quote with the most wins, same with the delimiter.If there is no quotechar the delimiter can't be determinedthis way.

In that case, try the following:

The delimiter should occur the same number of times oneach row. However, due to malformed data, it may not. We don't wantan all or nothing approach, so we allow for small variations in thisnumber.

  1. build a table of the frequency ofeach character on every line.
  2. build a table of freqencies of thisfrequency (meta-frequency?), e.g. 'x occurred 5 times in 10 rows, 6times in 1000 rows, 7 times in 2rows'
  3. use the mode of the meta-frequencyto determine the expectedfrequency for that character
  4. find out how often the characteractually meets that goal
  5. the character that best meets itsgoal is the delimiter

For performance reasons, the data is evaluated in chunks, so it cantry and evaluate the smallest portion of the data possible, evaluatingadditional chunks as necessary.


I'm not going to quote the source code here - it's in the Lib directory of every Python installation.

Remember that CSV can also use semicolons instead of commas as delimiters (e. g. in German versions of Excel, CSVs are semicolon-delimited because commas are used as decimal separators in Germany...)


You could show them the results in preview window - similar to the way Excel does it. It's pretty clear when the wrong delimiter is being used in that case. You could then allow them to select a range of delimiters and have the preview update in real time.

Then you could just make a simple guess as to the delimiter to start with (e.g. does a comma or a tab come first).


I ran into a similar need and thought I would share what I came up with. I haven't run a lot of data through it yet, so there are possible edge cases. Also, keep in mind the goal of this function isn't 100% certainty of the delimiter, but best guess to be presented to user.

/// <summary>/// Analyze the given lines of text and try to determine the correct delimiter used. If multiple/// candidate delimiters are found, the highest frequency delimiter will be returned./// </summary>/// <example>/// string discoveredDelimiter = DetectDelimiter(dataLines, new char[] { '\t', '|', ',', ':', ';' });/// </example>/// <param name="lines">Lines to inspect</param>/// <param name="delimiters">Delimiters to search for</param>/// <returns>The most probable delimiter by usage, or null if none found.</returns>public string DetectDelimiter(IEnumerable<string> lines, IEnumerable<char> delimiters) {  Dictionary<char, int> delimFrequency = new Dictionary<char, int>();  // Setup our frequency tracker for given delimiters  delimiters.ToList().ForEach(curDelim =>     delimFrequency.Add(curDelim, 0)  );  // Get a total sum of all occurrences of each delimiter in the given lines  delimFrequency.ToList().ForEach(curDelim =>     delimFrequency[curDelim.Key] = lines.Sum(line => line.Count(p => p == curDelim.Key))  );  // Find delimiters that have a frequency evenly divisible by the number of lines  // (correct & consistent usage) and order them by largest frequency  var possibleDelimiters = delimFrequency                    .Where(f => f.Value > 0 && f.Value % lines.Count() == 0)                    .OrderByDescending(f => f.Value)                    .ToList();  // If more than one possible delimiter found, return the most used one  if (possibleDelimiters.Any()) {    return possibleDelimiters.First().Key.ToString();  }  else {    return null;  }   }