How to make separator in pandas read_csv more flexible wrt whitespace, for irregular separators?
From the documentation, you can use either a regex or delim_whitespace
:
>>> import pandas as pd>>> for line in open("whitespace.csv"):... print repr(line)... 'a\t b\tc 1 2\n''d\t e\tf 3 4\n'>>> pd.read_csv("whitespace.csv", header=None, delimiter=r"\s+") 0 1 2 3 40 a b c 1 21 d e f 3 4>>> pd.read_csv("whitespace.csv", header=None, delim_whitespace=True) 0 1 2 3 40 a b c 1 21 d e f 3 4
>>> pd.read_csv("whitespace.csv", header = None, sep = "\s+|\t+|\s+\t+|\t+\s+")
would use any combination of any number of spaces and tabs as the separator.
Pandas has two csv readers, only is flexible regarding redundant leading white space:
pd.read_csv("whitespace.csv", skipinitialspace=True)
while one is not
pd.DataFrame.from_csv("whitespace.csv")
Neither is out-of-the-box flexible regarding trailing white space, see the answers with regular expressions. Avoid delim_whitespace, as it also allows just spaces (without , or \t) as separators.