Pandas ParserError EOF character when reading multiple csv files to HDF5 Pandas ParserError EOF character when reading multiple csv files to HDF5 python python

Pandas ParserError EOF character when reading multiple csv files to HDF5


I had a similar problem. The line listed with the 'EOF inside string' had a string that contained within it a single quote mark. When I added the option quoting=csv.QUOTE_NONE it fixed my problem.

For example:

import csvdf = pd.read_csv(csvfile, header = None, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')


I have the same problem, and after adding these two params to my code, the problem is gone.

read_csv (...quoting=3, error_bad_lines=False)


I realize this is an old question, but I wanted to share some more details on the root cause of this error and why the solution from @Selah works.

From the csv.py docstring:

    * quoting - controls when quotes should be generated by the writer.    It can take on any of the following module constants:    csv.QUOTE_MINIMAL means only when required, for example, when a        field contains either the quotechar or the delimiter    csv.QUOTE_ALL means that quotes are always placed around fields.    csv.QUOTE_NONNUMERIC means that quotes are always placed around        fields which do not parse as integers or floating point        numbers.    csv.QUOTE_NONE means that quotes are never placed around fields.

csv.QUOTE_MINIMAL is the default value and " is the default quotechar. If somewhere in your csv file you have a quotechar it will be parsed as a string until another occurrence of the quotechar. If your file has odd number of quotechars the last one will not be closed before reaching the EOF (end of file). Also be aware that anything between the quotechars will be parsed as a single string. Even if there are many line breaks (expected to be parsed as separate rows) it all goes into a single field of the table. So the line number that you get in the error can be misleading. To illustrate with an example consider this:

In[4]: import pandas as pd  ...: from io import StringIO  ...: test_csv = '''a,b,c  ...: "d,e,f  ...: g,h,i  ...: "m,n,o  ...: p,q,r  ...: s,t,u  ...: '''  ...: In[5]: test = StringIO(test_csv)In[6]: pd.read_csv(test)Out[6]:                  a  b  c0  d,e,f\ng,h,i\nm  n  o1                p  q  r2                s  t  uIn[7]: test_csv_2 = '''a,b,c  ...: "d,e,f  ...: g,h,i  ...: "m,n,o  ...: "p,q,r  ...: s,t,u  ...: '''  ...: test_2 = StringIO(test_csv_2)  ...: In[8]: pd.read_csv(test_2)Traceback (most recent call last):......pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 2

The first string has 2 (even) quotechars. So each quotechar is closed and the csv is parsed without an error, although probably not what we expected. The other string has 3 (odd) quotechars. The last one is not closed and the EOF is reached hence the error. But line 2 that we get in the error message is misleading. We would expect 4, but since everything between first and second quotechar is parsed as a string our "p,q,r line is actually second.