Pandas ParserError EOF character when reading multiple csv files to HDF5

python csv python-3.x pandas hdf5

I had a similar problem. The line listed with the 'EOF inside string' had a string that contained within it a single quote mark. When I added the option quoting=csv.QUOTE_NONE it fixed my problem.

For example:

import csvdf = pd.read_csv(csvfile, header = None, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')

python csv python-3.x pandas hdf5

I have the same problem, and after adding these two params to my code, the problem is gone.

read_csv (...quoting=3, error_bad_lines=False)

python csv python-3.x pandas hdf5

I realize this is an old question, but I wanted to share some more details on the root cause of this error and why the solution from @Selah works.

From the csv.py docstring:

    * quoting - controls when quotes should be generated by the writer.    It can take on any of the following module constants:    csv.QUOTE_MINIMAL means only when required, for example, when a        field contains either the quotechar or the delimiter    csv.QUOTE_ALL means that quotes are always placed around fields.    csv.QUOTE_NONNUMERIC means that quotes are always placed around        fields which do not parse as integers or floating point        numbers.    csv.QUOTE_NONE means that quotes are never placed around fields.

csv.QUOTE_MINIMAL is the default value and " is the default quotechar. If somewhere in your csv file you have a quotechar it will be parsed as a string until another occurrence of the quotechar. If your file has odd number of quotechars the last one will not be closed before reaching the EOF (end of file). Also be aware that anything between the quotechars will be parsed as a single string. Even if there are many line breaks (expected to be parsed as separate rows) it all goes into a single field of the table. So the line number that you get in the error can be misleading. To illustrate with an example consider this:

In[4]: import pandas as pd  ...: from io import StringIO  ...: test_csv = '''a,b,c  ...: "d,e,f  ...: g,h,i  ...: "m,n,o  ...: p,q,r  ...: s,t,u  ...: '''  ...: In[5]: test = StringIO(test_csv)In[6]: pd.read_csv(test)Out[6]:                  a  b  c0  d,e,f\ng,h,i\nm  n  o1                p  q  r2                s  t  uIn[7]: test_csv_2 = '''a,b,c  ...: "d,e,f  ...: g,h,i  ...: "m,n,o  ...: "p,q,r  ...: s,t,u  ...: '''  ...: test_2 = StringIO(test_csv_2)  ...: In[8]: pd.read_csv(test_2)Traceback (most recent call last):......pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 2

The first string has 2 (even) quotechars. So each quotechar is closed and the csv is parsed without an error, although probably not what we expected. The other string has 3 (odd) quotechars. The last one is not closed and the EOF is reached hence the error. But line 2 that we get in the error message is misleading. We would expect 4, but since everything between first and second quotechar is parsed as a string our "p,q,r line is actually second.

CodeHunter

Pandas ParserError EOF character when reading multiple csv files to HDF5

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last