pandas read_csv and filter columns with usecols pandas read_csv and filter columns with usecols python python

pandas read_csv and filter columns with usecols


The solution lies in understanding these two keyword arguments:

  • names is only necessary when there is no header row in your file and you want to specify other arguments (such as usecols) using column names rather than integer indices.
  • usecols is supposed to provide a filter before reading the whole DataFrame into memory; if used properly, there should never be a need to delete columns after reading.

So because you have a header row, passing header=0 is sufficient and additionally passing names appears to be confusing pd.read_csv.

Removing names from the second call gives the desired output:

import pandas as pdfrom StringIO import StringIOcsv = r"""dummy,date,loc,xbar,20090101,a,1bar,20090102,a,3bar,20090103,a,5bar,20090101,b,1bar,20090102,b,3bar,20090103,b,5"""df = pd.read_csv(StringIO(csv),        header=0,        index_col=["date", "loc"],         usecols=["date", "loc", "x"],        parse_dates=["date"])

Which gives us:

                xdate       loc2009-01-01 a    12009-01-02 a    32009-01-03 a    52009-01-01 b    12009-01-02 b    32009-01-03 b    5


This code achieves what you want --- also its weird and certainly buggy:

I observed that it works when:

a) you specify the index_col rel. to the number of columns you really use -- so its three columns in this example, not four (you drop dummy and start counting from then onwards)

b) same for parse_dates

c) not so for usecols ;) for obvious reasons

d) here I adapted the names to mirror this behaviour

import pandas as pdfrom StringIO import StringIOcsv = """dummy,date,loc,xbar,20090101,a,1bar,20090102,a,3bar,20090103,a,5bar,20090101,b,1bar,20090102,b,3bar,20090103,b,5"""df = pd.read_csv(StringIO(csv),        index_col=[0,1],        usecols=[1,2,3],         parse_dates=[0],        header=0,        names=["date", "loc", "", "x"])print df

which prints

                xdate       loc   2009-01-01 a    12009-01-02 a    32009-01-03 a    52009-01-01 b    12009-01-02 b    32009-01-03 b    5


If your csv file contains extra data, columns can be deleted from the DataFrame after import.

import pandas as pdfrom StringIO import StringIOcsv = r"""dummy,date,loc,xbar,20090101,a,1bar,20090102,a,3bar,20090103,a,5bar,20090101,b,1bar,20090102,b,3bar,20090103,b,5"""df = pd.read_csv(StringIO(csv),        index_col=["date", "loc"],         usecols=["dummy", "date", "loc", "x"],        parse_dates=["date"],        header=0,        names=["dummy", "date", "loc", "x"])del df['dummy']

Which gives us:

                xdate       loc2009-01-01 a    12009-01-02 a    32009-01-03 a    52009-01-01 b    12009-01-02 b    32009-01-03 b    5