Selecting multiple columns in a Pandas dataframe
The column names (which are strings) cannot be sliced in the manner you tried.
Here you have a couple of options. If you know from context which variables you want to slice out, you can just return a view of only those columns by passing a list into the __getitem__
syntax (the []'s).
df1 = df[['a', 'b']]
Alternatively, if it matters to index them numerically and not by their name (say your code should automatically do this without knowing the names of the first two columns) then you can do this instead:
df1 = df.iloc[:, 0:2] # Remember that Python does not slice inclusive of the ending index.
Additionally, you should familiarize yourself with the idea of a view into a Pandas object vs. a copy of that object. The first of the above methods will return a new copy in memory of the desired sub-object (the desired slices).
Sometimes, however, there are indexing conventions in Pandas that don't do this and instead give you a new variable that just refers to the same chunk of memory as the sub-object or slice in the original object. This will happen with the second way of indexing, so you can modify it with the .copy()
method to get a regular copy. When this happens, changing what you think is the sliced object can sometimes alter the original object. Always good to be on the look out for this.
df1 = df.iloc[0, 0:2].copy() # To avoid the case where changing df1 also changes df
To use iloc
, you need to know the column positions (or indices). As the column positions may change, instead of hard-coding indices, you can use iloc
along with get_loc
function of columns
method of dataframe object to obtain column indices.
{df.columns.get_loc(c): c for idx, c in enumerate(df.columns)}
Now you can use this dictionary to access columns through names and using iloc
.
As of version 0.11.0, columns can be sliced in the manner you tried using the .loc
indexer:
df.loc[:, 'C':'E']
is equivalent to
df[['C', 'D', 'E']] # or df.loc[:, ['C', 'D', 'E']]
and returns columns C
through E
.
A demo on a randomly generated DataFrame:
import pandas as pdimport numpy as npnp.random.seed(5)df = pd.DataFrame(np.random.randint(100, size=(100, 6)), columns=list('ABCDEF'), index=['R{}'.format(i) for i in range(100)])df.head()Out: A B C D E FR0 99 78 61 16 73 8R1 62 27 30 80 7 76R2 15 53 80 27 44 77R3 75 65 47 30 84 86R4 18 9 41 62 1 82
To get the columns from C to E (note that unlike integer slicing, 'E' is included in the columns):
df.loc[:, 'C':'E']Out: C D ER0 61 16 73R1 30 80 7R2 80 27 44R3 47 30 84R4 41 62 1R5 5 58 0...
The same works for selecting rows based on labels. Get the rows 'R6' to 'R10' from those columns:
df.loc['R6':'R10', 'C':'E']Out: C D ER6 51 27 31R7 83 19 18R8 11 67 65R9 78 27 29R10 7 16 94
.loc
also accepts a Boolean array so you can select the columns whose corresponding entry in the array is True
. For example, df.columns.isin(list('BCD'))
returns array([False, True, True, True, False, False], dtype=bool)
- True if the column name is in the list ['B', 'C', 'D']
; False, otherwise.
df.loc[:, df.columns.isin(list('BCD'))]Out: B C DR0 78 61 16R1 27 30 80R2 53 80 27R3 65 47 30R4 9 41 62R5 78 5 58...
Assuming your column names (df.columns
) are ['index','a','b','c']
, then the data you want is in thethird and fourth columns. If you don't know their names when your script runs, you can do this
newdf = df[df.columns[2:4]] # Remember, Python is zero-offset! The "third" entry is at slot two.
As EMS points out in his answer, df.ix
slices columns a bit more concisely, but the .columns
slicing interface might be more natural, because it uses the vanilla one-dimensional Python list indexing/slicing syntax.
Warning: 'index'
is a bad name for a DataFrame
column. That same label is also used for the real df.index
attribute, an Index
array. So your column is returned by df['index']
and the real DataFrame index is returned by df.index
. An Index
is a special kind of Series
optimized for lookup of its elements' values. For df.index it's for looking up rows by their label. That df.columns
attribute is also a pd.Index
array, for looking up columns by their labels.