Pandas read_csv: parsing time field correctly
In Python, datetimes are generally represented as datetime.datetime objects. These are not very efficient, which is why Pandas uses Timestamps, which are numeric.
To read the data (note the double brackets around the parse_dates
arguments):
df = pd.read_csv(filename, parse_dates=[['TranDate', 'TranTime']])>>> df.info()<class 'pandas.core.frame.DataFrame'>Int64Index: 18 entries, 0 to 17Data columns (total 3 columns):TranDate_TranTime 18 non-null datetime64[ns]TranID 18 non-null objectTranAmt 18 non-null float64dtypes: datetime64[ns](1), float64(1), object(1)>>> df.head() TranDate_TranTime TranID TranAmt0 2016-04-27 02:18:00 A123456 9999.531 2016-04-26 02:48:00 B123457 26070.332 2016-04-25 03:18:00 C123458 13779.563 2016-04-24 03:18:00 A123459 18157.264 2016-04-23 04:18:00 B123460 215868.15
The date and time columns have been joined to just one columns. Once you have this timestamp, it is easy to access its attributes using the dt
accessor, e.g.
>>> df.groupby(df.TranDate_TranTime.dt.hour).TranAmt.sum().head()TranDate_TranTime2 36069.863 31936.824 239563.405 64651.246 62010.65Name: TranAmt, dtype: float64>>> df.groupby(df.TranDate_TranTime.dt.day).TranAmt.sum().head()TranDate_TranTime19 68939.6320 262215.4221 15929.4522 36902.2323 242122.84Name: TranAmt, dtype: float64
Refer to the Pandas docs for more information regarding Pandas date functionality.
- No aggregation affection, but you will lose the time part.
- No, mostly you can access the time part by
.dt
accessor.
import pandas as pddf = pd.read_csv('MyTest.csv', parse_dates=[['TranDate', 'TranTime']])print dfTranDate_TranTime TranID TranAmt0 2016-04-27 02:18:00 A123456 9999.531 2016-04-26 02:48:00 B123457 26070.332 2016-04-25 03:18:00 C123458 13779.563 2016-04-24 03:18:00 A123459 18157.264 2016-04-23 04:18:00 B123460 215868.155 2016-04-22 04:18:00 C123461 23695.256 2016-04-21 05:18:00 A123462 57.007 2016-04-20 05:18:00 B123463 64594.248 2016-04-19 06:18:00 C123464 47890.919 2016-04-27 06:18:00 A123465 14119.7410 2016-04-26 07:18:00 B123466 2649.6011 2016-04-25 07:18:00 C123467 16757.3812 2016-04-24 08:18:00 A123468 8864.7813 2016-04-23 08:18:00 B123469 26254.6914 2016-04-22 09:18:00 C123470 13206.9815 2016-04-21 09:18:00 A123471 15872.4516 2016-04-20 10:18:00 B123472 197621.1817 2016-04-19 10:18:00 C123473 21048.72
Parse and manage the date/time as one column as far as possible using nested bracket parse_dates=[[]]
.
print df.groupby(df.TranDate_TranTime.dt.hour).sum() TranAmt2 36069.863 31936.824 239563.405 64651.246 62010.657 19406.988 35119.479 29079.4310 218669.90print df.groupby(df.TranDate_TranTime.dt.minute).sum() TranAmt18 710437.4248 26070.33
Get what you want like above.
And you can still groupby after resampling like below.
df2 = df.set_index('TranDate_TranTime').resample('60s').sum().dropna()print df2 TranAmtTranDate_TranTime 2016-04-19 06:18:00 47890.912016-04-19 10:18:00 21048.722016-04-20 05:18:00 64594.242016-04-20 10:18:00 197621.182016-04-21 05:18:00 57.002016-04-21 09:18:00 15872.452016-04-22 04:18:00 23695.252016-04-22 09:18:00 13206.982016-04-23 04:18:00 215868.152016-04-23 08:18:00 26254.692016-04-24 03:18:00 18157.262016-04-24 08:18:00 8864.782016-04-25 03:18:00 13779.562016-04-25 07:18:00 16757.382016-04-26 02:48:00 26070.332016-04-26 07:18:00 2649.602016-04-27 02:18:00 9999.532016-04-27 06:18:00 14119.74print df2.groupby(df2.index.day).sum() TranAmt19 68939.6320 262215.4221 15929.4522 36902.2323 242122.8424 27022.0425 30536.9426 28719.9327 24119.27