When to apply(pd.to_numeric) and when to astype(np.float64) in python?
If you already have numeric dtypes (int8|16|32|64
,float64
,boolean
) you can convert it to another "numeric" dtype using Pandas .astype() method.
Demo:
In [90]: df = pd.DataFrame(np.random.randint(10**5,10**7,(5,3)),columns=list('abc'), dtype=np.int64)In [91]: dfOut[91]: a b c0 9059440 9590567 20769181 5861102 4566089 19473232 6636568 162770 24879913 6794572 5236903 56287794 470121 4044395 4546794In [92]: df.dtypesOut[92]:a int64b int64c int64dtype: objectIn [93]: df['a'] = df['a'].astype(float)In [94]: df.dtypesOut[94]:a float64b int64c int64dtype: object
It won't work for object
(string) dtypes, that can't be converted to numbers:
In [95]: df.loc[1, 'b'] = 'XXXXXX'In [96]: dfOut[96]: a b c0 9059440.0 9590567 20769181 5861102.0 XXXXXX 19473232 6636568.0 162770 24879913 6794572.0 5236903 56287794 470121.0 4044395 4546794In [97]: df.dtypesOut[97]:a float64b objectc int64dtype: objectIn [98]: df['b'].astype(float)...skipped...ValueError: could not convert string to float: 'XXXXXX'
So here we want to use pd.to_numeric() method:
In [99]: df['b'] = pd.to_numeric(df['b'], errors='coerce')In [100]: dfOut[100]: a b c0 9059440.0 9590567.0 20769181 5861102.0 NaN 19473232 6636568.0 162770.0 24879913 6794572.0 5236903.0 56287794 470121.0 4044395.0 4546794In [101]: df.dtypesOut[101]:a float64b float64c int64dtype: object
I don't have a technical explanation for this but, I have noticed that pd.to_numeric() raises the following error when converting the string 'nan':
In [10]: df = pd.DataFrame({'value': 'nan'}, index=[0])In [11]: pd.to_numeric(df.value)Traceback (most recent call last): File "<ipython-input-11-98729d13e45c>", line 1, in <module> pd.to_numeric(df.value) File "C:\Users\joshua.lee\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\tools\numeric.py", line 133, in to_numeric coerce_numeric=coerce_numeric) File "pandas/_libs/src\inference.pyx", line 1185, in pandas._libs.lib.maybe_convert_numericValueError: Unable to parse string "nan" at position 0
whereas astype(float) does not:
df.value.astype(float)Out[12]: 0 NaNName: value, dtype: float64