Comparing a dataframe on string lengths for different columns
I think you need list comprehension, because string
function works only with Series
(column
):
print ([df[col].str.len().min() for col in ['a','b','c']])
Another solution with apply
:
print ([df[col].apply(len).min() for col in ['a','b','c']])
Sample:
df = pd.DataFrame({'a':['h','gg','yyy'], 'b':['st','dsws','sw'], 'c':['fffff','','rr'], 'd':[1,3,5]})print (df) a b c d0 h st fffff 11 gg dsws 32 yyy sw rr 5print ([df[col].str.len().min() for col in ['a','b','c']])[1, 2, 0]
Timings:
#[3000 rows x 4 columns]df = pd.concat([df]*1000).reset_index(drop=True)In [17]: %timeit ([df[col].apply(len).min() for col in ['a','b','c']])100 loops, best of 3: 2.63 ms per loopIn [18]: %timeit ([df[col].str.len().min() for col in ['a','b','c']])The slowest run took 4.12 times longer than the fastest. This could mean that an intermediate result is being cached.100 loops, best of 3: 2.88 ms per loop
Conclusion:
apply
is faster, but not works with None
.
df = pd.DataFrame({'a':['h','gg','yyy'], 'b':[None,'dsws','sw'], 'c':['fffff','','rr'], 'd':[1,3,5]})print (df) a b c d0 h None fffff 11 gg dsws 32 yyy sw rr 5print ([df[col].apply(len).min() for col in ['a','b','c']])
TypeError: object of type 'NoneType' has no len()
print ([df[col].str.len().min() for col in ['a','b','c']])[1, 2.0, 0]
EDIT by comment:
#fail with Noneprint (df[['a','b','c']].applymap(len).min(axis=1))0 11 02 2dtype: int64
#working with Noneprint (df[['a','b','c']].apply(lambda x: x.str.len().min(), axis=1))0 11 02 2dtype: int64