Is there an efficient method of checking whether a column has mixed dtypes?
In pandas there's infer_dtype()
which might be helpful here.
Written in Cython (code link), it returns a string summarising the values in the passed object. It's used a lot in pandas' internals so we might reasonably expect that's it has been designed with efficiency in mind.
>>> from pandas.api.types import infer_dtype
Now, column A is a mix of integers and some other types:
>>> infer_dtype(df.A)'mixed-integer'
Column B's values are all of floating type:
>>> infer_dtype(df.B)'floating'
Column C contains strings:
>>> infer_dtype(df.B)'string'
The general "catchall" type for mixed values is simply "mixed":
>>> infer_dtype(['a string', pd.Timedelta(10)])'mixed'
A mix of floats and integers is ''mixed-integer-float'':
>>> infer_dtype([3.141, 99])'mixed-integer-float'
To make the function you describe in your question, one approach could be to create a function which catches the relevant mixed cases:
def is_mixed(col): return infer_dtype(col) in ['mixed', 'mixed-integer']
Then you have:
>>> df.apply(is_mixed)A TrueB FalseC Falsedtype: bool
Here is an approach that uses the fact that in Python3 different types cannot be compared. The idea is to run max
over the array which being a builtin should be reasonably fast. And it does short-cicuit.
def ismixed(a): try: max(a) return False except TypeError as e: # we take this to imply mixed type msg, fst, and_, snd = str(e).rsplit(' ', 3) assert msg=="'>' not supported between instances of" assert and_=="and" assert fst!=snd return True except ValueError as e: # catch empty arrays assert str(e)=="max() arg is an empty sequence" return False
It doesn't catch mixed numeric types, though. Also, objects that just do not support comparison may trip this up.
But it's reasonably fast. If we strip away all pandas
overhead:
v = df.valueslist(map(is_mixed, v.T))# [True, False, False]timeit(lambda: list(map(ismixed, v.T)), number=1000)# 0.008936170022934675
For comparison
timeit(lambda: list(map(infer_dtype, v.T)), number=1000)# 0.02499613002873957
Not sure how you need the result, but you can map
the type
to df.values.ravel()
and create a dictionary of the name of the column link to the comparison of the len
of a set
superior to 1 for each slice of the l
such as:
l = list(map(type, df.values.ravel()))print ({df.columns[i]:len(set(l[i::df.shape[1]])) > 1 for i in range(df.shape[1])}){'A': True, 'B': False, 'C': False}
Timing:
%timeit df.applymap(type).nunique() > 1#3.25 ms ± 516 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)%%timeit l = list(map(type, df.values.ravel())){df.columns[i]:len(set(l[i::df.shape[1]])) > 1 for i in range(df.shape[1])}#100 µs ± 5.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
EDIT for larger dataframe, the improve in time is less interesting though:
dfl = pd.concat([df]*100000,ignore_index=True)%timeit dfl.applymap(type).nunique() > 1#519 ms ± 61.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)%%timeitl = list(map(type, dfl.values.ravel())){dfl.columns[i]:len(set(l[i::dfl.shape[1]])) > 1 for i in range(dfl.shape[1])}#254 ms ± 33.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
A bit faster solution on the same idea:
%timeit { col: len(set(map(type, dfl[col])))>1 for col in dfl.columns}#124 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)