Is there an efficient method of checking whether a column has mixed dtypes? Is there an efficient method of checking whether a column has mixed dtypes? python python

Is there an efficient method of checking whether a column has mixed dtypes?


In pandas there's infer_dtype() which might be helpful here.

Written in Cython (code link), it returns a string summarising the values in the passed object. It's used a lot in pandas' internals so we might reasonably expect that's it has been designed with efficiency in mind.

>>> from pandas.api.types import infer_dtype

Now, column A is a mix of integers and some other types:

>>> infer_dtype(df.A)'mixed-integer'

Column B's values are all of floating type:

>>> infer_dtype(df.B)'floating'

Column C contains strings:

>>> infer_dtype(df.B)'string'

The general "catchall" type for mixed values is simply "mixed":

>>> infer_dtype(['a string', pd.Timedelta(10)])'mixed'

A mix of floats and integers is ''mixed-integer-float'':

>>> infer_dtype([3.141, 99])'mixed-integer-float'

To make the function you describe in your question, one approach could be to create a function which catches the relevant mixed cases:

def is_mixed(col):    return infer_dtype(col) in ['mixed', 'mixed-integer']

Then you have:

>>> df.apply(is_mixed)A     TrueB    FalseC    Falsedtype: bool


Here is an approach that uses the fact that in Python3 different types cannot be compared. The idea is to run max over the array which being a builtin should be reasonably fast. And it does short-cicuit.

def ismixed(a):    try:        max(a)        return False    except TypeError as e: # we take this to imply mixed type        msg, fst, and_, snd = str(e).rsplit(' ', 3)        assert msg=="'>' not supported between instances of"        assert and_=="and"        assert fst!=snd        return True    except ValueError as e: # catch empty arrays        assert str(e)=="max() arg is an empty sequence"        return False

It doesn't catch mixed numeric types, though. Also, objects that just do not support comparison may trip this up.

But it's reasonably fast. If we strip away all pandas overhead:

v = df.valueslist(map(is_mixed, v.T))# [True, False, False]timeit(lambda: list(map(ismixed, v.T)), number=1000)# 0.008936170022934675

For comparison

timeit(lambda: list(map(infer_dtype, v.T)), number=1000)# 0.02499613002873957


Not sure how you need the result, but you can map the type to df.values.ravel() and create a dictionary of the name of the column link to the comparison of the len of a set superior to 1 for each slice of the l such as:

l = list(map(type, df.values.ravel()))print ({df.columns[i]:len(set(l[i::df.shape[1]])) > 1 for i in range(df.shape[1])}){'A': True, 'B': False, 'C': False}

Timing:

%timeit df.applymap(type).nunique() > 1#3.25 ms ± 516 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)%%timeit l = list(map(type, df.values.ravel())){df.columns[i]:len(set(l[i::df.shape[1]])) > 1 for i in range(df.shape[1])}#100 µs ± 5.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

EDIT for larger dataframe, the improve in time is less interesting though:

dfl = pd.concat([df]*100000,ignore_index=True)%timeit dfl.applymap(type).nunique() > 1#519 ms ± 61.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)%%timeitl = list(map(type, dfl.values.ravel())){dfl.columns[i]:len(set(l[i::dfl.shape[1]])) > 1 for i in range(dfl.shape[1])}#254 ms ± 33.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

A bit faster solution on the same idea:

%timeit { col: len(set(map(type, dfl[col])))>1 for col in dfl.columns}#124 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)