What to do with missing values when plotting with seaborn?

python python-2.7 pandas data-analysis seaborn

This is a known issue with matplotlib/pylab histograms!

See e.g. https://github.com/matplotlib/matplotlib/issues/6483

where various workarounds are suggested, two favourites (for example from https://stackoverflow.com/a/19090183/1021819) being:

import numpy as npnbins=100A=data['alcconsumption']Anan=A[~np.isnan(A)] # Remove the NaNsseaborn.distplot(Anan,hist=True,bins=nbins)

Alternatively, specify bin edges (in this case by anyway making use of Anan...):

Amin=min(Anan)Amax=max(Anan)seaborn.distplot(A,hist=True,bins=np.linspace(Amin,Amax,nbins))

python python-2.7 pandas data-analysis seaborn

You can use the following line to select the non-NaN values for a distribution plot using seaborn:

seaborn.distplot(data['alcconsumption'].notnull(),hist=True,bins=100)

python python-2.7 pandas data-analysis seaborn

I would definitely handle missing values before you plot your data. Whether ot not to use dropna() would depend entirely on the nature of your dataset. Is alcconsumption a single series or part of a dataframe? In the latter case, using dropna() would remove the corresponding rows in other columns as well. Are the missing values few or many? Are they spread around in your series, or do they tend to occur in groups? Is there perhaps reason to believe that there is a trend in your dataset?

If the missing values are few and scattered, you could easiliy use dropna(). In other cases I would choose to fill missing values with the previously observed value (1). Or even fill the missing values with interpolated values (2). But be careful! Replacing a lot of data with filled or interpolated observations could seriously interrupt your dataset and lead to very wrong conlusions.

Here are some examples that use your snippet...

seaborn.distplot(data['alcconsumption'],hist=True,bins=100)plt.xlabel('AlcoholConsumption')plt.ylabel('Frequency(normalized 0->1)')

... on a synthetic dataset:

import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltdef sample(rows, names):    ''' Function to create data sample with random returns    Parameters    ==========    rows : number of rows in the dataframe    names: list of names to represent assets    Example    =======    >>> sample(rows = 2, names = ['A', 'B'])                  A       B    2017-01-01  0.0027  0.0075    2017-01-02 -0.0050 -0.0024    '''    listVars= names    rng = pd.date_range('1/1/2017', periods=rows, freq='D')    df_temp = pd.DataFrame(np.random.randint(-100,100,size=(rows, len(listVars))), columns=listVars)     df_temp = df_temp.set_index(rng)    return df_tempdf = sample(rows = 15, names = ['A', 'B'])df['A'][8:12] = np.nandf

Output:

            A   B2017-01-01 -63.0  102017-01-02  49.0  792017-01-03 -55.0  592017-01-04  89.0  342017-01-05 -13.0 -802017-01-06  36.0  902017-01-07 -41.0  862017-01-08  10.0 -812017-01-09   NaN -612017-01-10   NaN -802017-01-11   NaN -392017-01-12   NaN  242017-01-13 -73.0 -252017-01-14 -40.0  862017-01-15  97.0  60

(1) Using forward fill with pandas.DataFrame.fillna(method = ffill)

ffill will "fill values forward", meaning it will replace the nan's with the value of the row above.

df = df['A'].fillna(axis=0, method='ffill')sns.distplot(df, hist=True,bins=5)plt.xlabel('AlcoholConsumption')plt.ylabel('Frequency(normalized 0->1)')

(2) Using interpolation with pandas.DataFrame.interpolate()

Interpolate values according to different methods. Time interpolation works on daily and higher resolution data to interpolate given length of interval.

df['A'] = df['A'].interpolate(method = 'time')sns.distplot(df['A'], hist=True,bins=5)plt.xlabel('AlcoholConsumption')plt.ylabel('Frequency(normalized 0->1)')

As you can see, the different methods render two very different results. I hope this will be useful to you. If not then let me know and I'll have a look at it again.

CodeHunter

What to do with missing values when plotting with seaborn?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last