interactive conditional histogram bucket slicing data visualization interactive conditional histogram bucket slicing data visualization pandas pandas

interactive conditional histogram bucket slicing data visualization


In order to get the interaction effect you're looking for, you must bin all the columns you care about, together.

The cleanest way I can think of doing this is to stack into a single series then use pd.cut

Considering your sample df

enter image description here

df_ = pd.cut(df[['A', 'B']].stack(), 5, labels=list(range(5))).unstack()df_.columns = df_.columns.to_series() + 'bkt'pd.concat([df, df_], axis=1)

enter image description here


Let's build a better example and look at a visualization using seaborn

df = pd.DataFrame(dict(A=(np.random.randn(10000) * 100 + 20).astype(int),                       B=(np.random.randn(10000) * 100 - 20).astype(int)))import seaborn as snsdf.index = df.index.to_series().astype(str).radd('city')df_ = pd.cut(df[['A', 'B']].stack(), 30, labels=list(range(30))).unstack()df_.columns = df_.columns.to_series() + 'bkt'sns.jointplot(x=df_.Abkt, y=df_.Bbkt, kind="scatter", color="k")

enter image description here


Or how about some data with some correlation

mean, cov = [0, 1], [(1, .5), (.5, 1)]data = np.random.multivariate_normal(mean, cov, 100000)df = pd.DataFrame(data, columns=["A", "B"])df.index = df.index.to_series().astype(str).radd('city')df_ = pd.cut(df[['A', 'B']].stack(), 30, labels=list(range(30))).unstack()df_.columns = df_.columns.to_series() + 'bkt'sns.jointplot(x=df_.Abkt, y=df_.Bbkt, kind="scatter", color="k")

enter image description here


Interactive bokeh

Without getting too complicated

from bokeh.io import show, output_notebook, output_filefrom bokeh.plotting import figurefrom bokeh.layouts import row, columnfrom bokeh.models import ColumnDataSource, Select, CustomJSoutput_notebook()# generate random dataflips = np.random.choice((1, -1), (5, 5))flips = np.tril(flips, -1) + np.triu(flips, 1) + np.eye(flips.shape[0])half = np.ones((5, 5)) / 2cov = (half + np.diag(np.diag(half))) * flipsmean = np.zeros(5)data = np.random.multivariate_normal(mean, cov, 10000)df = pd.DataFrame(data, columns=list('ABCDE'))df.index = df.index.to_series().astype(str).radd('city')# Stack and cut to get dependent relationshipsb = 20df_ = pd.cut(df.stack(), b, labels=list(range(b))).unstack()# assign default columns x and y.  These will be the columns I set bokeh to readdf_[['x', 'y']] = df_.loc[:, ['A', 'B']]source = ColumnDataSource(data=df_)tools = 'box_select,pan,box_zoom,wheel_zoom,reset,resize,save'p = figure(plot_width=600, plot_height=300)p.circle('x', 'y', source=source, fill_color='olive', line_color='black', alpha=.5)def gcb(like, n):    code = """    var data = source.get('data');    var f = cb_obj.get('value');    data['{0}{1}'] = data[f];    source.trigger('change');    """    return CustomJS(args=dict(source=source), code=code.format(like, n))xcb = CustomJS(    args=dict(source=source),    code="""    var data = source.get('data');    var colm = cb_obj.get('value');    data['x'] = data[colm];    source.trigger('change');    """)ycb = CustomJS(    args=dict(source=source),    code="""    var data = source.get('data');    var colm = cb_obj.get('value');    data['y'] = data[colm];    source.trigger('change');    """)options = list('ABCDE')x_select = Select(options=options, callback=xcb, value='A')y_select = Select(options=options, callback=ycb, value='B')show(column(p, row(x_select, y_select)))

enter image description here


Here is a new solution using bokeh and HoloViews. It should respond a little more to the interactive part.

I try to remember that simple is beautiful when it comes to dataviz.

I used faker library in order to generate random city names to make the following graphs more realistic.

I will let all my codes here even if the most important part is the choice of the libraries.

import pandas as pdimport numpy as npfrom faker import Fakerdef generate_random_dataset(city_number,                         list_identifier,                        labels,                        bins,                        city_location='en_US'):    fake = Faker(locale=city_location)    df = pd.DataFrame(data=np.random.uniform(0, 100, len(list_identifier)]),                       index=[fake.city() for _ in range(city_number)],                       columns=list_identifier)    for name in list_identifier:        df[name + 'bkt'] =  pd.Series(pd.cut(df[name], bins, labels=labels))    return dflist_identifier=list('ABC')labels = ['Low', 'Medium', 'Average', 'Good', 'Great']bins = np.array([-1, 20, 40, 60, 80, 101])df = generate_random_dataset(30, list_identifier, labels, bins)df.head()

will output:df

Sometimes, when your dataset is small, exposing a simple chart with colors could be enough.

from bokeh.charts import Bar, output_file, showfrom bokeh.layouts import columnbar = []for name in list_identifier:    bar.append(Bar(df, label='index', values=name, stack=name+'bkt',               title="percentage of " + name, legend='top_left', plot_width=1024))output_file('cities.html')show(column(bar))

Will create a new html page (cities) containing the graphs. Note that all the graphs generated with bokeh are interactive.

graphA

graphB

bokeh can't initially plot hexbin. However, HoloViews can. Thus, it allows to draw interactive plots whitin ipython notebook.

The syntax is quite straightforward, you just need a Matrix with two columns and call the hist method:

import holoviews as hvhv.notebook_extension('bokeh')df = generate_random_dataset(1000, list_identifier, list(range(5)), 5)points = hv.Points(np.column_stack((df.A, df.B)))points.hist(num_bins=5, dimension=['x', 'y'])

repartition of A and B

To compare with @piRSquared solution, I stole a bit of code (thank you btw :) to show the data with some correlation:

mean, cov = [0, 1], [(1, .5), (.5, 1)]data = np.random.multivariate_normal(mean, cov, 100000)df = pd.DataFrame(data, columns=["A", "B"])df.index = df.index.to_series().astype(str).radd('city')df_ = pd.cut(df[['A', 'B']].stack(), 30, labels=list(range(30))).unstack()df_.columns = df_.columns.to_series() + 'bkt'points = hv.Points(np.column_stack((df_.Abkt, df_.Bbkt)))points.hist(num_bins=5, dimension=['x', 'y'])

mean covariance

Please consider visit HoloViews tutorial.


As a newbie with insufficient rep, I can't comment, so I'm putting this here as an "answer," though it shouldn't be treated as one; these are just some incomplete suggestions in the same vein as the comments.

Along with the others, I like seaborn though I'm not sure those plots are interactive in the way you are seeking. While I haven't used bokeh, my understanding is that it provides more in the way of interactivity, but regardless of the package, as you move beyond 3 and 4 variables, you can only cram so much into one (family of) charts.

As for in your table directly, the aforementioned df.hist() (by lanery) is a good start. Once you have those bins, you can then play with the immensely powerful df.groupby() function. I've been using pandas for two years now, and that function STILL blows my mind. While not interactive, it will definitely help you slice and dice your data as you see fit.