interactive conditional histogram bucket slicing data visualization
In order to get the interaction effect you're looking for, you must bin all the columns you care about, together.
The cleanest way I can think of doing this is to stack
into a single series
then use pd.cut
Considering your sample df
df_ = pd.cut(df[['A', 'B']].stack(), 5, labels=list(range(5))).unstack()df_.columns = df_.columns.to_series() + 'bkt'pd.concat([df, df_], axis=1)
Let's build a better example and look at a visualization using seaborn
df = pd.DataFrame(dict(A=(np.random.randn(10000) * 100 + 20).astype(int), B=(np.random.randn(10000) * 100 - 20).astype(int)))import seaborn as snsdf.index = df.index.to_series().astype(str).radd('city')df_ = pd.cut(df[['A', 'B']].stack(), 30, labels=list(range(30))).unstack()df_.columns = df_.columns.to_series() + 'bkt'sns.jointplot(x=df_.Abkt, y=df_.Bbkt, kind="scatter", color="k")
Or how about some data with some correlation
mean, cov = [0, 1], [(1, .5), (.5, 1)]data = np.random.multivariate_normal(mean, cov, 100000)df = pd.DataFrame(data, columns=["A", "B"])df.index = df.index.to_series().astype(str).radd('city')df_ = pd.cut(df[['A', 'B']].stack(), 30, labels=list(range(30))).unstack()df_.columns = df_.columns.to_series() + 'bkt'sns.jointplot(x=df_.Abkt, y=df_.Bbkt, kind="scatter", color="k")
Interactive bokeh
Without getting too complicated
from bokeh.io import show, output_notebook, output_filefrom bokeh.plotting import figurefrom bokeh.layouts import row, columnfrom bokeh.models import ColumnDataSource, Select, CustomJSoutput_notebook()# generate random dataflips = np.random.choice((1, -1), (5, 5))flips = np.tril(flips, -1) + np.triu(flips, 1) + np.eye(flips.shape[0])half = np.ones((5, 5)) / 2cov = (half + np.diag(np.diag(half))) * flipsmean = np.zeros(5)data = np.random.multivariate_normal(mean, cov, 10000)df = pd.DataFrame(data, columns=list('ABCDE'))df.index = df.index.to_series().astype(str).radd('city')# Stack and cut to get dependent relationshipsb = 20df_ = pd.cut(df.stack(), b, labels=list(range(b))).unstack()# assign default columns x and y. These will be the columns I set bokeh to readdf_[['x', 'y']] = df_.loc[:, ['A', 'B']]source = ColumnDataSource(data=df_)tools = 'box_select,pan,box_zoom,wheel_zoom,reset,resize,save'p = figure(plot_width=600, plot_height=300)p.circle('x', 'y', source=source, fill_color='olive', line_color='black', alpha=.5)def gcb(like, n): code = """ var data = source.get('data'); var f = cb_obj.get('value'); data['{0}{1}'] = data[f]; source.trigger('change'); """ return CustomJS(args=dict(source=source), code=code.format(like, n))xcb = CustomJS( args=dict(source=source), code=""" var data = source.get('data'); var colm = cb_obj.get('value'); data['x'] = data[colm]; source.trigger('change'); """)ycb = CustomJS( args=dict(source=source), code=""" var data = source.get('data'); var colm = cb_obj.get('value'); data['y'] = data[colm]; source.trigger('change'); """)options = list('ABCDE')x_select = Select(options=options, callback=xcb, value='A')y_select = Select(options=options, callback=ycb, value='B')show(column(p, row(x_select, y_select)))
Here is a new solution using bokeh
and HoloViews
. It should respond a little more to the interactive part.
I try to remember that simple is beautiful when it comes to dataviz.
I used faker
library in order to generate random city names to make the following graphs more realistic.
I will let all my codes here even if the most important part is the choice of the libraries.
import pandas as pdimport numpy as npfrom faker import Fakerdef generate_random_dataset(city_number, list_identifier, labels, bins, city_location='en_US'): fake = Faker(locale=city_location) df = pd.DataFrame(data=np.random.uniform(0, 100, len(list_identifier)]), index=[fake.city() for _ in range(city_number)], columns=list_identifier) for name in list_identifier: df[name + 'bkt'] = pd.Series(pd.cut(df[name], bins, labels=labels)) return dflist_identifier=list('ABC')labels = ['Low', 'Medium', 'Average', 'Good', 'Great']bins = np.array([-1, 20, 40, 60, 80, 101])df = generate_random_dataset(30, list_identifier, labels, bins)df.head()
Sometimes, when your dataset is small, exposing a simple chart with colors could be enough.
from bokeh.charts import Bar, output_file, showfrom bokeh.layouts import columnbar = []for name in list_identifier: bar.append(Bar(df, label='index', values=name, stack=name+'bkt', title="percentage of " + name, legend='top_left', plot_width=1024))output_file('cities.html')show(column(bar))
Will create a new html page (cities) containing the graphs. Note that all the graphs generated with bokeh
are interactive.
bokeh
can't initially plot hexbin. However, HoloViews
can. Thus, it allows to draw interactive plots whitin ipython notebook
.
The syntax is quite straightforward, you just need a Matrix with two columns and call the hist method:
import holoviews as hvhv.notebook_extension('bokeh')df = generate_random_dataset(1000, list_identifier, list(range(5)), 5)points = hv.Points(np.column_stack((df.A, df.B)))points.hist(num_bins=5, dimension=['x', 'y'])
To compare with @piRSquared solution, I stole a bit of code (thank you btw :) to show the data with some correlation:
mean, cov = [0, 1], [(1, .5), (.5, 1)]data = np.random.multivariate_normal(mean, cov, 100000)df = pd.DataFrame(data, columns=["A", "B"])df.index = df.index.to_series().astype(str).radd('city')df_ = pd.cut(df[['A', 'B']].stack(), 30, labels=list(range(30))).unstack()df_.columns = df_.columns.to_series() + 'bkt'points = hv.Points(np.column_stack((df_.Abkt, df_.Bbkt)))points.hist(num_bins=5, dimension=['x', 'y'])
Please consider visit HoloViews
tutorial.
As a newbie with insufficient rep, I can't comment, so I'm putting this here as an "answer," though it shouldn't be treated as one; these are just some incomplete suggestions in the same vein as the comments.
Along with the others, I like seaborn
though I'm not sure those plots are interactive in the way you are seeking. While I haven't used bokeh
, my understanding is that it provides more in the way of interactivity, but regardless of the package, as you move beyond 3 and 4 variables, you can only cram so much into one (family of) charts.
As for in your table directly, the aforementioned df.hist()
(by lanery) is a good start. Once you have those bins, you can then play with the immensely powerful df.groupby()
function. I've been using pandas for two years now, and that function STILL blows my mind. While not interactive, it will definitely help you slice and dice your data as you see fit.