How to extract sklearn decision tree rules to pandas boolean conditions?

python pandas machine-learning scikit-learn decision-tree

First of all let's use the scikit documentation on decision tree structure to get information about the tree that was constructed :

n_nodes = clf.tree_.node_countchildren_left = clf.tree_.children_leftchildren_right = clf.tree_.children_rightfeature = clf.tree_.featurethreshold = clf.tree_.threshold

We then define two recursive functions. The first one will find the path from the tree's root to create a specific node (all the leaves in our case). The second one will write the specific rules used to create a node using its creation path :

def find_path(node_numb, path, x):        path.append(node_numb)        if node_numb == x:            return True        left = False        right = False        if (children_left[node_numb] !=-1):            left = find_path(children_left[node_numb], path, x)        if (children_right[node_numb] !=-1):            right = find_path(children_right[node_numb], path, x)        if left or right :            return True        path.remove(node_numb)        return Falsedef get_rule(path, column_names):    mask = ''    for index, node in enumerate(path):        #We check if we are not in the leaf        if index!=len(path)-1:            # Do we go under or over the threshold ?            if (children_left[node] == path[index+1]):                mask += "(df['{}']<= {}) \t ".format(column_names[feature[node]], threshold[node])            else:                mask += "(df['{}']> {}) \t ".format(column_names[feature[node]], threshold[node])    # We insert the & at the right places    mask = mask.replace("\t", "&", mask.count("\t") - 1)    mask = mask.replace("\t", "")    return mask

Finally, we use those two functions to first store the creation path of each leaf. And then to store the rules used to create each leaf :

# Leavesleave_id = clf.apply(X_test)paths ={}for leaf in np.unique(leave_id):    path_leaf = []    find_path(0, path_leaf, leaf)    paths[leaf] = np.unique(np.sort(path_leaf))rules = {}for key in paths:    rules[key] = get_rule(paths[key], pima.columns)

With the data you gave the output is :

rules ={3: "(df['insulin']<= 127.5) & (df['bp']<= 26.450000762939453) & (df['bp']<= 9.100000381469727)  ", 4: "(df['insulin']<= 127.5) & (df['bp']<= 26.450000762939453) & (df['bp']> 9.100000381469727)  ", 6: "(df['insulin']<= 127.5) & (df['bp']> 26.450000762939453) & (df['skin']<= 27.5)  ", 7: "(df['insulin']<= 127.5) & (df['bp']> 26.450000762939453) & (df['skin']> 27.5)  ", 10: "(df['insulin']> 127.5) & (df['bp']<= 28.149999618530273) & (df['insulin']<= 145.5)  ", 11: "(df['insulin']> 127.5) & (df['bp']<= 28.149999618530273) & (df['insulin']> 145.5)  ", 13: "(df['insulin']> 127.5) & (df['bp']> 28.149999618530273) & (df['insulin']<= 158.5)  ", 14: "(df['insulin']> 127.5) & (df['bp']> 28.149999618530273) & (df['insulin']> 158.5)  "}

Since the rules are strings, you can't directly call them using df[rules[3]], you have to use the eval function like so df[eval(rules[3])]

python pandas machine-learning scikit-learn decision-tree

Now you can use export_text.

from sklearn.tree import export_textr = export_text(loan_tree, feature_names=(list(X_train.columns)))print(r)

A complete example from sklearn

from sklearn.datasets import load_irisfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.tree import export_textiris = load_iris()X = iris['data']y = iris['target']decision_tree = DecisionTreeClassifier(random_state=0, max_depth=2)decision_tree = decision_tree.fit(X, y)r = export_text(decision_tree, feature_names=iris['feature_names'])print(r)

python pandas machine-learning scikit-learn decision-tree

I figured out a further solution to this problem (a second part to the one posted by vlemaistre) which allows the user to run through any node and subset the data based on the pandas boolean condition.

node_id = 3def datatree_path_summarystats(node_id):    for k, v in paths.items():        if node_id in v:            d = k,v    ruleskey = d[0]    numberofsteps = sum(map(lambda x : x<node_id, d[1]))    for k, v in rules.items():        if k == ruleskey:            b = k,v    stringsubset = b[1]    datasubset = "&".join(stringsubset.split('&')[:numberofsteps])    return datasubsetdatasubset = datatree_path_summarystats(node_id)df[eval(datasubset)]

This function runs through the paths that contain the node id you are looking for. It will then split the rule based on that number of nodes creating the logic to subset the dataframe based on that one specific node.

CodeHunter

How to extract sklearn decision tree rules to pandas boolean conditions?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last