Understanding the Forest: A Visualization Tool to Support Decision Tree Analysis

Decision Trees (DTs) are one of the most widely used supervised Machine Learning algorithms. The algorithm constructs binary tree data structures that partition the data into smaller segments according to different rules. Hence, DTs can be used as a learning process of finding the optimal rules to separate and classify all items of a dataset. Since the algorithm relies on a decision process similar to rule-based decisions, they are easily interpretable. However, DTs can be difficult to analyse when dealing with large datasets and/or with multiple trees, i.e. ensembles. To ease the analysis and validation of these models, we developed a visual tool which includes a set of visualizations that overview and give details of a set of trees. Our tool aims to provide different perspectives over the same data and provide further insights on how decisions are being made. In this article, we overview our design process, present the different visualization models and their iterative validation. We present a use case in the telecommunications domain. In concrete, we use the visual tool to help understand how a model based on DTs decides which is the best channel (i.e., phonecall, e-mail, SMS) to contact a client.

 

Understanding the Forest

 

Our tool’s main challenge is to support the analysis of RF, specially the different features impact on the classification result. We aim to enable the user to analyse the RF model, by providing a summary view of all DTs results, and the visualization of each DT structure and classification distribution. Our tool is web-based and uses the library D3.js to implement the visualizations. It is divided into two main areas: the upper part and the lower part. In the first, there is a fixed dashboard, which represents the resulting feature importance of the RF model—represented with a bar chart—and a scatterplot of all DTs positioned in the x-axis according to their impurity and in the y-axis according to the maximum feature importance value. To differentiate each feature we use different colours. In the second, we present the “Classification Grid”. Then, the user can change this view to the “Pyramid Matrix”.

 

Classification Grid

 

The “Classification Grid” visualises the class distribution per feature (rows) and tree depth level (columns) to ease the understanding of how each feature can influence the classification individually. Although the rule (i.e., complete path from root node to leaf node) is what defines the final classification, we aimed to give another perspective on the features influence. Our goal is to see the class distribution along depth and the differences between depth level. On the right side of the “Classification Grid”, we visualise a histogram of the importance values by feature. In this histogram, the higher the bar, the higher the number of DTs in the corresponding importance value. Due to the small size of this graph, bars representing a reduced number of values (i.e., with smaller heights) would be difficult to notice. To overcome this, we draw a thin grey line below each bar, highlighting visually values with a reduced number of occurrences.

 

Figure 1

Screenshot of the initial state of “Understanding the Forest”. In this stage, the user can visualise a Feature Importance bar chart, a scatterplot with all DTs and below, the “Classification Grid”

 

Pyramid Matrix

 

To represent the influence of the different features, we created the “Pyramid Matrix”. As the node adjacency matrix is symmetric, we divide the matrix in half along the diagonal and rotate 45 degrees, creating a pyramid shape. To visualise the data and explore humans’ pattern recognition ability and to analyse how the different pairs of features influence the classification, we defined two different approaches: a Heat Map and a Pie Chart.

 

Pie Chart Grid

We opted for a pie chart due to the fact that the number of classes are commonly small. The pie charts are placed in the corresponding cell of the Pyramid Matrix. The aim is to ease the identification of the features that influence the classification the most.

 

Figure 2

The Pie Grid for the Pyramid Matrix. In this visualization we are able to analyse the distribution of classes.

 

Heat Map Grid

In this approach, we aimed to give more details on the range of values that may influence the classification. In short, we created a grid in which each cell represents a range of values. We calculate the class relevance in each cell and paint each cell according to the most relevant class. We add a gradient to represent cells in which classification is more or less evenly distributed. This means that, for example, a cell in which there is only one class, will be painted in the colour of that class in darker colours than a cell with evenly distributed classifications. With this, ranges with higher certainty of being attributed to one class will stand out in relation to others.

 

Figure 3

The Heat Map Grid for the Pyramid Matrix. In this visualization we are able to analyse the distribution of classes and a more detailed analysis on the range of values. Each colour represent one class. The darker the colour, the more relevant the class in that cell.

 

Tree Visualization

 

To visualise the DTs, we used a node-link diagram to represent single tree structures. In our visualization, nodes are divided into decision nodes and leaves. Both nodes have represented the respective impurity through an horizontal bar, with a grey scale, the lower the impurity value the lighter the bar, and vice versa. Below this horizontal bar we write the number of the feature, the signal of the split, and the split threshold. We opted to write the number instead of the name to prevent overlapping texts. The samples in leaf nodes are visualised with a pie chart, representing the class distribution.

 
Regarding the decision nodes, we aimed to represent the class distribution (i.e., which class has a higher number of samples) but also to represent the number of samples at each tree depth level. We tried two approaches of bar charts. The first one is the application of a typical bar chart. This bar chart appears above the impurity bar and has as many bars as classes, representing the number of samples through height. This method is simple to analyse, however, with high ranges of values (nodes with many samples and nodes with fewer samples) the smaller values are difficult to see. To overcome this issue, we propose the application of a horizon bar chart. Its representation is similar to the horizon graphs, but applied in bars. With this strategy, we aim to highlight higher values through colour, but improve the analysis of smaller values in relation to a simple bar chart.

 
In our tree visualization, the links between nodes also represent information. Their colour represents the split feature used in the parent node and its thickness represents the number of samples in each link. Hence, splits which are able to divide the samples in subsets of different sizes are highlighted.

 

Figure 4

The resulting tree visualization of a DT previously selected by the user.

 

Discussion

 

We conducted a User Study and we could perceive that from the three approaches, the sized stacked bar was the one with more positive comments. However, this representation does not work well when the number of occurrences is low. Although the “Occurrences Bar” had lower accuracy, the participants said it was important to represent the number of splits in each feature. Hence, we should combine both stacked bars and split threshold methods, or enable the user to chose from the different methods. The feedback given at the end of the user test and their different backgrounds may provide some hints over the usefulness and efficiency of our tool. It can be generally understood, if properly contextualised, and can provide to a data scientist the necessary tools to improve the RF models. It can also be used by marketing operators, to interpret the RFs results and use this knowledge to improve their marketing campaigns.

 
In terms of generalisation, we argue that our tool can be used in any other tree-based ensemble models. Also, our tool can be used both for the analysis and interpretation as well for verification and improvement of RF models. Another aspect to be considered is the scalability of our tool. In the “Classification Grid”, the higher the number of tree depth levels the wider the grid. This may be an issue as all depths may no longer be visible. To overcome this, we can reduce the length of the bars and add a horizontal scroll. In relation to the colours used to distinguish each feature, these may not be as distinguishable when many features are used in the RF model. One possible solution is to colour only the most important features, and assign a single colour for the ones with lower importance values. Regarding the “Pyramid Grid” approach, the addition of more features would create a too complex shape. One possible solution would be to decrease the grid density (i.e., reduce the number of cells) so the squares for each feature pair would be smaller.

 

Acknowledgements

 
This work is funded by the project POWER (grant number POCI-01-0247-FEDER-070365), co-financed by the European Regional Development Fund (FEDER), through Portugal 2020 (PT2020), and by the Competitiveness and Internationalization Operational Programme (COMPETE 2020). This work is funded by national funds through the FCT – Foundation for Science and Technology, I.P., within the scope of the project CISUC-UID/CEC/00326/2020 and by European Social Fund, through the Regional Operational Program Centro 2020

 
Publication

  • C. Maçãs, J. R. Campos, and N. Lourenço, “Understanding the Forest: A Visualization Tool to Support Decision Tree Analysis,” in 27th International Conference Information Visualisation (IV), 2023, pp. 223-229.

Author

Catarina Maçãs

João R. Campos

Nuno Lourenço