Interactive Network Visualization of Gene Expression Time-Series Data

In the field of Biology, data visualization is used to better understand processes that range from phylogenetic trees to multiple layers of molecular networks. The latter is especially challenging due to the large quantities of varying elements and complex relationships, often with no perceptible structure.

 
We propose a tool that uses interactive visualization models to represent the dynamic behaviors of molecular networks, employing various methods to explore and organize the data, including clustering, force-directed layouts, and a timeline for navigating through time-series data. To further analyze temporal attributes, the timeline can be distorted through a force-directed layout to spatially position time points according to their similarity. Additionally, gene expression can be annotated through an integrated biological database. The developed visualization models were validated through a time-series gene expression RNA-Seq dataset from the HIV-1 infection.

 

Figure 1

Screenshot of the application with time-series network data clustered into 12 cluster and one cluster is selected

 
 

Framework

 

The canvas of the developed tool is divided into a user interface and a data visualization which provide the user with the ability to load, sort and visualize external datasets. The data visualization utilizes a dynamic network model and an interactive timeline to represent and analyze relational data over time. While these visualization models are generic and, therefore, adaptable to any dataset that comply with a specific structure, the tool integrates data-specific methods for the analysis of biological data and timeseries. Gene Ontology databases have also been integrated into the tool to identify, classify and sort protein datasets by utilizing their respective annotations.

 
Through the interface, the user can load dataset files, control the number of visual elements, and dynamically sort the data by its various attributes. Additionally, the user can also use the mouse to interact with the visualization, such as in selecting elements, panning and zooming. If the user selects nodes or clusters in the network, information on their attributes will be shown in the interface, including Gene Ontology annotation data, if available.

 
 

Network Visualization

 

The tool’s main data visualization is a dynamic network model that utilizes force-directed layouts to visually sort its nodes. Loading attribute data will map the values to size of their respective nodes. Additionally, when representing time-series data, the brightness value of each node’s color will also be mapped to the variation of the temporal values. As such, nodes will increase and decrease in brightness along with their attributes over time.

 
Edges are hidden by default to avoid visual clutter caused by the large number of relationships in complex networks, as well as to emphasize the use of position to portray similarity between elements that may not share direct relationships. However, when edges are drawn, they are drawn using an edge bundling algorithm that transforms them into organic, fluid structures with perceptible directions.

 
When a network is first loaded, the nodes are sorted by the Yifan Hu force-directed layout, which iteratively recalculates the nodes’ positions in accordance to their relationships. Afterwards, a generic force-directed layout is applied throughout the application’s runtime to dynamically adapt the visualization regarding the user’s selections. The layout utilizes attraction and repulsion to control the relative position of elements according to their relationships. This is used in the distribution of clusters, in the distribution of nodes throughout each cluster, and in the creation of a time curve.

 

Figure 2

Screenshots of the network clustered by position with a prepared number of clusters, where the nodes are initially positioned using the Yifan Hu layout (top) and are then attracted to their respective clusters (bottom).

 
 

Clustering

 

Nodes can be clustered into groups based on either their current positions on the network, or their similarity to other nodes based on their attributes, such as gene expression variation between time points, which identifies clusters of proteins that share similar activation patterns. Through an hierarchical clustering algorithm, the number of clusters can be changed without additional recalculations, and the visualization adapts to these changes dynamically. Each node is assigned a hue value based on their positions in the similarity matrix calculated during clustering, meaning that similar nodes will also be chromatically closer.

 
 

Timeline and Time Curve

 

The interactive timeline consists of a slider that can be dragged to switch between points in time, which updates the visual proprieties of nodes according to their attributes at each point. The “Time Flow” button will initiate an automatic and cyclical movement of the slider, allowing the user to interact with the visualization while visualizing changes in the values over time. When a node is hovered or selected, a line graph detailing that node’s list of attributes will be shown on top of the timeline. If a cluster is hovered, the graph will display an average of each of the attributes of the nodes in that cluster.

 
Pressing the “Time Curve” button will apply a force-based layout on the timeline, applying forces between the time-points according to their similarities. This distorts the timeline into a curve, where color shows time progression and the relative distance between each time point represents their similarity. The time curve can represent behaviors such as regressions and cycles or significant changes.

 

Figure 3

Screenshots of the timeline transformed into a time curve (top) and of the network visualization (bottom) clustered based on temporal variance, showing similarities and differences between time points.

 
 

Results

 

To demonstrate the developed application and visualization models, we utilized a time-series gene expression RNA-Seq dataset from the HIV-1 infection, which measured expression across 24 hours with intervals of 2 hours, and respective PPI network. While the Yifan Hu layout and position-based clustering allowed for an analysis of the topological proprieties of the network, this only served to highlight its superficial structure, based on direct relationships. Clustering by temporal variance resulted in the creation of groups of proteins with similar expression profiles, which was possible to observe in the visualization, as proteins within the same clusters presented increases and decreases at the same points in time.

 
Furthermore, by bending the timeline into a time curve, as shown in Figure 3, it was possible to further explore patterns of behaviors happening over time. The resulting time curve shows four non-sequential time points (4, 10, 16 and 24 hours) placed close together, and selecting each of these time points shows the same clusters of nodes portraying peaks of gene expression. While this requires a more in-depth analysis, this represented a possible cyclical behavior that may be interpreted as waves of expression changes, which has previously been observed in the HIV-1 infection.

 
The loaded datasets were able to be sorted through clustering and force-directed layouts, while the provided methods facilitated the analysis of temporal variation and were capable of graphically representing known behaviors of the HIV-1 infection over time.

Author

António Cruz

Joel P. Arrais

Penousal Machado