Deep Learning For Expressive Music Generation

In the last decade, Deep Learning algorithms have been increasing its popularity in several fields such as computer vision, speech recognition, natural language processing and many others. Deep Learning models, however, are not limited to scientific domains as they have recently been applied to content generation in diverse art forms – both in the generation of novel contents and as co-creative tools. Artificial music generation is one of the fields where Deep Learning architectures have been applied [1]. They have been mostly used to create new compositions exhibiting promising results when compared to human compositions. Despite this, the majority of these artificial pieces lack some expression when compared to music compositions performed by humans [2]. We propose a system capable of artificially generating expressive music compositions. Our main goal is to improve the quality of the musical compositions generated by the artificial system by exploring perceptually relevant musical elements such as note velocity and duration. To assess this hypothesis we perform user tests. Results suggest that expressive elements such as duration and velocity are key aspects in a music composition expression, making the ones who include these preferable to non-expressive ones. 



The model is composed of three independent LSTM-based networks, each one assigned to train sequential relationships between specific elements – Notes (Pitch) Network, Velocities (Note Attack) Network and Durations (Note duration) Network. Each network learns the sequence patterns of its elements from a given musical composition. 


The model receives MIDI files as an input. Each file is parsed, resulting in three different musical elements vectors (notes, velocities and durations). Each network is trained regarding its own assigned element to learn its relational patterns. After such, each network generates its own vector that is combined with the others to generate the artificial composed piece (i.e. a vector with all notes and respective velocities and durations). This final vector is then converted to a MIDI le resulting in the final audible piece. For this experiment, we have trained our model using Johann Bach compositions. This allowed us to have a large data set (MIDI Files) and also the advantage of having several live performances of Bach pieces we could use in User-Testing for comparison.


Figure 1

Overview of the Model.


The following expressive graphs depict interesting variance when confronted with the monotonous non-expressive ones. We observe that the expressive velocity graph shows different intensity moments in the piece just as a narrative: it gets softer after the first notes and ends in a ’striking’ way. On the durations graph, we observe more abrupt changes making the graph look denser, with occasional longer notes.


Figure 2

Velocity (left image) and duration (right image) variation.


In order to get a better understanding of the actual impact that the introduction of expressive elements has on the pieces, we conducted user-testing sessions.



Having generated several artificial expressive compositions, we have used these tests to examine how the generated compositions would be rated when compared to pieces composed and performed by humans. In addition, these terms of comparison let us understand if there is any sign of influence by the studied expressive elements – whether in human compositions or artificially generated. 

The tests were designed in the form of a questionnaire. Each questionnaire was composed of 20 excerpts (10 were generated by the model and the other half was composed of Bach live performances) with the purpose of ranking them. We removed the expressiveness from some excerpts (Human and Artificial) to evaluate the difference in each score. The users had no information on this experiment and were merely asked to rate each excerpt. 



In general, we observed that the participants did favour the excerpts containing partial or full expressiveness. Observing all the given scores to each parameter, there’s a suggestion that different combinations have a determining influence over the excerpts – and also that adding expressive elements (alone or simultaneously) tend to increase the value concerning the participant’s preferences. 


Comparing the artificial non-expressive compositions with expressive ones we observe that in some cases – especially when the generated piece holds a sequence of stacking notes – expressiveness plays a major role in making an artificial composition to sound more natural, closer to a human one. 


The order of the sequence tested is presented at the table within the acronyms (e.g. HHH) as the following: Notes, Velocities, Durations.


Figure 3

Results of the tests performed with 30 persons.
H: human composition/performance
C: computer composition/performance
O: element absence




 [1] – Jean-Pierre Briot, Gaëtan Hadjeres, and François Pachet. 2017. Deep learning techniques for music generation-a survey. arXiv preprint arXiv:1709.01620 (2017). 

 [2] – Filippo Carnovalini and Antonio Rodà. 2019. A multilayered approach to automatic music generation and expressive performance. In 2019 International Workshop on Multilayer Music Representation and Processing (MMRP). IEEE, 41– 48.



  • J. M. Simões, P. Machado, and A. C. Rodrigues, “Deep Learning for Expressive Music Generation,” in ARTECH 2019: 9th International Conference on Digital and Interactive Arts, Braga, Portugal, October 23-25, 2019, 2019, p. 14:1–14:9.


José Maria Simões

Penousal Machado

Ana Rodrigues