Raw Music from Free Movements

Summary

Raw Music from Free Movements (RAMFEM) is a deep learning architecture that translates pose sequences into audio waveforms. The architecture combines a sequence-to-sequence model generating audio encodings and an adversarial autoencoder that generates raw audio from audio encodings. RAMFEM constitutes an attempt to design a digital music instrument by starting from the creative decisions a dancer makes when translating music into movement and then reverse these decisions for the purpose of generating music from movement. An important aspect of RAMFEM’s capability to learn from and recreate existing movement and music relationships is its operation in the raw audio domain. Because of this, RAMFEM can be applied to any recordings of movement and music, capture their correlations, and subsequently recreate the acoustic characteristics of the music through embodied gestures.

This project has been realised in collaboration with Kivanç Tatar, at that time independent musician and researcher, Vancouver, Canada. A detailed description of the project has been published.

Machine Learning Model

The current architecture of RAMFEM consists of three components: an adversarial autoencoder (AAE), a sequence to sequence transducer (Seq2Seq), and an audio concatenation mechanism. The source code, trained models, and audio and motion capture data required for testing and training are available online.

The AAE in RAMFEM encodes and decodes short audio waveforms into and from latent vectors. The Seq2Seq takes a sequence of poses as input and translates them into a sequence of audio encodings. These encodings are passed to an audio decoder which transforms them into waveforms. The audio concatenation mechanism takes a sequence of waveforms, applies a Hanning window as amplitude envelope to each of them, and then concatenates them with a 50% overlap to create the final audio sequence.

RAMFEM Processing Pipeline. RAMFEM takes as input a short sequence of dance poses and produces as output a sequence of audio windows which are blended together using an amplitude envelope.

RAMFEM Model Architecture. The model consists of several neural network that form part of the sequence to sequence transducer (left side) and adversarial autoencoder (right side).

Dataset

Two different datasets were employed for training, named improvisation dataset and sonification dataset. The improvisation dataset consists of pose sequences and audio that have been recorded while a dancer was freely improvising to a given music. The dancer is an expert with a specialisation in contemporary dance and improvisation.

Results

Audio generated by the model trained on the sonification dataset when it is presented with the original movement sequence used for sonification.

Audio generated by the model trained on the improvisation dataset when it is presented with the original movement sequence used for sonification.

Audio generated by the model trained on the sonification dataset when it is presented with a different movement sequence.

Audio generated by the model trained on the improvisation dataset when it is presented with a different movement sequence.