Summary

The following tutorial introduces the use of an adversarial autoencoder model for generating synthetic sequences of dance poses. This model can be trained on motion capture data. The model uses a combination of conventional neural network (ANN) and long short term memory (LSTM) layers to create pose sequences. Apart from explaining how to create and train the corresponding models, this article also provides some examples how the latent space of pose sequence encodings can be navigated to discover potentially interesting synthetic pose sequences.

This tutorial forms part of a series of tutorials on using PyTorch to create and train generative deep learning models. The code for these tutorials is available here.

After 200 epochs of training, the pose sequences reconstructed by the autoencoder look like this when rendered as skeleton animation.

Create Models

The code parts for importing pose sequences from a motion capture recording and creating Datasets and Dataloaders for training and testing are identical to those in a previous article on “Pose Sequence Generation with a GAN”. These code parts are skipped here.

An adversarial autoencoder (AAE) consists of three models. A model named Encoder that compresses input data into a latent vector, a model named Decoder that recreates the input data from the latent vector, and a model named Discriminator, that distinguishes between random variables from a true normal distribution and latent vectors. A brief introduction into autoencoders is available here.

In this example, the models operate on sequences of dance poses obtained from motion capture recordings. The Encoder and Decoder models employ both conventional artificial neural networks (ANN) and recurrent neural networks consisting of long short term memory (LSTM) layers. The discriminator employs ANN layers only.

The code for creating the Discriminator model is identical to that in the previous article on “Pose Sequence Generation with a GAN”. This code and its explanation are also skipped here.

Create Encoder Model

For compressing a sequence of poses into a latent vector, the Encoder model passes the input sequence as two dimensional feature vector through several LSTM layers. From the output of the last LSTM layer, only the last time step is used. This output is then passed through several ANN layers. These layers successively reduce the dimension of the feature vector down to the size of the latent dimension. Each ANN layer is followed by a ReLU activation function.

The class definition of the encoder model is as follows:

class Encoder(nn.Module):
    def __init__(self, sequence_length, pose_dim, latent_dim, rnn_layer_count, rnn_layer_size, dense_layer_sizes):
        super(Encoder, self).__init__()
        
        self.sequence_length = sequence_length
        self.pose_dim = pose_dim
        self.latent_dim = latent_dim
        self.rnn_layer_count = rnn_layer_count
        self.rnn_layer_size = rnn_layer_size 
        self.dense_layer_sizes = dense_layer_sizes
    
        # create recurrent layers
        rnn_layers = []
        rnn_layers.append(("encoder_rnn_0", nn.LSTM(self.pose_dim, self.rnn_layer_size, self.rnn_layer_count, batch_first=True)))
        
        self.rnn_layers = nn.Sequential(OrderedDict(rnn_layers))
        
        # create dense layers
        
        dense_layers = []
        
        dense_layers.append(("encoder_dense_0", nn.Linear(self.rnn_layer_size, self.dense_layer_sizes[0])))
        dense_layers.append(("encoder_dense_relu_0", nn.ReLU()))
        
        dense_layer_count = len(self.dense_layer_sizes)
        for layer_index in range(1, dense_layer_count):
            dense_layers.append(("encoder_dense_{}".format(layer_index), nn.Linear(self.dense_layer_sizes[layer_index-1], self.dense_layer_sizes[layer_index])))
            dense_layers.append( ( "encoder_dense_relu_{}".format( layer_index ), nn.ReLU() ) )

        dense_layers.append(("encoder_dense_{}".format(len( self.dense_layer_sizes )), nn.Linear( self.dense_layer_sizes[-1], self.latent_dim ) ) )
        dense_layers.append( ( "encoder_dense_relu_{}".format( len( self.dense_layer_sizes)), nn.ReLU()))
        
        self.dense_layers = nn.Sequential(OrderedDict(dense_layers))
        
    def forward(self, x):
        
        #print("x 1 ", x.shape)
        
        x, (_, _) = self.rnn_layers(x)
        
        #print("x 2 ", x.shape)
        
        x = x[:, -1, :] # only last time step 
        
        #print("x 3 ", x.shape)
        
        yhat = self.dense_layers(x)
        
        #print("yhat ", yhat.shape)
 
        return yhat

The constructor of the Encoder model class takes the following arguments: the sequence length, the pose dimension, the latent dimension, the number of LSTM layers, the number of units in each LSTM layer, and a sequence of unit counts for the ANN layers (with the unit count equal to the latent dimension for the last layer missing since this layer is added anyway). The Encoder model class can be instantiated as follows:

latent_dim = 64
sequence_length = 128
ae_rnn_layer_count = 2
ae_rnn_layer_size = 512
ae_dense_layer_sizes = [ 512 ]

encoder = Encoder(sequence_length, pose_dim, latent_dim, ae_rnn_layer_count, ae_rnn_layer_size, ae_dense_layer_sizes).to(device)

The shapes of the input and output tensors for this model are as follows:

  • input tensor: batch_size x sequence_length x pose_dim
  • output tensor: batch_size x latent_dim

Create Decoder Model

The Decoder model mirrors the task and network structure of the Encoder model. For decompressing a one dimensional latent vector, the Decoder passes the latent vector though several ANN layers. The ANN layers successively increase the dimension of the latent dimension vector. Each ANN layer is followed by a ReLU activation function. The output tensor from the last ANN layer is repeated a number of times equal to the length of a pose sequence. After that, the output tensor is passed through several LSTM layers. The output from the last LSTM layer is flattened and passed through a final ANN layer. This layer is not followed by an activation function. The output of this ANN layer is reshaped to match the shape of a pose sequence data in the dataset.

The class definition of the Decoder model is as follows:

class Decoder(nn.Module):
    def __init__(self, sequence_length, pose_dim, latent_dim, rnn_layer_count, rnn_layer_size, dense_layer_sizes):
        super(Decoder, self).__init__()
        
        self.sequence_length = sequence_length
        self.pose_dim = pose_dim
        self.latent_dim = latent_dim
        self.rnn_layer_size = rnn_layer_size
        self.rnn_layer_count = rnn_layer_count
        self.dense_layer_sizes = dense_layer_sizes

        # create dense layers
        dense_layers = []
        
        dense_layers.append(("decoder_dense_0", nn.Linear(latent_dim, self.dense_layer_sizes[0])))
        dense_layers.append(("decoder_relu_0", nn.ReLU()))

        dense_layer_count = len(self.dense_layer_sizes)
        for layer_index in range(1, dense_layer_count):
            dense_layers.append(("decoder_dense_{}".format(layer_index), nn.Linear(self.dense_layer_sizes[layer_index-1], self.dense_layer_sizes[layer_index])))
            dense_layers.append( ( "decoder_dense_relu_{}".format( layer_index ), nn.ReLU() ) )
 
        self.dense_layers = nn.Sequential(OrderedDict(dense_layers))
        
        # create rnn layers
        rnn_layers = []

        rnn_layers.append(("decoder_rnn_0", nn.LSTM(self.dense_layer_sizes[-1], self.rnn_layer_size, self.rnn_layer_count, batch_first=True)))
        
        self.rnn_layers = nn.Sequential(OrderedDict(rnn_layers))
        
        # final output dense layer
        final_layers = []
        
        final_layers.append(("decoder_dense_{}".format(dense_layer_count), nn.Linear(self.rnn_layer_size, self.pose_dim)))
        
        self.final_layers = nn.Sequential(OrderedDict(final_layers))
        
    def forward(self, x):
        #print("x 1 ", x.size())
        
        # dense layers
        x = self.dense_layers(x)
        #print("x 2 ", x.size())
        
        # repeat vector
        x = torch.unsqueeze(x, dim=1)
        x = x.repeat(1, sequence_length, 1)
        #print("x 3 ", x.size())
        
        # rnn layers
        x, (_, _) = self.rnn_layers(x)
        #print("x 4 ", x.size())
        
        # final time distributed dense layer
        x_reshaped = x.contiguous().view(-1, self.rnn_layer_size)  # (batch_size * sequence, input_size)
        #print("x 5 ", x_reshaped.size())
        
        yhat = self.final_layers(x_reshaped)
        #print("yhat 1 ", yhat.size())
        
        yhat = yhat.contiguous().view(-1, self.sequence_length, self.pose_dim)
        #print("yhat 2 ", yhat.size())

        return yhat

The constructor of the Decoder model class takes several arguments: the sequence length, the pose dimension, the latent dimension of the pose sequence encoding, the number of LSTM layers, the number of units in each LSTM layer, and a sequence of unit counts for the ANN layers (with the unit count for the last layer missing since this corresponds to the product of the sequence length and pose dimension and this layer is added anyway). The unit counts for the ANN layers are obtained by reversing the corresponding list that was used for creating the Encoder model. The Decoder model class can be instantiated as follows:

ae_dense_layer_sizes_reversed = ae_dense_layer_sizes.copy()
ae_dense_layer_sizes_reversed.reverse()

decoder = Decoder(sequence_length, pose_dim, latent_dim, ae_rnn_layer_count, ae_rnn_layer_size, ae_dense_layer_sizes_reversed).to(device)

The shapes of the input and output tensors for this model are as follows:

  • input tensor: batch_size x latent_dim
  • output tensor: batch_size x sequence_length x pose_dim

Optimisers and Loss Functions

The optimisers, loss function, train- and test-step functions, and the final train function are all largely identical to those described in the previous article on “Pose Generation with an Adversarial Autoencoder”. These functions and their explanation are skipped here.

Generate and Visualise Pose Sequences

Once the autoencoder has been trained, it can be used to experiment with the reconstruction of pose sequences. To visualise the reconstructed (or original) pose sequences, an instance of the PoseRenderer class is used. This class forms part of the “common” module which is explained here. The PoseRenderer class is instantiated as follows:

skel_edge_list = utils.get_skeleton_edge_list(skeleton)
poseRenderer = PoseRenderer(skel_edge_list)

Several convenience functions are provided for reconstructing and visualising sequences of poses.

A function entitled “create_ref_sequence_anim” generates an animation of an original pose sequence. This function takes as arguments an index that serves as starting point in the pose sequence of the original motion capture recording and the file name under which the animation is saved. The function is defined as follows:

def create_ref_sequence_anim(seq_index, file_name):
    sequence_excerpt = pose_sequence_excerpts[seq_index]
    sequence_excerpt = np.reshape(sequence_excerpt, (sequence_length, joint_count, joint_dim))
    
    sequence_excerpt = torch.tensor(np.expand_dims(sequence_excerpt, axis=0)).to(device)
    zero_trajectory = torch.tensor(np.zeros((1, sequence_length, 3), dtype=np.float32)).to(device)
    
    skel_sequence = skeleton.forward_kinematics(sequence_excerpt, zero_trajectory)
    
    skel_sequence = skel_sequence.detach().cpu().numpy()
    skel_sequence = np.squeeze(skel_sequence)    
    
    view_min, view_max = utils.get_equal_mix_max_positions(skel_sequence)
    skel_images = poseRenderer.create_pose_images(skel_sequence, view_min, view_max, view_ele, view_azi, view_line_width, view_size, view_size)
    skel_images[0].save(file_name, save_all=True, append_images=skel_images[1:], optimize=False, duration=33.0, loop=0)

A function entitled “create_rec_sequence_anim” generates an animation of a reconstructed pose sequence. This function also takes as arguments an index as starting point in the pose sequence of the original motion capture recording and the file name under which the animation is saved. The function is defined as follows:

def create_rec_sequence_anim(seq_index, file_name):
    sequence_excerpt = pose_sequence_excerpts[seq_index]
    sequence_excerpt = np.expand_dims(sequence_excerpt, axis=0)
    
    sequence_excerpt = torch.from_numpy(sequence_excerpt).to(device)
    
    with torch.no_grad():
        sequence_enc = encoder(sequence_excerpt)
        pred_sequence = decoder(sequence_enc)
        
    pred_sequence = torch.squeeze(pred_sequence)
    pred_sequence = pred_sequence.view((-1, 4))
    pred_sequence = nn.functional.normalize(pred_sequence, p=2, dim=1)
    pred_sequence = pred_sequence.view((1, sequence_length, joint_count, joint_dim))

    zero_trajectory = torch.tensor(np.zeros((1, sequence_length, 3), dtype=np.float32))
    zero_trajectory = zero_trajectory.to(device)

    skel_sequence = skeleton.forward_kinematics(pred_sequence, zero_trajectory)

    skel_sequence = skel_sequence.detach().cpu().numpy()
    skel_sequence = np.squeeze(skel_sequence)    

    view_min, view_max = utils.get_equal_mix_max_positions(skel_sequence)
    skel_images = poseRenderer.create_pose_images(skel_sequence, view_min, view_max, view_ele, view_azi, view_line_width, view_size, view_size)
    skel_images[0].save(file_name, save_all=True, append_images=skel_images[1:], optimize=False, duration=33.0, loop=0)

Another convenience function named “encode_sequences” can be used to obtain a list of encodings of pose sequences. This function takes as argument a list of pose indices. Each of these pose indices serves as starting point for an individual pose sequence. The function is defined as follows:

def encode_sequences(frame_indices):
    
    encoder.eval()
    
    latent_vectors = []
    
    seq_excerpt_count = len(frame_indices)

    for excerpt_index in range(seq_excerpt_count):
        excerpt_start_frame = frame_indices[excerpt_index]
        excerpt_end_frame = excerpt_start_frame + sequence_length
        excerpt = pose_sequence[excerpt_start_frame:excerpt_end_frame]
        excerpt = np.expand_dims(excerpt, axis=0)
        excerpt = torch.from_numpy(excerpt).to(device)
        
        with torch.no_grad():
            latent_vector = encoder(excerpt)
            
        latent_vector = torch.squeeze(latent_vector)
        latent_vector = latent_vector.detach().cpu().numpy()

        latent_vectors.append(latent_vector)
        
    encoder.train()
        
    return latent_vectors

The counterpart of the previously described function is a convenience function with the name “decode_sequence_encodings”. This function decodes pose sequence encodings into pose sequences, concatenates the individual pose sequences into a single pose sequence which it then visualises and exports as animation. The function takes as arguments a list of pose sequence encodings and a file name under which the animation is saved. The function is defined as follows:

def decode_sequence_encodings(sequence_encodings, file_name):
    
    decoder.eval()
    
    rec_sequences = []
    
    for seq_encoding in sequence_encodings:
        seq_encoding = np.expand_dims(seq_encoding, axis=0)
        seq_encoding = torch.from_numpy(seq_encoding).to(device)

        with torch.no_grad():
            rec_seq = decoder(seq_encoding)
            
        rec_seq = torch.squeeze(rec_seq)
        rec_seq = rec_seq.view((-1, 4))
        rec_seq = nn.functional.normalize(rec_seq, p=2, dim=1)
        rec_seq = rec_seq.view((-1, joint_count, joint_dim))

        rec_sequences.append(rec_seq)
    
    rec_sequences = torch.cat(rec_sequences, dim=0)
    rec_sequences = torch.unsqueeze(rec_sequences, dim=0)
    
    print("rec_sequences s ", rec_sequences.shape)

    zero_trajectory = torch.tensor(np.zeros((1, len(sequence_encodings) * sequence_length, 3), dtype=np.float32))
    zero_trajectory = zero_trajectory.to(device)
    
    skel_sequence = skeleton.forward_kinematics(rec_sequences, zero_trajectory)
    
    skel_sequence = skel_sequence.detach().cpu().numpy()
    skel_sequence = np.squeeze(skel_sequence)
    
    view_min, view_max = utils.get_equal_mix_max_positions(skel_sequence)
    skel_images = poseRenderer.create_pose_images(skel_sequence, view_min, view_max, view_ele, view_azi, view_line_width, view_size, view_size)
    skel_images[0].save(file_name, save_all=True, append_images=skel_images[1:], optimize=False, duration=33.0, loop=0)

    decoder.train()

In the following, some examples of using the convenience functions to experiment with pose sequences and pose sequence encodings are presented.

Create an Animation for a Single Original Pose Sequence

A single original pose sequence can be obtained and saved as animation as follows:

seq_index = 100

create_ref_sequence_anim(seq_index, "results/anims/orig_sequence_{}.gif".format(seq_index))

Create an Animation for a Single Reconstructed Pose Sequence

A single pose sequence can be reconstructed and saved as animation as follows:

seq_index = 100

create_rec_sequence_anim(seq_index, "results/anims/rec_sequence_{}.gif".format(seq_index))

Create an Animation from Several Reconstructed Pose Sequences

A list of pose sequences can be reconstructed and saved as animation as follows:

start_seq_index = 1000
end_seq_index = 2000

seq_indices = [ seq_index for seq_index in range(start_seq_index, end_seq_index, sequence_length)]

seq_encodings = encode_sequences(seq_indices)
decode_sequence_encodings(seq_encodings, "results/anims/rec_sequences_{}-{}.gif".format(start_seq_index, end_seq_index))

Create an Animation from a Random Walk in Latent Space

In a more interesting example, a single pose sequence is encoded and the encoding is used as starting point for a random walk within latent space. The random walk generates a list of increasingly randomised pose sequence encodings which are then decoded into pose sequences, concatenated and saved as animation.

start_seq_index = 100
sequence_count = 500

seq_indices = [start_seq_index]

seq_encodings = encode_sequences(seq_indices)

for index in range(0, sequence_count - 1):
    random_step = np.random.random((latent_dim)).astype(np.float32) * 2.0
    seq_encodings.append(seq_encodings[index] + random_step)

decode_sequence_encodings(seq_encodings, "results/anims/rec_sequences_randwalk_{}_{}.gif".format(start_seq_index, sequence_count))

Create an Animation by Following a Trajectory in Latent Space with an Offset

In this example, a list of pose sequences that follow each other in the original motion capture recording is encoded into a list of latent vectors. These latent vectors represent a trajectory in latent space. This trajectory is followed at an offset by adding a vector to each pose sequence encoding. The resulting encodings are then decoded into pose sequences, concatenated and saved as animation.

start_seq_index = 100
end_seq_index = 500
    
seq_indices = [ seq_index for seq_index in range(start_seq_index, end_seq_index, sequence_length)]

seq_encodings = encode_sequences(seq_indices)

offset_seq_encodings = []

for index in range(len(seq_encodings)):
    sin_value = np.sin(index / (len(seq_encodings) - 1) * np.pi * 4.0)
    offset = np.ones(shape=(latent_dim), dtype=np.float32) * sin_value * 4.0
    offset_seq_encoding = seq_encodings[index] + offset
    offset_seq_encodings.append(offset_seq_encoding)
    
decode_sequence_encodings(offset_seq_encodings, "results/anims/rec_sequences_offset_{}-{}.gif".format(start_seq_index, end_seq_index))

Create an Animation by Interpolating Between Pose Sequence Encodings

Two pose sequences are encoded and new encodings are created by gradually interpolating between the initial pose sequence encodings. Each interpolated encoding is decoded into a pose sequence. All these pose sequences are concatenated and saved as animation.

start_seq1_index = 1000
end_seq1_index = 2000

start_seq2_index = 2000
end_seq2_index = 3000

seq1_indices = [ seq_index for seq_index in range(start_seq1_index, end_seq1_index, sequence_length)]
seq2_indices = [ seq_index for seq_index in range(start_seq2_index, end_seq2_index, sequence_length)]

seq1_encodings = encode_sequences(seq1_indices)
seq2_encodings = encode_sequences(seq2_indices)

mixed_seq_encodings = []

for index in range(len(seq1_indices)):
    mix_factor = index / (len(seq1_indices) - 1)
    mixed_seq_encoding = seq1_encodings[index] * (1.0 - mix_factor) + seq2_encodings[index] * mix_factor
    mixed_seq_encodings.append(mixed_seq_encoding)

decode_sequence_encodings(mixed_seq_encodings, "results/anims/rec_sequences_mix_{}-{}_{}-{}.gif".format(start_seq1_index, end_seq1_index, start_seq2_index, end_seq2_index))