Summary

The following tutorial introduces the use of a simple generative adversarial network for generating sequences of synthetic dance poses. This model can be trained on motion capture data. The model uses a combination of a conventional neural network (ANN) and a long short term memory (LSTM) network to create pose sequences.

This tutorial forms part of a series of tutorials on using PyTorch to create and train generative deep learning models. The code for these tutorials is available here.

After 220 epochs of training, the model generates pose sequences that look like this when rendered as skeleton animations.

Synthetic Pose Sequence. This sequence has been produced by the GAN introduced in this article after being trained for 220 epochs on motion capture data.

Create Dataset

Large parts of the code are identical to those in a previous article about Pose generation with a GAN. These parts are skipped here.

To create a dataset from motion capture data that can be used for training, several steps are undertaken: remove sequence excerpts in which poses are invalid or otherwise unsuitable for training, split pose sequences into short overlapping excerpts, declare and define a Dataset class to hold the data, split the data into a training and test set, and instantiate DataLoaders from the training and test set.

The removal of sequence excerpts and the splitting into shorter sequences is done in parallel.

mocap_valid_frame_ranges = [ [ 860, 9500 ] ]
sequence_length = 128
sequence_offset = 2

# gather pose sequence excerpts
pose_sequence_excerpts = []

for valid_frame_range in mocap_valid_frame_ranges:
    frame_range_start = valid_frame_range[0]
    frame_range_end = valid_frame_range[1]
    
    for seq_excerpt_start in np.arange(frame_range_start, frame_range_end - sequence_length, sequence_offset):
        #print("valid: start ", frame_range_start, " end ", frame_range_end, " exc: start ", seq_excerpt_start, " end ", (seq_excerpt_start + sequence_length) )
        pose_sequence_excerpt =  pose_sequence[seq_excerpt_start:seq_excerpt_start + sequence_length]
        pose_sequence_excerpts.append(pose_sequence_excerpt)
        
pose_sequence_excerpts = np.array(pose_sequence_excerpts)

A custom dataset class for pose sequences is created by subclassing the Dataset class.

class SequenceDataset(Dataset):
    def __init__(self, sequence_excerpts):
        self.sequence_excerpts = sequence_excerpts
    
    def __len__(self):
        return self.sequence_excerpts.shape[0]
    
    def __getitem__(self, idx):
        return self.sequence_excerpts[idx, ...]

This custom dataset class is instantiated as follows:

full_dataset = SequenceDataset(pose_sequence_excerpts)

As in the articles before, this dataset is split into two datasets, one for training and one for testing. Subsequently, DataLoaders are created from these two datasets.

Create Models

As has been explained in the previous articles, a Generator model and a Critique model need to be implemented. The Generator model takes as input a tensor containing noise and produces as output a tensor representing synthetic pose sequences. The Critique takes as input a tensor representing pose sequences and produces as output a tensor that classifies the pose sequences as either real or fake. Both models employ a combination of a conventional artificial neural network (ANN) and a recurrent neural network that consists of long short term memory units (LSTM).

Create Critique Model

For converting input pose sequences into output classes, the Critique model passes the input pose sequences as two dimensional feature vectors through several LSTM layers. From the output of the last LSTM layer, only the last time step is used. This output is then passed through several ANN layers. These layers successively reduce the dimension of the feature vector down to 1. Each ANN layer with the exception of the last one is followed by a ReLU activation function. The last ANN layer is followed by a Softmax activation function to obtain normalised class probabilities.

The class definition of the Critique model is as follows:

class Critique(nn.Module):
    def __init__(self, sequence_length, pose_dim, rnn_layer_count, rnn_layer_size, dense_layer_sizes):
        super(Critique, self).__init__()
        
        self.sequence_length = sequence_length
        self.pose_dim = pose_dim
        self.rnn_layer_count = rnn_layer_count
        self.rnn_layer_size = rnn_layer_size 
        self.dense_layer_sizes = dense_layer_sizes
    
        # create recurrent layers
        rnn_layers = []
        rnn_layers.append(("critique_rnn_0", nn.LSTM(self.pose_dim, self.rnn_layer_size, self.rnn_layer_count, batch_first=True)))
        
        self.rnn_layers = nn.Sequential(OrderedDict(rnn_layers))
        
        # create dense layers
        
        dense_layers = []
        
        dense_layers.append(("critique_dense_0", nn.Linear(self.rnn_layer_size, self.dense_layer_sizes[0])))
        dense_layers.append(("critique_dense_relu_0", nn.ReLU()))
        
        dense_layer_count = len(self.dense_layer_sizes)
        for layer_index in range(1, dense_layer_count):
            dense_layers.append(("critique_dense_{}".format(layer_index), nn.Linear(self.dense_layer_sizes[layer_index-1], self.dense_layer_sizes[layer_index])))
            dense_layers.append( ( "critique_dense_relu_{}".format( layer_index ), nn.ReLU() ) )

        dense_layers.append( ( "critique_dense_{}".format( len( self.dense_layer_sizes ) ), nn.Linear(self.dense_layer_sizes[-1], 1 ) ) )
        dense_layers.append( ( "critique_dense_sigmoid_{}".format( len( self.dense_layer_sizes ) ), nn.Sigmoid() ) )
        
        self.dense_layers = nn.Sequential(OrderedDict(dense_layers))
        
    def forward(self, x):
        
        #print("x 1 ", x.shape)
        
        x, (_, _) = self.rnn_layers(x)
        
        #print("x 2 ", x.shape)
        
        x = x[:, -1, :] # only last time step 
        
        #print("x 3 ", x.shape)
        
        yhat = self.dense_layers(x)
        
        #print("yhat ", yhat.shape)
 
        return yhat

The constructor of the Critique model class takes several arguments: the length of a pose sequence, the dimension of a pose, the number of LSTM layers, the number of units in each LSTM layer, and a sequence of unit counts for the ANN layers (with the unit count of 1 for the last layer missing, since this layer is added anyway). The model class can be instantiated as follows:

crit_rnn_layer_count = 2
crit_rnn_layer_size = 512
crit_dense_layer_sizes = [ 512 ]

critique = Critique(sequence_length, pose_dim, crit_rnn_layer_count, crit_rnn_layer_size, crit_dense_layer_sizes).to(device)

The shapes of the input and output tensors for this model are as follows:

  • input tensor: batch_size x sequence_length x pose_dim
  • output tensor: batch_size x 1

Create Generator Model

For generating synthetic pose sequences from a one dimensional vector of random values, the Generator model passes the noise vector through ANN layers. These layers successively increase the dimension of the noise vector. Each ANN layer is followed by a ReLU activation function. The output tensor from the last ANN layer is repeated a number of times equal to the length of a pose sequence. After that, the output tensor is passed through several LSTM layers. The output from the last LSTM layer is flattened and passed through a final ANN layer. This layer is not followed by an activation function. The output of this ANN layer is reshaped to match the shape of a pose sequence in the dataset.

The class definition of the Generator model is as follows:

class Generator(nn.Module):
    def __init__(self, sequence_length, pose_dim, latent_dim, rnn_layer_count, rnn_layer_size, dense_layer_sizes):
        super(Generator, self).__init__()
        
        self.sequence_length = sequence_length
        self.pose_dim = pose_dim
        self.latent_dim = latent_dim
        self.rnn_layer_size = rnn_layer_size
        self.rnn_layer_count = rnn_layer_count
        self.dense_layer_sizes = dense_layer_sizes

        # create dense layers
        dense_layers = []
        
        dense_layers.append(("decoder_dense_0", nn.Linear(latent_dim, self.dense_layer_sizes[0])))
        dense_layers.append(("decoder_relu_0", nn.ReLU()))

        dense_layer_count = len(self.dense_layer_sizes)
        for layer_index in range(1, dense_layer_count):
            dense_layers.append(("decoder_dense_{}".format(layer_index), nn.Linear(self.dense_layer_sizes[layer_index-1], self.dense_layer_sizes[layer_index])))
            dense_layers.append( ( "decoder_dense_relu_{}".format( layer_index ), nn.ReLU() ) )
 
        self.dense_layers = nn.Sequential(OrderedDict(dense_layers))
        
        # create rnn layers
        rnn_layers = []

        rnn_layers.append(("decoder_rnn_0", nn.LSTM(self.dense_layer_sizes[-1], self.rnn_layer_size, self.rnn_layer_count, batch_first=True)))
        
        self.rnn_layers = nn.Sequential(OrderedDict(rnn_layers))
        
        # final output dense layer
        final_layers = []
        
        final_layers.append(("decoder_dense_{}".format(dense_layer_count), nn.Linear(self.rnn_layer_size, self.pose_dim)))
        
        self.final_layers = nn.Sequential(OrderedDict(final_layers))
        
    def forward(self, x):
        #print("x 1 ", x.size())
        
        # dense layers
        x = self.dense_layers(x)
        #print("x 2 ", x.size())
        
        # repeat vector
        x = torch.unsqueeze(x, dim=1)
        x = x.repeat(1, sequence_length, 1)
        #print("x 3 ", x.size())
        
        # rnn layers
        x, (_, _) = self.rnn_layers(x)
        #print("x 4 ", x.size())
        
        # final time distributed dense layer
        x_reshaped = x.contiguous().view(-1, self.rnn_layer_size)  # (batch_size * sequence, input_size)
        #print("x 5 ", x_reshaped.size())
        
        yhat = self.final_layers(x_reshaped)
        #print("yhat 1 ", yhat.size())
        
        yhat = yhat.contiguous().view(-1, self.sequence_length, self.pose_dim)
        #print("yhat 2 ", yhat.size())

        return yhat

The constructor of the Generator model class takes several arguments: the sequence length, the pose dimension, the latent dimension of the pose sequence encoding, the number of LSTM layers, the number of units in each LSTM layer, and a sequence of unit counts for the ANN layers (with the unit count for the last layer missing since this corresponds to the product of the sequence length and pose dimension and this layer is added anyway). The model class can be instantiated as follows:

latent_dim = 64

generator = Generator(sequence_length, pose_dim, latent_dim, gen_rnn_layer_count, gen_rnn_layer_size, gen_dense_layer_sizes).to(device)

The shapes of the input and output tensors for this model are as follows:

  • input tensor: batch_size x latent_dim
  • output tensor: batch_size x sequence_length x pose_dim

Generate and Visualise Poses

The code for the optimisers, loss functions, train and test step functions, and training are largely identical to the previous article and skipped here.

Two convenience functions are defined for rendering pose sequences as skeleton animations. For the rendering, the PoseRenderer class is used. The length of the rendered pose sequences is identical to the sequence length on which the models have been trained. The resulting animations are exported in “.gif” format.

The first function named “create_ref_sequence_anim” creates animations from excerpts of the original motion capture data. This function takes as arguments the index of the first frame in a pose sequence and the name of the file the animation is exported as.

def create_ref_sequence_anim(seq_index, file_name):
    sequence_excerpt = pose_sequence_excerpts[seq_index]
    sequence_excerpt = np.reshape(sequence_excerpt, (sequence_length, joint_count, joint_dim))
    
    sequence_excerpt = torch.tensor(np.expand_dims(sequence_excerpt, axis=0))
    zero_trajectory = torch.tensor(np.zeros((1, sequence_length, 3), dtype=np.float32))
    
    skel_sequence = skeleton.forward_kinematics(sequence_excerpt, zero_trajectory)
    skel_sequence = np.squeeze(skel_sequence.numpy())
    view_min, view_max = utils.get_equal_mix_max_positions(skel_sequence)
    skel_images = poseRenderer.create_pose_images(skel_sequence, view_min, view_max, view_ele, view_azi, view_line_width, view_size, view_size)
    skel_images[0].save(file_name, save_all=True, append_images=skel_images[1:], optimize=False, duration=33.0, loop=0)

The second function named “create_gen_sequence_anim” creates animations from synthetic pose sequences that are generated by the Generator. This function takes only the name of the file for exporting the animation as argument.

def create_gen_sequence_anim(file_name):
    generator.eval()
    
    random_encoding = torch.randn((1, latent_dim)).to(device)
    
    with torch.no_grad():
        gen_sequence = generator(random_encoding)
        
    gen_sequence = torch.squeeze(gen_sequence)
    gen_sequence = gen_sequence.view((-1, 4))
    gen_sequence = nn.functional.normalize(gen_sequence, p=2, dim=1)
    gen_sequence = gen_sequence.view((1, sequence_length, joint_count, joint_dim))

    zero_trajectory = torch.tensor(np.zeros((1, sequence_length, 3), dtype=np.float32))
    zero_trajectory = zero_trajectory.to(device)

    skel_sequence = skeleton.forward_kinematics(gen_sequence, zero_trajectory)

    skel_sequence = skel_sequence.detach().cpu().numpy()
    skel_sequence = np.squeeze(skel_sequence)    

    view_min, view_max = utils.get_equal_mix_max_positions(skel_sequence)
    skel_images = poseRenderer.create_pose_images(skel_sequence, view_min, view_max, view_ele, view_azi, view_line_width, view_size, view_size)
    skel_images[0].save(file_name, save_all=True, append_images=skel_images[1:], optimize=False, duration=33.0, loop=0)
    
    generator.train()

These two functions can be called as follows:

pose_index = 100

create_ref_sequence_anim(pose_index, "orig_pose_sequence.gif")
create_gen_sequence_anim("gen_pose_sequence.gif")