Video classification with a 3D convolutional neural network

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

This tutorial demonstrates training a 3D convolutional neural network (CNN) for video classification using the UCF101 action recognition dataset. A 3D CNN uses a three-dimensional filter to perform convolutions. The kernel is able to slide in three directions, whereas in a 2D CNN it can slide in two dimensions. The model is based on the work published in A Closer Look at Spatiotemporal Convolutions for Action Recognition by D. Tran et al. (2017). In this tutorial, you will:

  • Build an input pipeline
  • Build a 3D convolutional neural network model with residual connections using Keras functional API
  • Train the model
  • Evaluate and test the model

This video classification tutorial is the second part in a series of TensorFlow video tutorials. Here are the other three tutorials:

Setup

Begin by installing and importing some necessary libraries, including: remotezip to inspect the contents of a ZIP file, tqdm to use a progress bar, OpenCV to process video files, einops for performing more complex tensor operations, and tensorflow_docs for embedding data in a Jupyter notebook.

pip install remotezip tqdm opencv-python einops
# Install TensorFlow 2.10
pip install tensorflow==2.10.0
import tqdm
import random
import pathlib
import itertools
import collections

import cv2
import einops
import numpy as np
import remotezip as rz
import seaborn as sns
import matplotlib.pyplot as plt

import tensorflow as tf
import keras
from keras import layers
2023-10-27 01:29:52.291653: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/cv2/../../lib64:
2023-10-27 01:29:52.327803: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-27 01:29:52.949253: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/cv2/../../lib64:
2023-10-27 01:29:52.949379: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/cv2/../../lib64:
2023-10-27 01:29:52.949390: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

Load and preprocess video data

The hidden cell below defines helper functions to download a slice of data from the UCF-101 dataset, and load it into a tf.data.Dataset. You can learn more about the specific preprocessing steps in the Loading video data tutorial, which walks you through this code in more detail.

The FrameGenerator class at the end of the hidden block is the most important utility here. It creates an iterable object that can feed data into the TensorFlow data pipeline. Specifically, this class contains a Python generator that loads the video frames along with its encoded label. The generator (__call__) function yields the frame array produced by frames_from_video_file and a one-hot encoded vector of the label associated with the set of frames.

URL = 'https://storage.googleapis.com/thumos14_files/UCF101_videos.zip'
download_dir = pathlib.Path('./UCF101_subset/')
subset_paths = download_ufc_101_subset(URL, 
                        num_classes = 10, 
                        splits = {"train": 30, "val": 10, "test": 10},
                        download_dir = download_dir)
train :
100%|██████████| 300/300 [00:28<00:00, 10.61it/s]
val :
100%|██████████| 100/100 [00:09<00:00, 10.75it/s]
test :
100%|██████████| 100/100 [00:08<00:00, 12.31it/s]

Create the training, validation, and test sets (train_ds, val_ds, and test_ds).

2023-10-27 01:30:41.429812: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/cv2/../../lib64:
2023-10-27 01:30:41.429949: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/cv2/../../lib64:
2023-10-27 01:30:41.430034: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/cv2/../../lib64:
2023-10-27 01:30:41.430115: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/cv2/../../lib64:
2023-10-27 01:30:41.496040: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/cv2/../../lib64:
2023-10-27 01:30:41.496279: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

Create the model

The following 3D convolutional neural network model is based off the paper A Closer Look at Spatiotemporal Convolutions for Action Recognition by D. Tran et al. (2017). The paper compares several versions of 3D ResNets. Instead of operating on a single image with dimensions (height, width), like standard ResNets, these operate on video volume (time, height, width). The most obvious approach to this problem would be replace each 2D convolution (layers.Conv2D) with a 3D convolution (layers.Conv3D).

This tutorial uses a (2 + 1)D convolution with residual connections. The (2 + 1)D convolution allows for the decomposition of the spatial and temporal dimensions, therefore creating two separate steps. An advantage of this approach is that factorizing the convolutions into spatial and temporal dimensions saves parameters.

For each output location a 3D convolution combines all the vectors from a 3D patch of the volume to create one vector in the output volume.

3D convolutions

This operation is takes time * height * width * channels inputs and produces channels outputs (assuming the number of input and output channels are the same. So a 3D convolution layer with a kernel size of (3 x 3 x 3) would need a weight-matrix with 27 * channels ** 2 entries. The reference paper found that a more effective & efficient approach was to factorize the convolution. Instead of a single 3D convolution to process the time and space dimensions, they proposed a "(2+1)D" convolution which processes the space and time dimensions separately. The figure below shows the factored spatial and temporal convolutions of a (2 + 1)D convolution.

(2+1)D convolutions

The main advantage of this approach is that it reduces the number of parameters. In the (2 + 1)D convolution the spatial convolution takes in data of the shape (1, width, height), while the temporal convolution takes in data of the shape (time, 1, 1). For example, a (2 + 1)D convolution with kernel size (3 x 3 x 3) would need weight matrices of size (9 * channels**2) + (3 * channels**2), less than half as many as the full 3D convolution. This tutorial implements (2 + 1)D ResNet18, where each convolution in the resnet is replaced by a (2+1)D convolution.

# Define the dimensions of one frame in the set of frames created
HEIGHT = 224
WIDTH = 224
class Conv2Plus1D(keras.layers.Layer):
  def __init__(self, filters, kernel_size, padding):
    """
      A sequence of convolutional layers that first apply the convolution operation over the
      spatial dimensions, and then the temporal dimension. 
    """
    super().__init__()
    self.seq = keras.Sequential([  
        # Spatial decomposition
        layers.Conv3D(filters=filters,
                      kernel_size=(1, kernel_size[1], kernel_size[2]),
                      padding=padding),
        # Temporal decomposition
        layers.Conv3D(filters=filters, 
                      kernel_size=(kernel_size[0], 1, 1),
                      padding=padding)
        ])

  def call(self, x):
    return self.seq(x)

A ResNet model is made from a sequence of residual blocks. A residual block has two branches. The main branch performs the calculation, but is difficult for gradients to flow through. The residual branch bypasses the main calculation and mostly just adds the input to the output of the main branch. Gradients flow easily through this branch. Therefore, an easy path from the loss function to any of the residual block's main branch will be present. This avoids the vanishing gradient problem.

Create the main branch of the residual block with the following class. In contrast to the standard ResNet structure this uses the custom Conv2Plus1D layer instead of layers.Conv2D.

class ResidualMain(keras.layers.Layer):
  """
    Residual block of the model with convolution, layer normalization, and the
    activation function, ReLU.
  """
  def __init__(self, filters, kernel_size):
    super().__init__()
    self.seq = keras.Sequential([
        Conv2Plus1D(filters=filters,
                    kernel_size=kernel_size,
                    padding='same'),
        layers.LayerNormalization(),
        layers.ReLU(),
        Conv2Plus1D(filters=filters, 
                    kernel_size=kernel_size,
                    padding='same'),
        layers.LayerNormalization()
    ])

  def call(self, x):
    return self.seq(x)

To add the residual branch to the main branch it needs to have the same size. The Project layer below deals with cases where the number of channels is changed on the branch. In particular, a sequence of densely-connected layer followed by normalization is added.

class Project(keras.layers.Layer):
  """
    Project certain dimensions of the tensor as the data is passed through different 
    sized filters and downsampled. 
  """
  def __init__(self, units):
    super().__init__()
    self.seq = keras.Sequential([
        layers.Dense(units),
        layers.LayerNormalization()
    ])

  def call(self, x):
    return self.seq(x)

Use add_residual_block to introduce a skip connection between the layers of the model.

def add_residual_block(input, filters, kernel_size):
  """
    Add residual blocks to the model. If the last dimensions of the input data
    and filter size does not match, project it such that last dimension matches.
  """
  out = ResidualMain(filters, 
                     kernel_size)(input)

  res = input
  # Using the Keras functional APIs, project the last dimension of the tensor to
  # match the new filter size
  if out.shape[-1] != input.shape[-1]:
    res = Project(out.shape[-1])(res)

  return layers.add([res, out])

Resizing the video is necessary to perform downsampling of the data. In particular, downsampling the video frames allow for the model to examine specific parts of frames to detect patterns that may be specific to a certain action. Through downsampling, non-essential information can be discarded. Moreoever, resizing the video will allow for dimensionality reduction and therefore faster processing through the model.

class ResizeVideo(keras.layers.Layer):
  def __init__(self, height, width):
    super().__init__()
    self.height = height
    self.width = width
    self.resizing_layer = layers.Resizing(self.height, self.width)

  def call(self, video):
    """
      Use the einops library to resize the tensor.  

      Args:
        video: Tensor representation of the video, in the form of a set of frames.

      Return:
        A downsampled size of the video according to the new height and width it should be resized to.
    """
    # b stands for batch size, t stands for time, h stands for height, 
    # w stands for width, and c stands for the number of channels.
    old_shape = einops.parse_shape(video, 'b t h w c')
    images = einops.rearrange(video, 'b t h w c -> (b t) h w c')
    images = self.resizing_layer(images)
    videos = einops.rearrange(
        images, '(b t) h w c -> b t h w c',
        t = old_shape['t'])
    return videos

Use the Keras functional API to build the residual network.

input_shape = (None, 10, HEIGHT, WIDTH, 3)
input = layers.Input(shape=(input_shape[1:]))
x = input

x = Conv2Plus1D(filters=16, kernel_size=(3, 7, 7), padding='same')(x)
x = layers.BatchNormalization()(x)
x = layers.ReLU()(x)
x = ResizeVideo(HEIGHT // 2, WIDTH // 2)(x)

# Block 1
x = add_residual_block(x, 16, (3, 3, 3))
x = ResizeVideo(HEIGHT // 4, WIDTH // 4)(x)

# Block 2
x = add_residual_block(x, 32, (3, 3, 3))
x = ResizeVideo(HEIGHT // 8, WIDTH // 8)(x)

# Block 3
x = add_residual_block(x, 64, (3, 3, 3))
x = ResizeVideo(HEIGHT // 16, WIDTH // 16)(x)

# Block 4
x = add_residual_block(x, 128, (3, 3, 3))

x = layers.GlobalAveragePooling3D()(x)
x = layers.Flatten()(x)
x = layers.Dense(10)(x)

model = keras.Model(input, x)
frames, label = next(iter(train_ds))
model.build(frames)
# Visualize the model
keras.utils.plot_model(model, expand_nested=True, dpi=60, show_shapes=True)

png

Train the model

For this tutorial, choose the tf.keras.optimizers.Adam optimizer and the tf.keras.losses.SparseCategoricalCrossentropy loss function. Use the metrics argument to the view the accuracy of the model performance at every step.

model.compile(loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              optimizer = keras.optimizers.Adam(learning_rate = 0.0001), 
              metrics = ['accuracy'])

Train the model for 50 epoches with the Keras Model.fit method.

history = model.fit(x = train_ds,
                    epochs = 50, 
                    validation_data = val_ds)
Epoch 1/50
38/38 [==============================] - 230s 6s/step - loss: 2.4452 - accuracy: 0.1233 - val_loss: 2.4661 - val_accuracy: 0.1400
Epoch 2/50
38/38 [==============================] - 223s 6s/step - loss: 2.1898 - accuracy: 0.2067 - val_loss: 2.5864 - val_accuracy: 0.1500
Epoch 3/50
38/38 [==============================] - 222s 6s/step - loss: 2.0602 - accuracy: 0.2667 - val_loss: 2.7133 - val_accuracy: 0.1200
Epoch 4/50
38/38 [==============================] - 221s 6s/step - loss: 1.8716 - accuracy: 0.3633 - val_loss: 2.4647 - val_accuracy: 0.1800
Epoch 5/50
38/38 [==============================] - 220s 6s/step - loss: 1.7901 - accuracy: 0.3667 - val_loss: 2.7002 - val_accuracy: 0.1500
Epoch 6/50
38/38 [==============================] - 221s 6s/step - loss: 1.7632 - accuracy: 0.3867 - val_loss: 2.6759 - val_accuracy: 0.1600
Epoch 7/50
38/38 [==============================] - 223s 6s/step - loss: 1.7130 - accuracy: 0.3833 - val_loss: 2.3038 - val_accuracy: 0.2200
Epoch 8/50
38/38 [==============================] - 222s 6s/step - loss: 1.6025 - accuracy: 0.4000 - val_loss: 2.6929 - val_accuracy: 0.1700
Epoch 9/50
38/38 [==============================] - 220s 6s/step - loss: 1.5444 - accuracy: 0.4767 - val_loss: 2.6629 - val_accuracy: 0.1800
Epoch 10/50
38/38 [==============================] - 220s 6s/step - loss: 1.4557 - accuracy: 0.4767 - val_loss: 2.2244 - val_accuracy: 0.2300
Epoch 11/50
38/38 [==============================] - 220s 6s/step - loss: 1.3617 - accuracy: 0.5200 - val_loss: 2.1875 - val_accuracy: 0.3200
Epoch 12/50
38/38 [==============================] - 221s 6s/step - loss: 1.3553 - accuracy: 0.5333 - val_loss: 2.1145 - val_accuracy: 0.2700
Epoch 13/50
38/38 [==============================] - 220s 6s/step - loss: 1.3947 - accuracy: 0.5000 - val_loss: 1.8937 - val_accuracy: 0.3700
Epoch 14/50
38/38 [==============================] - 221s 6s/step - loss: 1.3361 - accuracy: 0.5267 - val_loss: 1.6443 - val_accuracy: 0.4600
Epoch 15/50
38/38 [==============================] - 219s 6s/step - loss: 1.2380 - accuracy: 0.5533 - val_loss: 1.4356 - val_accuracy: 0.5400
Epoch 16/50
38/38 [==============================] - 221s 6s/step - loss: 1.1703 - accuracy: 0.5767 - val_loss: 1.8729 - val_accuracy: 0.4000
Epoch 17/50
38/38 [==============================] - 221s 6s/step - loss: 1.1605 - accuracy: 0.6033 - val_loss: 1.7361 - val_accuracy: 0.4600
Epoch 18/50
38/38 [==============================] - 220s 6s/step - loss: 1.0992 - accuracy: 0.6067 - val_loss: 1.2881 - val_accuracy: 0.5700
Epoch 19/50
38/38 [==============================] - 220s 6s/step - loss: 1.1121 - accuracy: 0.6133 - val_loss: 1.4157 - val_accuracy: 0.5200
Epoch 20/50
38/38 [==============================] - 218s 6s/step - loss: 1.0310 - accuracy: 0.6300 - val_loss: 1.2908 - val_accuracy: 0.5800
Epoch 21/50
38/38 [==============================] - 219s 6s/step - loss: 1.0134 - accuracy: 0.6467 - val_loss: 1.3474 - val_accuracy: 0.5800
Epoch 22/50
38/38 [==============================] - 219s 6s/step - loss: 0.9677 - accuracy: 0.6300 - val_loss: 1.3569 - val_accuracy: 0.5300
Epoch 23/50
38/38 [==============================] - 219s 6s/step - loss: 0.9292 - accuracy: 0.6433 - val_loss: 1.1130 - val_accuracy: 0.5900
Epoch 24/50
38/38 [==============================] - 219s 6s/step - loss: 0.9134 - accuracy: 0.6833 - val_loss: 1.3144 - val_accuracy: 0.5200
Epoch 25/50
38/38 [==============================] - 218s 6s/step - loss: 0.8948 - accuracy: 0.7000 - val_loss: 1.1649 - val_accuracy: 0.5900
Epoch 26/50
38/38 [==============================] - 219s 6s/step - loss: 0.8968 - accuracy: 0.6500 - val_loss: 1.1370 - val_accuracy: 0.6400
Epoch 27/50
38/38 [==============================] - 219s 6s/step - loss: 0.9460 - accuracy: 0.6533 - val_loss: 2.2827 - val_accuracy: 0.3800
Epoch 28/50
38/38 [==============================] - 218s 6s/step - loss: 1.0633 - accuracy: 0.6333 - val_loss: 1.2745 - val_accuracy: 0.5400
Epoch 29/50
38/38 [==============================] - 219s 6s/step - loss: 0.9378 - accuracy: 0.6733 - val_loss: 1.2241 - val_accuracy: 0.6500
Epoch 30/50
38/38 [==============================] - 219s 6s/step - loss: 0.8682 - accuracy: 0.7200 - val_loss: 1.1828 - val_accuracy: 0.6500
Epoch 31/50
38/38 [==============================] - 218s 6s/step - loss: 0.8379 - accuracy: 0.6833 - val_loss: 1.1417 - val_accuracy: 0.6000
Epoch 32/50
38/38 [==============================] - 218s 6s/step - loss: 0.7856 - accuracy: 0.6900 - val_loss: 1.2292 - val_accuracy: 0.5600
Epoch 33/50
38/38 [==============================] - 219s 6s/step - loss: 0.8056 - accuracy: 0.7233 - val_loss: 1.0834 - val_accuracy: 0.6200
Epoch 34/50
38/38 [==============================] - 220s 6s/step - loss: 0.8262 - accuracy: 0.6867 - val_loss: 1.1120 - val_accuracy: 0.6000
Epoch 35/50
38/38 [==============================] - 218s 6s/step - loss: 0.7472 - accuracy: 0.7367 - val_loss: 0.9757 - val_accuracy: 0.6700
Epoch 36/50
38/38 [==============================] - 219s 6s/step - loss: 0.6969 - accuracy: 0.7500 - val_loss: 0.9642 - val_accuracy: 0.6400
Epoch 37/50
38/38 [==============================] - 219s 6s/step - loss: 0.7518 - accuracy: 0.7467 - val_loss: 1.1454 - val_accuracy: 0.5100
Epoch 38/50
38/38 [==============================] - 220s 6s/step - loss: 0.7360 - accuracy: 0.7267 - val_loss: 0.9619 - val_accuracy: 0.6800
Epoch 39/50
38/38 [==============================] - 220s 6s/step - loss: 0.6887 - accuracy: 0.7600 - val_loss: 1.1292 - val_accuracy: 0.6100
Epoch 40/50
38/38 [==============================] - 220s 6s/step - loss: 0.7217 - accuracy: 0.7567 - val_loss: 1.2201 - val_accuracy: 0.6100
Epoch 41/50
38/38 [==============================] - 219s 6s/step - loss: 0.7505 - accuracy: 0.7200 - val_loss: 0.9450 - val_accuracy: 0.6800
Epoch 42/50
38/38 [==============================] - 218s 6s/step - loss: 0.6737 - accuracy: 0.7433 - val_loss: 0.9566 - val_accuracy: 0.6500
Epoch 43/50
38/38 [==============================] - 219s 6s/step - loss: 0.6232 - accuracy: 0.7867 - val_loss: 0.9072 - val_accuracy: 0.7100
Epoch 44/50
38/38 [==============================] - 220s 6s/step - loss: 0.5908 - accuracy: 0.8100 - val_loss: 0.9052 - val_accuracy: 0.7200
Epoch 45/50
38/38 [==============================] - 219s 6s/step - loss: 0.5901 - accuracy: 0.7767 - val_loss: 0.8087 - val_accuracy: 0.7100
Epoch 46/50
38/38 [==============================] - 218s 6s/step - loss: 0.6202 - accuracy: 0.7833 - val_loss: 1.0201 - val_accuracy: 0.7000
Epoch 47/50
38/38 [==============================] - 217s 6s/step - loss: 0.6777 - accuracy: 0.7567 - val_loss: 1.5742 - val_accuracy: 0.4800
Epoch 48/50
38/38 [==============================] - 217s 6s/step - loss: 0.8462 - accuracy: 0.6767 - val_loss: 1.6540 - val_accuracy: 0.4400
Epoch 49/50
38/38 [==============================] - 219s 6s/step - loss: 0.7168 - accuracy: 0.7333 - val_loss: 1.2454 - val_accuracy: 0.6000
Epoch 50/50
38/38 [==============================] - 216s 6s/step - loss: 0.6592 - accuracy: 0.7433 - val_loss: 0.9307 - val_accuracy: 0.6700

Visualize the results

Create plots of the loss and accuracy on the training and validation sets:

def plot_history(history):
  """
    Plotting training and validation learning curves.

    Args:
      history: model history with all the metric measures
  """
  fig, (ax1, ax2) = plt.subplots(2)

  fig.set_size_inches(18.5, 10.5)

  # Plot loss
  ax1.set_title('Loss')
  ax1.plot(history.history['loss'], label = 'train')
  ax1.plot(history.history['val_loss'], label = 'test')
  ax1.set_ylabel('Loss')

  # Determine upper bound of y-axis
  max_loss = max(history.history['loss'] + history.history['val_loss'])

  ax1.set_ylim([0, np.ceil(max_loss)])
  ax1.set_xlabel('Epoch')
  ax1.legend(['Train', 'Validation']) 

  # Plot accuracy
  ax2.set_title('Accuracy')
  ax2.plot(history.history['accuracy'],  label = 'train')
  ax2.plot(history.history['val_accuracy'], label = 'test')
  ax2.set_ylabel('Accuracy')
  ax2.set_ylim([0, 1])
  ax2.set_xlabel('Epoch')
  ax2.legend(['Train', 'Validation'])

  plt.show()

plot_history(history)

png

Evaluate the model

Use Keras Model.evaluate to get the loss and accuracy on the test dataset.

model.evaluate(test_ds, return_dict=True)
13/13 [==============================] - 15s 1s/step - loss: 0.8879 - accuracy: 0.7000
{'loss': 0.8878847360610962, 'accuracy': 0.699999988079071}

To visualize model performance further, use a confusion matrix. The confusion matrix allows you to assess the performance of the classification model beyond accuracy. In order to build the confusion matrix for this multi-class classification problem, get the actual values in the test set and the predicted values.

def get_actual_predicted_labels(dataset): 
  """
    Create a list of actual ground truth values and the predictions from the model.

    Args:
      dataset: An iterable data structure, such as a TensorFlow Dataset, with features and labels.

    Return:
      Ground truth and predicted values for a particular dataset.
  """
  actual = [labels for _, labels in dataset.unbatch()]
  predicted = model.predict(dataset)

  actual = tf.stack(actual, axis=0)
  predicted = tf.concat(predicted, axis=0)
  predicted = tf.argmax(predicted, axis=1)

  return actual, predicted
def plot_confusion_matrix(actual, predicted, labels, ds_type):
  cm = tf.math.confusion_matrix(actual, predicted)
  ax = sns.heatmap(cm, annot=True, fmt='g')
  sns.set(rc={'figure.figsize':(12, 12)})
  sns.set(font_scale=1.4)
  ax.set_title('Confusion matrix of action recognition for ' + ds_type)
  ax.set_xlabel('Predicted Action')
  ax.set_ylabel('Actual Action')
  plt.xticks(rotation=90)
  plt.yticks(rotation=0)
  ax.xaxis.set_ticklabels(labels)
  ax.yaxis.set_ticklabels(labels)
fg = FrameGenerator(subset_paths['train'], n_frames, training=True)
labels = list(fg.class_ids_for_name.keys())
actual, predicted = get_actual_predicted_labels(train_ds)
plot_confusion_matrix(actual, predicted, labels, 'training')
38/38 [==============================] - 46s 1s/step

png

actual, predicted = get_actual_predicted_labels(test_ds)
plot_confusion_matrix(actual, predicted, labels, 'test')
13/13 [==============================] - 15s 1s/step

png

The precision and recall values for each class can also be calculated using a confusion matrix.

def calculate_classification_metrics(y_actual, y_pred, labels):
  """
    Calculate the precision and recall of a classification model using the ground truth and
    predicted values. 

    Args:
      y_actual: Ground truth labels.
      y_pred: Predicted labels.
      labels: List of classification labels.

    Return:
      Precision and recall measures.
  """
  cm = tf.math.confusion_matrix(y_actual, y_pred)
  tp = np.diag(cm) # Diagonal represents true positives
  precision = dict()
  recall = dict()
  for i in range(len(labels)):
    col = cm[:, i]
    fp = np.sum(col) - tp[i] # Sum of column minus true positive is false negative

    row = cm[i, :]
    fn = np.sum(row) - tp[i] # Sum of row minus true positive, is false negative

    precision[labels[i]] = tp[i] / (tp[i] + fp) # Precision 

    recall[labels[i]] = tp[i] / (tp[i] + fn) # Recall

  return precision, recall
precision, recall = calculate_classification_metrics(actual, predicted, labels) # Test dataset
precision
{'ApplyEyeMakeup': 0.6666666666666666,
 'ApplyLipstick': 0.5714285714285714,
 'Archery': 0.6,
 'BabyCrawling': 0.5,
 'BalanceBeam': 0.5714285714285714,
 'BandMarching': 1.0,
 'BaseballPitch': 1.0,
 'Basketball': 0.5,
 'BasketballDunk': 0.8181818181818182,
 'BenchPress': 0.8333333333333334}
recall
{'ApplyEyeMakeup': 0.6,
 'ApplyLipstick': 0.4,
 'Archery': 0.9,
 'BabyCrawling': 0.8,
 'BalanceBeam': 0.4,
 'BandMarching': 0.8,
 'BaseballPitch': 0.9,
 'Basketball': 0.3,
 'BasketballDunk': 0.9,
 'BenchPress': 1.0}

Next steps

To learn more about working with video data in TensorFlow, check out the following tutorials: