(MTL S01E11) Metal Performance Shaders

5 min readNov 24, 2024

In this episode, we’ll explore Metal’s library for image processing, matrix and vector operations, and neural network layers: Metal Performance Shaders (MPS). MPS integrates seamlessly into the Metal pipeline and is highly optimized for performance. However, its documentation often lacks the depth needed to cover all the nuances of using the library effectively.

Overview

First of all, the official documentation is an invaluable resource. The library includes a vast collection of ready-to-use algorithms, making it impossible to cover everything here. Instead, this episode will give you a sense of what’s available and what to expect.

There are three main sections in Metal Performance Shaders: `image filters`, `neural networks`, and `matrices and vectors`. Personally, I’ve mostly worked with image filters and matrices, as neural networks are more straightforward to implement using MPSGraph (which we’ll cover in the next episode).

Wrappers

MPS introduces its own set of types, which are essentially wrappers around native Metal objects. However, certain MPS operations require these specific types instead of Metal’s standard ones:

MPSImage: Represents a collection of one or more `MTLTexture` objects with identical dimensions, allowing you to simulate an image with more than four channels.
MPSMatrix: Combines a `MTLBuffer` with metadata, including data alignment, the type of elements, and the number of rows and columns, to represent a matrix efficiently.
MPSVector: A one-dimensional equivalent of `MPSMatrix`, designed for vector operations while retaining similar metadata features.

Image Filters

Morphological: Area-based operations like min, max, erode, and dilate.
Convolution: Filters such as median, box, tent, Gaussian blur, Sobel, and pyramid operations (Gaussian and Laplacian).
Histogram: Includes computation and equalization.
Threshold: Binary, to-zero, truncate, and their inverse variants.
Integral: Computes integral images.
Manipulations: Operations like bilinear and Lanczos scaling, color space and format conversion, and transpose.
Statistics: Computes values like mean, variance, min, and max.
Reduction: Aggregates values like min, max, sum, or mean by row or column.
Arithmetic: Basic operations such as addition, subtraction, multiplication, and division.
Euclidean Distance Transform: Computes the distance from non-zero pixels to the nearest zero pixel.
Guided Filter: Performs edge-preserving smoothing using a guidance image.

If some of these terms are unfamiliar, don’t hesitate to look them up or explore basic image processing techniques. It’s a fascinating and rewarding area to learn about!

Neural Networks

MPS provides support for fully connected, convolutional, and recurrent neural networks. Each section includes layers for arithmetic, pooling, convolution, fully connected operations, neuron activation, softmax, normalization, upsampling, resampling, dropout, loss computation, filtering, and layer concatenation. While this functionality is robust and sufficient for most machine learning tasks, using MPSGraph is often simpler and more intuitive when applying Metal to machine learning workflows.

Matrices and Vectors

Here, we have tools for fundamental operations like addition and multiplication, as well as advanced mathematical capabilities, including LU decomposition, triangular matrix solvers, and Cholesky decomposition.

Nuances

Input/Output Texture Requirements: Some operations, such as pyramid filters, require specific input/output textures. For instance, pyramid operations need floating-point textures with allocated mipmaps.
Intermediate Textures and Buffers: While certain image filters can perform operations in-place, they may still create intermediate textures or buffers. Be mindful of this when optimizing memory usage.
Default Parameters: Default settings for some MPS functions may not work for every task. For example, building a Laplacian pyramid requires first creating a Gaussian pyramid with custom (non-default) kernels.

Example: Euclidean Disctance Transform

Why choose the Euclidean Distance Transform as an example? It’s simple and perhaps a bit unexciting, but it’s perfect for illustrating what’s happening under the hood and showcasing a typical use case of MPS. Its straightforward nature allows us to focus on understanding the mechanics and benefits of using Metal Performance Shaders without getting lost in complex computations.

Implementation

Let’s create a straightforward method to compute a Euclidean distance field for a texture:

func computeDistanceField(for source: MTLTexture, in commandBuffer: MTLCommandBuffer) -> MTLTexture {
    let device = source.device
    let result = createTexture(device: device,                                      // (1)
                               width: source.width,
                               height: source.height,
                               format: source.pixelFormat)
    let mpsDistanceTransform = MPSImageEuclideanDistanceTransform(device: device)   // (2)
    mpsDistanceTransform.encode(                                                    // (3)
        commandBuffer: commandBuffer,
        sourceTexture: source,
        destinationTexture: result)
    return result
}

Output Texture Requirements: The output image must have the same dimensions and format as the input. Both input and output textures must be single-channel, and the output must use a floating-point format. If these requirements aren’t met, MPS will raise an exception, which is helpful for debugging. However, if an exception isn’t triggered, you may end up with nonsensical results, making diagnosis more challenging.
Initializing MPS: Creating an instance of MPS is straightforward. While some MPS operations require additional configuration, many can be initialized once and reused across multiple iterations, improving efficiency.
Encoding the Operation: Use a command buffer to encode the operation, specifying the input and output textures. Some MPS operations support in-place execution, allowing the input texture to serve as the output, which can be useful for memory optimization.

Under the Hood

Now, let’s examine what happens on the GPU side and is hidden by Metal. To do this, we can capture our command buffer in Xcode:

Though we don’t have access to the kernel source code, we can still understand the principles behind its implementation and reproduce it on our side. Let me break this down at a very high level.

The process consists of three passes, what we can deduce not only from the kernel names but also by observing the intermediate textures and buffers.

Vertical Pass (Top to Bottom): Computes distances moving downward through the texture.
Reverse Vertical Pass (Bottom to Top): Refines distances moving upward, but only if the newly computed distance is smaller than the existing one.
Horizontal Pass: Completes the process by calculating the final Euclidean distances based on the precomputed vertical distances.

During the vertical passes, MPS creates two additional textures (as indicated by texture names) and two intermediate buffers. These allocations can lead to temporary peaks in memory usage, so don’t be alarmed if you observe increased consumption during this stage.

Conclusion

MPS is a powerful tool that can save you significant time in implementation and provide excellent performance for computationally intensive tasks.
Documentation for MPS, however, is often sparse and insufficient, leaving developers to rely on experimentation, exceptions, and scattered discussions on forums to uncover its nuances.
Many of MPS nuances are discovered through trial and error or by understanding the underlying principles of its implementation.
Intermediate resource creation is a common occurrence in MPS operations, which can lead to increased memory consumption. Be mindful of this when optimizing for memory-limited environments.
You can examine the GPU pipeline using tools like Xcode’s GPU capture, but keep in mind that the pipeline can vary depending on the parameters used. Gaussian blur, for example, adjusts its implementation based on configuration.