(MTL S01E07) Rendering pt.1: What’s under hood

7 min readOct 20, 2024

As Metal is closely tied to the GPU, one of its obvious purposes is rendering. But what exactly is rendering? Let’s explore this concept. In this episode, we’ll focus on general principles rather than diving into the specific details of Metal.

DISCLAIMER: Some details here aren’t directly covered in the Metal documentation. They are based on my own experiments and comparisons with OpenGL’s rendering pipeline, which, in my experience, shares general similarities. If you have any feedback or insights, feel free to share!

What is rendering

Let’s begin with some key terminology:

Rendering, or image synthesis, is the process of generating either a photorealistic or non-photorealistic image from a 2D or 3D model through a computer program. It can also refer to calculating visual effects in video editing software to produce the final video output.

There are two main approaches to rendering:

Ray tracing calculates pixel values by simulating the paths of light rays as they interact with objects in a scene, producing effects like reflections and shadows.
Rasterization projects 3D objects onto a 2D screen plane and fills the corresponding pixels based on the geometry and textures of the objects.

Rasterization is significantly faster than ray tracing, although it doesn’t achieve the same level of quality. Historically, rasterization has been used in most rendering engines. Metal primarily uses rasterization, but it also supports ray tracing to add more detail where needed.

I might write an article about rasterization on the CPU for a deeper dive into its mechanics, but that’s beyond the scope of this episode. As for ray tracing in Metal, we’ll explore that in future episodes, as it’s a bit more complex for beginners.

Rendering pipeline

Let’s start with a general overview of the rendering pipeline. While we could reference Apple’s official diagram, it’s a bit too high-level for our purposes:

Source: https://developer.apple.com/documentation/metal/using_a_render_pipeline_to_render_primitives

I decided to add a bit more detail for better understanding, while still skipping most of the low-level specifics. I’ll focus on the key points that are useful in most tasks. Although I’m not sure what a complete diagram of the Metal rendering pipeline looks like, you could refer to the OpenGL pipeline. I’m confident the overall concepts are quite similar.

Draw dispatch

This is the first step of the rendering pipeline, where we perform the following tasks:

Set up the render encoder (attachments, scissor rect, viewport volume, pipeline state, etc).
Select a pipeline state for drawing (shaders, blending).
Pass parameters to the shaders.
Dispatch the draw call itself (defining primitives, vertex count, and the number of instances).

Yes, we can render multiple instances of the same model and parameters, while managing different behaviors through the shaders.

A more detailed explanation will follow in the 8th (next) episode.

Vertex processing

Once we’ve dispatched our draw call, Metal maps the buffers we passed into the vertex shader’s structures. You can assign vertex properties from different buffers to corresponding attributes in the input (`[[stage_in]]`) structure, or simply pass the buffers directly and handle them using the vertex index (`[[vertex_id]]`). More details on this will be covered in the 9th episode.

The vertex shader is then invoked for every vertex of every instance, calculating the position in the viewport volume and other necessary parameters. This means:

If you dispatch a draw call for 3 instances, each with 42 vertices, the vertex shader will be called 42 * 3 = 126 times.
You can also pass nothing to the shader’s input and construct vertices directly within the shader. This is sometimes useful for drawing simple geometry like quads.

The output of the vertex shader is a structure that must contain a field marked with the `[[position]]` attribute, representing the vertex’s position in the viewport. Additionally, you can include any other fields needed for the fragment shader, such as normals or texture coordinates. There’s no need to manually interpolate these values between vertices — Metal takes care of that for you.

Primitive processing

After positioning the vertices in the viewport, Metal performs several steps, which are controlled implicitly through the encoder settings, draw call parameters, and pipeline state.

First, Metal assembles primitives based on the type specified in the draw call. Metal supports the following primitive types: points, lines, and triangles, which can also be constructed using line strips and triangle strips. Each primitive type has its own specific use cases and nuances.

Next, these primitives are clipped to fit within the specified viewport volume. Depending on the primitive type, this process works as follows:

Points — Points that fall outside the viewport are simply discarded.
Lines — If part of a line is outside the viewport, new vertex coordinates are computed at the boundary, and vertices outside the volume are dropped.
Triangles — New triangles are generated with vertices placed at the boundary, replacing the parts outside the viewport.

During clipping, new outputs are created using linear interpolation.

For triangles, a face culling step can also be applied, which discards triangles based on their orientation (determined by the clockwise or counterclockwise order of their vertices). This is typically used to hide triangles that are naturally invisible due to their orientation relative to the viewer.

Rasterisation

The next step is rasterization of the primitives. Since the primitives are already projected, Metal treats them as 2D shapes with an additional depth component. At this stage, it determines whether each pixel lies inside or outside the primitive and interpolates vertex outputs:

The result of rasterization is a fragment, which contains data for each pixel (though, in some cases, multiple fragments per pixel can be generated for multisampling purposes). This data includes:

Interpolated outputs from the vertex shader.
The position of the fragment (2D coordinates + depth value).
Stencil value.
Additional parameters (e.g., for multisampling or other effects).

Fragment processing

The fragments from the previous step undergo several processing stages, with the order varying depending on the configuration:

Scissor Test (if enabled) discards fragments outside a specified rectangular area in the window (note: not in the viewport!). Typically, this is the first step in most cases.

Fragment Shader is applied to every input fragment. It can receive vertex outputs as input (`[[stage_in]]`), along with any additional parameters you need to pass in. The result of the fragment shader is new values for the attachments (if multiple attachments are used):

If there are no outputs for depth and stencil attachments, the values from the fragment are passed through as-is.
If there is no color output, the result for the color attachment is undefined (or determined by the store action set for the attachment).

Depth Test (if enabled) discards fragments based on their depth value, comparing them so that fragments that are farther (or nearer, depending on the settings) may be discarded. In certain cases (e.g., when the fragment shader has no depth output), this step occurs before the fragment shader to save performance. Otherwise, it’s executed afterward.

Stencil Test (if enabled) discards fragments based on their stencil value by comparing the masked fragment’s stencil value with the stencil value in the related attachment. Metal uses two masks: one for reading and one for writing, allowing more flexible stencil buffer updates. This is a slightly tricky thing, so I may write another article with examples on stencil testing later.

Pixel processing

This is where the final processing of pixel values occurs:

Blending (if set in the current pipeline state) mixes the new pixel value with the existing one for the same pixel. The available blending operations are limited, so if you need more complex blending modes (e.g., Photoshop-style blending), you’ll need to implement them programmatically in the fragment shader, though this is slower than using the built-in pipeline blending.
Pixel format conversion adapts the output value to the required pixel format, performing operations like clamping (which you can skip in the fragment shader, but not in a compute kernel), denormalization, etc.
Multisampling merges multiple fragment values into a single pixel.

After these operations, the final result is written to the attachment texture.

Conclusion

Rendering is the process of programmatically creating an image. There are two main approaches: ray tracing and rasterization.
In Metal, the rendering pipeline can be broadly divided into three main stages: vertex processing, rasterization, and fragment processing.
For additional reading, I recommend this resource and this one. While they focus on OpenGL, they offer valuable insights into rendering concepts that are applicable more generally.