The perspective projection matrix in Vulkan

(Edited on: )
Tags: vulkan graphics maths
If you have any questions or remarks, you can leave a comment here.


The perspective projection matrix is crucial in computer graphics to display 3D points on a screen. In most of the computer graphics/OpenGL/Vulkan tutorials online there is only a brief mention of the glm::perspective function and its parameters and quick “hacks” to make it work on Vulkan (Hello negative viewport and correction matrix). In these notes I will try to explain the maths behind the perspective projection and give a matrix that works with Vulkan. Don’t worry. You can still follow if you don’t use Vulkan and adapt the formulas with your settings.

Figure 1: The frustum and the clip volume

The projection matrix transforms vectors from the camera (or eye) space to the clip space. The clip space is a homogeneous space used to remove (or clip) primitives outside the viewport. After clipping, the hardware performs a “perspective division” to transform clip coordinates into normalized device coordinates by dividing each component by the 4th component w.

Classic perspective with a near and far plane

Projecting onto the near plane

The first step consists of projecting points in eye space on the near plane of the frustum. Here, the eye space is a right-handed coordinate system, and the camera is facing toward the -Z axis. Let’s start by finding the values of \(x\) and \(y\). The depth will be derived later.

Figure 2: The frustum volume, the point (xe, ye, ze) gets projected in the near plane at coordinates (xp, yp, zp). The blue line is the Z axis.

The frustum is a pyramid with the camera as its apex: it forms similar triangles on the XZ and YZ planes.

Using the Intercept theorem (also known as Thales’s theorem), it’s easy to find the value of \(x_p\) and \(y_p\).

Figure 3: Side view (YZ plane) of the frustum, the top view (XZ plane) is similar

The intercept theorem tells us that the ratio between a point projected on the near plane and the coordinate in eye space is the same: \[ \frac{x_p}{x_e} = \frac{y_p}{y_e} = \frac{z_p}{z_e} = \frac{-n}{z_e} \]

\(z_p\) will always be \(-n\) because we are projecting points on the near plane.

In the end, we have:

\begin{align} & x_p = \frac{-n x_e}{z_e} = \frac{1}{-z_e} n x_e\\ & y_p = \frac{-n y_e}{z_e} = \frac{1}{-z_e} n y_e\\ & z_p = -n = \frac{1}{-z_e} n z_e \end{align}

One thing to note here is that \(x_p\) and \(y_p\) are inversely proportional to \(z_e\). The result of a multiplication between a matrix and a vector is a linear combination of its components. It’s not possible to divide by a component. To solve this, the hardware performs what’s called a “perspective divide”. Every components of the clip coordinate gets divided by its 4th component (\(w_c\)), so we will set it to \(-z_e\) to divide \(x_p\), \(y_p\) and \(z_p\).

We just derived \(w_c\) so we can fill one row in our projection matrix.

\[ w_c = -1 \times z_e \]

\begin{equation} \begin{pmatrix} . & . & . & .\\ . & . & . & .\\ . & . & . & .\\ 0 & 0 & -1 & 0 \end{pmatrix} \begin{pmatrix} x_e \\ y_e \\ z_e \\ 1 \end{pmatrix}= \begin{pmatrix} x_c \\ y_c \\ z_c \\ w_c \end{pmatrix} \end{equation}

Going from near plane to near clip plane

Now that we have expressed \(x_p\) and \(y_p\) in terms of \(x_e\) and \(y_e\), let’s try to express their normalized device coordinates \(x_n\) and \(y_n\). The normalized device coordinates are the clip coordinates divided by their 4th component:

\begin{align} \begin{pmatrix} x_n \\ y_n \\ z_n \\ w_n \end{pmatrix}= \begin{pmatrix} \frac{x_c}{w_c} \\ \frac{y_c}{w_c} \\ \frac{z_c}{w_c} \\ \frac{w_c}{w_c} \end{pmatrix} \end{align}

The near plane of our frustum is defined by 4 corners \((l, t)\), \((r, t)\), \((r, b)\) and \((l, b)\). We want to match these with \((-1, -1)\), \((1, -1)\), \((1, 1)\) and \((-1, 1)\) respectively.


Note: If you have a different clip space, you will have to adjust the corners of the near clip plane! Vulkan differs from other graphics APIs; it uses a downward Y axis and the same clip depth as DirectX (0 to 1).

If you use a legacy OpenGL clip space, your clip space corners will be different: Y is upward (\(t = 1\), \(b = -1\))


Figure 4: The mapping of the corners of the frustum to the corners of the clip volume

The mapping of the near frustum plane to the near clip plane is a linear function of the form \(f(x) = \alpha x + \beta\). We can use the formula to find the slope of the function, and then, by replacing the known values in the function, we can find the constant term:

Starting with the \(x\) coordinate:

\begin{align} & f(l) = -1 \quad \text{and} \quad f( r) = 1\\ \Rightarrow \quad & \alpha = \frac{1 - (-1)}{r - l} = \frac{2}{r-l} \\ \\ & f( r) = 1\\ \Rightarrow \quad & f( r) = 1 = \frac{2}{r - l} r + \beta \\ \Leftrightarrow \quad &\beta = 1 - \frac{2r}{r-l} = -\frac{r+l}{r-l}\\ \\ & f(x_p) = x_n\\ \Leftrightarrow \quad & x_n = \frac{2}{r-l} x_p - \frac{r+l}{r-l} \end{align}

The same goes for finding \(y\):

\begin{align} & f(t) = -1 \quad \text{and} \quad f(b) = 1 \\ \Rightarrow \quad & \alpha = \frac{1 - (-1)}{b - t} = \frac{2}{b-t} \\ \\ & f(b) = 1\\ \Rightarrow \quad & f(b) = 1 = \frac{2}{b - t} b + \beta \\ \Leftrightarrow \quad &\beta = 1 - \frac{2b}{b-t} = -\frac{b+t}{b-t}\\ \\ & f(y_p) = y_n\\ \Leftrightarrow \quad & y_n = \frac{2}{b-t} y_p - \frac{b+t}{b-t} \end{align}

Now we just have to replace \(x_p\) and \(y_p\) by the expressions we found earlier in terms of \(x_e\) and \(y_e\).

\begin{align} x_n &= \frac{2}{r-l} x_p - \frac{r+l}{r-l}\\ &= \frac{2}{r-l} \left(\frac{1}{-z_e} n x_e\right) - \frac{r+l}{r-l}\\ &= \frac{1}{-z_e} \left( \frac{2n}{r-l} x_e + \frac{r+l}{r-l} z_e \right)\\ \\ y_n &= \frac{2}{b-t} y_p - \frac{b+t}{b-t}\\ &= \frac{2}{b-t} \left(\frac{1}{-z_e} n y_e\right) - \frac{b+t}{b-t}\\ &= \frac{1}{-z_e} \left( \frac{2n}{b-t} y_e + \frac{b+t}{b-t} z_e \right) \end{align}

Remember that \(x_n = x_c / w_c\) and \(y_n = y_c / w_c\). By factoring by \(\frac{1}{-z_e}\), we can read the coefficients for \(x_c\) and \(y_c\).

We now have:

\begin{equation} \begin{pmatrix} \frac{2n}{r-l} & 0 & \frac{r+l}{r-l} & 0\\ 0 & \frac{2n}{b-t} & \frac{b+t}{b-t} & 0\\ . & . & . & .\\ 0 & 0 & -1 & 0 \end{pmatrix} \begin{pmatrix} x_e \\ y_e \\ z_e \\ 1 \end{pmatrix}= \begin{pmatrix} x_c \\ y_c \\ z_c \\ w_c \end{pmatrix} \end{equation}

Deriving the depth projection

Unfortunately, we cannot use the same method to find the coefficients for z because z will always be on the near plane after projecting a point onto the near plane. We know that the \(z\) coordinate does not depend on \(x\) and \(y\), so let’s fill the remaining row with 0 for x and y, and A and B for the coefficients we need to find.

\begin{equation} \begin{pmatrix} \frac{2n}{r-l} & 0 & \frac{r+l}{r-l} & 0\\ 0 & \frac{2n}{b-t} & \frac{b+t}{b-t} & 0\\ 0 & 0 & A & B\\ 0 & 0 & -1 & 0 \end{pmatrix} \begin{pmatrix} x_e \\ y_e \\ z_e \\ 1 \end{pmatrix}= \begin{pmatrix} x_c \\ y_c \\ z_c \\ w_c \end{pmatrix} \end{equation}

By definition \(z_n\) is :

\[ z_n = \frac{z_c}{w_c} = \frac{A \times z_e + B \times w_e}{-z_e} \]

\(w_e\) is always going to be 1.

\[ z_n = \frac{z_c}{w_c} = \frac{A \times z_e + B}{-z_e} \]

To find A and B, we will replace \(z_n\) and \(z_e\) with known values: the near plane maps to \(1\) in NDC and the far plane maps to \(0\).


Note: The setup shown here uses \(1\) for the near plane and \(0\) for the far plane. It is called “Reverse Depth” and results in a better distribution of the floating points values than using \(-1\) and \(1\) or \(0\) and \(1\), so I highly recommend to use it.

The default DirectX convention is to use \(0\) for the near plane and \(1\) for the far plane; legacy OpenGL uses \(-1\) for the near and \(1\) for the far (this is the worst for floating point precision).


\begin{align} & z_n = 1 \Rightarrow z_e = -n\\ & z_n = 0 \Rightarrow z_e = -f \end{align}

\begin{align} & \left\{ \begin{array}{lr} \frac{A \times \left(-n\right) + B}{-(-n)} = 1\\ \frac{A \times \left(-f\right) + B}{-(-f)} = 0 \end{array} \right.\\ \\ \Leftrightarrow \quad & \left\{ \begin{array}{lr} A \times (-n) + B = n\\ A \times (-f) + B = 0 \end{array} \right.\\ \\ \Leftrightarrow \quad & \left\{ \begin{array}{lr} A \times (-n) + Af = n\\ B = A f \end{array} \right.\\ \\ \Leftrightarrow \quad & \left\{ \begin{array}{lr} A = \frac{n}{f-n}\\ B = \frac{nf}{f-n} \end{array} \right. \end{align}

Here is our final expression for \(z_n\):

\[ z_n = \frac{1}{-z_e} \left({\frac{n}{f-n} \times z_e + \frac{nf}{f-n}}\right) \]

Our matrix is now complete!

\begin{equation} \begin{pmatrix} \frac{2n}{r-l} & 0 & \frac{r+l}{r-l} & 0\\ 0 & \frac{2n}{b-t} & \frac{b+t}{b-t} & 0\\ 0 & 0 & \frac{n}{f-n} & \frac{nf}{f-n}\\ 0 & 0 & -1 & 0 \end{pmatrix} \begin{pmatrix} x_e \\ y_e \\ z_e \\ 1 \end{pmatrix}= \begin{pmatrix} x_c \\ y_c \\ z_c \\ w_c \end{pmatrix} \end{equation}

Simplifying the matrix for the usual case

We often work with symmetric frustums: \(l = -r\) and \(b = -t\).

The matrix becomes a bit simpler:

\begin{align} & l = -r \Rightarrow l + r = 0 \quad \text{and} \quad r - l = 2r = width\\ & b = -t \Rightarrow b + t = 0 \quad \text{and} \quad b - t = -2t = -height \end{align}

\begin{equation} \begin{pmatrix} \frac{2n}{width} & 0 & 0 & 0\\ 0 & -\frac{2n}{height} & 0 & 0\\ 0 & 0 & \frac{n}{f-n} & \frac{nf}{f-n}\\ 0 & 0 & -1 & 0 \end{pmatrix} \begin{pmatrix} x_e \\ y_e \\ z_e \\ 1 \end{pmatrix}= \begin{pmatrix} x_c \\ y_c \\ z_c \\ w_c \end{pmatrix} \end{equation}

It’s also easier to reason on the field of view and aspect ratios rather than on the width and the height of the near plane, so let’s replace them:

Figure 5: Diagram of the vertical field of view

Let’s try to express \(\frac{2n}{width}\) and \(-\frac{2n}{height}\) in terms of field of view and aspect ratio:

\begin{align} & \tan\left(\frac{fov_y}{2}\right) = \frac{\frac{height}{2}}{n} \\ \Leftrightarrow \quad & 2n \tan\left(\frac{fov_y}{2}\right) = height \\ \Leftrightarrow \quad & \frac{2n}{height} = \frac{1}{\tan\left(\frac{fov_y}{2}\right)} \end{align}

\begin{align} \frac{2n}{width} & = \frac{2n}{width} \times \frac{height}{height} \\ & = \frac{2n}{height} \times \frac{height}{width} \\ & = \frac{2n}{height} \left(\frac{width}{height}\right)^{-1} \\ & = \frac{1}{\tan\left(\frac{fov_y}{2}\right)} \left(\frac{width}{height}\right)^{-1} \end{align}

And finally, when replacing in the matrix:

\begin{align} & \text{focal length} = \frac{1}{\tan\left(\frac{fov_y}{2}\right)}\\ \\ & \text{aspect ratio} = \frac{width}{height}\\ \\ & \begin{pmatrix} \frac{\text{focal length}}{\text{aspect ratio}} & 0 & 0 & 0\\ 0 & -{\text{focal length}} & 0 & 0\\ 0 & 0 & \frac{n}{f-n} & \frac{nf}{f-n}\\ 0 & 0 & -1 & 0 \end{pmatrix} \begin{pmatrix} x_e \\ y_e \\ z_e \\ 1 \end{pmatrix}= \begin{pmatrix} x_c \\ y_c \\ z_c \\ w_c \end{pmatrix} \end{align}

Implementation

Here is a possible implementation in C++. Note that the matrix is trivial to invert using your favorite method (I like to use row-reduction for this). We can compute it at the same time as the perspective projection matrix.

float4x4 perspective(float vertical_fov, float aspect_ratio, float n, float f, float4x4 *inverse)
{
    float fov_rad = vertical_fov * 2.0f * M_PI / 360.0f;
    float focal_length = 1.0f / std::tan(fov_rad / 2.0f);

    float x  =  focal_length / aspect_ratio;
    float y  = -focal_length;
    float A  = n / (f - n);
    float B  = f * A;

    float4x4 projection({
        x,    0.0f,  0.0f, 0.0f,
        0.0f,    y,  0.0f, 0.0f,
        0.0f, 0.0f,     A,    B,
        0.0f, 0.0f, -1.0f, 0.0f,
    });

    if (inverse)
    {
        *inverse = float4x4({
            1/x,  0.0f, 0.0f,  0.0f,
            0.0f,  1/y, 0.0f,  0.0f,
            0.0f, 0.0f, 0.0f, -1.0f,
            0.0f, 0.0f,  1/B,   A/B,
        });
    }

    return projection;
}

A lot of people come here to find a function to copy-paste in their projects when things are not working for them. If it is not working for you, I am going to give you some advice.

This particular implementation uses all the formulas explained in previous sections. As mentioned in the different notes, the matrix will change depending on your setup. You could also be using row vectors and left multiplication for vectors and matrices. In this case, the correct matrix is the transpose of the one in this post. If you are indeed using row vectors, I advise you to take a deep breath and make sure you understand the differences between using row vectors and column vectors.

If you are using a different setup than I do (Vulkan, Reverse depth), you need to derive the formulas yourself and follow what I have done with your values for near plane, far plane, and clip space.

Infinite perspective

Having to set scene-specific values for both the near and far plane can be a pain the ass. If you want to display large open-world scenes, you will almost always use an absurdly high value for the far plane anyway.

With the increased precision of using a reverse depth buffer, it is possible to set an infinitely distant far plane.

\begin{align} & \lim_{f \to +\infty} A = \frac{n}{f-n} = 0 \\ & \lim_{f \to +\infty} B = n \times \frac{f}{f-n} = n \end{align}

We have to replace A with \(0\) and B with \(n\)!

External references

Reverse depth
https://developer.nvidia.com/content/depth-precision-visualized