I recently encountered a situation in Rust where I had two enums, and I wanted to do something different for each combination of values. In my case, each enum had three variants (some of which carried additional data) for a total of nine possibilites.

I could think of two ways to write that function. One way is to match on the first enum, and then in each branch match on the second. The other way is to make a tuple out of the two enums and then match on that. I was curious if there would be a performance difference.

A simple example

Let’s start with a simple case, where each enum has two variants and neither of them carry any additional data:

pub enum Vertical {
    Up,
    Down,
}

pub enum Horizontal {
    Left,
    Right,
}

Next, we’ll write the simplest possible function that returns a different result for each possible input. There are two ways to do this. In the first version, match_nested, We match on the Vertical enum first, and then in each branch we match on the Horizontal enum. In the second version, match_tuple, we use a tuple to match on both enums at once.

pub fn match_nested(v: Vertical, h: Horizontal) -> usize {
    match v {
        Vertical::Up => match h {
            Horizontal::Left => 0,
            Horizontal::Right => 1,
        },
        Vertical::Down => match h {
            Horizontal::Left => 2,
            Horizontal::Right => 3,
        },
    }
}

pub fn match_tuple(v: Vertical, h: Horizontal) -> usize {
    match (v, h) {
        (Vertical::Up, Horizontal::Left) => 0,
        (Vertical::Up, Horizontal::Right) => 1,
        (Vertical::Down, Horizontal::Left) => 2,
        (Vertical::Down, Horizontal::Right) => 3,
    }
}

I wrote a simple benchmarked for these two functions. Here are the results for calling each function 1000 times:

test simple::tests::match_nested     ... bench:       1,599 ns/iter (+/- 701)
test simple::tests::match_tuple      ... bench:       1,562 ns/iter (+/- 135)

There turned out to be a very good reason for that - when compiled with optimizations, they both produce the exact same assembly. Now, I’m not that great at reading assembly, but I can tell from the following that there are no branches. Also, I may be wrong, but it seems like it’s using the enum discriminant itself as part of the calculation.

example::match_nested:
  push rbp
  mov rbp, rsp
  xor ecx, ecx
  test sil, sil
  setne cl
  lea rax, [rcx + 2]
  test dil, dil
  cmove rax, rcx
  pop rbp
  ret

example::match_tuple:
  push rbp
  mov rbp, rsp
  xor ecx, ecx
  test sil, sil
  setne cl
  lea rax, [rcx + 2]
  test dil, dil
  cmove rax, rcx
  pop rbp
  ret

Obviously these functions are too simple to meausure what we want. A match statement implies that there should be some branching, yet all the branching has been optimized away. Let’s try something else.

A slightly more complicated example

Let’s try with some more complicated enums. In this example, we’ll use an overly contrived way of representing movements on a grid:

pub enum Vertical {
    Up(usize),
    Zero,
    Down(usize),
}

pub enum Horizontal {
    Left(usize),
    Zero,
    Right(usize),
}

Next, we’ll define two versions of a function to calculate the Manhattan distance of those movements. Similar to before, the first version, distance_nested, matches on the Vertical enum and then in each branch matches on the Horizontal enum. In the second version, distance_tuple, we tie the two variables together and write a single match statement that handles all the cases.

pub fn distance_nested(v: Vertical, h: Horizontal) -> usize {
    match v {
        Vertical::Up(dv) => match h {
            Horizontal::Left(dh) => dv + dh,
            Horizontal::Zero => dv,
            Horizontal::Right(dh) => dv + dh,
        },
        Vertical::Zero => match h {
            Horizontal::Left(dh) => dh,
            Horizontal::Zero => 0,
            Horizontal::Right(dh) => dh,
        },
        Vertical::Down(dv) => match h {
            Horizontal::Left(dh) => dv + dh,
            Horizontal::Zero => dv,
            Horizontal::Right(dh) => dv + dh,
        },
    }
}

pub fn distance_tuple(v: Vertical, h: Horizontal) -> usize {
    match (v, h) {
        (Vertical::Up(dv), Horizontal::Left(dh)) => dv + dh,
        (Vertical::Up(dv), Horizontal::Zero) => dv,
        (Vertical::Up(dv), Horizontal::Right(dh)) => dv + dh,
        (Vertical::Zero, Horizontal::Left(dh)) => dh,
        (Vertical::Zero, Horizontal::Zero) => 0,
        (Vertical::Zero, Horizontal::Right(dh)) => dh,
        (Vertical::Down(dv), Horizontal::Left(dh)) => dv + dh,
        (Vertical::Down(dv), Horizontal::Zero) => dv,
        (Vertical::Down(dv), Horizontal::Right(dh)) => dv + dh,
    }
}

I think it’s rather subjective whether one version is more pleasant to read than the other. I personally prefer the tuple version, especially if the logic inside each branch is less trivial.

When I benchmarked these (again, by calling them 1000 times), I got some different results. I ran it a number of times, and sometimes the standard deviations were quite a bit larger, but the tuple version was always faster than the nested version.

test complex::tests::distance_nested ... bench:       2,732 ns/iter (+/- 124)
test complex::tests::distance_tuple  ... bench:       2,347 ns/iter (+/- 196)

I’m hesitant to trust my benchmarks. There seemed to be quite a bit of variability between runs, which makes me wonder whether I was really measuring what I intended to. I know that benchmarking is a difficult art, and I haven’t done a lot of it.

Let’s look at the assembly though. At this point I’m pushing the limits of my knowledge of assembly, but I can infer a few things.

The first is that the two functions definitely produce different assembly.

The second thing I can tell is that the nested function has a lot more branching in it. There are a total of six jump instructions, compared to two jump expressions in the tuple version. Furthermore, there are several execution paths through the nested version that take two jumps, whereas in the tuple version no execution path takes more than one jump.

This is interesting to me, because based on the source code I would say that there are two branch points in each function. However, when matching on a tuple it seems that the compiler was able to reduce the amount of branching.

On the other hand, if I trace through it by hand, it seems to me that all the paths through the code wind up executing roughly the same number of instructions. Just because there’s more code doesn’t mean that the function will take longer to execute. In fact, it’s possible for the opposite to be true.

example::distance_nested:
  push rbp
  mov rbp, rsp
  mov al, byte ptr [rdi]
  cmp al, 2
  je .LBB0_6
  cmp al, 1
  jne .LBB0_4
  mov al, byte ptr [rsi]
  cmp al, 1
  je .LBB0_3
  cmp al, 2
  mov rax, qword ptr [rsi + 8]
  pop rbp
  ret
.LBB0_6:
  mov rax, qword ptr [rdi + 8]
  mov cl, byte ptr [rsi]
  cmp cl, 2
  je .LBB0_10
  cmp cl, 1
  je .LBB0_8
.LBB0_10:
  add rax, qword ptr [rsi + 8]
  pop rbp
  ret
.LBB0_4:
  mov rax, qword ptr [rdi + 8]
  mov cl, byte ptr [rsi]
  cmp cl, 1
  je .LBB0_8
  cmp cl, 2
  add rax, qword ptr [rsi + 8]
  pop rbp
  ret
.LBB0_8:
  pop rbp
  ret
.LBB0_3:
  xor eax, eax
  pop rbp
  ret

example::distance_tuple:
  push rbp
  mov rbp, rsp
  mov rcx, qword ptr [rdi]
  mov rdx, qword ptr [rsi]
  mov rsi, qword ptr [rsi + 8]
  cmp rcx, 1
  je .LBB1_4
  mov rax, qword ptr [rdi + 8]
  cmp rcx, 2
  cmp rdx, 1
  je .LBB1_3
  cmp rdx, 2
  add rax, rsi
.LBB1_3:
  pop rbp
  ret
.LBB1_4:
  xor eax, eax
  cmp rdx, 1
  cmove rsi, rax
  mov rax, rsi
  pop rbp
  ret

Conclusion

We’ve reached the point where I don’t know enough about assembly to really make any meaningful observations.

What have I learned from this? I certainly can’t conclude that matching on a tuple is faster than nesting match expressions. I can’t even conclude that it’s faster in this particualr case, because I didn’t benchmark a large domain of inputs.

I think I can conclude that, at least for simple cases like this, the two methods for matching on a pair of enums are not sufficiently different for me to let performance influence my decision. Going forward, I will prefer whichever one I think results in clearer code.