yanto.fi

Advent of Code 2020, part 2

| 1257 words | 6 minutes

#cpp

During last couple of week I returned back to my Advent of Code 2020 solutions with an intent to clean and improve them. Most of the improvements were achieved by careful study of other solutions, mainly from Voltara and Tim Visée. Overall, the resulting changes are as follow:

All solutions now run (on Ryzen 5 2600X) in 650-700 ms, which is well below 1 second.

Day 01

In the initial solution, I used break and if-statements to break out of the multiple loops. The outer if-clauses were more of a “has the answer been found already?” clause and completely unnecessary. Replacing this flow with a separate method with return reduced the amount of if-clauses, and thus the execution time went from 200 closer to 100 μs.

The second improvement was based on understanding the data limits and applying this knowledge ot the solution. When iterating over possible values for {A, B, C}, the innermost loop (for C) can be skipped completely when the sum A + B overflows the target value. This same sum can also be extended with the smallest value the input data has, reducing the unnecessary loops even further. The execution time drops to 75-80 μs range.

Day 03

I easily forget this, but sizeof(bool) is not 1 bit – it is 1 byte. It seems that at some point in the history, it could’ve been even more. Thus, using std::array<bool, 32> to represent each line in the input data takes 8 times more memory than the solution requires. Interestingly, std::vector<bool> is a specialized container with storage size optimizations, while std::array<bool> is not. Alternatively, a simple uint32_t can be used to have 32 bits of static data.

In any case, the data representation change doesn’t affect the performance in any noticeable way, as the solution is not constrained by memory to begin with. What does affect the performance is parsing the 32-char input lines into these 32 bits. Below you can see my original solution (and in its original form - here), which processes each line one char at a time. Alternatively, SIMD extensions can be used to process more characters at the same time – see convert_avx2(), which processes the whole input line in a single operation.

While writing these notes down, I’ve noticed a curious detail. The original method used line.size() in the for-loop, which is illustrated as extra len argument in convert_basic() below. However, as the input lines are 32 characters in length, this information might as well be explicitly defined in the solution. By doing that, the compiler can vectorize this method automatically (see convert_basic_static() in Compiler Explorer). Compared to convert_avx2(), compiler uses xmmN registers extensively and generates more instructions to support that, whereas convert_avx2() uses single ymm0 directly. Nevertheless, the performance is comparable – 20-21 μs for convert_basic_static() and 18 μs for convert_avx2(), and much better than in the original solution.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
uint32_t convert_basic(const char* input, size_t len)
{
    uint32_t output = 0;
    for (size_t c = 0; c < len; ++c)
        if (input[c] == '#')
            set_bit(output, c);

    return output;
}

// Ref: https://github.com/Voltara/advent2020-fast/blob/main/src/day03.cpp
// Modified and annotated
uint32_t convert_avx2(const char* input)
{
    const auto* ptr = reinterpret_cast<const __m256i *>(input);

    // Load unaligned signed integer from memory address
    __m256i v = _mm256_loadu_si256(ptr);

    // Broadcast 8-bit integer (char) to all elements of dst
    __m256i mask = _mm256_set1_epi8('#');

    // Compare packed 8-bit integers for equality (not bits directly!)
    v = _mm256_cmpeq_epi8(v, mask);

    // Reduce 256-bit element to 32-bit element, by taking the most
    return _mm256_movemask_epi8(v);
}

Day 09

Part 2 asks to find a sub-array in a vector of numbers, identified by specific criteria. For this, my initial solution used a simple for-loop in a for-loop going through each possible pair of values (see find_continuous_set() method). This function can be identified to be the most costly operation in the solution, taking approx. 240 μs or 80% of the whole duration.

Next, as suggested by Tim Visée, I’ve implemented the dynamic sliding window approach. Based on the latest measurements, find_continuous_set() now takes only 0-1 μs! As the author points out, the number of loops is greatly reduced, as the data vector is parsed through only once. In addition, the data accesses are as linear as it is possible to make them, which ensures that loaded cache lines are fully utilized by the CPU. Even when sliding the tail pointer forward, the data used for calculations already exists in near cache.

Day 15 and 23

Both days 15 and 23 have solutions that are constrained by memory access. As opposed to what I’ve learned in the initial phase, carefully specifying the data sizes has noticeable effect on the runtime in these cases. Specifically, in both days the program processes numbers that fit into uint32_t. Previously, I was using size_t for these numbers, mostly as a “just in case” measure. Properly constraining the values into uint32_t reduced the execution time of Day 15 from 600 ms to 450 ms, and Day 23 – from 340ms to 170ms.

More interestingly, the memory access for Day 15 could be optimized even more (as found in Voltara’s solution)! For that day, the solution is storing a history of previously spoken numbers and the turns they were spoken in. There are two pieces of knowledge that this history provides: (1) has the number been spoken before, and if so, (2) during which turn. Extracting the (1) into its own bitset provides the opportunity to access the memory faster (due to smaller size) and in many cases allows to skip accessing the (2) altogether. Implementing this in my solution reduces the runtime from 450 ms to 325 ms.

Learnings

Overall, I’ve learned that it’s extremely beneficial to understand the data you are working with. As Abrash writes, “know your data” and “[a major] speedup [was achieved] by storing and processing the data in a manner appropriate for the typical nature of the data itself” [ref]. Similarly, during Day 1, the typical nature of the data allowed to skip some calculations completely. And in Day 15 processing benefited from dividing the data into two separate arrays for faster access to the important parts.

Next, the compiler is your friend and can optimize the solutions to great extent, if allowed to do so. The limitation here is that it can only do so based on the data available during compile time. Explicitly defining the constraints in the code (such as 32-char processing window in Day 3) produces better assembly suitable for specific purpose, as opposed to letting the compiler generate code to handle all possible values of std::vector<T>::size().

Finally, not all problems/tasks are the same. In the first post, I found that carefully specifying the data size had no effect on the performance in this project, based on the experience of specific days. Meanwhile, the other days came (Day 15 and 23) that actually benefited from such optimizations. In the future, it would be important to pay attention to what exactly one should optimize (e.g., memory accesses as in Day 09, memory throughput as in Day 15/23, instruction per memory access ratio as in Day 03).