Advent of Code 2020

Advent of Code has caught my attention a couple years before, but I did not pick it up at that point. The 2020 event was the first one I attended and, after completing it, I’m happy with the time spent on it. I did not approach it as a competition: my initial commit dates 8 Dec 2020 and the 50* commit – 6 Mar 2021. Some of these days I didn’t touch the project at all. In the end, I was able to write solutions in C++ to all puzzles, adding up to 2 seconds of execution time on my hardware.

Day 01: OK [  230 μs]       Day 11: OK [ 28567 μs]       Day 21: OK [    260 μs]
Day 02: OK [  563 μs]       Day 12: OK [    25 μs]       Day 22: OK [1045576 μs]
Day 03: OK [   53 μs]       Day 13: OK [     1 μs]       Day 23: OK [ 278303 μs]
Day 04: OK [  425 μs]       Day 14: OK [  5355 μs]       Day 24: OK [   4587 μs]
Day 05: OK [   13 μs]       Day 15: OK [589841 μs]       Day 25: OK [  81279 μs]
Day 06: OK [  106 μs]       Day 16: OK [   260 μs]
Day 07: OK [ 2510 μs]       Day 17: OK [ 15035 μs]
Day 08: OK [  460 μs]       Day 18: OK [   268 μs]
Day 09: OK [  351 μs]       Day 19: OK [    66 μs]
Day 10: OK [   11 μs]       Day 20: OK [  2517 μs]

Total: 2056662 μs

Initially, I expected to find the solutions to the puzzles relatively easily. Projects such as Voltara/advent2018-fast set my mind on optimizing the solutions more than solving them. Quickly, I found out that the learning curve is not as easy as expected. For the solutions to run less than 15 sec, one is expected to use many data structures and algorithms.

Overall, I’ve enjoyed the process of creating the solutions to the puzzles. There were puzzles I was stuck on and couldn’t do without reading and studying the solutions of others (e.g., Day 13, Day 19). Comparing the execution times against Voltara/advent2020-fast, compiled and executed on the same hardware, yields following results:

  • 17 of my solutions are 1-5 times slower,
  • 5 solutions are 20-80 times slower,
  • and 3 solutions are slower by 230, 9500, 14000 times.

I am happy with my results, but there’s definitely possibilities to improve the solutions even more. Below, I would like highlight couple of days that surprised me the most.

Day 5

One of the exciting parts of completing a puzzle is to compare own solutions against others. There are opportunities to learn from other people in how approached the problem, ranging from the general method to how they write the solution down. The solutions are often composed from inter-changeable parts, allowing experimentation.

One of such experiments is presented below. Part of the solution was to parse seat numbers like FBFBBFFRLR. I, as many others, chose to convert this to an integer – like 0101100101. Below are two common variations of this function: convert_line1 combines the loops and if-clauses into one, while convert_line2 keeps them separate. The surprising thing here is the effect on the runtime. With convert_line1 the whole solution executes in 77-78 μs, while with convert_line2 – in only 14-16 μs (-80%).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
uint16_t convert_line1(const std::string_view& line) // e.g. "BFFFBBFRRR"
{
    uint16_t result = 0;

    for (uint8_t i = 0; i < 10; ++i)
        if (line[i] == 'B' || line[i] == 'R')
            set_bit(result, (9 - i));

    return result;
}

uint16_t convert_line2(const std::string_view& line)
{
    uint16_t result = 0;

    for (uint8_t i = 0; i < 7; ++i)
        if (line[i] == 'B')
            set_bit(result, (9 - i));

    for (uint8_t i = 0; i < 3; ++i)
        if (line[7 + i] == 'R')
            set_bit(result, (2 - i));

    return result;
}

In a sense, the difference is understandable, because in convert_line1 some conditions are evaluated which are definitely false: line[i] == 'R' for the first 7 chars and line[i] == 'B' for the last 3 chars. Assuming uniform distribution of input values, there will be 3.5 + 3 unnecessary comparisons done for every input. Thus, switching to convert_line2 reduces the number of comparisons by 40% (from 16.5 to 10). But, this reduction doesn’t in itself explain the runtime difference.

Statistics recorded with perf stat -r 1000 show surprisingly consistent number of branches. The difference in the number of branches between convert_line1 and convert_line2 is ~9000. Dividing it by 6.5 gives 1384 lines, yet there are only 908 lines in my input. Instead of guessing, we can also calculate the number – there are 3213 F’s, so number of skippable if-clauses is 5937 (3213 + 908*3), which is not that close to 9000 either. Not sure this all makes sense, but reducing if-clauses clearly has greater effect than single operation cost.

Day 23

The contrast between the solutions for both parts for Day 23 was probably the most memorable for me. The puzzle gives a list of 10 numbers and asks to re-arrange them based on specific rules. Well, I saw small input and I saw parts moving around, so the solution I came up with for the part 1 is to copy things around. With this approach, each part must be copied in correct order: first the current cup, then all cups up to the destination cup, then the destination cup itself, then 3 pickup cups, and finally the cups after the destination cup. As you might imagine, doing this with 10 numbers for 100 loops is very fast. One might even think that this would an optimal solution!

The part 2 grows the input size to 1,000,000 and asks to loop around 10,000,000 times. The approach of copying things around between two buffers blows up to infinite runtime, assumingly, as I didn’t care to wait. The important part to understand here was that instead of considering the cups, one must consider the links between them! I stepped through the moves in a spreadsheet and found that only three links are relevant in each loop. Considering only them reduces the number of copies from 1,000,000 to 3! The effect of this is that the runtime of the solution stays around comfortable 300 ms.

Spreadsheet evaluation of initial moves

Focusing on the value swapping, I used a buffer to save the values and write them back in required order (see swap1 below). Later, I found out about std::exchange and std::swap, so I experimented with variations swap2 and swap3. In the end, the effects between them are seem to be quite small and the output assembly is similar (see https://godbolt.org/z/sxo97h). Using std::swap results in two extra movs, as it does one extra read and write that we know is unnecessary. swap1 and swap3 produce similar output with exception of register usage: the std::exchange version re-uses eax register and must read/write it twice, while swap1 uses three separate registers.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <array>

// swap (a, b, c) into (c, a, b)
void swap1(std::array<int, 10>& data, int a, int b, int c)
{
    std::array<int, 3> prev = {data[a], data[b], data[c]};
    data[a] = prev[2];
    data[b] = prev[0];
    data[c] = prev[1];
}

void swap2(std::array<int, 10>& data, int a, int b, int c)
{
    std::swap(data[a], data[b]);
    std::swap(data[a], data[c]);
}

void swap3(std::array<int, 10>& data, int a, int b, int c)
{
    auto x = std::exchange(data[a], data[c]);
    auto y = std::exchange(data[b], x);
    data[c] = y;
}

Optimizations

The background I have for the optimization is the course I attended in the university – Programming Parallel Computers. The course is about creating a simple program and optimizing it in different stages to see the runtime drop from 99 seconds to 0.7 seconds. As the course page describes, “achieving good performance on a modern CPU is much more than simply using multiple threads”.

To collect benchmark statistics, I’ve used perf, but there are also other tools. Using perf record to see the parts that take most time is a great first step. From the numbers I’ve seen usually, the trace frequency has been ~60 kHz for me, yielding resolution of the events ~16 μs. perf stat outputs a set of statistics, some of which are above my level of understanding, but great answers on SO help with that.

Overall, I’ve learned multiple things when attempting to optimize the solutions.

  • Using multiple threads (e.g. with openmp) is often unnecessary, as the solutions are often in the microsecond range. If the solution for any given day takes too long to execute, there are other things to look for and threads are rarely the answer. From some simple measurements, it seems that multi-threading is more relevant in the millisecond range.

  • Profiling with resolution of 16 μs may yield too crude results. For some solutions such resolution yields less than 5 events per execution. And some of these solutions can be even faster, by a factor of 2-5, at least comparing to advent2020-fast. As such, perf seems to be relevant only up to a certain point.

  • Limiting integers to specific sizes was less of a thing than I expected. That is, using int versus int8_t/int16_t/etc. During Day 4, the data struct in my solution could be packed from 56 bytes to 40 bytes (-28%), which sounds great on paper. Yet, the amount of data in these puzzles is so small, it lowers the memory usage by only 4640 bytes, without much effect on runtime. So, the memory size is not an issue within the scope of these puzzles.

  • In general, the compiler does so many things that without measurements it is not trivial to say how/if a change is going to affect the performance. For example, in solution to Day 11 Part 1, I’ve extracted the inner loop calculating occupied seats into its own method. For some reason, simply moving the code into different place alone reduced the runtime from 7700 μs to 5900 μs (-23%).

Reflections

Finally, I would like to reflect on this experience from point of view of learning. I feel it is important that the puzzles do not limit the user in any way, allowing to use many different data structures and algorithms. This makes the bruteforce solutions (often) as valid and applicable as math-y solutions. In addition, Eric Wastl mentioned during FOSDEM talk (around 20:25) that the two-part structure of the puzzles intentionally attempts to simulate real world environments with code maintenance and end-user wanting new things out of the same code.

Consequently, one of the major areas I’ve personally felt noticeable improvement was in using C++ language itself. Specifically:

  • During initial days I was slightly lost with parsing the data. I’ve used a regex library for that (e.g. Day 2). Later I created and started using a set of common parsing helpers and state machines for more complex input formats (such as Days 4, 7, 14, 16).

  • Also, it was easy to experiment with some C++ features that were new for me, such as <string_view>, <bitset>, constexpr and static_asserts. For a while I’ve read from different sources that std::string_view should be preferred over std::string in certain situations. But, I didn’t had the chance to actually try it on something non-trivial. AOC puzzles here take the sweet spot between reference examples, which tell “how” but skip on “why” and “where”, and pre-existing codebases, where it may not be wise to add a new feature caveats of which are not well understood.

  • Lastly, as the programming is often a community effort, it is a great exercise to read and understand the solutions of others, e.g. from the reddit community. When doing that, I’ve found out that sometimes I write very verbose solutions. It was very surprising to find how well other people were able to express similar concepts in their solutions with much more clarity and less cruft.