Performance Optimization in Rust: A Case Study

Introduction

Code optimization is important when writing performant software. In this post, we'll look at different ways to speed up a program written in Rust. Through a series of improvements, we'll explore how careful analysis, algorithmic improvements, and Rust's powerful features can lead to significant performance gains.

Problem Statement

We are presented with a sequence of integers and the task of flagging certain elements that risk crumbling the sequence on itself. The rules are as follows:

The first 100 numbers are inherently stable.
Each subsequent number is considered stable if it is the sum of two distinct numbers from the preceding 100 numbers.

Minified Example

For this example we're going to assume the rule is about 5 numbers, instead of 100. Given the following sequence:

40, 62, 55, 65, 95, 102 117, 150, 182, 127

the last number "127" will be flagged out because it is the only one that can not be arrived at by summing two of the previous 5 numbers.

Our task is to efficiently process a large sequence, flagging out numbers that could cause the sequence to "crumble".

You can follow along by getting a copy of the input sequence and code here

Initial Implementation

We begin with a straightforward implementation to establish a performance baseline:

fn crumble0(list: &Vec) -> i32 {
    let mut n_crumbled = 0;
    let window_size = 100;
    // list of numbers we need to check, skip the first 100
    let target_nums = &list[window_size..];
    // iterator to represent a sliding window of size 100
    let windows = list.windows(window_size);

    for (&num, window) in zip(target_nums, windows) {
        let mut found = false;
        let window = window.iter();

        for pair in window.combinations(2) {
            if pair[0] + pair[1] == num {
                found = true;
                break;
            }
        }

        if !found {
            n_crumbled += 1;
            println!("{} is unsafe", num);
        }
    }
    n_crumbled
}

After the initial setup, this naive solution goes as follow:

Generate a sliding window `windows` over the entire sequence
Pair up each target number with the list of previous 100 numbers (the sliding window)
For each pair of numbers in the window, check if their sum equals the target number
If none of the pairs sums to the target, mark that number as unsafe

Benchmarking with Hyperfine

Throughout this post we'll be measuring the performance of each optimization using hyperfine, a command-line benchmarking tool. This allows us to get precise execution time measurements with statistical variance. Here's an example of hyperfine output:

> hyperfine ./target/release/crumble0
Benchmark 1: ./target/release/crumble0
  Time (mean ± σ):     156.3 ms ±   3.2 ms    [User: 156.1 ms, System: 0.1 ms]
  Range (min … max):   149.8 ms … 163.4 ms    20 runs

Optimization Strategy 1: Search Space Reduction

Our first implementation is pretty inefficient if you think about it. For each target number,worst case, we're trying every possible pair in a window of 100 numbers. If we work through the math, that's (100 choose 2) = 4,950 pairs we're checking for each target! With thousands of target numbers to process.
Here's a idea: for a specific window some numbers are way too big to ever be part of a valid sum, so let's cut them out first and save some CPU time in the generation of pairs. Obviously any number larger than our target can be filtered out.
We can filter even more elements if we take into consideration the smallest number in the window. For example, if our target is 80 and the smallest number in our window is 20, we can throw away everything bigger than 60 (as there is no chance anything bigger that 60 can contribute to a valid sum). This way, we are minimizing our window before running combinations(2)

Lets see how we can implement this:

fn crumble1(list: &Vec) -> i32 {
    // ... (setup code omitted for brevity)
    for (&num, window) in zip(target_nums, windows) {
        //...
        let smallest = *window.min().unwrap();
        let window_filtered = window.filter(|&&x| x <= (num - smallest));

        for pair in window_filtered.combinations(2) {
            // ...
        }

Using hyperfine again show this version executing in about 83.7 ms This optimization reduced execution time by ~47% compared to our initial implementation, demonstrating the impact of intelligently reducing the search space before performing hot loops.

Optimization Strategy 2: Exploiting Data Patterns

The next improvement comes from analyzing the data patterns in our sequence. During development I had some print logging statements and I noticed a pattern. I found that frequently the largest and smallest numbers in the filtered window summed to the given target number. So, we could replace most cases of the expensive combinations iteration with a simple early return check of min + max of each window.

By checking this special case first, we can avoid the expensive iterations in many cases. Here's the gist of what would be added:

fn crumble2(list: &Vec) -> i32 {
    // ... (setup code omitted for brevity)
    for (&num, window) in zip(target_nums, windows) {
        //...
        let smallest = *window.min().unwrap();
        let window_filtered = window.filter(|&&x| x <= (num - smallest));
        let largest = window_filtered.clone().max().copied().unwrap_or_default();

        if (smallest + largest == num) && (smallest != largest) {
            continue;
        }
        // ... (rest of the function)
    }
    n_crumbled
}

This turned out indeed a huge improvement, benchmarking shows that it executes in about 7 ms. While this is a heuristic based on the specific dataset and not a general algorithm, it demonstrates the importance of understanding the characteristics of our data for effective optimization.

Optimization Strategy 3: Parallelization

For our final optimization, we leverage Rust's excellent concurrency support using the Rayon library:

fn crumble3(list: &Vec) {
    // ... (setup code omitted for brevity)
    let windows = list.par_windows(window_size);
    windows.zip(target_nums).for_each(|(window, &num)| {
        // ... (filter logic similar to crumble2)
    });
}

This implementation utilizes Rayon's parallel iterators to distribute the workload across multiple CPU cores. The par_windows method creates a parallel iterator over the sliding windows, and for_each applies our stability check to each window concurrently.

Performance Analysis

Benchmarks were conducted using a 4-core ThinkPad T460s. The summary below demonstrate the impact of our optimizations:

Naive approach (crumble0): 156.3 ms ± 3.2 ms
Search space reduction (crumble1): 83.7 ms ± 3.0 ms
Data pattern exploitation (crumble2): 7.0 ms ± 0.9 ms
Parallelization (crumble3): 5.4 ms ± 1.4 ms

These results show a remarkable 29x speedup from our initial implementation to the final optimized version.

Conclusion

This case study demonstrates a systematic approach to performance optimizations:

Algorithmic Improvements: By reducing the search space and exploiting data patterns, we achieved significant speedups without sacrificing correctness.
Data-Driven Optimization: Understanding the characteristics of our dataset led to a heuristic that dramatically improved performance.
Concurrency: Leveraging Rust's powerful concurrency features allowed us to fully utilize modern multi-core processors.
Iterative Refinement: Each optimization built upon the previous one, showcasing the importance of incremental improvements.
Empirical Validation: Rigorous benchmarking at each stage ensured that our optimizations were effective.