[Software] Optimization is one of the most intellectually rewarding parts of software development, but it can also be one of the most convoluted and insidious. Many software engineers find themselves diving deep into optimization without fully understanding where to focus their efforts or developing a higher-level understanding. As a result, we often waste time on marginal gains while missing much bigger opportunities for improvements.
Many of us (myself included) have fallen into the trap of diving into optimization without proper preparation. It’s terribly tempting, isn’t it? But let me share with you some mental models that have transformed my approach to optimization except one approach which I described in a separate post on software tuning through property-based thinking .
Measure First, Optimize Second
It’s tempting to jump into optimization as soon as we notice that something isn’t running as quickly as we’d like. Yet optimization without measurement is a classic pitfall. Without measurement data, we’re just guessing which parts of our program are the most expensive.
The key insight is that measurement is the foundation of all good optimization. It’s rather like being a detective - we must gather evidence before making accusations about which parts of our code are the culprits of poor performance.
Two common measurement tools available to us are:
- Profiling tools
can be our best friends here or our backstabbing foe masquerading as a friend. They’re like X-ray machines for our code, potentially showing us where the time is spent but we must know how to read the X-rays given context about the patient:
- Sampling profilers give us a wonderfully quick overview by periodically peeking at our program’s execution (like taking snapshots of a running race except camera equipment might get in the way of measuring)
- Causal profilers are better at showing us the real performance impact of potential optimizations by actually slowing down parts of our code to measure their effect on our program’s total execution time.
- Benchmarking tools like JMH on the micro side or Google’s PageSpeed at the system level can be brilliant accomplices on our optimization journey or dangerous sirens leading us astray. Benchmark toolkits are our laboratory equipment, helping us measure the exact speed of our code under controlled conditions, but we must be careful about what we’re measuring. Microbenchmarks give us precise measurements of isolated code segments (rather like timing an athlete running a specific drill after warmups, except the practice conditions might not reflect the actual game), while macro-benchmarks can show us more real-world performance by exercising our entire system under less contrived workloads (but setting up truly representative workloads is a fascinating challenge in itself).
A counter-concern is ensuring we do not sacrifice the compositionality or modularity of our software in ways that make it much harder to maintain (unless it is throwaway code and we have made that decision consciously). We still want to keep our units of work measurable going forward and not create larger and more brittle monolithic computations.
By measuring first, and knowing what it is we are measuring and how that is useful to us ensures that every optimization we make has a tangible and positive impact.
Takeaway: Always profile or create a baseline benchmark before optimizing anything. Focus our efforts on the biggest bottlenecks, not where we think (guess) they might be. Also be aware that some benchmarks measure contrived scenarios and sample-based profilers can be misleading, so think deeply about what we are measuring and how to aligns with the real world environment of our running software.
Review Our Data Structures & Algorithms
The choice of data structures and algorithms often has a far greater impact on performance than any low-level optimization we might attempt.
Think about searching. If we’re using a linear search on a sorted array, we might spend ages trying to optimize the comparison operation. But by simply switching to a binary search, we’d achieve a dramatic improvement that no amount of low-level optimization could match! The difference between O(n) and O(log n) is profound for non-trivial trees, isn’t it?
Here are some delightful examples:
- Arrays vs Linked Lists: Arrays give us wonderful cache locality and random access, but insertions and deletions can be costly. Linked lists are precisely the opposite. The choice between them isn’t about which is “better” - it’s about understanding our specific use case.
- Hash Tables: These are absolute gems for lookup operations, giving us O(1) average case performance. But they come with memory overhead and can have poor cache locality. When dealing with small datasets, a simple sorted array might actually perform better.
- Probabilistic Data Structures: Structures like Bloom filters and Count-Min sketches trade perfect accuracy for extraordinary space efficiency. A Bloom filter can tell us if an element is definitely not in a set, or probably in it, using just a tiny fraction of the memory needed for precise storage. And HyperLogLog counters can estimate cardinality with remarkable accuracy using logarithmic space - perfect for tracking unique visitors or distinct elements in massive streams of data! Sometimes accuracy isn’t the biggest concern.
What’s particularly exciting is how these choices compound. Consider a program that frequently needs to look up items by key and also maintain them in sorted order. We might choose a balanced binary search tree (like a Red-Black tree) instead of maintaining both a hash table and a sorted array. One data structure instead of two - isn’t that elegant?
Takeaway: The best optimization often comes not from clever micro-optimizations, but from choosing the right algorithms and data structures for our specific needs. It’s rather like choosing the right tool for a job - a hammer might be a fantastic tool, but it’s not much use when you need to turn a screw!
Reduce Work By Precomputing and Caching (but not too much)
A central idea in optimization is reducing the amount of work the system needs to do. This can be achieved through techniques like precomputation and caching.
- Precomputation
- involves performing calculations ahead of time so that they don’t need to be repeated during runtime. The classic example is computing constants, or partial results, at compile time.
- Caching
- stores the results of expensive computations so that when the same inputs appear again, the system can reuse the result rather than recalculating it.
Both techniques minimize redundant work.
One related low-level strategy is strength-reduction, which replaces costly operations with cheaper ones. I include migrating computations to SIMD or GPUs in this genre which becomes more relevant as these capabilities made available to wider runtimes and operating environments. Today our applications consume and produce more data than ever before and we are building longer processing pipelines especially in ML use cases.
Another related strategy is reviewing the data structures and algorithms we use for our particular use cases such that caching and precomputation become more viable options in more scenarios.
Takeaway: Look for opportunities to precompute and cache results (when relevant) and replace costly operations with cheaper alternatives wherever possible (assuming we have measured the speed up and the added complexity is worth the benefit).
Parallelism and Span
Parallelism is a well-known approach to improving performance. Modern hardware is designed to take advantage of multiple cores, and parallel execution can lead to massive speedups—if applied properly.
The key concept here is span, which is the length of the longest path through our computation graph. The shorter our span, the more parallelism we can achieve on capable hardware because more tasks can be executed at the same time. Reducing span is crucial for unlocking the full potential of our system’s concurrency capabilities.
However, not every algorithm parallelizes well. Some are inherently serial and can only use speculative parallelism to achieve speedup. Speculative parallelism involves executing tasks in parallel, even when we’re unsure if all will be needed, and throwing away the results of unnecessary computations later.
This approach can work well for some algorithms but may introduce overhead in others. Understanding when and where to apply parallelism effectively is key to avoiding the trap of over-parallelizing or introducing unnecessary complexity.
Takeaway: Focus on reducing span to maximize parallelism. Be aware that some algorithms are serial by nature and require thoughtful approaches like speculative parallelism.
The Importance of Domain-Specific Optimizations
Compilers do a great job of optimizing general-purpose code, but they are inherently limited by their generality. This is where domain-specific knowledge can give us the edge. When we understand the problem we’re solving, we can often apply optimizations that a compiler can’t see.
For example, if we’re working in the domain of image processing, languages like Halide allow us to express optimizations specifically tailored to this type of work. Similarly, if we’re working with graph algorithms, a tool like GraphIt can help us optimize better than a general-purpose compiler ever could.
These tools allow us to leverage our domain expertise to write high-performance code that would be difficult, if not impossible, to optimize to the same level using standard techniques.
Takeaway: Use domain-specific knowledge and tools to optimize more effectively when possible. Sometimes a general-purpose compiler just won’t cut it.
General Case versus Pathological Cases
One common challenge in optimization is balancing the needs of the general case with the demands of pathological special cases .
For most software systems, optimizing for the majority of use cases leads to broad, system-wide improvements. However, there are often edge cases or rare scenarios that can have pathological runtimes—situations where performance degrades drastically due to specific inputs or conditions.
Addressing these pathological cases can sometimes lead to disproportionate complexity, introducing maintenance burdens or negatively impacting the general case. However, ignoring them can cause our system to fail in high-priority or high-stakes situations.
The trick is to balance the two. First, ensure that the general case is well-optimized—this is where most of our users will benefit. But don’t leave the special cases untouched. Find ways to isolate their impact and optimize them without over-complicating our general architecture.
Takeaway: Optimize the general case first, but don’t ignore special cases with pathological runtimes. Look for ways to isolate and improve them without adding unnecessary complexity.
Locality, Parallelism, Concurrency and Redundant Work
Optimization isn’t just about raw speed; it’s about making intelligent trade-offs. Three key concepts often pull in different directions:
Locality : Keeping related data close together in memory to reduce cache misses. Better locality leads to better cache performance, but it may limit parallelism.
Parallelism: Running tasks concurrently at the same time on capable hardware. More parallelism can boost performance but might come at the cost of poor locality or increased overhead.
Redundant work: Sometimes repeating work can simplify the architecture and improve locality, but it may lead to wasted CPU cycles.
The challenge lies in finding the right balance for our particular application. Improving one aspect—say, by increasing parallelism—may hurt another, such as locality. Intelligent optimization requires recognizing these trade-offs and managing them carefully.
Takeaway: Balance locality, parallelism, and redundant work. Each has its trade-offs, and optimizing one often comes at the expense of another.
Strategic Optimization Requires Clear Thinking
Effective software optimization is not about clever tricks or premature tweaks. It’s about developing a clear understanding of our program’s bottlenecks, making informed trade-offs, and using domain knowledge where general-purpose tools fall short.
The most important skill we can develop as an engineer is the ability to think critically about where to focus our optimization efforts. Measure first, reduce unnecessary work, leverage parallelism intelligently, and always keep trade-offs in mind.
By building strong mental models, we’ll be able to optimize more effectively and avoid the traps of complexity and over-optimization.
If you made it to the end and seek more on this topic, check out my post on optimizing software using properties .
If you enjoyed this content, please consider sharing this link with a friend, following my GitHub, Twitter/X or LinkedIn accounts, or subscribing to my RSS feed.