Performance Todo List

No unnecessary computations:
- No unnecessary copying
- No unnecessary allocations
Use all computing power:
- Use all cores
- Use SIMD
Avoid waits and stalls:
- Lockless data structures
- Asynchronous APIs
- Job systems
Use hardware efficiently:
- Cache friendly data structures
- Well predictable branches
- Avoid false sharing
OS-level efficiency

performance todo list

(The below are possible ways to increase the performance of a pipeline but these should be tested in an development enviroment and be merged only after thorough benchmarking and profiling)

Build Pipeline Modifications

Enable compiler optimizations: doesn't require touching the C++ code
- Optimization #1: Longer compile times, smaller binaries, faster execution
Set target architecture:
- Tells the compiler to perform micro-optimizations for a specific CPU architecture
- Also tells the compiler which SIMD instruction sets to use (e.g., SSE, AVX)
Use fast math:
- GCC, LLVM: -ffast-math (included in -Ofast)
- MSVC: /fp:fast
- ICC: -fp-model=fast
- Faster computations by allowing reordering and approximations of floating-point operations
- Less precise results, non standard compliant
Disable exceptions and RTTI if not used:
- No exceptions
  - GCC, LLVM, ICC: -fno-exceptions
  - MSVC: /EHs-c- /D_HAS_EXCEPTIONS=0
- No RTTI:
  - GCC, LLVM, ICC: -fno-rtti
  - MSVC: /GR-
- Reduces binary size and improves performance by removing exception handling and RTTI overhead (not major performance gains)
- Breaks code using exceptions and RTTI and non standard compliant
Enable Link Time Optimization (LTO):
- As the compiler doesn't have the full picture of the program during individual translation unit compilation, LTO allows optimizations across all translation units at link time, as the linker has the big picture of the entire program.
- This enables more aggressive inlining, dead code elimination, and other cross-module optimizations
- GCC, LLVM: -flto
- MSVC: /GL
- ICC: -ipo
- Allows the compiler to optimize across translation units, leading to better inlining and dead code elimination
- Increases compile time and memory usage during compilation
Use unity builds: (substitute to LTO)
- Combines multiple source files into a single compilation unit, reducing overhead and enabling better optimization
- Reduces the total number of source files the compiler has to process, leading to faster build times
- Harder, as macros are split b/w files
- Cmake: -DCMAKE_UNITY_BUILD=ON
Link statically:
- static linking can improve performance by reducing runtime overhead associated with dynamic linking, such as symbol resolution and relocation
- However, static linking increases the size of the executable, as all library code is included directly in the binary
- Trade-off between performance and binary size
- GCC, LLVM: -static
- MSVC: /MT (for static runtime linking)
Use profile-guided optimization (PGO):
- Involves compiling the program with various performance counters embedded into it to collect runtime data, running the program to get a special file togather profiling information, and then recompiling using this data to guide optimizations
- Next time the program is compiled with the PGD
- Helps the compiler make informed decisions about inlining, loop unrolling, and branch prediction based on actual usage patterns
- Optimizes the code based on real-world usage patterns, rather than heuristics, leading to better performance
- Requires multiple compilation steps and profiling runs
- https://en.wikipedia.org/wiki/Profile-guided_optimization
- https://developer.android.com/games/agde/pgo-overview
Try different compilers
Try different standard library implementations:
- Different standard library implementations may have varying performance characteristics for certain operations
- For example, libc++ (LLVM) vs libstdc++ (GCC)
- Benchmark your specific use cases to see which implementation performs better
Preload with a replacement lib:
- Speeds up memory allocations by using a high-performance memory allocator like tcmalloc, jemalloc, or mimalloc instead of the default system allocator
- Linux, BSD: env LD_PRELOAD=/usr/lib/libSUPERmalloc.so ./your_application
- macOS: env DYLD_INSERT_LIBRARIES=/usr/lib/libSUPERmalloc.dylib ./your_application
- Windows: Requires DLL injection techniques to replace the default allocator with a custom one
Use binary post processing tools:
- Tools like LLVM BOLT and Google's PatchELF can optimize the final binary by reordering functions and data based on runtime profiling information
- This can improve instruction cache utilization and reduce page faults, leading to better runtime performance

Effective use of C++

Constexpr all the things:
- Read more: what are constexpr, rules and best practices
- https://stackoverflow.com/questions/14116003/whats-the-difference-between-constexpr-and-const
- https://medium.com/@sofiasondh/c-const-vs-constexpr-the-comparison-183f9dd92deb
Make varibales const:
- See if const can be used, there are muliple ways to tell a variable is const
- Make member functions const if they don't modify the object state
- Explicit object parameters (C++23)
- Global variables: copy globals to const locals (if copying is cheap) at the start of functions to help the compiler optimize access
Noexcept all the things:
- void f(); -> could throw exceptions
- void f() noexcept; -> will not throw exceptions
Use static for internal linkage:
- static variables, static member functions
- Can be also used to mark internal linkage variables & functions
- static int global_values;
- static void helper_function();
- inline is a weak hint to the compiler to inline the function, but static gives internal linkage
- above f1 does random arithematic which doesnn't make sense, the machine code of it is so long that the compiler decides not to inline it, but f1 is only called once, but doesn't matter how big the function is its worth inilining it as its only called once, hence static forces the compiler to inline it.
Use [[noreturn]]
- use to mark a function that does not return to the caller, example error throwing functions or functions that terminate the program
- helps the compiler optimize control flow and eliminate dead code paths
Use [[likely]] and [[unlikely]]:
- Hints to the compiler about the expected outcome of a condition
- Helps the compiler optimize branch prediction and instruction layout
- Example:
```
if ([[likely]] condition) {
    // code for the likely case
} else {
    // code for the unlikely case
}
```
- Note: Use these attributes judiciously, as incorrect hints can lead to worse performance
Use [[assume(condition)]]:
- assert(condition) is for the programmer, assume is for the compiler
- if !condition is false, the behavior is undefined
- declare a code path as unreachable to help the compiler optimize (or use std::unreachable in C++23)
Make functions pure:
- A pure function is a function where the return value is determined only by its input values, without observable side effects (also can be used const)
Take parameters properly:
Avoid allocations in loops:
- Hoist their declatarations outside the loop, allocated memory can be preserved b/w loops, object needs to be cleared if needed using .clear()
- Reserve space in containers if size is known beforehand, vector can have more space than needed
Avoid copying exceptions:
- avoids slicing if their dynamic type is derived from the static type
Avoid copies for in range-for & lambda captures & st. bindings:
- use references in range-for loops to avoid unnecessary copies

Manual hardware oriented optimizations

Memory mental model:
- Byte sized chunks in the form of a tape? No, we have multiple levels of caches (L1, L2, L3) b/w CPU and RAM
- Prefetching: cache lines, data locality
- Caching: small memory close to the CPU, faster access times, temporal locality (Recently accessed data is likely to be accessed again soon)
Takeaways from the above:
- Access memory sequentially to take advantage of spatial locality
- Keep frequently accessed data in cache to benefit from temporal locality
- Minimize cache misses by optimizing data structures and access patterns
- std::array, std::vector, std:deque, std::flat_map, std::flat_set are cache friendly
- std::list, std::set, std::unordered_map, std::unordered_set, std::map are not cache friendly
- row-major > column-major order
- reordering struct members to minimize padding and improve cache line utilization
- some other ways to be temporaily local:
  - loop fusion
  - blocking (tiling)
  - AoS vs SoA
  - thread affinity (Pin thread to a core)
  - Linux: pthread_setaffinity
  - Windows: SetThreadAffinityMask
  - macOS: thread_policy_set with thread_affinity_policy_t
  - we can also set priority of the process and priority of the thread
  - NUMA architecture awareness
Branch predictor:
- avoid indirect calls
- make branches predictable
- use branchless operations where possible (cmov -> assembly level, ternary operator, bitwise operations)
SIMD:
- Use compiler auto-vectorization
- Use SIMD intrinsics for critical code paths
- Use libraries that leverage SIMD (e.g., Eigen, Intel IPP)