Ever since the advent of the three-way comparison “spaceship” operator <=> in C++20 I was wondering when a sort algorithm would appear that truly leverages three-way comparison (not be mistaken with three-way partitioning as I learnt along the way).
Three-way comparison
Determines the total ordering of two values in a single operation. See https://wiki2.org/en/Three-way_comparison
Spoiler alert
The key takeaway: Using the hint-enabled version of kj::sort_three_way speed-ups of up to ~3.4x were observed (GCC/libstdc++ on fully sorted data), with expensive-comparison cases and/or choosing the right partitioning technique typically landing in the 1.2x-1.7x range.
Two things to know
Speedup scales with how well you know your data and the complexity of the comparison function — and
kj::sort_three_waylets you tell it.
What and Where
#include <kj/sort_three_way.hpp>
std::vector<std::wstring> names = /* ... */;
// like std::ranges::sort, but three-way
kj::sort_three_way(names);
// with a hint: I know it's nearly sorted
kj::sort_three_way(kj::sort_with<kj::key_order_nearly_sorted>, names);
kj::sort_three_way can be found on https://codeberg.org/trueqbit/three-way-sort. Feedback welcome!
To me, specifying hints to the sort algorithm seems like an ideal addition to the Standard Library, so I hope to propose it to WG21 one day. Before this attempt I encourage trying kj::sort_three_way!
Intuition
I intuitively assumed that using the result of a three-way comparison would reduce the number of comparisons and would be automatically faster just because the comparison function has to be invoked only once to determine equality.
What I was right about: it is faster when used together with three-way partitioning (DNF technique) and expensive comparison functions, especially with duplicate-heavy data sets.
However I also must admit that this was my naïve part as I never looked closely at std::ranges::sort or sorting algorithms in general. After all there are better things to do than reinvent the wheel.
The answer to the question is not so difficult but not straightforward either. A happy-go-lucky post in the C++ community on Slack on 25 January 2026 about fewer comparisons and a couple of replies later got me thinking. I “nerd-sniped” Arthur O’Dwyer and I was disappointed when he conveyed to me some weeks later that he was going to be anti-nerd-sniped. So I set out to investigate a bit more.
I made a mistake: I simply assumed that fewer comparisons means less work and more speed, and I initially had only performance tests with randomly generated test data.
To my defense, I was only comparing against the Microsoft STL implementation of std::ranges::sort, which uses three-way partitioning. Additionally I was only playing with a simplistic algorithm that I call the “pivot hiding” technique (it removes the pivot from the range to be sorted and re-inserts it later). But this is detrimental to performance except in a few cases.
Let me nerd-snipe you again as I was originally only looking at Microsoft STL’s implementation :)
Nerd-sniping
What a cool, picturesque expression! I heard it for the first time and learnt that it originates from xkcd strip 356.
Different techniques and Standard Library implementations
Even such a well-understood problem space like sorting is hard. It’s not only the theoretical science that matters, but to which kind of data set the algorithm is applied and how this translates to a running computer program.
As usual the chosen algorithms are key. Yet Standard Library implementations don’t agree on a specific technique, and they certainly don’t agree on thresholds. My humble guess is that I am not the first to try squeeze out the last drop of performance.
A few coinages I use below
- pivot band — the middle bucket (
== pivot) produced by three-way / DNF partitioning.- pivot sentinel — the binary-partitioning trick where “median to first” lets the pivot end up in the left sub-partition and act as a sentinel for binary partitioning and the final sort pass.
- pivot hiding — the simplistic technique introduced above: remove the pivot from the range and re-insert it later. Detrimental except in a few cases.
- unguarded loop — a loop written without range bounds checks, because a sentinel element or a structural invariant guarantees the loop stops before the end of range is ever hit.
Partitioning
Looking at the major Standard Library implementations they all use the hybrid introspective “intro-sort” algorithm, but behind it there are 2 main approaches to partitioning, which retroactively define how intro-sort works:
- Three-way partitioning
- Binary partitioning
Major Standard Library implementations
- Microsoft STL is exclusively using three-way partitioning.
- libstdc++ is exclusively using (unguarded) binary partitioning.
- libc++ is effectively using three-way partitioning, too. Additionally they have an intro-sort combined with other ideas, e.g. branch-less sorting for arithmetic types, high optimization for very small ranges, and other parts based on Pattern-defeating quicksort (PDQ).
Three-way partitioning
Three-way partitioning is the approach known in the literature as the “Dutch National Flag (DNF) partitioning technique”. It detects duplicates and partitions a range into 3 buckets: < pivot | == pivot | > pivot.
I call this the “pivot band” technique as it creates a band in the middle consisting of elements equal to the pivot.
- It needs a way to determine equality, therefore with traditional less-than comparison it has to invoke the less-than comparison function 2 times. This is where true three-way comparison comes into play.
- To pick the pivot it uses Tukey’s ninther (median of medians) for bigger sub partitions, converging to the “median of three” for smaller sub partitions; an important performance property of Microsoft STL’s “median of three” function is that it is optimized for less nested branches and requires only 2 less-than comparisons in the best-case scenario.
- Each recursively established sub partition is sorted using insertion-sort.
With three-way partitioning, three-way comparison really reduces the number of comparisons by up to 25.6% with current real-world data fixtures, which is quite a huge percentage! Whether it is effectively faster depends on your input data, the complexity of the comparison function and emitted compiler code. Conversely, binary partitioning always outperforms three-way partitioning when there’s no duplicates (i.e. unique keys).
Binary partitioning
Binary partitioning simply creates 2 buckets around a pivot: < pivot | >= pivot.
I call this the “pivot sentinel” technique.
It comes with 2 interesting advantages:
- For computing the pivot it uses “median to first”, which then allows for a tight unguarded partitioning loop in its essential hot path.
- The left-most sub partition will eventually contain the pivot, which allows for a single final insertion-sort pass of the whole range, which is partially unguarded.
Now with binary partitioning, there’s not really much a three-way comparison result can be used for. So we are done with libstdc++, right? Not quite - There are situations when three-way partitioning outperforms it despite its higher complexity, which makes it worth to challenge binary partitioning. An additional minor advantage, which I find useful nonetheless, is that three-way comparison allows us to write comparison statements that read a bit more naturally.
What’s inside kj::sort_three_way
A few more coinages
- trule — a tiny transport capsule carrying the comparison and projection functions separately so the compiler can keep them in registers. I remember this expression well from somewhere, but I no longer recall its author.
- inlining fence — a deliberate
[[noinline]]at a call site to reset the compiler’s inlining budget on the other side.
Codegen
I ensured that the codegen is the same as for Microsoft STL’s std::ranges::sort, which is important especially in the hot paths. Since I am on Windows I investigated assembly emitted by MSVC.
Even ensuring that the compiler emits the best assembler just for computing the depth limit repeatedly makes a difference.
An interesting problem I encountered was codegen for ordinary insertion-sort for arithmetic types, which explains why sorting PDH counter data by PID (or parent PID) doesn’t quite reach the throughput of std::ranges::sort.
For string-view comparison, using <=> even gave me the same-or-better assembly. For arithmetic types, MSVC emits an auxiliary lea instruction in the inner insertion-sort loop, which is suboptimal.
Quoting the relevant comment:
NOTE: MSVC cannot fully match its own STL’s insertion sort codegen here (inlined). MSVC STL uses
less<>which returns bool, allowing the compiler to track the hole pointer with a simplesub rax,18hin the inner loop. With three-way comparison returningstrong_ordering, the compiler reuses thepivotBandstack slot for the value temporary and keepsrbxlive as its base pointer throughout, forcing an auxiliaryr10withlea rax,[r10-18h]+mov r10,raxper iteration. This is likely a register allocator artifact of the heavier comparator return type and seems to be not fixable at the source level without abandoning three-way comparison.
Register pressure and the inlining budget
CPU register pressure from the surrounding context has an observable impact. The compiler’s inlining budget plays a vital role, and ways how to control it:
- GCC inlines more aggressively than MSVC.
- Function calls in the hot path are expensive, and by default the compiler’s decisions are absolutely dependent on the surrounding context (call tree).
- “Fencing” and flattening creates independent sort routines highly tailored for their input range.
Comparison and partitioning shape
Knowing the shape of the data set matters:
- Three-way partitioning handles duplicate-heavy data optimally as there are naturally many equal elements.
- Binary partitioning handles unique keys optimally as there are naturally no equal elements.
Note that “median of three” or “median to first” cannot leverage three-way comparison (so the three-way savings apply inside three-way partitioning, not in pivot selection).
Move cost and insertion sort
The literature seems to be silent about this, but the move-cost of a range’s value type is not to be ignored. Avoiding the creation of “holes” during insertion sort does make a difference, too. It is less of an issue for a single final insertion sort pass, but adds up when sorting each recursively established sub partition.
A very high insertion threshold of 128 elements for already sorted data is beneficial. Avoiding swaps is the real deal, and possible if you know the data set is sorted already.
Enregistering the comparator and projector (“trule”)
The trule technique passes down the comparison and projection functions, each separately so they can be enregistered (which std::reference_wrapper prevents due to its user-defined constructor).
- This is the approach the Microsoft STL has taken as opposed to libstdc++, which combines them in a struct with a user-defined constructor, which is not an aggregate.
- Also, Microsoft STL’s trait that determines the ability to be enregistered is more permissive than libstdc++‘s, and I think more accurate as it directly correlates to assembler code.
Proper tests
I really needed proper benchmarks and tests involving real-world data sets.
Benchmarking
For proper benchmarking luckily there are other programmers who carefully thought about reliable benchmark testing - Catch2 takes care of all the details, e.g. establishing the baseline parameters for proper measurement like mean, standard deviation and sampling.
Real-world data
For real-world data it turned out that I don’t need data sets with several million elements. Since I am a systems programmer what could be more enticing than sampling data from the operating system with just a few hundred items?
The good thing is that these kind of data sets often have a peculiar shape (e.g. already pre-sorted or nearly sorted) or exhibit patterns (e.g. clusters) simply due to how the operating system hands it over to the caller, predictably so.
The current data fixtures are outputs from the Windows Service Control Manager (SCM) and the Performance Data Helper API (PDH counters).
For instance:
- Data set characteristics of Windows services:
- nearly sorted by name (case-insensitive)
- 3 partitions:
- System services, in correct order (sorted by SCM at boot)
- Per-user services, in correct order
- newly or dynamically created services appended, out of order
- Data set characteristics of PDH counter data for processes:
- unsorted at first call, fully sorted by instance name at succeeding calls
- no duplicates (each PID and instance name is unique)
- many duplicate parent PIDs
- highly clustered names (many instances share the same process name, but different PIDs, e.g. 91 svchost, 34 firefox, 22 conhost)
- within a cluster, entries often share the same parent PID, resulting in clusters of parent PIDs as well
Outcome - kj::sort_three_way
I have now intro-sort implementations that are equivalent to libstdc++‘s and Microsoft STL’s implementations respectively.
kj::sort_three_way gives you the best of both techniques, binary and three-way partitioning.
This enables comparing against Standard Library implementations and comparing different techniques as well.
API
- A random-access range-based API that enforces the
kj::three_way_sortableconcept on the data set, analogous tostd::ranges::sortenforcingstd::sortable. three_way_sortablerequires at least strict weak ordering.- Not stable, matching
std::ranges::sort— usestd::ranges::stable_sortif you need stability.
Ordering constraint
The comparator must return
std::weak_orderingorstd::strong_ordering.std::partial_ordering— which allows incomparable elements like NaN — isn’t enough, and thethree_way_sortableconcept rejects it at the call site.
Sort hints
kj::sort_three_way lets you specify hints about the data set and the data type to be sorted. They drive the selection of sort strategy, ninther threshold, insertion threshold and whether to use the faster swap-insertion-sort.
- Exposition:
kj::sort_hints{ .distribution = kj::key_distribution_unspecified, .order = kj::key_order_unspecified, .complex_comparison = ..., .high_move_cost = !std::is_trivially_move_constructible_v<value_type>, ._like = kj::_sort_like_stdlib }. - Complex comparison is currently determined by whether the comparison functor “desugars” to
std::compare_three_way, an internal type trait I found in libc++. Be aware that this check alone may not be accurate for custom comparators. - In order to compare
kj::sort_three_way’s implementation against the Standard Library’s implementation the sort hints expose an (internal) enum to perform the sorting in the same way. This lets us establish a benchmark baseline.
Under the hood
- Avoiding the creation of “holes” during insertion sort for high move-cost types makes a difference. This resulted in an alternative swap-based insertion-sort function that performs more move operations for long displacements but eliminates the repeated move-construction of a temporary that the standard hole-based insertion-sort always does..
- Calculating the depth limit as a threshold to fallback to heap sort follows the path chosen by the Microsoft STL: Instead of the usual
2⋅⌊log₂(N)⌋they chose1.5⋅⌊log₂(N)⌋divisions, which allows for trying just a bit more partitioning before resorting to heap sort. - Controls the inlining budget and decisions of the compiler, selectively at strategic points:
[[noinline]]is used on the call site to create an “inlining fence”, effectively resetting the compiler’s inlining budget. ou actually don’t want an unexpected expensive function call happening where the compiler should be inlining the insertion-sort.[[msvc::forceinline_calls]]/[[gnu::flatten]]/[[gnu::always_inline]]is used to avoid unexpected expensive helper function calls and rather create optimal independent routines highly tailored for their input range.
TODO
indirect_value_implmeta-function for projected values.- Tests with Clang and LLVM’s libc++. Though I drew some things from libc++ I only really tested with GCC/libstdc++ and MSVC/Microsoft STL.
- Tests with more data sets.
Benchmark results
Let’s look at the numbers.
I ran the same real-world fixtures against std::ranges::sort:
- on Windows 11 25H2 Build 26200.8246, compiled with MSVC 14.51 -O2
- on Windows Subsystem for Linux (WSL), compiled with GCC 15.2.0 -O3
The first column is the absolute baseline time; the others are speedup factors relative to it (higher is better).
The last column, marked “hinted”, is the main result to read.
Legend:
std::ranges::sort— the Standard Library’s implementation, baseline at 1.00x.stdlib-like—kj::sort_three_wayconfigured to mimic the Standard Library’s strategy and thresholds. Isolates the benefit of true three-way comparison before any hint-driven tuning kicks in.default—kj::sort_three_waywith type traits inferred (comparison complexity, move cost) but no hint about the shape of the data set. This is what you get if you don’t know your data.hinted—kj::sort_three_waywithsort_hintsdescribing the shape of the data. This is the main criterion: if you know your data, this is what you get.
On GCC/libstdc++ no hinted row falls below 1.00x. On MSVC/Microsoft STL only one does — Pre-sorted by parent PID — and that’s the arithmetic-codegen case described earlier.
MSVC/Microsoft STL
BENCHMARK COMPARISON SUMMARY
-----------------------------------------------------------------------------------------------
Benchmark std::ranges::sort stdlib-like default hinted
-----------------------------------------------------------------------------------------------
Unsorted by instance name 15.667 us 1.14x 1.12x 1.10x
Fully sorted by instance name 6.016 us 1.00x 1.00x 1.89x
Reverse sorted by instance name 7.333 us 1.14x 1.16x 1.43x
Pre-sorted by parent PID 1.040 us 0.87x 0.91x 0.93x
Pre-sorted by process PID 2.190 us 0.98x 0.97x 1.21x
Service names nearly sorted (wmemcmp) 9.684 us 1.11x 1.03x 1.06x
Service names nearly sorted (std::wstring) 16.768 us 1.08x 1.66x 1.61x
Service names nearly sorted (ci comparison) 133.793 us 1.07x 1.44x 1.33x
-----------------------------------------------------------------------------------------------
--- config: xyChart: showDataLabel: true --- xychart-beta horizontal title "MSVC/Microsoft STL — hinted speedup vs std::ranges::sort" x-axis ["Unsorted", "Fully sorted", "Reverse sorted", "Parent PID", "Process PID", "Services (wmemcmp)", "Services (wstring)", "Services (ci)"] y-axis "speedup (x)" 0 --> 3.5 bar [1.10, 1.89, 1.43, 0.93, 1.21, 1.06, 1.61, 1.33] line [1, 1, 1, 1, 1, 1, 1, 1]
And the same benchmarks through the stdlib-like lens.
The speedup here isolates what true three-way comparison alone buys you.
--- config: xyChart: showDataLabel: true --- xychart-beta horizontal title "MSVC/Microsoft STL — stdlib-like speedup vs std::ranges::sort" x-axis ["Unsorted", "Fully sorted", "Reverse sorted", "Parent PID", "Process PID", "Services (wmemcmp)", "Services (wstring)", "Services (ci)"] y-axis "speedup (x)" 0 --> 3.5 bar [1.14, 1.00, 1.14, 0.87, 0.98, 1.11, 1.08, 1.07] line [1, 1, 1, 1, 1, 1, 1, 1]
GCC/libstdc++
BENCHMARK COMPARISON SUMMARY
-----------------------------------------------------------------------------------------------
Benchmark std::ranges::sort stdlib-like default hinted
-----------------------------------------------------------------------------------------------
Unsorted by instance name 7.356 us 0.98x 0.98x 1.00x
Fully sorted by instance name 7.353 us 1.05x 1.04x 3.40x
Reverse sorted by instance name 6.077 us 1.04x 1.05x 1.70x
Pre-sorted by parent PID 1.271 us 1.00x 0.99x 1.25x
Pre-sorted by process PID 1.467 us 0.99x 1.00x 1.01x
Service names nearly sorted (wmemcmp) 6.308 us 0.96x 0.96x 1.03x
Service names nearly sorted (std::wstring) 7.565 us 0.97x 1.04x 1.25x
Service names nearly sorted (ci comparison) 29.645 us 1.00x 1.00x 1.47x
-----------------------------------------------------------------------------------------------
--- config: xyChart: showDataLabel: true --- xychart-beta horizontal title "GCC/libstdc++ — hinted speedup vs std::ranges::sort" x-axis ["Unsorted", "Fully sorted", "Reverse sorted", "Parent PID", "Process PID", "Services (wmemcmp)", "Services (wstring)", "Services (ci)"] y-axis "speedup (x)" 0 --> 3.5 bar [1.00, 3.40, 1.70, 1.25, 1.01, 1.03, 1.25, 1.47] line [1, 1, 1, 1, 1, 1, 1, 1]
Disclosure
I use AI tooling responsibly and to help me out: to do the heavy-lifting, boring or hard-to-track things. My intent is to be authentic.
This blog post was written solely by myself. I used Claude AI for proof-reading, verifying its completeness and suggesting restructuring text so it is easier to understand. Things that easily escape a human.
All library code was hand-picked, re-assembled, verified or written by myself.
Of course I didn’t come up with the sorting algorithms themselves, this would be foolish - credits go to the people who implemented the Standard Library facilities.
I used Claude for conversational AI, to compare emitted assembler, reason about the implementation, codegen and trade-offs, and instructed it to create the benchmark tests.
