Recently, I came across an old (circa 2009) article by Igor Ostrovski, titled “Gallery of Processor Cache Effects”. In the first part of the article, Igor performs some simple experiments to demonstrate how CPU caches can impact performance in non-obvious ways if the programmer is not familiar with how they work. Given that his article is 15 years old at the time of writing this post, I thought it might be interesting to try and reproduce his results on my much more modern machine. It might seem really simple, but I learned even more than promised as I stumbled through the steps myself.
All programs listed below are always compiled/run with cargo
’s --release
flag.
Quick summary
In Igor’s first experiment, he shows that for an array of i32
s (each 4 bytes long),
two loops iterating:
- every element, and
- every 16th element both finish their work in the same amount of time.
This is because of the effect of CPU caches – access of an index in an array will actually pull into the CPU caches a (typically) 64 byte contiguous chunk of memory that contains the element that was accessed. This chunk of memory is referred to as a cache line, and this process allows subsequent access to anything else in that chunk to be very fast.
Running time of the loop is heavily dominated by memory access, and since fetching every element or every 16th element will fetch & cache the same number of lines/elements, both loops finish in the same time.
I set out to repeat this experiment, but as you’ll come to read, I was classically nerd-sniped into learning more about how the operating system manages memory, and how that can sometimes be at odds with what we’re trying to achieve.
A first attempt
Alright, let’s get going.
The code in the first example is very simple, and reproducing it in Rust should be straightforward:
use std::time::Instant;
fn main() {
let strides = [1, 16];
let mut array = vec![0i32; 64 * 1024 * 1024];
for stride in strides {
let start = Instant::now();
workload(&mut array, stride);
let dur = start.elapsed().as_millis();
println!("(stride={stride}) duration: {}ms", dur);
}
}
fn workload(arr: &mut [i32], stride: usize) {
for i in (0..arr.len()).step_by(stride) {
arr[i] *= 3;
}
}
Seems easy enough…Here’s the output:
(stride=1) duration: 368ms
(stride=16) duration: 32ms
Immediately, we’re met with an unexpected result. The stride=16
result is faster
by a significant margin, directly contradicting Igor’s assertion.
Before we get to the answer, let’s take a very brief detour to understand how operating systems manage memory for the programs we write.
Physical memory, virtual memory, and paging
The memory that is available to userspace programs is virtual - i.e., the addresses do not directly refer to physical memory locations. The operating system presents a contiguous chunk of this virtual memory to a program, and then maps any addresses accessed in that virtual space to addresses in physical memory.
The OS also does not deal with memory allocations at the granularity of single bytes. Memory is allocated and deallocated in pages, which are chunks of virtual memory of a predefined size (typically 4KiB). Individual pages of memory are tagged with permissions bits that specify whether the memory can be written to, read from, or executed.
The operating system does not immediately create a mapping to physical memory for all of the virtual memory requested by a program. This happens gradually as the program accesses (reads or writes) previously un-accessed pages. The mechanism by which the operating system is notified to create a mapping is called a page fault. This adds some latency to userspace programs, since it needs to be handled by the kernel. There are other reasons a page fault may occur, but this is the most relevant one in the context of this post.
Apples-to-apples
Now that we have a better picture of how our program’s memory is handled by the
operating system, let’s see if running perf
sheds any light. Here is the trimmed
output, showing just the number of page faults:
Performance counter stats for './target/release/cpucache':
...
131,145 page-faults:u # 273.724 K/sec
...
That is a large number1, but given that we’re requesting a fairly big chunk of memory (256MiB) for our vector, maybe that’s not unacceptable? That being said, armed with our knowledge from the previous section, we are ready to take a guess.
Once a virtual memory page is backed by physical memory, there are no more page
faults to be had for accessing that chunk2.
This sheds some light on what might be happening: our first stride iteration
(stride=1
) is incurring page faults as it traverses the array, because that
is the first time our program accesses that memory. The second iteration (stride=16
)
then may not be incurring this penalty.
In order to make the benchmark fair, we can ensure that the second iteration also page faults as it iterates through the array. The code fix is easy: ensure that a new vector is allocated per stride iteration.
fn main() {
let strides = [1, 16];
- let mut array = vec![0i32; 64 * 1024 * 1024];
for stride in strides {
+ let mut array = vec![0i32; 64 * 1024 * 1024];
let start = Instant::now();
workload(&mut array, stride);
let dur = start.elapsed().as_millis();
Running perf
again confirms our theory: we’re seeing double the number of page faults.
Performance counter stats for './target/release/cpucache':
...
262,219 page-faults:u # 298.211 K/sec
...
With this fix, we see that the results match Igor’s original observation that both loops complete in around the same amount of time!
(stride=1) duration: 377ms
(stride=16) duration: 359ms
Improving latency per iteration
While we were able to align the running time of both the stride=1
and stride=16
iterations
above, something is still off.
Our iterations are taking ~350ms, which is significantly slower than the iterations in Igor’s experiments - his loops completed in ~80ms. I’m also running on much more modern (and presumably more powerful) hardware, so these timings are definitely off.
Perhaps we can avoid the significant overhead of incurring page faults in both stride iterations?
The zero page and Copy-On-Write
It turns out that when you request a zero-initialized chunk of memory, a lot of magic happens to ensure that it’s as fast as possible and has minimum impact on the rest of the system to allow for memory overcommitment.
First, an operation like vec![0i32, ...]
results in a call to calloc()
. The kernel
is aware that the request for memory is coming from calloc()
, and optimizes this
path by not actually writing any zeroes.
Next, we need to examine the operation we are performing in workload
:
arr[i] * = 3
To state the obvious, this is actually two operations and may be re-written as:
arr[i] = arr[i] * 3
The first access is a read of arr[i]
, in the right-hand side of the expression.
This triggers a page fault, and it is at this point that usually a physical page
would be reserved. However, since the kernel knows we are reading a bunch of zeroes,
it maps our virtual memory to a special physical page called the zero page.
The zero page is an all-zero page of memory that’s created statically during kernel
initialization, so there’s no extra allocation that needs to happen to service
our read operation. Multiple virtual pages can be mapped to the singular zero page.
The second access is writing the result of arr[i] * 3
back to the array index.
When this operation is executed, another page fault occurs because the zero page is
write-protected. To handle this, the kernel finally allocates physical memory for
the page being accessed by duplicating the zero page, and the write is issued.
The general procedure of creating a duplicate page only at the time of writing data to memory is called Copy-On-Write (COW).
No wonder the loops are slow - each page fault needs to be handled by the kernel, and we are triggering two of them per 4KiB chunk of our array.
Avoiding the COW penalty
In general, Copy-On-Write & demand paging is good for the system as a whole, since physical memory is only allocated just in time. This allows for greater utilization of the overall memory available in the system.
However, in our simple benchmark, the overhead of triggering page faults is quite noticeable. I’d like to show cleaner benchmark numbers.
It turns out that we can avoid this overhead by requesting a chunk of memory that’s initialized to some value other than zero. This means that the OS does not have the luxury of using the zero page as the initial backing physical memory.
Here, we set the initial value in the invocation of the vec!
macro to 1i32
:
let strides = [1, 16];
for stride in strides {
- let mut array = vec![0i32; 64 * 1024 * 1024];
+ let mut array = vec![1i32; 64 * 1024 * 1024];
let start = Instant::now();
workload(&mut array, stride);
let dur = start.elapsed().as_millis();
This incurs the page fault penalty before workload
even begins:
- First, a 256MiB chunk of virtual memory is reserved.
- Immediately, writes are issued to set the values to
1
. These writes trigger all of the incumbent page faults to ensure there is backing physical memory.
There is no intermediate step here where the virtual memory is first backed by the
zero page, and then incurs a copy-on-write penalty. The writes in the workload
function are essentially “free” in terms of page faults.
Running this version of the code produces this result:
(stride=1) duration: 29ms
(stride=16) duration: 26ms
Perfect! We’ve successfully reproduced the findings in Igor’s article.
Appendix
Alternative: Manually pre-faulting
Another approach we could have taken if we wanted to keep the zero-initialization, is to manually pre-trigger the page faults before starting the workload for each stride.
On most Linux systems, pages are 4KiB in size - as is the case on my machine:
$ getconf PAGESIZE
4096
This means that if we make a write into the vector at each 4KiB page boundary, we should trigger all the necessary page faults. If we do this before we perform the workloads for each stride, the workloads themselves should run quickly.
let strides = [1, 16];
for stride in strides {
- let mut array = vec![1i32; 64 * 1024 * 1024];
+ let mut array = vec![0i32; 64 * 1024 * 1024];
+
+ // Page size is 4 * 1024 bytes. `i32` is 4 bytes wide.
+ // => every 1024th element in the vector is in a new page.
+ for i in (0..array.len()).step_by(4 * 1024 / size_of::<i32>()) {
+ array[i] = 0;
+ }
+
let start = Instant::now();
workload(&mut array, stride);
let dur = start.elapsed().as_millis();
Running this version confirms the hypothesis:
(stride=1) duration: 26ms
(stride=16) duration: 23ms
(Interestingly, it appears that writing 0
s to an already fully-zeroed chunk of
memory is optimized away neither by the compiler nor the operating system.)
Measuring page faults with with perf
I ran perf
to gather insights into the number of page faults in each case:
$ cargo build --release && perf stat -e page-faults,minor-faults,major-faults ./target/release/cpucache
# Initialized with zeroes
(stride=1) duration: 405ms
(stride=16) duration: 396ms
262,219 page-faults:u
262,219 minor-faults:u
0 major-faults:u
# Initialized with ones
(stride=1) duration: 29ms
(stride=16) duration: 26ms
131,146 page-faults:u
131,146 minor-faults:u
0 major-faults:u
Let’s try to arrive at these numbers ourselves from what we know about our vector.
The vector that we initialize has 64M elements, and each element is 4 bytes wide, so the total size allocated is 256MiB. With a page size of 4KiB, this equates to 256MiB/4KiB ≈ 65.5K pages.
Each run of our program has 2 stride iterations, and each one reserves its own vector. 65.5K * 2 = 131K pages, combined.
This explains the second output in the case of initializing with ones, since we know we can expect one page fault per page.
Additionally, we also previously established that when initializing with zero,
our program can expect two faults per page. Therefore we can also explain the
first perf
output.
References & Acknowledgements
- Gallery of Processor Cache Effects - Igor Ostrovski
- This StackOverflow answer that helped me understand why there would be twice as many page faults in the zero-initialization case.
- Understanding the Linux Kernel by Daniel P. Bovet & Marco Cesati (in particular, the sections on the Page Fault Exception Handler & Demand Paging)
- I used allocscope to verify that
vec![0i32, ...]
results in callingcalloc()
, andvec![1i32, ...]
instead results in calls tomalloc()
. - I’d also like to acknowledge Ragnar Groot Koerkamp’s research blog that has heavily inspired me to both try doing this, and also to write about my findings.
- Thanks to Sam Rose and Adam Chalmers for reviewing this article and providing feedback!