Undefined behavior and maintaining a list of raw pointers

Jan 20, 2023

For whatever reason1, I wanted to maintain a Vec of raw pointers to the values of entries in a HashMap<K, Mutex<MaybeUninit<T>>>. However, occasionally the program would crash with a segmentation fault, or some other error, when I'd try to write some data to the MaybeUninit value. The gist of what I was doing was something like this:

let mut map = HashMap::new();
let mut ptrs = Vec::new();

for idx in 0.100 {
    // Insert some thing into the map. MaybeUninit can have a value or not.
    map.insert(idx, Mutex::new(MaybeUninit::uninit()));
    // Fetch the thing and get a raw pointer to it. Store that pointer in a Vec.
    ptrs.push(map.get(&idx).unwrap() as *const Mutex<MaybeUninit<usize>>);
}

for (idx, ptr) in ptrs.iter().enumerate() {
    // Write some data to the field. The data itself doesn't matter.
    unsafe { &**ptr } // &Mutex<MaybeUninit<usize>>
        .lock() // Result<MutexGuard<MaybeUninit<usize>>>
        .unwrap() // MutexGuard<MaybeUninit<usize>>
        .write(idx as usize);
}

You might already see what the issue is with this snippet. I didn't.

Running this code produced a few different results:

  • Poisoning
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: PoisonError { .. }', src/main.rs:259:14
  • Use after free
program(18501,0x100afc580) malloc: Incorrect checksum for freed object 0x14d6069b8: probably modified after being freed.
Corrupt value: 0x14d606b70
program(18501,0x100afc580) malloc: *** set a breakpoint in malloc_error_break to debug
[1]    18501 abort      cargo run
  • Segmentation fault
[1]    18159 segmentation fault  cargo run

Unfortunately, it wasn't immediately obvious why my unsafe block was unsound. Fortunately, it is easy enough to validate that a pointer points to the correct location in memory. Modifying the code above, I came up with:

for idx in 0.100 {
    map.insert(idx, Mutex::new(MaybeUninit::uninit()));
    ptrs.push(map.get(&idx).unwrap() as *const Mutex<MaybeUninit<usize>>);
}

for (idx, ptr) in ptrs.iter().enumerate() {
    // Get a raw pointer to the correct location of the data in memory.
    let correct_ptr = map.get(&idx).unwrap() as *const Mutex<MaybeUninit<usize>>;
    // Compare it with the pointer stored earlier.
    assert!(
        std::ptr::eq(*ptr, correct_ptr),
        "left = {}, right = {}",
        *ptr as usize,
        correct_ptr as usize
    );
}

And the output:

thread 'main' panicked at 'left = 5349861848, right = 5351977768', src/main.rs:263:9

So I have a list of potentially dangling pointers. It took me a while to work out why my original pointers were invalid. And then I remembered that Vec has a with_capacity function so that you can minimize allocations by creating a vec with enough capacity to hold some amount of data. The Capacity and reallocation section in the docs state:

The capacity of a vector is the amount of space allocated for any future elements that will be added onto the vector. This is not to be confused with the length of a vector, which specifies the number of actual elements within the vector. If a vector’s length exceeds its capacity, its capacity will automatically be increased, but its elements will have to be reallocated.

And

Vec does not guarantee any particular growth strategy when reallocating when full, nor when reserve is called. The current strategy is basic and it may prove desirable to use a non-constant growth factor. Whatever strategy is used will of course guarantee O(1) amortized push.

A few lines of code revealed how capacity increases. There's probably documentation about this somewhere, but I coudn't find it.

let mut list = Vec::new();
let mut capacity = list.capacity();
println!("{capacity}");
for i in 0..10000 {
    list.push(i);
    let new_capacity = list.capacity();

    if capacity < new_capacity {
        capacity = new_capacity;
        println!("{capacity}");
    }
}

The above yields:

0
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384

So capacity increases exponentially. There is an explanation in the source for why the capacity starts at 4 once allocated:

// Tiny Vecs are dumb. Skip to:
// - 8 if the element size is 1, because any heap allocators is likely
//   to round up a request of less than 8 bytes to at least 8 bytes.
// - 4 if elements are moderate-sized (<= 1 KiB).
// - 1 otherwise, to avoid wasting too much space for very short Vecs.

For a HashMap:

0
3
7
14
28
56
112
224
448
896
1792
3584
7168
14336

Maybe similar reasons?

In the first snippet above, I'm inserting 100 entries. So the map is reallocating 5 times after the first allocation. If I had a HashMap with the correct initial capacity, that should solve the problem:

let mut map = HashMap::with_capacity(100);
let mut ptrs = Vec::new();

for idx in 0.100 {
    map.insert(idx, Mutex::new(MaybeUninit::uninit()));
    ptrs.push(map.get(&idx).unwrap() as *const Mutex<MaybeUninit<usize>>);
}

for (idx, ptr) in ptrs.iter().enumerate() {
    // Get a raw pointer to the correct location of the data in memory.
    let correct_ptr = map.get(&idx).unwrap() as *const Mutex<MaybeUninit<usize>>;
    // Compare it with the pointer stored earlier.
    assert!(std::ptr::eq(*ptr, correct_ptr));
}

That doesn't panic. But this only helps if I already know how many items would be inserted. In the code I was working on, there was no way of knowing.

Box is "the simplest form of heap allocation in Rust." And is also the solution to my problem. Box is a pointer to a heap-allocated value. If I had a HashMap<K, Box<Mutex<MaybeUninit<T>>>, then I can get a raw pointer from the Box to the value, instead of a pointer to an item in the map. This would be less efficient because every insert is now an allocation, however.

This also doesn't panic:

let mut map = HashMap::new();
let mut ptrs = Vec::new();

for idx in range {
    // Box the value.
    map.insert(idx, Box::new(Mutex::new(MaybeUninit::uninit())));
    // Store a raw pointer to the location that the box points to. 
    ptrs.push(&**map.get(&idx).unwrap() as *const Mutex<MaybeUninit<usize>>);
}

for (idx, ptr) in ptrs.iter().enumerate() {
    let correct_ptr = &**map.get(&idx).unwrap() as *const Mutex<MaybeUninit<usize>>;
    assert!(std::ptr::eq(*ptr, correct_ptr));
}

In this snippet, instead of getting a pointer to the value in the map, I have to dereference the value in the map twice so that I can get the location of the boxed value.

To fix the original code, I only needed to change the type being pushed to the HashMap:

let mut map = HashMap::new();
let mut ptrs = Vec::new();

for idx in range {
    map.insert(idx, Box::new(Mutex::new(MaybeUninit::uninit())));
    ptrs.push(&**map.get(&idx).unwrap() as *const Mutex<MaybeUninit<usize>>);
}

for (idx, ptr) in ptrs.iter().enumerate() {
    unsafe { &**ptr } // &Mutex<MaybeUninit<usize>>
        .lock() // Result<MutexGuard<MaybeUninit<usize>>>
        .unwrap() // MutexGuard<MaybeUninit<usize>>
        .write(idx as usize);
}
  1. There was a real reason