Day 54 — Rust ownership and Python garbage collection

Today I read the chapter on ownership in the Rust book. In this post, I'll try to summarize what I learned for future me!

Ownership in Rust

Ownership is Rust's central feature and the chapter explains how it allows Rust to make memory safety guarantees without needing a garbage collector.

In some languages like Python, memory is managed through a garbage collector which constantly looks for variables that can be dropped to free up memory. In other languages like C, memory is managed by the programmer with the malloc and free functions. In Rust, memory is managed through a set of ownership rules that the compiler checks at compile time, which looks like a fine balance between a garbage collector (which "sounds" external to the code actually being run), and manual memory management done by the programmer (which can be error prone).

Data types

To explain ownership, the chapter goes into scalar and non-scalar data types and how they are stored in memory. The main difference between the two is that the size of a scalar data type is known at compile time.

Scalar data types include all the integer and floating point types, such as u32 and f64; the character type char, the boolean type bool, and tuples (if they are made only out of scalar data types, for example, (i32, i32)). Non-scalar data types include things that can grow and shrink, like Strings and vectors.

Stack and Heap

Scalar values are stored on the stack, while non-scalar values are stored on the heap after finding a space that is big enough (through allocation). It's easy to manage values on the stack, because we always need to access and make copies at the top. But it's tedious to do both those things in the case of a heap, because we first have to follow the pointer to a memory location to access the value, and then ask the allocator for more space if we want to make copies.

Ownership rules

Managing data on the heap is the reason why ownership exists in Rust. These are Rust's ownership rules:

Based on these rules (enforced by the Rust compiler at compile time), Rust can automatically run malloc and free on values in memory when a variable comes in and goes out of scope. And that's why we don't need to use a garbage collector or do both malloc and free by hand!

Copy and Move

In Rust, when we assign an existing scalar variable to a new variable:

  let x = 5;
  let y = x;

The value is copied from the old variable to the new one, because it's easy to do that on a stack like we discussed above.

But when we assign an existing non-scalar variable to a new variable:

  let x = String::from("Hello");
  let y = x;

Rust copies the pointer (which is on the stack) to the new variable, but not the data to which it points to. That's because copying data on the heap could be an expensive operation if the data were large.

But wait! Because of Rust's ownership rules, both x and y will now try to free the data they point to when they go out of scope (double free error!).

To prevent that from happening, Rust moves the ownership of data from x to y when we do let y = x, thus making x an invalid reference. If we try to use x later, the Rust compiler will throw an error!

References and Borrows

Other ways to move ownership are passing a variable to a function and returning a variable from a function. Since passing and then returning ownership with every function call can be tedious, Rust lets us pass references to variables into functions instead. In this case, the variable is borrowed by the function.

References allow us to refer to variables without taking ownership of them. They are immutable by default and we're not allowed to modify something we have a reference to. References can also be mutable but Rust doesn't let us have more than one mutable reference in a scope to prevent race conditions!

The code below will fail because r1 and r2 are both mutable references to s:

  let mut s = String::from("hello");

  let r1 = &mut s;
  let r2 = &mut s;

  println!("{} {}", r1, r2);

We also cannot have a mutable reference while we have an immutable one, because users of an immutable reference don't expect values to suddenly change by a mutable reference!

And because of that, the code below will also fail:

  let mut s = String::from("hello");

  let r1 = &s;
  let r2 = &mut s;

  println!("{} {}", r1, r2);

A reference's scope starts from where it is introduced, and continues through the last time that reference is used.

So this is valid code:

  let mut s = String::from("hello");

  let r1 = &mut s;

  println!("{}", r1);

  let r2 = &mut s;

  println!("{}", r2);

We can have multiple immutable references to a variable because they don't change the value they refer to:

  let mut s = String::from("hello");

  let r1 = &s;
  let r2 = &s;

  println!("{} {}", r1, r2);

Garbage collection in Python

When I was learning about Python C extensions some moons ago, I came across Python's C-API and how it also works with references (counts!). The Python garbage collector drops values when the references pointing to them become 0.

New references

When we create a new reference to a PyObject, we must call Py_DECREF on it so that it can be garbage collected. If we fail to call Py_DECREF, we get a memory leak!

  PyObject *pA = PyLong_FromLong(a);  // New ref
  PyObject *pB = PyLong_FromLong(b);  // New ref
  PyObject *r = PyNumber_Subtract(pA, pB);  // New ref

  Py_DECREF(pA);  // You must decref
  Py_DECREF(pB);  // You must decref

  return r;  // Caller must decref

What if we could remove the need to call Py_DECREF on pA and pB by automatically calling an associated free function when both of those references go out of scope?

Moved references

When we move a reference to a PyObject into something (in this case, a tuple), it is kinda owned by that tuple because now it's the tuple's responsibility to call Py_DECREF on it. If we call Py_DECREF after moving it to the tuple, that can lead to unintended consequences where the garbage collector might drop the value before the tuple has had a chance to make use of the value.

  PyObject *r = PyTuple_New(2);  // New ref

  PyObject *v1 = PyLong_FromLong(1L);  // New ref
  PyTuple_SetItem(r, 0, v1);

  // We shouldn't Py_DECREF(v1) because it belongs to r now

  PyObject *v2 = PyLong_FromLong(2L);  // New ref
  PyTuple_SetItem(r, 1, v2);

  return r;  // Callers must decref

What if a compiler could enforce ownership rules and prevent us from calling Py_DECREF on v1 after it has moved into r?

Borrowed references

When we borrow a reference to something, we need to explicitly call Py_INCREF to register our interest, so that the garbage collector doesn't drop the value before we've had a chance to make use of it.

  PyObject *pFirst;
  pFirst = PyList_GetItem(pList, 0);

  Py_INCREF(pFirst);  // Register our interest

  PyObject_Print(pFirst, stdout, 0);

  Py_DECREF(pFirst);  // Let go

What if we didn't have to register our interest for using the reference pFirst by calling Py_INCREF? The compiler could instead throw an error if we borrow pFirst but then try to change pList in modify.

Could ownership rules be built into Python's C-API and enforced by a compiler, thus removing reference counting and replacing the need for a garbage collector?