Day 34 — Linkers go brrrrr

28 September 2020 · recurse-center

Continuing from the day before yesterday, I started looking into linkers and loaders. Till now, I haven't had experience writing multiple C files that work together. And up until yesterday, my only other experience with .so files has been knowing that if I put opencv's .so file in my Python site-packages, I'd somehow be able to run import cv2 in my Python REPL. Also, the terms "symbols", "ABIs", and ".so files" have always seemed like magic to me so I want to demystify all of that!

Today I went down the following Wikipedia rabbit hole and also read some awesome blog posts. I'll try to summarize some things I learned (and leave bread crumbs) for future me!

C standard library > Shared libraries > Program lifecycle phase > Static library > Static build > Soname > Object file > Symbol table > Linker > Direct binding > Dynamic linker > Prelink > Loader > Dynamic loading > Application binary interface > Executable

After going down the rabbit hole, I found that The Linux Programming Interface contains very lucid explanations about the fundamentals of shared libraries in Chapter 41. I could've (mostly) avoided the rabbit hole if I just read this chapter in the first place!

Later in the day, I found this really awesome 20 part series on Linkers by Ian Lance Taylor. Also, yesterday Ori pointed me to How To Write Shared Libraries by Ulrich Drepper (who used to maintain glibc) which I need to read soon. There's also Program Library HOWTO which looks like a very exhaustive resource which I should get to. The Beginner's Guide to Linkers also looks like a cool blog post which I should check out. At the end of the day, I found that Julia Evans has written some really awesome blog posts on linkers which I need to read ASAP!

What are linkers?

A linker is a program that converts object files (which contains assembly code a compiler generates from a high-level language like C) into executables and shared libraries. This means that you can write your code in multiple C files (which may use some pre-existing libraries), convert them into object files, and have the linker convert everything into an executable. It does that by identifying and resolving references to all the "symbols" in your files. Symbol is just a fancy word for a variable or a function.

If we have three files mod1.c, mod2.c, and mod3.c, each containing a function that prints something:


  #include "stdio.h"

  int mod1_func() {
      printf("mod1 says hello!\n");
      return 0;
  }

And another file called prog.c which expects to call the functions mod1_func, mod2_func, and mod3_func from the files above:


  #include "stdio.h"

  int mod1_func();
  int mod2_func();
  int mod3_func();

  int main() {
      mod1_func();
      mod2_func();
      mod3_func();
      printf("Hello, world!");
      return 0;
  }

We can compile the mod*.c files into a library, and link it to prog.c while creating an executable. There are two types of libraries: static and shared.

What is a static library?

A static library is just an archive of all the object files. They remove the need to recompile the C files into object files every time you want to build a new executable. They were the first type of library that appeared on Unix systems.

We can create a static library for our mod*.c files above by first compiling them into object files (gcc -c compiles and assembles, but does not link):


  $ gcc -c mod1.c mod2.c mod3.c
  $ ls
  mod1.o mod2.o mod3.o

And then put them into an archive called libfunc.a using the ar command (predecessor to tar):


  $ ar r libfunc.a mod1.o mod2.o mod3.o
  $ ar tv libfunc.a
  rw-r--r-- 0/0   5928 Jan  1 05:30 1970 mod1.o
  rw-r--r-- 0/0   5928 Jan  1 05:30 1970 mod2.o
  rw-r--r-- 0/0   5928 Jan  1 05:30 1970 mod3.o

After that, we can compile prog.c to an object file, and link it to libfunc.a using gcc to create an executable:


  $ gcc -c prog.c
  $ gcc -o prog prog.o libfunc.a
  $ ./prog
  mod1 says hello!
  mod2 says hello!
  mod3 says hello!
  Hello, world!

When we link our program to a static library (static linking!), the resulting executable contains copies of all the object files from the static library. If we create multiple executables, each one will have its own copy of the static library, which is redundant.

This increases the disk space required to store the executables, and loads copies of the same static library into memory if some of these executables are running at the same time. And if a change is made to a static library, all of the executables must be relinked in order to propagate the change. Shared libraries were designed to overcome all of these problems.

What is a shared library?

With a shared library, only a single copy of the object files is shared by all programs that require them. They're loaded into memory the first time a program (that requires them) is executed. If during that time, another program (that requires them) starts executing, it can just reuse the copy already loaded into memory.

They have a performance overhead though as they must be compiled to "position-independent code" which (1) requires the use of an extra register, and (2) leads to "symbol relocation" to be performed at run time. This adds a little more time (compared to a statically linked equivalent) before the program finally runs.

Shared libraries follow the libfoo.so naming convention. To "install" a shared library on your system, you need to put it in one of the default directories that the dynamic linker looks into, or update the /etc/ld.so.conf config file (and run ldconfig afterwards) to tell the dynamic linker to look into a non-default directory. You can find more details in Chapter 41 of The Linux Programming Interface.

Let's create a shared library for our mod*.c files from above, we can call it libfunc.so:


  $ gcc -fPIC -Wall mod1.c mod2.c mod3.c -shared -o libfunc.so

The –fPIC option tells gcc to generate "position-independent code". This allows the shared library to be loaded to any memory address at run time. This is necessary because unlike static libraries, where literally everything required to run a program can be loaded into a continous block of memory, there is no way of knowing where the shared library code will be located (at link time) until its finally loaded into memory (at run time).

We can then create an executable for prog.c while specifying the shared library (libfunc.so) that must be loaded at runtime:


  $ gcc -Wall -o prog prog.c libfunc.so

But our executable fails to run because libfunc.so isn't in one of the default directories that the dynamic linker looks into:


  $ ./prog
  ./prog: error while loading shared libraries: libfunc.so: cannot open shared object file: No such file or directory

We can use the LD_LIBRARY_PATH variable from yesterday to tell the dynamic linker to look for the library in the current working directory:


  $ LD_LIBRARY_PATH=. ./prog
  mod1 says hello!
  mod2 says hello!
  mod3 says hello!
  Hello, world!

What is position-independent code?

Position-independent code can be loaded and executed at any memory address, in contrast to absolute code which must be loaded at a specific address to function correctly.[1] It uses two data structures to do that: Procedure Linkage Tables (PLTs) and Global Offset Tables (GOTs), which sort of act like caches.

The dynamic linker loads addresses to functions in PLTs and addresses to global and static variables in GOTs. GOTs are loaded all at once while PLTs are loaded lazily when the code actually calls the function in question. This laziness can be overridden by setting the LD_BIND_NOW environment variable when running the program.

I'm still trying to visualize both these data structures. These posts contain a lot more detail about each one.

Some useful tools

ldd: Can be used to display all the shared libraries that a program (or a shared library) requires to run.
objdump: Can be used to obtain various information, like disassembled binary machine code, symbol table, etc. from an executable file, compiled object, or shared library.
readelf: Similar to objdump.
nm: Can be used to list all the symbols defined within an object library or executable program.
LD_DEBUG: An environment variable that can be used to monitor the dynamic linker.


  $ LD_DEBUG=help /bin/ls
  Valid options for the LD_DEBUG environment variable are:

    libs        display library search paths
    reloc       display relocation processing
    files       display progress for input file
    symbols     display symbol table processing
    bindings    display information about symbol binding
    versions    display version dependencies
    scopes      display scope information
    all         all previous options combined
    statistics  display relocation statistics
    unused      determined unused DSOs
    help        display this help message and exit

  To direct the debugging output into a file instead of standard output
  a filename can be specified using the LD_DEBUG_OUTPUT environment variable.

I saw ELF mentioned a lot everywhere, so I'll look into that tomorrow! Today I also paired with Dan on a test && commit || revert tool that he's writing in Rust. We bumped into an "immutable reference" issue (I'm still new to Rust!) but eventually figured it out and fixed it.

Vinayak Mehta