Day 35 — What's inside an ELF executable? (symver edition)

I saw "ELF" being mentioned a lot while reading about linkers yesterday so I thought about looking into it. Turns out it's a file format for executables. Just hearing the word "executable" used to give me the chills because I couldn't look at what's inside one. (One time I ran cat on an executable to look at its contents, just to get an incomprehensible wall of magic text!)

But today I learned that ELF is based off of a standard, and I can use readelf and objdump to look inside it! I also read chapter 42 of The Linux Programming Interface and found out about symbol versioning! In this post, I'll try to summarize some things I learned about both for future me!

What's an ELF?

ELF stands for Executable and Linkable Format and it's a file format for executables like I mentioned above.

Let's say we have a hello.c with the following C code:


  #include <stdio.h>

  int main()
  {
      puts("Hello, world!");
      return 0;
  }

When we compile it into an executable with gcc (without specifying the output name) we get an a.out.


  $ gcc hello.c
  $ ./a.out
  Hello, world!

If we check the file type of this a.out executable, it's ELF!


  $ file a.out
  a.out: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=<sha>, for GNU/Linux 3.2.0, not stripped

a.out was a file format on old Unix systems. It remains the default output file name for executables created by certain modern compilers, but its file format is ELF and not the older a.out! There are a lot of executable file formats!

It was ELF all along!

An ELF executable contains (1) program headers, (2) section headers, and (3) data associated with entries in both (1) and (2). This image has a nice walkthrough and visual representation of each component in an ELF executable. We can use readelf to view all of these different components (headers and data).

Program headers

To view the the file header (which contains some metadata about the executable), we can use readelf with the --file-header command-line option! Similarly, --program-headers for the program headers.


  $ readelf --file-header a.out
  ELF Header:
    Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
    Class:                             ELF64
    Data:                              2's complement, little endian
    Version:                           1 (current)
    OS/ABI:                            UNIX - System V
    ...

The first field is a magic number! I found the same magic number in every ELF executable I created today. I'm not sure why it's called magic though.


  $ python
  >>> print(bytearray.fromhex('7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00').decode('ascii'))
  ELF

It's hexadecimal and converts to "ELF" in ASCII.

Section headers

To view the the section headers, we can use readelf with the --section-headers command-line option!


  $ readelf --section-headers a.out
  There are 31 section headers, starting at offset 0x3970:

  Section Headers:
    [Nr] Name              Type             Address           Offset
         Size              EntSize          Flags  Link  Info  Align
    ...
    [13] .plt              PROGBITS         0000000000001020  00001020
         0000000000000020  0000000000000010  AX       0     0     16
    ...
    [16] .text             PROGBITS         0000000000001060  00001060
         0000000000000185  0000000000000000  AX       0     0     16
    ...
    [18] .rodata           PROGBITS         0000000000002000  00002000
         0000000000000012  0000000000000000   A       0     0     4
    ...
    [24] .got              PROGBITS         0000000000003fb8  00002fb8
         0000000000000048  0000000000000008  WA       0     0     8
    [25] .data             PROGBITS         0000000000004000  00003000
         0000000000000010  0000000000000000  WA       0     0     8
    [26] .bss              NOBITS           0000000000004010  00003010
         0000000000000008  0000000000000000  WA       0     0     1
    ...

These are the sections I could wrap my head around, apart from the procedure linkage table (.plt) and global offset table (.got) sections that I mentioned yesterday!

More about all these sections here.

Symbol table

After struggling to understand what symbols are (fancy word for a function and a variable), it was a relief to see the a full symbol table stored in the executable with readelf --symbols!

We can see the puts (belonging to glibc 2.2.5) and main functions! (right scroll in the code block)


  $ readelf --symbols a.out
  ...
  Symbol table '.symtab' contains 65 entries:
     Num:    Value          Size Type    Bind   Vis      Ndx Name
      ...
      49: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND puts@@GLIBC_2.2.5
      ...
      61: 0000000000001149    27 FUNC    GLOBAL DEFAULT   16 main
      ...

We can also used objdump to look at the symbol table:


  $ objdump --syms a.out

  a.out:     file format elf64-x86-64

  SYMBOL TABLE:
  ...
  0000000000000000       F  *UND*     0000000000000000   puts@@GLIBC_2.2.5
  ...
  0000000000001149  g    F .text      000000000000001b   main
  ...

Quoting the man page for objdump, here's what each column in this symbol table means:

In the example above, we can see that puts and main have the F flag because they are functions. The flag characters are divided into 7 groups as follows:

Group 1

Flag Description
"l" Symbol is local
"g" Symbol is global
"u" Symbol is a unique global
Space Neither global nor local
"!" Both global and local

Group 2

Flag Description
"w" A weak symbol
Space A strong symbol

Group 3

Flag Description
"C" Symbol is constructor
Space An ordinary symbol

Group 4

Flag Description
"W" Symbol is a warning
Space A normal symbol

Group 5

Flag Description
"I" An indirect reference to another symbol
"i" A function to be evaluated during reloc processing
Space A normal symbol

Group 6

Flag Description
"d" A debugging symbol
"D" A dynamic symbol
Space A normal symbol

Group 7

Flag Description
"F" Name of a function
"f" Name of a file
"O" Name of an object
Space A normal symbol

Symbol Versioning

Each symbol can be versioned while creating a shared library. This means that we can define multiple versions of the same function, and executables will use the version of the function that was "current" when it was linked against the shared library.

Let's say we have the following code which is just a hello function that prints something:


  #include <stdio.h>

  void hello(void)
  {
      puts("Hello, v1!");
  }

We can define a version script:


  $ hello_v1.map
  VER_1 {
      global: hello;
      local: *; # Hide all other symbols
  };

In the script, global: hello; ensures that only the hello function is "exported". Everything else remains hidden because of local: *;. We also add the VER_1 tag to our "export configuration". You can also check out glibcs version script.

We can use this version script to create a shared library called libhello.so:


  $ gcc -c -fPIC -Wall hello.c
  $ gcc -shared -o libhello.so hello.o -Wl,--version-script,hello_v1.map

Now let's say we have a prog.c where we use this hello function. We can create an executable p1 and link it against libhello.so:


  $ cat prog.c
  void hello(void);

  int main()
  {
      hello();
      return 0;
  }
  $ gcc -o p1 prog.c libhello.so
  $ LD_LIBRARY_PATH=. ./p1
  Hello, v1!

When we look at p1's symbol table, we can see that it's using the hello function with the VER_1 tag:


  $ objdump -syms p1 | grep hello
  0000000000000000       F *UND*    0000000000000000              hello@@VER_1

But what if we want to modify the definition of this hello function, while ensuring that p1 continues to function by using the old version?

We can rename the old function to hello_old and define a new function called hello_new! To do this, we need to use the .symver assembler directive to tie both these functions to different version tags.


  #include <stdio.h>

  __asm__(".symver hello_old,hello@VER_1");
  __asm__(".symver hello_new,hello@@VER_2");

  void hello_old(void)
  {
      puts("Hello, v1!");
  }

  void hello_new(void)
  {
      puts("Hello, v2!");
  }

  void world(void)
  {
      puts("World, v2!");
  }

Our new version script looks like this:


  $ hello_v2.map
  VER_1 {
      global: hello;
      local: *; # Hide all other symbols
  };

  VER_2 {
      global: world;
  } VER_1;

VER_2 has @@ instead of @ to make it the default version so that any new executables that are linked against our shared library use the new function definitions. The } VER_1; in the last line indicates that VER_2 has a dependency on VER_1, which means that VER_2 "inherits" the export configuration from VER_1, while also exporting a new world function.

Now when we build the new version of our library, we can use our new version script:


  $ gcc -c -fPIC -Wall hello.c
  $ gcc -shared -o libhello.so hello.o -Wl,--version-script,hello_v2.map

And when we create a new executable for prog.c, it uses the new definition of hello, while p1 continues to use the old one!


  $ gcc -o p2 prog.c libhello.so
  $ LD_LIBRARY_PATH=. ./p2
  v2 xyz
  $ LD_LIBRARY_PATH=. ./p1
  v1 xyz

Below are the symbol tables for p1 and p2. We can see that they both use different hello functions tagged with VER_1 and VER_2!


  $ objdump --syms p1 | grep hello
  0000000000000000       F *UND*    0000000000000000              hello@@VER_1
  $ objdump --syms p2 | grep hello
  0000000000000000       F *UND*    0000000000000000              hello@@VER_2