Day 31 — Spying on ghostscript

I recently (at the start of my batch) found out about strace from this zine by Julia Evans! Today I used it to look at things ghostscript does under the hood (system calls!) to do a PDF to PNG conversion. I also read through the manual page for strace which had this one example where they compare the errors shown when a program tries to open non-existent files, to a porch light being on when nobody is really home!

  $ man strace
  For example, retrying the "ls -l" example with a non-existent file produces the following line:

      lstat("/foo/bar", 0xb004) = -1 ENOENT (No such file or directory)

  In this case the porch light is on but nobody is home.

I cloned and built ghostscript on its development branch for this exercise, and didn't use the one already installed on my system. The above example was useful as I saw a lot of those lines in the very large output when I ran:

  $ strace -f ./gs -sDEVICE=png16m -o foo.png -r300 foo.pdf

This large output was overwhelming so I narrowed down the scope to openat system calls, to just look at all the files ghostscript was opening.

  strace -f -e openat ./gs -sDEVICE=png16m -o foo.png -r300 foo.pdf

Even though this ghostscript build is located in my home folder at dev/ghostpdl/bin, it looked for some resources in /usr/local/share/ghostscript/9.54.0, which it couldn't find because I don't have that latest version installed!

There's a Resource folder at the root of the build directory but it didn't even look there. I guess these are fonts and some other resources that ghostscript requires. I thought these were hard requirements, but this assumption was broken when it still produced a nice PNG file! Maybe I'm missing something.

  openat(AT_FDCWD, "/usr/local/share/ghostscript/9.54.0/Resource/Init/Halftone/Default", O_RDONLY) = -1 ENOENT (No such file or directory)

It also opened libpng, which makes sense to do a PDF to PNG conversion!

  openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/", O_RDONLY|O_CLOEXEC) = 3

It then read the input PDF file:

  openat(AT_FDCWD, "foo.pdf", O_RDONLY) = 5

This is where a lot of reads and lseeks happened. I guess at this point ghostscript starts reading things from the PDF and starts converting them to some kind of a PNG data structure in C. This was followed by another openat where it opened the output PNG file:

  openat(AT_FDCWD, "foo.png", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 6

Which was followed by a lot pwrite64s. I guess this is where it starts writing things into the PNG file.

Now I need to look at its large codebase and pinpoint the code which does things between opening the PDF file and writing the PNG file. Are there any strace-like tools to trace C code execution paths?

I also created this list of other tools that can do a PDF to PNG conversion:

MuPDF is by the same ghostscript folks, and it has a third-party Python wrapper called PyMuPDF. I looked at its PyPI page and saw that it has wheels for all platforms and archs! I guess I could replace ghostscript with PyMuPDF instead, or have both as image conversion backends users can choose from, because probing around and looking through large C/C++ codebases is time-consuming (I don't know enough of both of those languages).

I was interested to learn how PyMuPDF wraps MuPDF and generates all those wheels. It uses SWIG for the wrapping bit which looked super interesting, so I went ahead and did their tutorial!

You just need to write an example.c (with your C code), then write an example.i (where you declare functions you want to export from your C code), and finally run SWIG on the second file. It produces an and an example_wrap.c which has a lot of Python C-API and PyObjects in it. It includes Python.h too.

  $ swig -python example.i

You then compile your C file and SWIG's C file with gcc, while also passing in the path to Python's header files using -I.

  $ gcc -c example.c example_wrap.c -I/usr/include/python3.8

This generates two output files, which can be compiled into an file using something called ld:

  $ ld -shared example.o example_wrap.o -o
  ld: example_wrap.o: relocation R_X86_64_PC32 against undefined symbol `PyExc_MemoryError' can not be used when making a shared object; recompile with -fPIC
  ld: final link failed: bad value

That raised an error, and I had to recompile both C files with -fPIC for it to work. Looks like -fPIC (PIC = Position Independent Code) "generates machine code that is not dependent on being located at a specific address in order to work".

Once you have the, you can import the example module in your Python code and call a factorial function written in C!

  >>> import example
  >>> dir(example)
  ['_SwigNonDynamicMeta', '__builtin__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '_example',   '_swig_add_metaclass', '_swig_python_version_info', '_swig_repr', '_swig_setattr_nondynamic_class_variable', '_swig_setattr_nondynamic_instance_variable', 'cvar',   'fact', 'get_time', 'my_mod']
  >>> example.fact(5)

What are these dark arts? I have so many questions about what is happening here! I've only ever compiled one C file into an a.out using gcc. What does it mean to compile multiple C files? How does that work? How do C modules work? What is position independent code? What is ld? And how does Python import a .so file?!

I guess it makes sense to just use existing tools (from the list above) to do a PDF to PNG conversion instead of probing around those large codebases to try and reinvent the wheel. I just need to learn how to package them nicely for different platforms and archs. I know PyMuPDF, opencv-python, and cryptography do this so I already have templates to look into!