Day 31 — Spying on ghostscript
22 September 2020 · recurse-center · strace TweetI recently (at the start of my batch) found out about strace
from this zine by Julia Evans! Today I used it to look at things ghostscript
does under the hood (system calls!) to do a PDF to PNG conversion. I also read through the manual page for strace
which had this one example where they compare the errors shown when a program tries to open non-existent files, to a porch light being on when nobody is really home!
$ man strace
...
For example, retrying the "ls -l" example with a non-existent file produces the following line:
lstat("/foo/bar", 0xb004) = -1 ENOENT (No such file or directory)
In this case the porch light is on but nobody is home.
...
I cloned and built ghostscript
on its development branch for this exercise, and didn't use the one already installed on my system. The above example was useful as I saw a lot of those lines in the very large output when I ran:
$ strace -f ./gs -sDEVICE=png16m -o foo.png -r300 foo.pdf
This large output was overwhelming so I narrowed down the scope to openat
system calls, to just look at all the files ghostscript
was opening.
strace -f -e openat ./gs -sDEVICE=png16m -o foo.png -r300 foo.pdf
Even though this ghostscript
build is located in my home folder at dev/ghostpdl/bin
, it looked for some resources in /usr/local/share/ghostscript/9.54.0
, which it couldn't find because I don't have that latest version installed!
There's a Resource
folder at the root of the build directory but it didn't even look there. I guess these are fonts and some other resources that ghostscript
requires. I thought these were hard requirements, but this assumption was broken when it still produced a nice PNG file! Maybe I'm missing something.
openat(AT_FDCWD, "/usr/local/share/ghostscript/9.54.0/Resource/Init/Halftone/Default", O_RDONLY) = -1 ENOENT (No such file or directory)
It also opened libpng
, which makes sense to do a PDF to PNG conversion!
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/libpng16.so.16", O_RDONLY|O_CLOEXEC) = 3
It then read the input PDF file:
openat(AT_FDCWD, "foo.pdf", O_RDONLY) = 5
This is where a lot of read
s and lseek
s happened. I guess at this point ghostscript
starts reading things from the PDF and starts converting them to some kind of a PNG data structure in C. This was followed by another openat
where it opened the output PNG file:
openat(AT_FDCWD, "foo.png", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 6
Which was followed by a lot pwrite64
s. I guess this is where it starts writing things into the PNG file.
Now I need to look at its large codebase and pinpoint the code which does things between opening the PDF file and writing the PNG file. Are there any strace
-like tools to trace C code execution paths?
I also created this list of other tools that can do a PDF to PNG conversion:
- MuPDF
- XpdfReader
pdftoppm
from poppler-utils, a fork ofXpdfReader
- PDFBox, written in Java
- pdf.js, written in Javascript
MuPDF
is by the same ghostscript
folks, and it has a third-party Python wrapper called PyMuPDF
. I looked at its PyPI page and saw that it has wheels for all platforms and archs! I guess I could replace ghostscript
with PyMuPDF
instead, or have both as image conversion backends users can choose from, because probing around and looking through large C/C++ codebases is time-consuming (I don't know enough of both of those languages).
I was interested to learn how PyMuPDF
wraps MuPDF
and generates all those wheels. It uses SWIG
for the wrapping bit which looked super interesting, so I went ahead and did their tutorial!
You just need to write an example.c
(with your C code), then write an example.i
(where you declare functions you want to export from your C code), and finally run SWIG
on the second file. It produces an example.py
and an example_wrap.c
which has a lot of Python C-API and PyObject
s in it. It includes Python.h
too.
$ swig -python example.i
example.py
example_wrap.c
You then compile your C file and SWIG's C file with gcc
, while also passing in the path to Python's header files using -I
.
$ gcc -c example.c example_wrap.c -I/usr/include/python3.8
example.o
example_wrap.o
This generates two output files, which can be compiled into an _example.so
file using something called ld
:
$ ld -shared example.o example_wrap.o -o _example.so
ld: example_wrap.o: relocation R_X86_64_PC32 against undefined symbol `PyExc_MemoryError' can not be used when making a shared object; recompile with -fPIC
ld: final link failed: bad value
That raised an error, and I had to recompile both C files with -fPIC
for it to work. Looks like -fPIC
(PIC = Position Independent Code) "generates machine code that is not dependent on being located at a specific address in order to work".
Once you have the _example.so
, you can import the example
module in your Python code and call a factorial function written in C!
>>> import example
>>> dir(example)
['_SwigNonDynamicMeta', '__builtin__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '_example', '_swig_add_metaclass', '_swig_python_version_info', '_swig_repr', '_swig_setattr_nondynamic_class_variable', '_swig_setattr_nondynamic_instance_variable', 'cvar', 'fact', 'get_time', 'my_mod']
>>> example.fact(5)
120
What are these dark arts? I have so many questions about what is happening here! I've only ever compiled one C file into an a.out
using gcc
. What does it mean to compile multiple C files? How does that work? How do C modules work? What is position independent code? What is ld
? And how does Python import a .so
file?!
I guess it makes sense to just use existing tools (from the list above) to do a PDF to PNG conversion instead of probing around those large codebases to try and reinvent the wheel. I just need to learn how to package them nicely for different platforms and archs. I know PyMuPDF
, opencv-python
, and cryptography
do this so I already have templates to look into!