Day 31 — Spying on ghostscript22 September 2020 · recurse-center · strace Tweet
I recently (at the start of my batch) found out about
strace from this zine by Julia Evans! Today I used it to look at things
ghostscript does under the hood (system calls!) to do a PDF to PNG conversion. I also read through the manual page for
strace which had this one example where they compare the errors shown when a program tries to open non-existent files, to a porch light being on when nobody is really home!
$ man strace ... For example, retrying the "ls -l" example with a non-existent file produces the following line: lstat("/foo/bar", 0xb004) = -1 ENOENT (No such file or directory) In this case the porch light is on but nobody is home. ...
I cloned and built
ghostscript on its development branch for this exercise, and didn't use the one already installed on my system. The above example was useful as I saw a lot of those lines in the very large output when I ran:
$ strace -f ./gs -sDEVICE=png16m -o foo.png -r300 foo.pdf
This large output was overwhelming so I narrowed down the scope to
openat system calls, to just look at all the files
ghostscript was opening.
strace -f -e openat ./gs -sDEVICE=png16m -o foo.png -r300 foo.pdf
Even though this
ghostscript build is located in my home folder at
dev/ghostpdl/bin, it looked for some resources in
/usr/local/share/ghostscript/9.54.0, which it couldn't find because I don't have that latest version installed!
Resource folder at the root of the build directory but it didn't even look there. I guess these are fonts and some other resources that
ghostscript requires. I thought these were hard requirements, but this assumption was broken when it still produced a nice PNG file! Maybe I'm missing something.
openat(AT_FDCWD, "/usr/local/share/ghostscript/9.54.0/Resource/Init/Halftone/Default", O_RDONLY) = -1 ENOENT (No such file or directory)
It also opened
libpng, which makes sense to do a PDF to PNG conversion!
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/libpng16.so.16", O_RDONLY|O_CLOEXEC) = 3
It then read the input PDF file:
openat(AT_FDCWD, "foo.pdf", O_RDONLY) = 5
This is where a lot of
lseeks happened. I guess at this point
ghostscript starts reading things from the PDF and starts converting them to some kind of a PNG data structure in C. This was followed by another
openat where it opened the output PNG file:
openat(AT_FDCWD, "foo.png", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 6
Which was followed by a lot
pwrite64s. I guess this is where it starts writing things into the PNG file.
Now I need to look at its large codebase and pinpoint the code which does things between opening the PDF file and writing the PNG file. Are there any
strace-like tools to trace C code execution paths?
I also created this list of other tools that can do a PDF to PNG conversion:
pdftoppmfrom poppler-utils, a fork of
- PDFBox, written in Java
MuPDF is by the same
ghostscript folks, and it has a third-party Python wrapper called
PyMuPDF. I looked at its PyPI page and saw that it has wheels for all platforms and archs! I guess I could replace
PyMuPDF instead, or have both as image conversion backends users can choose from, because probing around and looking through large C/C++ codebases is time-consuming (I don't know enough of both of those languages).
I was interested to learn how
MuPDF and generates all those wheels. It uses
SWIG for the wrapping bit which looked super interesting, so I went ahead and did their tutorial!
You just need to write an
example.c (with your C code), then write an
example.i (where you declare functions you want to export from your C code), and finally run
SWIG on the second file. It produces an
example.py and an
example_wrap.c which has a lot of Python C-API and
PyObjects in it. It includes
$ swig -python example.i example.py example_wrap.c
You then compile your C file and SWIG's C file with
gcc, while also passing in the path to Python's header files using
$ gcc -c example.c example_wrap.c -I/usr/include/python3.8 example.o example_wrap.o
This generates two output files, which can be compiled into an
_example.so file using something called
$ ld -shared example.o example_wrap.o -o _example.so ld: example_wrap.o: relocation R_X86_64_PC32 against undefined symbol `PyExc_MemoryError' can not be used when making a shared object; recompile with -fPIC ld: final link failed: bad value
That raised an error, and I had to recompile both C files with
-fPIC for it to work. Looks like
-fPIC (PIC = Position Independent Code) "generates machine code that is not dependent on being located at a specific address in order to work".
Once you have the
_example.so, you can import the
example module in your Python code and call a factorial function written in C!
>>> import example >>> dir(example) ['_SwigNonDynamicMeta', '__builtin__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '_example', '_swig_add_metaclass', '_swig_python_version_info', '_swig_repr', '_swig_setattr_nondynamic_class_variable', '_swig_setattr_nondynamic_instance_variable', 'cvar', 'fact', 'get_time', 'my_mod'] >>> example.fact(5) 120
What are these dark arts? I have so many questions about what is happening here! I've only ever compiled one C file into an
gcc. What does it mean to compile multiple C files? How does that work? How do C modules work? What is position independent code? What is
ld? And how does Python import a
I guess it makes sense to just use existing tools (from the list above) to do a PDF to PNG conversion instead of probing around those large codebases to try and reinvent the wheel. I just need to learn how to package them nicely for different platforms and archs. I know
cryptography do this so I already have templates to look into!