Day 42 — I have an answer to the ultimate question of how to convert a PDF to a PNG in Python!
07 October 2020 · recurse-center TweetYesterday I paired with Ilia to walk through the pdftoppm
code, and we identified some code that wasn't required to wrap pdftoppm
. I also wrapped a "hello world" ncurses
game with pybind11
!
Today I started looking into pdftoppm
again and commented some code (like command-line argument parsing and JPEG support) to just support PDF to PNG conversion, and made a pdftopng.cc
. And then I tried to wrap it using pybind11
.
BOOM! I got a seg fault though!
>>> import pdftopng
>>> pdftopng.convert(pdf_path="foo.pdf", png_path="foo")
Segmentation fault (core dumped)
I started looking for ways to debug this seg fault, and found this nice article by Python Speed. It recommends using catchsegv
while running your Python C extension code to get a register dump, a memory map(?), and a full backtrace!
$ catchsegv python convert.py
Fatal Python error: Segmentation fault
Current thread 0x00007f02a832b740 (most recent call first):
File "convert.py", line 3 in
Segmentation fault (core dumped)
*** Segmentation fault
Register dump:
...
Backtrace:
/lib/x86_64-linux-gnu/libpthread.so.0(raise+0xcb)[0x7f02a86db24b]
/home/vinayak/dev/poppler/build/example.cpython-38-x86_64-linux-gnu.so(_Z7convertPcS_+0xdd)[0x7f02a7b8016d]
...
Memory map:
...
This backtrace wasn't helpful at all. At this point, I remembered how we can add debug symbols to the output of gcc
/g++
so that gdb
can work with it. I also found this awesome DebuggingWithGdb moin! After adding debug symbols with the -g
option I fired up gdb
:
$ gdb python
...
(gdb) run convert.py
Starting program: /home/vinayak/.virtualenvs/poppler-dev/bin/python convert.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7420642 in GlobalParams::getOverprintPreview (this=0x0) at /home/vinayak/dev/poppler/poppler/GlobalParams.h:131
131 bool getOverprintPreview() { return overprintPreview; }
(gdb) bt
#0 0x00007ffff7420642 in GlobalParams::getOverprintPreview (this=0x0) at /home/vinayak/dev/poppler/poppler/GlobalParams.h:131
#1 0x00007ffff741e3ee in convert () at pdftopng.cc:561
...
(gdb)
This backtrace was much more helpful, and I found that in my frenzy to remove "unneeded" code, I had commented out an important variable called globalParams
which was being used down below! Adding it back fixed the seg fault! This code works now:
>>> import pdftopng
>>> pdftopng.convert(pdf_path="foo.pdf", png_path="foo")
Now I just need to clean up the API (maybe also add support for other pdfotppm
options? And add other poppler CLI tools too?), build multi-platform wheels, and publish this on PyPI! I'll then be able to remove the ghostscript
dependency in camelot
, and users won't have to install anything separately for the library to work!