Day 42 — I have an answer to the ultimate question of how to convert a PDF to a PNG in Python!

Yesterday I paired with Ilia to walk through the pdftoppm code, and we identified some code that wasn't required to wrap pdftoppm. I also wrapped a "hello world" ncurses game with pybind11!

Today I started looking into pdftoppm again and commented some code (like command-line argument parsing and JPEG support) to just support PDF to PNG conversion, and made a pdftopng.cc. And then I tried to wrap it using pybind11.

BOOM! I got a seg fault though!


  >>> import pdftopng
  >>> pdftopng.convert(pdf_path="foo.pdf", png_path="foo")
  Segmentation fault (core dumped)

I started looking for ways to debug this seg fault, and found this nice article by Python Speed. It recommends using catchsegv while running your Python C extension code to get a register dump, a memory map(?), and a full backtrace!


  $ catchsegv python convert.py
  Fatal Python error: Segmentation fault

  Current thread 0x00007f02a832b740 (most recent call first):
    File "convert.py", line 3 in 
  Segmentation fault (core dumped)
  *** Segmentation fault
  Register dump:
  ...

  Backtrace:
  /lib/x86_64-linux-gnu/libpthread.so.0(raise+0xcb)[0x7f02a86db24b]
  /home/vinayak/dev/poppler/build/example.cpython-38-x86_64-linux-gnu.so(_Z7convertPcS_+0xdd)[0x7f02a7b8016d]
  ...

  Memory map:
  ...

This backtrace wasn't helpful at all. At this point, I remembered how we can add debug symbols to the output of gcc/g++ so that gdb can work with it. I also found this awesome DebuggingWithGdb moin! After adding debug symbols with the -g option I fired up gdb:


  $ gdb python
  ...
  (gdb) run convert.py
  Starting program: /home/vinayak/.virtualenvs/poppler-dev/bin/python convert.py
  [Thread debugging using libthread_db enabled]
  Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

  Program received signal SIGSEGV, Segmentation fault.
  0x00007ffff7420642 in GlobalParams::getOverprintPreview (this=0x0) at /home/vinayak/dev/poppler/poppler/GlobalParams.h:131
  131       bool getOverprintPreview() { return overprintPreview; }
  (gdb) bt
  #0  0x00007ffff7420642 in GlobalParams::getOverprintPreview (this=0x0) at /home/vinayak/dev/poppler/poppler/GlobalParams.h:131
  #1  0x00007ffff741e3ee in convert () at pdftopng.cc:561
  ...
  (gdb)

This backtrace was much more helpful, and I found that in my frenzy to remove "unneeded" code, I had commented out an important variable called globalParams which was being used down below! Adding it back fixed the seg fault! This code works now:


  >>> import pdftopng
  >>> pdftopng.convert(pdf_path="foo.pdf", png_path="foo")

Now I just need to clean up the API (maybe also add support for other pdfotppm options? And add other poppler CLI tools too?), build multi-platform wheels, and publish this on PyPI! I'll then be able to remove the ghostscript dependency in camelot, and users won't have to install anything separately for the library to work!