Day 46 — Oh no! A bug :(

Today I read through some open issues on camelot, and found a bug for when you install it from conda-forge. I'd assumed that installing ghostscript from conda-forge installs all of its dependencies. It does, but looks like all the those depedencies are statically linked into one gs executable.

This would've been fine till camelot ran gs in a subprocess call, but the code was changed to use libgs some time ago. The bug should've been caught when that change was merged, but right now the only test in the conda-forge recipe is to check if camelot can be imported. That didn't catch the bug as the error happens when camelot.read_pdf() is called :(

The fix is to install libgs using the system package manager (apt / brew), or by downloading the setup for Windows from the ghostscript website. Hopefully, a fix for this won't be needed after the default pdf to image conversion backend is switched to pdftopng. But till then I need to update the docs with a note. Huge thanks to Jim Hall for reporting this, and for pointing me in the right direction!

These are the steps I'd used to reproduce the bug initially:


  $ sudo apt remove --auto-remove ghostscript

That removed a ton of packages including ubuntu-gnome-desktop (?!) which I'm supposed to be using! My system still works fine though, need to figure this out later.

After that I create a new conda environment, installed ghostscript from conda-forge, and camelot from PyPI:


  $ conda create --name gs-env python=3.8
  $ conda activate gs-env
  $ conda install -c conda-forge ghostscript
  $ which gs
  /home/vinayak/anaconda3/envs/gs-env/bin/gs
  $ pip install camelot-py[cv]

And then ran the test Jim described in the issue.


  >>> import camelot
  >>> tables = camelot.read_pdf('foo.pdf')
  >>>

It worked fine and fed into my confirmation bias, until Jim pointed out that I should check which libgs is being used, and if removing ghostscript removed libgs too or not!


  >>> from ctypes.util import find_library
  >>> find_library("gs")
  'libgs.so.9'
  >>>

Indeed! libgs was still present on my system!


  $ whereis libgs.so.9
  libgs.so: /usr/lib/x86_64-linux-gnu/libgs.so.9
  $ apt search libgs
  libgs9/focal-updates,focal-security,now 9.50~dfsg-5ubuntu4.2 amd64 [installed]
    interpreter for the PostScript language and for PDF - Library

I didn't remove it though as it was going to take away evince and a lot of other useful packages.

I compared the shared library dependencies of ubuntu and conda-forge ghostscript and found a stark contrast!


  $ ldd /usr/bin/gs
      linux-vdso.so.1 (0x00007ffc065d0000)
      libgs.so.9 => /usr/lib/x86_64-linux-gnu/libgs.so.9 (0x00007efd6bada000)
      libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007efd6b8e8000)
      libtiff.so.5 => /usr/lib/x86_64-linux-gnu/libtiff.so.5 (0x00007efd6b867000)
      libcups.so.2 => /usr/lib/x86_64-linux-gnu/libcups.so.2 (0x00007efd6b7cc000)
      libijs-0.35.so => /usr/lib/x86_64-linux-gnu/libijs-0.35.so (0x00007efd6b7c4000)
      libpng16.so.16 => /usr/lib/x86_64-linux-gnu/libpng16.so.16 (0x00007efd6b78c000)
      libjbig2dec.so.0 => /usr/lib/x86_64-linux-gnu/libjbig2dec.so.0 (0x00007efd6b76d000)
      libjpeg.so.8 => /usr/lib/x86_64-linux-gnu/libjpeg.so.8 (0x00007efd6b6e8000)
      libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007efd6b6cc000)
      liblcms2.so.2 => /usr/lib/x86_64-linux-gnu/liblcms2.so.2 (0x00007efd6b671000)
      libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007efd6b522000)
      libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007efd6b51c000)
      libidn.so.11 => /lib/x86_64-linux-gnu/libidn.so.11 (0x00007efd6b4e5000)
      libpaper.so.1 => /usr/lib/x86_64-linux-gnu/libpaper.so.1 (0x00007efd6b4df000)
      libfontconfig.so.1 => /usr/lib/x86_64-linux-gnu/libfontconfig.so.1 (0x00007efd6b498000)
      libfreetype.so.6 => /usr/lib/x86_64-linux-gnu/libfreetype.so.6 (0x00007efd6b3d9000)
      libopenjp2.so.7 => /usr/lib/x86_64-linux-gnu/libopenjp2.so.7 (0x00007efd6b383000)
      libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007efd6b360000)
      /lib64/ld-linux-x86-64.so.2 (0x00007efd6ca7c000)
      libwebp.so.6 => /usr/lib/x86_64-linux-gnu/libwebp.so.6 (0x00007efd6b0f5000)
      libzstd.so.1 => /usr/lib/x86_64-linux-gnu/libzstd.so.1 (0x00007efd6b04c000)
      liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x00007efd6b023000)
      libjbig.so.0 => /usr/lib/x86_64-linux-gnu/libjbig.so.0 (0x00007efd6ae15000)
      libgssapi_krb5.so.2 => /usr/lib/x86_64-linux-gnu/libgssapi_krb5.so.2 (0x00007efd6adc8000)
      libavahi-common.so.3 => /usr/lib/x86_64-linux-gnu/libavahi-common.so.3 (0x00007efd6adba000)
      libavahi-client.so.3 => /usr/lib/x86_64-linux-gnu/libavahi-client.so.3 (0x00007efd6ada5000)
      libgnutls.so.30 => /usr/lib/x86_64-linux-gnu/libgnutls.so.30 (0x00007efd6abcf000)
      libexpat.so.1 => /lib/x86_64-linux-gnu/libexpat.so.1 (0x00007efd6aba1000)
      libuuid.so.1 => /lib/x86_64-linux-gnu/libuuid.so.1 (0x00007efd6ab98000)
      libkrb5.so.3 => /usr/lib/x86_64-linux-gnu/libkrb5.so.3 (0x00007efd6aabb000)
      libk5crypto.so.3 => /usr/lib/x86_64-linux-gnu/libk5crypto.so.3 (0x00007efd6aa88000)
      libcom_err.so.2 => /lib/x86_64-linux-gnu/libcom_err.so.2 (0x00007efd6aa81000)
      libkrb5support.so.0 => /usr/lib/x86_64-linux-gnu/libkrb5support.so.0 (0x00007efd6aa72000)
      libdbus-1.so.3 => /lib/x86_64-linux-gnu/libdbus-1.so.3 (0x00007efd6aa21000)
      libp11-kit.so.0 => /usr/lib/x86_64-linux-gnu/libp11-kit.so.0 (0x00007efd6a8eb000)
      libidn2.so.0 => /usr/lib/x86_64-linux-gnu/libidn2.so.0 (0x00007efd6a8ca000)
      libunistring.so.2 => /usr/lib/x86_64-linux-gnu/libunistring.so.2 (0x00007efd6a746000)
      libtasn1.so.6 => /usr/lib/x86_64-linux-gnu/libtasn1.so.6 (0x00007efd6a730000)
      libnettle.so.7 => /usr/lib/x86_64-linux-gnu/libnettle.so.7 (0x00007efd6a6f6000)
      libhogweed.so.5 => /usr/lib/x86_64-linux-gnu/libhogweed.so.5 (0x00007efd6a6be000)
      libgmp.so.10 => /usr/lib/x86_64-linux-gnu/libgmp.so.10 (0x00007efd6a63a000)
      libkeyutils.so.1 => /lib/x86_64-linux-gnu/libkeyutils.so.1 (0x00007efd6a633000)
      libresolv.so.2 => /lib/x86_64-linux-gnu/libresolv.so.2 (0x00007efd6a615000)
      libsystemd.so.0 => /lib/x86_64-linux-gnu/libsystemd.so.0 (0x00007efd6a568000)
      libffi.so.7 => /usr/lib/x86_64-linux-gnu/libffi.so.7 (0x00007efd6a55c000)
      librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007efd6a551000)
      liblz4.so.1 => /usr/lib/x86_64-linux-gnu/liblz4.so.1 (0x00007efd6a530000)
      libgcrypt.so.20 => /usr/lib/x86_64-linux-gnu/libgcrypt.so.20 (0x00007efd6a410000)
      libgpg-error.so.0 => /lib/x86_64-linux-gnu/libgpg-error.so.0 (0x00007efd6a3ed000)

ubuntu ghostscript has so many shared library dependencies, but conda-forge ghostscript does not:


  $ ldd /home/vinayak/anaconda3/envs/gs-3.8/bin/gs
      linux-vdso.so.1 (0x00007ffc8dba3000)
      libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f33bd3e3000)
      libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f33bd3c0000)
      libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f33bd1ce000)
      /lib64/ld-linux-x86-64.so.2 (0x00007f33becb9000)

It's possible that conda-forge ghostscript is one statically linked executable.

There's also a stark difference in the sizes for both executables:


  $ du -sh /usr/bin/gs
  16K /usr/bin/gs
  $ du -sh /home/vinayak/anaconda3/envs/gs-3.8/bin/gs
  25M /home/vinayak/anaconda3/envs/gs-3.8/bin/gs

To reproduce the bug in a clean environment, I launched a docker container with the latest ubuntu image, and installed all the requirements:


  $ docker run -it ubuntu /bin/bash
  $ apt update && apt install curl git
  $ curl -O https://repo.anaconda.com/archive/Anaconda3-2019.03-Linux-x86_64.sh
  $ bash Anaconda3-2019.03-Linux-x86_64.sh
  $ eval "$(/root/anaconda3/bin/conda shell.bash hook)"
  (base) $ conda create --name gs-env python=3.8
  (base) $ conda activate gs-env
  (gs-env) $ conda install -c conda-forge camelot-py

Installing camelot from conda-forge installs ghostscript. But I couldn't find libgs!


  (gs-env) python3
  >>> from ctypes.util import find_library
  >>> find_library("gs")
  >>>
  (gs-env) which gs
  /root/anaconda3/envs/gs-env/bin/gs
  (gs-env) whereis libgs
  libgs:

After that I tried to run the code snippet Jim had posted:


  (gs-env) $ git clone https://github.com/camelot-dev/camelot
  (gs-env) $ cd camelot/tests/files
  (gs-env) ./camelot/tests/files $ python3
  >>> import camelot
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/root/anaconda3/envs/gs-env/lib/python3.8/site-packages/camelot/__init__.py", line 6, in <module>
      from .io import read_pdf
    File "/root/anaconda3/envs/gs-env/lib/python3.8/site-packages/camelot/io.py", line 5, in <module>
      from .handlers import PDFHandler
    File "/root/anaconda3/envs/gs-env/lib/python3.8/site-packages/camelot/handlers.py", line 9, in <module>
      from .parsers import Stream, Lattice
    File "/root/anaconda3/envs/gs-env/lib/python3.8/site-packages/camelot/parsers/__init__.py", line 4, in <module>
      from .lattice import Lattice
    File "/root/anaconda3/envs/gs-env/lib/python3.8/site-packages/camelot/parsers/lattice.py", line 26, in <module>
      from ..image_processing import (
    File "/root/anaconda3/envs/gs-env/lib/python3.8/site-packages/camelot/image_processing.py", line 3, in <module>
      import cv2
  ImportError: libGL.so.1: cannot open shared object file: No such file or directory
  >>>

But ran into another bug! opencv depends on libGL.so, which was not already there on this base ubuntu image, and I had to install libgl1-mesa-glx to fix this opencv import error.


  (gs-env) ./camelot/tests/files $ python3
  >>> import camelot
  >>> tables = camelot.read_pdf('foo.pdf')
  Traceback (most recent call last):
    File "/root/anaconda3/envs/gs-env/lib/python3.8/site-packages/camelot/ext/ghostscript/_gsprint.py", line 260, in <module>
      libgs = cdll.LoadLibrary("libgs.so")
    File "/root/anaconda3/envs/gs-env/lib/python3.8/ctypes/__init__.py", line 451, in LoadLibrary
      return self._dlltype(name)
    File "/root/anaconda3/envs/gs-env/lib/python3.8/ctypes/__init__.py", line 373, in __init__
      self._handle = _dlopen(self._name, mode)
  OSError: libgs.so: cannot open shared object file: No such file or directory

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/root/anaconda3/envs/gs-env/lib/python3.8/site-packages/camelot/io.py", line 113, in read_pdf
      tables = p.parse(
    File "/root/anaconda3/envs/gs-env/lib/python3.8/site-packages/camelot/handlers.py", line 171, in parse
      t = parser.extract_tables(
    File "/root/anaconda3/envs/gs-env/lib/python3.8/site-packages/camelot/parsers/lattice.py", line 402, in extract_tables
      self._generate_image()
    File "/root/anaconda3/envs/gs-env/lib/python3.8/site-packages/camelot/parsers/lattice.py", line 211, in _generate_image
      from ..ext.ghostscript import Ghostscript
    File "/root/anaconda3/envs/gs-env/lib/python3.8/site-packages/camelot/ext/ghostscript/__init__.py", line 24, in <module>
      from . import _gsprint as gs
    File "/root/anaconda3/envs/gs-env/lib/python3.8/site-packages/camelot/ext/ghostscript/_gsprint.py", line 267, in <module>
      raise RuntimeError("Please make sure that Ghostscript is installed")
  RuntimeError: Please make sure that Ghostscript is installed
  >>>

Finally the bug that I was looking for! Installing libgs9 fixed it, but this is not ideal. I need to come up with a Windows wheel for pdftopng so that I can finally replace ghostscript as the default pdf to image conversion backend in camelot. Is there a way to somehow launch "Windows containers" to debug things?