Releasing Camelot v0.10.0

I'm happy to announce that Camelot v0.10.0 is out!

tl;dr

Background

Camelot uses ghostscript to convert a PDF page into a PNG so that it can find lines and identify tables. This works out nicely to get tables out of PDFs if you're able to install everything correctly.

But the dependency on ghostscript has made it a bit difficult to install and use Camelot in some cases, because ghostscript isn't a pure Python package which can be installed using pip.

Users have to use their system package manager (pacman, apt, brew, etc) or go to the official site to download and install it, and then hope that the ghostscript executable gets installed with the right name in the right path.

Last year I started my RC batch to remove this dependency on ghostscript, and came up with pdftopng. It's a Python wrapper on top of pdftoppm (from poppler) which can be installed using pip, thanks to a build process which builds wheels for all major operating systems!

This is how you can use pdftopng to convert a PDF with a single page to a PNG:


  from pdftopng import pdftopng
  pdftopng.convert(pdf_path="foo.pdf", png_path="foo.png")

What's new

In addition to ghostscript, Camelot now has the poppler image conversion backend (via pdftopng) and you can choose between either of these for the internal PDF to PNG conversion. The thing to note is that poppler doesn't require any non-PyPI dependency while ghostscript still does.

Here's how you can specify the image conversion backend you want to use:


  tables = camelot.read_pdf(filename, backend="ghostscript")  # default
  tables = camelot.read_pdf(filename, backend="poppler")

If none of above backends work for you, you can supply your own backend by creating a class that implements a convert method, which reads a single page PDF from a pdf_path, converts it into an image, and then writes it to png_path:


  class ConversionBackend(object):
      def convert(pdf_path, png_path):
          # read pdf page from pdf_path
          # convert pdf page to image
          # write image to png_path
          pass

  tables = camelot.read_pdf(filename, backend=ConversionBackend())

What's next

The default image conversion backend will be changed from ghostscript to poppler in v0.12.0 (after September 2021). You should try it out to see if it's easy to install, and if it doesn't break any of your docker builds or ETL workflows.

After installing Camelot with pip install "camelot-py[base]", you can use the following snippet to try out the poppler backend:


  import camelot
  tables = camelot.read_pdf("https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf", backend="poppler")
  tables[0]
  # <Table shape=(7, 7)>

If you face any issues, please report them here!