Releasing Camelot v0.10.0
12 July 2021 · python · camelot TweetI'm happy to announce that Camelot v0.10.0 is out!
tl;dr
- You can now choose between two image conversion backends, or supply your own.
pip install camelot-py[base]
instead ofpip install camelot-py[cv]
Background
Camelot uses ghostscript
to convert a PDF page into a PNG so that it can find lines and identify tables. This works out nicely to get tables out of PDFs if you're able to install everything correctly.
But the dependency on ghostscript
has made it a bit difficult to install and use Camelot in some cases, because ghostscript
isn't a pure Python package which can be installed using pip
.
Users have to use their system package manager (pacman
, apt
, brew
, etc) or go to the official site to download and install it, and then hope that the ghostscript
executable gets installed with the right name in the right path.
Last year I started my RC batch to remove this dependency on ghostscript
, and came up with pdftopng
. It's a Python wrapper on top of pdftoppm
(from poppler
) which can be installed using pip
, thanks to a build process which builds wheels for all major operating systems!
This is how you can use pdftopng
to convert a PDF with a single page to a PNG:
from pdftopng import pdftopng
pdftopng.convert(pdf_path="foo.pdf", png_path="foo.png")
What's new
In addition to ghostscript
, Camelot now has the poppler
image conversion backend (via pdftopng
) and you can choose between either of these for the internal PDF to PNG conversion. The thing to note is that poppler
doesn't require any non-PyPI dependency while ghostscript
still does.
Here's how you can specify the image conversion backend you want to use:
tables = camelot.read_pdf(filename, backend="ghostscript") # default
tables = camelot.read_pdf(filename, backend="poppler")
If none of above backends work for you, you can supply your own backend by creating a class that implements a convert
method, which reads a single page PDF from a pdf_path
, converts it into an image, and then writes it to png_path
:
class ConversionBackend(object):
def convert(pdf_path, png_path):
# read pdf page from pdf_path
# convert pdf page to image
# write image to png_path
pass
tables = camelot.read_pdf(filename, backend=ConversionBackend())
What's next
The default image conversion backend will be changed from ghostscript
to poppler
in v0.12.0 (after September 2021). You should try it out to see if it's easy to install, and if it doesn't break any of your docker
builds or ETL workflows.
After installing Camelot with pip install "camelot-py[base]"
, you can use the following snippet to try out the poppler
backend:
import camelot
tables = camelot.read_pdf("https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf", backend="poppler")
tables[0]
# <Table shape=(7, 7)>
If you face any issues, please report them here!