Releasing Camelot v0.10.012 July 2021 · python · camelot Tweet
I'm happy to announce that Camelot v0.10.0 is out!
- You can now choose between two image conversion backends, or supply your own.
pip install camelot-py[base]instead of
pip install camelot-py[cv]
ghostscript to convert a PDF page into a PNG so that it can find lines and identify tables. This works out nicely to get tables out of PDFs if you're able to install everything correctly.
But the dependency on
ghostscript has made it a bit difficult to install and use Camelot in some cases, because
ghostscript isn't a pure Python package which can be installed using
Users have to use their system package manager (
brew, etc) or go to the official site to download and install it, and then hope that the
ghostscript executable gets installed with the right name in the right path.
Last year I started my RC batch to remove this dependency on
ghostscript, and came up with
pdftopng. It's a Python wrapper on top of
poppler) which can be installed using
pip, thanks to a build process which builds wheels for all major operating systems!
This is how you can use
pdftopng to convert a PDF with a single page to a PNG:
from pdftopng import pdftopng pdftopng.convert(pdf_path="foo.pdf", png_path="foo.png")
In addition to
ghostscript, Camelot now has the
poppler image conversion backend (via
pdftopng) and you can choose between either of these for the internal PDF to PNG conversion. The thing to note is that
poppler doesn't require any non-PyPI dependency while
ghostscript still does.
Here's how you can specify the image conversion backend you want to use:
tables = camelot.read_pdf(filename, backend="ghostscript") # default tables = camelot.read_pdf(filename, backend="poppler")
If none of above backends work for you, you can supply your own backend by creating a class that implements a
convert method, which reads a single page PDF from a
pdf_path, converts it into an image, and then writes it to
class ConversionBackend(object): def convert(pdf_path, png_path): # read pdf page from pdf_path # convert pdf page to image # write image to png_path pass tables = camelot.read_pdf(filename, backend=ConversionBackend())
The default image conversion backend will be changed from
poppler in v0.12.0 (after September 2021). You should try it out to see if it's easy to install, and if it doesn't break any of your
docker builds or ETL workflows.
After installing Camelot with
pip install "camelot-py[base]", you can use the following snippet to try out the
import camelot tables = camelot.read_pdf("https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf", backend="poppler") tables # <Table shape=(7, 7)>
If you face any issues, please report them here!