Announcing Excalibur, a Web Interface to Extract Tabular Data from PDFs
21 October 2018 · excalibur · pdf · tableLast week, Camelot trended at #1 on Hacker News, Github and #5 on Product Hunt. Thank you for the love! There's still a lot to do to make it more awesome. You can follow the roadmap on its Github wiki. You can also check out my previous blog post on Camelot.
In this post, I'll talk about developing Excalibur, a web interface to extract data tables from PDFs, powered by Camelot. I'll keep it short and sweet!
v0.1.0 Development
Last week's event motivated me to develop a web interface on top of Camelot. Because why not? Also, web interfaces are easier to use than writing code, unless you're using requests.
I had developed a web interface for Camelot (inspired from Tabula) using Flask, Celery, Bootstrap and jQuery, in my previous job at SocialCops (along with a bunch of other interfaces), so this wasn't something new for me.
I have always loved how modular Apache Airflow's code base is, so I started from scratch by using it as a template. Since I am an amateur at frontend (I can hack something together only using Bootstrap and jQuery, though one day I'll get to these shiny new frameworks that the cool kids seem to be excited about!), my friend Nikhil Sikka, who's a fantastic frontend developer, helped make the web interface look pretty. We took inspiration from Tabula and Docparser and were able to get Excalibur up in a couple of days, coding from home and a local Starbucks!
Initially, I had Celery as a hard requirement but I decided to remove it from v0.1.0 to keep Excalibur lightweight. I plan to add it as an optional requirement in v0.2.0 because who doesn't want parallel and distributed workload execution?
You can check out the documentation at Read the Docs and follow the development on GitHub.
Getting up and running with Excalibur
Excalibur is written in Python 3. After installation with pip, you can start the webserver using:
$ excalibur webserver
That's it! Now you can go to http://localhost:5000 and extract data tables from your PDFs using the interface! Check out the usage section for step-by-step instructions.
You can also download executables for Windows and Linux from the releases page!
Why Excalibur?
Updated: 26 November 2018
- Extracting tables from PDFs is hard. A simple copy-and-paste from a PDF into an Excel doesn't preserve table structure. Excalibur makes PDF table extraction very easy, by automatically detecting tables in PDFs and letting you save them into CSVs and Excel files.
- Excalibur uses Camelot under the hood, which gives you additional settings to tweak table extraction and get the best results. You can see how it performs better than other open-source tools and libraries in this comparison.
- You can save table extraction settings (like table areas) for a PDF once, and apply them on new PDFs to extract tables with similar structures.
- You get complete control over your data. All file storage and processing happens on your own local or remote machine.
- Excalibur can be configured with MySQL and Celery for parallel and distributed workloads. By default, sqlite and multiprocessing are used for sequential workloads.
Why another PDF table extraction tool?
Borrowed from my previous blog post on Camelot.
There are both open (Tabula, pdf-table-extract) and closed-source (Smallpdf, Docparser) tools that are widely used to extract data tables from PDFs. They either give a nice output or fail miserably. There is no in between. This is not helpful since everything in the real world, including PDF table extraction, is fuzzy. This leads to the creation of ad-hoc table extraction scripts for each type of PDF table.
Excalibur uses Camelot under the hood, which was created to offer users complete control over table extraction. If you can’t get the desired output with default settings, you can tweak them and get the job done!
Usage
You can check out concepts and usage documentation to get the most out of Excalibur. Excalibur builds on top of Camelot's table parsing flavors, Lattice and Stream. On Excalibur's web interface, you can configure the same keyword arguments as you would inside your code using Camelot's Python API.
To get data out of PDFs where Camelot doesn't extract the table structure correctly, you can even draw table areas and add column separators on the interface itself!
Long road ahead
I am working on Camelot and Excalibur full time. You can help support its development becoming a backer or a sponsor on OpenCollective! I'm in the process of making both roadmaps more detailed, by defining features/fixes and adding milestones. If you would like to fund the development (or know someone who would) and want to discuss this further, do send me an email at vmehta94 at gmail dot com!
Current roadmaps:
You can check out the Contributor's Guide for guidelines around reporting issues or proposing enhancements.
Thanks for reading, keep looking up!