Extracting text from PDF or “I’d rather shoot myself right now”

Updated on August 11, 2017

tl;dr: Use pdf-box for general text extraction tasks and use tabula for tables.

If you should ever find yourself in a situation where you want to get information out of a pdf-document you should reconsider first. Is there no other source available?

No? Okay, prepare for pain! Or listen to my advice, as I have already endured the pain:

Don’t try to find a python tool that does the job. It seems like, at the moment, there aren’t any good ones out there.

Instead use https://pdfbox.apache.org it gave me the best results out of the box of any tools I used – and I tested a lot of tools.

Update: If you want to extract tables from a pdf there really is no way around tabula. It has a very good browser based gui and cli. If you are trying to programmatically extract data you might have to play around with the “–columns” part instead of auto detection for good results when working with loads of similar looking pdfs.

 

 

jonas

 

Leave a Reply

Your email address will not be published. Required fields are marked *