How to extract tabular data from PDF?

DesmondChoo
6 min readJul 11, 2021

--

Do you often need to extract tabular data from PDF? If so, you will be familiar to both Tabula and Camelot python package which allow user to do just that. However, no matter how useful these packages are, there are some limitations that reside within each package. What if we could develop a script that could utilize both packages…

Introduction

The article will be based off the python script that utilizes both Tabula and Camelot to extract potential tables from text PDF. The script will first use Camelot to extract and if Camelot’s algorithm does not detect the table, it will use Tabula. Once all the tabular data are extracted and processed, it will then be outputted into a single Excel with each tavular data housed in the tab based on the page of the PDF.

Pros and Cons of Camelot and Tabula

Before we dive into the code, if you want to learn more about Camelot and Tabula packages, and what are some of the pros and cons for these packages, you can refer to this blog which succintly explain these packages and more.

Idea of the script

For this script, my test data is based off of DBS Q2020 Pillar 3 PDF. I opined that the performance for both Camelot and Tabula were similar for the extraction of tabular data that have clear defined demarcation for a table. However, if the tabular data has merged cell or if there isn’t any clearly defined lines, I observed that Camelot has a slight edge over Tabula.

By using the syntax below, we can compare the perfomance of both packages (this is just one example and by no means rigorous).

## Camelot
table_c = camelot.read_pdf(file, pages=”20")
df = pd.DataFrame()
for j in range(len(table_c)):
df_c = table_c[j].df
df = df.append(remove_spaces(split_newline(df_c)))
df.to_excel(“camelot.xlsx”)
## Tabula
table_t = tabula.read_pdf(file,pages=20,multiple_tables=True,stream=True, guess=True)
df = table_t[0]
df.to_excel("tabula.xlsx")
Comparison between Camelot and Tabula

As shown in the picture above, while both managed to extract the important figures from the PDF, the Camelot version is slightly better and more usable.

Of course not all rainbows and sunshine for Camelot. While the extracted data is slightly better, there are also certain tabular data that Camelot will not extract due to the data not having enough signs or indications of it being a tabular data, even though Tabula can. Another limitation that I’ve faced after working with Camelot is that even after extraction, there are certain elements within the extracted data that will not be parsed correctly (this wil be explained in the next section).

Example of a tabular data that is missed out by Camelot

Thus, the best way forward is to combine both packages together to make sure that the maximum amount of tabular data can be extracted with the highest level of useability after extraction. So based on that condition, we will try to use Camelot first to scan each page of the PDF, and if no tabular data is being detected, then wewill use Tabula as a fail safe to ensure that it is really the case.

Enhancing the extraction of Camelot

As briefly explained in the previous section, there were some instances that the tabular data extracted by Camelot was not correctly parsed (see image below), with certain rows of tabular data all concatenated with delimiter “\n” in the first column.

Issues with Camelot’s extracted tabular data

We can easily fixed this issue by creating a function to make sure that if there is “\n” in the extracted data, we will do split them up and make sure the right element falls into the right cell.

## Function to clean camelot table_c
def split_newline(df_c):
for k in range(len(df_c)):
if “\n” in df_c[0].iloc[k]:
text = df_c[0].iloc[k].split(“\n”)
for z in range(len(text)):
df_c.at[k,z] = text[z]
return df_c
table_c = camelot.read_pdf(file, pages="25")
df = pd.DataFrame()
for j in range(len(table_c)):
df_c = table_c[j].df
df = df.append(remove_spaces(split_newline(df_c)))
df.to_excel("camelot.xlsx")

By using both functions above, we are able to change the output to the one below.

Enhanced Camelot’s extracted tabular data with function.

Putting it all together

This script set out to make sure that we are able to extract most, if not all, the tabular data residing in a PDF, thus we use Camelot, with Tabula acting as a fail safe. We also discovered some issues with the parsing of Camelot’s extracted tabular data, so we enhanced it with a simple python function.

While the major contributing factor to maximum useability is the accuracy of the extracted tabular data, it is also important to separate the extracted tabular data into different tabs with the name of the tab based on the PDF, so that the user know where to find that particular data. Thus, the final step is to do that.

We can do this step by using xlsxwriter and a simple for-loop in python. With this, we can finally bring all the elements together into a single code.

import tabula
import camelot
from pdfrw import PdfReader
import pandas as pd
import xlsxwriter
import re
## Importing PDF
file = "Data\DBS_P3Q42020.pdf"
## Get number of pages
num_pages = len(PdfReader(file).pages)
## Function to clean camelot table_c
def split_newline(df_c):
for k in range(len(df_c)):
if "\n" in df_c[0].iloc[k]:
text = df_c[0].iloc[k].split("\n")
for z in range(len(text)):
df_c.at[k,z] = text[z]
return df_c
def remove_spaces(df_c):
for i in range(len(df_c)):
for col in df_c.columns[2:]:
text = df_c[col].iloc[i]
if text.count(' ')>1:
text = ''.join(e for e in text if e.isalnum())
df_c.at[i,col] = text
return df_c
with pd.ExcelWriter(‘test_dbs.xlsx’, engine=’xlsxwriter’) as writer:
for i in range(1,num_pages+1):
try:
table_t = tabula.read_pdf(file,pages=i,multiple_tables=True,stream=True)
table_c = camelot.read_pdf(file, pages=str(i))
if len(table_c) > 0:df = pd.DataFrame()
for j in range(len(table_c)):
df_c = table_c[j].df
## Run through the first column to assess if there is ‘\n’
df = df.append(remove_spaces(split_newline(df_c)))
print(“Page “+ str(i) + “ using Camelot”)
df.to_excel(writer,sheet_name = “Page “+str(i), index=False, header=False)
elif len(table_t) > 0:
df = table_t[0]
print(“Page “+ str(i) + “ using Tabula”)
df.to_excel(writer,sheet_name = “Page “+str(i), index=False, header=False)
except AttributeError:
pass

Final Thought

I do see some use cases whereby it could potential save some working-hours for users, especially financial analysts, who frequently need to copy and paste data from PDF into excel and then churn the relevant visualization in order to do analysis like industry/peer comparison.

While this script is far from perfect, it is a quick way to extract tabular data nevertheless. If you do use this script, you will realised that there are still some tables that both Camelot and Tabula cannot extract accurately. In this case, one way to resolve it might be to convert the PDF to text and manually transform it. I might explore doing that in the future so stay tuned!

Last but not least, a big thanks for reading my article! :)

Other Projects

--

--

DesmondChoo
DesmondChoo

Written by DesmondChoo

Someone who loves to experiment with new technique in the realm of Data Analytics and Data Science

Responses (3)