PDF Manipulation Application Part 2 — Removing unwanted pages using PDFminer
This article is a continuation from an earlier article detailing the Streamlit application that can add watermark features, remove metadata or even concatenating different Doc/Docx/PDF files together. You can view the article here.
Introduction
In the earlier article, I explained why I used Streamlit to create this interactive web application as well as my impetus of this application. If you are interested in that or my thought process of that application, you can view it over here.
In this article, I am going to enhance that application a little further by adding two other functionalities into it — (1) Removing blank pages and (2) removing pages that contain certain words or phrases.
PDFminer is a python package that enables me to carry out the aforementioned functionalities.
Why PDFminer?
There are quite a number of python packages that allow the user to extract text from PDF, but one of the more famous package will be PDFminer. I have also chosen it because I have used it in my previous line of work so I am slightly more comfortable with it.
So what is PDFminer? PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely
on getting and analyzing text data. [1]
In this article, I will just touch on PDFminer and not PDFminer.six.
Getting started
The difficult part of these functionalities is to extract the text on each page. In order to do that, we will first need to import the relevant packages from PDFminer.
import io
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
Let’s devise a loop to extract the text of each page in the PDF and check if the text contains any of the unwanted words (remove_word).
to_del = []
remove_word = "Delete"for i, page in enumerate(PDFPage.get_pages(pdf_path, caching=True, check_extractable=True)): resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close() if remove_word in text:
to_del.append(i)
Once you have done that, we will incorporate the loop into a function that can take either uploaded PDF files or filepaths. The function should also take into account the phrases or word that the user wants to remove so that the function can return the pages that contain the unwanted words or phrases.
def extract_nullpage_from_pdf(pdf_path,remove_word): to_del = [] try:
for i, page in enumerate(PDFPage.get_pages(pdf_path, caching=True, check_extractable=True)): resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close() if remove_word in text:
to_del.append(i) # If it is a path instead
except AttributeError:
fp = open(pdf_path, ‘rb’)
for i, page in enumerate(PDFPage.get_pages(fp, caching=True, check_extractable=True)): resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close() if remove_word in text:
to_del.append(i) return to_del
Last but not least, let’s create another function that uses the same syntax but will search for blank pages instead.
def extract_null_from_pdf(pdf_path):blank_pg = [] try:
for i, page in enumerate(PDFPage.get_pages(pdf_path, caching=True, check_extractable=True)): resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close() if text == “ \f”:
blank_pg.append(i) except AttributeError:
fp = open(pdf_path, ‘rb’)
for i, page in enumerate(PDFPage.get_pages(fp, caching=True, check_extractable=True)): resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close() if text == “ \f”:
blank_pg.append(i) return blank_pg
The reason why we want two separate functions instead of one is because I want the user to be able to choose the manipulations that they would want to conduct on the PDF. By separting them, it makes it easier for me to make these functionalities separate and modular.
Incorporating these functions into the script from earlier article
In the earlier article, we have created a script that will loop through every page of each PDF files and apply the selected transformation onto them. Since we have the above two functions, it is quite simple to incorporate them into the script.
## Create a loop for all the paths
for i in range(len(filepath)):
file = filepath[i]
reader_input = PdfReader(file)
status_text.text(f”Processing {file} now…”) if “Remove pages that contain words/phrases” in config_select_manipulation:
to_del = (extract_nullpage_from_pdf(file,remove_word))
else:
to_del = [] if “Remove blank pages” in config_select_manipulation:
blank_pg = (extract_null_from_pdf(file))
else:
blank_pg = [] if “Add Watermark” in config_select_manipulation:
## go through the pages one after the next
for current_page in range(len(reader_input.pages)):
if current_page in to_del:
pass
elif current_page in blank_pg:
pass
else:
"Rest of the codes..."
Since I separated both functions (remove words and remove blank), I can then use the if/else statement to allow the user to choose if they want to use these functions. Is there a better or more elegant way of doing so? I would certainly think so, but I will stick with it as I would see this methodology as a quick way of prototyping my idea and solution. :)
Preview of the enhanced streamlit application
Full Code
import streamlit as st
from pdfrw import PdfReader, PdfWriter, PageMerge, IndirectPdfDict
import pathlib
import os
from os import path
from glob import glob
from PIL import Image
import numpy as np
import comtypes.client
from pathlib import Pathimport io
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPagedef extract_nullpage_from_pdf(pdf_path,remove_word): to_del = []
try:
for i, page in enumerate(PDFPage.get_pages(pdf_path, caching=True, check_extractable=True)): resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close() if remove_word in text:
to_del.append(i) except AttributeError:
fp = open(pdf_path, ‘rb’)
for i, page in enumerate(PDFPage.get_pages(fp, caching=True, check_extractable=True)): resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close() if remove_word in text:
to_del.append(i) return to_deldef extract_null_from_pdf(pdf_path): blank_pg = [] try:
for i, page in enumerate(PDFPage.get_pages(pdf_path, caching=True, check_extractable=True)): resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close() if text == “ \f”:
blank_pg.append(i) except AttributeError:
fp = open(pdf_path, ‘rb’)
for i, page in enumerate(PDFPage.get_pages(fp, caching=True, check_extractable=True)): resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close() if text == “ \f”:
blank_pg.append(i) return blank_pg#################### Streamlit ####################
def load_image(img):
im = Image.open(img)
image = np.array(im)
return imagest.image(load_image(os.getcwd()+”\Title.png”))
st.write(“##”)st.subheader(“Choose Options”)config_select_options = st.selectbox(“Select option:”, [“Input files manually”, “Input path”], 0)if config_select_options == “Input files manually”:
uploaded_file_pdf = st.file_uploader(“Upload PDF Files”,type=[‘pdf’], accept_multiple_files=True)
# uploaded_file_doc = st.file_uploader(“Upload doc/docx Files”,type=[‘docx’,’doc’], accept_multiple_files=True)
else:
input_path = st.text_input(“Please input the path of your folder”)
uploaded_file = []
if len(input_path) > 0:
st.write(f”PDFs in this path {input_path} will be uploaded”)output_path = st.text_input(“Please input the output path to house your PDFs”)
if len(output_path) > 0:
st.write(f”Amended PDFs will be housed in this path {output_path}”)st.write(“ — — -”)
st.subheader(“Choose type of manipulation to PDF”)config_select_manipulation = st.multiselect(“Select one or more options:”, [“Add Watermark”, “Remove Metadata”, “Concatenate PDFs”,”Remove blank pages”, “Remove pages that contain words/phrases”], [“Add Watermark”, “Remove Metadata”, “Concatenate PDFs”])
if “Add Watermark” in config_select_manipulation:
uploaded_file_wmp = st.file_uploader(“Upload watermark PDF for portrait”,type=[‘pdf’])
uploaded_file_wml = st.file_uploader(“Upload watermark PDF for landscape”,type=[‘pdf’])
remove_word = “”
if “Remove pages that contain words/phrases” in config_select_manipulation:
remove_word = st.text_input(“Input words / phrases contain in the page so that the page will be removed:”,”To remove”)#################### Actual Code ####################def main():
## Set up progress bar
st.write(“ — — -”)
st.subheader(“Status”)progress_bar = st.progress(0)
status_text = st.empty()status_text.text(“In progress… Please Wait.”)## Checking the output directory
if not os.path.exists(‘output_path’):
os.makedirs(‘output_path’)
status_text.text(“Output path checked okay. Proceeding to next step…”)## Define the reader and writer objectswriter = PdfWriter()
if “Add Watermark” in config_select_manipulation:
watermark_input_P = PdfReader(uploaded_file_wmp)
watermark_input_LS = PdfReader(uploaded_file_wml)
watermark_P = watermark_input_P.pages[0]
watermark_LS = watermark_input_LS.pages[0]
status_text.text(“Loaded Watermark PDF. Progressing to the next step…”)
progress_bar.progress(0.25)def find_ext(dr, ext):
return glob(path.join(dr,”*.{}”.format(ext)))wdFormatPDF = 17if config_select_options == “Input path”:filepath_doc = find_ext(input_path,”doc”)
filepath_docx = find_ext(input_path,”docx”)
filepath_all_doc = filepath_doc + filepath_docxfor file in filepath_all_doc:
name = Path(file).name.split(“.”)[0]
word = comtypes.client.CreateObject(‘Word.Application’, dynamic = True)
word.Visible = True
doc = word.Documents.Open(file)
doc.SaveAs(input_path+”\\”+ name +”.pdf”, wdFormatPDF)
doc.Close()
word.Quit()
filepath = find_ext(input_path,”pdf”)
else:
filepath = uploaded_file_pdf## Create a loop for all the paths
for i in range(len(filepath)):
file = filepath[i]
reader_input = PdfReader(file)
status_text.text(f”Processing {file} now…”)if “Remove pages that contain words/phrases” in config_select_manipulation:
to_del = (extract_nullpage_from_pdf(file,remove_word))
else:
to_del = []if “Remove blank pages” in config_select_manipulation:
blank_pg = (extract_null_from_pdf(file))
else:
blank_pg = []if “Add Watermark” in config_select_manipulation:
## go through the pages one after the next
for current_page in range(len(reader_input.pages)):
if current_page in to_del:
pass
elif current_page in blank_pg:
pass
else:
#if reader_input.pages[current_page].contents is not None:
merger = PageMerge(reader_input.pages[current_page])try:
mediabox = reader_input.pages[current_page].values()[1][‘/Kids’][0][‘/MediaBox’]except TypeError:
mediabox = reader_input.pages[0].values()[1]if mediabox[2] < mediabox[3]:
merger.add(watermark_P).render()
else:
merger.add(watermark_LS).render()
writer.addpage(reader_input.pages[current_page])
status_text.text(f”Watermark done for {file}…”)
else:
writer.addpages(reader_input.pages)if “Remove Metadata” in config_select_manipulation:
# Remove metadata
writer.trailer.Info = IndirectPdfDict(
Title=’’,
Author=’’,
Subject=’’,
Creator=’’,
)if “Concatenate PDFs” not in config_select_manipulation:
writer.write(output_path+”\Annex “+str(i+1)+”.pdf”)
writer = PdfWriter()status_text.text(f”{file} completed…”)
progress_bar.progress(0.25+(0.75/len(filepath))*(i+1))if “Concatenate PDFs” not in config_select_manipulation:
status_text.text(f”All done!!!”)
st.balloons()
else:
# write the modified content to disk
writer.write(output_path+”\Annex.pdf”)if config_select_options != “Input files manually”:
for file in filepath_pdf:
if os.path.exists(file):
os.remove(file)st.balloons()st.write(“ — — -”)
st.write(“Once you have selected the required options above, you can click on the button below to start processing. “)if st.button(“Click here to start!”):
main()
Final Thought
Other than these enhancement, I would love to dabble into hosting this application on Heroku for the next enhancement. If you are interested, stay tuned on my next article.
Lastly, a big thanks for reading my article! :)
References
[1] https://buildmedia.readthedocs.org/media/pdf/pdfminer-docs/latest/pdfminer-docs.pdf
Other Projects
- PDF Manipulation Application Part 1 — Creating an application using Streamlit and PDFRW
- PDF Manipulation Application Part 3 — Deploying the application on Heroku and output result as zip file
- Price Comparison using Streamlit and Selenium
- Tabular data from PDF: Camelot vs Tabula? Why not use both together?