Getting your .pdfs into R

Doing quantitative text analysis often means working with documents in .pdf format, and these documents may or may not be in a machine-readable format. Assuming we are using RStudio, how do we read these files into our environment so that we can clean, process, and analyze them? One way to do this is to use pdftools, a brilliant R package created by Jeoen Ooms for exactly this purpose.

For more on the pdftools library, including some additional functions not covered here, you can download the documentation from CRAN.

Installing pdftools

First, let's install and load in pdftools.

#install.packages("pdftools")
library(pdftools)

To use the built in optical character recognition (OCR) functionality in pdftools, we'll also need a library called tesseract, so let's install/load that in as well.

#install.packages("tesseract")
library(tesseract)

Reading your .pdf files into R

Now that we've got pdftools and tesseract up and running, let's check out some of the functions available for getting our .pdf data into our RStudio environment. For this tutorial, I'll be using the RCMP External Review Committee's 2018-2019 Annual Report. You can download a copy of this report from my GitHub.

Let's start by reading in our report by page, which we can do using the pdf_text() function. The result will be a character string of length 48 (the number of pages in our document). Once we've read in the pages of our document, let's print the results of one of the pages and inspect it.

#read in pdf document using pdf_text() function
rcmp_report <- pdf_text("~/Desktop/personal-website/Blog_Files/RCMP_External_Review_Committee_AR_2018.pdf")

#inspect results, printing results of page 30
rcmp_report[30]
## [1] "Corporate Management and Planning\nThe ERC continued to receive a wide scope of corporate services infrastructure, advice\nand transactional support from Public Safety and Emergency Preparedness Canada\nunder a memorandum of understanding. The small agency and administrative tribunal\ncommunities were also sources of advice and support, both through established\nnetworks and informally.\nA corporate service priority over the year was managing the accommodations re-fit\nproject for the ERC’s office space, which continued throughout the year and carried over\ninto 2019-20. The project is anticipated to be completed in June 2019.\nIncreasing the case review capacity of the ERC using program integrity funding\napproved in 2017 was a priority so that the ERC will be able to begin to reduce its large\nbacklogged caseload. The number of resourced staff positions at the ERC increased\nfrom eight at the beginning of 2018 to fifteen by the end of March, 2019. The\nposting of the appointment opportunities for both the ERC Chairperson and the Vice\nChairperson signalled a long-needed increase in the ERC’s capacity to write reports in\nresponse to the appeals referred by the RCMP Commissioner. The ERC looks to those\nappointments to support its goal of reducing the wait times to a reasonable period.\nThe ERC continued to work with the portfolio department and central agencies in\npursuit of a long term program resource level, which remains of particular importance\ngiven that the program integrity funding approved in 2017 for the\nERC will end on March 31, 2021.\n   24                    ANNUAL REPORT 2018-19\n"

What if we wanted to read in our .pdf file as a dataframe instead? To do that, all we need to is swap out pdf_text() for pdf_data(). Rather than create a character string containing the text of each page, pdf_data() will create a separate dataframe for each page in our document. In this case, we'll get 48 separate dataframes. As we'll see, the individual dataframes will also come with some extra information about each page. The width, height, x, y variables refer to the location of the word on the page, while the space variable prints TRUE or FALSE depending on whether or not the word is proceeded by a space or a linebreak (words proceeded by a linebreak rather than space labeled FALSE).

#read in pdf document using pdf_data() function
rcmp_report <- pdf_data("~/Desktop/personal-website/Blog_Files/RCMP_External_Review_Committee_AR_2018.pdf")

#inspect results, printing results of page 1 (title page)
rcmp_report[1]
## [[1]]
##   width height   x   y space   text
## 1   155     28  72 172  TRUE ANNUAL
## 2   137     28 238 172 FALSE REPORT
## 3    83     28  72 220  TRUE   2018
## 4    54     28 159 220 FALSE    -19

In addition to reading in our .pdf file, we may want to extract certain metadata about it as well. pdftools has a few handyu functions that can be used to extract things like the number of pages, the fonts being used, the table of contents, whether there are any attachments, or the original date and time the document was created. To get these data we can use the pdf_info(), pdf_fonts(), pdf_attachments(), pdf_toc(), and pdf_pagesize() functions.

Let's try the pdf_info() function and pull the data on how many pages are in this particular report, the date and time it was created, and whether or not there are any attachments.

#get info from pdf using pdf_info
rcmp_report_info <- pdf_info("~/Desktop/personal-website/Blog_Files/RCMP_External_Review_Committee_AR_2018.pdf")

#print 2nd, 6th, and 10th items of info list (I inspected earlier and so happen to know these are the number of pages, date created, and attachements variables)
rcmp_report_info[c(2, 6, 10)]
## $pages
## [1] 48
## 
## $created
## [1] "2019-06-21 02:16:16 EDT"
## 
## $attachments
## [1] FALSE

.pdf files that are not machine-readable

While it is obviously brilliant when our .pdfs are machine-readable like the one above, this is not always the case. In an earlier post, I wrote about how to convert your non machine-readable .pdf files into machine-readable .txt format using some simple python code written by my brilliant collaborator Kevin Dick. Well, it turns out there is an even easier way to do this using pdftools, especially useful if, like me, you're working primarily in RStudio. One of my favourite aspects of pdftools is that it has a built in OCR capability using tesseract (one of the best open source OCR engines currently available).

To illustrate this function, I'll use an 11-page, non machine-readable .pdf format document obtained under Canada's Access to Information Act (a memorandum to Canada's Minister of National Defence). You can download a copy of this file here.

The functions to read in our non machine-readable .pdf file, coonverting it to .txt format, are pdf_ocr_text() or pdf_ocr_data(). Same as above, there is a text and a data variant of the function (the data variant of the function includes an additional variable called 'confidence', which scores how confident the tesseract algorithm is in its .png to .txt conversion for each word).

#read in pdf document using pdf_text() function
atip_file <- pdf_ocr_text("~/Desktop/personal-website/Blog_Files/Memorandum_to_MND.pdf")
## Converting page 1 to Memorandum_to_MND_1.png... done!
## Converting page 2 to Memorandum_to_MND_2.png... done!
## Converting page 3 to Memorandum_to_MND_3.png... done!
## Converting page 4 to Memorandum_to_MND_4.png... done!
## Converting page 5 to Memorandum_to_MND_5.png... done!
## Converting page 6 to Memorandum_to_MND_6.png... done!
## Converting page 7 to Memorandum_to_MND_7.png... done!
## Converting page 8 to Memorandum_to_MND_8.png... done!
## Converting page 9 to Memorandum_to_MND_9.png... done!
## Converting page 10 to Memorandum_to_MND_10.png... done!
## Converting page 11 to Memorandum_to_MND_11.png... done!
#inspect results, printing results of page 1
atip_file[1]
## [1] "s.26\n| on\na og | So bc exhicas cod wee |\n«= Communications Security Centre de la sécurite yea\n“Establishment _ _des telecommunications oC oo LD\ne % q 3 7 017 CERRID # 33143613\nMEMORANDUM FOR THE MINISTER OF NATIONAL DEFENCE\nResponse to CSE Commissioner’s\nReview of the CSE Procedural Errors and CSE and Second Party Privacy Incidents.\n(For Approval)\nSummary\n| e The CSE Commissioner completed his annual Review of the CSE Procedural Errors | |\nand CSE and Second Party Privacy Incidents. |\nz\nBACKGROUND\ne You received a letter and report from the CSE Commissioner, dated January 6, 2017,\nproviding the results of his Review of the CSE Procedural Errors and CSE and Second\nParty Privacy Incidents.\ne The review examined the process that CSE uses to monitor compliance of its\noperations with legal responsibilities, ministerial requirements, operational policies and\nprocedures. The process involves compliance incidents and procedural errors of\nprivacy interest, and the associated mitigative and corrective actions.\ne The review examined three files including CSE Privacy Incident File (PIF), Second\nParty Incidents File (SPIF) and Minor Procedural Errors Record (MPER). The SPIF was\nintroduced in January 2016 to clarify the record keeping process in relation to incidents\nattributable to CSE from those attributable to Second Party partners.\n@\nEe U4 7-00029-00001\n"

Working with multiple .pdf files

To apply any of these functions to multiple .pdf files rather than just one, you can make use of list.files() and lappy(), both base R functions, like this:

#sew wd to where files are located
setwd("~/Desktop/personal-website/Blog_Files")

#create character list of all files ending with .pdf in folder 
pdf_files <- list.files("~/Desktop/personal-website/Blog_Files/", pattern = "pdf$")

#use lapply() to apply pdf_text or other pdftools function iteractively across each of the files
results <- lapply(pdf_files, pdf_text)
Alex Luscombe
Alex Luscombe
PhD Candidate in Criminology

Related