A Gentle Introduction to Tesseract OCR
It is a perennial problem in Canada that municipal, provincial, and federal government agencies disclose records under Access to Information (ATI)/Freedom of Information (FOI) law in non-machine readable (image) formats by default. Lengthy reports, emails, and excel files are often printed and scanned by access coordinators before they are released to the requester.
In some cases, coordinators may be willing to release the data in a "raw" format, however, this is not always the case, and inexperienced requesters may not even realize that this is something they can ask for (indeed, they may not even realize the thousands of pages they have requested are coming in image format before it is too late).
The inability to machine read these texts limits the analytic techniques that may be applied. It is also a barrier to access. Government agencies often "over produce" when processing requests by including mounds of irrelevant text as part of one's disclosure package. Manually sifting through thousands of pages of image format documents disclosed under ATI/FOI in search of one or two lines or key words becomes the equivalent of finding a needle in a haystack.
Fortunately, there exist a number of free and open-source solutions to this problem. In the field of computer science, transforming scanned images into machine readable text is widely considered to be a "solved" problem. One state-of-the-art solution is the Tesseract Optical Character Recognition (OCR) engine, considered to be one of the best OCR engines available.
The goal of this project is to show you how to use Tesseract OCR, which we have made easily accessible to you with some simple Python code. It is part of a larger series of projects the CAIJ team intends to launch to promote computer literacy and algorithmic tools among non-computer scientists.
In launching this project, we hope to improve access to open-source tools that can eliminate many of the barriers to accessing information. The ability to convert a document into a format that can be searched for keywords, phrases, and possibly studied using natural language processing (NLP) methods alongside more traditional qualitative ones promises to revolutionize social sciences research.
Full tutorial on GitHub:
Companion how-to video on YouTube:
- Access to Information and Optical Character Recognition (OCR): A Step-by-Step Guide to Tesseract. Part one of the CAIJ Computer Literacy Series
- Getting your .pdfs into R
- Neither confirm nor deny
- Freedom of Information Research and Cultural Studies: A Subterranean Affinity
- Policing Studies and the Use of Freedom of Information Requests