Extracting Text and Data from Files Using OCR

Nick Wolf and Vicky Steeves | March 1, 2018

What is OCR and What Can We Do with It?

OCR = Optical Character Recognition
A system that analyzes an image of a writing glyph-by-glyph and turns it into a document of machine-readable characters
High-performing OCR depends on machine-learning: you supervise your computer in recognizing images of characters—including unusual fonts, non-English language texts, etc.
Many options, but for highest accuracy (especially on older documents) use ABBYY or Tesseract

What is OCR and What Can We Do with It?

Use it to turn documents into machine-readable texts
Significantly reduce transcribing and typing
Build personal digital archives/libraries for advanced text mining and other digital methods projects

Option One: ABBYY FineReader

Basic specs:

Mac and Windows versions. Currently, Windows interface much better.
30-day Trial License Available (great for one-off projects)
Discount license ($118.99 currently) for teachers/students/researchers
Two Mac versions on scanner computer stations in Digital Studio at Bobst; free-trial version in Data Services lab, hopefully full license coming soon.

ABBYY Tutorial Materials

Login to your NYU E-mail/Home/Drive Account
Download this folder and save it somewhere on the system where you can find it.
FYI: Best resolution for OCR is 300 dpi

Step One: Load Document

Generally, select "Image or PDF File to Other Formats"

Step Two: Initial Pass by ABBYY

Hands off! Let ABBYY try to recognize every page first

Step Three: The Windows FineReader Dashboard

When asked if you want to save, select no...we need to refine the output considerably.

Text Area Types

Text Area (light green)
Picture Area (red)
Table Area (blue)

Step Four: Best Workflow Order of Operations

Work page by page to first DELETE all unwanted text/picture/table areas.
ADJUST size of current areas to capture anything on the page, preserving the original text/picture/table box.

Why? Order matters! The output text will follow the order of text/picture/table boxes in first window. Any boxes you add get tacked on to end of page. So preserve the naturally created order of boxes first, deleting and editing what is already there

ADD text areas as needed, adjust order

Step Four: Best Workflow Order of Operations

4. Click "Read" on top menu from time to time to generate a new output text. This allows ABBYY to continue to use its embedded tools to guess at words

5. Use the bottom detail window to adjust location of row and columns in tables. Select your table area using selection pointer, then click on table row/column line to delete, add, or move separators.

6. Implement the pattern trainer by following the "Creating and Training a User Pattern Tutorial here

Step Four: Best Workflow Order of Operations

7. FINALLY you may want to walk through the righthand text editor window and correct mistakes in blue. However, note that ABBYY marks in blue things that are wrong and things that might be wrong. Don't waste time eliminating blue markup.

Step Five: Save and Export

Save the ABBYY project bundle. Go to FILE >> Save FineReader Document and save the project form time to time.
When ready to export, hit the "Save" icon at top menu bar and select out put format. Note also the dropdown options under the "Document Layout" section. Opt in or out to keep things like pictures, page numbers, headers, footers, line breaks, hyphens.
Best options: RTF (if you want the exact layout on the page, especially line breaks) or TXT (if you just want text and line breaks).

Option Two (the free option): Tesseract

Find documentation at here
Install by first loading Homebrew on Mac (if you don't have it already). From terminal, type: /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Then type: brew install tesseract

For Windows, check install options here: github.com/tesseract-ocr/tesseract/wiki/Downloads

Tesseract Overview

Works on .tiff (uncompressed) files. Use tools like ImageMagick and Ghostscript to make conversions.

Running Tesseract

Runs in the command line, but don't be intimidated...basic command is: tesseract input-image-location output-text-location
To output to an html file with bounding boxes, use tesseract input-image-location output-text-location hocr
Batch OCR: for item in *.tif; do tesseract $item output_folder_name/$item; done
Training must also be done through command line, and that is a little harder. See tutorial at https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract.

Happy OCRing. Questions?

Email us: vicky.steeves@nyu.edu & nicholas.wolf@nyu.edu

Learn more about RDM: guides.nyu.edu/data_management

Get this presentation: guides.nyu.edu/data_management/resources

Make an appointment: guides.nyu.edu/appointment