Nick Wolf and Vicky Steeves | March 1, 2018
Vicky's ORCID: 0000-0003-4298-168X | Nick's ORCID: 0000-0001-5512-6151
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Basic specs:
Generally, select "Image or PDF File to Other Formats"
Hands off! Let ABBYY try to recognize every page first
When asked if you want to save, select no...we need to refine the output considerably.
Why? Order matters! The output text will follow the order of text/picture/table boxes in first window. Any boxes you add get tacked on to end of page. So preserve the naturally created order of boxes first, deleting and editing what is already there
4. Click "Read" on top menu from time to time to generate a new output text. This allows ABBYY to continue to use its embedded tools to guess at words
5. Use the bottom detail window to adjust location of row and columns in tables. Select your table area using selection pointer, then click on table row/column line to delete, add, or move separators.
6. Implement the pattern trainer by following the "Creating and Training a User Pattern Tutorial here
7. FINALLY you may want to walk through the righthand text editor window and correct mistakes in blue. However, note that ABBYY marks in blue things that are wrong and things that might be wrong. Don't waste time eliminating blue markup.
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew install tesseract
tesseract input-image-location output-text-location
tesseract input-image-location output-text-location hocr
for item in *.tif; do tesseract $item output_folder_name/$item; done
Email us: vicky.steeves@nyu.edu & nicholas.wolf@nyu.edu
Learn more about RDM: guides.nyu.edu/data_management
Get this presentation: guides.nyu.edu/data_management/resources
Make an appointment: guides.nyu.edu/appointment
Vicky's ORCID: 0000-0003-4298-168X | Nick's ORCID: 0000-0001-5512-6151
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.