Working with free text
Overview
Teaching: 20 min
Exercises: 40 minQuestions
How do we work with complex files?
Objectives
Use shell tools to clean and transform free text
Working with free text
So far we have looked at how to use the Unix shell to manipulate, count, and mine tabulated data. Some library data, especially digitized documents, is much messier than tabular metadata. Nonetheless, many of the same techniques can be applied to non-tabulated data, such as free text. We need to think carefully about what it is we are counting and how we can get the best out of the Unix shell.
Thankfully there are plenty of folks out there doing this sort of work, and we can borrow what they do as an introduction to working with these more complex files. For this final exercise we’re going to leap forward a little in terms of difficulty to a scenario where we won’t learn about everything that is happening in detail or discuss at length each command. We’re going to prepare and pull apart texts to demonstrate some of the potential applications of the Unix shell. And where commands we’ve learnt about are used, I’ve left some of the figuring out to do to you - so please refer to your notes if you get stuck!
Before going any further, type in the Zoom chat the name(s) of a couple of people in the class you want to work with and choose which type of text you’d like to work on together. We will eventually put you out into a breakout room. You have three options:
- An example of hand transcribed text: Gulliver’s Travels (1735)
- An example of text captured by an optical character recognition process: General Report on the Physiography of Maryland. A dissertation, etc. (Reprinted from Report of Maryland State Weather Service.) [With maps and illustrations.] 1898 (from https://doi.org/10.21250/db12)
- An example of a webpage: Piper’s World (a GeoCities page from 1999 saved at archive.org)
Option 1: Hand transcribed text
Grabbing a text, cleaning it up
We’re going to work with the gulliver.txt
file we made in the previous lesson.
You should (still) be working in the shell-lesson
directory.
Let’s look at the file.
$ less -N gulliver.txt
1 <U+FEFF>The Project Gutenberg eBook, Gulliver's Travels, by Jonatha
1 n Swift
2
3
4 This eBook is for the use of anyone anywhere at no cost and with
5 almost no restrictions whatsoever. You may copy it, give it away o
5 r
6 re-use it under the terms of the Project Gutenberg License included
7 with this eBook or online at www.gutenberg.org
8
9
10
11
12
13 Title: Gulliver's Travels
14 into several remote nations of the world
15
16
17 Author: Jonathan Swift
18
19
20
21 Release Date: June 15, 2009 [eBook #829]
22
23 Language: English
24
25 Character set encoding: UTF-8
26
27
28 ***START OF THE PROJECT GUTENBERG EBOOK GULLIVER'S TRAVELS***
29
30
31 Transcribed from the 1892 George Bell and Sons edition by David Pri
31 ce,
32 email ccx074@pglaf.org
33
34
35
We’re going to start by using the sed
command. The command allows you to edit files directly.
$ sed '9352,9714d' gulliver.txt > gulliver-nofoot.txt
The command sed
in combination with the d
value will look at gulliver.txt
and delete all
values between the rows specified. The >
action then
prompts the script to this edited text to the new file specified.
$ sed '1,37d' gulliver-nofoot.txt > gulliver-noheadfoot.txt
This does the same as before, but for the header.
You now have a cleaner text. The next step is to prepare it even further for rigorous analysis.
We now use the tr
command, used for translating or
deleting characters. Type and run:
$ tr -d '[:punct:]\r' < gulliver-noheadfoot.txt > gulliver-noheadfootpunct.txt
This uses the translate command and a special syntax to remove all punctuation
([:punct:]
) and carriage returns (\r
).
It also requires the use of both the output redirect >
we have seen and the input redirect <
we haven’t seen.
Finally regularize the text by removing all the uppercase lettering.
$ tr [:upper:] [:lower:] < gulliver-noheadfootpunct.txt > gulliver-clean.txt
Open the gulliver-clean.txt
in a text editor. Note how the text has been transformed ready for analysis.
Pulling a text apart, counting word frequencies
We are now ready to pull the text apart.
$ tr ' ' '\n' < gulliver-clean.txt | sort | uniq -c | sort -nr > gulliver-final.txt
Here we’ve made extended use of the pipes we saw in Counting and mining with the shell. The first part of this script uses the translate command again, this time to translate every blank space into \n
which renders as a new line. Every word in the file will at this stage have its own line.
The second part uses the sort
command to rearrange the text from its original order into an alphabetical configuration.
The third part uses uniq
, another new command, in combination with the -c
flag to remove duplicate lines and to produce a word count of those duplicates.
The fourth and final part sorts the text again by the counts of duplicates generated in step three.
Challenge
There are still some remaining punctuation marks in the text. They are called ‘smart’ or ‘curly’ quotes. Can you remove them using
sed
?Hint: These quote marks are not among the 128 characters of the ASCII standard, so in the file they are encoded using a different standard, UTF-8. While this is no problem for
sed
, the window you are typing into may not understand UTF-8. If so you will need to use a Bash script; we encountered these at the end of episode 4, ‘Automating the tedious with loops’.As a reminder, use the text editor of your choice to write a file that looks like this:
#!/bin/bash # This script removes quote marks from gulliver-clean.txt and saves the result as gulliver-noquotes.txt (replace this line with your solution)
Save the file as
remove-quotes.sh
and run it from the command line like this:bash remove-quotes.sh
Solution
#!/bin/bash # This script removes quote marks from gulliver-clean.txt and saves the result as gulliver-noquotes.txt sed -Ee 's/[“”‘’]//g' gulliver-clean.txt > gulliver-noquotes.txt
If this doesn’t work for you, you might need to check whether your text editor can save files using the UTF-8 encoding.
We have now taken the text apart and produced a
count for each word in it. This is data we can prod and poke
and visualize, that can form the basis of our investigations,
and can compare with other texts processed in the same way.
And if we need to run a different set of transformation for
a different analysis, we can return to gulliver-clean.txt
to start that work.
And all this using a few commands on an otherwise unassuming but very powerful command line.
Option 2: Optical character recognized text
Grabbing a text, cleaning it up
We’re going to work with 201403160_01_text.json
.
Let’s look at the file.
$ less -N 201403160_01_text.json
1 [[1, ""], [2, ""], [3, ""], [4, ""], [5, ""], [6, ""], [7, "A GENERAL RE
1 PORT ON THE PHYSIOGRAPHY OF MARYLAND A DISSERTATION PRESENTED TO THE PRE
1 SIDENT AND FACULTY OF THE JOHNS HOPKINS UNIVERSITY FOR THE DEGREE OF DOC
1 TOR OF PHILOSOPHY BY CLEVELAND ABBE, Jr. BALTIMORE, MD. MAY, 1898."], [8
1 , ""], [9, ""], [10, "A MAP S H OW I N G THE PHYSIOGRAPHIC PROVINCES OF
1 MARYLAND AND Their Subdivisions Scale 1 : 2,000.000. 32 Miles-1 Inch"],
1 [11, "A GENERAL REPORT ON THE PHYSIOGRAPHY OF MARYLAND A DISSERTATION PR
1 ESENTED TO THE PRESIDENT AND FACULTY OF THE JOHNS HOPKINS UNIVERSITY FOR
1 THE DEGREE OF DOCTOR OF PHILOSOPHY BY CLEVELAND ABBE, Jr. BALTIMORE, MD
1 . MAY, 1898."], [12, "PRINTED BY tL%t jfricbcnrtxifti Compang BALTIMORE,
1 MD., U. S. A. REPRINTED FROM Report of Maryland State Weather Service,
1 Vol. 1, 1899, pp. 41-216."], [13, "A GENERAL REPORT ON THE PHYSIOGRAPHY
1 OF MARYLAND Physiographic Processes. INTRODUCTION. From the earliest tim
1 es men have observed more or less closely the various phenomena which na
1 ture presents, and have sought to find an explanation for them. Among th
1 e most interesting of these phe nomena have been those which bear on the
1 development of the sur face features of the earth or its topography. Im
1 pressed by the size and grandeur of the mountains, their jagged crests a
1 nd scarred sides, early students of geographical features were prone to
1 ascribe their origin to great convulsions of the earth's crust, earthqua
1 kes and vol canic eruptions. One generation after another comes and goes
1 , yet the mountains continue to rear their heads to the same heights, th
1 e rivers to run down the mountain sides in the same courses and follow t
We’re going to start by using the tr
command, used for translating or
deleting characters. Type and run:
$ tr -d [:punct:] < 201403160_01_text.json > 201403160_01_text-nopunct.txt
This uses the translate command and a special syntax to remove all punctuation.
It also requires the use of both the output redirect >
we have seen and the input redirect <
we haven’t seen.
Finally regularize the text by removing all the uppercase lettering.
$ tr [:upper:] [:lower:] < 201403160_01_text-nopunct.txt > 201403160_01_text-clean.txt
Open the 201403160_01_text-clean.txt
in a text editor. Note how the text has been transformed ready for analysis.
Pulling a text apart, counting word frequencies
We are now ready to pull the text apart.
$ tr ' ' '\n' < 201403160_01_text-clean.txt | sort | uniq -c | sort -nr > 201403160_01_text-final.txt
Here we’ve made extended use of the pipes we saw in Counting and mining with the shell. The first part of this script uses the translate command again, this time to translate every blank space into \n
which renders as a new line. Every word in the file will at this stage have its own line.
The second part uses the sort
command to rearrange the text from its original order into an alphabetical configuration.
The third part uses uniq
, another new command, in combination with the -c
flag to remove duplicate lines and to produce a word count of those duplicates.
The fourth and final part sorts the text again by the counts of duplicates generated in step three.
Note: your final output will have one problem - not all the words counted are real words (see the words counted only 1 or 2 times). To better understand what has happened, search online to find out more about Optical Character Recognition of texts
Either way we have now taken the text apart and produced a
count for each word in it. This is data we can prod and poke
and visualize, that can form the basis of our investigations,
and can compare with other texts processed in the same way.
And if we need to run a different set of transformation for
a different analysis, we can return to 201403160_01_text-clean.txt
to start that work.
And all this using a few commands on an otherwise unassuming but very powerful command line.
Option 3: A webpage
Grabbing a text, cleaning it up
We’re going to work with diary.html
.
Let’s look at the file.
$ less -N diary.html
1 <!-- This document was created with HomeSite v2.5 -->
2 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
3 <html>
4
5 <head>
6
7
8 <script type="text/javascript" src="/static/js/analytics.js"></script>
9 <script type="text/javascript">archive_analytics.values.server_name="www
9 b-app6.us.archive.org";archive_analytics.values.server_ms=105;</script>
10 <link type="text/css" rel="stylesheet" href="/static/css/banner-styles.c
10 ss"/>
11
12
13 <title>Piper's Diary</title>
14 <meta name="description"
15 content="Come visit our shih tzu, Piper. He has his very own photo gall
15 ary, monthly diary, newsletter, and dog site award. He also maintains d
15 og book reviews and quotations. Come check him out!">
16 <meta name="keywords"
17 content="shih tzu, dog, pet, quotations, award, diary, advice, book, rev
17 iew, piper">
18 <style TYPE="text/css" TITLE="Basic Fonts">
We’re going to start by using the sed
command. The command allows you to edit files directly.
$ sed '265,330d' diary.html > diary-nofoot.txt
The command sed
in combination with the d
value will look at diary.html
and delete all
values between the rows specified. The >
action then
prompts the script to this edited text to the new file specified.
$ sed '1,221d' diary-nofoot.txt > diary-noheadfoot.txt
This does the same as before, but for the header.
You now have a cleaner text. The next step is to prepare it even further for rigorous analysis.
First we wil remove all the html tags. Type and run:
$ sed -e 's/<[^>]*>//g' diary-noheadfoot.txt > diary-notags.txt
Here we are using a regular expression (see the Library Carpentry regular expression lesson to find all valid html tags (anything within angle brackets) and delete them. This is a complex regular expression, do don’t worry too much about how it works! The script also requires the use of both the output redirect >
we have seen and the input redirect <
we haven’t seen.
We’re going to start by using the tr
command, used for translating or
deleting characters. Type and run:
$ tr -d '[:punct:]\r' < diary-notags.txt > diary-notagspunct.txt
This uses the translate command and a special syntax to remove all punctuation
([:punct:]
) and carriage returns (\r
).
Finally regularize the text by removing all the uppercase lettering.
$ tr [:upper:] [:lower:] < diary-notagspunct.txt > diary-clean.txt
Open the diary-clean.txt
in a text editor. Note how the text has been transformed ready for analysis.
Pulling a text apart, counting word frequencies
We are now ready to pull the text apart.
$ tr ' ' '\n' < diary-clean.txt | sort | uniq -c | sort -nr > diary-final.txt
Here we’ve made extended use of the pipes we saw in Counting and mining with the shell. The first part of this script uses the translate command again, this time to translate every blank space into \n
which renders as a new line. Every word in the file will at this stage have its own line.
The second part uses the sort
command to rearrange the text from its original order into an alphabetical configuration.
The third part uses uniq
, another new command, in combination with the -c
flag to remove duplicate lines and to produce a word count of those duplicates.
The fourth and final part sorts the text again by the counts of duplicates generated in step three.
We have now taken the text apart and produced a
count for each word in it. This is data we can prod and poke
and visualize, that can form the basis of our investigations,
and can compare with other texts processed in the same way.
And if we need to run a different set of transformation for
a different analysis, we can return to diary-final.txt
to start that work.
And all this using a few commands on an otherwise unassuming but very powerful command line.
NYU Libraries Option
In addition to these, I have included a sample dataset with some tasks. You can download a CSV of usage data from Ebook Central (NYU only) at this link
The goal is to answer any of the following questions and perform any of the following tasks without using Microsoft Excel
-
Imagine that you want to create a few subsets of this data to give to relevant subject specialists. What is your approach? What commands would you use?
-
Imagine that you need to figure out how big this file is and you need to generate a list of titles that have been checked out, but you don’t need the call numbers? How do you make a new file?
Where to go next
Deborah S. Ray and Eric J. Ray, Unix and Linux: visual quickstart guide, 4th edition (2009). Invaluable (and not expensive) as a reference guide - especially if you only use the command line sporadically!
The Command Line Crash Course ‘Learn UNIX the Hard Way’ – good for reminders of the basics.
Another Coursera course, Programming for Everybody (Python) is available and lasts 10 weeks, if you have 2-4 hours to spare per week. Python is popular in research programming as it is readable, relatively simple, and very powerful.
Bill Turkel and the Digital History community more broadly. The second lesson you did today was based on a lesson Bill has on his website and Bill is also a general editor of the Programming Historian. The Programming Historian is an open, collaborative book aimed at providing programming lessons to historians. Although the lessons are hooked around problems historians have, their lessons - which cover various programming languages - have a wide applicability - indeed today’s course is based on two lessons I wrote with Ian Milligan, an historian at Waterloo, Canada - for ProgHist. Bill also has a wonderful tutorial on ‘Named Entity Recognition with Command Line Tools in Linux’ which I thoroughly recommend if you want to automatically find, markup, and count names, places, and organisations in text files…
Conclusion
In this session you have learned to navigate the Unix shell, to undertake some basic file counting, concatenation and deletion, to query across data for common strings, to save results and derived data, and to prepare textual data for rigorous computational analysis.
This only scratches the surface of what the Unix environment is capable of. It is hoped, however, that this session has provided a taster sufficient to prompt further investigation and productive play.
Keep in mind that the full potential the tools can offer may only emerge upon embedding these skills into real projects. Nonetheless, being able to manipulate, count and mine thousands of files is extremely useful. Even a large collection of files which do not contain any alpha-numeric data elements, such as image files, can be easily sorted, selected and queried depending on the amount of description, of metadata contained in each filename. If not a prerequisite of working with the Unix, then taking the time to structure your data in a consistent and predictable manner is certainly a significant step towards getting the most out of Unix commands. And if you can find a way of using the Unix shell regularly - perhaps only to copy or amend files - you’ll keep the basics fresh, meaning that next time you have cause to use the Unix shell for more complex commands, you shouldn’t need to learn it all over again.
References
James Baker and Ian Milligan, ‘Counting and mining research data with Unix’, The Programming Historian (2014)
Ian Milligan and James Baker, ‘Introduction to the Bash Command Line’, The Programming Historian (2014)
William J. Turkel, ‘Named Entity Recognition with Command Line Tools in Linux’ (30 June 2013). The section ‘NER Demo’ is adapted from this and shared under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported.
William J. Turkel, ‘Basic Text Analysis with Command Line Tools in Linux’ (15 June 2013). The sections ‘Grabbing a text, cleaning it up’ and ‘Pulling a text apart, counting word frequencies’ are adapted from this and shared under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported.
Key Points
Shell tools can be combined to powerful effect