Nicholas Wolf and Vicky Steeves, NYU Data Services
Vicky's ORCID: 0000-0003-4298-168X | Nick's ORCID: 0000-0001-5512-6151
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Overview
This session is an intermediate level class that will examine ways to perform data cleaning, transformation, and management using Python. We will look at some helpful ways to load data and parse it into a container for ease of use in Python, to store it in helpful formats, and to perform some basic cleaning and transformations typical for mixed string-and-numeric formats. Finally, we'll try putting it all together using a dataset form the NYC Open Data portal.
Setup
Let's think about helpful project organizational structures from the start. We can borrow a helpful folder structure used by Ben Marwick in his rrtools R package. There are iterations of this idea in a lot of places, but the basic idea is to separate your starting data from your transformed/cleaned data, and all of this input data from your subsequent analysis, documentation, and outputs. We'll modify his tree directory slightly:
analysis/ | ├── paper/ # Working paper for publication │ │ ├── figures/ # location of the figures produced by your analysis | ├── data/ | ├── data_scripts/ # scripts used to transform and clean raw data │ ├── raw_data/ # data obtained by you or from elsewhere | └── cleaned_data/ # migrated format versions of raw data, with syntax corrections and integrity checks | └── docs # documentation files for methods, parameters, data dictionaries, etc.
Download the zipped folder package available at goo.gl/SJUrcJ and place it on your desktop. Uncompress the files. We'll be working with the Jupyter Notebook located at /analysis/data/data_scripts/python-cleaning-session-full.ipynb. Alternately, you can clone the course materials using
git clone https://github.com/NYU-DataServices/python-cleaning.git
You can also find this notebook (in non-executable form) at https://nyu-dataservices.github.io/python-cleaning/
You are welcome to deploy this Jupyter Notebook in the Jupyter environment of your choice (departmental hub, local instance). You can also use it on our JupyterHub instance here: https://tutorials-1.rcnyu.org
1. CSV and Dataframes
Not surprisingly, a very common and helpful way to store data is in CSV format. Most applications have some way of working with delimited files, and while not 100% standardized, CSVs provide a stable means of sharing and storing dat of all types.
Python offers a CSV module, but it can be a wordy way to input and output data into your workflow. Since analysis is also often eased by using a more complex data container such as a dataframe, let's rely on the Pandas module to do the heavy lifting of importing (and later exporting) CSVs.
Here, we load up a CSV storing NYC water consumption data from the NYC Open Data portal.
import pandas as pd
import os
## Set a working directory for our /data parent directory
## Mac
directory = r'../raw_data'
## Windows
#directory = r'..\data\raw_data\\'
# Load a CSV into a data frame for Mac. For windows, substitute left-slashes
df = pd.read_csv(os.path.join(directory,
'water-consumption-nyc.csv'), header=0,
sep=',', parse_dates=False, encoding='utf-8')
df.head(10)
2. JSON
Python's json module is compact and easy to use. Once loaded, the JSON object is treated as a full-fledged JSON dictionary. Let's load the same data on water consumption stored as a JSON file. You can preview the NYC JSON structure here: https://data.cityofnewyork.us/api/views/ia2d-e54m/rows.json?accessType=DOWNLOAD
import json
water_dictionary = json.loads(open(os.path.join(directory,
'water-consumption-nyc.json'), encoding='utf-8').read())
## We can take a look at the data portion of the file
## (rather than the metadata stored in the JSON)
for row in water_dictionary['data'][0:12]:
print(row)
Let's turn the list of lists in which NYC has stored the data into a dataframe. Because the data has been stored as a list of lists, this is a one-step process:
water_df = pd.DataFrame(water_dictionary['data'])
water_df.head(10)
We've lost our column headers! NYC has stored them in a separate location in the JSON. How can we recover them? Hint: take a look at the full JSON here: https://data.cityofnewyork.us/api/views/ia2d-e54m/rows.json?accessType=DOWNLOAD
# Challenge 1
water_df = pd.DataFrame(water_dictionary['data'],
columns = ['sid','id','position',
'created_at','created_meta',
'updated_at','updated_meta',
'meta','year','New York City Population',
'NYC Consumption(Millions galls per day)',
'Per Capita(Gallons per person per day)'])
# Alternative, more automated way:
cols = [col_name['name'] for col_name in water_dictionary['meta']['view']['columns']]
water_df = pd.DataFrame(water_dictionary['data'], columns=cols)
water_df.head(10)
3. SQLite
Once you start working with data distributed across multiple tables, or working with data that is starting to exceed easy usability in a format like CSV, consider implementing Python with a simple SQLite database to push and pull data. The maintainers of SQLite3 claim that it will operate on large data files (see https://www.sqlite.org/limits.html), up to 140 TB of file size for a database--larger than most systems can manipulate in memory, at any rate. So for most purposes, you can nicely store your data in such a database and access it when needed.
import sqlite3
# Create a connection object to our sqlite3 database file
conn = sqlite3.connect(os.path.join(directory,
'water-consumption.db'))
c = conn.cursor()
# Select data from the water table
rows = c.execute("SELECT * FROM water")
# The response is a list of tuples, each tuple containing a cell values
for row in rows:
print(row)
# We can modify our data
newrow = ('2017', '85312743', '1003', '158')
c.execute("INSERT INTO water VALUES (?,?,?,?)", newrow)
# Commit the changes
conn.commit()
# View the new row
rows = c.execute("SELECT * FROM water")
for row in rows:
print(row)
# Or...we can once again just read our SQLite table
# directly into a dataframe:
water_sql_df = pd.read_sql_query("SELECT * FROM water", conn)
#Close the connnection
conn.close()
water_sql_df.head(10)
4. Numpy Array
The numpy module provides a quick way to save complex array-like objects (lists, lists of lists, dictionaries, lists of dictionaries, dictionaries of lists) to a binary-file type .npy file. This can be a quick and easy way (among many others, including saving such objects in a .txt or as a .py file) to save and return to a data container without having to reconstruct it every time you work on a workflow. See also below for how to save the .npy file in the first place.
import numpy as np
# Loading an array from the .npy file
water_npy = np.load(os.path.join(directory,
'water_consumption.npy'), allow_pickle=True)
print(water_npy)
We've just discussed how to load and parse stored data. Let's look at the opposite side of the coin -- storing data from Python into a saved file.
1. JSON
Let's say we've transformed a data file by adding some new data, or cleaning it. We now have a cleaned data set that we can queue up for analysis.
# We have a cleaned data file, here represented by a small table
cleaned_table = {'data':[['1979', '7102100', '1512', '213'],
['2001', '8024964', '1184', '148'],
['2009', '8223444', '1204', '178']]}
with open('../cleaned_data/new_json.json',
'w', encoding='utf-8') as outfile:
json.dump(cleaned_table, outfile)
2. Numpy
np.save('../cleaned_data/new_numpy.npy', cleaned_table)
3. SQLite
This operation will be straightforward from the concepts outlined above, with the added step that we need to create our database if it does not exist, and our table within the database.
# We first establish a new database and connect to it
conn_out = sqlite3.connect('../cleaned_data/cleaned_sql.db')
cursor_out = conn_out.cursor()
# Then we create a new table
cursor_out.execute('''CREATE TABLE IF NOT EXISTS newsql
(year text not null unique, nyc_population text, consumption text, per_capita text)
''')
for insert_row in cleaned_table['data']:
cursor_out.execute("INSERT INTO newsql VALUES (?,?,?,?)", insert_row)
conn_out.commit()
conn_out.close()
4. CSV with Pandas
Pandas also has a very quick way to move a dataframe into a CSV
cleaned_df = pd.DataFrame(cleaned_table['data'], columns = ['year','nyc_population','consumption', 'per_capita'])
cleaned_df.to_csv('../cleaned_data/new_df.csv', ',', encoding='utf-8')
Finally, we can perform syntax cleaning operations in Python just as easily as in a program like Excel. In fact, if you are working with very large data ( > 1.2 million rows) you will simply not be able to use Excel or other spreadsheet programs to manipulate data and must depend on other methods.
Here, let's take a look at a very simple data set (2 columns, 66 rows) consisting of the Significant Noncompliance List published by NYC Open Data: https://data.cityofnewyork.us/Environment/Significant-Noncompliance-List/xnje-s6zf
If you preview the data, you'll notice that a date span has been combined into a single row cell, that there is some inconsistencies in how months are listed, and it implements a kind of "footnote" method, with starred/crossed establishment names referring to the following notes:
How would we perform the following data cleaning operations:
1) Split the "period" column into two columns, a start_date and end_date column? 2) Standarize all spelling of months 3) Create a new columns conveying the information in the "footnote" marks? 4) Remove commas, periods, etc. from the establishment names 5) Ensure there is no whitespace or newlines on the ends of entries 6) Output the cleaned data as a CSV in the /cleaned_data folder
Try this challenge using whatever data container you wish. You may want to consider a very simple one: we can turn the CSV into a list of lists by reading in the file, splitting it on the ", combination, and working with each value separately.
with open(os.path.join(directory, 'nyc-noncompliancelist.csv')) as dfile:
rows = dfile.readlines()
for row in rows[1:]: # Taking the rows as a list, but skipping the first header row
print(row.split('",')) # Splitting on the delimiter which we notice the escaped " before it.
# Challenge 1
water_df = pd.DataFrame(water_dictionary['data'], columns = ['sid','id','position',
'created_at','created_meta',
'updated_at','updated_meta',
'meta','year','New York City Population',
'NYC Consumption(Millions galls per day)',
'Per Capita(Gallons per person per day)'])
# Alternative, less compact approach
cols = []
for col_name in water_dictionary['meta']['view']['columns']:
cols.append(col_name['name'])
water_df = pd.DataFrame(water_dictionary['data'], columns=cols)
display(water_df)
# Challenge 2
with open(os.path.join(directory, 'nyc-noncompliancelist.csv')) as dfile:
rows = dfile.readlines()
new_data = []
for row in rows[1:]: #Taking the rows as list, but skipping first header row
rlist = row.split('",')
dates = rlist[0].replace('"','').split('-')
start_date = dates[0].replace('Jul ', 'July').strip()
end_date = dates[1].strip()
if rlist[1][0] == '*':
note = 'LateReport'
elif rlist[1][0] == '†':
note = 'NS-IndustrialUsers'
else:
note = 'NA'
est_name = rlist[1].replace('"','').replace('*','').replace('†','').replace('.','').replace(',','').strip('\n').strip()
new_data.append([start_date, end_date, note, est_name])
for r in new_data:
print(r)
Time permitting, let's put our full list of skills together and see if we can extract a simplified dataset from this Brooklyn Public Library listing of library branches and hours. Using the JSON formatted file available at https://www.bklynlibrary.org/locations/json, extract the name of each branch, its address, and a third column listing the total number of hours per week that the branch is open.