Data Cleaning and Management Using Python

Nicholas Wolf and Vicky Steeves, NYU Data Services

Vicky's ORCID: 0000-0003-4298-168X | Nick's ORCID: 0000-0001-5512-6151

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Overview

This session is an intermediate level class that will examine ways to perform data cleaning, transformation, and management using Python. We will look at some helpful ways to load data and parse it into a container for ease of use in Python, to store it in helpful formats, and to perform some basic cleaning and transformations typical for mixed string-and-numeric formats. Finally, we'll try putting it all together using a dataset form the NYC Open Data portal.

Setup

Let's think about helpful project organizational structures from the start. We can borrow a helpful folder structure used by Ben Marwick in his rrtools R package. There are iterations of this idea in a lot of places, but the basic idea is to separate your starting data from your transformed/cleaned data, and all of this input data from your subsequent analysis, documentation, and outputs. We'll modify his tree directory slightly:

analysis/
|
├── paper/              # Working paper for publication
│
│
├── figures/            # location of the figures produced by your analysis
|
├── data/
|   ├── data_scripts/   # scripts used to transform and clean raw data
│   ├── raw_data/       # data obtained by you or from elsewhere
|   └── cleaned_data/   # migrated format versions of raw data, with syntax corrections and integrity checks 
|
└── docs                # documentation files for methods, parameters, data dictionaries, etc.

Download the zipped folder package available at goo.gl/SJUrcJ and place it on your desktop. Uncompress the files. We'll be working with the Jupyter Notebook located at /analysis/data/data_scripts/python-cleaning-session-full.ipynb. Alternately, you can clone the course materials using

git clone https://github.com/NYU-DataServices/python-cleaning.git

You can also find this notebook (in non-executable form) at https://nyu-dataservices.github.io/python-cleaning/

You are welcome to deploy this Jupyter Notebook in the Jupyter environment of your choice (departmental hub, local instance). You can also use it on our JupyterHub instance here: https://tutorials-1.rcnyu.org

Loading and Parsing

1. CSV and Dataframes

Not surprisingly, a very common and helpful way to store data is in CSV format. Most applications have some way of working with delimited files, and while not 100% standardized, CSVs provide a stable means of sharing and storing dat of all types.

Python offers a CSV module, but it can be a wordy way to input and output data into your workflow. Since analysis is also often eased by using a more complex data container such as a dataframe, let's rely on the Pandas module to do the heavy lifting of importing (and later exporting) CSVs.

Here, we load up a CSV storing NYC water consumption data from the NYC Open Data portal.

In [1]:
import pandas as pd
import os

## Set a working directory for our /data parent directory
## Mac
directory = r'../raw_data'
## Windows
#directory = r'..\data\raw_data\\'

# Load a CSV into a data frame for Mac. For windows, substitute left-slashes
df = pd.read_csv(os.path.join(directory,
            'water-consumption-nyc.csv'), header=0, 
                 sep=',', parse_dates=False, encoding='utf-8')

df.head(10)
Out[1]:
Year New York City Population NYC Consumption(Million gallons per day) Per Capita(Gallons per person per day)
0 1979 7102100 1512 213
1 1980 7071639 1506 213
2 1981 7089241 1309 185
3 1982 7109105 1382 194
4 1983 7181224 1424 198
5 1984 7234514 1465 203
6 1985 7274054 1326 182
7 1986 7319246 1351 185
8 1987 7342476 1447 197
9 1988 7353719 1484 202

2. JSON

Python's json module is compact and easy to use. Once loaded, the JSON object is treated as a full-fledged JSON dictionary. Let's load the same data on water consumption stored as a JSON file. You can preview the NYC JSON structure here: https://data.cityofnewyork.us/api/views/ia2d-e54m/rows.json?accessType=DOWNLOAD

In [2]:
import json

water_dictionary = json.loads(open(os.path.join(directory,
        'water-consumption-nyc.json'), encoding='utf-8').read())

## We can take a look at the data portion of the file 
## (rather than the metadata stored in the JSON)

for row in water_dictionary['data'][0:12]:
    print(row)
[1, '4BE85703-1B5D-49E2-A418-75BEDD478BB5', 1, 1495480711, '994730', 1495480711, '994730', None, '1979', '7102100', '1512', '213']
[2, '971B06B7-1135-4887-B409-DAC19D845EBC', 2, 1495480711, '994730', 1495480711, '994730', None, '1980', '7071639', '1506', '213']
[3, '71AEF5AD-AA70-449C-899A-2514E97F893D', 3, 1495480711, '994730', 1495480711, '994730', None, '1981', '7089241', '1309', '185']
[4, '7914DA35-A410-44C2-A0D1-B11385B6B3F4', 4, 1495480711, '994730', 1495480711, '994730', None, '1982', '7109105', '1382', '194']
[5, 'A1347CE5-F309-45C0-B372-E797A32FB0E5', 5, 1495480711, '994730', 1495480711, '994730', None, '1983', '7181224', '1424', '198']
[6, 'DA07F92C-B10E-4D5B-A479-17522E0C5C8C', 6, 1495480711, '994730', 1495480711, '994730', None, '1984', '7234514', '1465', '203']
[7, '45CD9C9C-172C-4F0B-BC72-319E9B2EA6B0', 7, 1495480711, '994730', 1495480711, '994730', None, '1985', '7274054', '1326', '182']
[8, 'B76329F9-740B-41BC-900B-62AA4790DAD6', 8, 1495480711, '994730', 1495480711, '994730', None, '1986', '7319246', '1351', '185']
[9, '4B66D71F-642D-4770-9123-BAB33E1E8205', 9, 1495480711, '994730', 1495480711, '994730', None, '1987', '7342476', '1447', '197']
[10, '6EB7E8EA-D4D9-4DDF-8469-ADFC8FECB06F', 10, 1495480711, '994730', 1495480711, '994730', None, '1988', '7353719', '1484', '202']
[11, 'C2FAE03C-30CF-4861-9606-B96BB75C072A', 11, 1495480711, '994730', 1495480711, '994730', None, '1989', '7344175', '1402', '191']
[12, 'D5507E3A-F453-4FAA-847A-AE76CDB2558C', 12, 1495480711, '994730', 1495480711, '994730', None, '1990', '7335650', '1424', '194']

Let's turn the list of lists in which NYC has stored the data into a dataframe. Because the data has been stored as a list of lists, this is a one-step process:

In [3]:
water_df = pd.DataFrame(water_dictionary['data'])

water_df.head(10)
Out[3]:
0 1 2 3 4 5 6 7 8 9 10 11
0 1 4BE85703-1B5D-49E2-A418-75BEDD478BB5 1 1495480711 994730 1495480711 994730 None 1979 7102100 1512 213
1 2 971B06B7-1135-4887-B409-DAC19D845EBC 2 1495480711 994730 1495480711 994730 None 1980 7071639 1506 213
2 3 71AEF5AD-AA70-449C-899A-2514E97F893D 3 1495480711 994730 1495480711 994730 None 1981 7089241 1309 185
3 4 7914DA35-A410-44C2-A0D1-B11385B6B3F4 4 1495480711 994730 1495480711 994730 None 1982 7109105 1382 194
4 5 A1347CE5-F309-45C0-B372-E797A32FB0E5 5 1495480711 994730 1495480711 994730 None 1983 7181224 1424 198
5 6 DA07F92C-B10E-4D5B-A479-17522E0C5C8C 6 1495480711 994730 1495480711 994730 None 1984 7234514 1465 203
6 7 45CD9C9C-172C-4F0B-BC72-319E9B2EA6B0 7 1495480711 994730 1495480711 994730 None 1985 7274054 1326 182
7 8 B76329F9-740B-41BC-900B-62AA4790DAD6 8 1495480711 994730 1495480711 994730 None 1986 7319246 1351 185
8 9 4B66D71F-642D-4770-9123-BAB33E1E8205 9 1495480711 994730 1495480711 994730 None 1987 7342476 1447 197
9 10 6EB7E8EA-D4D9-4DDF-8469-ADFC8FECB06F 10 1495480711 994730 1495480711 994730 None 1988 7353719 1484 202

Challenge

We've lost our column headers! NYC has stored them in a separate location in the JSON. How can we recover them? Hint: take a look at the full JSON here: https://data.cityofnewyork.us/api/views/ia2d-e54m/rows.json?accessType=DOWNLOAD

In [4]:
# Challenge 1

water_df = pd.DataFrame(water_dictionary['data'], 
        columns = ['sid','id','position',
                'created_at','created_meta',
                'updated_at','updated_meta',
                'meta','year','New York City Population',
                'NYC Consumption(Millions galls per day)',
                'Per Capita(Gallons per person per day)'])

# Alternative, more automated way:

cols = [col_name['name'] for col_name in water_dictionary['meta']['view']['columns']]

water_df = pd.DataFrame(water_dictionary['data'], columns=cols)

water_df.head(10)
Out[4]:
sid id position created_at created_meta updated_at updated_meta meta Year New York City Population NYC Consumption(Million gallons per day) Per Capita(Gallons per person per day)
0 1 4BE85703-1B5D-49E2-A418-75BEDD478BB5 1 1495480711 994730 1495480711 994730 None 1979 7102100 1512 213
1 2 971B06B7-1135-4887-B409-DAC19D845EBC 2 1495480711 994730 1495480711 994730 None 1980 7071639 1506 213
2 3 71AEF5AD-AA70-449C-899A-2514E97F893D 3 1495480711 994730 1495480711 994730 None 1981 7089241 1309 185
3 4 7914DA35-A410-44C2-A0D1-B11385B6B3F4 4 1495480711 994730 1495480711 994730 None 1982 7109105 1382 194
4 5 A1347CE5-F309-45C0-B372-E797A32FB0E5 5 1495480711 994730 1495480711 994730 None 1983 7181224 1424 198
5 6 DA07F92C-B10E-4D5B-A479-17522E0C5C8C 6 1495480711 994730 1495480711 994730 None 1984 7234514 1465 203
6 7 45CD9C9C-172C-4F0B-BC72-319E9B2EA6B0 7 1495480711 994730 1495480711 994730 None 1985 7274054 1326 182
7 8 B76329F9-740B-41BC-900B-62AA4790DAD6 8 1495480711 994730 1495480711 994730 None 1986 7319246 1351 185
8 9 4B66D71F-642D-4770-9123-BAB33E1E8205 9 1495480711 994730 1495480711 994730 None 1987 7342476 1447 197
9 10 6EB7E8EA-D4D9-4DDF-8469-ADFC8FECB06F 10 1495480711 994730 1495480711 994730 None 1988 7353719 1484 202

3. SQLite

Once you start working with data distributed across multiple tables, or working with data that is starting to exceed easy usability in a format like CSV, consider implementing Python with a simple SQLite database to push and pull data. The maintainers of SQLite3 claim that it will operate on large data files (see https://www.sqlite.org/limits.html), up to 140 TB of file size for a database--larger than most systems can manipulate in memory, at any rate. So for most purposes, you can nicely store your data in such a database and access it when needed.

In [5]:
import sqlite3

# Create a connection object to our sqlite3 database file

conn = sqlite3.connect(os.path.join(directory,
                                    'water-consumption.db'))

c = conn.cursor()

# Select data from the water table

rows = c.execute("SELECT * FROM water")

# The response is a list of tuples, each tuple containing a cell values

for row in rows:
    print(row)
('1979', '7102100', '1512', '213')
('1980', '7071639', '1506', '213')
('1981', '7089241', '1309', '185')
('1982', '7109105', '1382', '194')
('1983', '7181224', '1424', '198')
('1984', '7234514', '1465', '203')
('1985', '7274054', '1326', '182')
('1986', '7319246', '1351', '185')
('1987', '7342476', '1447', '197')
('1988', '7353719', '1484', '202')
('1989', '7344175', '1402', '191')
('1990', '7335650', '1424', '194')
('1991', '7374501', '1469', '199')
('1992', '7428944', '1369', '184')
('1993', '7506166', '1369', '182')
('1994', '7570458', '1358', '179')
('1995', '7633040', '1326', '174')
('1996', '7697812', '1298', '169')
('1997', '7773443', '1206', '155')
('1998', '7858259', '1220', '155')
('1999', '7947660', '1237', '156')
('2000', '8008278', '1240', '155')
('2001', '8024964', '1184', '148')
('2002', '8041649', '1136', '141')
('2003', '8058335', '1094', '136')
('2004', '8075020', '1100', '136')
('2005', '8091706', '1138', '141')
('2006', '8108391', '1069', '132')
('2007', '8125077', '1114', '137')
('2008', '8141762', '1098', '135')
('2009', '8158448', '1007', '123')
('2010', '8175133', '1039', '127')
('2011', '8175133', '1021', '125')
('2012', '8336697', '1009', '121')
('2013', '8405837', '1006', '120')
('2014', '8491079', '996', '117')
('2015', '8550405', '1009', '118')
('2016', '8537673', '1002', '117')
In [6]:
# We can modify our data

newrow = ('2017', '85312743', '1003', '158')

c.execute("INSERT INTO water VALUES (?,?,?,?)", newrow)

# Commit the changes

conn.commit()

# View the new row

rows = c.execute("SELECT * FROM water")

for row in rows:
    print(row)
('1979', '7102100', '1512', '213')
('1980', '7071639', '1506', '213')
('1981', '7089241', '1309', '185')
('1982', '7109105', '1382', '194')
('1983', '7181224', '1424', '198')
('1984', '7234514', '1465', '203')
('1985', '7274054', '1326', '182')
('1986', '7319246', '1351', '185')
('1987', '7342476', '1447', '197')
('1988', '7353719', '1484', '202')
('1989', '7344175', '1402', '191')
('1990', '7335650', '1424', '194')
('1991', '7374501', '1469', '199')
('1992', '7428944', '1369', '184')
('1993', '7506166', '1369', '182')
('1994', '7570458', '1358', '179')
('1995', '7633040', '1326', '174')
('1996', '7697812', '1298', '169')
('1997', '7773443', '1206', '155')
('1998', '7858259', '1220', '155')
('1999', '7947660', '1237', '156')
('2000', '8008278', '1240', '155')
('2001', '8024964', '1184', '148')
('2002', '8041649', '1136', '141')
('2003', '8058335', '1094', '136')
('2004', '8075020', '1100', '136')
('2005', '8091706', '1138', '141')
('2006', '8108391', '1069', '132')
('2007', '8125077', '1114', '137')
('2008', '8141762', '1098', '135')
('2009', '8158448', '1007', '123')
('2010', '8175133', '1039', '127')
('2011', '8175133', '1021', '125')
('2012', '8336697', '1009', '121')
('2013', '8405837', '1006', '120')
('2014', '8491079', '996', '117')
('2015', '8550405', '1009', '118')
('2016', '8537673', '1002', '117')
('2017', '85312743', '1003', '158')
In [7]:
# Or...we can once again just read our SQLite table 
# directly into a dataframe:

water_sql_df = pd.read_sql_query("SELECT * FROM water", conn)

#Close the connnection

conn.close()

water_sql_df.head(10)
Out[7]:
Year New York City Population NYC Consumption(Million gallons per day) Per Capita(Gallons per person per day)
0 1979 7102100 1512 213
1 1980 7071639 1506 213
2 1981 7089241 1309 185
3 1982 7109105 1382 194
4 1983 7181224 1424 198
5 1984 7234514 1465 203
6 1985 7274054 1326 182
7 1986 7319246 1351 185
8 1987 7342476 1447 197
9 1988 7353719 1484 202

4. Numpy Array

The numpy module provides a quick way to save complex array-like objects (lists, lists of lists, dictionaries, lists of dictionaries, dictionaries of lists) to a binary-file type .npy file. This can be a quick and easy way (among many others, including saving such objects in a .txt or as a .py file) to save and return to a data container without having to reconstruct it every time you work on a workflow. See also below for how to save the .npy file in the first place.

In [9]:
import numpy as np

# Loading an array from the .npy file

water_npy = np.load(os.path.join(directory,
                 'water_consumption.npy'), allow_pickle=True)

print(water_npy)
[[1 '4BE85703-1B5D-49E2-A418-75BEDD478BB5' 1 1495480711 '994730'
  1495480711 '994730' None '1979' '7102100' '1512' '213']
 [2 '971B06B7-1135-4887-B409-DAC19D845EBC' 2 1495480711 '994730'
  1495480711 '994730' None '1980' '7071639' '1506' '213']
 [3 '71AEF5AD-AA70-449C-899A-2514E97F893D' 3 1495480711 '994730'
  1495480711 '994730' None '1981' '7089241' '1309' '185']
 [4 '7914DA35-A410-44C2-A0D1-B11385B6B3F4' 4 1495480711 '994730'
  1495480711 '994730' None '1982' '7109105' '1382' '194']
 [5 'A1347CE5-F309-45C0-B372-E797A32FB0E5' 5 1495480711 '994730'
  1495480711 '994730' None '1983' '7181224' '1424' '198']
 [6 'DA07F92C-B10E-4D5B-A479-17522E0C5C8C' 6 1495480711 '994730'
  1495480711 '994730' None '1984' '7234514' '1465' '203']
 [7 '45CD9C9C-172C-4F0B-BC72-319E9B2EA6B0' 7 1495480711 '994730'
  1495480711 '994730' None '1985' '7274054' '1326' '182']
 [8 'B76329F9-740B-41BC-900B-62AA4790DAD6' 8 1495480711 '994730'
  1495480711 '994730' None '1986' '7319246' '1351' '185']
 [9 '4B66D71F-642D-4770-9123-BAB33E1E8205' 9 1495480711 '994730'
  1495480711 '994730' None '1987' '7342476' '1447' '197']
 [10 '6EB7E8EA-D4D9-4DDF-8469-ADFC8FECB06F' 10 1495480711 '994730'
  1495480711 '994730' None '1988' '7353719' '1484' '202']
 [11 'C2FAE03C-30CF-4861-9606-B96BB75C072A' 11 1495480711 '994730'
  1495480711 '994730' None '1989' '7344175' '1402' '191']
 [12 'D5507E3A-F453-4FAA-847A-AE76CDB2558C' 12 1495480711 '994730'
  1495480711 '994730' None '1990' '7335650' '1424' '194']
 [13 '7276B270-1998-4290-B645-607A2E03FE17' 13 1495480711 '994730'
  1495480711 '994730' None '1991' '7374501' '1469' '199']
 [14 '02E24AC6-6D8D-496D-933B-F174C9950DD3' 14 1495480711 '994730'
  1495480711 '994730' None '1992' '7428944' '1369' '184']
 [15 'D47DD57A-E12E-4D4F-8711-D10E37F17975' 15 1495480711 '994730'
  1495480711 '994730' None '1993' '7506166' '1369' '182']
 [16 '88ADE46C-791F-464F-8B8F-5077FFFF0EF6' 16 1495480711 '994730'
  1495480711 '994730' None '1994' '7570458' '1358' '179']
 [17 'A9F0A00A-12E5-4708-8932-6B43C5B85DD4' 17 1495480711 '994730'
  1495480711 '994730' None '1995' '7633040' '1326' '174']
 [18 '09164B3F-2924-495B-A3E3-723DA4CE3DF8' 18 1495480711 '994730'
  1495480711 '994730' None '1996' '7697812' '1298' '169']
 [19 'A2440145-4DA3-4088-8615-AFC05A857D47' 19 1495480711 '994730'
  1495480711 '994730' None '1997' '7773443' '1206' '155']
 [20 '01B75A1E-E15F-43C9-8FC2-EFFEF25C00E9' 20 1495480711 '994730'
  1495480711 '994730' None '1998' '7858259' '1220' '155']
 [21 '9223E5A0-DED5-4F20-A6ED-03FAFC4E820B' 21 1495480711 '994730'
  1495480711 '994730' None '1999' '7947660' '1237' '156']
 [22 '68B1631D-C945-4399-BEF8-832093F772D2' 22 1495480711 '994730'
  1495480711 '994730' None '2000' '8008278' '1240' '155']
 [23 'CA110C17-74B7-4D2D-AA19-424AE4E7744A' 23 1495480711 '994730'
  1495480711 '994730' None '2001' '8024964' '1184' '148']
 [24 '4AEDDA2C-BD8C-4B95-9966-259D9B038C4F' 24 1495480711 '994730'
  1495480711 '994730' None '2002' '8041649' '1136' '141']
 [25 '87F3BE91-D193-45E3-A2D2-6C3B9FA6069C' 25 1495480711 '994730'
  1495480711 '994730' None '2003' '8058335' '1094' '136']
 [26 'C82A5F3B-2E43-40F5-A951-021C168E0C08' 26 1495480711 '994730'
  1495480711 '994730' None '2004' '8075020' '1100' '136']
 [27 '13D59A66-BA54-4ADB-9069-C38D47664340' 27 1495480711 '994730'
  1495480711 '994730' None '2005' '8091706' '1138' '141']
 [28 '64E75825-7892-4840-B57F-D0B577F83185' 28 1495480711 '994730'
  1495480711 '994730' None '2006' '8108391' '1069' '132']
 [29 '55CAFFC9-CDB2-4037-BA3F-3857278D4405' 29 1495480711 '994730'
  1495480711 '994730' None '2007' '8125077' '1114' '137']
 [30 '71ADBA93-9D04-40BD-8791-41518FA9B18E' 30 1495480711 '994730'
  1495480711 '994730' None '2008' '8141762' '1098' '135']
 [31 'F09DD97C-B560-48C4-ADAE-521D611B8389' 31 1495480711 '994730'
  1495480711 '994730' None '2009' '8158448' '1007' '123']
 [32 'BEE271B1-301E-403D-8FC7-9FD5FF4EA867' 32 1495480711 '994730'
  1495480711 '994730' None '2010' '8175133' '1039' '127']
 [33 '2F8E8569-C197-43E4-8C2A-17F455B738F0' 33 1495480711 '994730'
  1495480711 '994730' None '2011' '8175133' '1021' '125']
 [34 '41C90A3D-E8D7-4165-AEC5-0534FC6EC07E' 34 1495480711 '994730'
  1495480711 '994730' None '2012' '8336697' '1009' '121']
 [35 'EDF6197F-E302-4B92-95FB-78F3439CAFEB' 35 1495480711 '994730'
  1495480711 '994730' None '2013' '8405837' '1006' '120']
 [36 'E3F488C5-E2CF-48D2-A259-6001D7D18ABB' 36 1495480711 '994730'
  1495480711 '994730' None '2014' '8491079' '996' '117']
 [37 '15E6A978-36AA-4338-8353-F5AD0431ADEA' 37 1495480711 '994730'
  1495480711 '994730' None '2015' '8550405' '1009' '118']
 [38 '08880B90-4C92-477C-ABE5-C8DFC548C43E' 38 1495480711 '994730'
  1495480711 '994730' None '2016' '8537673' '1002' '117']]

Storing

We've just discussed how to load and parse stored data. Let's look at the opposite side of the coin -- storing data from Python into a saved file.

1. JSON

Let's say we've transformed a data file by adding some new data, or cleaning it. We now have a cleaned data set that we can queue up for analysis.

In [10]:
# We have a cleaned data file, here represented by a small table

cleaned_table = {'data':[['1979', '7102100', '1512', '213'],
                         ['2001', '8024964', '1184', '148'],
                        ['2009', '8223444', '1204', '178']]}

with open('../cleaned_data/new_json.json', 
          'w', encoding='utf-8') as outfile:
    json.dump(cleaned_table, outfile)

2. Numpy

In [11]:
np.save('../cleaned_data/new_numpy.npy', cleaned_table)

3. SQLite

This operation will be straightforward from the concepts outlined above, with the added step that we need to create our database if it does not exist, and our table within the database.

In [15]:
# We first establish a new database and connect to it

conn_out = sqlite3.connect('../cleaned_data/cleaned_sql.db')
cursor_out = conn_out.cursor()

# Then we create a new table

cursor_out.execute('''CREATE TABLE IF NOT EXISTS newsql
                  (year text not null unique, nyc_population text, consumption text, per_capita text) 
               ''')

for insert_row in cleaned_table['data']:
    cursor_out.execute("INSERT INTO newsql VALUES (?,?,?,?)", insert_row)

conn_out.commit()
conn_out.close()

4. CSV with Pandas

Pandas also has a very quick way to move a dataframe into a CSV

In [16]:
cleaned_df = pd.DataFrame(cleaned_table['data'], columns = ['year','nyc_population','consumption', 'per_capita'])

cleaned_df.to_csv('../cleaned_data/new_df.csv', ',', encoding='utf-8')

Cleaning

Finally, we can perform syntax cleaning operations in Python just as easily as in a program like Excel. In fact, if you are working with very large data ( > 1.2 million rows) you will simply not be able to use Excel or other spreadsheet programs to manipulate data and must depend on other methods.

Here, let's take a look at a very simple data set (2 columns, 66 rows) consisting of the Significant Noncompliance List published by NYC Open Data: https://data.cityofnewyork.us/Environment/Significant-Noncompliance-List/xnje-s6zf

If you preview the data, you'll notice that a date span has been combined into a single row cell, that there is some inconsistencies in how months are listed, and it implements a kind of "footnote" method, with starred/crossed establishment names referring to the following notes:

  • *These establishments are on the list for late reporting only.
  • †These establishments are Non-Significant Industrial Users.

How would we perform the following data cleaning operations:

1) Split the "period" column into two columns, a start_date and end_date column? 2) Standarize all spelling of months 3) Create a new columns conveying the information in the "footnote" marks? 4) Remove commas, periods, etc. from the establishment names 5) Ensure there is no whitespace or newlines on the ends of entries 6) Output the cleaned data as a CSV in the /cleaned_data folder

Try this challenge using whatever data container you wish. You may want to consider a very simple one: we can turn the CSV into a list of lists by reading in the file, splitting it on the ", combination, and working with each value separately.

In [17]:
with open(os.path.join(directory, 'nyc-noncompliancelist.csv')) as dfile:
    rows = dfile.readlines()

for row in rows[1:]:            # Taking the rows as a list, but skipping the first header row
    print(row.split('",'))      # Splitting on the delimiter which we notice the escaped " before it.
['"July 1, 2011 - June 30, 2012', '"*A1 RADIATOR EXPRESS & AUTO REPAIR, INC."\n']
['"July 1, 2011 - June 30, 2012', '*ADELPHIA CONTAINER CORP.\n']
['"July 1, 2011 - June 30, 2012', '"ALSCO JEWELRY OF NEW YORK, INC."\n']
['"July 1, 2011 - June 30, 2012', '*AMERICO IZZO\n']
['"July 1, 2011 - June 30, 2012', '"APEXX OMNI-GRAPHICS, INC."\n']
['"July 1, 2011 - June 30, 2012', '"*ARCHITECTURAL COATINGS, INC."\n']
['"July 1, 2011 - June 30, 2012', '"ATLANTIS JEWELRY CONTRACTORS, INC."\n']
['"July 1, 2011 - June 30, 2012', '*B & M LINEN CORP.\n']
['"July 1, 2011 - June 30, 2012', '"BNNS CO., INC."\n']
['"July 1, 2011 - June 30, 2012', 'BRILLIANT JEWELERS/MJJ INC.\n']
['"July 1, 2011 - June 30, 2012', 'CARVIN FRENCH JEWELERS INC.\n']
['"July 1, 2011 - June 30, 2012', 'CONTROL ELECTROPOLISHING CORP.\n']
['"July 1, 2011 - June 30, 2012', '"*DIVERSIFIED HEAT TRANSFER, INC."\n']
['"July 1, 2011 - June 30, 2012', '*ELDORADO FINISHING INC.\n']
['"July 1, 2011 - June 30, 2012', '"FAM CREATIONS, INC."\n']
['"July 1, 2011 - June 30, 2012', 'G & J SETTING CORP.\n']
['"July 1, 2011 - June 30, 2012', 'HALMARK ARCHITECTURAL FINISHING CORP.\n']
['"July 1, 2011 - June 30, 2012', 'HAROLD HOWARD SCOTT\n']
['"July 1, 2011 - June 30, 2012', '"JAMES PITTA JEWELRY POLISHING, INC."\n']
['"July 1, 2011 - June 30, 2012', '"*JERRY\'S AUTO RADIATOR, INC."\n']
['"July 1, 2011 - June 30, 2012', 'JP JEWELRY CORPORATION\n']
['"July 1, 2011 - June 30, 2012', '"*LEO INGWER, INC."\n']
['"July 1, 2011 - June 30, 2012', '"*LES ATELIERS DE VAN CLEEF & ARPELS, INC."\n']
['"July 1, 2011 - June 30, 2012', '*M & A JEWELRY POLISHING CORP.\n']
['"July 1, 2011 - June 30, 2012', 'MAZZA/BARTHOLOMEW LTD.\n']
['"July 1, 2011 - June 30, 2012', '"*NEW RADIATOR, INC."\n']
['"July 1, 2011 - June 30, 2012', '*NEW YORK LAB INC.\n']
['"July 1, 2011 - June 30, 2012', 'NYC DEPARTMENT OF EDUCATION\n']
['"July 1, 2011 - June 30, 2012', '"PARALLEL PRODUCTS OF NEW ENGLAND, INC."\n']
['"July 1, 2011 - June 30, 2012', '"PERRIGO NEW YORK, INC."\n']
['"July 1, 2011 - June 30, 2012', 'PETROLEUM TANK CLEANERS LTD.\n']
['"July 1, 2011 - June 30, 2012', '"*RAM JEWELRY, INC."\n']
['"July 1, 2011 - June 30, 2012', '"REGAL EMBLEM CO., INC."\n']
['"July 1, 2011 - June 30, 2012', '"RIVIERA GOLD, INC."\n']
['"July 1, 2011 - June 30, 2012', '"*S & L AEROSPACE METALS, LLC"\n']
['"July 1, 2011 - June 30, 2012', 'SUMER GOLD LTD.\n']
['"July 1, 2011 - June 30, 2012', '"*T.S. MANUFACTURING, INC."\n']
['"July 1, 2011 - June 30, 2012', '*TRI-MART DESIGNS LTD.\n']
['"July 1, 2011 - June 30, 2012', '"WASTE MANAGEMENT OF NEW YORK, L.L.C."\n']
['"Jul 1, 2015 - June 30, 2016', '† A&L CESSPOOL SERVICE CORP.\n']
['"Jul 1, 2015 - June 30, 2016', '"* AZAD WATCH NYC, INC."\n']
['"Jul 1, 2015 - June 30, 2016', '"BK JEWELRY CONTRACTING, INC."\n']
['"Jul 1, 2015 - June 30, 2016', 'BNNS CO. INC.\n']
['"Jul 1, 2015 - June 30, 2016', 'CINER MANUFACTURING CO. INC.\n']
['"Jul 1, 2015 - June 30, 2016', '* CNV ARCHITECTURAL COATINGS INC.\n']
['"Jul 1, 2015 - June 30, 2016', '"† DELL’S MARASCHINO CHERRIES CO., INC."\n']
['"Jul 1, 2015 - June 30, 2016', '"* DIANE VON BASEDOW, INC."\n']
['"Jul 1, 2015 - June 30, 2016', '"EMA JEWELRY, INC."\n']
['"Jul 1, 2015 - June 30, 2016', '"* F. BLANCATO, LLC"\n']
['"Jul 1, 2015 - June 30, 2016', 'GEMBAR SETTING INC.\n']
['"Jul 1, 2015 - June 30, 2016', 'HALMARK ARCHITECTURAL FINISHING CORP.\n']
['"Jul 1, 2015 - June 30, 2016', '"* INDUSTRIAL FINISHING PRODUCTS, INC"\n']
['"Jul 1, 2015 - June 30, 2016', 'KYUANG M. LEE D/B/A SOLOMON JEWELRY\n']
['"Jul 1, 2015 - June 30, 2016', '"LGSP, INC."\n']
['"Jul 1, 2015 - June 30, 2016', '* NGJ 47 INC.\n']
['"Jul 1, 2015 - June 30, 2016', '† NYC DEPARTMENT OF EDUCATION\n']
['"Jul 1, 2015 - June 30, 2016', 'NY FINE JEWELRY DESIGN INC.\n']
['"Jul 1, 2015 - June 30, 2016', '"O.K.SONG JEWELRY, INC."\n']
['"Jul 1, 2015 - June 30, 2016', '"PERRIGO NEW YORK, INC."\n']
['"Jul 1, 2015 - June 30, 2016', '"PREMIER POLISHING & CONTRACTING USA, INC."\n']
['"Jul 1, 2015 - June 30, 2016', '"PROGRAMATIC PLATERS, INC."\n']
['"Jul 1, 2015 - June 30, 2016', '"RAM JEWELRY, INC."\n']
['"Jul 1, 2015 - June 30, 2016', 'SARKIS ELIGUR D/B/A PALA JEWELRY\n']
['"Jul 1, 2015 - June 30, 2016', '"TELD JEWELRY DESIGN, CORP."\n']
['"Jul 1, 2015 - June 30, 2016', '"* TIMELESS DESIGNS BY JACOB BRYAN, LTD"\n']
['"Jul 1, 2015 - June 30, 2016', '* U.S. DEPARTMENT OF VETERANS AFFAIRS\n']

Challenges Answers

In [18]:
# Challenge 1

water_df = pd.DataFrame(water_dictionary['data'], columns = ['sid','id','position',
                                                            'created_at','created_meta',
                                                            'updated_at','updated_meta',
                                                            'meta','year','New York City Population',
                                                            'NYC Consumption(Millions galls per day)',
                                                             'Per Capita(Gallons per person per day)'])

# Alternative, less compact approach

cols = []

for col_name in water_dictionary['meta']['view']['columns']:
    cols.append(col_name['name'])

water_df = pd.DataFrame(water_dictionary['data'], columns=cols)

display(water_df)
sid id position created_at created_meta updated_at updated_meta meta Year New York City Population NYC Consumption(Million gallons per day) Per Capita(Gallons per person per day)
0 1 4BE85703-1B5D-49E2-A418-75BEDD478BB5 1 1495480711 994730 1495480711 994730 None 1979 7102100 1512 213
1 2 971B06B7-1135-4887-B409-DAC19D845EBC 2 1495480711 994730 1495480711 994730 None 1980 7071639 1506 213
2 3 71AEF5AD-AA70-449C-899A-2514E97F893D 3 1495480711 994730 1495480711 994730 None 1981 7089241 1309 185
3 4 7914DA35-A410-44C2-A0D1-B11385B6B3F4 4 1495480711 994730 1495480711 994730 None 1982 7109105 1382 194
4 5 A1347CE5-F309-45C0-B372-E797A32FB0E5 5 1495480711 994730 1495480711 994730 None 1983 7181224 1424 198
5 6 DA07F92C-B10E-4D5B-A479-17522E0C5C8C 6 1495480711 994730 1495480711 994730 None 1984 7234514 1465 203
6 7 45CD9C9C-172C-4F0B-BC72-319E9B2EA6B0 7 1495480711 994730 1495480711 994730 None 1985 7274054 1326 182
7 8 B76329F9-740B-41BC-900B-62AA4790DAD6 8 1495480711 994730 1495480711 994730 None 1986 7319246 1351 185
8 9 4B66D71F-642D-4770-9123-BAB33E1E8205 9 1495480711 994730 1495480711 994730 None 1987 7342476 1447 197
9 10 6EB7E8EA-D4D9-4DDF-8469-ADFC8FECB06F 10 1495480711 994730 1495480711 994730 None 1988 7353719 1484 202
10 11 C2FAE03C-30CF-4861-9606-B96BB75C072A 11 1495480711 994730 1495480711 994730 None 1989 7344175 1402 191
11 12 D5507E3A-F453-4FAA-847A-AE76CDB2558C 12 1495480711 994730 1495480711 994730 None 1990 7335650 1424 194
12 13 7276B270-1998-4290-B645-607A2E03FE17 13 1495480711 994730 1495480711 994730 None 1991 7374501 1469 199
13 14 02E24AC6-6D8D-496D-933B-F174C9950DD3 14 1495480711 994730 1495480711 994730 None 1992 7428944 1369 184
14 15 D47DD57A-E12E-4D4F-8711-D10E37F17975 15 1495480711 994730 1495480711 994730 None 1993 7506166 1369 182
15 16 88ADE46C-791F-464F-8B8F-5077FFFF0EF6 16 1495480711 994730 1495480711 994730 None 1994 7570458 1358 179
16 17 A9F0A00A-12E5-4708-8932-6B43C5B85DD4 17 1495480711 994730 1495480711 994730 None 1995 7633040 1326 174
17 18 09164B3F-2924-495B-A3E3-723DA4CE3DF8 18 1495480711 994730 1495480711 994730 None 1996 7697812 1298 169
18 19 A2440145-4DA3-4088-8615-AFC05A857D47 19 1495480711 994730 1495480711 994730 None 1997 7773443 1206 155
19 20 01B75A1E-E15F-43C9-8FC2-EFFEF25C00E9 20 1495480711 994730 1495480711 994730 None 1998 7858259 1220 155
20 21 9223E5A0-DED5-4F20-A6ED-03FAFC4E820B 21 1495480711 994730 1495480711 994730 None 1999 7947660 1237 156
21 22 68B1631D-C945-4399-BEF8-832093F772D2 22 1495480711 994730 1495480711 994730 None 2000 8008278 1240 155
22 23 CA110C17-74B7-4D2D-AA19-424AE4E7744A 23 1495480711 994730 1495480711 994730 None 2001 8024964 1184 148
23 24 4AEDDA2C-BD8C-4B95-9966-259D9B038C4F 24 1495480711 994730 1495480711 994730 None 2002 8041649 1136 141
24 25 87F3BE91-D193-45E3-A2D2-6C3B9FA6069C 25 1495480711 994730 1495480711 994730 None 2003 8058335 1094 136
25 26 C82A5F3B-2E43-40F5-A951-021C168E0C08 26 1495480711 994730 1495480711 994730 None 2004 8075020 1100 136
26 27 13D59A66-BA54-4ADB-9069-C38D47664340 27 1495480711 994730 1495480711 994730 None 2005 8091706 1138 141
27 28 64E75825-7892-4840-B57F-D0B577F83185 28 1495480711 994730 1495480711 994730 None 2006 8108391 1069 132
28 29 55CAFFC9-CDB2-4037-BA3F-3857278D4405 29 1495480711 994730 1495480711 994730 None 2007 8125077 1114 137
29 30 71ADBA93-9D04-40BD-8791-41518FA9B18E 30 1495480711 994730 1495480711 994730 None 2008 8141762 1098 135
30 31 F09DD97C-B560-48C4-ADAE-521D611B8389 31 1495480711 994730 1495480711 994730 None 2009 8158448 1007 123
31 32 BEE271B1-301E-403D-8FC7-9FD5FF4EA867 32 1495480711 994730 1495480711 994730 None 2010 8175133 1039 127
32 33 2F8E8569-C197-43E4-8C2A-17F455B738F0 33 1495480711 994730 1495480711 994730 None 2011 8175133 1021 125
33 34 41C90A3D-E8D7-4165-AEC5-0534FC6EC07E 34 1495480711 994730 1495480711 994730 None 2012 8336697 1009 121
34 35 EDF6197F-E302-4B92-95FB-78F3439CAFEB 35 1495480711 994730 1495480711 994730 None 2013 8405837 1006 120
35 36 E3F488C5-E2CF-48D2-A259-6001D7D18ABB 36 1495480711 994730 1495480711 994730 None 2014 8491079 996 117
36 37 15E6A978-36AA-4338-8353-F5AD0431ADEA 37 1495480711 994730 1495480711 994730 None 2015 8550405 1009 118
37 38 08880B90-4C92-477C-ABE5-C8DFC548C43E 38 1495480711 994730 1495480711 994730 None 2016 8537673 1002 117
In [19]:
# Challenge 2

with open(os.path.join(directory, 'nyc-noncompliancelist.csv')) as dfile:
    rows = dfile.readlines()
    
new_data = []

for row in rows[1:]:   #Taking the rows as list, but skipping first header row
    rlist = row.split('",')
    dates = rlist[0].replace('"','').split('-')
    start_date = dates[0].replace('Jul ', 'July').strip()
    end_date = dates[1].strip()
    if rlist[1][0] == '*':
        note = 'LateReport'
    elif rlist[1][0] == '†':
        note = 'NS-IndustrialUsers'
    else:
        note = 'NA'
    est_name = rlist[1].replace('"','').replace('*','').replace('†','').replace('.','').replace(',','').strip('\n').strip()
    new_data.append([start_date, end_date, note, est_name])

for r in new_data:
    print(r)
['July 1, 2011', 'June 30, 2012', 'NA', 'A1 RADIATOR EXPRESS & AUTO REPAIR INC']
['July 1, 2011', 'June 30, 2012', 'LateReport', 'ADELPHIA CONTAINER CORP']
['July 1, 2011', 'June 30, 2012', 'NA', 'ALSCO JEWELRY OF NEW YORK INC']
['July 1, 2011', 'June 30, 2012', 'LateReport', 'AMERICO IZZO']
['July 1, 2011', 'June 30, 2012', 'NA', 'APEXX OMNI-GRAPHICS INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'ARCHITECTURAL COATINGS INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'ATLANTIS JEWELRY CONTRACTORS INC']
['July 1, 2011', 'June 30, 2012', 'LateReport', 'B & M LINEN CORP']
['July 1, 2011', 'June 30, 2012', 'NA', 'BNNS CO INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'BRILLIANT JEWELERS/MJJ INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'CARVIN FRENCH JEWELERS INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'CONTROL ELECTROPOLISHING CORP']
['July 1, 2011', 'June 30, 2012', 'NA', 'DIVERSIFIED HEAT TRANSFER INC']
['July 1, 2011', 'June 30, 2012', 'LateReport', 'ELDORADO FINISHING INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'FAM CREATIONS INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'G & J SETTING CORP']
['July 1, 2011', 'June 30, 2012', 'NA', 'HALMARK ARCHITECTURAL FINISHING CORP']
['July 1, 2011', 'June 30, 2012', 'NA', 'HAROLD HOWARD SCOTT']
['July 1, 2011', 'June 30, 2012', 'NA', 'JAMES PITTA JEWELRY POLISHING INC']
['July 1, 2011', 'June 30, 2012', 'NA', "JERRY'S AUTO RADIATOR INC"]
['July 1, 2011', 'June 30, 2012', 'NA', 'JP JEWELRY CORPORATION']
['July 1, 2011', 'June 30, 2012', 'NA', 'LEO INGWER INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'LES ATELIERS DE VAN CLEEF & ARPELS INC']
['July 1, 2011', 'June 30, 2012', 'LateReport', 'M & A JEWELRY POLISHING CORP']
['July 1, 2011', 'June 30, 2012', 'NA', 'MAZZA/BARTHOLOMEW LTD']
['July 1, 2011', 'June 30, 2012', 'NA', 'NEW RADIATOR INC']
['July 1, 2011', 'June 30, 2012', 'LateReport', 'NEW YORK LAB INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'NYC DEPARTMENT OF EDUCATION']
['July 1, 2011', 'June 30, 2012', 'NA', 'PARALLEL PRODUCTS OF NEW ENGLAND INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'PERRIGO NEW YORK INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'PETROLEUM TANK CLEANERS LTD']
['July 1, 2011', 'June 30, 2012', 'NA', 'RAM JEWELRY INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'REGAL EMBLEM CO INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'RIVIERA GOLD INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'S & L AEROSPACE METALS LLC']
['July 1, 2011', 'June 30, 2012', 'NA', 'SUMER GOLD LTD']
['July 1, 2011', 'June 30, 2012', 'NA', 'TS MANUFACTURING INC']
['July 1, 2011', 'June 30, 2012', 'LateReport', 'TRI-MART DESIGNS LTD']
['July 1, 2011', 'June 30, 2012', 'NA', 'WASTE MANAGEMENT OF NEW YORK LLC']
['July1, 2015', 'June 30, 2016', 'NS-IndustrialUsers', 'A&L CESSPOOL SERVICE CORP']
['July1, 2015', 'June 30, 2016', 'NA', 'AZAD WATCH NYC INC']
['July1, 2015', 'June 30, 2016', 'NA', 'BK JEWELRY CONTRACTING INC']
['July1, 2015', 'June 30, 2016', 'NA', 'BNNS CO INC']
['July1, 2015', 'June 30, 2016', 'NA', 'CINER MANUFACTURING CO INC']
['July1, 2015', 'June 30, 2016', 'LateReport', 'CNV ARCHITECTURAL COATINGS INC']
['July1, 2015', 'June 30, 2016', 'NA', 'DELL’S MARASCHINO CHERRIES CO INC']
['July1, 2015', 'June 30, 2016', 'NA', 'DIANE VON BASEDOW INC']
['July1, 2015', 'June 30, 2016', 'NA', 'EMA JEWELRY INC']
['July1, 2015', 'June 30, 2016', 'NA', 'F BLANCATO LLC']
['July1, 2015', 'June 30, 2016', 'NA', 'GEMBAR SETTING INC']
['July1, 2015', 'June 30, 2016', 'NA', 'HALMARK ARCHITECTURAL FINISHING CORP']
['July1, 2015', 'June 30, 2016', 'NA', 'INDUSTRIAL FINISHING PRODUCTS INC']
['July1, 2015', 'June 30, 2016', 'NA', 'KYUANG M LEE D/B/A SOLOMON JEWELRY']
['July1, 2015', 'June 30, 2016', 'NA', 'LGSP INC']
['July1, 2015', 'June 30, 2016', 'LateReport', 'NGJ 47 INC']
['July1, 2015', 'June 30, 2016', 'NS-IndustrialUsers', 'NYC DEPARTMENT OF EDUCATION']
['July1, 2015', 'June 30, 2016', 'NA', 'NY FINE JEWELRY DESIGN INC']
['July1, 2015', 'June 30, 2016', 'NA', 'OKSONG JEWELRY INC']
['July1, 2015', 'June 30, 2016', 'NA', 'PERRIGO NEW YORK INC']
['July1, 2015', 'June 30, 2016', 'NA', 'PREMIER POLISHING & CONTRACTING USA INC']
['July1, 2015', 'June 30, 2016', 'NA', 'PROGRAMATIC PLATERS INC']
['July1, 2015', 'June 30, 2016', 'NA', 'RAM JEWELRY INC']
['July1, 2015', 'June 30, 2016', 'NA', 'SARKIS ELIGUR D/B/A PALA JEWELRY']
['July1, 2015', 'June 30, 2016', 'NA', 'TELD JEWELRY DESIGN CORP']
['July1, 2015', 'June 30, 2016', 'NA', 'TIMELESS DESIGNS BY JACOB BRYAN LTD']
['July1, 2015', 'June 30, 2016', 'LateReport', 'US DEPARTMENT OF VETERANS AFFAIRS']

Final Challenge: BPL Branches Data

Time permitting, let's put our full list of skills together and see if we can extract a simplified dataset from this Brooklyn Public Library listing of library branches and hours. Using the JSON formatted file available at https://www.bklynlibrary.org/locations/json, extract the name of each branch, its address, and a third column listing the total number of hours per week that the branch is open.