Data Cleaning and Management Using Python¶

Nicholas Wolf and Vicky Steeves, NYU Data Services

Vicky's ORCID: 0000-0003-4298-168X | Nick's ORCID: 0000-0001-5512-6151

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Overview

This session is an intermediate level class that will examine ways to perform data cleaning, transformation, and management using Python. We will look at some helpful ways to load data and parse it into a container for ease of use in Python, to store it in helpful formats, and to perform some basic cleaning and transformations typical for mixed string-and-numeric formats. Finally, we'll try putting it all together using a dataset form the NYC Open Data portal.

Setup

Let's think about helpful project organizational structures from the start. We can borrow a helpful folder structure used by Ben Marwick in his rrtools R package. There are iterations of this idea in a lot of places, but the basic idea is to separate your starting data from your transformed/cleaned data, and all of this input data from your subsequent analysis, documentation, and outputs. We'll modify his tree directory slightly:

analysis/
|
├── paper/              # Working paper for publication
│
│
├── figures/            # location of the figures produced by your analysis
|
├── data/
|   ├── data_scripts/   # scripts used to transform and clean raw data
│   ├── raw_data/       # data obtained by you or from elsewhere
|   └── cleaned_data/   # migrated format versions of raw data, with syntax corrections and integrity checks 
|
└── docs                # documentation files for methods, parameters, data dictionaries, etc.

Download the zipped folder package available at goo.gl/SJUrcJ and place it on your desktop. Uncompress the files. We'll be working with the Jupyter Notebook located at /analysis/data/data_scripts/python-cleaning-session-full.ipynb. Alternately, you can clone the course materials using

git clone https://github.com/NYU-DataServices/python-cleaning.git

You can also find this notebook (in non-executable form) at https://nyu-dataservices.github.io/python-cleaning/

You are welcome to deploy this Jupyter Notebook in the Jupyter environment of your choice (departmental hub, local instance). You can also use it on our JupyterHub instance here: https://tutorials-1.rcnyu.org

Loading and Parsing¶

1. CSV and Dataframes

Not surprisingly, a very common and helpful way to store data is in CSV format. Most applications have some way of working with delimited files, and while not 100% standardized, CSVs provide a stable means of sharing and storing dat of all types.

Python offers a CSV module, but it can be a wordy way to input and output data into your workflow. Since analysis is also often eased by using a more complex data container such as a dataframe, let's rely on the Pandas module to do the heavy lifting of importing (and later exporting) CSVs.

Here, we load up a CSV storing NYC water consumption data from the NYC Open Data portal.

import pandas as pd
import os

## Set a working directory for our /data parent directory
## Mac
directory = r'../raw_data'
## Windows
#directory = r'..\data\raw_data\\'

# Load a CSV into a data frame for Mac. For windows, substitute left-slashes
df = pd.read_csv(os.path.join(directory,
            'water-consumption-nyc.csv'), header=0, 
                 sep=',', parse_dates=False, encoding='utf-8')

df.head(10)

2. JSON

Python's json module is compact and easy to use. Once loaded, the JSON object is treated as a full-fledged JSON dictionary. Let's load the same data on water consumption stored as a JSON file. You can preview the NYC JSON structure here: https://data.cityofnewyork.us/api/views/ia2d-e54m/rows.json?accessType=DOWNLOAD

import json

water_dictionary = json.loads(open(os.path.join(directory,
        'water-consumption-nyc.json'), encoding='utf-8').read())

## We can take a look at the data portion of the file 
## (rather than the metadata stored in the JSON)

for row in water_dictionary['data'][0:12]:
    print(row)

[1, '4BE85703-1B5D-49E2-A418-75BEDD478BB5', 1, 1495480711, '994730', 1495480711, '994730', None, '1979', '7102100', '1512', '213']
[2, '971B06B7-1135-4887-B409-DAC19D845EBC', 2, 1495480711, '994730', 1495480711, '994730', None, '1980', '7071639', '1506', '213']
[3, '71AEF5AD-AA70-449C-899A-2514E97F893D', 3, 1495480711, '994730', 1495480711, '994730', None, '1981', '7089241', '1309', '185']
[4, '7914DA35-A410-44C2-A0D1-B11385B6B3F4', 4, 1495480711, '994730', 1495480711, '994730', None, '1982', '7109105', '1382', '194']
[5, 'A1347CE5-F309-45C0-B372-E797A32FB0E5', 5, 1495480711, '994730', 1495480711, '994730', None, '1983', '7181224', '1424', '198']
[6, 'DA07F92C-B10E-4D5B-A479-17522E0C5C8C', 6, 1495480711, '994730', 1495480711, '994730', None, '1984', '7234514', '1465', '203']
[7, '45CD9C9C-172C-4F0B-BC72-319E9B2EA6B0', 7, 1495480711, '994730', 1495480711, '994730', None, '1985', '7274054', '1326', '182']
[8, 'B76329F9-740B-41BC-900B-62AA4790DAD6', 8, 1495480711, '994730', 1495480711, '994730', None, '1986', '7319246', '1351', '185']
[9, '4B66D71F-642D-4770-9123-BAB33E1E8205', 9, 1495480711, '994730', 1495480711, '994730', None, '1987', '7342476', '1447', '197']
[10, '6EB7E8EA-D4D9-4DDF-8469-ADFC8FECB06F', 10, 1495480711, '994730', 1495480711, '994730', None, '1988', '7353719', '1484', '202']
[11, 'C2FAE03C-30CF-4861-9606-B96BB75C072A', 11, 1495480711, '994730', 1495480711, '994730', None, '1989', '7344175', '1402', '191']
[12, 'D5507E3A-F453-4FAA-847A-AE76CDB2558C', 12, 1495480711, '994730', 1495480711, '994730', None, '1990', '7335650', '1424', '194']

Let's turn the list of lists in which NYC has stored the data into a dataframe. Because the data has been stored as a list of lists, this is a one-step process:

water_df = pd.DataFrame(water_dictionary['data'])

water_df.head(10)

Challenge¶

We've lost our column headers! NYC has stored them in a separate location in the JSON. How can we recover them? Hint: take a look at the full JSON here: https://data.cityofnewyork.us/api/views/ia2d-e54m/rows.json?accessType=DOWNLOAD

# Challenge 1

water_df = pd.DataFrame(water_dictionary['data'], 
        columns = ['sid','id','position',
                'created_at','created_meta',
                'updated_at','updated_meta',
                'meta','year','New York City Population',
                'NYC Consumption(Millions galls per day)',
                'Per Capita(Gallons per person per day)'])

# Alternative, more automated way:

cols = [col_name['name'] for col_name in water_dictionary['meta']['view']['columns']]

water_df = pd.DataFrame(water_dictionary['data'], columns=cols)

water_df.head(10)

3. SQLite

Once you start working with data distributed across multiple tables, or working with data that is starting to exceed easy usability in a format like CSV, consider implementing Python with a simple SQLite database to push and pull data. The maintainers of SQLite3 claim that it will operate on large data files (see https://www.sqlite.org/limits.html), up to 140 TB of file size for a database--larger than most systems can manipulate in memory, at any rate. So for most purposes, you can nicely store your data in such a database and access it when needed.

import sqlite3

# Create a connection object to our sqlite3 database file

conn = sqlite3.connect(os.path.join(directory,
                                    'water-consumption.db'))

c = conn.cursor()

# Select data from the water table

rows = c.execute("SELECT * FROM water")

# The response is a list of tuples, each tuple containing a cell values

for row in rows:
    print(row)

('1979', '7102100', '1512', '213')
('1980', '7071639', '1506', '213')
('1981', '7089241', '1309', '185')
('1982', '7109105', '1382', '194')
('1983', '7181224', '1424', '198')
('1984', '7234514', '1465', '203')
('1985', '7274054', '1326', '182')
('1986', '7319246', '1351', '185')
('1987', '7342476', '1447', '197')
('1988', '7353719', '1484', '202')
('1989', '7344175', '1402', '191')
('1990', '7335650', '1424', '194')
('1991', '7374501', '1469', '199')
('1992', '7428944', '1369', '184')
('1993', '7506166', '1369', '182')
('1994', '7570458', '1358', '179')
('1995', '7633040', '1326', '174')
('1996', '7697812', '1298', '169')
('1997', '7773443', '1206', '155')
('1998', '7858259', '1220', '155')
('1999', '7947660', '1237', '156')
('2000', '8008278', '1240', '155')
('2001', '8024964', '1184', '148')
('2002', '8041649', '1136', '141')
('2003', '8058335', '1094', '136')
('2004', '8075020', '1100', '136')
('2005', '8091706', '1138', '141')
('2006', '8108391', '1069', '132')
('2007', '8125077', '1114', '137')
('2008', '8141762', '1098', '135')
('2009', '8158448', '1007', '123')
('2010', '8175133', '1039', '127')
('2011', '8175133', '1021', '125')
('2012', '8336697', '1009', '121')
('2013', '8405837', '1006', '120')
('2014', '8491079', '996', '117')
('2015', '8550405', '1009', '118')
('2016', '8537673', '1002', '117')

# We can modify our data

newrow = ('2017', '85312743', '1003', '158')

c.execute("INSERT INTO water VALUES (?,?,?,?)", newrow)

# Commit the changes

conn.commit()

# View the new row

rows = c.execute("SELECT * FROM water")

for row in rows:
    print(row)

('1979', '7102100', '1512', '213')
('1980', '7071639', '1506', '213')
('1981', '7089241', '1309', '185')
('1982', '7109105', '1382', '194')
('1983', '7181224', '1424', '198')
('1984', '7234514', '1465', '203')
('1985', '7274054', '1326', '182')
('1986', '7319246', '1351', '185')
('1987', '7342476', '1447', '197')
('1988', '7353719', '1484', '202')
('1989', '7344175', '1402', '191')
('1990', '7335650', '1424', '194')
('1991', '7374501', '1469', '199')
('1992', '7428944', '1369', '184')
('1993', '7506166', '1369', '182')
('1994', '7570458', '1358', '179')
('1995', '7633040', '1326', '174')
('1996', '7697812', '1298', '169')
('1997', '7773443', '1206', '155')
('1998', '7858259', '1220', '155')
('1999', '7947660', '1237', '156')
('2000', '8008278', '1240', '155')
('2001', '8024964', '1184', '148')
('2002', '8041649', '1136', '141')
('2003', '8058335', '1094', '136')
('2004', '8075020', '1100', '136')
('2005', '8091706', '1138', '141')
('2006', '8108391', '1069', '132')
('2007', '8125077', '1114', '137')
('2008', '8141762', '1098', '135')
('2009', '8158448', '1007', '123')
('2010', '8175133', '1039', '127')
('2011', '8175133', '1021', '125')
('2012', '8336697', '1009', '121')
('2013', '8405837', '1006', '120')
('2014', '8491079', '996', '117')
('2015', '8550405', '1009', '118')
('2016', '8537673', '1002', '117')
('2017', '85312743', '1003', '158')

# Or...we can once again just read our SQLite table 
# directly into a dataframe:

water_sql_df = pd.read_sql_query("SELECT * FROM water", conn)

#Close the connnection

conn.close()

water_sql_df.head(10)

4. Numpy Array

The numpy module provides a quick way to save complex array-like objects (lists, lists of lists, dictionaries, lists of dictionaries, dictionaries of lists) to a binary-file type .npy file. This can be a quick and easy way (among many others, including saving such objects in a .txt or as a .py file) to save and return to a data container without having to reconstruct it every time you work on a workflow. See also below for how to save the .npy file in the first place.

import numpy as np

# Loading an array from the .npy file

water_npy = np.load(os.path.join(directory,
                 'water_consumption.npy'), allow_pickle=True)

print(water_npy)

[[1 '4BE85703-1B5D-49E2-A418-75BEDD478BB5' 1 1495480711 '994730'
  1495480711 '994730' None '1979' '7102100' '1512' '213']
 [2 '971B06B7-1135-4887-B409-DAC19D845EBC' 2 1495480711 '994730'
  1495480711 '994730' None '1980' '7071639' '1506' '213']
 [3 '71AEF5AD-AA70-449C-899A-2514E97F893D' 3 1495480711 '994730'
  1495480711 '994730' None '1981' '7089241' '1309' '185']
 [4 '7914DA35-A410-44C2-A0D1-B11385B6B3F4' 4 1495480711 '994730'
  1495480711 '994730' None '1982' '7109105' '1382' '194']
 [5 'A1347CE5-F309-45C0-B372-E797A32FB0E5' 5 1495480711 '994730'
  1495480711 '994730' None '1983' '7181224' '1424' '198']
 [6 'DA07F92C-B10E-4D5B-A479-17522E0C5C8C' 6 1495480711 '994730'
  1495480711 '994730' None '1984' '7234514' '1465' '203']
 [7 '45CD9C9C-172C-4F0B-BC72-319E9B2EA6B0' 7 1495480711 '994730'
  1495480711 '994730' None '1985' '7274054' '1326' '182']
 [8 'B76329F9-740B-41BC-900B-62AA4790DAD6' 8 1495480711 '994730'
  1495480711 '994730' None '1986' '7319246' '1351' '185']
 [9 '4B66D71F-642D-4770-9123-BAB33E1E8205' 9 1495480711 '994730'
  1495480711 '994730' None '1987' '7342476' '1447' '197']
 [10 '6EB7E8EA-D4D9-4DDF-8469-ADFC8FECB06F' 10 1495480711 '994730'
  1495480711 '994730' None '1988' '7353719' '1484' '202']
 [11 'C2FAE03C-30CF-4861-9606-B96BB75C072A' 11 1495480711 '994730'
  1495480711 '994730' None '1989' '7344175' '1402' '191']
 [12 'D5507E3A-F453-4FAA-847A-AE76CDB2558C' 12 1495480711 '994730'
  1495480711 '994730' None '1990' '7335650' '1424' '194']
 [13 '7276B270-1998-4290-B645-607A2E03FE17' 13 1495480711 '994730'
  1495480711 '994730' None '1991' '7374501' '1469' '199']
 [14 '02E24AC6-6D8D-496D-933B-F174C9950DD3' 14 1495480711 '994730'
  1495480711 '994730' None '1992' '7428944' '1369' '184']
 [15 'D47DD57A-E12E-4D4F-8711-D10E37F17975' 15 1495480711 '994730'
  1495480711 '994730' None '1993' '7506166' '1369' '182']
 [16 '88ADE46C-791F-464F-8B8F-5077FFFF0EF6' 16 1495480711 '994730'
  1495480711 '994730' None '1994' '7570458' '1358' '179']
 [17 'A9F0A00A-12E5-4708-8932-6B43C5B85DD4' 17 1495480711 '994730'
  1495480711 '994730' None '1995' '7633040' '1326' '174']
 [18 '09164B3F-2924-495B-A3E3-723DA4CE3DF8' 18 1495480711 '994730'
  1495480711 '994730' None '1996' '7697812' '1298' '169']
 [19 'A2440145-4DA3-4088-8615-AFC05A857D47' 19 1495480711 '994730'
  1495480711 '994730' None '1997' '7773443' '1206' '155']
 [20 '01B75A1E-E15F-43C9-8FC2-EFFEF25C00E9' 20 1495480711 '994730'
  1495480711 '994730' None '1998' '7858259' '1220' '155']
 [21 '9223E5A0-DED5-4F20-A6ED-03FAFC4E820B' 21 1495480711 '994730'
  1495480711 '994730' None '1999' '7947660' '1237' '156']
 [22 '68B1631D-C945-4399-BEF8-832093F772D2' 22 1495480711 '994730'
  1495480711 '994730' None '2000' '8008278' '1240' '155']
 [23 'CA110C17-74B7-4D2D-AA19-424AE4E7744A' 23 1495480711 '994730'
  1495480711 '994730' None '2001' '8024964' '1184' '148']
 [24 '4AEDDA2C-BD8C-4B95-9966-259D9B038C4F' 24 1495480711 '994730'
  1495480711 '994730' None '2002' '8041649' '1136' '141']
 [25 '87F3BE91-D193-45E3-A2D2-6C3B9FA6069C' 25 1495480711 '994730'
  1495480711 '994730' None '2003' '8058335' '1094' '136']
 [26 'C82A5F3B-2E43-40F5-A951-021C168E0C08' 26 1495480711 '994730'
  1495480711 '994730' None '2004' '8075020' '1100' '136']
 [27 '13D59A66-BA54-4ADB-9069-C38D47664340' 27 1495480711 '994730'
  1495480711 '994730' None '2005' '8091706' '1138' '141']
 [28 '64E75825-7892-4840-B57F-D0B577F83185' 28 1495480711 '994730'
  1495480711 '994730' None '2006' '8108391' '1069' '132']
 [29 '55CAFFC9-CDB2-4037-BA3F-3857278D4405' 29 1495480711 '994730'
  1495480711 '994730' None '2007' '8125077' '1114' '137']
 [30 '71ADBA93-9D04-40BD-8791-41518FA9B18E' 30 1495480711 '994730'
  1495480711 '994730' None '2008' '8141762' '1098' '135']
 [31 'F09DD97C-B560-48C4-ADAE-521D611B8389' 31 1495480711 '994730'
  1495480711 '994730' None '2009' '8158448' '1007' '123']
 [32 'BEE271B1-301E-403D-8FC7-9FD5FF4EA867' 32 1495480711 '994730'
  1495480711 '994730' None '2010' '8175133' '1039' '127']
 [33 '2F8E8569-C197-43E4-8C2A-17F455B738F0' 33 1495480711 '994730'
  1495480711 '994730' None '2011' '8175133' '1021' '125']
 [34 '41C90A3D-E8D7-4165-AEC5-0534FC6EC07E' 34 1495480711 '994730'
  1495480711 '994730' None '2012' '8336697' '1009' '121']
 [35 'EDF6197F-E302-4B92-95FB-78F3439CAFEB' 35 1495480711 '994730'
  1495480711 '994730' None '2013' '8405837' '1006' '120']
 [36 'E3F488C5-E2CF-48D2-A259-6001D7D18ABB' 36 1495480711 '994730'
  1495480711 '994730' None '2014' '8491079' '996' '117']
 [37 '15E6A978-36AA-4338-8353-F5AD0431ADEA' 37 1495480711 '994730'
  1495480711 '994730' None '2015' '8550405' '1009' '118']
 [38 '08880B90-4C92-477C-ABE5-C8DFC548C43E' 38 1495480711 '994730'
  1495480711 '994730' None '2016' '8537673' '1002' '117']]

Storing¶

We've just discussed how to load and parse stored data. Let's look at the opposite side of the coin -- storing data from Python into a saved file.

1. JSON

Let's say we've transformed a data file by adding some new data, or cleaning it. We now have a cleaned data set that we can queue up for analysis.

# We have a cleaned data file, here represented by a small table

cleaned_table = {'data':[['1979', '7102100', '1512', '213'],
                         ['2001', '8024964', '1184', '148'],
                        ['2009', '8223444', '1204', '178']]}

with open('../cleaned_data/new_json.json', 
          'w', encoding='utf-8') as outfile:
    json.dump(cleaned_table, outfile)

2. Numpy

np.save('../cleaned_data/new_numpy.npy', cleaned_table)

3. SQLite

This operation will be straightforward from the concepts outlined above, with the added step that we need to create our database if it does not exist, and our table within the database.

# We first establish a new database and connect to it

conn_out = sqlite3.connect('../cleaned_data/cleaned_sql.db')
cursor_out = conn_out.cursor()

# Then we create a new table

cursor_out.execute('''CREATE TABLE IF NOT EXISTS newsql
                  (year text not null unique, nyc_population text, consumption text, per_capita text) 
               ''')

for insert_row in cleaned_table['data']:
    cursor_out.execute("INSERT INTO newsql VALUES (?,?,?,?)", insert_row)

conn_out.commit()
conn_out.close()

4. CSV with Pandas

Pandas also has a very quick way to move a dataframe into a CSV

cleaned_df = pd.DataFrame(cleaned_table['data'], columns = ['year','nyc_population','consumption', 'per_capita'])

cleaned_df.to_csv('../cleaned_data/new_df.csv', ',', encoding='utf-8')

Cleaning¶

Finally, we can perform syntax cleaning operations in Python just as easily as in a program like Excel. In fact, if you are working with very large data ( > 1.2 million rows) you will simply not be able to use Excel or other spreadsheet programs to manipulate data and must depend on other methods.

Here, let's take a look at a very simple data set (2 columns, 66 rows) consisting of the Significant Noncompliance List published by NYC Open Data: https://data.cityofnewyork.us/Environment/Significant-Noncompliance-List/xnje-s6zf

If you preview the data, you'll notice that a date span has been combined into a single row cell, that there is some inconsistencies in how months are listed, and it implements a kind of "footnote" method, with starred/crossed establishment names referring to the following notes:

*These establishments are on the list for late reporting only.
†These establishments are Non-Significant Industrial Users.

How would we perform the following data cleaning operations:

1) Split the "period" column into two columns, a start_date and end_date column? 2) Standarize all spelling of months 3) Create a new columns conveying the information in the "footnote" marks? 4) Remove commas, periods, etc. from the establishment names 5) Ensure there is no whitespace or newlines on the ends of entries 6) Output the cleaned data as a CSV in the /cleaned_data folder

Try this challenge using whatever data container you wish. You may want to consider a very simple one: we can turn the CSV into a list of lists by reading in the file, splitting it on the ", combination, and working with each value separately.

with open(os.path.join(directory, 'nyc-noncompliancelist.csv')) as dfile:
    rows = dfile.readlines()

for row in rows[1:]:            # Taking the rows as a list, but skipping the first header row
    print(row.split('",'))      # Splitting on the delimiter which we notice the escaped " before it.

['"July 1, 2011 - June 30, 2012', '"*A1 RADIATOR EXPRESS & AUTO REPAIR, INC."\n']
['"July 1, 2011 - June 30, 2012', '*ADELPHIA CONTAINER CORP.\n']
['"July 1, 2011 - June 30, 2012', '"ALSCO JEWELRY OF NEW YORK, INC."\n']
['"July 1, 2011 - June 30, 2012', '*AMERICO IZZO\n']
['"July 1, 2011 - June 30, 2012', '"APEXX OMNI-GRAPHICS, INC."\n']
['"July 1, 2011 - June 30, 2012', '"*ARCHITECTURAL COATINGS, INC."\n']
['"July 1, 2011 - June 30, 2012', '"ATLANTIS JEWELRY CONTRACTORS, INC."\n']
['"July 1, 2011 - June 30, 2012', '*B & M LINEN CORP.\n']
['"July 1, 2011 - June 30, 2012', '"BNNS CO., INC."\n']
['"July 1, 2011 - June 30, 2012', 'BRILLIANT JEWELERS/MJJ INC.\n']
['"July 1, 2011 - June 30, 2012', 'CARVIN FRENCH JEWELERS INC.\n']
['"July 1, 2011 - June 30, 2012', 'CONTROL ELECTROPOLISHING CORP.\n']
['"July 1, 2011 - June 30, 2012', '"*DIVERSIFIED HEAT TRANSFER, INC."\n']
['"July 1, 2011 - June 30, 2012', '*ELDORADO FINISHING INC.\n']
['"July 1, 2011 - June 30, 2012', '"FAM CREATIONS, INC."\n']
['"July 1, 2011 - June 30, 2012', 'G & J SETTING CORP.\n']
['"July 1, 2011 - June 30, 2012', 'HALMARK ARCHITECTURAL FINISHING CORP.\n']
['"July 1, 2011 - June 30, 2012', 'HAROLD HOWARD SCOTT\n']
['"July 1, 2011 - June 30, 2012', '"JAMES PITTA JEWELRY POLISHING, INC."\n']
['"July 1, 2011 - June 30, 2012', '"*JERRY\'S AUTO RADIATOR, INC."\n']
['"July 1, 2011 - June 30, 2012', 'JP JEWELRY CORPORATION\n']
['"July 1, 2011 - June 30, 2012', '"*LEO INGWER, INC."\n']
['"July 1, 2011 - June 30, 2012', '"*LES ATELIERS DE VAN CLEEF & ARPELS, INC."\n']
['"July 1, 2011 - June 30, 2012', '*M & A JEWELRY POLISHING CORP.\n']
['"July 1, 2011 - June 30, 2012', 'MAZZA/BARTHOLOMEW LTD.\n']
['"July 1, 2011 - June 30, 2012', '"*NEW RADIATOR, INC."\n']
['"July 1, 2011 - June 30, 2012', '*NEW YORK LAB INC.\n']
['"July 1, 2011 - June 30, 2012', 'NYC DEPARTMENT OF EDUCATION\n']
['"July 1, 2011 - June 30, 2012', '"PARALLEL PRODUCTS OF NEW ENGLAND, INC."\n']
['"July 1, 2011 - June 30, 2012', '"PERRIGO NEW YORK, INC."\n']
['"July 1, 2011 - June 30, 2012', 'PETROLEUM TANK CLEANERS LTD.\n']
['"July 1, 2011 - June 30, 2012', '"*RAM JEWELRY, INC."\n']
['"July 1, 2011 - June 30, 2012', '"REGAL EMBLEM CO., INC."\n']
['"July 1, 2011 - June 30, 2012', '"RIVIERA GOLD, INC."\n']
['"July 1, 2011 - June 30, 2012', '"*S & L AEROSPACE METALS, LLC"\n']
['"July 1, 2011 - June 30, 2012', 'SUMER GOLD LTD.\n']
['"July 1, 2011 - June 30, 2012', '"*T.S. MANUFACTURING, INC."\n']
['"July 1, 2011 - June 30, 2012', '*TRI-MART DESIGNS LTD.\n']
['"July 1, 2011 - June 30, 2012', '"WASTE MANAGEMENT OF NEW YORK, L.L.C."\n']
['"Jul 1, 2015 - June 30, 2016', '† A&L CESSPOOL SERVICE CORP.\n']
['"Jul 1, 2015 - June 30, 2016', '"* AZAD WATCH NYC, INC."\n']
['"Jul 1, 2015 - June 30, 2016', '"BK JEWELRY CONTRACTING, INC."\n']
['"Jul 1, 2015 - June 30, 2016', 'BNNS CO. INC.\n']
['"Jul 1, 2015 - June 30, 2016', 'CINER MANUFACTURING CO. INC.\n']
['"Jul 1, 2015 - June 30, 2016', '* CNV ARCHITECTURAL COATINGS INC.\n']
['"Jul 1, 2015 - June 30, 2016', '"† DELL’S MARASCHINO CHERRIES CO., INC."\n']
['"Jul 1, 2015 - June 30, 2016', '"* DIANE VON BASEDOW, INC."\n']
['"Jul 1, 2015 - June 30, 2016', '"EMA JEWELRY, INC."\n']
['"Jul 1, 2015 - June 30, 2016', '"* F. BLANCATO, LLC"\n']
['"Jul 1, 2015 - June 30, 2016', 'GEMBAR SETTING INC.\n']
['"Jul 1, 2015 - June 30, 2016', 'HALMARK ARCHITECTURAL FINISHING CORP.\n']
['"Jul 1, 2015 - June 30, 2016', '"* INDUSTRIAL FINISHING PRODUCTS, INC"\n']
['"Jul 1, 2015 - June 30, 2016', 'KYUANG M. LEE D/B/A SOLOMON JEWELRY\n']
['"Jul 1, 2015 - June 30, 2016', '"LGSP, INC."\n']
['"Jul 1, 2015 - June 30, 2016', '* NGJ 47 INC.\n']
['"Jul 1, 2015 - June 30, 2016', '† NYC DEPARTMENT OF EDUCATION\n']
['"Jul 1, 2015 - June 30, 2016', 'NY FINE JEWELRY DESIGN INC.\n']
['"Jul 1, 2015 - June 30, 2016', '"O.K.SONG JEWELRY, INC."\n']
['"Jul 1, 2015 - June 30, 2016', '"PERRIGO NEW YORK, INC."\n']
['"Jul 1, 2015 - June 30, 2016', '"PREMIER POLISHING & CONTRACTING USA, INC."\n']
['"Jul 1, 2015 - June 30, 2016', '"PROGRAMATIC PLATERS, INC."\n']
['"Jul 1, 2015 - June 30, 2016', '"RAM JEWELRY, INC."\n']
['"Jul 1, 2015 - June 30, 2016', 'SARKIS ELIGUR D/B/A PALA JEWELRY\n']
['"Jul 1, 2015 - June 30, 2016', '"TELD JEWELRY DESIGN, CORP."\n']
['"Jul 1, 2015 - June 30, 2016', '"* TIMELESS DESIGNS BY JACOB BRYAN, LTD"\n']
['"Jul 1, 2015 - June 30, 2016', '* U.S. DEPARTMENT OF VETERANS AFFAIRS\n']

Challenges Answers¶

# Challenge 1

water_df = pd.DataFrame(water_dictionary['data'], columns = ['sid','id','position',
                                                            'created_at','created_meta',
                                                            'updated_at','updated_meta',
                                                            'meta','year','New York City Population',
                                                            'NYC Consumption(Millions galls per day)',
                                                             'Per Capita(Gallons per person per day)'])

# Alternative, less compact approach

cols = []

for col_name in water_dictionary['meta']['view']['columns']:
    cols.append(col_name['name'])

water_df = pd.DataFrame(water_dictionary['data'], columns=cols)

display(water_df)

# Challenge 2

with open(os.path.join(directory, 'nyc-noncompliancelist.csv')) as dfile:
    rows = dfile.readlines()
    
new_data = []

for row in rows[1:]:   #Taking the rows as list, but skipping first header row
    rlist = row.split('",')
    dates = rlist[0].replace('"','').split('-')
    start_date = dates[0].replace('Jul ', 'July').strip()
    end_date = dates[1].strip()
    if rlist[1][0] == '*':
        note = 'LateReport'
    elif rlist[1][0] == '†':
        note = 'NS-IndustrialUsers'
    else:
        note = 'NA'
    est_name = rlist[1].replace('"','').replace('*','').replace('†','').replace('.','').replace(',','').strip('\n').strip()
    new_data.append([start_date, end_date, note, est_name])

for r in new_data:
    print(r)

['July 1, 2011', 'June 30, 2012', 'NA', 'A1 RADIATOR EXPRESS & AUTO REPAIR INC']
['July 1, 2011', 'June 30, 2012', 'LateReport', 'ADELPHIA CONTAINER CORP']
['July 1, 2011', 'June 30, 2012', 'NA', 'ALSCO JEWELRY OF NEW YORK INC']
['July 1, 2011', 'June 30, 2012', 'LateReport', 'AMERICO IZZO']
['July 1, 2011', 'June 30, 2012', 'NA', 'APEXX OMNI-GRAPHICS INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'ARCHITECTURAL COATINGS INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'ATLANTIS JEWELRY CONTRACTORS INC']
['July 1, 2011', 'June 30, 2012', 'LateReport', 'B & M LINEN CORP']
['July 1, 2011', 'June 30, 2012', 'NA', 'BNNS CO INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'BRILLIANT JEWELERS/MJJ INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'CARVIN FRENCH JEWELERS INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'CONTROL ELECTROPOLISHING CORP']
['July 1, 2011', 'June 30, 2012', 'NA', 'DIVERSIFIED HEAT TRANSFER INC']
['July 1, 2011', 'June 30, 2012', 'LateReport', 'ELDORADO FINISHING INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'FAM CREATIONS INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'G & J SETTING CORP']
['July 1, 2011', 'June 30, 2012', 'NA', 'HALMARK ARCHITECTURAL FINISHING CORP']
['July 1, 2011', 'June 30, 2012', 'NA', 'HAROLD HOWARD SCOTT']
['July 1, 2011', 'June 30, 2012', 'NA', 'JAMES PITTA JEWELRY POLISHING INC']
['July 1, 2011', 'June 30, 2012', 'NA', "JERRY'S AUTO RADIATOR INC"]
['July 1, 2011', 'June 30, 2012', 'NA', 'JP JEWELRY CORPORATION']
['July 1, 2011', 'June 30, 2012', 'NA', 'LEO INGWER INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'LES ATELIERS DE VAN CLEEF & ARPELS INC']
['July 1, 2011', 'June 30, 2012', 'LateReport', 'M & A JEWELRY POLISHING CORP']
['July 1, 2011', 'June 30, 2012', 'NA', 'MAZZA/BARTHOLOMEW LTD']
['July 1, 2011', 'June 30, 2012', 'NA', 'NEW RADIATOR INC']
['July 1, 2011', 'June 30, 2012', 'LateReport', 'NEW YORK LAB INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'NYC DEPARTMENT OF EDUCATION']
['July 1, 2011', 'June 30, 2012', 'NA', 'PARALLEL PRODUCTS OF NEW ENGLAND INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'PERRIGO NEW YORK INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'PETROLEUM TANK CLEANERS LTD']
['July 1, 2011', 'June 30, 2012', 'NA', 'RAM JEWELRY INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'REGAL EMBLEM CO INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'RIVIERA GOLD INC']
['July 1, 2011', 'June 30, 2012', 'NA', 'S & L AEROSPACE METALS LLC']
['July 1, 2011', 'June 30, 2012', 'NA', 'SUMER GOLD LTD']
['July 1, 2011', 'June 30, 2012', 'NA', 'TS MANUFACTURING INC']
['July 1, 2011', 'June 30, 2012', 'LateReport', 'TRI-MART DESIGNS LTD']
['July 1, 2011', 'June 30, 2012', 'NA', 'WASTE MANAGEMENT OF NEW YORK LLC']
['July1, 2015', 'June 30, 2016', 'NS-IndustrialUsers', 'A&L CESSPOOL SERVICE CORP']
['July1, 2015', 'June 30, 2016', 'NA', 'AZAD WATCH NYC INC']
['July1, 2015', 'June 30, 2016', 'NA', 'BK JEWELRY CONTRACTING INC']
['July1, 2015', 'June 30, 2016', 'NA', 'BNNS CO INC']
['July1, 2015', 'June 30, 2016', 'NA', 'CINER MANUFACTURING CO INC']
['July1, 2015', 'June 30, 2016', 'LateReport', 'CNV ARCHITECTURAL COATINGS INC']
['July1, 2015', 'June 30, 2016', 'NA', 'DELL’S MARASCHINO CHERRIES CO INC']
['July1, 2015', 'June 30, 2016', 'NA', 'DIANE VON BASEDOW INC']
['July1, 2015', 'June 30, 2016', 'NA', 'EMA JEWELRY INC']
['July1, 2015', 'June 30, 2016', 'NA', 'F BLANCATO LLC']
['July1, 2015', 'June 30, 2016', 'NA', 'GEMBAR SETTING INC']
['July1, 2015', 'June 30, 2016', 'NA', 'HALMARK ARCHITECTURAL FINISHING CORP']
['July1, 2015', 'June 30, 2016', 'NA', 'INDUSTRIAL FINISHING PRODUCTS INC']
['July1, 2015', 'June 30, 2016', 'NA', 'KYUANG M LEE D/B/A SOLOMON JEWELRY']
['July1, 2015', 'June 30, 2016', 'NA', 'LGSP INC']
['July1, 2015', 'June 30, 2016', 'LateReport', 'NGJ 47 INC']
['July1, 2015', 'June 30, 2016', 'NS-IndustrialUsers', 'NYC DEPARTMENT OF EDUCATION']
['July1, 2015', 'June 30, 2016', 'NA', 'NY FINE JEWELRY DESIGN INC']
['July1, 2015', 'June 30, 2016', 'NA', 'OKSONG JEWELRY INC']
['July1, 2015', 'June 30, 2016', 'NA', 'PERRIGO NEW YORK INC']
['July1, 2015', 'June 30, 2016', 'NA', 'PREMIER POLISHING & CONTRACTING USA INC']
['July1, 2015', 'June 30, 2016', 'NA', 'PROGRAMATIC PLATERS INC']
['July1, 2015', 'June 30, 2016', 'NA', 'RAM JEWELRY INC']
['July1, 2015', 'June 30, 2016', 'NA', 'SARKIS ELIGUR D/B/A PALA JEWELRY']
['July1, 2015', 'June 30, 2016', 'NA', 'TELD JEWELRY DESIGN CORP']
['July1, 2015', 'June 30, 2016', 'NA', 'TIMELESS DESIGNS BY JACOB BRYAN LTD']
['July1, 2015', 'June 30, 2016', 'LateReport', 'US DEPARTMENT OF VETERANS AFFAIRS']

Final Challenge: BPL Branches Data¶

Time permitting, let's put our full list of skills together and see if we can extract a simplified dataset from this Brooklyn Public Library listing of library branches and hours. Using the JSON formatted file available at https://www.bklynlibrary.org/locations/json, extract the name of each branch, its address, and a third column listing the total number of hours per week that the branch is open.

	Year	New York City Population	NYC Consumption(Million gallons per day)	Per Capita(Gallons per person per day)
0	1979	7102100	1512	213
1	1980	7071639	1506	213
2	1981	7089241	1309	185
3	1982	7109105	1382	194
4	1983	7181224	1424	198
5	1984	7234514	1465	203
6	1985	7274054	1326	182
7	1986	7319246	1351	185
8	1987	7342476	1447	197
9	1988	7353719	1484	202

	0	1	2	3	4	5	6	7	8	9	10	11
0	1	4BE85703-1B5D-49E2-A418-75BEDD478BB5	1	1495480711	994730	1495480711	994730	None	1979	7102100	1512	213
1	2	971B06B7-1135-4887-B409-DAC19D845EBC	2	1495480711	994730	1495480711	994730	None	1980	7071639	1506	213
2	3	71AEF5AD-AA70-449C-899A-2514E97F893D	3	1495480711	994730	1495480711	994730	None	1981	7089241	1309	185
3	4	7914DA35-A410-44C2-A0D1-B11385B6B3F4	4	1495480711	994730	1495480711	994730	None	1982	7109105	1382	194
4	5	A1347CE5-F309-45C0-B372-E797A32FB0E5	5	1495480711	994730	1495480711	994730	None	1983	7181224	1424	198
5	6	DA07F92C-B10E-4D5B-A479-17522E0C5C8C	6	1495480711	994730	1495480711	994730	None	1984	7234514	1465	203
6	7	45CD9C9C-172C-4F0B-BC72-319E9B2EA6B0	7	1495480711	994730	1495480711	994730	None	1985	7274054	1326	182
7	8	B76329F9-740B-41BC-900B-62AA4790DAD6	8	1495480711	994730	1495480711	994730	None	1986	7319246	1351	185
8	9	4B66D71F-642D-4770-9123-BAB33E1E8205	9	1495480711	994730	1495480711	994730	None	1987	7342476	1447	197
9	10	6EB7E8EA-D4D9-4DDF-8469-ADFC8FECB06F	10	1495480711	994730	1495480711	994730	None	1988	7353719	1484	202