Nicholas Wolf and Vicky Steeves, NYU Data Services
Vicky's ORCID: 0000-0003-4298-168X | Nick's ORCID: 0000-0001-5512-6151
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
This session is an intermediate-to-advanced level class that offers some ideas for how to approach the following common data wrangling needs in research:
Project Environment
Download the notebook available at https://goo.gl/Pnm7Dx and open it in Jupyter Notebook. Alternatively, you can clone the course materials using
git clone https://github.com/NYU-DataServices/DataHarvesting-Python.git
State of New York Socrata API Account
To work with this session's data, you'll need to create an API account with the state of New York's data service. Visit data.ny.gov and click on signup to create an account.
If not using our JupyterHub cloud environment, you'll need to install sodapy
This is a wrapper for interfacing with the Socrata data platform's API.
Run
pip install sodapy
or
easy_install sodapy
Google Sheets API
To demonstrate how to work with oauth API interfaces, we'll do some sandbox work using Google Sheets API as a proof of concept. For this, you'll need a Google Drive account such as the one associated with your NYU email address.
If not using our JupyterHub cloud environment, you'll need a set of API libraries provided by Google:
pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib
It is not unusual to find open data served directly on the web, without requiring any tokens or authorizations. Although it is not often the practice to provide very large resources via such an entrypoint because of the burden it places on data servers, smaller datasets are frequently available. When it comes to such non-API generated data it is generally easier to work with services that serve the data directly in its native file format, rather than wrapping it in HTML. The latter requires you to parse first the HTML or interpret the HTTP response.
As an example, take the motor vehicle accident report data found here. One thousand records from this 1.65 million-record dataset can be accessed directly at https://data.ny.gov/resource/xe9x-a24f.json (FYI Firefox has a nice JSON viewer built in when you encounter .json hosted files on the web.)
Let's walk through making an HTTP request for that .json data and quickly transforming it into a useful container (a Pandas dataframe) to ready it for use. We'll make use of Python's JSON module, a compact and easy-to-use way of turning JSON into Python's native object types, lists and dictionaries. If you want to anticipate what JSON input will prompt which Python object type output, see this table here: https://docs.python.org/3/library/json.html#json-to-py-table. Note that a JSON array of key-value objects will yield a Python list of dictionaries.
import json
import requests
mv_data_json = requests.get('https://data.ny.gov/resource/xe9x-a24f.json')
mv_list_recs = json.loads(mv_data_json.text)
print(mv_list_recs[0])
Pandas
If you haven't yet given a Pandas dataframe a try as a way to manage large arrays of information, give it a go. It is possible to go overboard: not everything needs to be put in a dataframe, especially when a Python list of lists or a dictionary will do. Take a look at how fast we can access subsets of the motor vehicle data.
import pandas as pd
mv_df = pd.DataFrame(mv_list_recs)
mv_df.head(10)
Bonus: wget direct from web
If you have wget installed on your system, you can use the command line utility wget directly in a Notebook cell. This cell could then be run at the start of your notebook to always retrieve the latest version of a dataset.
To install wget, visit http://www.gnu.org/software/wget/
!wget https://data.ny.gov/resource/xe9x-a24f.json
wget_json = json.loads(open('xe9x-a24f.json').read())
wget_json[0]
For more robust ways of serving up data, i.e. by APIs, we generally need to register an "app," i.e. an application that will be accessing the data, receive at the very least a token (and often a client secret as well) to enable tracked downloads of data, ensure proper access limits, etc.
We can think of these access points as involving one of two levels of authentication: a simple authentication involving signed requests (where a long term token is passed along with the request but no secondary per-request or limited-time token is needed), and a more complex, two or three step authentication process.
It is very helpful if an API comes with a pre-built library to interface with that server so that you don't have to handle signing requests in HTTP, managing tokens, etc.
Fortunately, we have a nice workable pre-made library for working with this same NY State Socrata API portal, sodapy.
Setting Up Your App on Your Socrata Account
Once you have sodapy installed, you'll want to visit the API section of the website, which has its own record of the same dataset we visited above with some additional access information:
https://dev.socrata.com/foundry/data.ny.gov/xe9x-a24f
If you scroll down about halfway through this page, you'll see a large "Sign up for an app token!" button. Click on that to be taken to your API dashboard (alternative, you can login and navigae to https://data.ny.gov/profile/edit/developer_settings)
In the second main section, select "Create New App Token."
Fill in some basic details for your "app" and once generated, copy down the App Token given to you.
We can now proceed to try out the sodapy library.
from sodapy import Socrata
client = Socrata('data.ny.gov','APP-TOKEN',
username='EMAIL-USERNAME',
password='PASSWORD')
results = client.get('xe9x-a24f', limit=2000)
results_df = pd.DataFrame.from_records(results)
results_df.head(10)
In our example above, using sodapy, notice that we can set the number of results we obtain for each request. This is helpful, even if we want the full set of data. At 1.6 million records, this dataset is taxing (though not extraordinarly so) on our http connection to grab it in one go.
A common approach used by data providers to solving this problem is paging. In this setup, you are limited in the number of results you can obtain in each request (often set at 200 records) and must make multiple requests and compile the results.
Take, for example, the city of New York's 311 calls data (https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9). At over 22.4 million records, it isn't realistic to pull the entire dataset in one go. You must page results by passing a "limit" and "offset" paramters in your request URL. Think of this as the total number of records (limit) and a way to request subsequent subsets of results (first page, second page, third page; or offset 0, 1, 2...). An example is available here: https://dev.socrata.com/docs/paging.html.
In other cases our data providers will want you to perform a a two or three-legged authorization -- in other words, you register an access token that you use to sign a request for a second token which you can then pass in the actual data request, PLUS some kind of password authorization that is passed via a web form.
To illustrate how this works on a nice localized environment, we'll practice this using our own Google Drive accounts and Google Sheets (which also happen to be a nice way to read/write out data).
We'll more or less follow Google's own tutorial pitched to Python approaches available here: https://developers.google.com/sheets/api/quickstart/python
Login on your browser and click on the "Enable the Google Sheets API" button in the tutorial linked above. Select Download Client Configuration and save the credentials.json file to the same directory as this notebook.
You can name the sheet or not, but it would be helpful to have some dummy data on the sheet to access. For example, consider copying and pasting some of the preview data from the MV data above: https://data.ny.gov/Transportation/Motor-Vehicle-Crashes-Vehicle-Information-Three-Ye/xe9x-a24f
Run the script below. The firs time you run it, it will ask you to authorize the app, generating a pickle file that you will keep for subsequent access to your Drive.
import pickle
import os.path
from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
# If modifying these scopes, delete the file token.pickle.
SCOPES = ['https://www.googleapis.com/auth/spreadsheets.readonly']
# The ID and range of a sample spreadsheet.
SAMPLE_SPREADSHEET_ID = '1j24IhZiEQK2ss6y95cnrygAHJZp4tmtL5PWwWEa6CXo'
SAMPLE_RANGE_NAME = 'Sheet1!A1:AO'
creds = None
if os.path.exists('token.pickle'):
with open('token.pickle', 'rb') as token:
creds = pickle.load(token)
if not creds or not creds.valid:
if creds and creds.expired and creds.refresh_token:
creds.refresh(Request())
else:
flow = InstalledAppFlow.from_client_secrets_file(
'credentials.json', SCOPES)
creds = flow.run_local_server(port=0)
with open('token.pickle', 'wb') as token:
pickle.dump(creds, token)
service = build('sheets', 'v4', credentials=creds)
sheet = service.spreadsheets()
result = sheet.values().get(spreadsheetId=SAMPLE_SPREADSHEET_ID,
range=SAMPLE_RANGE_NAME).execute()
values = result.get('values', [])
# The result is a list of lists, with each row
# represented as a list of cell values
for row in values:
print(row)
What common checks should we perform?
In the motor vehicle dataframe, for example, what if we suspect that case_vehicle_id is nonunique but should be? How can we check it?
results_df.case_vehicle_id.is_unique
Ideally, we either would not have NULLs (NaNs in Pandas) or else we would have very clear documentation on whether a NULL indicates that an observation was not attempted for that variable, whether it was attempted but unsuccessful, or attempted and the value not within the established parameters (e.g. because zero, or infinitely large, etc.)
# Return a dataframe containing any NULLs (NaNs) in any column
results_df[results_df.isnull().any(axis=1)]
For example, we might take a look at number of occupants per vehicle to establish whether there is unrealistic piling (for example, if a lot of cases involve numbers of occupants that are in increments of 5), a lot of even numbers (suggsting estimation), etc.
We might consider address our NULL issue here by dropping any row containing a NaN before looking at our frequency counts.
num_occs = results_df.dropna(axis=0, how='any',
subset=['number_of_occupants'])
num_occs.number_of_occupants.value_counts()
We can do the same to discover typical numbers of occupants in vehicles:
layer_df = results_df.dropna(axis=0, how='any',
subset=['action_prior_to_accident',
'number_of_occupants']).groupby(['action_prior_to_accident',
'number_of_occupants']).size()
display(layer_df)
Checking for syntax errors in columns that are string entries:
vehicle_body_types = results_df.dropna(axis=0, how='any',
subset=['vehicle_body_type'])
vehicle_body_types.vehicle_body_type.value_counts()