Python for Harvesting Data on the Web

Nicholas Wolf and Vicky Steeves, NYU Data Services

Vicky's ORCID: 0000-0003-4298-168X | Nick's ORCID: 0000-0001-5512-6151

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Overview

This session is an intermediate-to-advanced level class that offers some ideas for how to approach the following common data wrangling needs in research:

  • Obtain data and load it into a suitable data "container" for analysis, often via a web interface, especially an API
  • Parse the data retrieved via an API and turn it into a useful object for manipulation and analysis
  • Perform some basic data integrity checks of the downloaded data in anticipation of preparing it for analysis

Setup

Project Environment

Download the notebook available at https://goo.gl/Pnm7Dx and open it in Jupyter Notebook. Alternatively, you can clone the course materials using

git clone https://github.com/NYU-DataServices/DataHarvesting-Python.git

State of New York Socrata API Account

To work with this session's data, you'll need to create an API account with the state of New York's data service. Visit data.ny.gov and click on signup to create an account.

If not using our JupyterHub cloud environment, you'll need to install

sodapy
This is a wrapper for interfacing with the Socrata data platform's API.

Run

pip install sodapy

or

easy_install sodapy

Google Sheets API

To demonstrate how to work with oauth API interfaces, we'll do some sandbox work using Google Sheets API as a proof of concept. For this, you'll need a Google Drive account such as the one associated with your NYU email address.

If not using our JupyterHub cloud environment, you'll need a set of API libraries provided by Google:

pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib

1. Web Requests

It is not unusual to find open data served directly on the web, without requiring any tokens or authorizations. Although it is not often the practice to provide very large resources via such an entrypoint because of the burden it places on data servers, smaller datasets are frequently available. When it comes to such non-API generated data it is generally easier to work with services that serve the data directly in its native file format, rather than wrapping it in HTML. The latter requires you to parse first the HTML or interpret the HTTP response.

As an example, take the motor vehicle accident report data found here. One thousand records from this 1.65 million-record dataset can be accessed directly at https://data.ny.gov/resource/xe9x-a24f.json (FYI Firefox has a nice JSON viewer built in when you encounter .json hosted files on the web.)

Let's walk through making an HTTP request for that .json data and quickly transforming it into a useful container (a Pandas dataframe) to ready it for use. We'll make use of Python's JSON module, a compact and easy-to-use way of turning JSON into Python's native object types, lists and dictionaries. If you want to anticipate what JSON input will prompt which Python object type output, see this table here: https://docs.python.org/3/library/json.html#json-to-py-table. Note that a JSON array of key-value objects will yield a Python list of dictionaries.

In [21]:
import json
import requests

mv_data_json = requests.get('https://data.ny.gov/resource/xe9x-a24f.json')

mv_list_recs = json.loads(mv_data_json.text)

print(mv_list_recs[0])
{'year': '2016', 'case_vehicle_id': '13364180', 'vehicle_body_type': 'SUBURBAN', 'registration_class': 'PASSENGER OR SUBURBAN', 'action_prior_to_accident': 'Going Straight Ahead', 'type_axles_of_truck_or_bus': 'Not Entered', 'direction_of_travel': 'East', 'fuel_type': 'Gas', 'vehicle_year': '2015', 'state_of_registration': 'NY', 'number_of_occupants': '2', 'engine_cylinders': '4', 'vehicle_make': 'NISSA', 'contributing_factor_1': 'HUMAN', 'contributing_factor_1_description': 'Fell Asleep', 'contributing_factor_2': 'HUMAN', 'contributing_factor_2_description': 'Unknown', 'event_type': 'Not Applicable', 'partial_vin': 'KNMAT2MV1FP544486'}

Pandas

If you haven't yet given a Pandas dataframe a try as a way to manage large arrays of information, give it a go. It is possible to go overboard: not everything needs to be put in a dataframe, especially when a Python list of lists or a dictionary will do. Take a look at how fast we can access subsets of the motor vehicle data.

In [22]:
import pandas as pd

mv_df = pd.DataFrame(mv_list_recs)

mv_df.head(10)
Out[22]:
action_prior_to_accident case_vehicle_id contributing_factor_1 contributing_factor_1_description contributing_factor_2 contributing_factor_2_description direction_of_travel engine_cylinders event_type fuel_type number_of_occupants partial_vin registration_class state_of_registration type_axles_of_truck_or_bus vehicle_body_type vehicle_make vehicle_year year
0 Going Straight Ahead 13364180 HUMAN Fell Asleep HUMAN Unknown East 4 Not Applicable Gas 2 KNMAT2MV1FP544486 PASSENGER OR SUBURBAN NY Not Entered SUBURBAN NISSA 2015 2016
1 Merging 13364181 HUMAN Alcohol Involvement HUMAN Not Applicable South 4 Not Entered Gas 1 5NPET46C87H227617 PASSENGER OR SUBURBAN NY Not Entered 4 DOOR SEDAN HYUND 2007 2016
2 Going Straight Ahead 13364182 HUMAN Not Applicable HUMAN Not Applicable South 4 Not Entered Gas 1 1HGCP26379A133398 PASSENGER OR SUBURBAN NY Not Entered 4 DOOR SEDAN HONDA 2009 2016
3 Going Straight Ahead 13364283 ENVMT Pavement Slippery HUMAN Not Applicable North 4 Not Applicable Gas 1 1B3HB28B37D582730 PASSENGER OR SUBURBAN NY Not Entered 4 DOOR SEDAN DODGE 2007 2016
4 Changing Lanes 13364291 HUMAN Unsafe Lane Changing HUMAN Not Applicable West 4 Not Applicable Gas 1 WMWZC3C57CWL81108 PASSENGER OR SUBURBAN NY Not Entered SUBURBAN MINI 2012 2016
5 Going Straight Ahead 13364292 HUMAN Not Applicable HUMAN Not Applicable West 5 Not Applicable Gas 3 3VWDP7AJ0DM381569 PASSENGER OR SUBURBAN NY Not Entered 4 DOOR SEDAN VOLKS 2013 2016
6 Going Straight Ahead 13364304 ENVMT Pavement Slippery HUMAN Not Applicable East 6 Not Applicable Gas 2 3GCUKREH8EG196335 PASSENGER OR SUBURBAN NY Not Entered PICKUP TRUCK CHEVR 2014 2016
7 Going Straight Ahead 13364305 ENVMT Pavement Slippery HUMAN Not Applicable North 4 Not Applicable Gas 1 1HGCM56387A220086 PASSENGER OR SUBURBAN NY Not Entered 4 DOOR SEDAN HONDA 2007 2016
8 Going Straight Ahead 13364306 ENVMT Pavement Slippery HUMAN Not Applicable North 6 Snow Embankment, Collision With Fixed Object Gas 1 3G5DA03E04S597761 PASSENGER OR SUBURBAN NY Not Entered SUBURBAN BUICK 2004 2016
9 Going Straight Ahead 13364307 ENVMT Pavement Slippery ENVMT Obstruction/ Debris North 4 Unknown Gas 4 2CNFLCEC1B6362978 PASSENGER OR SUBURBAN NY Not Entered SUBURBAN CHEVR 2011 2016

Bonus: wget direct from web

If you have wget installed on your system, you can use the command line utility wget directly in a Notebook cell. This cell could then be run at the start of your notebook to always retrieve the latest version of a dataset.

To install wget, visit http://www.gnu.org/software/wget/

In [7]:
!wget https://data.ny.gov/resource/xe9x-a24f.json
--2020-03-03 15:50:01--  https://data.ny.gov/resource/xe9x-a24f.json
Resolving data.ny.gov (data.ny.gov)... 52.206.68.26, 52.206.140.205, 52.206.140.199
Connecting to data.ny.gov (data.ny.gov)|52.206.68.26|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/json]
Saving to: ‘xe9x-a24f.json.1’

xe9x-a24f.json.1        [   <=>              ] 606.07K  1.31MB/s    in 0.5s    

2020-03-03 15:50:03 (1.31 MB/s) - ‘xe9x-a24f.json.1’ saved [620616]

In [8]:
wget_json = json.loads(open('xe9x-a24f.json').read())
wget_json[0]
Out[8]:
{'year': '2016',
 'case_vehicle_id': '13364180',
 'vehicle_body_type': 'SUBURBAN',
 'registration_class': 'PASSENGER OR SUBURBAN',
 'action_prior_to_accident': 'Going Straight Ahead',
 'type_axles_of_truck_or_bus': 'Not Entered',
 'direction_of_travel': 'East',
 'fuel_type': 'Gas',
 'vehicle_year': '2015',
 'state_of_registration': 'NY',
 'number_of_occupants': '2',
 'engine_cylinders': '4',
 'vehicle_make': 'NISSA',
 'contributing_factor_1': 'HUMAN',
 'contributing_factor_1_description': 'Fell Asleep',
 'contributing_factor_2': 'HUMAN',
 'contributing_factor_2_description': 'Unknown',
 'event_type': 'Not Applicable',
 'partial_vin': 'KNMAT2MV1FP544486'}

2. Simple Web API Requests

For more robust ways of serving up data, i.e. by APIs, we generally need to register an "app," i.e. an application that will be accessing the data, receive at the very least a token (and often a client secret as well) to enable tracked downloads of data, ensure proper access limits, etc.

We can think of these access points as involving one of two levels of authentication: a simple authentication involving signed requests (where a long term token is passed along with the request but no secondary per-request or limited-time token is needed), and a more complex, two or three step authentication process.

Simple Authentication Example Using NYC State Data Socrata

It is very helpful if an API comes with a pre-built library to interface with that server so that you don't have to handle signing requests in HTTP, managing tokens, etc.

Fortunately, we have a nice workable pre-made library for working with this same NY State Socrata API portal, sodapy.

Setting Up Your App on Your Socrata Account

Once you have sodapy installed, you'll want to visit the API section of the website, which has its own record of the same dataset we visited above with some additional access information:

https://dev.socrata.com/foundry/data.ny.gov/xe9x-a24f

If you scroll down about halfway through this page, you'll see a large "Sign up for an app token!" button. Click on that to be taken to your API dashboard (alternative, you can login and navigae to https://data.ny.gov/profile/edit/developer_settings)

In the second main section, select "Create New App Token."

Fill in some basic details for your "app" and once generated, copy down the App Token given to you.

We can now proceed to try out the sodapy library.

In [9]:
from sodapy import Socrata

client = Socrata('data.ny.gov','APP-TOKEN',
                username='EMAIL-USERNAME',
                password='PASSWORD')

results = client.get('xe9x-a24f', limit=2000)
In [11]:
results_df = pd.DataFrame.from_records(results)
results_df.head(10)
Out[11]:
action_prior_to_accident case_vehicle_id contributing_factor_1 contributing_factor_1_description contributing_factor_2 contributing_factor_2_description direction_of_travel engine_cylinders event_type fuel_type number_of_occupants partial_vin registration_class state_of_registration type_axles_of_truck_or_bus vehicle_body_type vehicle_make vehicle_year year
0 Going Straight Ahead 13364180 HUMAN Fell Asleep HUMAN Unknown East 4 Not Applicable Gas 2 KNMAT2MV1FP544486 PASSENGER OR SUBURBAN NY Not Entered SUBURBAN NISSA 2015 2016
1 Merging 13364181 HUMAN Alcohol Involvement HUMAN Not Applicable South 4 Not Entered Gas 1 5NPET46C87H227617 PASSENGER OR SUBURBAN NY Not Entered 4 DOOR SEDAN HYUND 2007 2016
2 Going Straight Ahead 13364182 HUMAN Not Applicable HUMAN Not Applicable South 4 Not Entered Gas 1 1HGCP26379A133398 PASSENGER OR SUBURBAN NY Not Entered 4 DOOR SEDAN HONDA 2009 2016
3 Going Straight Ahead 13364283 ENVMT Pavement Slippery HUMAN Not Applicable North 4 Not Applicable Gas 1 1B3HB28B37D582730 PASSENGER OR SUBURBAN NY Not Entered 4 DOOR SEDAN DODGE 2007 2016
4 Changing Lanes 13364291 HUMAN Unsafe Lane Changing HUMAN Not Applicable West 4 Not Applicable Gas 1 WMWZC3C57CWL81108 PASSENGER OR SUBURBAN NY Not Entered SUBURBAN MINI 2012 2016
5 Going Straight Ahead 13364292 HUMAN Not Applicable HUMAN Not Applicable West 5 Not Applicable Gas 3 3VWDP7AJ0DM381569 PASSENGER OR SUBURBAN NY Not Entered 4 DOOR SEDAN VOLKS 2013 2016
6 Going Straight Ahead 13364304 ENVMT Pavement Slippery HUMAN Not Applicable East 6 Not Applicable Gas 2 3GCUKREH8EG196335 PASSENGER OR SUBURBAN NY Not Entered PICKUP TRUCK CHEVR 2014 2016
7 Going Straight Ahead 13364305 ENVMT Pavement Slippery HUMAN Not Applicable North 4 Not Applicable Gas 1 1HGCM56387A220086 PASSENGER OR SUBURBAN NY Not Entered 4 DOOR SEDAN HONDA 2007 2016
8 Going Straight Ahead 13364306 ENVMT Pavement Slippery HUMAN Not Applicable North 6 Snow Embankment, Collision With Fixed Object Gas 1 3G5DA03E04S597761 PASSENGER OR SUBURBAN NY Not Entered SUBURBAN BUICK 2004 2016
9 Going Straight Ahead 13364307 ENVMT Pavement Slippery ENVMT Obstruction/ Debris North 4 Unknown Gas 4 2CNFLCEC1B6362978 PASSENGER OR SUBURBAN NY Not Entered SUBURBAN CHEVR 2011 2016

Paging

In our example above, using sodapy, notice that we can set the number of results we obtain for each request. This is helpful, even if we want the full set of data. At 1.6 million records, this dataset is taxing (though not extraordinarly so) on our http connection to grab it in one go.

A common approach used by data providers to solving this problem is paging. In this setup, you are limited in the number of results you can obtain in each request (often set at 200 records) and must make multiple requests and compile the results.

Take, for example, the city of New York's 311 calls data (https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9). At over 22.4 million records, it isn't realistic to pull the entire dataset in one go. You must page results by passing a "limit" and "offset" paramters in your request URL. Think of this as the total number of records (limit) and a way to request subsequent subsets of results (first page, second page, third page; or offset 0, 1, 2...). An example is available here: https://dev.socrata.com/docs/paging.html.

3. Complex Web API Requests

In other cases our data providers will want you to perform a a two or three-legged authorization -- in other words, you register an access token that you use to sign a request for a second token which you can then pass in the actual data request, PLUS some kind of password authorization that is passed via a web form.

To illustrate how this works on a nice localized environment, we'll practice this using our own Google Drive accounts and Google Sheets (which also happen to be a nice way to read/write out data).

We'll more or less follow Google's own tutorial pitched to Python approaches available here: https://developers.google.com/sheets/api/quickstart/python

1. Enable the GSheets API on your account

Login on your browser and click on the "Enable the Google Sheets API" button in the tutorial linked above. Select Download Client Configuration and save the credentials.json file to the same directory as this notebook.

2. Create a Sample Sheet in Google Drive

You can name the sheet or not, but it would be helpful to have some dummy data on the sheet to access. For example, consider copying and pasting some of the preview data from the MV data above: https://data.ny.gov/Transportation/Motor-Vehicle-Crashes-Vehicle-Information-Three-Ye/xe9x-a24f

3. Authenticate/login

Run the script below. The firs time you run it, it will ask you to authorize the app, generating a pickle file that you will keep for subsequent access to your Drive.

In [14]:
import pickle
import os.path
from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request

# If modifying these scopes, delete the file token.pickle.
SCOPES = ['https://www.googleapis.com/auth/spreadsheets.readonly']

# The ID and range of a sample spreadsheet.
SAMPLE_SPREADSHEET_ID = '1j24IhZiEQK2ss6y95cnrygAHJZp4tmtL5PWwWEa6CXo'
SAMPLE_RANGE_NAME = 'Sheet1!A1:AO'


creds = None
if os.path.exists('token.pickle'):
    with open('token.pickle', 'rb') as token:
        creds = pickle.load(token)
if not creds or not creds.valid:
    if creds and creds.expired and creds.refresh_token:
        creds.refresh(Request())
    else:
        flow = InstalledAppFlow.from_client_secrets_file(
            'credentials.json', SCOPES)
        creds = flow.run_local_server(port=0)
    with open('token.pickle', 'wb') as token:
        pickle.dump(creds, token)

service = build('sheets', 'v4', credentials=creds)
sheet = service.spreadsheets()
result = sheet.values().get(spreadsheetId=SAMPLE_SPREADSHEET_ID,
                            range=SAMPLE_RANGE_NAME).execute()
values = result.get('values', [])
In [15]:
# The result is a list of lists, with each row 
# represented as a list of cell values

for row in values:
    print(row)
['2016', '13364180', 'SUBURBAN', 'PASSENGER OR SUBURBAN', 'Going Straight Ahead', 'Not Entered', 'East', 'Gas', '2015', 'NY', '2', '4', 'NISSA', 'HUMAN', 'Fell Asleep', 'HUMAN', 'Unknown', 'Not Applicable', 'KNMAT2MV1FP544486']
['2016', '13364181', '4 DOOR SEDAN', 'PASSENGER OR SUBURBAN', 'Merging', 'Not Entered', 'South', 'Gas', '2007', 'NY', '1', '4', 'HYUND', 'HUMAN', 'Alcohol Involvement', 'HUMAN', 'Not Applicable', 'Not Entered', '5NPET46C87H227617']
['2016', '13364182', '4 DOOR SEDAN', 'PASSENGER OR SUBURBAN', 'Going Straight Ahead', 'Not Entered', 'South', 'Gas', '2009', 'NY', '1', '4', 'HONDA', 'HUMAN', 'Not Applicable', 'HUMAN', 'Not Applicable', 'Not Entered', '1HGCP26379A133398']
['2016', '13364283', '4 DOOR SEDAN', 'PASSENGER OR SUBURBAN', 'Going Straight Ahead', 'Not Entered', 'North', 'Gas', '2007', 'NY', '1', '4', 'DODGE', 'ENVMT', 'Pavement Slippery', 'HUMAN', 'Not Applicable', 'Not Applicable', '1B3HB28B37D582730']
['2016', '13364291', 'SUBURBAN', 'PASSENGER OR SUBURBAN', 'Changing Lanes', 'Not Entered', 'West', 'Gas', '2012', 'NY', '1', '4', 'MINI', 'HUMAN', 'Unsafe Lane Changing', 'HUMAN', 'Not Applicable', 'Not Applicable', 'WMWZC3C57CWL81108']
['2016', '13364292', '4 DOOR SEDAN', 'PASSENGER OR SUBURBAN', 'Going Straight Ahead', 'Not Entered', 'West', 'Gas', '2013', 'NY', '3', '5', 'VOLKS', 'HUMAN', 'Not Applicable', 'HUMAN', 'Not Applicable', 'Not Applicable', '3VWDP7AJ0DM381569']
['2016', '13364304', 'PICKUP TRUCK', 'PASSENGER OR SUBURBAN', 'Going Straight Ahead', 'Not Entered', 'East', 'Gas', '2014', 'NY', '2', '6', 'CHEVR', 'ENVMT', 'Pavement Slippery', 'HUMAN', 'Not Applicable', 'Not Applicable', '3GCUKREH8EG196335']

Data Integrity Checks

What common checks should we perform?

1. Check that Columns Serving as Unique Identifiers are Truly Unique

In the motor vehicle dataframe, for example, what if we suspect that case_vehicle_id is nonunique but should be? How can we check it?

In [27]:
results_df.case_vehicle_id.is_unique
Out[27]:
True

2. Check for NULLs

Ideally, we either would not have NULLs (NaNs in Pandas) or else we would have very clear documentation on whether a NULL indicates that an observation was not attempted for that variable, whether it was attempted but unsuccessful, or attempted and the value not within the established parameters (e.g. because zero, or infinitely large, etc.)

In [26]:
# Return a dataframe containing any NULLs (NaNs) in any column

results_df[results_df.isnull().any(axis=1)]
Out[26]:
action_prior_to_accident case_vehicle_id contributing_factor_1 contributing_factor_1_description contributing_factor_2 contributing_factor_2_description direction_of_travel engine_cylinders event_type fuel_type number_of_occupants partial_vin registration_class state_of_registration type_axles_of_truck_or_bus vehicle_body_type vehicle_make vehicle_year year
10 Entering Parked Position 13364312 HUMAN Driver Inattention/Distraction* HUMAN Not Applicable North 4 Not Applicable Gas NaN JHLRD18611C023039 PASSENGER OR SUBURBAN NY Not Entered SUBURBAN HONDA 2001 2016
16 Going Straight Ahead 13364338 HUMAN Unsafe Speed ENVMT Pavement Slippery East NaN Not Applicable Not Entered 1 1FTWX3BR1AEA59372 Not Entered NY Not Entered PICKUP TRUCK FORD 2010 2016
19 Going Straight Ahead 13364349 HUMAN Driver Inattention/Distraction* HUMAN Not Applicable West NaN Not Applicable Not Entered 1 NaN Not Entered CT Not Entered SUBURBAN NaN NaN 2016
26 Going Straight Ahead 13364375 HUMAN Failure to Yield Right-of-Way HUMAN Not Applicable East NaN Not Applicable Not Entered 2 1C3CCCABXFN655448 Not Entered SC Not Entered 4 DOOR SEDAN CHRY 2015 2016
27 Merging 13364481 HUMAN Failure to Yield Right-of-Way HUMAN Unknown East NaN Not Applicable Not Entered 1 NaN Not Entered NY Not Entered 4 DOOR SEDAN FORD 2006 2016
42 Making Left Turn 13364641 HUMAN Failure to Yield Right-of-Way HUMAN Not Applicable East NaN Not Applicable Not Entered 1 NaN Not Entered ON Not Entered SUBURBAN KIA 2015 2016
45 Unknown 13364777 HUMAN Drugs (Illegal) HUMAN Not Applicable South NaN Fence, Collision With Fixed Object Not Entered 1 1N4AL11D86N340131 Not Entered NY Not Entered 4 DOOR SEDAN NISS 2006 2016
50 Parked 13364869 HUMAN Not Applicable HUMAN Not Applicable Unknown 6 Other Motor Vehicle, Collision With Gas NaN 1G4HP54KXYU122033 PASSENGER OR SUBURBAN NY Not Entered 4 DOOR SEDAN BUICK 2000 2016
51 Parked 13364870 HUMAN Not Applicable HUMAN Not Applicable Unknown NaN Not Applicable Not Entered NaN NaN Not Entered NY Not Entered 4 DOOR SEDAN SUBA 2002 2016
53 Going Straight Ahead 13364875 HUMAN Passing or Lane Usage Improper ENVMT Pavement Slippery Southwest NaN Not Applicable Not Entered 1 1N4AB7AP1EN855535 Not Entered NV Not Entered 4 DOOR SEDAN NISS 2014 2016
55 Going Straight Ahead 13364889 HUMAN Not Applicable HUMAN Not Applicable East NaN Not Applicable Not Entered 1 NaN Not Entered CT Not Entered SUBURBAN CHEV 2007 2016
60 Going Straight Ahead 13364935 HUMAN Unsafe Speed ENVMT Pavement Slippery West NaN Not Applicable Not Entered 2 NaN Not Entered OH Not Entered 4 DOOR SEDAN NaN NaN 2016
68 Going Straight Ahead 13365051 HUMAN Unsafe Speed ENVMT Pavement Slippery West NaN Not Applicable Not Entered 1 3HSDJAPR5FN526902 Not Entered OK Not Entered TRACTOR INTL 2015 2016
70 Going Straight Ahead 13365071 HUMAN Unsafe Speed ENVMT Pavement Slippery West NaN Not Applicable Not Entered 1 NaN Not Entered NC Not Entered 4 DOOR SEDAN NaN NaN 2016
74 Going Straight Ahead 13365083 HUMAN Not Applicable HUMAN Not Applicable East NaN Not Applicable Not Entered 1 4T1BK3EK2AU103974 Not Entered NJ Not Entered 4 DOOR SEDAN TOYT 2010 2016
85 Parked 13365157 HUMAN Not Applicable HUMAN Not Applicable South 8 Not Applicable Gas NaN 1C6RD6FP0CS242502 PASSENGER OR SUBURBAN NY Not Entered PICKUP TRUCK DODGE 2012 2016
90 Parked 13365167 HUMAN Not Applicable HUMAN Not Applicable East NaN Not Applicable Not Entered NaN 1GNFK03098R264302 Not Entered NY Not Entered SUBURBAN CHEV 2008 2016
97 Starting in Traffic 13365178 HUMAN Driver Inattention/Distraction* HUMAN Not Applicable South 4 NaN Gas 2 1N4AL21E79N505556 PASSENGER OR SUBURBAN NY Not Entered 4 DOOR SEDAN NISSA 2009 2016
98 Stopped in Traffic 13365179 HUMAN Not Applicable HUMAN Not Applicable South 6 NaN Gas 2 5NPEU46F57H261585 PASSENGER OR SUBURBAN NY Not Entered 4 DOOR SEDAN HYUND 2007 2016
102 Going Straight Ahead 13365220 ENVMT Pavement Slippery HUMAN Unsafe Speed Southwest NaN Not Applicable Not Entered 1 19UUA5696A002755 Not Entered NY Not Entered 4 DOOR SEDAN ACUR 2002 2016
105 Going Straight Ahead 13365232 ENVMT Animal's Action HUMAN Not Applicable East 6 NaN Gas 2 5UXFE43549L275133 PASSENGER OR SUBURBAN NY Not Entered SUBURBAN BMW 2009 2016
106 Going Straight Ahead 13365245 HUMAN Failure to Keep Right HUMAN Not Applicable South 4 Not Applicable Gas NaN WDBRF40J54F531217 PASSENGER OR SUBURBAN NY Not Entered 4 DOOR SEDAN ME/BE 2004 2016
107 Parked 13365246 HUMAN Not Applicable HUMAN Not Applicable North 4 Not Applicable Gas NaN KMHDU46D77U133540 PASSENGER OR SUBURBAN NY Not Entered 4 DOOR SEDAN HYUND 2007 2016
115 Going Straight Ahead 13365304 HUMAN Driver Inattention/Distraction* HUMAN Not Applicable North NaN Not Entered Not Entered 1 1FAHP2E87DG158783 Not Entered NY Not Entered 4 DOOR SEDAN FORD 2013 2016
116 Stopped in Traffic 13365305 HUMAN Not Applicable HUMAN Not Applicable North NaN Not Entered Not Entered 4 5XXGM4A72FG440668 Not Entered NY Not Entered 4 DOOR SEDAN KIA 2015 2016
123 Going Straight Ahead 13365345 HUMAN Not Applicable HUMAN Not Applicable South NaN Not Applicable Not Entered 2 NaN Not Entered NY Not Entered PICKUP TRUCK FORD 2011 2016
126 Going Straight Ahead 13365355 HUMAN Alcohol Involvement ENVMT Pavement Slippery East 8 NaN Gas 2 1LNLM81WXVY691562 PASSENGER OR SUBURBAN NY Not Entered 4 DOOR SEDAN LINCO 1997 2016
128 Making Left Turn 13365366 HUMAN Not Applicable HUMAN Not Applicable Southeast NaN Not Applicable Not Entered 2 WMWZC5C51FWP49170 Not Entered NJ Not Entered 4 DOOR SEDAN COOP 2015 2016
138 Going Straight Ahead 13365397 HUMAN Unsafe Speed HUMAN Not Applicable West NaN Not Applicable Not Entered 2 WBAPH73519E127649 Not Entered ON Not Entered 4 DOOR SEDAN BMW 2009 2016
148 Going Straight Ahead 13365474 HUMAN Alcohol Involvement HUMAN Following Too Closely West NaN Not Applicable Not Entered 2 WBAPK5G51BNN32089 Not Entered NJ Not Entered 4 DOOR SEDAN BMW 2011 2016
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1839 Changing Lanes 13376063 HUMAN Failure to Yield Right-of-Way HUMAN Not Applicable Southeast NaN Not Applicable Not Entered 1 WAUSF78K79N018658 Not Entered NY Not Entered 4 DOOR SEDAN AUDI 2009 2016
1847 Going Straight Ahead 13376077 HUMAN Not Applicable HUMAN Not Applicable North 8 NaN Gas 2 5N1AA08A74N709236 PASSENGER OR SUBURBAN NY Not Entered SUBURBAN NISSA 2004 2016
1848 Stopped in Traffic 13376078 HUMAN Not Applicable HUMAN Not Applicable West 6 NaN Gas 1 5FNRL386X6B401297 PASSENGER OR SUBURBAN NY Not Entered SUBURBAN HONDA 2006 2016
1849 Making Left Turn 13376079 HUMAN Failure to Yield Right-of-Way HUMAN Not Applicable Northeast 6 NaN Gas 1 2D4GP44LX5R148453 OMNIBUS - TAXI NY Not Entered SUBURBAN DODGE 2005 2016
1853 Going Straight Ahead 13376091 HUMAN Driver Inattention/Distraction* HUMAN Failure to Yield Right-of-Way East NaN Not Applicable Not Entered 1 NaN Not Entered NY Not Entered 4 DOOR SEDAN NISS 2016 2016
1855 Going Straight Ahead 13376095 ENVMT Animal's Action HUMAN Not Applicable North NaN Not Applicable Not Entered 2 JTDZN3EU6D3197458 Not Entered NY Not Entered 4 DOOR SEDAN TOYT 2013 2016
1856 Parked 13376102 HUMAN Not Applicable HUMAN Not Applicable Southwest 6 Not Applicable Gas NaN 4T1BF18BXXU313560 PASSENGER OR SUBURBAN NY Not Entered 4 DOOR SEDAN TOYOT 1999 2016
1857 Backing 13376103 HUMAN Backing Unsafely HUMAN Driver Inattention/Distraction* Northeast NaN Not Applicable Not Entered 1 NaN Not Entered NY Not Entered SUBURBAN HOND 2000 2016
1863 Going Straight Ahead 13376130 HUMAN Not Applicable HUMAN Not Applicable South NaN Not Applicable Not Entered 1 1FTNE2EW6ADA46555 Not Entered NY Not Entered VAN TRUCK FORD 2010 2016
1864 Going Straight Ahead 13376131 ENVMT Animal's Action HUMAN Not Applicable North 4 NaN Gas 2 3VWCT21C21M405085 PASSENGER OR SUBURBAN NY Not Entered 2 DOOR SEDAN VOLKS 2001 2016
1866 Stopped in Traffic 13376133 HUMAN Not Applicable HUMAN Not Applicable Northeast NaN Not Applicable Not Entered 1 5FNYF48489B001864 Not Entered NY Not Entered SUBURBAN HOND 2009 2016
1871 Going Straight Ahead 13376142 HUMAN Not Applicable HUMAN Not Applicable East NaN Not Applicable Not Entered 1 NaN Not Entered NY Not Entered PICKUP TRUCK FORD 2003 2016
1873 Slowing or Stopping 13376144 HUMAN Not Applicable HUMAN Not Applicable South NaN Not Applicable Not Entered 2 NaN Not Entered TN Not Entered SUBURBAN NISS 2006 2016
1882 Going Straight Ahead 13376170 HUMAN Unsafe Speed HUMAN Not Applicable South NaN Not Applicable Not Entered 1 1GNDT13S882144205 Not Entered NY Not Entered 4 DOOR SEDAN CHEV 2008 2016
1887 Stopped in Traffic 13376182 HUMAN Not Applicable HUMAN Not Applicable East NaN Not Applicable Not Entered 1 NaN Not Entered NY Not Entered SUBURBAN SUBA 2014 2016
1888 Stopped in Traffic 13376183 HUMAN Not Applicable HUMAN Not Applicable East NaN Other Motor Vehicle, Collision With Not Entered 1 NaN Not Entered NY Not Entered 4 DOOR SEDAN INFI 2015 2016
1889 Stopped in Traffic 13376184 HUMAN Not Applicable HUMAN Not Applicable East NaN Other Motor Vehicle, Collision With Not Entered 1 NaN Not Entered NJ Not Entered 4 DOOR SEDAN FORD 2011 2016
1890 Going Straight Ahead 13376185 HUMAN Driver Inattention/Distraction* HUMAN Following Too Closely East NaN Not Applicable Not Entered 1 NaN Not Entered NJ Not Entered SUBURBAN JEEP 2013 2016
1904 Making Left Turn 13376206 HUMAN Unknown HUMAN Failure to Yield Right-of-Way East NaN Not Applicable Not Entered 1 NaN Not Entered NY Not Entered SUBURBAN CHEV 1994 2016
1906 Going Straight Ahead 13376208 HUMAN Driver Inattention/Distraction* HUMAN Unsafe Speed South 4 NaN Gas 1 4A3AC44G21E137484 PASSENGER OR SUBURBAN NY Not Entered 2 DOOR SEDAN MITSU 2001 2016
1923 Stopped in Traffic 13376260 HUMAN Not Applicable HUMAN Not Applicable East NaN Not Applicable Not Entered 1 2G4WB52K3X1551808 Not Entered NY Not Entered 4 DOOR SEDAN BUIC 1999 2016
1925 Parked 13376262 HUMAN Not Applicable HUMAN Not Applicable West 4 Not Applicable Gas NaN 3CZRU6H35GM717105 PASSENGER OR SUBURBAN NY Not Entered SUBURBAN HONDA 2016 2016
1929 Making Left Turn 13376266 HUMAN Not Applicable HUMAN Not Applicable South 6 NaN Diesel 1 3C7WRMBL6FG526168 COMMERCIAL NY Not Entered PICKUP TRUCK RAM 2015 2016
1930 Going Straight Ahead 13376267 ENVMT Pavement Slippery HUMAN Failure to Keep Right West NaN Not Applicable Not Entered 1 NaN Not Entered NC Not Entered SUBURBAN TOYT 2001 2016
1943 Going Straight Ahead 13376300 ENVMT Pavement Slippery ENVMT Pavement Slippery East NaN Not Applicable Not Entered 2 NaN Not Entered NJ Not Entered 4 DOOR SEDAN TOYT 2012 2016
1958 Going Straight Ahead 13376326 HUMAN Driver Inattention/Distraction* HUMAN Not Applicable West 6 NaN Gas 1 1GKDM19W0XB533780 PASSENGER OR SUBURBAN NY Not Entered SUBURBAN GMC 1999 2016
1975 Making Left Turn 13376364 HUMAN Traffic Control Device Disregarded HUMAN Not Applicable Southeast NaN Not Applicable Not Entered 1 NaN Not Entered PA Not Entered 4 DOOR SEDAN CHEV 1995 2016
1983 Going Straight Ahead 13376377 HUMAN Pedestrian/Bicyclist Error/Confusion HUMAN Not Applicable West NaN Not Applicable Not Entered 1 NaN Not Entered NaN Not Entered BICYCLIST NaN NaN 2016
1984 Going Straight Ahead 13376379 HUMAN Not Applicable HUMAN Not Applicable South 4 NaN Gas 1 1FMCU9D74AKA63813 PASSENGER OR SUBURBAN NY Not Entered SUBURBAN FORD 2010 2016
1989 Going Straight Ahead 13376384 ENVMT View Obstructed/Limited HUMAN Driver Inattention/Distraction* West NaN Not Applicable Not Entered 1 1GCWGFCA3E1159187 Not Entered NY Not Entered VAN TRUCK CHEV 2014 2016

415 rows × 19 columns

3. Look for Outliers, Unrealistic Values, or Syntax Errors

For example, we might take a look at number of occupants per vehicle to establish whether there is unrealistic piling (for example, if a lot of cases involve numbers of occupants that are in increments of 5), a lot of even numbers (suggsting estimation), etc.

We might consider address our NULL issue here by dropping any row containing a NaN before looking at our frequency counts.

In [28]:
num_occs = results_df.dropna(axis=0, how='any', 
                             subset=['number_of_occupants'])

num_occs.number_of_occupants.value_counts()
Out[28]:
1     1332
2      361
3       98
0       42
4       40
5       11
6        7
9        1
30       1
12       1
7        1
Name: number_of_occupants, dtype: int64

We can do the same to discover typical numbers of occupants in vehicles:

In [30]:
layer_df = results_df.dropna(axis=0, how='any',
        subset=['action_prior_to_accident',
               'number_of_occupants']).groupby(['action_prior_to_accident',
                                              'number_of_occupants']).size()

display(layer_df)
action_prior_to_accident    number_of_occupants
Avoiding Object in Roadway  1                        5
                            2                        4
                            3                        1
                            5                        1
Backing                     1                       35
                            2                        5
                            3                        1
                            30                       1
Changing Lanes              1                       37
                            2                        4
                            3                        5
                            4                        4
Entering Parked Position    1                        4
                            3                        1
Going Straight Ahead        1                      805
                            2                      209
                            3                       51
                            4                       23
                            5                        9
                            6                        3
                            9                        1
Making Left Turn            1                      108
                            2                       36
                            3                        9
                            4                        2
Making Right Turn           1                       40
                            12                       1
                            2                       17
                            3                        3
Making U Turn               1                        3
                                                  ... 
Merging                     3                        1
                            4                        1
                            5                        1
Not Applicable              1                        1
Other                       0                        1
                            1                        3
                            2                        3
Overtaking/Passing          1                        6
                            3                        1
Parked                      0                       41
                            1                        7
                            2                        3
Slowing or Stopping         1                      122
                            2                       35
                            3                       10
                            4                        5
                            6                        1
Starting from Parking       1                        4
                            2                        3
                            3                        2
Starting in Traffic         1                       14
                            2                        3
                            3                        3
                            7                        1
Stopped in Traffic          1                      122
                            2                       37
                            3                       10
                            4                        5
                            6                        3
Unknown                     1                        3
Length: 62, dtype: int64

Checking for syntax errors in columns that are string entries:

In [31]:
vehicle_body_types = results_df.dropna(axis=0, how='any', 
                             subset=['vehicle_body_type'])

vehicle_body_types.vehicle_body_type.value_counts()
Out[31]:
4 DOOR SEDAN              940
SUBURBAN                  687
PICKUP TRUCK              153
2 DOOR SEDAN               97
VAN TRUCK                  26
PEDESTRIAN                 19
UNKNOWN VEHICLE            17
TRACTOR                    14
BUS (OMNIBUS)              11
CONVERTIBLE                 7
DUMP                        6
UTILITY                     4
BICYCLIST                   3
FLAT BED TRUCK              2
DELIVERY TRUCK              2
OTHER VEHICLE               1
AMBULANCE                   1
TOW TRUCK                   1
FIRE VEHICLE                1
TRUCK CRANE                 1
REFRIGERATOR TRUCK          1
SEDAN                       1
CEMENT MIXER                1
Not Entered                 1
CONSTRUCTION EQUIPMENT      1
POWER SHOVEL                1
UNKNOWN TRUCK               1
Name: vehicle_body_type, dtype: int64