Python for Harvesting Data on the Web¶

Nicholas Wolf and Vicky Steeves, NYU Data Services

Vicky's ORCID: 0000-0003-4298-168X | Nick's ORCID: 0000-0001-5512-6151

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Overview¶

This session is an intermediate-to-advanced level class that offers some ideas for how to approach the following common data wrangling needs in research:

Obtain data and load it into a suitable data "container" for analysis, often via a web interface, especially an API
Parse the data retrieved via an API and turn it into a useful object for manipulation and analysis
Perform some basic data integrity checks of the downloaded data in anticipation of preparing it for analysis

Setup¶

Project Environment

Download the notebook available at https://goo.gl/Pnm7Dx and open it in Jupyter Notebook. Alternatively, you can clone the course materials using

git clone https://github.com/NYU-DataServices/DataHarvesting-Python.git

State of New York Socrata API Account

To work with this session's data, you'll need to create an API account with the state of New York's data service. Visit data.ny.gov and click on signup to create an account.

If not using our JupyterHub cloud environment, you'll need to install

sodapy

This is a wrapper for interfacing with the Socrata data platform's API.

Run

pip install sodapy

or

easy_install sodapy

Google Sheets API

To demonstrate how to work with oauth API interfaces, we'll do some sandbox work using Google Sheets API as a proof of concept. For this, you'll need a Google Drive account such as the one associated with your NYU email address.

If not using our JupyterHub cloud environment, you'll need a set of API libraries provided by Google:

pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib

1. Web Requests¶

It is not unusual to find open data served directly on the web, without requiring any tokens or authorizations. Although it is not often the practice to provide very large resources via such an entrypoint because of the burden it places on data servers, smaller datasets are frequently available. When it comes to such non-API generated data it is generally easier to work with services that serve the data directly in its native file format, rather than wrapping it in HTML. The latter requires you to parse first the HTML or interpret the HTTP response.

As an example, take the motor vehicle accident report data found here. One thousand records from this 1.65 million-record dataset can be accessed directly at https://data.ny.gov/resource/xe9x-a24f.json (FYI Firefox has a nice JSON viewer built in when you encounter .json hosted files on the web.)

Let's walk through making an HTTP request for that .json data and quickly transforming it into a useful container (a Pandas dataframe) to ready it for use. We'll make use of Python's JSON module, a compact and easy-to-use way of turning JSON into Python's native object types, lists and dictionaries. If you want to anticipate what JSON input will prompt which Python object type output, see this table here: https://docs.python.org/3/library/json.html#json-to-py-table. Note that a JSON array of key-value objects will yield a Python list of dictionaries.

import json
import requests

mv_data_json = requests.get('https://data.ny.gov/resource/xe9x-a24f.json')

mv_list_recs = json.loads(mv_data_json.text)

print(mv_list_recs[0])

{'year': '2016', 'case_vehicle_id': '13364180', 'vehicle_body_type': 'SUBURBAN', 'registration_class': 'PASSENGER OR SUBURBAN', 'action_prior_to_accident': 'Going Straight Ahead', 'type_axles_of_truck_or_bus': 'Not Entered', 'direction_of_travel': 'East', 'fuel_type': 'Gas', 'vehicle_year': '2015', 'state_of_registration': 'NY', 'number_of_occupants': '2', 'engine_cylinders': '4', 'vehicle_make': 'NISSA', 'contributing_factor_1': 'HUMAN', 'contributing_factor_1_description': 'Fell Asleep', 'contributing_factor_2': 'HUMAN', 'contributing_factor_2_description': 'Unknown', 'event_type': 'Not Applicable', 'partial_vin': 'KNMAT2MV1FP544486'}

Pandas

If you haven't yet given a Pandas dataframe a try as a way to manage large arrays of information, give it a go. It is possible to go overboard: not everything needs to be put in a dataframe, especially when a Python list of lists or a dictionary will do. Take a look at how fast we can access subsets of the motor vehicle data.

import pandas as pd

mv_df = pd.DataFrame(mv_list_recs)

mv_df.head(10)

Bonus: wget direct from web

If you have wget installed on your system, you can use the command line utility wget directly in a Notebook cell. This cell could then be run at the start of your notebook to always retrieve the latest version of a dataset.

To install wget, visit http://www.gnu.org/software/wget/

!wget https://data.ny.gov/resource/xe9x-a24f.json

--2020-03-03 15:50:01--  https://data.ny.gov/resource/xe9x-a24f.json
Resolving data.ny.gov (data.ny.gov)... 52.206.68.26, 52.206.140.205, 52.206.140.199
Connecting to data.ny.gov (data.ny.gov)|52.206.68.26|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/json]
Saving to: ‘xe9x-a24f.json.1’

xe9x-a24f.json.1        [   <=>              ] 606.07K  1.31MB/s    in 0.5s    

2020-03-03 15:50:03 (1.31 MB/s) - ‘xe9x-a24f.json.1’ saved [620616]

wget_json = json.loads(open('xe9x-a24f.json').read())
wget_json[0]

{'year': '2016',
 'case_vehicle_id': '13364180',
 'vehicle_body_type': 'SUBURBAN',
 'registration_class': 'PASSENGER OR SUBURBAN',
 'action_prior_to_accident': 'Going Straight Ahead',
 'type_axles_of_truck_or_bus': 'Not Entered',
 'direction_of_travel': 'East',
 'fuel_type': 'Gas',
 'vehicle_year': '2015',
 'state_of_registration': 'NY',
 'number_of_occupants': '2',
 'engine_cylinders': '4',
 'vehicle_make': 'NISSA',
 'contributing_factor_1': 'HUMAN',
 'contributing_factor_1_description': 'Fell Asleep',
 'contributing_factor_2': 'HUMAN',
 'contributing_factor_2_description': 'Unknown',
 'event_type': 'Not Applicable',
 'partial_vin': 'KNMAT2MV1FP544486'}

2. Simple Web API Requests¶

For more robust ways of serving up data, i.e. by APIs, we generally need to register an "app," i.e. an application that will be accessing the data, receive at the very least a token (and often a client secret as well) to enable tracked downloads of data, ensure proper access limits, etc.

We can think of these access points as involving one of two levels of authentication: a simple authentication involving signed requests (where a long term token is passed along with the request but no secondary per-request or limited-time token is needed), and a more complex, two or three step authentication process.

Simple Authentication Example Using NYC State Data Socrata¶

It is very helpful if an API comes with a pre-built library to interface with that server so that you don't have to handle signing requests in HTTP, managing tokens, etc.

Fortunately, we have a nice workable pre-made library for working with this same NY State Socrata API portal, sodapy.

Setting Up Your App on Your Socrata Account

Once you have sodapy installed, you'll want to visit the API section of the website, which has its own record of the same dataset we visited above with some additional access information:

https://dev.socrata.com/foundry/data.ny.gov/xe9x-a24f

If you scroll down about halfway through this page, you'll see a large "Sign up for an app token!" button. Click on that to be taken to your API dashboard (alternative, you can login and navigae to https://data.ny.gov/profile/edit/developer_settings)

In the second main section, select "Create New App Token."

Fill in some basic details for your "app" and once generated, copy down the App Token given to you.

We can now proceed to try out the sodapy library.

from sodapy import Socrata

client = Socrata('data.ny.gov','APP-TOKEN',
                username='EMAIL-USERNAME',
                password='PASSWORD')

results = client.get('xe9x-a24f', limit=2000)

results_df = pd.DataFrame.from_records(results)
results_df.head(10)

Paging¶

In our example above, using sodapy, notice that we can set the number of results we obtain for each request. This is helpful, even if we want the full set of data. At 1.6 million records, this dataset is taxing (though not extraordinarly so) on our http connection to grab it in one go.

A common approach used by data providers to solving this problem is paging. In this setup, you are limited in the number of results you can obtain in each request (often set at 200 records) and must make multiple requests and compile the results.

Take, for example, the city of New York's 311 calls data (https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9). At over 22.4 million records, it isn't realistic to pull the entire dataset in one go. You must page results by passing a "limit" and "offset" paramters in your request URL. Think of this as the total number of records (limit) and a way to request subsequent subsets of results (first page, second page, third page; or offset 0, 1, 2...). An example is available here: https://dev.socrata.com/docs/paging.html.

3. Complex Web API Requests¶

In other cases our data providers will want you to perform a a two or three-legged authorization -- in other words, you register an access token that you use to sign a request for a second token which you can then pass in the actual data request, PLUS some kind of password authorization that is passed via a web form.

To illustrate how this works on a nice localized environment, we'll practice this using our own Google Drive accounts and Google Sheets (which also happen to be a nice way to read/write out data).

We'll more or less follow Google's own tutorial pitched to Python approaches available here: https://developers.google.com/sheets/api/quickstart/python

1. Enable the GSheets API on your account¶

Login on your browser and click on the "Enable the Google Sheets API" button in the tutorial linked above. Select Download Client Configuration and save the credentials.json file to the same directory as this notebook.

2. Create a Sample Sheet in Google Drive¶

You can name the sheet or not, but it would be helpful to have some dummy data on the sheet to access. For example, consider copying and pasting some of the preview data from the MV data above: https://data.ny.gov/Transportation/Motor-Vehicle-Crashes-Vehicle-Information-Three-Ye/xe9x-a24f

Run the script below. The firs time you run it, it will ask you to authorize the app, generating a pickle file that you will keep for subsequent access to your Drive.

import pickle
import os.path
from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request

# If modifying these scopes, delete the file token.pickle.
SCOPES = ['https://www.googleapis.com/auth/spreadsheets.readonly']

# The ID and range of a sample spreadsheet.
SAMPLE_SPREADSHEET_ID = '1j24IhZiEQK2ss6y95cnrygAHJZp4tmtL5PWwWEa6CXo'
SAMPLE_RANGE_NAME = 'Sheet1!A1:AO'


creds = None
if os.path.exists('token.pickle'):
    with open('token.pickle', 'rb') as token:
        creds = pickle.load(token)
if not creds or not creds.valid:
    if creds and creds.expired and creds.refresh_token:
        creds.refresh(Request())
    else:
        flow = InstalledAppFlow.from_client_secrets_file(
            'credentials.json', SCOPES)
        creds = flow.run_local_server(port=0)
    with open('token.pickle', 'wb') as token:
        pickle.dump(creds, token)

service = build('sheets', 'v4', credentials=creds)
sheet = service.spreadsheets()
result = sheet.values().get(spreadsheetId=SAMPLE_SPREADSHEET_ID,
                            range=SAMPLE_RANGE_NAME).execute()
values = result.get('values', [])

# The result is a list of lists, with each row 
# represented as a list of cell values

for row in values:
    print(row)

['2016', '13364180', 'SUBURBAN', 'PASSENGER OR SUBURBAN', 'Going Straight Ahead', 'Not Entered', 'East', 'Gas', '2015', 'NY', '2', '4', 'NISSA', 'HUMAN', 'Fell Asleep', 'HUMAN', 'Unknown', 'Not Applicable', 'KNMAT2MV1FP544486']
['2016', '13364181', '4 DOOR SEDAN', 'PASSENGER OR SUBURBAN', 'Merging', 'Not Entered', 'South', 'Gas', '2007', 'NY', '1', '4', 'HYUND', 'HUMAN', 'Alcohol Involvement', 'HUMAN', 'Not Applicable', 'Not Entered', '5NPET46C87H227617']
['2016', '13364182', '4 DOOR SEDAN', 'PASSENGER OR SUBURBAN', 'Going Straight Ahead', 'Not Entered', 'South', 'Gas', '2009', 'NY', '1', '4', 'HONDA', 'HUMAN', 'Not Applicable', 'HUMAN', 'Not Applicable', 'Not Entered', '1HGCP26379A133398']
['2016', '13364283', '4 DOOR SEDAN', 'PASSENGER OR SUBURBAN', 'Going Straight Ahead', 'Not Entered', 'North', 'Gas', '2007', 'NY', '1', '4', 'DODGE', 'ENVMT', 'Pavement Slippery', 'HUMAN', 'Not Applicable', 'Not Applicable', '1B3HB28B37D582730']
['2016', '13364291', 'SUBURBAN', 'PASSENGER OR SUBURBAN', 'Changing Lanes', 'Not Entered', 'West', 'Gas', '2012', 'NY', '1', '4', 'MINI', 'HUMAN', 'Unsafe Lane Changing', 'HUMAN', 'Not Applicable', 'Not Applicable', 'WMWZC3C57CWL81108']
['2016', '13364292', '4 DOOR SEDAN', 'PASSENGER OR SUBURBAN', 'Going Straight Ahead', 'Not Entered', 'West', 'Gas', '2013', 'NY', '3', '5', 'VOLKS', 'HUMAN', 'Not Applicable', 'HUMAN', 'Not Applicable', 'Not Applicable', '3VWDP7AJ0DM381569']
['2016', '13364304', 'PICKUP TRUCK', 'PASSENGER OR SUBURBAN', 'Going Straight Ahead', 'Not Entered', 'East', 'Gas', '2014', 'NY', '2', '6', 'CHEVR', 'ENVMT', 'Pavement Slippery', 'HUMAN', 'Not Applicable', 'Not Applicable', '3GCUKREH8EG196335']

Data Integrity Checks¶

What common checks should we perform?

1. Check that Columns Serving as Unique Identifiers are Truly Unique¶

In the motor vehicle dataframe, for example, what if we suspect that case_vehicle_id is nonunique but should be? How can we check it?

results_df.case_vehicle_id.is_unique

True

2. Check for NULLs¶

Ideally, we either would not have NULLs (NaNs in Pandas) or else we would have very clear documentation on whether a NULL indicates that an observation was not attempted for that variable, whether it was attempted but unsuccessful, or attempted and the value not within the established parameters (e.g. because zero, or infinitely large, etc.)

# Return a dataframe containing any NULLs (NaNs) in any column

results_df[results_df.isnull().any(axis=1)]

3. Look for Outliers, Unrealistic Values, or Syntax Errors¶

For example, we might take a look at number of occupants per vehicle to establish whether there is unrealistic piling (for example, if a lot of cases involve numbers of occupants that are in increments of 5), a lot of even numbers (suggsting estimation), etc.

We might consider address our NULL issue here by dropping any row containing a NaN before looking at our frequency counts.

num_occs = results_df.dropna(axis=0, how='any', 
                             subset=['number_of_occupants'])

num_occs.number_of_occupants.value_counts()

1     1332
2      361
3       98
0       42
4       40
5       11
6        7
9        1
30       1
12       1
7        1
Name: number_of_occupants, dtype: int64

We can do the same to discover typical numbers of occupants in vehicles:

layer_df = results_df.dropna(axis=0, how='any',
        subset=['action_prior_to_accident',
               'number_of_occupants']).groupby(['action_prior_to_accident',
                                              'number_of_occupants']).size()

display(layer_df)

action_prior_to_accident    number_of_occupants
Avoiding Object in Roadway  1                        5
                            2                        4
                            3                        1
                            5                        1
Backing                     1                       35
                            2                        5
                            3                        1
                            30                       1
Changing Lanes              1                       37
                            2                        4
                            3                        5
                            4                        4
Entering Parked Position    1                        4
                            3                        1
Going Straight Ahead        1                      805
                            2                      209
                            3                       51
                            4                       23
                            5                        9
                            6                        3
                            9                        1
Making Left Turn            1                      108
                            2                       36
                            3                        9
                            4                        2
Making Right Turn           1                       40
                            12                       1
                            2                       17
                            3                        3
Making U Turn               1                        3
                                                  ... 
Merging                     3                        1
                            4                        1
                            5                        1
Not Applicable              1                        1
Other                       0                        1
                            1                        3
                            2                        3
Overtaking/Passing          1                        6
                            3                        1
Parked                      0                       41
                            1                        7
                            2                        3
Slowing or Stopping         1                      122
                            2                       35
                            3                       10
                            4                        5
                            6                        1
Starting from Parking       1                        4
                            2                        3
                            3                        2
Starting in Traffic         1                       14
                            2                        3
                            3                        3
                            7                        1
Stopped in Traffic          1                      122
                            2                       37
                            3                       10
                            4                        5
                            6                        3
Unknown                     1                        3
Length: 62, dtype: int64

Checking for syntax errors in columns that are string entries:

vehicle_body_types = results_df.dropna(axis=0, how='any', 
                             subset=['vehicle_body_type'])

vehicle_body_types.vehicle_body_type.value_counts()

4 DOOR SEDAN              940
SUBURBAN                  687
PICKUP TRUCK              153
2 DOOR SEDAN               97
VAN TRUCK                  26
PEDESTRIAN                 19
UNKNOWN VEHICLE            17
TRACTOR                    14
BUS (OMNIBUS)              11
CONVERTIBLE                 7
DUMP                        6
UTILITY                     4
BICYCLIST                   3
FLAT BED TRUCK              2
DELIVERY TRUCK              2
OTHER VEHICLE               1
AMBULANCE                   1
TOW TRUCK                   1
FIRE VEHICLE                1
TRUCK CRANE                 1
REFRIGERATOR TRUCK          1
SEDAN                       1
CEMENT MIXER                1
Not Entered                 1
CONSTRUCTION EQUIPMENT      1
POWER SHOVEL                1
UNKNOWN TRUCK               1
Name: vehicle_body_type, dtype: int64

	action_prior_to_accident	case_vehicle_id	contributing_factor_1	contributing_factor_1_description	contributing_factor_2	contributing_factor_2_description	direction_of_travel	engine_cylinders	event_type	fuel_type	number_of_occupants	partial_vin	registration_class	state_of_registration	type_axles_of_truck_or_bus	vehicle_body_type	vehicle_make	vehicle_year	year
0	Going Straight Ahead	13364180	HUMAN	Fell Asleep	HUMAN	Unknown	East	4	Not Applicable	Gas	2	KNMAT2MV1FP544486	PASSENGER OR SUBURBAN	NY	Not Entered	SUBURBAN	NISSA	2015	2016
1	Merging	13364181	HUMAN	Alcohol Involvement	HUMAN	Not Applicable	South	4	Not Entered	Gas	1	5NPET46C87H227617	PASSENGER OR SUBURBAN	NY	Not Entered	4 DOOR SEDAN	HYUND	2007	2016
2	Going Straight Ahead	13364182	HUMAN	Not Applicable	HUMAN	Not Applicable	South	4	Not Entered	Gas	1	1HGCP26379A133398	PASSENGER OR SUBURBAN	NY	Not Entered	4 DOOR SEDAN	HONDA	2009	2016
3	Going Straight Ahead	13364283	ENVMT	Pavement Slippery	HUMAN	Not Applicable	North	4	Not Applicable	Gas	1	1B3HB28B37D582730	PASSENGER OR SUBURBAN	NY	Not Entered	4 DOOR SEDAN	DODGE	2007	2016
4	Changing Lanes	13364291	HUMAN	Unsafe Lane Changing	HUMAN	Not Applicable	West	4	Not Applicable	Gas	1	WMWZC3C57CWL81108	PASSENGER OR SUBURBAN	NY	Not Entered	SUBURBAN	MINI	2012	2016
5	Going Straight Ahead	13364292	HUMAN	Not Applicable	HUMAN	Not Applicable	West	5	Not Applicable	Gas	3	3VWDP7AJ0DM381569	PASSENGER OR SUBURBAN	NY	Not Entered	4 DOOR SEDAN	VOLKS	2013	2016
6	Going Straight Ahead	13364304	ENVMT	Pavement Slippery	HUMAN	Not Applicable	East	6	Not Applicable	Gas	2	3GCUKREH8EG196335	PASSENGER OR SUBURBAN	NY	Not Entered	PICKUP TRUCK	CHEVR	2014	2016
7	Going Straight Ahead	13364305	ENVMT	Pavement Slippery	HUMAN	Not Applicable	North	4	Not Applicable	Gas	1	1HGCM56387A220086	PASSENGER OR SUBURBAN	NY	Not Entered	4 DOOR SEDAN	HONDA	2007	2016
8	Going Straight Ahead	13364306	ENVMT	Pavement Slippery	HUMAN	Not Applicable	North	6	Snow Embankment, Collision With Fixed Object	Gas	1	3G5DA03E04S597761	PASSENGER OR SUBURBAN	NY	Not Entered	SUBURBAN	BUICK	2004	2016
9	Going Straight Ahead	13364307	ENVMT	Pavement Slippery	ENVMT	Obstruction/ Debris	North	4	Unknown	Gas	4	2CNFLCEC1B6362978	PASSENGER OR SUBURBAN	NY	Not Entered	SUBURBAN	CHEVR	2011	2016

	action_prior_to_accident	case_vehicle_id	contributing_factor_1	contributing_factor_1_description	contributing_factor_2	contributing_factor_2_description	direction_of_travel	engine_cylinders	event_type	fuel_type	number_of_occupants	partial_vin	registration_class	state_of_registration	type_axles_of_truck_or_bus	vehicle_body_type	vehicle_make	vehicle_year	year
10	Entering Parked Position	13364312	HUMAN	Driver Inattention/Distraction*	HUMAN	Not Applicable	North	4	Not Applicable	Gas	NaN	JHLRD18611C023039	PASSENGER OR SUBURBAN	NY	Not Entered	SUBURBAN	HONDA	2001	2016
16	Going Straight Ahead	13364338	HUMAN	Unsafe Speed	ENVMT	Pavement Slippery	East	NaN	Not Applicable	Not Entered	1	1FTWX3BR1AEA59372	Not Entered	NY	Not Entered	PICKUP TRUCK	FORD	2010	2016
19	Going Straight Ahead	13364349	HUMAN	Driver Inattention/Distraction*	HUMAN	Not Applicable	West	NaN	Not Applicable	Not Entered	1	NaN	Not Entered	CT	Not Entered	SUBURBAN	NaN	NaN	2016
26	Going Straight Ahead	13364375	HUMAN	Failure to Yield Right-of-Way	HUMAN	Not Applicable	East	NaN	Not Applicable	Not Entered	2	1C3CCCABXFN655448	Not Entered	SC	Not Entered	4 DOOR SEDAN	CHRY	2015	2016
27	Merging	13364481	HUMAN	Failure to Yield Right-of-Way	HUMAN	Unknown	East	NaN	Not Applicable	Not Entered	1	NaN	Not Entered	NY	Not Entered	4 DOOR SEDAN	FORD	2006	2016
42	Making Left Turn	13364641	HUMAN	Failure to Yield Right-of-Way	HUMAN	Not Applicable	East	NaN	Not Applicable	Not Entered	1	NaN	Not Entered	ON	Not Entered	SUBURBAN	KIA	2015	2016
45	Unknown	13364777	HUMAN	Drugs (Illegal)	HUMAN	Not Applicable	South	NaN	Fence, Collision With Fixed Object	Not Entered	1	1N4AL11D86N340131	Not Entered	NY	Not Entered	4 DOOR SEDAN	NISS	2006	2016
50	Parked	13364869	HUMAN	Not Applicable	HUMAN	Not Applicable	Unknown	6	Other Motor Vehicle, Collision With	Gas	NaN	1G4HP54KXYU122033	PASSENGER OR SUBURBAN	NY	Not Entered	4 DOOR SEDAN	BUICK	2000	2016
51	Parked	13364870	HUMAN	Not Applicable	HUMAN	Not Applicable	Unknown	NaN	Not Applicable	Not Entered	NaN	NaN	Not Entered	NY	Not Entered	4 DOOR SEDAN	SUBA	2002	2016
53	Going Straight Ahead	13364875	HUMAN	Passing or Lane Usage Improper	ENVMT	Pavement Slippery	Southwest	NaN	Not Applicable	Not Entered	1	1N4AB7AP1EN855535	Not Entered	NV	Not Entered	4 DOOR SEDAN	NISS	2014	2016
55	Going Straight Ahead	13364889	HUMAN	Not Applicable	HUMAN	Not Applicable	East	NaN	Not Applicable	Not Entered	1	NaN	Not Entered	CT	Not Entered	SUBURBAN	CHEV	2007	2016
60	Going Straight Ahead	13364935	HUMAN	Unsafe Speed	ENVMT	Pavement Slippery	West	NaN	Not Applicable	Not Entered	2	NaN	Not Entered	OH	Not Entered	4 DOOR SEDAN	NaN	NaN	2016
68	Going Straight Ahead	13365051	HUMAN	Unsafe Speed	ENVMT	Pavement Slippery	West	NaN	Not Applicable	Not Entered	1	3HSDJAPR5FN526902	Not Entered	OK	Not Entered	TRACTOR	INTL	2015	2016
70	Going Straight Ahead	13365071	HUMAN	Unsafe Speed	ENVMT	Pavement Slippery	West	NaN	Not Applicable	Not Entered	1	NaN	Not Entered	NC	Not Entered	4 DOOR SEDAN	NaN	NaN	2016
74	Going Straight Ahead	13365083	HUMAN	Not Applicable	HUMAN	Not Applicable	East	NaN	Not Applicable	Not Entered	1	4T1BK3EK2AU103974	Not Entered	NJ	Not Entered	4 DOOR SEDAN	TOYT	2010	2016
85	Parked	13365157	HUMAN	Not Applicable	HUMAN	Not Applicable	South	8	Not Applicable	Gas	NaN	1C6RD6FP0CS242502	PASSENGER OR SUBURBAN	NY	Not Entered	PICKUP TRUCK	DODGE	2012	2016
90	Parked	13365167	HUMAN	Not Applicable	HUMAN	Not Applicable	East	NaN	Not Applicable	Not Entered	NaN	1GNFK03098R264302	Not Entered	NY	Not Entered	SUBURBAN	CHEV	2008	2016
97	Starting in Traffic	13365178	HUMAN	Driver Inattention/Distraction*	HUMAN	Not Applicable	South	4	NaN	Gas	2	1N4AL21E79N505556	PASSENGER OR SUBURBAN	NY	Not Entered	4 DOOR SEDAN	NISSA	2009	2016
98	Stopped in Traffic	13365179	HUMAN	Not Applicable	HUMAN	Not Applicable	South	6	NaN	Gas	2	5NPEU46F57H261585	PASSENGER OR SUBURBAN	NY	Not Entered	4 DOOR SEDAN	HYUND	2007	2016
102	Going Straight Ahead	13365220	ENVMT	Pavement Slippery	HUMAN	Unsafe Speed	Southwest	NaN	Not Applicable	Not Entered	1	19UUA5696A002755	Not Entered	NY	Not Entered	4 DOOR SEDAN	ACUR	2002	2016
105	Going Straight Ahead	13365232	ENVMT	Animal's Action	HUMAN	Not Applicable	East	6	NaN	Gas	2	5UXFE43549L275133	PASSENGER OR SUBURBAN	NY	Not Entered	SUBURBAN	BMW	2009	2016
106	Going Straight Ahead	13365245	HUMAN	Failure to Keep Right	HUMAN	Not Applicable	South	4	Not Applicable	Gas	NaN	WDBRF40J54F531217	PASSENGER OR SUBURBAN	NY	Not Entered	4 DOOR SEDAN	ME/BE	2004	2016
107	Parked	13365246	HUMAN	Not Applicable	HUMAN	Not Applicable	North	4	Not Applicable	Gas	NaN	KMHDU46D77U133540	PASSENGER OR SUBURBAN	NY	Not Entered	4 DOOR SEDAN	HYUND	2007	2016
115	Going Straight Ahead	13365304	HUMAN	Driver Inattention/Distraction*	HUMAN	Not Applicable	North	NaN	Not Entered	Not Entered	1	1FAHP2E87DG158783	Not Entered	NY	Not Entered	4 DOOR SEDAN	FORD	2013	2016
116	Stopped in Traffic	13365305	HUMAN	Not Applicable	HUMAN	Not Applicable	North	NaN	Not Entered	Not Entered	4	5XXGM4A72FG440668	Not Entered	NY	Not Entered	4 DOOR SEDAN	KIA	2015	2016
123	Going Straight Ahead	13365345	HUMAN	Not Applicable	HUMAN	Not Applicable	South	NaN	Not Applicable	Not Entered	2	NaN	Not Entered	NY	Not Entered	PICKUP TRUCK	FORD	2011	2016
126	Going Straight Ahead	13365355	HUMAN	Alcohol Involvement	ENVMT	Pavement Slippery	East	8	NaN	Gas	2	1LNLM81WXVY691562	PASSENGER OR SUBURBAN	NY	Not Entered	4 DOOR SEDAN	LINCO	1997	2016
128	Making Left Turn	13365366	HUMAN	Not Applicable	HUMAN	Not Applicable	Southeast	NaN	Not Applicable	Not Entered	2	WMWZC5C51FWP49170	Not Entered	NJ	Not Entered	4 DOOR SEDAN	COOP	2015	2016
138	Going Straight Ahead	13365397	HUMAN	Unsafe Speed	HUMAN	Not Applicable	West	NaN	Not Applicable	Not Entered	2	WBAPH73519E127649	Not Entered	ON	Not Entered	4 DOOR SEDAN	BMW	2009	2016
148	Going Straight Ahead	13365474	HUMAN	Alcohol Involvement	HUMAN	Following Too Closely	West	NaN	Not Applicable	Not Entered	2	WBAPK5G51BNN32089	Not Entered	NJ	Not Entered	4 DOOR SEDAN	BMW	2011	2016
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1839	Changing Lanes	13376063	HUMAN	Failure to Yield Right-of-Way	HUMAN	Not Applicable	Southeast	NaN	Not Applicable	Not Entered	1	WAUSF78K79N018658	Not Entered	NY	Not Entered	4 DOOR SEDAN	AUDI	2009	2016
1847	Going Straight Ahead	13376077	HUMAN	Not Applicable	HUMAN	Not Applicable	North	8	NaN	Gas	2	5N1AA08A74N709236	PASSENGER OR SUBURBAN	NY	Not Entered	SUBURBAN	NISSA	2004	2016
1848	Stopped in Traffic	13376078	HUMAN	Not Applicable	HUMAN	Not Applicable	West	6	NaN	Gas	1	5FNRL386X6B401297	PASSENGER OR SUBURBAN	NY	Not Entered	SUBURBAN	HONDA	2006	2016
1849	Making Left Turn	13376079	HUMAN	Failure to Yield Right-of-Way	HUMAN	Not Applicable	Northeast	6	NaN	Gas	1	2D4GP44LX5R148453	OMNIBUS - TAXI	NY	Not Entered	SUBURBAN	DODGE	2005	2016
1853	Going Straight Ahead	13376091	HUMAN	Driver Inattention/Distraction*	HUMAN	Failure to Yield Right-of-Way	East	NaN	Not Applicable	Not Entered	1	NaN	Not Entered	NY	Not Entered	4 DOOR SEDAN	NISS	2016	2016
1855	Going Straight Ahead	13376095	ENVMT	Animal's Action	HUMAN	Not Applicable	North	NaN	Not Applicable	Not Entered	2	JTDZN3EU6D3197458	Not Entered	NY	Not Entered	4 DOOR SEDAN	TOYT	2013	2016
1856	Parked	13376102	HUMAN	Not Applicable	HUMAN	Not Applicable	Southwest	6	Not Applicable	Gas	NaN	4T1BF18BXXU313560	PASSENGER OR SUBURBAN	NY	Not Entered	4 DOOR SEDAN	TOYOT	1999	2016
1857	Backing	13376103	HUMAN	Backing Unsafely	HUMAN	Driver Inattention/Distraction*	Northeast	NaN	Not Applicable	Not Entered	1	NaN	Not Entered	NY	Not Entered	SUBURBAN	HOND	2000	2016
1863	Going Straight Ahead	13376130	HUMAN	Not Applicable	HUMAN	Not Applicable	South	NaN	Not Applicable	Not Entered	1	1FTNE2EW6ADA46555	Not Entered	NY	Not Entered	VAN TRUCK	FORD	2010	2016
1864	Going Straight Ahead	13376131	ENVMT	Animal's Action	HUMAN	Not Applicable	North	4	NaN	Gas	2	3VWCT21C21M405085	PASSENGER OR SUBURBAN	NY	Not Entered	2 DOOR SEDAN	VOLKS	2001	2016
1866	Stopped in Traffic	13376133	HUMAN	Not Applicable	HUMAN	Not Applicable	Northeast	NaN	Not Applicable	Not Entered	1	5FNYF48489B001864	Not Entered	NY	Not Entered	SUBURBAN	HOND	2009	2016
1871	Going Straight Ahead	13376142	HUMAN	Not Applicable	HUMAN	Not Applicable	East	NaN	Not Applicable	Not Entered	1	NaN	Not Entered	NY	Not Entered	PICKUP TRUCK	FORD	2003	2016
1873	Slowing or Stopping	13376144	HUMAN	Not Applicable	HUMAN	Not Applicable	South	NaN	Not Applicable	Not Entered	2	NaN	Not Entered	TN	Not Entered	SUBURBAN	NISS	2006	2016
1882	Going Straight Ahead	13376170	HUMAN	Unsafe Speed	HUMAN	Not Applicable	South	NaN	Not Applicable	Not Entered	1	1GNDT13S882144205	Not Entered	NY	Not Entered	4 DOOR SEDAN	CHEV	2008	2016
1887	Stopped in Traffic	13376182	HUMAN	Not Applicable	HUMAN	Not Applicable	East	NaN	Not Applicable	Not Entered	1	NaN	Not Entered	NY	Not Entered	SUBURBAN	SUBA	2014	2016
1888	Stopped in Traffic	13376183	HUMAN	Not Applicable	HUMAN	Not Applicable	East	NaN	Other Motor Vehicle, Collision With	Not Entered	1	NaN	Not Entered	NY	Not Entered	4 DOOR SEDAN	INFI	2015	2016
1889	Stopped in Traffic	13376184	HUMAN	Not Applicable	HUMAN	Not Applicable	East	NaN	Other Motor Vehicle, Collision With	Not Entered	1	NaN	Not Entered	NJ	Not Entered	4 DOOR SEDAN	FORD	2011	2016
1890	Going Straight Ahead	13376185	HUMAN	Driver Inattention/Distraction*	HUMAN	Following Too Closely	East	NaN	Not Applicable	Not Entered	1	NaN	Not Entered	NJ	Not Entered	SUBURBAN	JEEP	2013	2016
1904	Making Left Turn	13376206	HUMAN	Unknown	HUMAN	Failure to Yield Right-of-Way	East	NaN	Not Applicable	Not Entered	1	NaN	Not Entered	NY	Not Entered	SUBURBAN	CHEV	1994	2016
1906	Going Straight Ahead	13376208	HUMAN	Driver Inattention/Distraction*	HUMAN	Unsafe Speed	South	4	NaN	Gas	1	4A3AC44G21E137484	PASSENGER OR SUBURBAN	NY	Not Entered	2 DOOR SEDAN	MITSU	2001	2016
1923	Stopped in Traffic	13376260	HUMAN	Not Applicable	HUMAN	Not Applicable	East	NaN	Not Applicable	Not Entered	1	2G4WB52K3X1551808	Not Entered	NY	Not Entered	4 DOOR SEDAN	BUIC	1999	2016
1925	Parked	13376262	HUMAN	Not Applicable	HUMAN	Not Applicable	West	4	Not Applicable	Gas	NaN	3CZRU6H35GM717105	PASSENGER OR SUBURBAN	NY	Not Entered	SUBURBAN	HONDA	2016	2016
1929	Making Left Turn	13376266	HUMAN	Not Applicable	HUMAN	Not Applicable	South	6	NaN	Diesel	1	3C7WRMBL6FG526168	COMMERCIAL	NY	Not Entered	PICKUP TRUCK	RAM	2015	2016
1930	Going Straight Ahead	13376267	ENVMT	Pavement Slippery	HUMAN	Failure to Keep Right	West	NaN	Not Applicable	Not Entered	1	NaN	Not Entered	NC	Not Entered	SUBURBAN	TOYT	2001	2016
1943	Going Straight Ahead	13376300	ENVMT	Pavement Slippery	ENVMT	Pavement Slippery	East	NaN	Not Applicable	Not Entered	2	NaN	Not Entered	NJ	Not Entered	4 DOOR SEDAN	TOYT	2012	2016
1958	Going Straight Ahead	13376326	HUMAN	Driver Inattention/Distraction*	HUMAN	Not Applicable	West	6	NaN	Gas	1	1GKDM19W0XB533780	PASSENGER OR SUBURBAN	NY	Not Entered	SUBURBAN	GMC	1999	2016
1975	Making Left Turn	13376364	HUMAN	Traffic Control Device Disregarded	HUMAN	Not Applicable	Southeast	NaN	Not Applicable	Not Entered	1	NaN	Not Entered	PA	Not Entered	4 DOOR SEDAN	CHEV	1995	2016
1983	Going Straight Ahead	13376377	HUMAN	Pedestrian/Bicyclist Error/Confusion	HUMAN	Not Applicable	West	NaN	Not Applicable	Not Entered	1	NaN	Not Entered	NaN	Not Entered	BICYCLIST	NaN	NaN	2016
1984	Going Straight Ahead	13376379	HUMAN	Not Applicable	HUMAN	Not Applicable	South	4	NaN	Gas	1	1FMCU9D74AKA63813	PASSENGER OR SUBURBAN	NY	Not Entered	SUBURBAN	FORD	2010	2016
1989	Going Straight Ahead	13376384	ENVMT	View Obstructed/Limited	HUMAN	Driver Inattention/Distraction*	West	NaN	Not Applicable	Not Entered	1	1GCWGFCA3E1159187	Not Entered	NY	Not Entered	VAN TRUCK	CHEV	2014	2016