NYU Libraries Logo

Getting Started with Python Pandas¶

Nicholas Wolf
ORCID 0000-0001-5512-6151

This lesson is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Overview

This class is meant to be an overview of using Python Pandas for those who have never used it before, or who have made some progress but could use further guidance. Some background in working with tabular data is helpful, but not required. We'll focus on the following goals:

  • Understand the building blocks of a Pandas dataframe
  • Know how to make a dataframe and how to load it with data
  • Filtering, selecting, and other common operations needed to focus on a subset of a dataframe
  • Updating values
  • Table joins and merges
  • Exporting a dataframe to a saved file

Materials

You can always find this notebook (in non-executable form) at https://nyu-dataservices.github.io/startingpandas.

We'll be using this Jupyter Notebook as a basis for the lesson. You can access the code/notebooks for this class in the following ways:

If you do not have Python, Pandas, and Jupyter Notebooks installed:

  • Use the Jupyter Notebook available on our course JupyterHub instance here: https://tutorials-1.rcnyu.org. Navigate to the "shared" directory, then to RDM_StartingPandas, and then open the file "session-notebook-complete.ipynb". You will need to change the kernel to "RDM_Main." You will then be able to run and edit the code, but not save any changes. We will go over how to access this browser-based notebook in class.

If you have Jupyter Notebooks and Pandas installed on your laptop:

  • You can clone this repository and open the "session-notebook-complete.ipynb" file:

git clone https://github.com/NYU-DataServices/startingpandas.git

  • Or you can download the materials by visiting https://github.com/NYU-DataServices/startingpandas. Select the green code/clone button at the top, and select "Download Zip." Once you have the downloaded zip package, unzip it and open the "session-notebook-complete.ipynb" file.

1. Using Pandas...and NOT using Pandas¶

Pandas can be a powerful tool, especially for those using it who have a background in other statistical software and are looking for a means to work with tabular data. But it isn't the only (or in some cases even the best) means of dealing with data munging or data analysis in Python, particularly for large data.

For example, note the respective size of these two Python objects:

In [1]:
import pandas as pd

# Create a 900 x 900 table of integers and store it as a simple Python list of list-rows:

list_lists = [list(range(0,900)) for i in range(0,900)]

# Make a Pandas dataframe out of that same table

df_list_lists = pd.DataFrame(list_lists)

# Note the size difference in memory of these two objects. This is size in bytes

print(list_lists.__sizeof__())
print(df_list_lists.__sizeof__())
7816
6480128

The Python list of lists is considerably smaller in bytes than the dataframe.

Unsurprisingly, users experience periodic issues in reading large tables into a Pandas dataframe because of this overhead. A sense of these problems and common workarounds can be found on this Stack Overflow thread.

On the other hand, our Pandas dataframe will start to outperform Python loops to modify data as size as our table/matrix gets larger:

In [2]:
# Update the fourth column of our list of lists

def update_list(list_lists):
    new_list_lists = []
    for row in list_lists:
        new_list_lists.append(row[0:3] + [row[3] * 3] + row[4:])
    print(new_list_lists[0][0:5])
    

print(list_lists[0][0:5])

%timeit -n 1 -r 1 update_list(list_lists)
[0, 1, 2, 3, 4]
[0, 1, 2, 9, 4]
7.01 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [3]:
%timeit -n 1 -r 1 df_list_lists[3] = df_list_lists[3].apply(lambda x: x*3)

df_list_lists.head(5)
4.73 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Out[3]:
0 1 2 3 4 5 6 7 8 9 ... 890 891 892 893 894 895 896 897 898 899
0 0 1 2 9 4 5 6 7 8 9 ... 890 891 892 893 894 895 896 897 898 899
1 0 1 2 9 4 5 6 7 8 9 ... 890 891 892 893 894 895 896 897 898 899
2 0 1 2 9 4 5 6 7 8 9 ... 890 891 892 893 894 895 896 897 898 899
3 0 1 2 9 4 5 6 7 8 9 ... 890 891 892 893 894 895 896 897 898 899
4 0 1 2 9 4 5 6 7 8 9 ... 890 891 892 893 894 895 896 897 898 899

5 rows × 900 columns

Don't forget: a Pandas dataframe is a special kind of two-dimensional array, and arrays excel at performing matrix-based transformations¶

In other words, if you simply need a container to "hold" your data, a lot of times a simple core Python structure is great. But if you need to do full-table transformations, quick statistics, advanced statistics, and table relational joins, then Pandas is a great option.

It is essential if you want to do the steps above AND you have non-uniform data types.

Unlike another commonly used matrix library, numpy, Pandas dataframes accommodate tables/matrices that mix integers, strings, and other data types. (Pandas also shares some underlying code with numpy.)

2. Building a Dataframe: Series¶

To understand how a dataframe works in Pandas (or any other environment) we can think of the multiple ways we can assemble a two-dimensional table like this:

A two-dimensional table illustrating how data might be organized








Now we might conceive of this table as consisting of four rows, or observations, with each row consisting of elements that are ordered so that they align with a column location that tells us what the value is for any given variable.

But we also might think of a table as consisting of vertical uniform-length columns, each representing the measurement of single variable across the same number of observations, that are then assembled by stacking them side-by-side:

An image showing how a table is also built out of uniform columns








In Python terms, we might think of rows and columns in a table as having some "dictionary-like" qualities, and some "list-like" qualities. For example, the elements of a row can be conceived of values that are each paired with a key corresponding to our column headers (or variables):

Image showing how we might think of a table row as an equivalent of a Python key-value dictionary











We can also think of the rows and columns as having an index order, like a Python list, so that we might slice and retrieve a column or row (or a value using both column and row) using its index:

Image showing how we might think of a table as having index ordered rows and columns








And we can think of each column as being like a Python list, with an order so that once again we might access individual values and stacked next to each other to form a table:

Image showing how we might think of a table as consisting of several uniform-length lists placed next to each other








The Pandas Series object¶

Recognizing these hybrid dictionary- and list-like qualities of the components of a two-dimensional array, the building block for Pandas dataframe is the Series object.

In [4]:
# We might create a Series from a list:

list_series = pd.Series(["student1", "student2", "student3", "student4"])

list_series
Out[4]:
0    student1
1    student2
2    student3
3    student4
dtype: object

Note that our resulting Series has an index, and looks like a 4 x 1 (4 rows x 1 column) array. Let's add a name so that we understand what this column/vector of values refers to:

In [5]:
named_list_series = pd.Series(["student1", "student2", "student3", "student4"], name="student_name")

named_list_series
Out[5]:
0    student1
1    student2
2    student3
3    student4
Name: student_name, dtype: object

Our Series can be sliced by index location, much like a list:

In [6]:
named_list_series[0]
Out[6]:
'student1'
In [7]:
named_list_series[0:2]
Out[7]:
0    student1
1    student2
Name: student_name, dtype: object

Great! But in and of themselves, a Series object isn't that helpful. But putting several together gives us a dataframe. We can do this by instantiating a DataFrame object which has been passed a dictionary of Series, i.e. one or more Series objects identified with a key that will serve as the column header:

In [8]:
year_series = pd.Series([1990, 1991, 1992, 1993])

pop_series = pd.Series([1.5, 1.6, 1.8, 2.0])

population_table = pd.DataFrame({"year":year_series, "pop":pop_series})

population_table
Out[8]:
year pop
0 1990 1.5
1 1991 1.6
2 1992 1.8
3 1993 2.0

Note that Pandas automatically builds for us a row index, highlighted in bold, on the lefthand side. If we had failed to provide column names, it would have used index numbers to label them.

That's all we need to know about the Pandas Series object to get started. Mostly, this is helpful so that we know that when we operate on a single column sliced from a dataframe, we are operating on a Series object.

3. Loading a DataFrame¶

We have several options for how to make a dataframe and start working in Pandas:

  1. We can load a tabular data file and allow Pandas to parse it as a dataframe

  2. We can instantiate an empty dataframe and append rows or columns in the form of Series objects

  3. We can transform a Python complex array (such as a list of lists or a list of dictionaries) into a dataframe

No matter which approach is taken, I recommend taking some time to set the various parameters of the pd.DataFrame object so that your work on the dataframe later has expected results. This includes setting column names, column order, data types of variables, and (when reading from file) encoding.

Here are examples of all three:

Load from CSV/Excel/TSV, etc.¶

In [9]:
# Loading from a character-delimited file using explicit settings for delimiter, 
# header, data types, and using column names to order the resulting dataframe

df_from_csv = pd.read_csv("water-consumption-nyc-csv-example.csv", delimiter = ",",
                          header = 0, names = ["Year","NYC_Pop", "Consumption","PerCapita_Consumption"],
                          dtype = {"Year":int, "NYC_Pop":int, "Consumption":int, "PerCapita_Consumption":int})

# We can use .head() and .tail() to preview just a portion of a dataframe.
# Pass as an optional parameter the number of rows you wish to see

df_from_csv.head(5)
Out[9]:
Year NYC_Pop Consumption PerCapita_Consumption
0 1979 7102100 1512 213
1 1980 7071639 1506 213
2 1981 7089241 1309 185
3 1982 7109105 1382 194
4 1983 7181224 1424 198
In [10]:
# Rearranging column order

df_from_csv = df_from_csv[["Year","NYC_Pop", "PerCapita_Consumption", "Consumption"]]

df_from_csv.head(5)
Out[10]:
Year NYC_Pop PerCapita_Consumption Consumption
0 1979 7102100 213 1512
1 1980 7071639 213 1506
2 1981 7089241 185 1309
3 1982 7109105 194 1382
4 1983 7181224 198 1424

Populating an empty dataframe¶

In [11]:
# We create an empty dataframe, then add 1 or more rows, each as dictionary with keys aligned with the columns of empty_df

empty_df = pd.DataFrame(columns = ["Year","NYC_Pop", "PerCapita_Consumption", "Consumption"])

newrows = pd.DataFrame([
            {"Year": 1984,
            "NYC_Pop": 8102100,
            "PerCapita_Consumption": 188,
            "Consumption": 1302
            },
            {"Year": 1985,
            "NYC_Pop": 8902100,
            "PerCapita_Consumption": 176,
            "Consumption": 1203
            }
        ], columns = ["Year","NYC_Pop", "PerCapita_Consumption", "Consumption"])

empty_df = pd.concat([empty_df,newrows], ignore_index=True)
    
empty_df.head(5)
Out[11]:
Year NYC_Pop PerCapita_Consumption Consumption
0 1984 8102100 188 1302
1 1985 8902100 176 1203

4. Selecting/filtering rows and columns from a dataframe¶

One of the most common operations we need to do with dataframes is filter or select a subset of observations, or perhaps a set of columns, to use for analysis. Let's start by looking at how we can grab just one or more columns from a dataframe.

Subsetting one or more columns¶

In [12]:
df_from_csv["Year"]
Out[12]:
0     1979
1     1980
2     1981
3     1982
4     1983
5     1984
6     1985
7     1986
8     1987
9     1988
10    1989
11    1990
12    1991
13    1992
14    1993
15    1994
16    1995
17    1996
18    1997
19    1998
20    1999
21    2000
22    2001
23    2002
24    2003
25    2004
26    2005
27    2006
28    2007
29    2008
30    2009
31    2010
32    2011
33    2012
34    2013
35    2014
36    2015
37    2016
Name: Year, dtype: int64
In [13]:
# We can grab one or more columns either by column index location, or by column name

sub_df_from_csv = df_from_csv["Year"]

sub_df_from_csv.head(5)
Out[13]:
0    1979
1    1980
2    1981
3    1982
4    1983
Name: Year, dtype: int64
In [14]:
# Note that a single column subsetted from a dataframe is a Series

type(sub_df_from_csv)
Out[14]:
pandas.core.series.Series
In [15]:
# Extracting multiple columns. Note that we pass multiple column names as a list

sub2_df_from_csv = df_from_csv[["Year", "NYC_Pop"]]

sub2_df_from_csv.head(5)
Out[15]:
Year NYC_Pop
0 1979 7102100
1 1980 7071639
2 1981 7089241
3 1982 7109105
4 1983 7181224
In [16]:
# Pulling multiple columns gives us a dataframe!

type(sub2_df_from_csv)
Out[16]:
pandas.core.frame.DataFrame

Extracting rows using slice notation¶

In [17]:
sub3_df_from_csv = df_from_csv[2:4]

sub3_df_from_csv
Out[17]:
Year NYC_Pop PerCapita_Consumption Consumption
2 1981 7089241 185 1309
3 1982 7109105 194 1382

Challenge¶

What snippet of code would you use to slice out the 12th, 13th, and 14th rows and the columns "NYC_Pop" and "PerCapita_Consumption"?


In [18]:
# Add Answer

Filtering using index location: .iloc()¶

The row and column extraction operations are only useful up to a point -- usually you want to select a subset of your dataframe using some kind of complex criteria. We have two main means of doing this: by index location of columns and rows, and by label name and/or some kind of Boolean test of the value in any given "cell." A third means combines the two. We start by looking at filtering/selecting by index location using iloc().

The concept here will be familiar to those comfortable with list slicing, or with how a matrix can be extracted in numpy. The pattern is:

dataframe.iloc[row_index_start : row_index_index_stop, column_index_start : column_index_stop]

In [19]:
# First two rows, third and fourth column only

df_from_csv.iloc[0:2, 2:4]
Out[19]:
PerCapita_Consumption Consumption
0 213 1512
1 213 1506
In [20]:
# All rows, second column only

df_from_csv.iloc[:, 1]
Out[20]:
0     7102100
1     7071639
2     7089241
3     7109105
4     7181224
5     7234514
6     7274054
7     7319246
8     7342476
9     7353719
10    7344175
11    7335650
12    7374501
13    7428944
14    7506166
15    7570458
16    7633040
17    7697812
18    7773443
19    7858259
20    7947660
21    8008278
22    8024964
23    8041649
24    8058335
25    8075020
26    8091706
27    8108391
28    8125077
29    8141762
30    8158448
31    8175133
32    8175133
33    8336697
34    8405837
35    8491079
36    8550405
37    8537673
Name: NYC_Pop, dtype: int64
In [21]:
# Third row, all columns

df_from_csv.iloc[2:3, :]
Out[21]:
Year NYC_Pop PerCapita_Consumption Consumption
2 1981 7089241 185 1309

Filtering using index location: .loc()¶

We can also select using the same pattern (colon separated spans, a comma between the row and column decisions), but use a label or Boolean filter to subset.

The pattern is:

dataframe.loc[row_label(s), column_label(s)]

or, for a Boolean:

dataframe.loc[ True/False test for row values in a column, column_label(s)]

In [22]:
# Selecting all rows, but only the Year and NYC_Pop columns:

df_from_csv.loc[:, "Year":"NYC_Pop"].head(5)
Out[22]:
Year NYC_Pop
0 1979 7102100
1 1980 7071639
2 1981 7089241
3 1982 7109105
4 1983 7181224
In [23]:
# We also have an option to pass two or more column labels, even if non-adjacent, to select columns we want:

df_from_csv.loc[:, ["Year", "Consumption"]].head(5)
Out[23]:
Year Consumption
0 1979 1512
1 1980 1506
2 1981 1309
3 1982 1382
4 1983 1424
In [24]:
# Understand that the first parameter, the "row" label, is actually the column that serves as an index.
# In all of our examples so far, we've been using a Pandas-built numerical index. Thus:

df_from_csv.loc[0, ["Year", "Consumption"]]
Out[24]:
Year           1979
Consumption    1512
Name: 0, dtype: int64
In [25]:
# However, keep in mind that we can set one of our columns to be an index, 
# enabling us to use the column values as label

df_newindex = df_from_csv.copy().set_index("Year")

df_newindex.head(5)
Out[25]:
NYC_Pop PerCapita_Consumption Consumption
Year
1979 7102100 213 1512
1980 7071639 213 1506
1981 7089241 185 1309
1982 7109105 194 1382
1983 7181224 198 1424
In [26]:
df_newindex.loc[1981, :].head(5)
Out[26]:
NYC_Pop                  7089241
PerCapita_Consumption        185
Consumption                 1309
Name: 1981, dtype: int64

Important! Boolean Tests in Filtering¶

Lastly, we have our Boolean means of filtering using .loc(). Note that we want to access a column within our brackets and perform some kind of True/False Boolean test. We can do this using the dataframe.columnName syntax (as long as our column names do not contain spaces. If they do, you must use dataframe["columnName"] instead.

In [27]:
df_from_csv.loc[df_from_csv.Year < 1983]
Out[27]:
Year NYC_Pop PerCapita_Consumption Consumption
0 1979 7102100 213 1512
1 1980 7071639 213 1506
2 1981 7089241 185 1309
3 1982 7109105 194 1382
In [28]:
# Alternatively, same thing:

df_from_csv.loc[df_from_csv["Year"] < 1983]
Out[28]:
Year NYC_Pop PerCapita_Consumption Consumption
0 1979 7102100 213 1512
1 1980 7071639 213 1506
2 1981 7089241 185 1309
3 1982 7109105 194 1382

We can chain multiple Boolean tests together to get some pretty sophisticated filtering:

In [29]:
# Note that we use | (pipe) and & for "OR" and "AND" Bitwise Booleans, respectively. We MUST use parentheses 
# to separate out each Boolean to be evaluated. This is done to obey the precedence and associativity rules in python (similar to BODMAS
# rules in mathematics). For more information: https://www.geeksforgeeks.org/precedence-and-associativity-of-operators-in-python/#

df_from_csv.loc[(df_from_csv.Year < 1983) | (df_from_csv.Year == 1989)]
Out[29]:
Year NYC_Pop PerCapita_Consumption Consumption
0 1979 7102100 213 1512
1 1980 7071639 213 1506
2 1981 7089241 185 1309
3 1982 7109105 194 1382
10 1989 7344175 191 1402
In [30]:
# Using more than one AND/OR and added parentheses to clarify

df_from_csv.loc[((df_from_csv.Year < 1983) | (df_from_csv.Year == 1989)) & (df_from_csv.Consumption > 1320)]
Out[30]:
Year NYC_Pop PerCapita_Consumption Consumption
0 1979 7102100 213 1512
1 1980 7071639 213 1506
3 1982 7109105 194 1382
10 1989 7344175 191 1402

As an aside, slightly hidden from us is the matrix logic used behind the scenes to make all of this happen.

In reality, the Boolean tests inside the parentheses return True/False versions of the original dataframe in which each cell is set to True or False based on the condition you have indicated. It is this True/False matrix that is then used to mask the original dataframe.

We can test this by simply evaluating the Boolean statement itself:

In [31]:
df_from_csv.Year < 1983
Out[31]:
0      True
1      True
2      True
3      True
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
Name: Year, dtype: bool

Quick note: deprecated versions¶

If you come across examples on Stack Overflow using .ix[] to filter/select data, disregard as this is now deprecated in Pandas.

5. Modifying values in a dataframe¶

Once you are comfortable selecting subsections of a dataframe, from entire rows, columns, sub-tables, or individual single values, ways of updating those values can take a few forms. We'll concentrate on two.

Direct setting of new value on an existing dataframe¶

Let's say we want to update the value in the Consumption column for the first row.

In [32]:
df_from_csv.iloc[0:1, :]
Out[32]:
Year NYC_Pop PerCapita_Consumption Consumption
0 1979 7102100 213 1512
In [33]:
df_from_csv.loc[0, "Consumption"] = 1548

df_from_csv.iloc[0:1, :]
Out[33]:
Year NYC_Pop PerCapita_Consumption Consumption
0 1979 7102100 213 1548
In [34]:
# We can also do quick calculated values across all rows and set the results on a new column

df_from_csv["Consumption_per_10000"] = df_from_csv.Consumption / 10000

df_from_csv.head(5)
Out[34]:
Year NYC_Pop PerCapita_Consumption Consumption Consumption_per_10000
0 1979 7102100 213 1548 0.1548
1 1980 7071639 213 1506 0.1506
2 1981 7089241 185 1309 0.1309
3 1982 7109105 194 1382 0.1382
4 1983 7181224 198 1424 0.1424

A very helpful approach: dataframe.apply()¶

A great feature of Pandas is the .apply() method which enables you to apply some kind of transformative function, either one you write yourself or something you pull from Python's core library, and propagate it across any portion of the dataframe you wish to change.

We can start with an example using a Python core function:

In [35]:
x = 1.6
round(x)
Out[35]:
2
In [36]:
df_from_csv["Consumption_per_10000"] = df_from_csv["Consumption_per_10000"].apply(round)

df_from_csv.head(5)
Out[36]:
Year NYC_Pop PerCapita_Consumption Consumption Consumption_per_10000
0 1979 7102100 213 1548 0
1 1980 7071639 213 1506 0
2 1981 7089241 185 1309 0
3 1982 7109105 194 1382 0
4 1983 7181224 198 1424 0
In [37]:
# Alternative using a function you write yourself.

def check_if_high(in_val):
    if in_val > 200:
        return "Y"
    else:
        return "N"
    
df_from_csv["Consumption_high"] = df_from_csv["PerCapita_Consumption"].apply(check_if_high)    

df_from_csv.head(5)
Out[37]:
Year NYC_Pop PerCapita_Consumption Consumption Consumption_per_10000 Consumption_high
0 1979 7102100 213 1548 0 Y
1 1980 7071639 213 1506 0 Y
2 1981 7089241 185 1309 0 N
3 1982 7109105 194 1382 0 N
4 1983 7181224 198 1424 0 N
In [38]:
# And a final example, using a quick lambda function

df_from_csv["Total_sum"] = df_from_csv["Consumption"].apply(lambda x: x+1000)

df_from_csv.head(5)
Out[38]:
Year NYC_Pop PerCapita_Consumption Consumption Consumption_per_10000 Consumption_high Total_sum
0 1979 7102100 213 1548 0 Y 2548
1 1980 7071639 213 1506 0 Y 2506
2 1981 7089241 185 1309 0 N 2309
3 1982 7109105 194 1382 0 N 2382
4 1983 7181224 198 1424 0 N 2424

6. Table joins and table concatenation¶

Helpfully, Pandas offers some SQL-like means of joining (i.e. in a SQL sense: matching records on common keys, yielding a wider table) and performing the equivalent of a SQL union (i.e. concatenating one table of the same structure and columns to another, yielding a longer table).

These are known as a "merge" and a "concat" in Pandas.

In [39]:
# Constructing a second table with a plan to use "Year" as our common key

nyc_pop_density = pd.DataFrame([[1979,8.9],[1980,8.1],[1981,7.4],[1982,7.2],
 [1983,6.5],[1984,6.5],[1985,6.0],[1986,5.8],
[1987,5.5],[1988,5.4]], columns=["pop_year","density_per_5000"])

nyc_water_consumption = df_from_csv.loc[df_from_csv.Year < 1989]

nyc_pop_density
Out[39]:
pop_year density_per_5000
0 1979 8.9
1 1980 8.1
2 1981 7.4
3 1982 7.2
4 1983 6.5
5 1984 6.5
6 1985 6.0
7 1986 5.8
8 1987 5.5
9 1988 5.4
In [40]:
nyc_water_consumption
Out[40]:
Year NYC_Pop PerCapita_Consumption Consumption Consumption_per_10000 Consumption_high Total_sum
0 1979 7102100 213 1548 0 Y 2548
1 1980 7071639 213 1506 0 Y 2506
2 1981 7089241 185 1309 0 N 2309
3 1982 7109105 194 1382 0 N 2382
4 1983 7181224 198 1424 0 N 2424
5 1984 7234514 203 1465 0 Y 2465
6 1985 7274054 182 1326 0 N 2326
7 1986 7319246 185 1351 0 N 2351
8 1987 7342476 197 1447 0 N 2447
9 1988 7353719 202 1484 0 Y 2484
In [41]:
# Performing a join ("merge") with the year column;
# we'll do a left-hand inner table join, meaning we will keep all records in the water consumption table (lefthand)
# and only records that match from the right-hand table (the pop density). As it happens, there is a 1-1 match

merged_df = pd.merge(nyc_water_consumption, nyc_pop_density, how="left", left_on="Year", right_on="pop_year")

merged_df
Out[41]:
Year NYC_Pop PerCapita_Consumption Consumption Consumption_per_10000 Consumption_high Total_sum pop_year density_per_5000
0 1979 7102100 213 1548 0 Y 2548 1979 8.9
1 1980 7071639 213 1506 0 Y 2506 1980 8.1
2 1981 7089241 185 1309 0 N 2309 1981 7.4
3 1982 7109105 194 1382 0 N 2382 1982 7.2
4 1983 7181224 198 1424 0 N 2424 1983 6.5
5 1984 7234514 203 1465 0 Y 2465 1984 6.5
6 1985 7274054 182 1326 0 N 2326 1985 6.0
7 1986 7319246 185 1351 0 N 2351 1986 5.8
8 1987 7342476 197 1447 0 N 2447 1987 5.5
9 1988 7353719 202 1484 0 Y 2484 1988 5.4
In [42]:
# Let's say we have a second dataframe with the same type of data, using the same columns.
# We perform a concatenation in which the second dataframe becomes additional row on the original

additional_df = pd.DataFrame([[1989,7367819, 189, 1456, 0, "Y", 2456],[1990,7406718, 192, 1345, 0, "Y", "2345"]],
                         columns=["Year","NYC_Pop","PerCapita_Consumption",
                                  "Consumption","Consumption_per_10000","Consumption_high","Total_sum"])

additional_df
Out[42]:
Year NYC_Pop PerCapita_Consumption Consumption Consumption_per_10000 Consumption_high Total_sum
0 1989 7367819 189 1456 0 Y 2456
1 1990 7406718 192 1345 0 Y 2345
In [43]:
# The resulting dataframe will preserve the original row indices of the original respective dataframes
# To reset the index, we pass an additional parameter: ignore_index=True 

concat_df = pd.concat([nyc_water_consumption, additional_df], ignore_index=True)

concat_df
Out[43]:
Year NYC_Pop PerCapita_Consumption Consumption Consumption_per_10000 Consumption_high Total_sum
0 1979 7102100 213 1548 0 Y 2548
1 1980 7071639 213 1506 0 Y 2506
2 1981 7089241 185 1309 0 N 2309
3 1982 7109105 194 1382 0 N 2382
4 1983 7181224 198 1424 0 N 2424
5 1984 7234514 203 1465 0 Y 2465
6 1985 7274054 182 1326 0 N 2326
7 1986 7319246 185 1351 0 N 2351
8 1987 7342476 197 1447 0 N 2447
9 1988 7353719 202 1484 0 Y 2484
10 1989 7367819 189 1456 0 Y 2456
11 1990 7406718 192 1345 0 Y 2345

7. Summary statistics and data quick views¶

There are a few key methods that we can call on a Pandas dataframe to help us get a sense of its shape and contents quickly:

In [44]:
# Check its size. This dataframe has 38 rows and 7 columns

df_from_csv.shape
Out[44]:
(38, 7)
In [45]:
# Check for nulls; in this case, nulls in any row, any column.

df_from_csv[df_from_csv.isnull().any(axis=1)]
Out[45]:
Year NYC_Pop PerCapita_Consumption Consumption Consumption_per_10000 Consumption_high Total_sum
In [46]:
# Check if any given column is unique

df_from_csv.Year.is_unique
Out[46]:
True
In [47]:
# Summary statistics

df_from_csv.describe()
Out[47]:
Year NYC_Pop PerCapita_Consumption Consumption Consumption_per_10000 Total_sum
count 38.000000 3.800000e+01 38.000000 38.000000 38.0 38.000000
mean 1997.500000 7.764553e+06 161.552632 1241.631579 0.0 2241.631579
std 11.113055 4.610189e+05 31.404680 173.980493 0.0 173.980493
min 1979.000000 7.071639e+06 117.000000 996.000000 0.0 1996.000000
25% 1988.250000 7.342901e+06 135.250000 1095.000000 0.0 2095.000000
50% 1997.500000 7.815851e+06 155.500000 1238.500000 0.0 2238.500000
75% 2006.750000 8.120906e+06 189.500000 1378.750000 0.0 2378.750000
max 2016.000000 8.550405e+06 213.000000 1548.000000 0.0 2548.000000
In [48]:
# Single column value counts, max, min.

print(nyc_water_consumption.Consumption_high.value_counts())

print(nyc_water_consumption.NYC_Pop.max())

print(nyc_water_consumption.Consumption.min())
N    6
Y    4
Name: Consumption_high, dtype: int64
7353719
1309

8. More help¶

  1. I love this plain explanation of filtering/selecting and refer back to it often: https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/

  2. Official Pandas documentation, recently re-newed: https://pandas.pydata.org/pandas-docs/stable/index.html

  3. Pandas's own 10-minute quick start: https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#min

  4. Software Carpentries Pandas lesson (as part of general Python instruction): https://swcarpentry.github.io/python-novice-gapminder/08-data-frames/

  5. NYU Data Services Quantitative guide for merging/joining datasets: https://guides.nyu.edu/quant/merge

Challenge Answer¶

In [49]:
df_from_csv[11:14][["NYC_Pop","PerCapita_Consumption"]]
Out[49]:
NYC_Pop PerCapita_Consumption
11 7335650 194
12 7374501 199
13 7428944 184
In [ ]: