Instructor Notes
General Notes
- It’s all right not to get through the whole lesson.
- This lesson is designed for people who have never programmed before, but any given class may include people with a wide range of prior experience. We have therefore included enough material to fill a full day if need be, but expect that many offerings will only get as far as the introduction to Pandas.
- Don’t tell people to Google things.
- One of the goals of this lesson is to help novices build a workable mental model of how programming works. Until they have that model, they will not know what to search for or how to recognize a helpful answer. Telling them to Google can also give the impression that we think their problem is trivial. (That said, if learners have done enough programming before to be past these issues, having them search for solutions online can help them solidify their understanding.) It’s also worth quoting Trevor King’s comment about online search: “If you find anything, other folks were confused enough to bother with a blog or Stack Overflow post, so it’s probably not trivial.”
Running and Quitting
Variables and Assignment
Data Types and Type Conversion
Built-in Functions and Help
Morning Coffee
Libraries
Reading Tabular Data into DataFrames
Pandas DataFrames
Instructor Note
Learners often struggle here, many may not work with financial data and concepts so they find the example concepts difficult to get their head around. The biggest problem though is the line generating the wealth_score, this step needs to be talked through throughly:
It uses implicit conversion between boolean and float values which has not been covered in the course so far.
-
The axis=1 argument needs to be explained clearly.
:::::::::::::::::::::::::::::::::::::::::::::::::
Pandas vectorizing methods and grouping operations are features that provide users much flexibility to analyse their data.
For instance, let’s say we want to have a clearer view on how the European countries split themselves according to their GDP.
- We may have a glance by splitting the countries in two groups during the years surveyed, those who presented a GDP higher than the European average and those with a lower GDP.
- We then estimate a wealthy score based on the historical (from 1962 to 2007) values, where we account how many times a country has participated in the groups of lower or higher GDP
PYTHON
mask_higher = data > data.mean()
wealth_score = mask_higher.aggregate('sum', axis=1) / len(data.columns)
print(wealth_score)
OUTPUT
country
Albania 0.000000
Austria 1.000000
Belgium 1.000000
Bosnia and Herzegovina 0.000000
Bulgaria 0.000000
Croatia 0.000000
Czech Republic 0.500000
Denmark 1.000000
Finland 1.000000
France 1.000000
Germany 1.000000
Greece 0.333333
Hungary 0.000000
Iceland 1.000000
Ireland 0.333333
Italy 0.500000
Montenegro 0.000000
Netherlands 1.000000
Norway 1.000000
Poland 0.000000
Portugal 0.000000
Romania 0.000000
Serbia 0.000000
Slovak Republic 0.000000
Slovenia 0.333333
Spain 0.333333
Sweden 1.000000
Switzerland 1.000000
Turkey 0.000000
United Kingdom 1.000000
dtype: float64
Finally, for each group in the wealth_score table, we
sum their (financial) contribution across the years surveyed using
chained methods:
OUTPUT
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 \
0.000000 36916.854200 46110.918793 56850.065437 71324.848786
0.333333 16790.046878 20942.456800 25744.935321 33567.667670
0.500000 11807.544405 14505.000150 18380.449470 21421.846200
1.000000 104317.277560 127332.008735 149989.154201 178000.350040
gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 \
0.000000 88569.346898 104459.358438 113553.768507 119649.599409
0.333333 45277.839976 53860.456750 59679.634020 64436.912960
0.500000 25377.727380 29056.145370 31914.712050 35517.678220
1.000000 215162.343140 241143.412730 263388.781960 296825.131210
gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
0.000000 92380.047256 103772.937598 118590.929863 149577.357928
0.333333 67918.093220 80876.051580 102086.795210 122803.729520
0.500000 36310.666080 40723.538700 45564.308390 51403.028210
1.000000 315238.235970 346930.926170 385109.939210 427850.333420
Selection of Individual Values
No, they do not produce the same output! The output of the first statement is:
OUTPUT
gdpPercap_1952 gdpPercap_1957
country
Albania 1601.056136 1942.284244
Austria 6137.076492 8842.598030
The second statement gives:
OUTPUT
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962
country
Albania 1601.056136 1942.284244 2312.888958
Austria 6137.076492 8842.598030 10750.721110
Belgium 8343.105127 9714.960623 10991.206760
Clearly, the second statement produces an additional column and an
additional row compared to the first statement.
What conclusion can we draw? We see that a numerical slice, 0:2,
omits the final index (i.e. index 2) in the range provided,
while a named slice, ‘gdpPercap_1952’:‘gdpPercap_1962’,
includes the final element.
Reconstructing Data
Explain what each line in the following short program does: what is
in first, second, etc.?
Let’s go through this piece of code line by line.
This line loads the dataset containing the GDP data from all
countries into a dataframe called first. The
index_col='country' parameter selects which column to use
as the row labels in the dataframe.
This line makes a selection: only those rows of first
for which the ‘continent’ column matches ‘Americas’ are extracted.
Notice how the Boolean expression inside the brackets,
first['continent'] == 'Americas', is used to select only
those rows where the expression is true. Try printing this expression!
Can you print also its individual True/False elements? (hint: first
assign the expression to a variable)
As the syntax suggests, this line drops the row from
second where the label is ‘Puerto Rico’. The resulting
dataframe third has one row less than the original
dataframe second.
Again we apply the drop function, but in this case we are dropping
not a row but a whole column. To accomplish this, we need to specify
also the axis parameter (we want to drop the second column
which has index 1).
The final step is to write the data that we have been working on to a
csv file. Pandas makes this easy with the to_csv()
function. The only required argument to the function is the filename.
Note that the file will be written in the directory from which you
started the Jupyter or Python session.
For each column in data, idxmin will return
the index value corresponding to each column’s minimum;
idxmax will do accordingly the same for each column’s
maximum value.
You can use these functions whenever you want to get the row index of the minimum/maximum value and not the actual minimum/maximum value.
Practice with Selection
Assume Pandas has been imported and the Gapminder GDP data for Europe has been loaded. Write an expression to select each of the following:
- GDP per capita for all countries in 1982.
- GDP per capita for Denmark for all years.
- GDP per capita for all countries for years after 1985.
- GDP per capita for each country in 2007 as a multiple of GDP per capita for that country in 1952.
1:
2:
3:
Pandas is smart enough to recognize the number at the end of the
column label and does not give you an error, although no column named
gdpPercap_1985 actually exists. This is useful if new
columns are added to the CSV file later.
4:
Many Ways of Access
There are at least two ways of accessing a value or slice of a
DataFrame: by name or index. However, there are many others. For
example, a single column or row can be accessed either as a
DataFrame or a Series object.
Suggest different ways of doing the following operations on a DataFrame:
- Access a single column
- Access a single row
- Access an individual DataFrame element
- Access several columns
- Access several rows
- Access a subset of specific rows and columns
- Access a subset of row and column ranges
1. Access a single column:
PYTHON
# by name
data["col_name"] # as a Series
data[["col_name"]] # as a DataFrame
# by name using .loc
data.T.loc["col_name"] # as a Series
data.T.loc[["col_name"]].T # as a DataFrame
# Dot notation (Series)
data.col_name
# by index (iloc)
data.iloc[:, col_index] # as a Series
data.iloc[:, [col_index]] # as a DataFrame
# using a mask
data.T[data.T.index == "col_name"].T
2. Access a single row:
PYTHON
# by name using .loc
data.loc["row_name"] # as a Series
data.loc[["row_name"]] # as a DataFrame
# by name
data.T["row_name"] # as a Series
data.T[["row_name"]].T # as a DataFrame
# by index
data.iloc[row_index] # as a Series
data.iloc[[row_index]] # as a DataFrame
# using mask
data[data.index == "row_name"]
3. Access an individual DataFrame element:
PYTHON
# by column/row names
data["column_name"]["row_name"] # as a Series
data[["col_name"]].loc["row_name"] # as a Series
data[["col_name"]].loc[["row_name"]] # as a DataFrame
data.loc["row_name"]["col_name"] # as a value
data.loc[["row_name"]]["col_name"] # as a Series
data.loc[["row_name"]][["col_name"]] # as a DataFrame
data.loc["row_name", "col_name"] # as a value
data.loc[["row_name"], "col_name"] # as a Series. Preserves index. Column name is moved to `.name`.
data.loc["row_name", ["col_name"]] # as a Series. Index is moved to `.name.` Sets index to column name.
data.loc[["row_name"], ["col_name"]] # as a DataFrame (preserves original index and column name)
# by column/row names: Dot notation
data.col_name.row_name
# by column/row indices
data.iloc[row_index, col_index] # as a value
data.iloc[[row_index], col_index] # as a Series. Preserves index. Column name is moved to `.name`
data.iloc[row_index, [col_index]] # as a Series. Index is moved to `.name.` Sets index to column name.
data.iloc[[row_index], [col_index]] # as a DataFrame (preserves original index and column name)
# column name + row index
data["col_name"][row_index]
data.col_name[row_index]
data["col_name"].iloc[row_index]
# column index + row name
data.iloc[:, [col_index]].loc["row_name"] # as a Series
data.iloc[:, [col_index]].loc[["row_name"]] # as a DataFrame
# using masks
data[data.index == "row_name"].T[data.T.index == "col_name"].T
4. Access several columns:
PYTHON
# by name
data[["col1", "col2", "col3"]]
data.loc[:, ["col1", "col2", "col3"]]
# by index
data.iloc[:, [col1_index, col2_index, col3_index]]
5. Access several rows
PYTHON
# by name
data.loc[["row1", "row2", "row3"]]
# by index
data.iloc[[row1_index, row2_index, row3_index]]
6. Access a subset of specific rows and columns
PYTHON
# by names
data.loc[["row1", "row2", "row3"], ["col1", "col2", "col3"]]
# by indices
data.iloc[[row1_index, row2_index, row3_index], [col1_index, col2_index, col3_index]]
# column names + row indices
data[["col1", "col2", "col3"]].iloc[[row1_index, row2_index, row3_index]]
# column indices + row names
data.iloc[:, [col1_index, col2_index, col3_index]].loc[["row1", "row2", "row3"]]
7. Access a subset of row and column ranges
PYTHON
# by name
data.loc["row1":"row2", "col1":"col2"]
# by index
data.iloc[row1_index:row2_index, col1_index:col2_index]
# column names + row indices
data.loc[:, "col1_name":"col2_name"].iloc[row1_index:row2_index]
# column indices + row names
data.iloc[:, col1_index:col2_index].loc["row1":"row2"]
Exploring available methods using the
dir() function
Python includes a dir() function that can be used to
display all of the available methods (functions) that are built into a
data object. In Episode 4, we used some methods with a string. But we
can see many more are available by using dir():
This command returns:
PYTHON
['__add__',
...
'__subclasshook__',
'capitalize',
'casefold',
'center',
...
'upper',
'zfill']
You can use help() or Shift+Tab to
get more information about what these methods do.
Assume Pandas has been imported and the Gapminder GDP data for Europe
has been loaded as data. Then, use dir() to
find the function that prints out the median per-capita GDP across all
European countries for each year that information is available.
Interpretation
Poland’s borders have been stable since 1945, but changed several times in the years before then. How would you handle this if you were creating a table of GDP per capita for Poland for the entire twentieth century?
- Use
DataFrame.iloc[..., ...]to select values by integer location. - Use
:on its own to mean all columns or all rows. - Select multiple columns or rows using
DataFrame.locand a named slice. - Result of slicing can be used in further operations.
- Use comparisons to select data based on value.
- Select values or NaN using a Boolean mask.
The selection can be done by using the labels for both the row (“Serbia”) and the column (“gdpPercap_2007”):
The output is
OUTPUT
9786.534714
No, they do not produce the same output! The output of the first statement is:
OUTPUT
gdpPercap_1952 gdpPercap_1957
country
Albania 1601.056136 1942.284244
Austria 6137.076492 8842.598030
The second statement gives:
OUTPUT
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962
country
Albania 1601.056136 1942.284244 2312.888958
Austria 6137.076492 8842.598030 10750.721110
Belgium 8343.105127 9714.960623 10991.206760
Clearly, the second statement produces an additional column and an
additional row compared to the first statement.
What conclusion can we draw? We see that a numerical slice, 0:2,
omits the final index (i.e. index 2) in the range provided,
while a named slice, ‘gdpPercap_1952’:‘gdpPercap_1962’,
includes the final element.
Let’s go through this piece of code line by line.
This line loads the dataset containing the GDP data from all
countries into a dataframe called first. The
index_col='country' parameter selects which column to use
as the row labels in the dataframe.
This line makes a selection: only those rows of first
for which the ‘continent’ column matches ‘Americas’ are extracted.
Notice how the Boolean expression inside the brackets,
first['continent'] == 'Americas', is used to select only
those rows where the expression is true. Try printing this expression!
Can you print also its individual True/False elements? (hint: first
assign the expression to a variable)
As the syntax suggests, this line drops the row from
second where the label is ‘Puerto Rico’. The resulting
dataframe third has one row less than the original
dataframe second.
Again we apply the drop function, but in this case we are dropping
not a row but a whole column. To accomplish this, we need to specify
also the axis parameter (we want to drop the second column
which has index 1).
The final step is to write the data that we have been working on to a
csv file. Pandas makes this easy with the to_csv()
function. The only required argument to the function is the filename.
Note that the file will be written in the directory from which you
started the Jupyter or Python session.
For each column in data, idxmin will return
the index value corresponding to each column’s minimum;
idxmax will do accordingly the same for each column’s
maximum value.
You can use these functions whenever you want to get the row index of the minimum/maximum value and not the actual minimum/maximum value.
1:
2:
3:
Pandas is smart enough to recognize the number at the end of the
column label and does not give you an error, although no column named
gdpPercap_1985 actually exists. This is useful if new
columns are added to the CSV file later.
4:
1. Access a single column:
PYTHON
# by name
data["col_name"] # as a Series
data[["col_name"]] # as a DataFrame
# by name using .loc
data.T.loc["col_name"] # as a Series
data.T.loc[["col_name"]].T # as a DataFrame
# Dot notation (Series)
data.col_name
# by index (iloc)
data.iloc[:, col_index] # as a Series
data.iloc[:, [col_index]] # as a DataFrame
# using a mask
data.T[data.T.index == "col_name"].T
2. Access a single row:
PYTHON
# by name using .loc
data.loc["row_name"] # as a Series
data.loc[["row_name"]] # as a DataFrame
# by name
data.T["row_name"] # as a Series
data.T[["row_name"]].T # as a DataFrame
# by index
data.iloc[row_index] # as a Series
data.iloc[[row_index]] # as a DataFrame
# using mask
data[data.index == "row_name"]
3. Access an individual DataFrame element:
PYTHON
# by column/row names
data["column_name"]["row_name"] # as a Series
data[["col_name"]].loc["row_name"] # as a Series
data[["col_name"]].loc[["row_name"]] # as a DataFrame
data.loc["row_name"]["col_name"] # as a value
data.loc[["row_name"]]["col_name"] # as a Series
data.loc[["row_name"]][["col_name"]] # as a DataFrame
data.loc["row_name", "col_name"] # as a value
data.loc[["row_name"], "col_name"] # as a Series. Preserves index. Column name is moved to `.name`.
data.loc["row_name", ["col_name"]] # as a Series. Index is moved to `.name.` Sets index to column name.
data.loc[["row_name"], ["col_name"]] # as a DataFrame (preserves original index and column name)
# by column/row names: Dot notation
data.col_name.row_name
# by column/row indices
data.iloc[row_index, col_index] # as a value
data.iloc[[row_index], col_index] # as a Series. Preserves index. Column name is moved to `.name`
data.iloc[row_index, [col_index]] # as a Series. Index is moved to `.name.` Sets index to column name.
data.iloc[[row_index], [col_index]] # as a DataFrame (preserves original index and column name)
# column name + row index
data["col_name"][row_index]
data.col_name[row_index]
data["col_name"].iloc[row_index]
# column index + row name
data.iloc[:, [col_index]].loc["row_name"] # as a Series
data.iloc[:, [col_index]].loc[["row_name"]] # as a DataFrame
# using masks
data[data.index == "row_name"].T[data.T.index == "col_name"].T
4. Access several columns:
PYTHON
# by name
data[["col1", "col2", "col3"]]
data.loc[:, ["col1", "col2", "col3"]]
# by index
data.iloc[:, [col1_index, col2_index, col3_index]]
5. Access several rows
PYTHON
# by name
data.loc[["row1", "row2", "row3"]]
# by index
data.iloc[[row1_index, row2_index, row3_index]]
6. Access a subset of specific rows and columns
PYTHON
# by names
data.loc[["row1", "row2", "row3"], ["col1", "col2", "col3"]]
# by indices
data.iloc[[row1_index, row2_index, row3_index], [col1_index, col2_index, col3_index]]
# column names + row indices
data[["col1", "col2", "col3"]].iloc[[row1_index, row2_index, row3_index]]
# column indices + row names
data.iloc[:, [col1_index, col2_index, col3_index]].loc[["row1", "row2", "row3"]]
7. Access a subset of row and column ranges
PYTHON
# by name
data.loc["row1":"row2", "col1":"col2"]
# by index
data.iloc[row1_index:row2_index, col1_index:col2_index]
# column names + row indices
data.loc[:, "col1_name":"col2_name"].iloc[row1_index:row2_index]
# column indices + row names
data.iloc[:, col1_index:col2_index].loc["row1":"row2"]
Among many choices, dir() lists the
median() function as a possibility. Thus,