Pandas DataFrames
Last updated on 2024-05-02 | Edit this page
Overview
Questions
- How can I do statistical analysis of tabular data?
Objectives
- Select individual values from a Pandas dataframe.
- Select entire rows or entire columns from a dataframe.
- Select a subset of both rows and columns from a dataframe in a single operation.
- Select a subset of a dataframe by a single Boolean criterion.
Note about Pandas DataFrames/Series
A DataFrame is a collection of Series; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.
Pandas is built on top of the Numpy library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.
What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.
Selecting values
To access a value at the position [i,j] of a DataFrame,
we have two options, depending on what is the meaning of i
in use. Remember that a DataFrame provides an index as a way to
identify the rows of the table; a row, then, has a position
inside the table as well as a label, which uniquely identifies
its entry in the DataFrame.
Use DataFrame.iloc[..., ...] to select values by their
(entry) position
- Can specify location by numerical index analogously to 2D version of character selection in strings.
PYTHON
import pandas as pd
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
print(data.iloc[0, 0])
OUTPUT
1601.056136
Use DataFrame.loc[..., ...] to select values by their
(entry) label.
- Can specify location by row and/or column name.
OUTPUT
1601.056136
Use : on its own to mean all columns or all rows.
- Just like Python’s usual slicing notation.
OUTPUT
gdpPercap_1952 1601.056136
gdpPercap_1957 1942.284244
gdpPercap_1962 2312.888958
gdpPercap_1967 2760.196931
gdpPercap_1972 3313.422188
gdpPercap_1977 3533.003910
gdpPercap_1982 3630.880722
gdpPercap_1987 3738.932735
gdpPercap_1992 2497.437901
gdpPercap_1997 3193.054604
gdpPercap_2002 4604.211737
gdpPercap_2007 5937.029526
Name: Albania, dtype: float64
- Would get the same result printing
data.loc["Albania"](without a second index).
OUTPUT
country
Albania 1601.056136
Austria 6137.076492
Belgium 8343.105127
⋮ ⋮ ⋮
Switzerland 14734.232750
Turkey 1969.100980
United Kingdom 9979.508487
Name: gdpPercap_1952, dtype: float64
- Would get the same result printing
data["gdpPercap_1952"] - Also get the same result printing
data.gdpPercap_1952(not recommended, because easily confused with.notation for methods)
Select multiple columns or rows using DataFrame.loc and
a named slice.
OUTPUT
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
country
Italy 8243.582340 10022.401310 12269.273780
Montenegro 4649.593785 5907.850937 7778.414017
Netherlands 12790.849560 15363.251360 18794.745670
Norway 13450.401510 16361.876470 18965.055510
Poland 5338.752143 6557.152776 8006.506993
In the above code, we discover that slicing using
loc is inclusive at both ends, which differs from
slicing using iloc, where slicing
indicates everything up to but not including the final index.
Result of slicing can be used in further operations.
- Usually don’t just print a slice.
- All the statistical operators that work on entire dataframes work the same way on slices.
- E.g., calculate max of a slice.
OUTPUT
gdpPercap_1962 13450.40151
gdpPercap_1967 16361.87647
gdpPercap_1972 18965.05551
dtype: float64
OUTPUT
gdpPercap_1962 4649.593785
gdpPercap_1967 5907.850937
gdpPercap_1972 7778.414017
dtype: float64
Use comparisons to select data based on value.
- Comparison is applied element by element.
- Returns a similarly-shaped dataframe of
TrueandFalse.
PYTHON
# Use a subset of data to keep output readable.
subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
print('Subset of data:\n', subset)
# Which values were greater than 10000 ?
print('\nWhere are values large?\n', subset > 10000)
OUTPUT
Subset of data:
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
country
Italy 8243.582340 10022.401310 12269.273780
Montenegro 4649.593785 5907.850937 7778.414017
Netherlands 12790.849560 15363.251360 18794.745670
Norway 13450.401510 16361.876470 18965.055510
Poland 5338.752143 6557.152776 8006.506993
Where are values large?
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
country
Italy False True True
Montenegro False False False
Netherlands True True True
Norway True True True
Poland False False False
Select values or NaN using a Boolean mask.
- A frame full of Booleans is sometimes called a mask because of how it can be used.
OUTPUT
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
country
Italy NaN 10022.40131 12269.27378
Montenegro NaN NaN NaN
Netherlands 12790.84956 15363.25136 18794.74567
Norway 13450.40151 16361.87647 18965.05551
Poland NaN NaN NaN
- Get the value where the mask is true, and NaN (Not a Number) where it is false.
- Useful because NaNs are ignored by operations like max, min, average, etc.
OUTPUT
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
count 2.000000 3.000000 3.000000
mean 13120.625535 13915.843047 16676.358320
std 466.373656 3408.589070 3817.597015
min 12790.849560 10022.401310 12269.273780
25% 12955.737547 12692.826335 15532.009725
50% 13120.625535 15363.251360 18794.745670
75% 13285.513523 15862.563915 18879.900590
max 13450.401510 16361.876470 18965.055510