http://img.youtube.com/vi/PMKuZoQoYE0/0.jpg
The Pandas DataFrame/Series has several methods to handle Missing Data. When applied to a DataFrame/Series, these methods evaluate and modify the missing elements.
This is Part 13 of the DataFrame methods series:
- Part 1 focuses on the DataFrame methods
abs()
,all()
,any()
,clip()
,corr()
, andcorrwith()
. - Part 2 focuses on the DataFrame methods
count()
,cov()
,cummax()
,cummin()
,cumprod()
,cumsum()
. - Part 3 focuses on the DataFrame methods
describe()
,diff()
,eval()
,kurtosis()
. - Part 4 focuses on the DataFrame methods
mad()
,min()
,max()
,mean()
,median()
, andmode()
. - Part 5 focuses on the DataFrame methods
pct_change()
,quantile()
,rank()
,round()
,prod()
, andproduct()
. - Part 6 focuses on the DataFrame methods
add_prefix()
,add_suffix()
, andalign()
. - Part 7 focuses on the DataFrame methods
at_time()
,between_time()
,drop()
,drop_duplicates()
andduplicated()
. - Part 8 focuses on the DataFrame methods
equals()
,filter()
,first()
,last(), head()
, andtail()
- Part 9 focuses on the DataFrame methods
equals()
,filter()
,first()
,last()
,head()
, andtail()
- Part 10 focuses on the DataFrame methods
reset_index()
,sample()
,set_axis()
,set_index()
,take()
, andtruncate()
- Part 11 focuses on the DataFrame methods
backfill()
,bfill()
,fillna()
,dropna()
, andinterpolate()
- Part 12 focuses on the DataFrame methods
isna()
,isnull()
,notna()
,notnull()
,pad()
andreplace()
- Part 13 focuses on the DataFrame methods
drop_level()
,pivot()
,pivot_table()
,reorder_levels()
,sort_values()
andsort_index()
Getting Started
Remember to add the Required Starter Code to the top of each code snippet. This snippet will allow the code in this article to run error-free.
Required Starter Code
import pandas as pd import numpy as np
Before any data manipulation can occur, two new libraries will require installation.
- The
pandas
library enables access to/from a DataFrame. - The
numpy
library supports multi-dimensional arrays and matrices in addition to a collection of mathematical functions.
To install these libraries, navigate to an IDE terminal. At the command prompt ($
), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($
). Your terminal prompt may be different.
$ pip install pandas
Hit the <Enter>
key on the keyboard to start the installation process.
$ pip install numpy
Hit the <Enter>
key on the keyboard to start the installation process.
Feel free to check out the correct ways of installing those libraries here:
If the installations were successful, a message displays in the terminal indicating the same.
DataFrame drop_level()
The drop_level()
method removes the specified index or column from a DataFrame/Series. This method returns a DataFrame/Series with the said level/column removed.
The syntax for this method is as follows:
DataFrame.droplevel(level, axis=0)
Parameter | Description |
---|---|
level |
If the level is a string, this level must exist. If a list, the elements must exist and be a level name/position of the index. |
axis |
If zero (0) or index is selected, apply to each column. Default is 0 (column). If zero (1) or columns, apply to each row. |
For this example, we generate random stock prices and then drop (remove) level Stock-B from the DataFrame.
nums = np.random.uniform(low=0.5, high=13.3, size=(3,4)) df_stocks = pd.DataFrame(nums).set_index([0, 1]).rename_axis(['Stock-A', 'Stock-B']) print(df_stocks) result = df_stocks.droplevel('Stock-B') print(result)
- Line [1] generates random numbers for three (3) lists within the specified range. Each list contains four (4) elements (
size=3,4
). The output saves tonums
. - Line [2] creates a DataFrame, sets the index, and renames the axis. This output saves to
df_stocks
. - Line [3] outputs the DataFrame to the terminal.
- Line [4] drops (removes) Stock-B from the DataFrame and saves it to the
result
variable. - Line [5] outputs the result to the terminal.
Output:
df_stocks
2 | 3 | ||
Stock-A | Stock-B | ||
12.327710 | 10.862572 | 7.105198 | 8.295885 |
11.474872 | 1.563040 | 5.915501 | 6.102915 |
result
2 | 3 | |
Stock-A | ||
12.327710 | 7.105198 | 8.295885 |
11.474872 | 5.915501 | 6.102915 |
DataFrame pivot()
The pivot()
method reshapes a DataFrame/Series and produces/returns a pivot table based on column values.
The syntax for this method is as follows:
DataFrame.pivot(index=None, columns=None, values=None)
Parameter | Description |
---|---|
index |
This parameter can be a string, object, or a list of strings and is optional. This option makes up the new DataFrame/Series index. If None , the existing index is selected. |
columns |
This parameter can be a string, object, or a list of strings and is optional. Makes up the new DataFrame/Series column(s). |
values |
This parameter can be a string, object, or a list of the previous and is optional. |
For this example, we generate 3-day sample stock prices for Rivers Clothing. The column headings display the following characters.
- A (for Opening Price)
- B (for Midday Price)
- C (for Opening Price)
cdate_idx = ['01/15/2022', '01/16/2022', '01/17/2022'] * 3 group_lst = list('AAABBBCCC') vals_lst = np.random.uniform(low=0.5, high=13.3, size=(9)) df = pd.DataFrame({'dates': cdate_idx, 'group': group_lst, 'value': vals_lst}) print(df) result = df.pivot(index='dates', columns='group', values='value') print(result)
- Line [1] creates a list of dates and multiplies this by three (3). The output is three (3) entries for each date. This output saves to
cdate_idx
. - Line [2] creates a list of headings for the columns (see above for definitions). Three (3) of each character are required (9 characters). This output saves to
group_lst
. - Line [3] uses
np.random.uniform
to create a random list of nine (9) numbers between the set range. The output saves tovals_lst
. - Line [4] creates a DataFrame using all the variables created on lines [1-3]. The output saves to
df
. - Line [5] outputs the DataFrame to the terminal.
- Line [6] creates a pivot from the DataFrame and groups the data by dates. The output saves to
result
. - Line [7] outputs the result to the terminal.
Output:
df
dates | group | value | |
0 | 01/15/2022 | A | 9.627767 |
1 | 01/16/2022 | A | 11.528057 |
2 | 01/17/2022 | A | 13.296501 |
3 | 01/15/2022 | B | 2.933748 |
4 | 01/16/2022 | B | 2.236752 |
5 | 01/17/2022 | B | 7.652414 |
6 | 01/15/2022 | C | 11.813549 |
7 | 01/16/2022 | C | 11.015920 |
8 | 01/17/2022 | C | 0.527554 |
result
group | A | B | C |
dates | |||
01/15/2022 | 8.051752 | 9.571285 | 6.196394 |
01/16/2022 | 6.511448 | 8.158878 | 12.865944 |
01/17/2022 | 8.421245 | 1.746941 | 12.896975 |
DataFrame pivot_table()
The pivot_table()
method streamlines a DataFrame to contain only specific data (columns). For example, say we have a list of countries with associated details. We only want to display one or two columns. This method can accomplish this task.
The syntax for this method is as follows:
DataFrame.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)
Parameter | Description |
---|---|
values |
This parameter is the column to aggregate and is optional. |
index |
If the parameter is an array, it must be the same length as the data. It may contain any other data types (but not a list). |
columns |
If an array, it must be the same length as the data. It may contain any other data types (but not a list). |
aggfunc |
This parameter can be a list of functions. These name(s) will display at the top of the relevant column names (see Example 2). |
fill_value |
This parameter is the value used to replace missing values in the table after the aggregation has occurred. |
margins |
If set to True , this parameter will add the row/column data to create subtotal(s) or total(s). False , by default. |
dropna |
This parameter will not include any columns where the value(s) are NaN . True by default. |
margins_name |
This parameter is the name of the row/column containing the totals if margins parameter is True . |
observed |
If True , display observed values. If False , display all observed values. |
sort |
By default, sort is True . The values automatically sort. If False , no sort is applied. |
For this example, a comma-delimited CSV file is read in. A pivot table is created based on selected parameters.
Code – Example 1:
df = pd.read_csv('countries.csv') df = df.head(5) print(df) result = pd.pivot_table(df, values='Population', columns='Capital') print(result)
- Line [1] reads in a CSV file and saves to a DataFrame (
df
). - Line [2] saves the first five (5) rows of the CSV file to
df
(over-writingdf
). - Line [3] outputs the DataFrame to the terminal.
- Line [4] creates a pivot table from the DataFrame based on the Population and Capital columns. The output saves to
result
. - Line [5] outputs the result to the terminal.
Output:
df
Country | Capital | Population | Area | |
0 | Germany | Berlin | 83783942 | 357021 |
1 | France | Paris | 67081000 | 551695 |
2 | Spain | Madrid | 47431256 | 498511 |
3 | Italy | Rome | 60317116 | 301338 |
4 | Poland | Warsaw | 38383000 | 312685 |
result
Capital | Berlin | Madrid | Paris | Rome | Warsaw |
Population | 83783942 | 47431256 | 67081000 | 60317116 | 38383000 |
For this example, a comma-delimited CSV file is read in. A pivot table is created based on selected parameters. Notice the max
function.
Code – Example 2
df = pd.read_csv('countries.csv') df = df.head(5) result = pd.pivot_table(df, values='Population', columns='Capital', aggfunc=[max]) print(result)
- Line [1] reads in a comma-separated CSV file and saves to a DataFrame (
df
). - Line [2] saves the first five (5) rows of the CSV file to
df
(over-writingdf
). - Line [3] creates a pivot table from the DataFrame based on the Population and Capital columns. The max population is a parameter of
aggfunc
. The output saves toresult
. - Line [4] outputs the result to the terminal.
Output:
result
max | |||||
Capital | Berlin | Madrid | Paris | Rome | Warsaw |
Population | 83783942 | 47431256 | 67081000 | 60317116 | 38383000 |
DataFrame reorder_levels()
The reorder_levels()
method re-arranges the index of a DataFrame/Series. This method can not contain any duplicate level(s) or drop level(s).
The syntax for this method is as follows:
DataFrame.reorder_levels(order, axis=0)
Parameter | Description |
---|---|
order |
This parameter is a list containing the new order levels. These levels can be a position or a label. |
axis |
If zero (0) or index is selected, apply to each column. Default is 0 (column). If zero (1) or columns, apply to each row. |
For this example, there are five (5) students. Each student has some associated data with it. Grades generate by using np.random.randint()
.
index = [(1001, 'Micah Smith', 14), (1001, 'Philip Jones', 15), (1002, 'Ben Grimes', 16), (1002, 'Alicia Heath', 17), (1002, 'Arch Nelson', 18)] m_index = pd.MultiIndex.from_tuples(index) grades_lst = np.random.randint(45,100,size=5) df = pd.DataFrame({"Grades": grades_lst}, index=m_index) print(df) result = df.reorder_levels([1,2,0]) print(result)
- Line [1] creates a List of tuples. Each tuple contains three (3) values. The output saves to
index
. - Line [2] creates a
MultiIndex
from the List of Tuples created on line [1] and saves tom_index
. - Line [3] generates five (5) random grades between the specified range and saves to
grades_lst
. - Line [4] creates a DataFrame from the variables on lines [1-3] and saves to
df
. - Line [5] outputs the DataFrame to the terminal.
- Line [6] re-orders the levels as specified. The output saves to
result
. - Line [7] outputs the result to the terminal.
Output:
df
Grades | |||
1001 | Micah Smith | 14 | 52 |
Philip Jones | 15 | 65 | |
1002 | Ben Grimes | 16 | 83 |
Alicia Heath | 17 | 99 | |
Arch Nelson | 18 | 78 |
result
Grades | |||
Micah Smith | 14 | 1001 | 52 |
Philip Jones | 15 | 1001 | 65 |
Ben Grimes | 16 | 1002 | 83 |
Alicia Heath | 17 | 1002 | 99 |
Arch Nelson | 18 | 1002 | 78 |
DataFrame sort_values()
The sort_values()
method sorts (re-arranges) the elements of a DataFrame.
The syntax for this method is as follows:
DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)
Parameter | Description |
---|---|
by |
This parameter is a string or a list of strings. These comprise the index levels/columns to sort. Dependent on the selected axis. |
axis |
If zero (0) or index is selected, apply to each column. Default is 0 (column). If zero (1) or columns, apply to each row. |
ascending |
By default, True . Sort is conducted in ascending order. If False , descending order. |
inplace |
If False , create a copy of the object. If True , the original object updates. By default, False . |
kind |
Available options are quicksort , mergesort , heapsort , or stable . By default, quicksort . See numpy.sort for additional details. |
na_position |
Available options are first and last (default). If the option is first , all NaN values move to the beginning, last to the end. |
ignore_index |
If True , the axis numbering is 0, 1, 2, etc. By default, False . |
key |
This parameter applies the function to the values before a sort. The data must be in a Series format and applies to each column. |
For this example, a comma-delimited CSV file is read in. This DataFrame sorts on the Capital column in descending order.
df = pd.read_csv('countries.csv') result = df.sort_values(by=['Capital'], ascending=False) print(result)
- Line [1] reads in a comma-delimited CSV file and saves to
df
. - Line [2] sorts the DataFrame on the Capital column in descending order. The output saves to
result
. - Line [3] outputs the result to the terminal.
Output:
Country | Capital | Population | Area | |
6 | USA | Washington | 328239523 | 9833520 |
4 | Poland | Warsaw | 38383000 | 312685 |
3 | Italy | Rome | 60317116 | 301338 |
1 | France | Paris | 67081000 | 551695 |
5 | Russia | Moscow | 146748590 | 17098246 |
2 | Spain | Madrid | 47431256 | 498511 |
8 | India | Dheli | 1352642280 | 3287263 |
0 | Germany | Berlin | 83783942 | 357021 |
7 | India | Beijing | 1400050000 | 9596961 |
DataFrame sort_index()
The sort_index()
method sorts the DataFrame.
The syntax for this method is as follows:
DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None)
Parameter | Description |
---|---|
axis |
If zero (0) or index is selected, apply to each column. Default is 0 (column). If zero (1) or columns, apply to each row. |
level |
This parameter is an integer, level name, or a list of integers/level name(s). If not empty, a sort is performed on values in the selected index level(s). |
ascending |
By default, True . Sort is conducted in ascending order. If False , descending order. |
inplace |
If False , create a copy of the object. If True , the original object updates. By default, False . |
kind |
Available options are quicksort , mergesort , heapsort , or stable . By default, quicksort . See numpy.sort for additional details. |
na_position |
Available options are first and last (default). If the option is first , all NaN values move to the beginning, last to the end. |
ignore_index |
If True , the axis numbering is 0, 1, 2, etc. By default, False . |
key |
This parameter applies the function to the values before a sort. The data must be in a Series format and applies to each column. |
For this example, a comma-delimited CSV file is read into a DataFrame. This DataFrame sorts on the index Country column.
df = pd.read_csv('countries.csv') df = df.set_index('Country') result = df.sort_index() print(result)
- Line [1] reads in a comma-delimited CSV file and saves to
df
. - Line [2] sets the index of the DataFrame to Country. The output saves to
df
(over-writing originaldf
). - Line [3] sorts the DataFrame (
df
) on the indexed column (Country) in ascending order (default). The output saves toresult
. - Line [4] outputs the result to the terminal.
Output:
Country | Population | Area | |
China | Beijing | 1400050000 | 9596961 |
France | Paris | 67081000 | 551695 |
Germany | Berlin | 83783942 | 357021 |
India | Dheli | 1352642280 | 3287263 |
Italy | Rome | 60317116 | 301338 |
Poland | Warsaw | 38383000 | 312685 |
Russia | Moscow | 146748590 | 17098246 |
Spain | Madrid | 47431256 | 498511 |
USA | Washington | 328239523 | 9833520 |
Finxter