https://www.youtube.com/embed/r9Gaauyf1Qk?feature=oembed
The Pandas DataFrame/Series has several methods to handle Missing Data. When applied to a DataFrame/Series, these methods evaluate and modify the missing elements.
This is Part 12 of the DataFrame methods series:
- Part 1 focuses on the DataFrame methods
abs()
,all()
,any()
,clip()
,corr()
, andcorrwith()
. - Part 2 focuses on the DataFrame methods
count()
,cov()
,cummax()
,cummin()
,cumprod()
,cumsum()
. - Part 3 focuses on the DataFrame methods
describe()
,diff()
,eval()
,kurtosis()
. - Part 4 focuses on the DataFrame methods
mad()
,min()
,max()
,mean()
,median()
, andmode()
. - Part 5 focuses on the DataFrame methods
pct_change()
,quantile()
,rank()
,round()
,prod()
, andproduct()
. - Part 6 focuses on the DataFrame methods
add_prefix()
,add_suffix()
, andalign()
. - Part 7 focuses on the DataFrame methods
at_time()
,between_time()
,drop()
,drop_duplicates()
andduplicated()
. - Part 8 focuses on the DataFrame methods
equals()
,filter()
,first()
,last(), head()
, andtail()
- Part 9 focuses on the DataFrame methods
equals()
,filter()
,first()
,last()
,head()
, andtail()
- Part 10 focuses on the DataFrame methods
reset_index()
,sample()
,set_axis()
,set_index()
,take()
, andtruncate()
- Part 11 focuses on the DataFrame methods
backfill()
,bfill()
,fillna()
,dropna()
, andinterpolate()
- Part 12 focuses on the DataFrame methods
isna()
,isnull()
,notna()
,notnull()
,pad()
andreplace()
Getting Started
Remember to add the Required Starter Code to the top of each code snippet. This snippet will allow the code in this article to run error-free.
Required Starter Code
import pandas as pd import numpy as np
Before any data manipulation can occur, two new libraries will require installation.
- The
pandas
library enables access to/from a DataFrame. - The
numpy
library supports multi-dimensional arrays and matrices in addition to a collection of mathematical functions.
To install these libraries, navigate to an IDE terminal. At the command prompt ($
), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($
). Your terminal prompt may be different.
$ pip install pandas
Hit the <Enter>
key on the keyboard to start the installation process.
$ pip install numpy
Hit the <Enter>
key on the keyboard to start the installation process.
Feel free to check out the correct ways of installing those libraries here:
If the installations were successful, a message displays in the terminal indicating the same.
DataFrame isna() & Dataframe isnull()
The DataFrame isna()
and isnull()
methods return Boolean (True
/False
) values in the same shape as the DataFrame/Series passed. If any empty values are of the following type, they will resolve to True
.
None
NaN
NaT
NA
All other values (valid data) will resolve to False
.
Note: Any empty strings or numpy.inf
are not considered empty unless use_inf_as_na
is set to True
.
The syntax for these methods is as follows:
DataFrame.isna() DataFrame.isnull()
Parameters:
These methods contain no parameters.
For this example, three (3) temperatures over three (3) days for Anchorage, Alaska save to a DataFrame. Unfortunately, some temperatures did not accurately record.
The code below returns a new DataFrame containing True
values in the same position as the missing temperatures and False
in the remainder.
Code – isna()
:
df_temps = pd.DataFrame({'Day-1': [np.nan, 11, 12], 'Day-2': [13, 14, pd.NaT], 'Day-3': [None, 15, 16]}, index=['Morning', 'Noon', 'Evening']) print(df_temps) result = df_temps.isna() print(result)
- Line [1] creates a dictionary of lists and saves it to
df_temps
. - Line [2] outputs the DataFrame to the terminal.
- Line [3] uses
isna()
to set the empty values (np.nan
,pd.NaT
,None
) toTrue
and the remainder (valid values) toFalse
. This output saves to theresult
variable. - Line [4] outputs the result to the terminal.
Output:
original df_temps
Day-1 | Day-2 | Day-3 | |
Morning | NaN | 13 | NaN |
Noon | 11.0 | 14 | 15.0 |
Evening | 12.0 | NaT | 16.0 |
result
Day-1 | Day-2 | Day-3 | |
Morning | True | False | True |
Noon | False | False | False |
Evening | False | True | False |
Code – isnull()
:
df_temps = pd.DataFrame({'Day-1': [np.nan, 11, 12], 'Day-2': [13, 14, pd.NaT], 'Day-3': [None, 15, 16]}, index=['Morning', 'Noon', 'Evening']) print(df_temps) result = df_temps.isnull() print(result)
- Line [1] creates a dictionary of lists and saves it to
df_temps
. - Line [2] outputs the DataFrame to the terminal.
- Line [3] uses
isnull()
to set the empty values (np.nan
,pd.NaT
,None
) toTrue
and the remainder (valid values) toFalse
. This output saves to theresult
variable. - Line [4] outputs the result to the terminal.
Output:
original df_temps
Day-1 | Day-2 | Day-3 | |
Morning | NaN | 13 | NaN |
Noon | 11.0 | 14 | 15.0 |
Evening | 12.0 | NaT | 16.0 |
result
Day-1 | Day-2 | Day-3 | |
Morning | True | False | True |
Noon | False | False | False |
Evening | False | True | False |
Note: The isnull()
method is an alias of the isna()
method. The output from both examples is identical.
DataFrame notna() & notnull()
The DataFrame notna()
and notnull()
methods return Boolean (True
/False
) values. These values returned are in the same shape as the DataFrame/Series passed. If any empty values are of the following type, they will resolve to False
.
None
NaN
NaT
NA
All other values that are not of the above type (valid data) will resolve to True
.
The syntax for these methods is as follows:
DataFrame.notna() DataFrame.notnull()
Parameters:
These methods contain no parameters.
For this example, three (3) temperatures over three (3) days for Anchorage, Alaska save to a DataFrame. Unfortunately, some temperatures did not accurately record.
The code below returns a new DataFrame containing True
values in the same position as the missing temperatures and False
in the remainder.
Code – notna()
:
df_temps = pd.DataFrame({'Day-1': [np.nan, 11, 12], 'Day-2': [13, 14, pd.NaT], 'Day-3': [None, 15, 16]}, index=['Morning', 'Noon', 'Evening']) print(df_temps) result = df_temps.notna() print(result)
- Line [1] creates a dictionary of lists and saves it to
df_temps
. - Line [2] outputs the DataFrame to the terminal.
- Line [3] uses
notna()
to set the empty values (np.nan
,pd.NaT
,None
) toFalse
and the remainder (valid values) toTrue
. This output saves to theresult
variable. - Line [4] outputs the result to the terminal.
Output:
original df_temps
Day-1 | Day-2 | Day-3 | |
Morning | NaN | 13 | NaN |
Noon | 11.0 | 14 | 15.0 |
Evening | 12.0 | NaT | 16.0 |
result
Day-1 | Day-2 | Day-3 | |
Morning | False | True | False |
Noon | True | True | True |
Evening | True | False | True |
Code – notnull()
:
df_temps = pd.DataFrame({'Day-1': [np.nan, 11, 12], 'Day-2': [13, 14, pd.NaT], 'Day-3': [None, 15, 16]}, index=['Morning', 'Noon', 'Evening']) print(df_temps) result = df_temps.notnull() print(result)
- Line [1] creates a dictionary of lists and saves it to
df_temps
. - Line [2] outputs the DataFrame to the terminal.
- Line [3] uses
notnull()
to set the empty values (np.nan
,pd.NaT
,None
) toFalse
and the remainder (valid values) toTrue
. This output saves to theresult
variable. - Line [4] outputs the result to the terminal.
Output:
original df_temps
Day-1 | Day-2 | Day-3 | |
Morning | NaN | 13 | NaN |
Noon | 11.0 | 14 | 15.0 |
Evening | 12.0 | NaT | 16.0 |
result
Day-1 | Day-2 | Day-3 | |
Morning | False | True | False |
Noon | True | True | True |
Evening | True | False | True |
Note: The notnull()
method is an alias of the notna()
method. The output from both examples is identical.
DataFrame pad()
The pad()
method is an alias for DataFrame/Series fillna()
with the parameter method set to 'ffill'
. Click here for details.
DataFrame replace()
The replace()
method substitutes values in a DataFrame/Series with a different value assigned. This operation is performed dynamically on the object passed.
Note: The .loc
/.iloc
methods are slightly different from replace()
as they require a specific location in order to change the said value(s).
The syntax for this method is as follows:
DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')
Parameter | Description |
---|---|
to_replace |
Determines how to locate values to replace . The following parameters are: – Numeric, String, or Regex. – List of Strings, Regex, or Numeric. – Dictionary: a Dictionary, DataFrame Dictionary, or Nested Dictionary Each one must exactly match the to_replace parameter to cause any change. |
value |
The value to replace any values that match. |
inplace |
If set to True , the changes apply to the original DataFrame/Series. If False , the changes apply to a new DataFrame/Series. By default, False . |
limit |
The maximum number of elements to backward/forward fill. |
regex |
A regex expression to match. Matches resolve to the value parameter. |
method |
The available options for this method are pad , ffill , bfill , or None . Specify the replacement method to use. |
Possible Errors Raised:
Error | When Does It Occur? |
AssertionError |
If regex is not a Boolean (True /False ), or the to_replace parameter is None . |
TypeError |
If to_replace is not in a valid format, such as: – Not scalar, an array, a dictionary, or is None . – If to_replace is a dictionary and the value parameter is not a list. – If multiple Booleans or date objects and to_replace fails to match the value parameter. |
ValueError |
Any error returns if a list/ndarray and value are not the same length. |
The examples below show how versatile the replace()
method is. We recommend you spend some time reviewing the code and output.
In this example, we have five (5) grades for a student. Notice that one (1) grade is a failing grade. To rectify this, run the following code:
Code – Example 1
grades = pd.Series([55, 64, 52, 76, 49]) print(grades) result = grades.replace(49, 51) print(result)
- Line [1] creates a Series of Lists and saves it to
grades
. - Line [2] modifies the failing grade of 49 to a passing grade of 51. The output saves to
result
. - Line [3] outputs the
result
to the terminal.
Output:
O | 55 |
1 | 64 |
2 | 52 |
3 | 76 |
4 | 51 |
dtype: int64 |
This example shows a DataFrame of three (3) product lines for Rivers Clothing. They want the price of 11.35 changed to 12.95. Run the code below to change the pricing.
Code – Example 2:
df = pd.DataFrame({'Tops': [10.12, 12.23, 11.35], 'Tanks': [11.35, 13.45, 14.98], 'Sweats': [11.35, 21.85, 35.75]}) result = df.replace(11.35, 12.95) print(result)
- Line [1] creates a dictionary of lists and saves it to
df
. - Line [2] replaces the value 11.35 to 12.95 for each occurrence. The output saves to
result
. - Line [3] outputs the result to the terminal.
Output:
Tops | Tanks | Sweats | |
0 | 10.12 | 12.95 | 12.95 |
1 | 12.23 | 13.45 | 21.85 |
2 | 12.95 | 14.98 | 35.75 |
Code – Example 3:
This example shows a DataFrame with two (2) teams. Each team contains three (3) members. This code removes one (1) member from each team and replaces it with quit.
df = pd.DataFrame({'Team-1': ['Barb', 'Todd', 'Taylor'], 'Team-2': ['Arch', 'Bart', 'Alex']}) result = df.replace(to_replace=r'^Bar.$', value='quit', regex=True) print(result)
- Line [1] creates a Dictionary of Lists and saves it to
df
. - Line [2] replaces any values that start with
Bar
and contain one (1) additional character (.
). This match changed to the wordquit
. The output saves toresult
. - Line [3] outputs the result to the terminal.
Finxter