Pandas is the most predominant library for manipulating datasets and dataframes. This has been the norm for a long time. But with the advancement in artificial intelligence, a new open-source library called PandasAI is developed that adds generative AI capabilities to Pandas.
PandasAI does not replace Pandas. Instead, it gives its generative AI capabilities. In this way, you can perform data analysis by chatting with PandasAI. It then abstracts what is happening in the background and provides you with the output of your query.
Installing PandasAI
PandasAI is available via PyPI (Python Package Index). Create a new virtual environment if you are using a local IDE. Then use the pip package manager to install it.
pip install pandasai
You may encounter a dependency conflict error similar to the one shown below if you are using Google Colab.
Do not downgrade the IPython version. Just restart your runtime and run the code block again. This will resolve the issue.
Understanding the Sample Dataset
The sample dataset you will manipulate with PandasAI is the California Housing Prices dataset from Kaggle. This dataset contains information about housing from the 1990 California census. It has ten columns that provide statistics about these houses. The data card to help you learn more about this dataset is available on Kaggle. Below are the first five rows of the dataset.
Each column represents a single statistic of a house.
Connecting PandasAI to the Large Language Model
To connect PandasAI to a large language model (LLM) like that of OpenAI, you need access to its API key. To obtain one, proceed to the OpenAI platform. Then log in to your account. Select API under the options page that appears next.
After that, click on your profile and select the View API keys option. On the page that appears next click Create new secret key button. Lastly, name your API key.
OpenAI will generate your API key. Copy it as you will need it while connecting PandasAI with OpenAI. Make sure you keep the key secret as anyone with access to it can make calls to OpenAI on your behalf. OpenAI will then charge your account for the calls.
Now that you have the API key, create a new Python script and paste the code below. You won’t need to change this code as most of the time you will be building on it.
import pandas as pd
from pandasai import PandasAI
df = pd.read_csv("/content/housing.csv")
from pandasai.llm.openai import OpenAI
llm = OpenAI(api_token="your API token")
pandas_ai = PandasAI(llm)
The above code imports both PandasAI and Pandas. It then reads a dataset. Finally, it Instantiates the OpenAI LLM.
You are now set to converse with your data.
To query your data, pass your dataframe and your prompt to the instance of PandasAI class. Start by printing the first five rows of your dataset.
pandas_ai(df, prompt='What are the first five rows of the dataset?')
The output of the above prompt is as follows:
This output is identical to that of the dataset overview earlier. This shows that PandasAI produces correct results and is reliable.
Then, check the number of columns present in your dataset.
pandas_ai(df, prompt='How many columns are in the dataset? ')
It returns 10 which is the correct number of columns in the California Housing dataset.
Checking whether there are missing values in the dataset.
pandas_ai(df, prompt='Are there any missing values in the dataset?')
PandasAI returns that the total_bedrooms column has 207 missing values, which is again correct.
There are a lot of simple tasks that you can achieve using PandasAI, you are not limited to the ones above.
PandasAI does not only support simple tasks. You can also use it to carry out complex queries on the dataset. For example, in the housing dataset, if you want to determine the number of houses that are located on an island, have a value of more than 100,000 dollars, and have more than 10 rooms you can use the prompt below.
pandas_ai(df,prompt= "How many houses have a value greater than 100000,"
" are in an island and total bedrooms is more than 10?")
The correct output is five. This is the same result that PandasAI outputs.
Complex queries might take a data analyst some time to write and debug. The above prompt only takes two lines of natural language to accomplish the same task. You just need to have in mind exactly what you want to accomplish, and PandasAI will take care of the rest.
Drawing Charts Using PandasAI
Charts are a vital part of any data analysis process. It helps the data analysts visualize the data in a human-friendly manner. PandasAI also has a chart drawing feature. You just have to pass the dataframe and the instruction.
Start by creating a histogram for each column in the dataset. This will help you visualize the distribution of the variables.
pandas_ai(df, prompt= "Plot a histogram for each column in the dataset")
The output is as follows:
PandasAI was able to draw the histogram of all the columns without having to pass their names in the prompt.
PandasAI can also plot charts without you telling it explicitly which chart to use. For example, you may want to find out the correlation of the data in the housing dataset. To achieve this you can pass a prompt as follows:
pandas_ai(df, prompt= "Plot the correlation in the dataset")
PandasAI plots a correlation matrix as shown below:
The library chooses a heatmap and plots a correlation matrix.
Passing in Multiple Dataframes to the PandasAI Instance
Working with multiple dataframes can be tricky. Especially for a person who is new to data analysis. PandasAI bridges this gap as all you need to do is pass both dataframes and start using prompts to manipulate the data.
Create two dataframes using Pandas.
employees_data = {
'EmployeeID': [1, 2, 3, 4, 5],
'Name': ['John', 'Emma', 'Liam', 'Olivia', 'William'],
'Department': ['HR', 'Sales', 'IT', 'Marketing', 'Finance']
}
salaries_data = {
'EmployeeID': [1, 2, 3, 4, 5],
'Salary': [5000, 6000, 4500, 7000, 5500]
}
employees_df = pd.DataFrame(employees_data)
salaries_df = pd.DataFrame(salaries_data)
You can ask PandasAI a question that cuts across both of the dataframes. You only have to pass both dataframes to the PandasAI instance.
pandas_ai([employees_df, salaries_df], "Which employee has the largest salary?")
It returns Olivia which is again the correct answer.
Performing data analysis has never been easier, PandasAI lets you chat with your data and analyze it with ease.
Understanding the Technology That Powers PandasAI
PandasAI simplifies the process of data analysis hence saving a lot of time for data analysts. But it abstracts what is happening in the background. You need to familiarize yourself with generative AI so that you can have an overview of how PandasAI is operating under the hood. This will also help you keep up with the latest innovations in the generative AI domain.