To analyze a dataset, you first need to understand the data. Sometimes, you might have no forehand knowledge of a dataset, preventing you from getting the most out of it. As a data analyst, you can use Exploratory data analysis (EDA) to gain knowledge of your dataset before in-depth analysis.
Exploratory data analysis (EDA) investigates a dataset to gain meaningful insights. The process of performing EDA involves querying information about the structure and contents of a dataset.
MAKEUSEOF VIDEO OF THE DAY
Installing the Gota Package
The Gota package is the most popular for data analysis in Go; it’s like the Python Pandas package but for Go. The Gota package contains many methods for analyzing datasets and reading JSON, CSV, and HTML formats.
Run this command on your terminal in the directory where you have initialized a Go module file:
go get -u github.com/go-gota/gota
The command will install Gota in the local directory, ready for you to import the package to use it.
Just like Pandas, Gota supports series and dataframes operations. There are two sub-packages in the Gota package: the series, and the dataframe package. You can import either one or both, depending on your needs.
import (
"github.com/go-gota/gota/series"
"github.com/go-gota/gota/dataframe"
)
Reading a Dataset Using the Gota Package
You can use any CSV file you like, but the following examples show results from a Kaggle dataset, containing laptop price data.
Gota lets you read CSV, JSON, and HTML file formats to create dataframes using the ReadCSV, ReadJSON, and ReadHTML methods. Here’s how you load a CSV file into a dataframe object:
file, err := os.Open("/path/to/csv-file.csv")
if err != nil {
fmt.Println("file open error")
}
dataFrame := dataframe.ReadCSV(file)
fmt.Println(dataFrame)
You can use the Open method of the os package to open a CSV file. The ReadCSV method reads the file object and returns a dataframe object.
When you print this object, the output is in a tabular format. You can further manipulate the dataframe object using the various methods Gota provides.
The object will only print some of the columns if a dataset has more than a set value.
Fetching the Dimension of the Dataset
The dimensions of a dataframe are the number of rows and columns it contains. You can fetch these dimensions using the Dims method of the dataframe object.
var rows, columns = dataFrame.Dims()
Replace one of the variables with an underscore to fetch the other dimension only. You can also query the number of rows and columns individually, using the Nrow and Ncol methods.
var rows = dataFrame.Nrow()
var columns = dataFrame.Ncol()
Fetching the Data Types of Columns
You’ll need to know the composite data types in a dataset’s columns to analyze it. You can fetch these using the Types method of your dataframe object:
var types = dataFrame.Types()
fmt.Println(types)
The Types method returns a slice containing the column’s data types:
Fetching the Column Names
You’ll need the column names to select specific columns for operations. You can use the Names method to fetch them.
var columnNames := dataFrame.Names()
fmt.Println(columnNames)
The Names method returns a slice of the column names.
Checking for Missing Values
You might have a dataset that contains null or non-numeric values. You can check for such values using the HasNaN and IsNaN methods of a series object:
aCol := dataFrame.Col("display_size")
var hasNull = aCol.HasNaN()
var isNotNumber = aCol.IsNaN()
HasNan checks if a column contains null elements. IsNaN returns a slice of booleans representing whether each value in the column is a number.
Descriptive statistical analysis helps you understand the distribution of numerical columns. Using the Describe method, you can generate a descriptive statistical analysis of your dataset:
description := dataFrame.Describe()
fmt.Println(description)
The Describe method returns metrics like the mean, standard deviation, and maximum values of columns in a dataset. It summarizes these in a tabular format.
You can also be specific and focus on columns and metrics by selecting a particular column, then querying for the metric you want. You should first fetch the series representing a specific column, then use its methods like so:
aCol := dataFrame.Col("display_size")
var mean = aCol.Mean()
var median = aCol.Median()
var minimum = aCol.Min()
var standardDeviation = aCol.StdDev()
var maximum = aCol.Max()
var quantiles25 = aCol.Quantile(25.0)
These methods mirror the results from the descriptive statistical analysis that Describe performs.
Fetching the Elements in a Column
One of the final tasks you’ll want to perform is to check the values in a column for a general overview. You can use the Records method to view the values of a column.
aCol := dataFrame.Col("brand")
fmt.Println(aCol.Records())
This method returns a slice of strings containing the values in your selected column:
Exporting a Gota Dataframe to a File
If you choose to go further and use the Gota package for full data analysis, you’ll need to save data in files. You can use the WriteCSV and WriteJSON methods of dataframe to export files. The methods take in a file that you’ll create using the os package’s Create method.
Here’s how you can export a dataframe using the Gota package.
dataFrame := dataframe.ReadCSV(file)
outputFile, err := os.Create("output.csv")
if err != nil {
log.Fatal(err)
}
err = dataFrame.WriteCSV(outputFile)
if err != nil {
log.Fatalln("There was an error writing the dataframe contents to the file")
}
The dataFrame variable is a representation of the dataframe. When you use the Create method of the os package, it creates a new, empty file with the specified name and returns the file. The WriteCSV method takes in the file instance and returns an error or nil if there’s no error.
Exploratory Data Analysis Is Important
An understanding of data and datasets is essential for data analysts and machine learning specialists. It is a critical operation in their work cycle, and exploratory data analysis is one of the techniques they use to achieve that.
There’s more to the Gota package. You can use it for various data wrangling functions in the same way that you’d use the Python Pandas library for data analysis. However, Gota doesn’t support quite as much functionality as Pandas.