Data Science is a multidisciplinary field that combines various techniques and processes to extract knowledge from structured and unstructured data.

One of the key concepts in this field is data cleaning and preprocessing. This activity includes managing data sources that often come with information that is not necessarily useful for the project we are working on, or it must be cleaned and properly processed for dealing with missing values or inconsistencies among other things.

Python has many powerful tools to achieve this. I am going to use the TripAdvisor Rating Impact on Hotel Popularity dataset from Kaggle to show some of the functions we can use for data cleaning and manipulating.

You can download the original dataset on the following url:

https://www.kaggle.com/datasets/jocelyndumlao/tripadvisor-rating-impact-on-hotel-popularity

For this exercise, I’ll use Jupyter Notebook because it’s easy to run commands cell by cell, so I can put each task in an individual cell that I can run separately. I will leave the link to the github repository at the end of the post.

First of all, we need to import Pandas library and use the common alias as pd

import pandas as pd

Then we need to load the dataset into a Data Frame object

df = pd.read_csv(“./datasets/tripadvisor-rating-impact-on-hotel-popularity/data_rdd.csv”,

encoding_errors=”ignore”)

The previous line of code reads the CSV file named ‘data_rdd.csv’ located in the “./datasets/tripadvisor-rating-impact-on-hotel-popularity/” directory. It then loads the data from the CSV file into a DataFrame, which is stored with the variable name “df.” The “encoding_errors” parameter is set to “ignore”, which means any encoding errors encountered during reading will be disregarded.

One of the first things we can do is get summary information to get the initial insight of the object using DataFrame information method. This will bring us information about index dtype, columns, non-null values and memory usage.

df.info()

This particular dataset is composed of 272 columns, so in this case df.info() will show a reduced summary without individual column detail information.

**Describing statistical information:**

Now that we have the structure of the DataFrame, we can view some statistical information about numerical columns. This is done using describe method.

df.describe()

This will show the following information from each numerical column:

- Count: Total amount of rows in the Data Frame
- Mean
- Standard deviation (std)
- Min value
- 25%, 50% and 75%: This represents percentiles of the data distribution. Percentiles are statistical measures that indicate the relative standing of a particular data point within a dataset. 25% is the first quartile and represents the data point below which 25% of the data falls. 50% is the second quartile and represents the median value of the dataset. 75% is the third quartile and represents the data point below which 75% of the data falls. It separates the lowest 75% of the data from the upper 25%.
- Max value

**Showing some data:**

Another common function used to show a quick view of the structure and some data are the head and tail methods. These commands will show the list of columns with a view of the first 5 values on de DataFrame if it is used without any additional parameters.

df.head()

You can modify the number of rows shown by entering the desired number inside the parenthesis, for example, to show the first 20 rows:

df.head(20)

The tail method shows the same but starting from the last rows. So I you want to show the last 5 rows of the Data Frame you can use the following:

df.tail()

As the same in df.head it can be used with a number inside parenthesis for selecting the desired number of rows to be shown.

**Show column data types:**

Knowing the type of data is very important in data analysis and manipulation, datasets can come in a variety of data types, and it is usually needed to convert from some type to another to achieve some specific tasks. You can use dtypes method to show data types of each column.

df.dtypes

In this dataset we have three different types of data.

- Int64: Represents 64-bit integer numbers.
- Float64: Denotes 64-bit floating-point (decimal) numbers.
- Object: Typically used for string or text data.

**Handling missing values**

Identifying and working with missing values is a very common task when dealing with datasets and Python has a lot of tools to work with.

First of all, we can list the number of missing values in each column of our DataFrame by iterating df.columns object.

The isna() function Is used to detect missing values and will return True or False indicating if the value evaluated is empty or NA. You can find missing values as simply empty values or as NaN values. NaN stands for “Not a Number” and it is a special value that represents the absence of a valid number. This is typically used to indicate missing or undefined data in numeric Datasets. In Python, NaN is a specific floating-point value, while None is a more general object used to represent missing or undefined values in non-numeric data.

for column in df.columns:

print(“Column”, column, “has”, df[column].isna().sum(), “empty values.”)

The above line will find the total number of NaN values on each column of the DataFrame, concatenating sum() function after isna() will show the total of NaN values for each feature.

You can also show missing values for one specific feature or column. In this case it will return rows with NA values on the column “photos”. Notice that this will not show the total number of missing values but list the content of rows that has NaN values on the “photos” column.

df[df[“photos”].isna()]

You can also add a list of columns at the end of the line to limit the output to the desired features. In this example I will use “name” and “photos”.

df[df[“photos”].isna()][[“name”, “photos”]]

**Replace missing values:**

Fillna() is a method in Pandas that is used to fill or replace missing values in a Data Frame or Series with specified values.

First, we can use isna() to identify missing values in a particular column.

df[df[“location_grade”].isna()][[“name”, “location_grade”]]

The above line wil display NaN values in “location_grade” column showing names and location_grades of the DataFrame. Rows 4498, 4500,4516,4528 and 4587 has missing values among others.

Then we can use fillna() method to replace NaN values with a simple 0. We use the parameter inplace=True to modify the original dataframe in place. If False, it will return a new DataFrame with missing values filled.

df[“location_grade”].fillna(0, inplace=True)

Now we can check the results and see the new output. “iloc” stands for Integer Location and it is used to select rows and columns by their integer-based positions. By specifying the index numbers of the rows, we can filter the output to see a group of records. In this case I passed a list with the same index numbers that we evaluated before to see the changes.

Fillna() function can also be combined with other functions like max(), min() or mean(). There can be cases where we need to replace missing values based on calculated values, this is called aggregation functions and can be useful when you want to use a general summary statistic to replace missing values.

For example, we can list NA values for the price_min column as we did before, but in this case, we will use aggregation functions.

#Show NA values for location grade.

df[df[“price_min”].isna()][[“name”, “price_min”]]

Before replacing missing values, we can check the mean of all values in “price_min” column.

df[“price_min”].mean()

Then we can use fillna() method to replace values.

#Replace NaN values with the mean() in the price_min column

df[“price_min”].fillna(df[“price_min”].mean(), inplace=True)

Once it’s done, we can check for results as we did before but with the index number of desire rows.

We can also use fillna() to replace a group of column values, for doing this we can pass a key/value dictionary with the proper values as shown below.

#Replace NaN values with the mean(), min() and max() in two or more columns (location_grade, price_curr_min and price_min).

df.fillna({“location_grade” : df[“location_grade”].mean(),

“price_curr_min”: df[“price_curr_min”].max(),

“price_min”: df[“price_min”].min()}, inplace=True)

**Removing rows**

An alternative way of handling missing values is to remove them from the Data Frame. This is done using dropna() method. In the following example we use print(df.shape[0]) to show the number of rows in the Data Frame to check before and after status.

#Delete rows #with all columns having empty values.

df.dropna(inplace=True)

You can also include parameters to provide control over which rows and columns to consider for removal.

“how”: Specifies the condition for dropping rows and columns based on the presence of missing values and it can take one of the following values:

- “any”: Removes any row or column that contains at least one missing value.
- “all”: Removes a row or column only if all its values are missing.

“axis”: Determines whether you want to drop rows or columns. Axis=0 (default) is used to remove rows, and axis=1 is for columns.

“subset”: Allows to specify a subset of columns to consider for missing values and it is passed as a list.

One example of dropping rows with missing values using these additional parameters could be the following:

df = pd.read_csv(“./datasets/tripadvisor-rating-impact-on-hotel-popularity/data_rdd.csv”,

encoding_errors=”ignore”)

The above line will remove rows where both “column_1” and “column_2” have missing (NaN) values in the same row.

**Missing values imputation**

Missing values imputation is a crucial step in data preprocessing for data analysis and machine learning. It involves filling in or estimating missing data points in a dataset to make it more suitable for analysis.

Python has various imputation methods that include for example mean/median imputation, mode imputation and backward/forward imputation among others.

*Mean imputation:*

#Impute missing numerical values.

#Mean Imputation: Replace missing values with the mean (average) value

of the non-missing values in the same column.

#Show NaN values for price_min column. print(“price_min has”, df[“price_min”].isna().sum(), “nan values.”)

#show rows with NaN values for price_min column

df[df[“price_min”].isna()][[“name”, “price_min”]].head()

#show mean value for price_min

print(“Mean value for price_min column is: “, df[“price_min”].mean())

#fill na’s with mean

df[“price_min”].fillna(df[“price_min”].mean(), inplace=True)

#check for result

print(“\nNow, column price_min has”, df[“price_min”].isna().sum(), “NaN values.”)

df[“price_min”].iloc[[790,858,859,866,893]]

*Median imputation:*

Median imputation is a data cleaning technique where missing values are replaced with the median value of the respective column. This method helps maintain the central tendency of the data and can be particularly useful when outliers are present. Median imputation ensures that missing values do not significantly affect the overall statistical properties of the dataset while preserving its integrity.

#Median Imputation: Replace missing values with the median value of

#the non-missing values in the same column.

#This is less sensitive to outliers compared to mean imputation.

#Show NaN values for price_min column.

print(“price_min has”, df[“price_min”].isna().sum(), “nan values.”)

#show rows with NaN values for price_min column

df[df[“price_min”].isna()][[“name”, “price_min”]].head()

#show median value for price_min

print(“Median value for price_min column is: “, df[“price_min”].median())

#fill na’s with median

df[“price_min”].fillna(df[“price_min”].median(), inplace=True)

#check for result print(“\nNow, column price_min has”, df[“price_min”].isna().sum(), “nan values.”)

df[“price_min”].iloc[[790,858,859,866,893]]

*Mode imputation:*

Mode imputation is a simple way to preserve the original distribution of categorical data while filling in gaps.

#Mode Imputation:

#you can replace missing values with the mode (most frequent) value in the same column.

#Show NaN values for price_min column

print(“price_min has”, df[“price_min”].isna().sum(), “nan values.”)

#show rows with NaN values for price_min column

df[df[“price_min”].isna()][[“name”, “price_min”]].head()

#show most frequent value for price_min

print(“Mode value for price_min column is: “, df[“price_min”].mode()[0])

#fill na’s with most frequent value

df[“price_min”].fillna(df[“price_min”].mode()[0], inplace=True)

#check for result

print(“\nNow, column price_min has”, df[“price_min”].isna().sum(), “nan values.”)

df[“price_min”].iloc[[790,858,859,866,893]]

*Forward fill imputation:*

Forward fill imputation fills missing values with the most recent non-missing value in the column, propagating the last known value forward.

In the following example we can see that the last non-missing value in the row 789 is 5438.0, this number is the one to be imputed in the first missing value encountered, in this case, the row 790 has the first NaN value in the column, so this NaN value will be replaced by the number 5438.0. As It can see in the final output after the ffill method, row 790 has now the correct value. This process will be repeated in each row that has NaN values for the price_min column.

#Forward fill imputation. Fill missing values with the most #recent non-missing value in the column.

#show rows with NaN values for price_min column print(“Rows with missing values”)

print(df[df[“price_min”].isna()][[“name”, “price_min”]])

#show price_min for row 789 (last non-missing value of the column)

print(“First value for forward fill will be: “, df[“price_min”].iloc[789])

#fill na’s with forward fill imputation df[“price_min”].ffill(inplace=True)

#check for result print(“\nNow, column price_min has”, df[“price_min”].isna().sum(), “nan values.”)

df[“price_min”].iloc[[790,858,859,866,893]]

*Backward fill imputation:*

It is the same as Forward fill imputation with the difference that in backward fill imputation NaN values are replaced by the next valid observation in the column, propagating values backward in the DataFrame and usage is the same as ffill.

In this example the value that will be used to replace row 790 will be the value of row 791 as it is the next valid observation in the column after the first NaN value in the row 790.

#Backward fill imputation. Fill missing values with next valid observation in the column

#show rows with NaN values for price_min column

print(“Rows with missing values”)

print(df[df[“price_min”].isna()][[“name”, “price_min”]])

#show price_min for row 791 (next valid value of the column)

print(“First value for backward fill will be: “, df[“price_min”].iloc[791])

#fill na’s with forward fill imputation df[“price_min”].bfill(inplace=True)

df[“price_min”].iloc[[790,858,859,866,893]]

*Dropping rows and columns:*

Sometimes it is necessary to remove missing values instead of replacing them with others. For this task we have the dropna() function.

Dropna() can be used either in rows and columns, here is an overview of how it works.

Drop all rows with at least one missing value:

#Drop rows #where all elements are missing

#df.shape[0] show the number of rows on the DataFrame

print(“Before shape:”, df.shape[0])

df.dropna(inplace=True)

print(“After shape:”, df.shape[0])

*Drop all columns with at least one missing value:*

#Drop all columns with at least one missing value

#df.shape[1] show the number of columns on the DataFrame

print(“Before shape”, df.shape[1])

df.dropna(inplace=True, axis=1)

print(“After shape”, df.shape[1])

*Drop one specific column with more than 15% percentage of missing values:*

#Drop one specific column with more than 15% percentage of missing values.

#will use price_min for this example

#df.shape[0] show the number of rows on the DataFrame

print(“Before shape:”, df.shape[0])

if df[“price_min”].isna().sum() / df.shape[0] > 0.15: df.dropna(subset=”price_min”, inplace=True)

print(“After shape:”, df.shape[0])

**Handling duplicates**

Handling duplicates is a critical step in data science. Duplicates can skew statistical analysis, lead to incorrect results, and consume unnecessary computational resources. By identifying and removing duplicates, you can ensure the accuracy and reliability of the analysis, enabling more informed decision-making. Additionally, eliminating duplicates streamlines data processing, making it more efficient and reducing processing time.

Detect all duplicated rows for all columns:

#Handling Duplicates:

#Detect all duplicated rows for all columns.

for column in df.columns:

print(“Column”, column, “has”, df[column].duplicated().sum(),

“duplicated values”)

*Detect and remove duplicated rows for a group of columns:*

#Handling Duplicates:

#Detect and remove all duplicated rows for a group of columns.

#df.shape[0] returns the number of rows in the DataFrame

print(“Before shape:”, df.shape[0])

df.drop_duplicates(subset =[“price_min”, “price_max”, “views_binary”], inplace=True)

print(“After drop duplicates shape:” ,df.shape[0])

*Remove all duplicated rows for a group of columns. Drop all except the first occurrence:*

#Handling Duplicates:

#Detect and remove all duplicated rows for a group of columns.

#Drop all except first occurrence

print(“Before shape:”, df.shape[0])

df.drop_duplicates(subset= [“price_min”, “price_max”, “views_binary”], inplace=True, keep=”first”)

print(“After dropna shape:” ,df.shape[0])

**Data type conversions:**

Data type conversion in Pandas is vital for data integrity, memory efficiency, and analysis. It ensures proper data handling for operations like filtering, sorting, and statistical analysis. Consistent data types are essential for data visualization and machine learning compatibility. Overall, it’s a foundational step in data preparation for robust data science workflows.

You can apply several functions to do many tasks in data type conversion like the following:

**Convert all numeric columns to string:**

#Data Type Conversion:

#Convert all numeric columns to string

import numpy as np

#print number of numerical columns before converting

#df.shape[1] returns the number of columns in the Data Frame

print(“Number of numerical columns before conversion is”,

df.select_dtypes(include=np.number).shape[1])

#generate a list of numeric columns

columns_list = list(df.select_dtypes(include=np.number))

for count, item in enumerate (columns_list):

df[item] = df[item].astype(str)

#print number of numerical columns after converting

print(“Number of numerical columns after conversion is”,

df.select_dtypes(include=np.number).shape[1])

*Convert one numerical column to string:*

#Data Type Conversion:

#Convert one numerical column to string

print(“Before conversion, price_min is”, df[“price_min”].dtypes)

df[“price_min”] = df[“price_min”].astype(str)

print(“After conversion, price_min is”, df[“price_min”].dtypes)

**Convert one categorical 0/1 column to no/yes:**

#Data Type Conversion #Convert one categorical 0/1 column to categorical no/yes

#will use views_binary as categorical column

print(“Before conversion:” ,df[“views_binary”].head(7))

#convert column to string

df[“views_binary”] = df[“views_binary”].astype(str)

#map 0 for no and 1 for yes

df[“views_binary”] = df[“views_binary”].map({“0” : “no”, “1” : “yes”})

#check result

print(“\nAfter conversion:”, df[“views_binary”].head(7))

**Renaming columns:**

*Rename one column by name:*

#Renaming Columns:

#Rename column by name

#show column names

print(df.columns)

#change hotel_url to hotel_url_adrress

df.rename(columns={“hotel_url” : “hotel_url_address”}, inplace=True)

#check result

print(“\n”, df.columns)

*Rename columns using a list:*

#Renaming Columns:

#Rename a list of columns

#print column names

print(df.columns)

#create list of columns to be renamed

original_columns = [“name”, “views”, “views_binary”]

new_columns = [“name_x”, “views_x”, “views_binary_x”]

#create dictionary from the two lists above using dict comprehension

dict = {original_columns[i] : new_columns[i] for i in range(len(original_columns))}

df.rename(columns=dict, inplace=True)

#check for result print(“\n”, df.columns)

*Rename all columns. Convert to lowercase:*

#Renaming Columns:

#Rename all columns. Convert to lowercase

#show column names (first column “Unnamed 0” is capital)

print(df.columns)

#convert to lowecase

df.rename(str.lower, axis=”columns”, inplace=True)

#check result print(“\n”, df.columns)