• UAE
  • International
  • Technology
  • Government
thedubaiheadlines
Website Header
| |
  • Home
  • Business
    • CEO
    • Founder
    • Realtor
    • Entrepreneur
    • Journalist
  • Cryptocurrency
  • Events
  • Health
    • Doctor
    • Plastic surgeon
    • Beauty cosmetics
  • Life Style
  • Sports
    • Athlete
    • Fitness Trainer
    • Coach
  • Sustainability
  • Tourism
Reading: Exploratory data analysis to unveil patterns in a car insurance data
Share
thedubaiheadlinesthedubaiheadlines
Font ResizerAa
  • Home
  • UAE
  • International
  • Government
  • Business
  • Technology
  • Cryptocurrency
  • Events
  • Health
  • Luxury
  • Life Style
  • Sports
  • Sustainability
  • Tourism
Search
  • Home
  • UAE
  • International
  • Government
  • Business
  • Cryptocurrency
  • Events
  • Health
  • Luxury
  • Life Style
  • Sports
  • Technology
  • Tourism
Follow US
© 2025 Thedubaiheadlines. All Rights Reserved.
Home » Blog » Exploratory data analysis to unveil patterns in a car insurance data
Sustainability

Exploratory data analysis to unveil patterns in a car insurance data

Khalid Bin Rashid
Khalid Bin Rashid
Share

Contents
Steps to follow1. Data Inspection2. Data Cleaning3. Data Visualisation4. Hypothesis TestingStep 1: Data InspectionImporting Necessary LibrariesLoading the DataAccessing the columnsDealing with Null valuesDuplicate RowsStatistical Representation of DataStep 2: Cleaning the dataSorting Out Missing Rows– Horsepower and peak-rpm– Stroke and bore featuresHandling outliersDrop Unnecessary ColumnsStep 3: Visualising the dataKey Takeaways:Step 4: Hypothesis TestingConclusion

The dataset can viewed and downloaded from here.

Steps to follow

To perform an Exploratory Data Analysis (EDA) on the dataset, we will follow these main steps:

1. Data Inspection

2. Data Cleaning

3. Data Visualisation

4. Hypothesis Testing

Step 1: Data Inspection

Data inspection is the first step in any analysis, whether the goal is to build models or perform an EDA, as in our case.

The aim is to examine different aspects of the data to identify any issues that need attention during the cleaning process.

Think of it as a check-up done by doctors before making a diagnosis or prescription.

Importing Necessary Libraries

For this analysis, we’ll primarily use `pandas` for data manipulation, `numpy` for numerical operations, and `matplotlib` along with `seaborn` for data visualisation.

Screenshot 2024-08-31 at 12.52.57

Loading the Data

Using the pandas library, you can easily load data with functions like read_csv, read_excel, etc., depending on the file format. Since our dataset is in CSV format, we’ll use the read_csv function to load it.

Screenshot 2024-08-31 at 12.56.38

In the code above, we could see the number of rows and columns, but there’s more we need to investigate.

How many columns have missing values? How can we view all 26 columns and understand what they represent?

Accessing the columns

To get a comprehensive overview of the dataset, including details like the data types, number of non-null entries, and column names, we can use the info() method in pandas.

This method provides everything we need to know about the dataset at a glance.

Screenshot 2024-08-31 at 13.07.35

In the code above, we could partially access the dataset’s features. However, not all columns were listed.

From those displayed, we can identify columns with missing values, the data types of each column, and the memory usage.

It can be observed that 12 columns are of float data type, 6 are integers, and 8 are strings/objects. This means that 18 columns are numerical, while 8 are categorical or ordinal.

Dealing with Null values

Dealing with null values is essential and cannot be avoided. For example, out of the 205 total rows, the price column has only 201 non-null entries, horsepower and peak-rpm have 203, and stroke and bore have 201, among others.

To get a clearer picture, we can use the isna() method followed by sum() to return the number of null values in each column.

Screenshot 2024-08-31 at 13.15.17

This method reveals everything we need to know about missing values, showing that we need to address seven columns in total.

Duplicate Rows

Another important aspect to check is the presence of duplicate rows. Thankfully, Pandas makes this process simple.

By using the duplicated() method followed by sum(), we can quickly determine the total number of duplicate rows in the dataset.

Screenshot 2024-08-31 at 14.00.10

Since the return value is 0, we can confidently say that there are no duplicate rows in the dataset.

Statistical Representation of Data

To carry out this assessment, the describe() method in Pandas provides an easy way to view key statistical values for all numeric features. See the code and result below for an example.

Screenshot 2024-08-31 at 14.10.54

Information about the count, mean, standard deviation, min, 25-percentile, 50-percentile (median), 75-percentile, and maximum value for each column can be easily accessed.

This information reveals that most features might have distributions close to normal and hence has no outliers. That can be verified using an histogram and box plot.

A computer screen shot of a black screen

Description automatically generated
A group of blue bars

Description automatically generated

From the histograms above, none of them seem to perfectly follow a normal distribution. Here’s a quick analysis of each:

  1. highway-mpg: This distribution appears bimodal (two peaks) rather than normally distributed which is bell shaped. The bimodality suggests two distinct group which could be a group of fast cars and less fast cars.
  2. city-mpg: Similar to highway-mpg, this distribution is not normally distributed. It also appears bimodal with two peaks.
  3. peak-rpm: This histogram is quite irregular with a few prominent peaks, suggesting that the data is skewed and not normally distributed.
  4. horsepower: The distribution is skewed to the right, meaning there are more data points with lower horsepower, and the tail extends towards higher horsepower values. This is not a normal distribution.
  5. height: The height histogram is the closest to a normal distribution. It is not perfectly normal but it is pretty close.
  6. normalized-losses: This distribution shows a right skew, where most data points are concentrated on the lower side and a long tail stretches towards higher values. This is also not normally distributed.

Normal distribution have various characteristics which includes equal mean and median, a bell shaped distribution which also means they are symmentric and therefore has no skew and finally, no outliers.

It is a good approach to verify fully.

A group of graphs with lines

Description automatically generated with medium confidence

The boxplots above reveals that some of the variables consist of outliers. The points outside of the whiskers of the plots are outliers and even though they are not many, they need to be dealt with.

From the perusing step, we know we have to deal with missing data, and outliers.

Step 2: Cleaning the data

Data cleaning involves preparing the data in a way that makes it suitable for analysis.

Sorting Out Missing Rows

From the previous step, we know there are missing values in the price, num-of-doors, peak-rpm, horsepower, stroke, bore, and normalized-losses columns.

While we can replace these with the mean or median based on statistical reasoning, it’s important to also consider the context.

For instance, if a feature is specific to a certain manufacturer, it might be more appropriate to fill in missing values with what’s common for that manufacturer.

First, let’s assess the rows with missing horsepower values to decide the best approach for filling them in.

Screenshot 2024-08-31 at 19.40.52
– Horsepower and peak-rpm

There are two rows where both horsepower and peak-rpm are missing. A closer look reveals that these rows belong to the same car brand, and notably, these are the only two records for that manufacturer in the dataset.

This indicates that there are no historical values to reference, making it inappropriate to simply use the average or median of the entire column.

This is because a car’s horsepower is influenced by various factors, including the manufacturer.

In a real-world scenario, with insufficient data and an inability to afford losing more rows, the best option would be to research the car brand.

The dataset contains details like the number of doors and the body style (e.g., wagon or hatchback), which could help identify the specific model or a similar one to estimate the missing horsepower.

However, in this case, we will drop these rows. In addition, it’s worth noting that the normalised-losses—which represent the historical losses incurred by an insurance company for a specific car, after being normalised—are also missing.

Since normalised losses are crucial in this analysis, we will exclude rows without this data. If this were a modeling problem, we might consider training a model to predict these losses, and then use the model to estimate the missing values.

– Stroke and bore features

Next, let’s discuss the stroke and bore features. The bore refers to the diameter of the engine’s cylinders, measured in inches. Larger bores allow for larger valves and increased airflow.

The stroke, on the other hand, is the distance the piston travels inside the cylinder, also measured in inches. Longer strokes generally provide more torque.

There may be a correlation between these features and the normalised losses, making them potentially important.

Upon inspecting the rows with missing stroke and bore, we find that four rows are missing both features.

These rows correspond to cars from the Mazda brand. We should look into historical records for Mazda to determine the typical values for stroke and bore in their vehicles.

Screenshot 2024-08-31 at 19.56.27

The code and result above confirms that historical data is available. While the best approach would be to conduct research, a viable alternative is to replace the missing values with the average values from similar records.

This ensures the replacement values stay within a realistic range. To proceed, we first calculate the average stroke and bore values for Mazda vehicles in the dataset.

Once we have these averages, we can replace the missing values accordingly. This can be done by grouping the data by the car manufacturer and then computing the average for the stroke and bore columns.

Screenshot 2024-08-31 at 20.08.22

The mean values for Mazda’s stroke and bore are approximately 3.3, rounded to the nearest decimal.

We can replace the missing values (NaN) with these averages. We can apply the same strategy for the missing prices.

First, we identify the rows with missing prices, check the car brands, and then decide the best course of action.

There are three different car brands with missing prices, which also have their normalised losses missing. Since our analysis focuses on normalised losses, we’ll drop rows with missing normalised losses for this analysis.

Handling outliers

Regarding outliers, there are various approaches to handling them depending on the dataset’s intended use. When building models, one option is to use algorithms that are robust to outliers.

If the outliers are due to errors, they can be replaced with the mean or removed altogether.

In this case, however, we will leave them as they are. This decision is based on their scarcity across variables, which allows us to retain all instances in the data.

The outliers are not erroneous; in fact, having more data would enable us to capture a broader range of their instances.

With that decided, let’s proceed with the data cleaning:

  1. First, remove all rows with missing normalised losses. This will also handle rows with missing prices and horsepower.
  2. Then, fill in the mean stroke and bore values for Mazda only.
  3. Finally, print the sum of all remaining null values to confirm the cleaning process.
Screenshot 2024-09-01 at 10.19.53

This leaves us with zero missing values in each row.

Drop Unnecessary Columns

At this point, it’s crucial to be clear about the focus of the analysis and remove irrelevant columns.

In this case, the analysis is centered on understanding the features that influence normalised losses.

With a total of 26 columns, we should drop those that are unlikely to contribute to accidents or damage.

A straightforward approach is to examine the correlation of the numeric features with normalised losses. This can be easily done using the Corr method.

Screenshot 2024-09-01 at 10.43.25

Step 3: Visualising the data

To effectively perform data visualisation, we can formulate key questions that need answers:

Some examples can be:

  1. Does a certain body type,fuel type or aspiration lead to increased insurance loss?
  2. What is the relationship between selected features and normalised losses?

To answer the question about body type, we can analyze the relationship between body type and normalised losses. Here’s how:

  1. Select the body-type and normalised losses columns.
  2. Group the data by body-type and calculate the mean normalized losses for each group.
  3. Use Seaborn’s barplot to visualize the mean normalised losses by body type.
Screenshot 2024-09-01 at 11.18.12

The plot above shows that mean losses indeed vary based on body type. Convertibles have the highest average amount paid for losses by insurance companies, while wagons have the lowest.

I suspect it is due to the fact that convertibles tend to be sport cars and as such built for speed. This can lead to more accidents and more reasons to higher insurance pay.

Screenshot 2024-09-01 at 11.35.57

The picture above shows whether or not a car is standard (naturally supercharged) or as turbo has no effect on the losses. I find this surprising.

Screenshot 2024-09-01 at 11.38.08

Checking the fuel type reveals that cars running on gas incur more losses for insurance companies than diesel-powered vehicles.

This could be because gas-powered cars often provide better acceleration, leading drivers to push them harder and potentially increasing the likelihood of accidents.

The next form of analysis is to check the relationship between numerical features.

Screenshot 2024-09-01 at 11.50.26

The subplots below show the relationship between selected features and normalised losses.

A group of blue dots

Description automatically generated

Key Takeaways:

  1. Fuel Efficiency: There is a slight positive correlation between a car’s fuel efficiency in both city and highway driving and the normalised losses.
  2. Engine Performance: The engine speed at which maximum horsepower is produced (peak RPM) shows a slight positive correlation with normalised losses. In addition, there is an even stronger positive relationship between horsepower and losses. Higher horsepower often leads to faster acceleration and higher top speeds, which can increase the likelihood of accidents and, consequently, more damage.
  3. Car Height: Taller cars tend to incur fewer losses. This is an example of negative correlation. It might be due to better visibility and enhanced crash protection offered by their taller roofs, particularly in certain types of collisions.
  4. Number of Doors: There is a moderate positive correlation between the number of doors and normalized losses, with cars having fewer doors (typically two-door cars) incurring more losses. This could be because many two-door cars are sports cars, which are often driven more aggressively.

Step 4: Hypothesis Testing

While visualisations can suggest potential positive relationships, it’s important to test the statistical significance of these relationships.

Are the features truly correlated, or is it just due to chance?

When performing significance tests, several factors must be considered. These include the type of data (e.g., numerical vs. categorical, numerical vs. numerical), the normality of the data, and the variances, as many statistical tests rely on specific assumptions.

In this case, we are comparing our numerical variables with the normalised losses using Spearman’s Rank Correlation since the data distribution is not normal.

Let’s define our hypotheses:

Null Hypothesis (H₀): There is no correlation between the two variables.

Alternative Hypothesis (H₁): There is a correlation between the two variables.

We set the significance level (alpha) at 0.05. If the p-value is less than alpha, we reject the null hypothesis and conclude that there is a significant relationship.

However, if the p-value is greater than alpha, we fail to reject the null hypothesis, meaning no significant correlation is found.

A screen shot of a computer

Description automatically generated

Based on the image above, the results indicate that our assumptions are statistically significant, and for each variable, the null hypothesis is rejected.

Conclusion

From the analysis, it is evident that certain factors related to a car’s stability and acceleration capabilities often lead to increased insurance losses.

This comprehensive analysis covered various approaches to data wrangling and visualisation, uncovering valuable insights.

These findings can aid insurance companies in making more informed decisions, predicting potential losses associated with specific car models, and adjusting their pricing strategies accordingly.

Subscribe to Our Newsletter
Subscribe to our newsletter to get our newest articles instantly!
[mc4wp_form]
Share This Article
Email Copy Link Print
Previous Article Naomi Lang and Ross Gollan win UK long-course fell titles
Next Article Skin Care Routine for women over 40

Recent Posts

Entrepreneur

Gulf Craft Welcomes His Highness Prince Mishaal bin Mohammed Al Saud to Its Umm Al Quwain Shipyard

Gulf Craft welcomes its Highness Prince Mishaal Bin Mohammed to Saud to its UMM Shipyard Al Quuwain SUPERIOR OF THE…

By Sophia Turner
0 Min Read
Events

May 2025: The Brief, Live! News and Trends Commentary

May 2025: The letter, live! News and trends comment - Event seller …

By Sophia Turner
1 Min Read
Business

The Power of Feasibility Study in UAE Real Estate

In the world of real estate of the EAU, entering without a good strategy and leaving is asking for problems.…

By Lily Spencer
1 Min Read

You Might Also Like

Sustainability

Video: Dubai’s RTA and Burj Khalifa switch off lights for Earth Hour 2025

Dubai: Dubai’s roads and Metro stations along with Burj Khalifa, the tallest tower in the world, went dark for an…

1 Min Read
Sustainability

Harmful Sunscreen Ingredients to Avoid for Non Toxic Protection

Learn what sunrise ingredients can be Arpul for your health and the planet, as well as ingredients to search in…

9 Min Read
Sustainability

9 Best Sustainable Sneaker Brands (2025)

Veja does not have a leg classification. Price range: $ 150- $ 210 Brazilian for "look", the French brand Vaja…

2 Min Read
Sustainability

10 Best Eco-Friendly Clothing Brands: Rated & Reviewed

The dedicated brand does not have a leg classification. Price range: $ 12– $ 449 The dedicated brand has an…

1 Min Read
thedubaiheadlines
  • UAE
  • International
  • Government
  • Cryptocurrency
  • Events
  • Technology
  • Business
  • CEO
  • Founder
  • Journalist
  • Realtor
  • Sports
  • Athlete
  • Coach
  • Fitness Trainer
  • Health
  • Doctor
  • Beauty cosmetics
  • Plastic surgeon
© 2017-2025 Thedubaiheadlines . All Rights Reserved.
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?