SANTTUcurriculum vitae
11 Oct 2018

DIY: How to do Information Visualisation efficiently on any given Dataset?

Can you predict who would be interested in buying a caravan insurance policy and give an explanation why?

I will use CoIL 2000 challenge dataset, which contains information on customers of an insurance company. The data consists of 86 variables and includes product usage data and socio-demographic data derived from zip area codes.  I will aim to show you the visualization for the trends and features of this dataset without actually coing any machine learning.

Dataset

Download the dataset from here: http://liacs.leidenuniv.nl/~puttenpwhvander/library/cc2000/

Hypothesis

First and foremost we need to consider the hypothesis. So my hypotheses on this dataset are as follows:

  1. Persons having caravan policy will have a high probability of having a car because the caravan must be carried by automobile (Car is the cheap and easy medium of transport for caravan).
  2. Hence they will also have a car policy and fire policy (caravans have a high risk of catching fire because of the materials used in constructing).
  3. They also will have a boat policy because boating and caravanning are indicative of an outdoor lifestyle.
  4. Caravan policy will be most taken by low-Status people, who use caravans as mobile homes and high-status people for the outdoor lifestyle.

Introduction

“One picture is more valuable than thousands of words”. It takes much time to analyze data in a table, but a couple of seconds to analyze visualization.

Huge amounts of information available today made us deal on information visualization. It supports decision making in various departments.

Investigating and analyzing large amounts of data is seriously a difficult task, but by using some data mining or machine learning techniques through information visualization can make it rather simple.

Data Preprocessing and Feature Selection

The CoIL 2000 dataset contains 5822 samples or instances with 86 features. Where 86th feature gives the binary information about an instance interested in taking the caravan policy or not.

Hence the COIL dataset encloses 85 possible input features.

I cannot rely on all the variables; hence need to find a reliable subset of features for making the hypothesis which is the first important task.

I decided to write a simple script in Matlab which can give me the counts of each feature who are having a caravan policy. Hence using the high count of the features I have the stated hypothesis.

The method of finding a subset of features is called feature selection. I have selected the following features for making my hypothesis.

1) Car policies

2) Fire policies

3) Social Status

4) Boat policies.

I also have selected some other related features to compare them with the selected above mentioned features.

Methods of Visualization

Doughnut chart

A doughnut chart is a kind of Pie chart with a blank at the centre; Doughnuts have an ability to support multiple statistics as one.

A simple pie chart is typically used to show the proportion of data within a single category.

Rather than using a doughnut chart to increase the number of categories that can be displayed in a single chart, a doughnut chart may be used to show greater levels of details across a single category of information.

It displays the contribution of each variable to the total.

Horizontal Bar chart

Horizontal Bar charts are used to compares the values across the categories using the horizontal bars or rectangles. It is used when the categories represent durations or when the categories names are too long to represent the data.

Column chart

Column charts are also called bar charts, these are used to compares the values across the categories using the vertical rectangles. It is used when the order of the categories is not very considered or when displaying the category counts.

Experiments

Hypothesis  1

Persons having caravan policy will have a high probability of having a car because the caravan must be carried by automobile (Car is the cheap and easy medium of transport for caravan).

According to the stated hypothesis cars are used mostly to carry the caravans as they are cheap and efficient means of transport. Hence I used the contributions of the different automobiles policies which influence the caravan policy.

 

This visualization is done using a doughnut chart. It is clear from the Figure 1; that there are 8 categories or classes for automobiles and the contribution of Car is larger than the rest of the categories of automobiles.

It could have been more aesthetic to use the 3D visualization for the doughnut charts, but it would increase the data-ink ratio and also considered to be the chart junk. Hence this visualization is based on the principles of Tufte. Multiple colours used in the chart clearly discriminate each category with the quantity of its contribution.

 

Hypotheis 2

Hence they will also have a car policy and fire policy (caravans have a high risk of catching fire because of the materials used in constructing).

As caravans are made of such a kind of material which is prone to fire accidents, the fire policies would also influence the caravan policy.

From Figure 1, as the contribution of the car automobile is more, it is obvious that the car would have a car policy. Hence car policy would also influence for a caravan policy.

Figure 2 shows us clearly that the fire policy and the car policy have the most number of hits for having a caravan policy.

On the other hand, we see that the policies of automobiles like Lorries and Agricultural Machines have no hits for the caravan policies.

This is maybe because the lorry policy customers have their own transportation system and generally they are used for trading and carrying goods. The agricultural machine policy customers are more focused on the agricultural land for their cultivations and they do not show much interest in moving from one place to another.

 

The data in Figure 2 is visualized using a horizontal bar chart. The visualization is truly based on Tufte principles here as well. There was no need of multiple colours for the visualization here because they belong to a single category and moreover there is no other feature discriminating these categories.

Hypothesis 3

Customers having a boat policy would also have a caravan policy because boating and caravanning is an outdoor lifestyle of living

From Figure 2 it is clear that the hits for the boat policy are very less. This is maybe because of the low-status people who are using caravans only for the living purpose as their mobile homes, but not for some outdoor lifestyles.

Figure 3 visualizes the features that were hidden in  Figure2. It was clear in Figure 2 that all the features except car and fire were hidden because of the high hits on the car and fire policies. According to hypothesis boat policies should be more but here the scooter policies were more.

 

Hypothesis 4

Caravan policy will be most taken by low-Status people, who use caravans as mobile homes and high-status people for the outdoor lifestyle.

From the hypothesis, I stated that the caravan policies are taken mostly by low-status customers, using caravans as their mobile homes and high-status customers for their outdoor lifestyles.

From  Figure 4, it’s clearly seen that the social status A, B1 (High Status) customers have more hits for caravan policies and social status D (Low status) customers have high hits for the caravan policies.


There are 5 categories of social status in the data. I have purposely added unnecessary grid lines and the colour indicator ‘social status’ on the right-hand side so that we could see an increased data to ink (DI)-ratio. However, this was avoided using Tufte’s principles in Figures 1 and 2. Moreover, it was quite unnecessary to use colour bars since they just represented only one feature.

Results

The results were according to the hypothesis made at the beginning of this article, except for the boat policies. That is because of the low-status customer (social status D), who were using caravans for their mobile homes but not for their interests in the outdoor lifestyles.

Take Away

Visualization is based on the multivariate data analysis.

Figure 4 is an example of the high DI-ratio with unnecessary gridlines and a colour bar. In the other figures, the DI-ratio is minimized. Although a single line would be necessary for drawing the bars, to maintain some elegance and aesthetic sense, the DI-ratio is nevertheless optimized.

A 3D graph is not generally recommended as they produce some lie factors and lead to chart junks and ducks which do not obey the Tufte’s principles.

Multiple colors are used in doughnut charts only to discriminate the contributions of each category.

A single colour is used in the other Figures only to show that they belong to the same category which obeys the gestalt’s laws.

Artificial Intelligence • Data Analytics • Design • teaching Leave a comment

Leave a Reply

%d bloggers like this: