How machine learning improves P&C insurer territory mapping

A combination of actuarial concepts and advanced analytics can be used to review old territory definitions for rating and underwriting.

The incorporation of demographic risks in the territory definitions gives insurers a clearer picture on premium rate-making. (Photo: Shutterstock)

Most personal auto and home insurance carriers use geographic risk as a primary rating variable for calculating policy premium rates. This ensures that customers living in the same neighborhood pay similar insurance premiums, as they are likely to experience the same geographical risks to their automobiles and property.

Some common geographically shared risks include:

Over years of managing insurance underwriting operations, insurance carriers define rating territories that group geographical areas with homogeneous risk profiles. However, the definitions for these territories are often based on subjective information such as agent feedback or less-than-credible claims information. This leads to poor statistical homogeneity for each territory in terms of pure premiums and claim frequencies.

A combination of actuarial concepts and advanced analytics can be used to review old territory definitions for rating and underwriting. Under this fresh approach, new rating territories are defined using unsupervised machine learning algorithms, such as cluster analysis techniques, after applying principle of locality to increase the credibility of data.

Principle of locality

Michael J. Miller, at a 2004 Casualty Actuarial Society Ratemaking Seminar titled “Determination of Geographical Territories,” discussed the technique for defining rating territories based on homogeneity in terms of risk classification stating. A homogeneous territory would have ideally the same geographical risk. This homogeneity is measured using within cluster variance, meaning the variance of a cluster is proportional to the total squared distance of each observation from its centroid.

A fresh approach

In order to understand the advanced analytics-driven process of territory allocation via machine learning, let’s view the problem from the perspective of a P&C insurer who wants to review its old territory definitions (for auto and home insurance divisions) for one of the states where it operates and define new contiguous territories using a data-driven technique that captures historical risk exposures.

The goal is to identify the new territories that will display a lower risk variance within their boundaries. This would help the insurance company offer better-tailored premiums more aligned with the actual risk profiles of their customers.

To kick-start the project, the insurer must provide historical data of underwritten insurance policies and incurred claims transactions of the state under study to the analytics team, which then commences work on territory mapping. The process flow of the territory mapping procedure is as follows:

Data preparation

At this stage, the analytics team integrates the claims and policy data at the policy level. This is followed by identification of building blocks (or geographic groups) based on the data density. Post the identification of these locks, data summarization is performed by the analytics team followed by geo-coding to identify the latitudes and longitudes for the identified building blocks.

Adding variables from external data sources

Once the data preparation is complete, the analytics team deploys web scraping to extract demographic data at the building block level by leveraging industry policy and loss data. The extracted data is then integrated with internal data sets to create the master data set.

Variable creation and normalization

Under this step, all the numerical variables in the master data set are binned using percentile distribution followed by creation of dummy variables. The analytics then calculates the percent of losses and loss amounts as per the claim reasons. This helps in assigning weights to multiple longitudes and latitudes. This is followed by calculation of credibility weighted pure premiums and claim frequencies using principle of locality.

Preliminary data analysis

At this stage, the analytics team should explores the data set in its current form for a better understanding of the subsequent analysis. This involves:

Variable selection to reduce noise

Once the data sets are integrated and variables have been created and normalized, the master data set may consist of approximately 250-300 features and variables. While most unsupervised machine learning algorithms do not mandate the need for variable selection, it solves three key purposes:

  1. Enables the algorithm to run faster. With improving big data infrastructure and expanding processor capacity, this is usually not a challenge.
  2. Makes it easier to profile clusters and extract meaningful insights from variables and features.
  3. Retain variables that maximize business sense and minimize multi-collinearity.

The purpose of variable selection is to identify the factors influencing pure premium or loss in a given building block. These factors include variables available from the claims, policy, demographic and industry data. Multiple approaches can be explored to reduce the number of variables that can sufficiently explain the data variance, such as:

Cluster profiling via unsupervised machine learning

This involves assessment of the clustering tendencies of the data, identifying the optimal number of clusters, profiling and visualizing the clusters, and comparing the results of multiple clustering algorithms such as:

Based on the results of the comparative study of the outputs provided by the clustering algorithms, the analytics team can finalize and provide the insurer with the correct grouping of variables for the auto and home insurance category. These groupings can include variables such as incurred losses for factors including hail, wind, fire, theft, vandalism and accidents; as well as geographical coordinates, credibility weightings for premiums and claims, customer earnings, number of owned properties, and home value.

Defining new territories

Having finalized the variables and the methodology to be used for the analysis, the analytics team can orchestrate the right people and digital technologies to drive meaningful business outcomes. This includes using tools such as R/RStudio and Tableau to extract demographic data using R scripts and populate the identified data fields in the master data set. Then the team can use several clustering methods to study the similarities in the characteristics of the data sets.

By the end of the procedure, the analytics team can identify the clusters for the home insurance category, and the auto insurance category, respectively. These clusters can then be used to define distinct new contiguous territories, which when compared to the existing territory definitions, have a lower risk variance within their boundaries.

Remaining competitive

The aforementioned machine learning exercise allows carriers to appropriately capture risks and stay competitive by quoting tailor-made prices for customer groups with geographically similar risk profiles. The incorporation of demographic risks in the territory definitions gives insurers a far clearer picture on premium ratemaking, and the reusable processes and tools.

Aditya Sehgal (Aditya.Sehgal@exlservice.com) is senior assistant vice president of EXL Service, an operations management and analytics company based in New York City that serves the insurance, banking, financial services, utilities, health care, travel, transportation and logistics industries. Varun Manchanda (Varun.Manchanda@exlservice.com) is the company’s assistant vice president.

See also: