Traditionally, insurance fraud detection strategies focus on identifying fraudulent claims once the claim has been paid to the claimant.
It is easier to mitigate the losses, however, when the fraud is identified before the claim is paid.
With the advancement in computing and data analytics, it is now possible to adopt a predictive approach to fraud detection. As a result, insurers are turning to data-driven fraud detection programs aimed at prevention, detection, and management of fraudulent claims.
Related: Using smart technology to combat insurance fraud
|Advanced analytics in fraud detection
Since insurers have a large amount of data, it makes sense to evaluate internal and external data for identifying claims with a higher propensity of being fraudulent. By careful analysis of this accumulated data, insurers can identify patterns and anomalies with the help of advanced analytical tools and techniques. This helps in determining characteristics of a fraudster and the need for investigating a claim further.
The key lies in employing predictive techniques such as statistical modeling and machine learning algorithms, which provide pro-active insights into potential fraud events. In this article, we will discuss two advanced analytics techniques used in fraud detection: logistic regression and gradient boosting model (GBM).
Before we move on to explaining the inner mechanics of using these advanced analytics techniques, it is critical to understand the flow of information in an insurance claim procedure. This includes:
- First Notice of Loss (FNOL): The stage at which the claimant first notifies the insurer that a loss has occurred.
- First Contact (FC): The stage at which the insurer contacts the claimant, after FNOL, asking for more information about the loss that has occurred.
- Ongoing (OG): The continuous back and forth of information between the claimant and insurer after FC until the claim is closed.
This flow of information makes it more relevant to carry out a robust identification of potential fraud right at the first stage; the next two stages can be used to appropriately lead investigations in a particular direction. The following process describes a stair-step approach towards deploying data analytics to identify fraud at the different stages of the insurance claim process.
Related: Fraud is not a cost of doing business — and emerging tech is here to prove it
|Step No. 1: Collating the right data
To uncover factors/KPIs indicating fraudulent behavior, an exhaustive data sourcing exercise needs to be undertaken, considering both internal and external data. Internal data comprises information centered on customers, claims, claimants and policies. On the other hand, external data consists of information not captured by the insurer. This includes regional demographics, industrially accepted standard scores, and information pertaining to weather conditions that prevailed when the loss occurred, as well as information on catastrophes that may have occurred during the time period of interest. The end result of this step is a 'master dataset' created by weaving the collected internal and external data.
The variables are classified on their availability during various claim stages and made available accordingly during the 3 stages of model building.
Insurers can use these techniques to fast-track the claim handling process. (Photo: iStock)
|Step No. 2: Applying analytics techniques
Once the master data set is ready and the variables are identified, we choose two analytics techniques to identify fraudulent behavior:
Logistic regression: A statistical method for analyzing a dataset with one or more independent variables that determine a binary outcome. This predictive analytics technique produces an outcome that is measured with a dichotomous variable (which has only two possible outcomes). Plausible fraudulent claims are a rare event; almost less than 1% of all claims. As logistic regression underestimates the probability score in case of rare events, in order to ensure unbiased results, an oversampled data set needs to be created where the event rate should be >=5%.
Since the flow of information is in three stages (FNOL, FC, and OG), a residual modeling technique is applied for logistic regression. This means the logistic score from one stage is used as an offset variable in the subsequent stage. Hence, under logistic regression, the information gains that happen at one stage are passed on to the subsequent stage. As a result, as claims move forward from one stage to other, the insurers have more clarity whether they are genuine or fraud.
Gradient boosting model (GBM): A machine-learning technique that aims to improve a single model by fitting many models and combining them for prediction. The need to create an over-sampled data doesn't arise, and modeling exercise can be performed by gradient boosting of classification trees.
GBM doesn't support sequential modeling, therefore a parallel development approach is followed at each of the three stages — FNOL, FC, and OG.
Related: Combating insurance claims fraud with hybrid analytics
|Step No. 3: Running the analysis and analyzing the results
Under logistic regression, a standard approach to variable selection is carried out. Variables can be eliminated on the basis of fill rates, correlation analysis and clustering. Tools like SAS are used for step-wise selection of variables in the logistics procedure. Further shortlisting can be done to get rid of multicollinearity. No such treatment is required under GBM. The output of these two techniques can be measured and analyzed in terms of 'lift', 'k-s' and precision values for each of the three stages.
|Logistic Regression and GBM: A comparison
Both techniques have different algorithms running in the background. Logistic regression requires human intervention at different stages whereas GBM is based on machine-learning algorithms that require minimal human involvement. In terms of output, GBM produces a scored dataset based on probability values of all observations whereas logistic regression provides a scored data and a mathematical equation (that can then be used to score the new incoming claims). Hence, logistic regression allows insurers to establish causality between predictor and predicted variables. In the case of logistic regression, the interaction terms and variable transformation are subject to the discretion of the data scientist building the model, whereas GBM itself introduces and tests for interaction and variable transformation.
Logistic Regression | GBM | |
Human Intervention | Yes, at every step | Minimal |
Output | Scored Dataset + Mathematical equation | Scored Dataset (probability values) |
Handle Non-linear Data | No | Yes |
Handle Observations with Missing Variables | Requires input | Yes |
The takeaway
Both techniques have their own merits and limitations. Depending on what the business wants to accomplish, an appropriate technique can be selected. It is also not necessary that both of these techniques be used independently. Using logistic regression in tandem with GBM on the dataset can provide a better perspective on the authenticity of claims.
Insurers can use these techniques to fast-track the claim-handling process during the FNOL and realign claim resources with more complex claim-handling activities.
See also:
Want to continue reading?
Become a Free PropertyCasualty360 Digital Reader
Your access to unlimited PropertyCasualty360 content isn’t changing.
Once you are an ALM digital member, you’ll receive:
- Breaking insurance news and analysis, on-site and via our newsletters and custom alerts
- Weekly Insurance Speak podcast featuring exclusive interviews with industry leaders
- Educational webcasts, white papers, and ebooks from industry thought leaders
- Critical converage of the employee benefits and financial advisory markets on our other ALM sites, BenefitsPRO and ThinkAdvisor
Already have an account? Sign In Now
© 2024 ALM Global, LLC, All Rights Reserved. Request academic re-use from www.copyright.com. All other uses, submit a request to [email protected]. For more information visit Asset & Logo Licensing.