Synthetic data enables insurers to get more value from AI

Insurers should be at least experimenting with synthetic data, which can address a variety of AI challenges, by early 2022.

Since it lacks personal details, synthetic data can help insurers address a number of challenges including regulatory compliance and data-set bias. (Lia Kolyrina/Shutterstock)

Come 2022, 85% of algorithms will be erroneous due to bias, according to Gartner.

Most of the blame lies with the data sets used to train AI models, which often lack enough data representation for women, people of color and other minority groups.

Over time, this underrepresentation not only leads to bad business decisions for insurers, but also has a real impact on consumers. For instance, many customers in minority neighborhoods are being charged higher car insurance premiums than those in primarily white neighborhoods.

Insurer AI challenges

Of course, bias is just one challenge insurers have had with AI. Others include:

Synthetic data is emerging as a possible solution to all of these challenges. Synthetic data sets look just as real as a company’s original customer data, with just as many details, but without the original personal data points. They also can be tweaked for better balance and representation, and can help insurers comply with privacy regulations such as GDPR and CCPA because the data sets don’t contain any personal data.

These datasets are created by AI models fostered by a growing community of startups. The technology, while nascent, is already in use at multiple Fortune 100 insurers and financial services companies.

Synthetic data use cases

Model retraining when performance degrades: Planned performance evaluations found that several AI models used by an insurer were exhibiting signs of performance degradation over time. The team created synthetic data sets for re-training those models to optimize performance and eliminate the potential for bias.

Eliminating racial bias: A large financial services company evaluated a crime/fraud prediction dataset and then created synthetic data that corrected a skew towards racial bias from 24% to just 1%.

Fraud detection: A large insurer created a synthetic dataset and then boosted incidences of fraud in the data, to train the company’s fraud detection models and make it easier for them to pick up patterns, improving accuracy of the models.

Optimizing pricing: CCPA and HIPAA prevent insurers from using customers’ personal data in their modeling, making pricing models a challenge. A Fortune 100 insurer turned to synthetic data that leveraged geolocation data and 15M synthetic addresses. The resulting model was as accurate as the test model trained on real data, with a 60x shorter time-to-data – meaning it got into production much faster.

Wave of the future

Synthetic datasets are quickly replacing older approaches to data anonymization including data masking, “pseudonymization,” randomization, permutation and generalization. This approach is much more secure, and datasets can be created in a fraction of the time, saving money and accelerating AI model development.

It follows that Gartner predicts that by 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated.

Insurers should be at least experimenting with synthetic data by early 2022, with an eye toward making it a core element of AI modeling in the very near future.

Tobias Hann is CEO of MOSTLY AI, developers of a synthetic data platform. MOSTLY AI is on a mission to revolutionize how companies think about and work with data. Hann can be reached by sending email to hello@mostly.ai.

These opinions are the author’s own.

See also: