An illustrative guide to model insurance claim frequencies using generalized linear models in R
Discuss step by step approach for count data modeling with focus on insurance claim frequencies, familiarize with diagnostics and explore techniques to overcome any challenges encountered and finally selecting the appropriate modeling approach to fit the data better.
The first step in developing any pricing model, i.e., predicting pure premium or also known as loss cost model is predicting claim frequencies (expected claim count per unit of exposure), which is a rate instead of a simple count. The common assumption is insurance claim count follows a Poisson distribution which means mean and variance is equal. Therefore a generalized linear model with Poisson distribution and log link function is a natural choice, to begin with.
What are the various challenges we face while modeling claim frequency?
As we have seen the mean and variance should be equal in case of Poisson distribution, however, in many data sets this property is violated because data are often overdispersed. In this case, Poisson distribution underestimates the variance of the observed counts.
Count data often have an excessive number of zero outcomes than are expected in Poisson regression. As an example, the proportion of zero claims in automobile insurance data may be large because usually, people tend not to report small claims.
Often exposures are varied, which means all the observations are not comparable. For example, a count of 4 claims out of 12 months is much bigger than a count of 1 out of 6 months.
Let’s look at the techniques to overcome these challenges
Run a preliminary Poisson regression and test the null hypothesis of no dispersion in the model against the alternative hypothesis of overdispersion or under dispersion.
As an output, it will get value for alpha, if α>0, its over-dispersion, and if α<0, its under-dispersion. We will discuss this test through a working example during our modeling exercise in the next section.