Parallel Monte Carlo: Simulating Compound Poisson Processes using C++ and TBB

Introduction

In this post we implement a function to simulate random samples of a Compound Poisson variable. A random variable \(L\) is a compound Poisson (CP) random variable if there exists a Poisson random variable \(N\), and a random variable \(S\) such that

\[ L = \sum_{i = 1}^N S_i \]

where \(S_i\) for \(i \in \mathbb{N}\) is an IID sequence of random variables with the same distribution as \(S\) and that are independent of \(N\). This kind of expression is typical in operational risk modeling, insurance modeling, and ruin theory. Typically, the variable \(N\) is a counter for the occurrence of loss events to an insurance portfolio over a given time period, and \(S_i\) is the severity of the \(i\)-th loss event. A CP variable/process is the most common approach banks take to model operational risk as part of their Advanced Measurement Approach. The advantage of this model in operational risk is that losses (and hence data) tend to be sparse in this domain. In addition, losses tend to be heavy tailed. By splitting event frequency (\(N\)) from event severity (\(S\)) the model developer can use more data to fit loss severity distributions and loss frequency distributions independently (after accepting the assumptions of the CP process). This assumption of IID losses allows the modeler to focus on fitting the tail of the loss distribution, without having to worry about the frequency of occurrences.

We’ll implement a function that simulates random samples of \(L\) in R, serially in C++, and in parallel in C++ using the Threading Building Blocks (TBB) library.

R Implementation

R is extremely expressive when it comes to mathematical, statistical, and graphing operations and the base implementation is quite simple:

## Function to simulate random samples from a CP variable with 
## log-normal severities
rCP = function(lambda, mu, sigma, N)
{
    sapply(1:N, FUN = function(dummy){
        sum(rlnorm(n = rpois(1, lambda), meanlog = mu, sdlog = sigma))
    })
}

Serial C++ Implementation

We now implement the sampler in C++ using the Rcpp package. We will also make use of the dqrng package so that we can conveniently include the PCG random number generator by Melissa O’Neil.

#include <Rcpp.h>

// [[Rcpp::depends(dqrng)]]

#include <pcg_random.hpp>
#include <random>
#include <algorithm>

// [[Rcpp::export]]
Rcpp::NumericVector cppCP(const double lambda,
                          const double mu,
                          const double sigma,
                          const int N)
{
    // Seed with a real random value, if available
    pcg_extras::seed_seq_from<std::random_device> seed_source;

    // Make a random number engine
    pcg64 rng(seed_source);
    
    // Distribution for frequency
    std::poisson_distribution<int> Freq(lambda);
    
    // Distribution for severity
    std::lognormal_distribution<double> Sev(mu, sigma);
    
    // Allocate vector
    Rcpp::NumericVector out(N);

    // Simulate samples
    std::generate(out.begin(), out.end(), [&](){
        
        // Simulating loss event count
        int n = Freq(rng);
        
        // Accumulating loss severities
        double s = 0;
        for(int i = 0; i < n; ++i) s += Sev(rng);
        
        return s;
    });
    
    return out;
}
x_r = rCP(lambda = 1.7, mu = 5.5, sigma = 2.5, N = 10^6)
x_cpp = cppCP(lambda = 1.7, mu = 5.5, sigma = 2.5, N = 10^6)

ks.test(x = x_r, y = x_cpp, alternative = "two.sided")
## Warning in ks.test(x = x_r, y = x_cpp, alternative = "two.sided"): p-value will
## be approximate in the presence of ties
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x_r and x_cpp
## D = 0.000963, p-value = 0.7427
## alternative hypothesis: two-sided
microbenchmark::microbenchmark(
    "R" = rCP(lambda = 1.7, mu = 5.5, sigma = 2.5, N = 10^6),
    "C++" = cppCP(lambda = 1.7, mu = 5.5, sigma = 2.5, N = 10^6),
    times = 10L
    )
## Unit: milliseconds
##  expr       min         lq       mean     median         uq       max neval
##     R 4268.0577 4331.33293 4358.51965 4357.96885 4401.17094 4416.6745    10
##   C++   70.8535   70.96943   77.22742   71.20739   71.51722  128.0817    10

We see that a Kolmogorov-Smirnov test for equality of distributions shows that the two samples aren’t statistically distinct. Also the C++ version is ~30 times faster than the R version. Note that we added the compilation flags -O3 -march=native to R’s Makeoconf file (although one could simply register a plugin with Rcpp or use a makevars file in a package).

Parallel C++ Implementation

There are many libraries we can use to parallelize the C++ code above. These include OpenMP and OpenACC (both of which allow for standards based parallelization through directive based APIs, with newer standards allowing for GPU offloading), MPI and Boost.MPI for distributed messaging passing, Kokkos, Taskflow, Boost.Compute, HPX, etc.

However, Threading Building Blocks (TBB) is very powerful, expressive, mature, and is very conveniently included in the RcppParallel package. I find TBB’s API to be very well designed. So TBB it is!

#include <Rcpp.h>
#include <RcppParallel.h>

// [[Rcpp::depends(dqrng, RcppParallel)]]

#include <pcg_random.hpp>
#include <random>
#include <algorithm>



// [[Rcpp::export]]
Rcpp::NumericVector tbbCP(const double lambda,
                          const double mu,
                          const double sigma,
                          const int N,
                          const uint64_t seed)
{
    using brange = tbb::blocked_range<size_t>;
    
    // Allocate vector
    Rcpp::NumericVector out(N);
    
    // Getting pointer to data
    auto begin = out.begin();
    
    tbb::parallel_for(brange(0, N), [=](brange& range){
        
        // Distribution for frequency
        std::poisson_distribution<int> Freq(lambda);
    
        // Distribution for severity
        std::lognormal_distribution<double> Sev(mu, sigma);
        
        // RNG local to thread, with unique stream
        pcg64 rng(seed, range.end());
        
        // Serial version of sampler
        auto seq_CP = [&](){
            
            // Simulating loss event count
            int n = Freq(rng);
            
            // Accumulating loss severities
            double s = 0;
            for(int i = 0; i < n; ++i) s += Sev(rng);
            
            return s;
        };
        
        // Loop to simulate samples
        std::generate(begin + range.begin(), begin + range.end(), seq_CP);
    });
    
    return out;
}

C++11 and TBB allow for pretty parallel code. Let’s see if this function is as useful as it is pretty.

x_tbb = tbbCP(lambda = 1.7, mu = 5.5, sigma = 2.5, N = 10^6, seed = 42)

ks.test(x = x_cpp, y = x_tbb, alternative = "two.sided")
## Warning in ks.test(x = x_cpp, y = x_tbb, alternative = "two.sided"): p-value
## will be approximate in the presence of ties
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x_cpp and x_tbb
## D = 0.000891, p-value = 0.8222
## alternative hypothesis: two-sided
ks.test(x = x_r, y = x_tbb, alternative = "two.sided")
## Warning in ks.test(x = x_r, y = x_tbb, alternative = "two.sided"): p-value will
## be approximate in the presence of ties
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x_r and x_tbb
## D = 0.001349, p-value = 0.3227
## alternative hypothesis: two-sided
p = c(0.5, 0.75, 0.9, 0.95, 0.99, 0.999, 0.9999)
data.frame(R = quantile(x_r, probs = p), 
           Cpp = quantile(x_cpp, probs = p), 
           TBB = quantile(x_tbb, probs = p))
##                   R          Cpp          TBB
## 50%        525.6721     527.8515     525.2247
## 75%       3188.0650    3205.9785    3215.0528
## 90%      13007.0777   12989.6880   13043.5109
## 95%      29284.5478   29428.5408   29399.1686
## 99%     137016.4653  136261.8314  139049.8941
## 99.9%   821332.7436  794316.1106  845872.3186
## 99.99% 3864860.8800 3319502.5073 4282836.7037

The Kolmogorov-Smirnov test again can’t tell the samples apart between R, serial C++, and TBB implementations. Let’s take a look at the spacing between consecutive percentiles for non-zero values. If the RNGs between different threads overlapped in the parallel implementation, this would be a first way to find out as differences between the sorted values would show a spike at 0

library(magrittr)

y_r = x_r[x_r > 0] %>% sort(decreasing = FALSE) %>% diff()
y_tbb = x_tbb[x_tbb > 0 ] %>% sort(decreasing = FALSE) %>% diff()

ks.test(y_r, y_tbb, alternative = "two.sided")
## Warning in ks.test(y_r, y_tbb, alternative = "two.sided"): p-value will be
## approximate in the presence of ties
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  y_r and y_tbb
## D = 0.0013749, p-value = 0.4224
## alternative hypothesis: two-sided
data.frame(R = quantile(y_r, probs = p), 
           TBB = quantile(y_tbb, probs = p))
##                   R          TBB
## 50%    4.706155e-03 4.696845e-03
## 75%    2.789482e-02 2.799611e-02
## 90%    1.927010e-01 1.922044e-01
## 95%    7.343613e-01 7.353753e-01
## 99%    1.342726e+01 1.425215e+01
## 99.9%  6.736032e+02 6.651423e+02
## 99.99% 2.541149e+04 2.965516e+04

Final Performance Comparison

microbenchmark::microbenchmark(
    "R" = rCP(lambda = 1.7, mu = 5.5, sigma = 2.5, N = 10^6),
    "C++" = cppCP(lambda = 1.7, mu = 5.5, sigma = 2.5, N = 10^6),
    "TBB" = tbbCP(lambda = 1.7, mu = 5.5, sigma = 2.5, N = 10^6, seed = 42),
    times = 10L
    )
## Unit: milliseconds
##  expr        min          lq        mean      median          uq        max
##     R 4257.52277 4386.657195 4419.765148 4439.142598 4464.874526 4497.83768
##   C++   70.69678   70.932022   71.548681   71.373135   72.201374   72.53289
##   TBB    9.31567    9.440707    9.578117    9.507002    9.641842   10.27413
##  neval
##     10
##     10
##     10

TBB ran ~8 times faster than the serial version. The algorithm’s runtime scales linearly with \(\lambda\), so let’s compare the two C++ versions with a higher lambda:

microbenchmark::microbenchmark(
    "C++" = cppCP(lambda = 17, mu = 5.5, sigma = 2.5, N = 10^6),
    "TBB" = tbbCP(lambda = 17, mu = 5.5, sigma = 2.5, N = 10^6, seed = 42),
    times = 50L
    )
## Unit: milliseconds
##  expr       min        lq      mean    median       uq       max neval
##   C++ 585.05328 587.97749 595.12652 595.08056 601.1128 610.35413    50
##   TBB  68.66768  69.74909  71.06729  70.30209  71.6615  82.82758    50

TBB ran ~9.5 times faster than the serial version.

Related