Parallel Monte Carlo: Simulating Compound Poisson Processes using C++ and TBB
Introduction
In this post we implement a function to simulate random samples of a Compound Poisson variable. A random variable \(L\) is a compound Poisson (CP) random variable if there exists a Poisson random variable \(N\), and a random variable \(S\) such that
\[ L = \sum_{i = 1}^N S_i \]
where \(S_i\) for \(i \in \mathbb{N}\) is an IID sequence of random variables with the same distribution as \(S\) and that are independent of \(N\). This kind of expression is typical in operational risk modeling, insurance modeling, and ruin theory. Typically, the variable \(N\) is a counter for the occurrence of loss events to an insurance portfolio over a given time period, and \(S_i\) is the severity of the \(i\)-th loss event. A CP variable/process is the most common approach banks take to model operational risk as part of their Advanced Measurement Approach. The advantage of this model in operational risk is that losses (and hence data) tend to be sparse in this domain. In addition, losses tend to be heavy tailed. By splitting event frequency (\(N\)) from event severity (\(S\)) the model developer can use more data to fit loss severity distributions and loss frequency distributions independently (after accepting the assumptions of the CP process). This assumption of IID losses allows the modeler to focus on fitting the tail of the loss distribution, without having to worry about the frequency of occurrences.
We’ll implement a function that simulates random samples of \(L\) in R, serially in C++, and in parallel in C++ using the Threading Building Blocks (TBB) library.
R Implementation
R is extremely expressive when it comes to mathematical, statistical, and graphing operations and the base implementation is quite simple:
## Function to simulate random samples from a CP variable with
## log-normal severities
rCP = function(lambda, mu, sigma, N)
{
sapply(1:N, FUN = function(dummy){
sum(rlnorm(n = rpois(1, lambda), meanlog = mu, sdlog = sigma))
})
}
Serial C++ Implementation
We now implement the sampler in C++ using the Rcpp package. We will also make use of the dqrng package so that we can conveniently include the PCG random number generator by Melissa O’Neil.
#include <Rcpp.h>
// [[Rcpp::depends(dqrng)]]
#include <pcg_random.hpp>
#include <random>
#include <algorithm>
// [[Rcpp::export]]
Rcpp::NumericVector cppCP(const double lambda,
const double mu,
const double sigma,
const int N)
{
// Seed with a real random value, if available
pcg_extras::seed_seq_from<std::random_device> seed_source;
// Make a random number engine
pcg64 rng(seed_source);
// Distribution for frequency
std::poisson_distribution<int> Freq(lambda);
// Distribution for severity
std::lognormal_distribution<double> Sev(mu, sigma);
// Allocate vector
Rcpp::NumericVector out(N);
// Simulate samples
std::generate(out.begin(), out.end(), [&](){
// Simulating loss event count
int n = Freq(rng);
// Accumulating loss severities
double s = 0;
for(int i = 0; i < n; ++i) s += Sev(rng);
return s;
});
return out;
}
x_r = rCP(lambda = 1.7, mu = 5.5, sigma = 2.5, N = 10^6)
x_cpp = cppCP(lambda = 1.7, mu = 5.5, sigma = 2.5, N = 10^6)
ks.test(x = x_r, y = x_cpp, alternative = "two.sided")
## Warning in ks.test(x = x_r, y = x_cpp, alternative = "two.sided"): p-value will
## be approximate in the presence of ties
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x_r and x_cpp
## D = 0.000963, p-value = 0.7427
## alternative hypothesis: two-sided
microbenchmark::microbenchmark(
"R" = rCP(lambda = 1.7, mu = 5.5, sigma = 2.5, N = 10^6),
"C++" = cppCP(lambda = 1.7, mu = 5.5, sigma = 2.5, N = 10^6),
times = 10L
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## R 4268.0577 4331.33293 4358.51965 4357.96885 4401.17094 4416.6745 10
## C++ 70.8535 70.96943 77.22742 71.20739 71.51722 128.0817 10
We see that a Kolmogorov-Smirnov test for equality of distributions shows that the two samples aren’t statistically distinct. Also the C++ version is ~30 times faster than the R version. Note that we added the compilation flags -O3 -march=native to R’s Makeoconf file (although one could simply register a plugin with Rcpp or use a makevars file in a package).
Parallel C++ Implementation
There are many libraries we can use to parallelize the C++ code above. These include OpenMP and OpenACC (both of which allow for standards based parallelization through directive based APIs, with newer standards allowing for GPU offloading), MPI and Boost.MPI for distributed messaging passing, Kokkos, Taskflow, Boost.Compute, HPX, etc.
However, Threading Building Blocks (TBB) is very powerful, expressive, mature, and is very conveniently included in the RcppParallel package. I find TBB’s API to be very well designed. So TBB it is!
#include <Rcpp.h>
#include <RcppParallel.h>
// [[Rcpp::depends(dqrng, RcppParallel)]]
#include <pcg_random.hpp>
#include <random>
#include <algorithm>
// [[Rcpp::export]]
Rcpp::NumericVector tbbCP(const double lambda,
const double mu,
const double sigma,
const int N,
const uint64_t seed)
{
using brange = tbb::blocked_range<size_t>;
// Allocate vector
Rcpp::NumericVector out(N);
// Getting pointer to data
auto begin = out.begin();
tbb::parallel_for(brange(0, N), [=](brange& range){
// Distribution for frequency
std::poisson_distribution<int> Freq(lambda);
// Distribution for severity
std::lognormal_distribution<double> Sev(mu, sigma);
// RNG local to thread, with unique stream
pcg64 rng(seed, range.end());
// Serial version of sampler
auto seq_CP = [&](){
// Simulating loss event count
int n = Freq(rng);
// Accumulating loss severities
double s = 0;
for(int i = 0; i < n; ++i) s += Sev(rng);
return s;
};
// Loop to simulate samples
std::generate(begin + range.begin(), begin + range.end(), seq_CP);
});
return out;
}
C++11 and TBB allow for pretty parallel code. Let’s see if this function is as useful as it is pretty.
x_tbb = tbbCP(lambda = 1.7, mu = 5.5, sigma = 2.5, N = 10^6, seed = 42)
ks.test(x = x_cpp, y = x_tbb, alternative = "two.sided")
## Warning in ks.test(x = x_cpp, y = x_tbb, alternative = "two.sided"): p-value
## will be approximate in the presence of ties
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x_cpp and x_tbb
## D = 0.000891, p-value = 0.8222
## alternative hypothesis: two-sided
ks.test(x = x_r, y = x_tbb, alternative = "two.sided")
## Warning in ks.test(x = x_r, y = x_tbb, alternative = "two.sided"): p-value will
## be approximate in the presence of ties
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x_r and x_tbb
## D = 0.001349, p-value = 0.3227
## alternative hypothesis: two-sided
p = c(0.5, 0.75, 0.9, 0.95, 0.99, 0.999, 0.9999)
data.frame(R = quantile(x_r, probs = p),
Cpp = quantile(x_cpp, probs = p),
TBB = quantile(x_tbb, probs = p))
## R Cpp TBB
## 50% 525.6721 527.8515 525.2247
## 75% 3188.0650 3205.9785 3215.0528
## 90% 13007.0777 12989.6880 13043.5109
## 95% 29284.5478 29428.5408 29399.1686
## 99% 137016.4653 136261.8314 139049.8941
## 99.9% 821332.7436 794316.1106 845872.3186
## 99.99% 3864860.8800 3319502.5073 4282836.7037
The Kolmogorov-Smirnov test again can’t tell the samples apart between R, serial C++, and TBB implementations. Let’s take a look at the spacing between consecutive percentiles for non-zero values. If the RNGs between different threads overlapped in the parallel implementation, this would be a first way to find out as differences between the sorted values would show a spike at 0
library(magrittr)
y_r = x_r[x_r > 0] %>% sort(decreasing = FALSE) %>% diff()
y_tbb = x_tbb[x_tbb > 0 ] %>% sort(decreasing = FALSE) %>% diff()
ks.test(y_r, y_tbb, alternative = "two.sided")
## Warning in ks.test(y_r, y_tbb, alternative = "two.sided"): p-value will be
## approximate in the presence of ties
##
## Two-sample Kolmogorov-Smirnov test
##
## data: y_r and y_tbb
## D = 0.0013749, p-value = 0.4224
## alternative hypothesis: two-sided
data.frame(R = quantile(y_r, probs = p),
TBB = quantile(y_tbb, probs = p))
## R TBB
## 50% 4.706155e-03 4.696845e-03
## 75% 2.789482e-02 2.799611e-02
## 90% 1.927010e-01 1.922044e-01
## 95% 7.343613e-01 7.353753e-01
## 99% 1.342726e+01 1.425215e+01
## 99.9% 6.736032e+02 6.651423e+02
## 99.99% 2.541149e+04 2.965516e+04
Final Performance Comparison
microbenchmark::microbenchmark(
"R" = rCP(lambda = 1.7, mu = 5.5, sigma = 2.5, N = 10^6),
"C++" = cppCP(lambda = 1.7, mu = 5.5, sigma = 2.5, N = 10^6),
"TBB" = tbbCP(lambda = 1.7, mu = 5.5, sigma = 2.5, N = 10^6, seed = 42),
times = 10L
)
## Unit: milliseconds
## expr min lq mean median uq max
## R 4257.52277 4386.657195 4419.765148 4439.142598 4464.874526 4497.83768
## C++ 70.69678 70.932022 71.548681 71.373135 72.201374 72.53289
## TBB 9.31567 9.440707 9.578117 9.507002 9.641842 10.27413
## neval
## 10
## 10
## 10
TBB ran ~8 times faster than the serial version. The algorithm’s runtime scales linearly with \(\lambda\), so let’s compare the two C++ versions with a higher lambda:
microbenchmark::microbenchmark(
"C++" = cppCP(lambda = 17, mu = 5.5, sigma = 2.5, N = 10^6),
"TBB" = tbbCP(lambda = 17, mu = 5.5, sigma = 2.5, N = 10^6, seed = 42),
times = 50L
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## C++ 585.05328 587.97749 595.12652 595.08056 601.1128 610.35413 50
## TBB 68.66768 69.74909 71.06729 70.30209 71.6615 82.82758 50
TBB ran ~9.5 times faster than the serial version.