Abstract
Consensus Monte Carlo is an algorithm for conducting Monte Carlo
based Bayesian inference on large data sets distributed across many
worker machines in a data center. The algorithm works by running a
separate Monte Carlo algorithm on each worker machine, which only
sees a portion of the full data set. The worker-level posterior
samples are then combined to form a Monte Carlo approximation to the
full posterior distribution based on the complete data set. We
compare several methods of carrying out the combination, including a
new method based on approximating worker-level simulations using a
mixture of multivariate Gaussian distributions. We find that
resampling and kernel density based methods break down after 10 or
sometimes fewer dimensions, while the new mixture-based approach
works well, but the necessary mixture models take too long to fit.