This article was written for *Qvintensen, *the Swedish Statistical Association magazine in May 2014. Long due credits should go to Professor Dan Hedlin from Stockholm University and Ingegerd Jansson from Statistics Sweden for their very kind invitation and careful typesetting. A PDF version of this article can be found from *Qvintensen.*

# Complex Model for Complex Data via the Bayesian Approach

## Background

Statistical methods have developed rapidly in the past twenty years. One driving factor is that more and more complicated high-dimensional data require sophisticated data analysis methods. A noticeably successful case is the machine learning field which is now widely used in industry. Another reason is the dramatic advancements in the statistical computational environment. Computationally intensive methods that in the past could only be run on expensive super computers are now possible to run on a standard PC. This has created an enormous momentum for Bayesian analysis where complex models are typically analyzed with modern computer-intensive simulation methods.

Traditional linear models with Gaussian assumptions are challenged by the new large complicated datasets, which have in turn spurred interest in new approaches with flexible modeling with less restrictive assumptions. Moreover, research has shifted from merely modeling the mean and variance of the data to sophisticated modeling of skewness, tail-dependence, and outliers. However such work demands efficient inference tools. The development of highly efficient Markov chain Monte Carlo (MCMC) methods has reduced the barrier. The Bayesian approach provides a natural way for prediction, model comparison and evaluation of complicated models, and has the additional advantage of being intimately connected with decision making.

## The Bayesian density estimation

In statistics, density estimation is the procedure of estimating an unknown density p(y) from observed data. Density estimation techniques trace back to the use of histograms, later followed by kernel density estimation in which the shape of the data is approximated through a kernel function with a smoothing parameter. However, kernel density estimation suffers from one obstacle, which is the necessary step of specifying the bandwidth.

Mixture models have recently become a popular alternative approach. A mixture density is a combination of different densities with different weights. Usually the mixture density is a weighted sum of densities. Mixture densities can be used to capture data characteristics such as multi-modality, fat tails, and skewness. Figure 1 shows four examples.

[Fig-1Using mixture of normal densities (thin lines) to mimic a flexible density (bold line)]

Conditional density estimation concentrates on the modeling of the relationship between a response *y* and set of covariates *x* through a conditional density function p(y|x). In the simplest case, the homoscedastic Gaussian linear regression y = x′ β + ε is trivially equivalent to modeling p(y|x) by a Gaussian density with mean function μ = x′ β and constant variance.

In Bayesian statistics, inference of an unknown quantity θ, say p(θ|y), combines data information y, p(y|θ), with prior beliefs about θ, p(θ). In many simple statistical models with vague priors that play a minimal role in the posterior distribution, Bayesian inference draws similar conclusions to those obtained from a traditional frequentist approach. The Bayesian approach is however more easily extended to more complicated models using MCMC simulation techniques. In principle, MCMC can be applied to many hard-to-estimate models. However the efficiency depends heavily on how efficient the MCMC algorithm is. This is especially true in nonlinear models with many correlated parameters.

A key factor for evaluating a method’s performance is to check how it balances the trade-off between goodness-of-fit and overfitting. It is common that if a model wins in goodness-of-fit, it will lose in prediction. Variable selection is a technique that is commonly used in such a context. Historically the purposes for using variable selection are to select meaningful covariates that contribute to the model, inhibit ill-behaved design matrices, and prevent model overfitting. Methods like backward and forward selection are standard routines in most statistical software packages. However, the drawbacks are obvious in those techniques, e.g. the selection depends heavily on the starting points, which becomes more problematic with high dimensional data with many covariates. Most current methods rely on Bayesian variable selection via MCMC. A standard Bayesian variable selection approach is to augment the regression model with a variable selection indicator for each covariate. For the purpose of overcoming problems with overfitting, shrinkage estimation can also be used as an alternative, or even complementary, approach to variable selection. A shrinkage estimator shrinks the regression coefficients towards zero rather than eliminating the covariate completely. One way to select a proper value of the shrinkage is by cross-validation.

## Bayesian models for complex data

Modeling the volatility and variability in financial data has been a highly active research area since the seminal paper by Engle introduced the ARCH model in 1982, and there are large financial markets for volatility-based instruments. Financial data, such as stock market returns, are typically heavy tailed and subject to volatility clustering, i.e. time-varying variance. They also frequently show skewness and kurtosis that evolve in a very persistent fashion or they may have been the result of a financial crisis with an unprecedented volatility, see Figure 2 for modeling the degree of freedom with S&P 500 returns. To model such data requires sophisticated MCMC treatment, but in return, we obtain better insights into a situation that other methods can hardly tackle.

[Fig-2Time series plot of the posterior median and 95% probability intervals for kurtosis in terms of degrees of freedom of the return distribution for S&P 500 stock returns.]

LIDAR, Light Detection And Ranging, is a technique that uses laser-emitted light to detect chemical compounds in the atmosphere. In the dataset we have analyzed, the response variable (logratio) consists of 221 observations on the log ratio of received light from two laser sources: one at the resonance frequency of the target compound, and the other from a frequency off this target frequency. The predictor is the distance traveled before the light is reflected back to its source (range). Our aim is to model the predictive density p(logratio | range). A smooth mixture of asymmetric densities is used to model such predictive density which involves in a large number of parameters, see Figure 3 for the fitted curve with the confidence band. It is therefore likely to over-fit the data unless model complexity is controlled effectively. Bayesian variable selection in all parameters can lead to important simplifications of the mixture components. Not only does this control complexity for a given number of components, but it also simplifies the existing components if an additional component is added to the model.

[Fig-3Smooth mixture models for the LIDAR data. The figure displays the actual data overlayed on predictive regions and the predictive mean.]

In finance applications, a firm’s leverage (fraction of external financing) is usually modeled as a function of the proportion of fixed assets, the firm’s market value in relation to its book value, firm sales, and profits. The relationships between leverage and the covariates are highly nonlinear. There are also outliers. Strong nonlinearities seem to be a quite general feature of balance sheet data, but only a handful articles have suggested using nonlinear/nonparametric models. One attempt is to extend the regression model by introducing a lot of auxiliary variables, aka *splines. *A nonlinear curve/surface can then be constructed by choosing the correct number of splines and placing them in the right covariate space (see Figure 4 for the fitted mean curve and the standard deviation). Nonetheless, correctly allocating the splines in covariate space is not trivial. Bayesian methods treat the locations as unknown parameters that efficiently allocate the splines and therefore keep the number of splines to a minimum. Compared with the traditional deterministic spline approach, the Bayesian approach allows the splines to move freely in the covariate space and provides a dynamic surface with the measurement of surface uncertainty.

[Fig-4.1, Fig-4.2The posterior mean (left) and standard deviation (right) of the posterior surface for the model for firm leverage data. The depth of the color indicates how the leverage varies with book values and profits. The subplot to the right also shows an overlay of the covariate observations.]

## A model bigger than an elephant?

In the 1950s, linear regression model that was considered as very advanced is now the standard course content for university students. The data are much more complicated nowadays not only because the volume increases but also the structure is much more complicated. Very high-dimensional data that are a mix of numeric variables, character strings, images, or videos are not rare anymore. Sophisticated models are essential for such a situation. In principle, the complicated model should be able to capture more complicated data features but estimating and interpreting such a model is not obvious. Personally speaking, there is a huge space to explore computationally and statistically. Statistical models that can adapt to modern computational architectures already flourish in industry. Techniques like high performance computing will be more widely used in statistics and will be made aware to statisticians eventually.

(*I would like to thank Professor Mattias Villani who introduced me to this exciting area.*)