Bayesian Statistics: From Intuition to Practice
Bayesian Statistics: From Intuition to Practice
Introduction: Model-Based Thinking
Statistics shapes how we understand uncertainty. For much of the 20th century, the frequentist approach dominated scientific research. This framework treats probability as the long-run frequency of events. A coin has a 50% chance of heads because, over many flips, it lands heads about half the time.
Bayesian statistics offers a fundamentally different perspective. Rather than defining probability as long-run frequency, Bayesians treat probability as a degree of belief — a tool for reasoning about uncertainty itself. This shift in interpretation unlocks a powerful modeling philosophy: model-based thinking.
In the model-based view, every statistical analysis begins with an explicit generative model — a story about how the data could have been produced. You write down the process that generated your observations, assign probability distributions to each unknown quantity, and then use Bayesian inference to learn from data. The model is not just a convenience; it is the entire analysis.
As Richard McElreath emphasizes in Statistical Rethinking, the goal is not to test whether a model is "true" — no model is true. The goal is to understand how different assumptions produce different inferences, and to compare models by how well they predict. This philosophy makes Bayesian methods uniquely suited for modern data science: small datasets, complex hierarchical structures, and the need to quantify uncertainty in predictions all arise naturally within this framework.
The Garden of Forking Data
Before formalizing Bayes' theorem, it helps to think about inference in a way McElreath calls the Garden of Forking Data. Imagine all the possible ways a study could have turned out. Each possible dataset is a "path" through this garden. The data you actually observed is just one path.
The key insight: a good model should have produced your observed data as a plausible path. If your model assigns very low probability to the data you actually saw, something is wrong — either the model or the data. This idea underpins both prior and posterior predictive checks.
Consider tossing a small globe and noting whether it lands on water (W) or land (L). Suppose we believe the true proportion of water is 70%. How surprising would it be to observe the sequence W, W, L, W? We can trace through the branching paths:
The probability of any particular sequence depends on the parameter value we assume. Different parameter values produce different probability distributions over the "garden" of possible datasets. Bayesian inference works backward: given the observed path, which parameter values are most consistent with it?
Bayes' Theorem: The Mathematical Foundation
At the heart of Bayesian statistics lies Bayes' theorem, a simple equation with profound implications:
P(theta | data) = P(data | theta) * P(theta) / P(data)
This formula tells us how to update our beliefs about a parameter theta after observing data. Let's unpack each component:
- P(theta | data) is the posterior probability — our updated belief about theta after seeing the data
- P(data | theta) is the likelihood — how probable the observed data is, given a particular value of theta
- P(theta) is the prior probability — what we believed about theta before collecting data
- P(data) is the marginal likelihood or evidence — a normalizing constant that ensures the posterior sums to one
The theorem follows directly from the definition of conditional probability. Since the joint probability of A and B equals both the probability of A given B times the probability of B, and the probability of B given A times the probability of A, we can rearrange to get Bayes' theorem.
In practice, the denominator is often hard to compute directly. Instead, we work with the unnormalized posterior:
Posterior ~ Likelihood * Prior
This proportionality is all we need for most computational approaches, including grid approximation and MCMC sampling.
A Concrete Example: Medical Testing
Consider a disease that affects 1% of the population. A test for this disease has 95% sensitivity (correctly identifies diseased patients) and 90% specificity (correctly identifies healthy patients). You test positive. What is the probability you actually have the disease?
Define:
- D = having the disease
- T+ = testing positive
We know:
- P(D) = 0.01 (prevalence, our prior)
- P(T+ given D) = 0.95 (sensitivity)
- P(T+ given not D) = 0.10 (false positive rate)
Using Bayes' theorem:
P(D | T+) = P(T+ | D) * P(D) / P(T+)
The denominator expands to:
P(T+) = P(T+ | D) * P(D) + P(T+ | not D) * P(not D)
= 0.95 * 0.01 + 0.10 * 0.99
= 0.0095 + 0.099
= 0.1085
Therefore:
P(D | T+) = 0.0095 / 0.1085 = 0.088
Despite testing positive with a reasonably accurate test, you have only about an 8.8% chance of actually having the disease. This counterintuitive result stems from the low base rate — the prior probability is doing heavy lifting. Bayes' theorem forces us to account for it.
Prior, Likelihood, and Posterior: The Three Pillars
The Prior Distribution
The prior distribution encodes what we know about a parameter before observing data. In the model-based framework, the prior is part of the generative model — it is not an optional add-on but a statement about what parameter values are plausible.
Priors come in several flavors:
Informative priors incorporate substantive knowledge. If previous studies suggest a drug reduces blood pressure by 10-15 mmHg, we might use a normal prior centered at 12.5 with appropriate uncertainty.
Weakly informative priors provide mild regularization without strong assumptions. A normal distribution centered at zero with a standard deviation of 10 on a standardized effect size says large effects are unlikely but does not rule them out.
Flat or non-informative priors attempt minimal influence. A uniform prior across all possible values lets the data speak loudest. However, truly non-informative priors rarely exist, and flat priors can sometimes be surprisingly informative on transformed scales.
Critics often object that priors introduce subjectivity. Yet all statistical methods involve assumptions — Bayesian analysis simply makes these explicit rather than hiding them. Moreover, as data accumulates, the prior's influence diminishes. The honest approach is to choose priors thoughtfully, justify them, and check sensitivity.
The Likelihood Function
The likelihood function connects our model to observed data. For a parameter theta and data y, the likelihood quantifies how well different parameter values explain the observations. It is the probability of the data given the parameter, viewed as a function of the parameter.
Consider tossing a globe 10 times and observing 7 water landings. If p represents the proportion of water on the globe, the likelihood follows a binomial distribution:
L(p | data) = C(10,7) * p^7 * (1-p)^3
This function peaks at p = 0.7, the maximum likelihood estimate. Values near 0.7 explain the data well; extreme values like p = 0.1 or p = 0.9 explain it poorly.
The Posterior Distribution
The posterior distribution combines prior and likelihood through Bayes' theorem:
Posterior ~ Likelihood * Prior
The posterior represents our updated knowledge. It balances what we knew before with what the data tells us. With limited data, the prior matters more. With abundant data, the likelihood dominates.
Continuing the globe-tossing example, suppose we use a Beta(2, 2) prior (a slight belief the globe is half water). The posterior becomes Beta(2+7, 2+3) = Beta(9, 5). The prior pulled our estimate slightly toward 0.5, but the data dominated.
Bayesian Inference by Simulation: Grid Approximation
Before diving into MCMC, it is worth understanding the simplest approach to Bayesian inference: grid approximation. The idea is straightforward: evaluate the posterior at a grid of parameter values, then normalize. This makes the logic of Bayesian updating completely transparent.
For the globe-tossing example, suppose we want to infer the proportion of water p. We define a grid of possible values and compute the posterior at each point:
# Grid approximation of globe-tossing posterior
# Data: 6 water, 3 land in 9 tosses
# Define the grid of possible p values
p_grid <- seq(from = 0, to = 1, length.out = 100)
# Define the prior: uniform (Beta(1,1))
prior <- rep(1, 100)
# Define the likelihood: Binomial(9, p)
likelihood <- dbinom(6, size = 9, prob = p_grid)
# Compute the unstandardized posterior
unstd_posterior <- likelihood * prior
# Standardize so it sums to 1
posterior <- unstd_posterior / sum(unstd_posterior)
# Plot
plot(p_grid, posterior, type = "l",
xlab = "proportion of water (p)",
ylab = "posterior probability")
Grid approximation is pedagogically valuable because every step is visible. You can see exactly how the prior and likelihood combine. In practice, grid approximation scales poorly to many parameters — the curse of dimensionality means the grid grows exponentially. But for one- or two-parameter problems, it works well and builds intuition.
A Worked Example: Globe Tossing
Let us work through a complete example following McElreath's globe-tossing setup. We toss a small globe 10 times. It lands on water 7 times and land 3 times. We want to infer the true proportion of water, p.
Specifying the Model
We write down a complete generative model:
W ~ Binomial(n=10, p) # the data-generating process
p ~ Beta(2, 2) # the prior
The Beta(2, 2) prior is weakly informative — it expresses a mild preference for p near 0.5 but allows values across the full range. The choice of Beta distribution is convenient because it is conjugate with the binomial, giving us a closed-form posterior.
Computing the Posterior
With Beta-Binomial conjugacy, the posterior is:
p | data ~ Beta(2 + 7, 2 + 3) = Beta(9, 5)
The posterior mean is:
E[p | data] = 9 / (9 + 5) = 0.643
The posterior mode is:
Mode[p | data] = (9 - 1) / (9 + 5 - 2) = 8/12 = 0.667
Credible Intervals
Bayesian uncertainty quantification uses credible intervals. An 89% credible interval contains 89% of the posterior probability. We can directly state: "There is an 89% probability the parameter lies in this interval."
Why 89%? McElreath advocates this as a default because it avoids the magical thinking around 95% that pervades frequentist practice. There is nothing special about 95% — it is a historical accident. Using 89% makes it clear that the threshold is a choice, not a law of nature. It also produces slightly narrower intervals that are more informative for model comparison.
For our Beta(9, 5) posterior, we compute the 89% interval using quantiles:
qbeta(c(0.055, 0.945), shape1 = 9, shape2 = 5)
# Returns approximately [0.40, 0.85]
This means: given our model and data, there is an 89% probability that the true proportion of water on the globe lies between about 0.40 and 0.85. Compare this to the maximum likelihood estimate of 0.7 — the interval tells a richer story.
Prior Predictive Checks
Before fitting a model, we should check whether our prior produces plausible data. A prior predictive check simulates data from the model using only the prior, without conditioning on the observed data. If the simulated data look ridiculous, the prior is poorly specified.
# Prior predictive check for the globe-tossing model
set.seed(42)
n_sim <- 1000
n_tosses <- 10
# Sample p from the prior
p_prior <- rbeta(n_sim, shape1 = 2, shape2 = 2)
# Simulate data for each prior draw
w_sim <- rbinom(n_sim, size = n_tosses, prob = p_prior)
# Visualize the distribution of simulated water counts
hist(w_sim, breaks = -0.5:10.5,
main = "Prior predictive: water counts",
xlab = "Number of water landings (out of 10)",
col = "steelblue", border = "white")
If we instead used a very strong prior like Beta(100, 100) — which concentrates probability tightly around 0.5 — the prior predictive distribution would cluster around 5 water out of 10. If we then observed 7 water, the prior would be in tension with the data. Prior predictive checks surface these tensions before you waste time fitting a model.
Posterior Predictive Checks
After fitting, we verify that the model can reproduce the observed data patterns. A posterior predictive check simulates new data from the posterior — plugging in posterior samples of p and generating replicate datasets. We then compare these replicates to what we actually observed.
# Posterior predictive check
set.seed(42)
n_sim <- 10000
# Sample from the posterior: Beta(9, 5)
p_post <- rbeta(n_sim, shape1 = 9, shape2 = 5)
# Simulate replicate datasets
w_rep <- rbinom(n_sim, size = 10, prob = p_post)
# How often does the simulation match our observation (7 water)?
mean(w_rep == 7)
# Should be reasonably high if the model fits well
# Compare the distribution of replicates to observed
hist(w_rep, breaks = -0.5:10.5,
main = "Posterior predictive: water counts",
xlab = "Number of water landings (out of 10)",
col = "steelblue", border = "white")
abline(v = 7, col = "red", lwd = 2)
# The red line marks the observed count
If the observed data fall in the tails of the posterior predictive distribution, the model is inadequate — it cannot reproduce what you actually saw. This might indicate a missing predictor, a wrong likelihood family, or a poorly specified prior. Posterior predictive checks are one of the most important diagnostic tools in Bayesian data analysis.
Conjugate Priors: Common Families
Conjugate priors simplify Bayesian analysis. When the prior and posterior belong to the same distribution family, calculations become tractable. Here are three essential conjugate pairs:
Beta-Binomial Conjugacy
For binomial data (successes out of n trials), the Beta distribution serves as a conjugate prior for the success probability p.
- Prior: Beta(alpha, beta)
- Likelihood: Binomial(n, p) with k successes
- Posterior: Beta(alpha + k, beta + n - k)
If we start with Beta(1, 1), a uniform prior, and observe 7 water in 10 globe tosses, the posterior is Beta(8, 4). The mean shifts from 0.5 to 8/(8+4) = 0.67.
Normal-Normal Conjugacy
For normally distributed data with known variance, the normal distribution is conjugate for the mean mu.
- Prior: Normal(mu_0, sigma_0^2)
- Likelihood: Normal(mu, sigma^2) with n observations
- Posterior: Normal(mu_n, sigma_n^2)
The posterior mean is a weighted average of prior mean and sample mean, with weights proportional to their precisions (inverse variances).
Gamma-Poisson Conjugacy
For count data following a Poisson distribution, the Gamma distribution is conjugate for the rate parameter lambda.
- Prior: Gamma(alpha, beta)
- Likelihood: Poisson(lambda) with n observations summing to k
- Posterior: Gamma(alpha + k, beta + n)
This pairing appears in modeling event rates, such as customer arrivals or website clicks.
Bayes Factors
The Bayes factor quantifies evidence for one model over another, independent of prior probabilities:
BF = P(data | H1) / P(data | H0)
A Bayes factor of 10 means the data is 10 times more likely under model H1 than under model H0. Common interpretation guidelines suggest:
Bayes factors naturally penalize model complexity, addressing overfitting concerns. A more complex model must substantially improve the likelihood to overcome the penalty of its broader prior. However, Bayes factors can be sensitive to prior specification, especially for the alternative hypothesis. They are most useful when the models under comparison are well-motivated scientifically, not when used as an automatic model selection procedure.
MCMC Methods: Sampling from Posteriors
Analytical solutions exist only for simple models. Real-world problems require numerical approximation. Markov Chain Monte Carlo (MCMC) methods generate samples from the posterior distribution, which can then be summarized to produce estimates, intervals, and predictions.
Metropolis-Hastings Algorithm
The Metropolis-Hastings algorithm constructs a Markov chain that converges to the target posterior:
- Start at an initial parameter value theta_0
- Propose a new value theta_prime from a proposal distribution
- Calculate the acceptance ratio:
r = P(theta_prime | data) / P(theta | data) - Accept theta_prime with probability min(1, r); otherwise stay at theta
- Repeat
The algorithm explores the parameter space, spending more time in high-probability regions. After a burn-in period, samples approximate the posterior.
Gibbs Sampling
Gibbs sampling simplifies Metropolis-Hastings for multi-parameter models. Instead of proposing joint updates, it samples each parameter from its full conditional distribution (given all other parameters and data).
For parameters (theta_1, theta_2, theta_3):
- Sample theta_1 from P(theta_1 | theta_2, theta_3, data)
- Sample theta_2 from P(theta_2 | theta_1, theta_3, data)
- Sample theta_3 from P(theta_3 | theta_1, theta_2, data)
- Repeat
Gibbs sampling always accepts proposals but requires tractable conditional distributions.
Practical Considerations
MCMC introduces diagnostic challenges:
- Convergence: Has the chain reached the target distribution?
- Mixing: How efficiently does the chain explore the space?
- Autocorrelation: How many samples are effectively independent?
Tools like trace plots, the Gelman-Rubin statistic (R-hat), and effective sample size calculations help assess MCMC quality. Modern probabilistic programming frameworks like Stan handle most of these diagnostics automatically.
Multilevel Models
Bayesian multilevel models (also called hierarchical models or mixed-effects models) are one of the greatest strengths of the Bayesian framework. They capture structure in grouped data by allowing parameters to vary across groups while sharing information between groups.
The Problem: Pooled vs. Unpooled Estimates
Consider test scores from students across multiple schools. We could pool all data and estimate one overall mean, ignoring school differences. Or we could estimate each school separately (no pooling), but small schools would have wildly unstable estimates. Multilevel models offer a middle path: partial pooling.
The Hierarchical Structure
score_ij ~ Normal(mu_j, sigma^2) # student-level model
mu_j ~ Normal(mu, tau^2) # school-level model (prior)
mu ~ Normal(0, 10) # group-level prior
tau ~ Exponential(1) # prior on variation between schools
School-level parameters mu_j share information through the group-level prior. Schools with more data get estimates closer to their own sample mean, while schools with less data are pulled toward the group average. This partial pooling is automatic — the model figures out how much to shrink each estimate based on the data.
Why Multilevel Models Matter
Multilevel models shine whenever data has a grouped or clustered structure: students within schools, patients within hospitals, repeated measures within individuals, or observations across geographic regions. They provide better estimates for small groups, naturally quantify variation at each level, and avoid the ad-hoc decisions that plague traditional approaches (such as deciding whether to pool or not).
Applications in AI and Machine Learning
Naive Bayes Classifiers
Naive Bayes applies Bayes' theorem to classification with a simplifying assumption: features are conditionally independent given the class. For text classification:
P(class | words) ~ P(class) * product(P(word_i | class))
Despite the unrealistic independence assumption, Naive Bayes performs remarkably well for spam filtering, sentiment analysis, and document categorization.
Bayesian Optimization
Bayesian optimization finds optimal inputs for expensive black-box functions. It maintains a probabilistic surrogate model (often a Gaussian process) and uses an acquisition function to balance exploration and exploitation.
This approach powers hyperparameter tuning in machine learning, where each evaluation requires training a model. Bayesian optimization finds good configurations with fewer evaluations than grid or random search.
Bayesian Neural Networks
Bayesian neural networks place probability distributions over weights instead of point estimates. This provides:
- Uncertainty quantification: Predictions include confidence estimates
- Regularization: The prior prevents overfitting
- Model averaging: Predictions average over plausible weight configurations
Variational inference and MCMC enable training, though computational costs remain high for large networks.
Tools and Software
Modern Bayesian modeling is made practical by a rich ecosystem of tools. Here are the three most important ones, each with a different philosophy:
rethinking (R)
McElreath's rethinking package accompanies Statistical Rethinking and is designed for teaching. It exposes the mechanics of Bayesian computation rather than hiding them:
library(rethinking)
# Globe-tossing model using map (quadratic approximation)
data <- list(
w = 6, # water count
n = 9 # total tosses
)
fit <- map(
alist(
w ~ dbinom(n, p), # likelihood
p ~ dbeta(2, 2) # prior
),
data = data
)
# Extract 89% credible interval
precis(fit, prob = 0.89)
# Or use MCMC with map2stan
fit_mcmc <- map2stan(
alist(
w ~ dbinom(n, p),
p ~ dbeta(2, 2)
),
data = data,
iter = 4000, warmup = 1000, chains = 4
)
# Posterior predictive check
postcheck(fit_mcmc)
brms (R)
The brms package provides a formula interface for Bayesian regression models powered by Stan. It feels familiar to users of R's lm() and lmer() functions:
library(brms)
# Fit a multilevel regression
fit <- brm(
score ~ 1 + (1 | school),
data = student_data,
family = gaussian(),
prior = c(
prior(normal(0, 10), class = Intercept),
prior(cauchy(0, 2), class = sd),
prior(exponential(1), class = sigma)
),
iter = 4000, warmup = 1000, chains = 4,
seed = 42
)
# Summary with 89% credible intervals
summary(fit, prob = 0.89)
# Posterior predictive checks
pp_check(fit)
# Extract posterior samples for custom summaries
posterior_samples <- as_draws_df(fit)
PyMC (Python)
PyMC brings probabilistic programming to Python with a clean, intuitive API and efficient NUTS sampling:
import pymc as pm
import numpy as np
# Globe-tossing model
water_observed = 6
total_tosses = 9
with pm.Model() as globe_model:
# Prior
p = pm.Beta("p", alpha=2, beta=2)
# Likelihood
w = pm.Binomial("w", n=total_tosses, p=p,
observed=water_observed)
# Sample from the posterior
trace = pm.sample(4000, tune=1000, chains=4,
cores=4, random_seed=42)
# Summary with 89% credible intervals
import arviz as az
az.summary(trace, hdi_prob=0.89)
# Posterior predictive check
pm.sample_posterior_predictive(trace, model=globe_model)
az.plot_ppc(trace)
Other notable tools include Stan (the underlying engine for rethinking and brms, usable from R, Python, and Julia), Turing.jl (probabilistic programming in Julia), and NumPyro (fast MCMC in Python built on JAX).
Common Misconceptions
"Bayesian methods are subjective"
While priors encode subjective knowledge, this transparency is a strength. Different analysts can use different priors and compare results. Moreover, with sufficient data, priors matter less. The real question is whether assumptions are reasonable, not whether they exist.
"Bayesian computation is too slow"
MCMC does require more computation than point estimates, but modern tools like Stan, PyMC, and Turing make this manageable. Approximate methods like variational inference trade some accuracy for speed. For many problems, the benefits outweigh computational costs.
"Priors bias results"
Priors influence posteriors, but this is the point. Ignoring relevant prior information wastes data. Weakly informative priors provide regularization without strong assumptions. Sensitivity analysis — trying different priors — assesses robustness.
"Bayesian methods don't control error rates"
Bayesian inference focuses on posterior probabilities, not long-run error rates. This is not a bug but a feature. When error rate control matters (for example, in clinical trials), Bayesian designs can incorporate frequentist operating characteristics through simulation.
Practical Considerations: When to Use Bayesian Methods
Choose Bayesian approaches when:
- Prior information exists: Previous studies, expert knowledge, or physical constraints inform parameters
- Uncertainty quantification matters: Decisions require full posterior distributions, not just point estimates
- Data is limited: Priors stabilize estimates when data is sparse
- Complex models are needed: Hierarchical structures, missing data, or latent variables fit naturally
- Sequential updating is required: Posteriors become priors for new data
Frequentist methods may suffice when:
- Data is abundant: Likelihood dominates, making prior choice irrelevant
- Computational simplicity is critical: Quick analyses with standard methods
- Regulatory requirements demand it: Some fields mandate frequentist approaches
- Only point estimates matter: No need for full uncertainty quantification
Many practitioners use both frameworks, selecting tools based on the problem at hand.
Further Reading and Resources
Books
- Statistical Rethinking by Richard McElreath — The best starting point. A model-based introduction to Bayesian statistics with R and Stan, full of intuition and practical wisdom
- Bayesian Data Analysis by Gelman et al. — The comprehensive reference for practitioners
- Doing Bayesian Data Analysis by Kruschke — Gentle introduction for beginners with detailed visual explanations
- Probabilistic Programming and Bayesian Methods for Hackers by Davidson-Pilon — Practical, code-focused approach using PyMC
Online Resources
- Statistical Rethinking (YouTube): McElreath's lecture series — pedagogically outstanding
- Bayesian Methods for Machine Learning (Coursera): HSE University course
- Stan Documentation: Extensive examples and case studies
- ArviZ Documentation: Visualization and diagnostics for Bayesian models
Bayesian statistics offers a coherent framework for reasoning under uncertainty. The model-based philosophy — writing explicit generative models, checking them against data, and using posterior distributions to quantify what we know — provides clarity that few other approaches can match. By combining prior knowledge with observed data, Bayesian methods produce principled probability statements about parameters and predictions. Whether analyzing clinical trials, building machine learning systems, or making business decisions, Bayesian thinking helps us navigate an uncertain world with honesty and precision.